# Reddit Project on Data Science

This project is to web scrape data science related information from Reddit, the social news aggregation, web content rating, and discussion website. We aim to collect general in

In [1]:
# Load necessary libraries
import praw
from praw.models import MoreComments
import pandas as pd
import PIL
import numpy as np
from wordcloud import WordCloud
from datetime import datetime
from praw.models import MoreComments

### Reddit's Sorting Method:

Within a subreddit, there are multiple post submissions. Reddit provides us multiple ways to sort the submissions:
- rising: submissions that are getting a lot of activities (comments/upvotes) right now
- new: latest submissions by time
- hot: submissions that have been getting a lot of upvotes/comments
- gilded: comments that have been given reddit gold by someone
- controversial: submissions that have been getting multiple downvotes and upvotes. 
- top: submissions that have gotten the most upvotes over the set period 

### Authentication
First, we need to authenticate ourselves. In order to authenticate ourselves, we need to first create an app on reddit by filling in a name, description and redirect uri. After creating the app, we can use the authentication information to create the praw.Reddit instance. 

In [2]:
# we need to first authenticate ourselves
reddit = praw.Reddit(client_id='Z1grqQBW7ei7hA', client_secret='ekx286gd903s742SoxFSc7mF-kg', 
                     user_agent='simon_tutorial', redirect_uri="http://localhost:8080", username='simonneedsleep', 
                     password='ZXW1025reddit!')

# check the username
print(reddit.user.me())

simonneedsleep


### Get subreddit data
For the next step, we need to collect subreddit on data science relevant topics. The subreddits that we want to explore are r/DataScience, r/DataScienceJobs, r/MachineLearning, etc. For the moment, we will be looking at **r/DataScience** subreddit. We will collect the top 50 'hot' posts. 

In [3]:
## Scrape the DataScience subreddit
posts_hot = []
posts_top = []
ds_subreddit = reddit.subreddit('DataScience')
#ds_subreddit = reddit.subreddit('MachineLearning')


# Obtain the top 50 hot posts
for post in ds_subreddit.hot(limit=50):
    # convert timestamp to datetime
    posts_hot.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, datetime.fromtimestamp(post.created)])
posts_hot = pd.DataFrame(posts_hot,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

# Obtain the top 50 top posts
for post in ds_subreddit.hot(limit=50):
    # convert timestamp to datetime
    posts_top.append([post.title, post.score, post.id, post.subreddit, post.url, post.num_comments, post.selftext, datetime.fromtimestamp(post.created)])
posts_top = pd.DataFrame(posts_top,columns=['title', 'score', 'id', 'subreddit', 'url', 'num_comments', 'body', 'created'])

Let's take a look at the dataframe of the top 50 hot posts, sorted by score:

In [5]:
posts_hot.sort_values(by = 'score', ascending = False)

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
22,Experience/Advice from a 10+ year data scientist,768,iorbjg,datascience,https://www.reddit.com/r/datascience/comments/...,86,"For context, I was in most people's shoes here...",2020-09-08 14:41:58
36,What we look for in hiring,727,innazn,datascience,https://www.reddit.com/r/datascience/comments/...,151,"I wrote this post 6 months ago, and I know tha...",2020-09-06 18:52:49
9,Today I reached a new milestone: got rejected ...,622,ipstlf,datascience,https://www.reddit.com/r/datascience/comments/...,112,On-Campus Recruiting has been so stressful. Ju...,2020-09-10 04:33:25
48,How is Python easier than R?,314,in710s,datascience,https://www.reddit.com/r/datascience/comments/...,309,I come from a more statistical background but ...,2020-09-05 23:16:00
31,"Currently a Data Scientist with a Bachelors, s...",204,io7nwi,datascience,https://www.reddit.com/r/datascience/comments/...,129,"Hi all,\n\nI’m in a pickle. I’m currently work...",2020-09-07 17:43:43
1,What do you look for in a data science leader?,95,iqiop1,datascience,https://www.reddit.com/r/datascience/comments/...,32,Whether you're currently working as a data sci...,2020-09-11 07:14:44
28,"First Real Project Done, What's Next?",92,iom3q3,datascience,https://www.reddit.com/r/datascience/comments/...,25,Hi! I'm a 20-something math undergrad in a sma...,2020-09-08 07:15:23
15,What would this statistical technique be called?,91,ipfp8j,datascience,https://www.reddit.com/r/datascience/comments/...,30,"Say I have a group of 30,000 customers at a co...",2020-09-09 17:17:27
41,How do you learn the Maths?,82,inkgvv,datascience,https://www.reddit.com/r/datascience/comments/...,90,"I'm on a quest to become a Data Scientist, how...",2020-09-06 15:25:43
20,How to organize data cleaning scripts,47,ip95c4,datascience,https://www.reddit.com/r/datascience/comments/...,34,I started work as a data scientist earlier thi...,2020-09-09 08:14:27


Let's take a look at the dataframe of the top 50 top posts, sorted by score:

In [6]:
posts_top.sort_values(by = 'score', ascending = False)

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
22,Experience/Advice from a 10+ year data scientist,768,iorbjg,datascience,https://www.reddit.com/r/datascience/comments/...,86,"For context, I was in most people's shoes here...",2020-09-08 14:41:58
36,What we look for in hiring,728,innazn,datascience,https://www.reddit.com/r/datascience/comments/...,151,"I wrote this post 6 months ago, and I know tha...",2020-09-06 18:52:49
9,Today I reached a new milestone: got rejected ...,623,ipstlf,datascience,https://www.reddit.com/r/datascience/comments/...,112,On-Campus Recruiting has been so stressful. Ju...,2020-09-10 04:33:25
48,How is Python easier than R?,322,in710s,datascience,https://www.reddit.com/r/datascience/comments/...,309,I come from a more statistical background but ...,2020-09-05 23:16:00
31,"Currently a Data Scientist with a Bachelors, s...",207,io7nwi,datascience,https://www.reddit.com/r/datascience/comments/...,129,"Hi all,\n\nI’m in a pickle. I’m currently work...",2020-09-07 17:43:43
1,What do you look for in a data science leader?,96,iqiop1,datascience,https://www.reddit.com/r/datascience/comments/...,32,Whether you're currently working as a data sci...,2020-09-11 07:14:44
15,What would this statistical technique be called?,93,ipfp8j,datascience,https://www.reddit.com/r/datascience/comments/...,30,"Say I have a group of 30,000 customers at a co...",2020-09-09 17:17:27
28,"First Real Project Done, What's Next?",91,iom3q3,datascience,https://www.reddit.com/r/datascience/comments/...,25,Hi! I'm a 20-something math undergrad in a sma...,2020-09-08 07:15:23
41,How do you learn the Maths?,79,inkgvv,datascience,https://www.reddit.com/r/datascience/comments/...,90,"I'm on a quest to become a Data Scientist, how...",2020-09-06 15:25:43
2,Do math-oriented entry level data science posi...,42,iqilhn,datascience,https://www.reddit.com/r/datascience/comments/...,28,I am currently three years into the building o...,2020-09-11 07:08:47


We then concatenate the two dataframes and then remove the duplicates.

In [7]:
posts_ds = pd.concat([posts_top, posts_hot], axis=0).drop_duplicates(subset = 'id').reset_index(drop = True)

posts_ds.sort_values(by = 'score', ascending = False).reset_index(drop = True).head()

Unnamed: 0,title,score,id,subreddit,url,num_comments,body,created
0,Experience/Advice from a 10+ year data scientist,768,iorbjg,datascience,https://www.reddit.com/r/datascience/comments/...,86,"For context, I was in most people's shoes here...",2020-09-08 14:41:58
1,What we look for in hiring,728,innazn,datascience,https://www.reddit.com/r/datascience/comments/...,151,"I wrote this post 6 months ago, and I know tha...",2020-09-06 18:52:49
2,Today I reached a new milestone: got rejected ...,623,ipstlf,datascience,https://www.reddit.com/r/datascience/comments/...,112,On-Campus Recruiting has been so stressful. Ju...,2020-09-10 04:33:25
3,How is Python easier than R?,322,in710s,datascience,https://www.reddit.com/r/datascience/comments/...,309,I come from a more statistical background but ...,2020-09-05 23:16:00
4,"Currently a Data Scientist with a Bachelors, s...",207,io7nwi,datascience,https://www.reddit.com/r/datascience/comments/...,129,"Hi all,\n\nI’m in a pickle. I’m currently work...",2020-09-07 17:43:43


### Load the dataset

In [8]:
## save the dataframe locally
#print(posts_ds.shape)
posts_ds.to_csv('data/posts.csv')

## load the saved dataframe
#posts_ds = pd.read_csv('data/posts.csv', index_col = 0)

We plot a scatter plot of the number of comments against score for posts that have at least one comment. 

In [None]:
import matplotlib.pyplot as plt
sub_posts = posts_ds[posts_ds['num_comments'] > 0]
plt.scatter(sub_posts['score'], sub_posts['num_comments'], marker='o')
plt.xlabel('Score')
plt.ylabel('Number of Comments')

In [None]:
### Combine the title and the body columns to create a text str column
posts_ds['text'] = posts_ds['title'] + posts_ds['body']

In [None]:
### Needs to process the title and the body as well
import re

## Remove empty space:
posts_ds['text'] = posts_ds['text'].map(lambda x: re.sub('\s+', ' ', x))

## Convert all string to lower cases:
posts_ds['text'] = posts_ds['text'].str.lower()

## Remove all the punctuations:
posts_ds['text'] = posts_ds['text'].map(lambda x: re.sub('[^\w\s]', '', x))
## need to think about the dash

## Take a look at the processed df
posts_ds.head()

In [16]:
### Remove stopwords
import nltk
from nltk.corpus import stopwords
# the stop words list by nltk
stop_words = stopwords.words('english')

# extra stop words that we need to remove
more_stop_words = ['could', 'really', 'would', 'uses','use','using','used','one','also','days', 'im', 'dont', 
                   'say', 'can', 'not', 'id', 'like', 'youre', 'ive', 'arent', 'something', 'many', 'etc', 'even']

# define a collection of words related to data science education
key_words = ['project', 'course', 'degree', 'masters', 'program', 'experience', 'bootcamp', 'courses', 
             'projects', 'certificates', 'ms', 'phd', 'kaggle', 'capstone']

# online programs
key_words_online = ['online', 'programs', 'program', 'udemy', 'udacity', 'certificates', 'certificate', 'khan']

# programming language
key_words_language = ['python', 'r', 'sql']

In [None]:
posts_ds['text'] = posts_ds['text'].map(lambda text: " ".join(word for word in text.split() if word not in stop_words))

posts_ds['text'] = posts_ds['text'].map(lambda text: " ".join(word for word in text.split() if word not in more_stop_words))

### tokenization
import nltk
from nltk.tokenize import word_tokenize 
#nltk.download('punkt')

# tokenize the words
posts_ds['Tokens'] = posts_ds['text'].map(lambda x: word_tokenize(x))

posts_ds.head()

In [None]:
from wordcloud import WordCloud
wc = WordCloud(background_color="white", max_words=2000, width=800, height = 400)
# generate word cloud
wc.generate(' '.join(posts_ds['text']))
#wc.generate(' '.join(ds_comments_raw['Tokens']))

import matplotlib.pyplot as plt
%matplotlib inline

# show
plt.figure(figsize=(12, 6))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
from nltk.probability import FreqDist # this also uses Counter. 


print(FreqDist(' '.join(posts_ds['text']).split()).most_common(50))
#' '.join(ds_comments_raw['Comments']).split()

In [None]:
post_df = posts_ds['text'].str.contains('|'.join(key_words), regex = True).value_counts()
#pd.DataFrame(ds_df, columns=["Contains", "Count"])

#ds_df.reset_index()

#ds_df.reset_index().plot.bar(x='index', y='Comments', rot=0)
#pd.DataFrame(post_df, columns = ['Post'])

In [None]:
pd.DataFrame(post_df).reset_index().plot.bar(x='index', y='text', rot=0)

### Get comments from these posts
After collecting the subreddit, we then want to get comments from a specific post. 

In [9]:
# Store the comments in a dictionary for the subreddit CHANGE the representation to df
dict_comments = {}

# Get the id of the post
keys = posts_ds.id
for key in keys:
    submission = reddit.submission(id = key)
    for top_level_comment in submission.comments:
        try:
            dict_comments[key].append(top_level_comment.body)
        except KeyError:
            dict_comments[key] = [top_level_comment.body]

In [10]:
# Flatten the dictionary to a dataframe
ds_comments_raw = pd.DataFrame(list(dict_comments.items()), columns=['ID', 'Comments'])

# Transform each element of a list-like to a row, replicating index values
ds_comments = ds_comments_raw.explode('Comments')
ds_comments_raw.head()

Unnamed: 0,ID,Comments
0,inkv4p,"[Hi Guys,\n\nI am a Bioinformatician looking t..."
1,iqiop1,[-\tDefines a strategy and vision for the team...
2,iqilhn,[How big is your company? Larger places will h...
3,iqll63,[I work as a data scientist and can confirm th...
4,iqk3ag,[Thank you for your post. It's great to see th...


In [12]:
ds_comments.head()
ds_comments.to_csv('data/comments.csv')
ds_comments_raw.to_csv('data/comments_raw.csv')

In [13]:
### Save the dataframe to a local file
#ds_comments_raw.to_csv('data/comments.csv')
ds_comments.groupby('ID').count().shape

(47, 1)

In [14]:
### Preprocessing
import re
## Join the list of string into one long string:
ds_comments_raw['Comments'] = ds_comments_raw['Comments'].str.join(' ')

## Remove empty space:
ds_comments_raw['Comments'] = ds_comments_raw['Comments'].map(lambda x: re.sub('\s+', ' ', x))

## Convert all string to lower cases:
ds_comments_raw['Comments'] = ds_comments_raw['Comments'].str.lower()

## Remove all the punctuations:
ds_comments_raw['Comments'] = ds_comments_raw['Comments'].map(lambda x: re.sub('[^\w\s]', '', x))
## need to think about the dash

## Take a look at the processed df
ds_comments_raw.head()

Unnamed: 0,ID,Comments
0,inkv4p,hi guys i am a bioinformatician looking to get...
1,iqiop1,defines a strategy and vision for the team e...
2,iqilhn,how big is your company larger places will hav...
3,iqll63,i work as a data scientist and can confirm tha...
4,iqk3ag,thank you for your post its great to see that ...


In [17]:
### Remove stopwords
import nltk
from nltk.corpus import stopwords

ds_comments_raw['Comments'] = ds_comments_raw['Comments'].map(lambda text: " ".join(word for word in text.split() if word not in stop_words))

ds_comments_raw['Comments'] = ds_comments_raw['Comments'].map(lambda text: " ".join(word for word in text.split() if word not in more_stop_words))

ds_comments_raw.head()

Unnamed: 0,ID,Comments
0,inkv4p,hi guys bioinformatician looking get broader d...
1,iqiop1,defines strategy vision team excellent communi...
2,iqilhn,big company larger places room specialization ...
3,iqll63,work data scientist confirm active move compan...
4,iqk3ag,thank post great see succeed world gives sprea...


In [18]:
ds_comments_raw['Comments'].values

array(['hi guys bioinformatician looking get broader data science career wondering anyone tell important domain knowledge bioinformatician good knowledge biologyhealthcare transition different domain business analytics case find opportunities domain much background knowledge get applying job learn job thanks looking examples traditional academic programs viable given circumstances current facts situation finished gre score 332 165 quant 167 verbal limited professional success blue collar labour years current position requires data entry programming light data base admin professional experience back office automation python file server management email tools process automation building filemaker features coursera certification python data sciences live south bay area california bachelors degree economics psychology gpa 37 33 years old looking examples programs might willing accept someone position someone go learn data tools software want see electronics walk best buy look around cant d

### Using Tfidf Vectorizer and kMeans

In [31]:
from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
punc = ['.', ',', '"', "'", '?', '!', ':', ';', '(', ')', '[', ']', '{', '}',"%"]
stop_words = text.ENGLISH_STOP_WORDS.union(punc)

In [32]:
comms = ds_comments_raw['Comments'].values
vectorizer = TfidfVectorizer(stop_words = stop_words)
X = vectorizer.fit_transform(comms)

In [33]:
word_features = vectorizer.get_feature_names()
word_features[300:320]

['ass',
 'assembled',
 'assertion',
 'assessing',
 'assessment',
 'asset',
 'assigned',
 'assigning',
 'assignment',
 'assignments',
 'associate',
 'associated',
 'assume',
 'assumed',
 'assuming',
 'assumption',
 'astrophysical',
 'astrophysics',
 'asu',
 'asymmetrical']

In [34]:
stemmer = SnowballStemmer('english')
tokenizer = RegexpTokenizer(r'[a-zA-Z\']+')

def tokenize(text):
    return [stemmer.stem(word) for word in tokenizer.tokenize(text.lower())]

After writing the function, I pass it through as an argument when instantiating the vectorizer. We can see that those repeated forms I mentioned above are now gone and only the root form is present thanks to the stemming we added :)

In [35]:
vectorizer2 = TfidfVectorizer(stop_words = stop_words, tokenizer = tokenize)
X2 = vectorizer2.fit_transform(comms)
word_features2 = vectorizer2.get_feature_names()
word_features2[:50]

  'stop_words.' % sorted(inconsistent))


['ab',
 'abhttpswwwyoutubecomwatchvfnk',
 'abil',
 'abl',
 'ablewil',
 'abn',
 'abroad',
 'abrupt',
 'absolut',
 'absorb',
 'abstract',
 'absurd',
 'abund',
 'academ',
 'academi',
 'academia',
 'acceler',
 'accept',
 'access',
 'accomplish',
 'account',
 'accredit',
 'accross',
 'accumul',
 'accur',
 'accuraci',
 'accustom',
 'achiev',
 'acknowledg',
 'acquaint',
 'acquisit',
 'acronym',
 'act',
 'action',
 'activ',
 'actual',
 'actuari',
 'acumen',
 'ad',
 'adapt',
 'add',
 'addin',
 'addit',
 'adhoc',
 'admin',
 'admiss',
 'admit',
 'adob',
 'adopt',
 'advanc']

In [38]:
vectorizer3 = TfidfVectorizer(stop_words = stop_words, tokenizer = tokenize, max_features = 1000)
X3 = vectorizer3.fit_transform(comms)
words3 = vectorizer3.get_feature_names()
words3[:50]

  'stop_words.' % sorted(inconsistent))


['ab',
 'abil',
 'abl',
 'absolut',
 'abstract',
 'academ',
 'academi',
 'academia',
 'accept',
 'access',
 'achiev',
 'activ',
 'actual',
 'ad',
 'adapt',
 'add',
 'addit',
 'advanc',
 'advantag',
 'advic',
 'affect',
 'age',
 'ago',
 'agre',
 'ahead',
 'ai',
 'algebra',
 'algorithm',
 'allow',
 'alon',
 'alreadi',
 'altern',
 'alway',
 'amazon',
 'analys',
 'analysi',
 'analyst',
 'analyt',
 'analyz',
 'anoth',
 'answer',
 'anyon',
 'anyth',
 'anywher',
 'api',
 'app',
 'appli',
 'applic',
 'appreci',
 'approach']

### KMeans Clustering

After the preprocessing, I can finally apply the kMeans algorithm to cluster the comments vector. 

#### n_clusters = 10

In [53]:
kmeans_10 = KMeans(n_clusters = 10, n_init = 5, n_jobs = -1)
kmeans_10.fit(X3)

### change the distance function: cosine distance
### silhouette analysis/score k means: automate through: to get the top 3 silhouette: pick the highest
### the elbow plot: visualization

### LDA whether they collaborate or not. when there's sort of agreement, we can start labelling. hand labelling. each document becomes binary classification

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=10, n_init=5, n_jobs=-1, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

In [55]:
common_words = kmeans_10.cluster_centers_.argsort()[:,-1:-12:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

0 : function, column, notebook, file, board, googl, loss, forecast, data, creat, wrong
1 : data, work, job, scienc, peopl, time, project, math, compani, learn, master
2 : custom, segment, algorithm, model, cluster, ltv, test, promot, ab, featur, data
3 : r, python, data, f, sql, languag, learn, rd, want, excel, work
4 : messag, purpos, man, assign, depth, compet, hey, situat, join, worri, opinion
5 : thank, gave, spreadsheet, world, idea, great, post, need, final, financ, formula
6 : stuff, sourc, read, document, learn, new, code, way, glanc, everi, relat
7 : data, scientist, field, scienc, good, tutori, aw, model, job, ds, lot
8 : cert, test, practic, studi, know, impress, simpli, someon, project, say, tool
9 : lol, date, hi, articl, nice, featur, bit, add, mayb, form, forget


In [56]:
ds_comments_raw['Cluster_10'] = kmeans_10.labels_

In [58]:
ds_comments_raw.groupby(['Cluster_10']).size()

Cluster_10
0     4
1    20
2     3
3     7
4     1
5     2
6     2
7     6
8     1
9     1
dtype: int64

#### n_clusters = 6

In [64]:
kmeans_6 = KMeans(n_clusters = 6, n_init = 4, n_jobs = -1)
kmeans_6.fit(X3)

common_words = kmeans_6.cluster_centers_.argsort()[:,-1:-12:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

0 : thank, gave, spreadsheet, world, idea, great, post, need, final, financ, formula
1 : data, work, scienc, job, learn, project, peopl, time, compani, think, know
2 : lol, date, hi, articl, nice, featur, bit, add, mayb, form, forget
3 : r, python, function, f, data, sql, learn, rd, languag, excel, easier
4 : load, want, isnt, scrape, word, data, page, thread, raw, loop, sleep
5 : technic, manageri, staff, shot, forget, ignor, effort, guy, told, step, clear


In [65]:
ds_comments_raw['Cluster_6'] = kmeans_6.labels_
ds_comments_raw.groupby(['Cluster_6']).size()

Cluster_6
0     2
1    35
2     1
3     7
4     1
5     1
dtype: int64

#### n_clusters = 3

In [62]:
kmeans_3 = KMeans(n_clusters = 3, n_init = 2, n_jobs = -1)
kmeans_3.fit(X3)

common_words = kmeans_3.cluster_centers_.argsort()[:,-1:-12:-1]
for num, centroid in enumerate(common_words):
    print(str(num) + ' : ' + ', '.join(words[word] for word in centroid))

0 : data, job, scienc, work, role, compani, master, want, degre, time, scientist
1 : test, project, know, team, data, manag, cert, dataset, leader, work, promot
2 : r, data, python, learn, work, thank, function, need, project, code, make


In [63]:
ds_comments_raw['Cluster_3'] = kmeans_3.labels_
ds_comments_raw.groupby(['Cluster_3']).size()

Cluster_3
0    17
1     4
2    26
dtype: int64

In [None]:
### tokenization
import nltk
from nltk.tokenize import word_tokenize 
#nltk.download('punkt')

# tokenize the words
ds_comments_raw['Tokens'] = ds_comments_raw['Comments'].map(lambda x: word_tokenize(x))

ds_comments_raw.head()

In [None]:
from wordcloud import WordCloud
wc = WordCloud(background_color="white", max_words=2000, width=800, height = 400)
# generate word cloud
wc.generate(' '.join(ds_comments_raw['Comments']))
#wc.generate(' '.join(ds_comments_raw['Tokens']))

import matplotlib.pyplot as plt
%matplotlib inline

# show
plt.figure(figsize=(12, 6))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()

In [None]:
from textblob import TextBlob
TextBlob(ds_comments_raw['Comments'][0]).ngrams(2)

In [None]:
## word frequency by post
from nltk.probability import FreqDist # this also uses Counter. 
for i in range(ds_comments_raw.shape[0]):
    print(FreqDist(ds_comments_raw['Tokens'][i]).most_common(5))

In [None]:
from nltk.probability import FreqDist # this also uses Counter. 


unigram_list = FreqDist(' '.join(ds_comments_raw['Comments']).split()).most_common(100)
#' '.join(ds_comments_raw['Comments']).split()

unigram_df = pd.DataFrame(unigram_list, columns=["Word", "Frequency"])

#unigram_df.plot.bar(x='Word', y='Frequency', rot=0)

In [None]:
unigram_list

In [None]:
unigram_df = pd.DataFrame(unigram_list, columns=["Word", "Frequency"])

In [None]:
unigram_df.plot.bar(x='Word', y='Frequency', rot=0)

In [None]:
# Get the frequency of bigrams
from nltk.collocations import BigramCollocationFinder

text = ' '.join(ds_comments_raw['Comments'])
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(word_tokenize(text))
    
for k,v in sorted(finder.ngram_fd.items(), key=lambda item: item[1], reverse=True):
    if v > 2:
        print(k, v)

In [None]:
d = {'bigram': ['go back', 'masters degree', 'online courses', 'project work', 'get handson', 'kaggle competition', 'project work', 'sql experience', 'python data', 'back school', 'grad program'], 'Count': [6, 6, 4, 3, 3, 3, 3, 3, 3, 3, 3]}
df = pd.DataFrame(data=d)
df.plot.barh(x='bigram', y='Count', rot=0)

In [None]:
for top_level_comment in submission.comments:
    if isinstance(top_level_comment, MoreComments):
        continue
    print(top_level_comment.body)

### Data Exploration
- Look for the keywords, such as bootcamp, masters, program, learn (lectures), certified
- Build a wordcloud
- Top frequency
- n-grams (bi-gram: exponentially increases in your sample size)
- Look for stopwords (equivalent proxies, mooc)
- Look for new Data Science jobs (especially from the DataScienceJobs, or other related website)

#### What to look for?
- Look for the information to help with the models (what terms may signify the topics)
- How to extract topics? Manual label (regex)
- highly voted? highly response to? time series response? weekly hot? To reinforce the topics? No. comments.
- Think for how would I find information manually. And then automate it. 

In [None]:
# Using spacy to process the comments
import spacy
nlp = spacy.load('en_core_web_sm')
example_doc = nlp(example_comment)

In [None]:
for token in example_doc:
    if not token.is_stop and not token.is_punct:
        print(token)

In [None]:
### put this into a function
from collections import Counter
# Remove stop words and punctuation symbols
words = [token.text for token in example_doc
       if not token.is_stop and not token.is_punct]

word_freq = Counter(words)

# 5 commonly occuring words with their frequencies
common_words = word_freq.most_common(5)
print(common_words)

# Unique words
unique_words = [word for (word, freq) in word_freq.items() if freq == 1]
#print(unique_words)

In [None]:
# Extract Noun Phrases
for chunk in example_doc.noun_chunks:
    print(chunk)

In [None]:
for ent in example_doc.ents:
    print(ent.text, ent.start_char, ent.end_char,
         ent.label_, spacy.explain(ent.label_))

get a list of tokens. join every single comment together to a string. that's the corpus. str.split(). frequency distribution. Go for the bi-gram/unigram! 

what are things occured in relation to others. what grams appear in the same line with others. 
nltk. what terms occur with each other. statistical analysis with market basket analysis. the lift, support and confidence. NLP version. 

In [None]:
from nltk.probability import FreqDist # this also uses Counter. 
list_A = ['list', 'form', 'list', 'A', 'B']
FreqDist(list_A).most_common(2)

In [None]:
key_words = ['project', 'course', 'degree', 'masters', 'program', 'experience', 'bootcamp', 'courses', 
             'projects', 'certificates', 'ms', 'phd', 'kaggle', 'capstone']

'|'.join(key_words)

In [None]:
ds_df = ds_comments_raw['Comments'].str.contains('|'.join(key_words), regex = True).value_counts()
#pd.DataFrame(ds_df, columns=["Contains", "Count"])

ds_df.reset_index()

ds_df.reset_index().plot.bar(x='index', y='Comments', rot=0)

In [None]:
posts_ds['text'].str.contains('|'.join(key_words), regex = True).value_counts()

### Models (NLP)
- Topic detection
- Text classification

### Create a Job Posting Database (Reddit, Linkedin, Indeed, etc)
- Web scrape websites such as Reddit, Linkedin, Indeed for job posts related to data science
- Analyze the job description and requirements
- Create a database for the job posts

### Create a database to store the reddits

project, we have gathered some interesting findings. Initial objectives 

first component: technical work done. what kind of information. 3-4 slides

Here are some findings: 3-4 slides

some bar charts: the distribution charts.

Visuals: put on the slides



classify those who might be a good fit. 

marketting. 

what tools people are interested in 

check those who are in the industry. 

also the promotion of R in market. 

projects is thats needed

another one. stackoverflow. two more sites. 

### Meeting on 9/4/20:

Clustering method: topic modelling; similarity (cosine similarity); terms that are similar to each other; concordance and context. Don't label by hand!! Clusters are good enough. Label the clusters. Standard clustering. k-means with manhattan distance. translate words. translate word-embedding. 

database building. 

### Using TfidVectorizer and kMeans Clustering

After preprocessing the text, I will proceed to using kMeans algorithm to t cluster the comments vector. In order for kMeans to work, I need to initialize the number of centroids. I will run the algorithm multiple times and have it choose the version that has the lowest within cluster variance. 

In [None]:
kmeans = KMeans(n_clusters = 15, n_init = 5, n_jobs = -1)
kmeans.fit(X3)