# Week 6 Assignment - MSDS 682

## Summary:

The dataset used in this assignment was collected from [r/datascience on Reddit](https://www.reddit.com/r/datascience/) for the time period of April 8-14, 2019, using the Pushshift API. The data contains information for each thread posted in the subreddit such as the date the post was created, the author, the title of the post, the flair (topic category) of the post, and number of comments about the post. I will explore frequently occuring single-term words, bigrams, and trigrams, and also use predictive modeling to determine the flair category of a post, and use topic modeling to find distinct subject groups. 

In [1]:
import numpy as np
import pandas as pd
import string
import spacy

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.util import ngrams

en_stopwords = stopwords.words('english')  
stopwords = set(en_stopwords)  #stopwords has split contractions in it
nlp = spacy.load("en")

### Text Processing

This section of the notebook loads in the data gathered from r/datascience on Reddit. The dataset has 236 rows and 9 columns. I then modified the `cleanCorpus` function from previous assignments and ran it on the `title` column of the dataframe, which contains the text from each thread post title in the subreddit. Once processed, I checked for thread title duplication and removed rows in which a user posted the same thread twice.

In [2]:
#read in data collected from Reddit stored as csv file
ds_df = pd.read_csv("datafrom_r_datascience.csv")
ds_df.shape

(236, 9)

In [3]:
#verify first 5 rows of data
ds_df.head()

Unnamed: 0,id,created_utc,author,can_mod_post,link_flair_text,title,num_comments,score,permalink
0,bamzb9,1554682018,bbbobbyb,False,,Is it hard to get into a Masters in Data Scien...,1,1,/r/datascience/comments/bamzb9/is_it_hard_to_g...
1,ban361,1554682674,bobbyb2222,False,Education,Is it hard to get into a Masters in Data Scien...,2,1,/r/datascience/comments/ban361/is_it_hard_to_g...
2,banj60,1554685531,rdy1107,False,,Just finished an academic thesis on technical ...,1,1,/r/datascience/comments/banj60/just_finished_a...
3,banxus,1554688162,TwoToneDonut,False,Education,EdX Microsoft Professional Program - Data Scie...,0,1,/r/datascience/comments/banxus/edx_microsoft_p...
4,baod6v,1554690889,CharlesPolley,False,Discussion,"10 Data Structure, Algorithms, and SQL Courses...",1,0,/r/datascience/comments/baod6v/10_data_structu...


In [4]:
"""
This function will take in each row of a pandas dataframe column (corpus) that contains each document (title text) as a string after making the text lowercase, 
removing punctuation/digits/special characters, converting into a spaCy object (tokenizing), removing stopwords, and rejoining as a string.
The final string product is then returned to the dataframe.
"""

#function to lowercase, remove punctuation/digits, tokenize (split into words), and remove stopwords
def cleanCorpus(corpus):

    #remove punctuation, digits, and non-standard single quotes/apostrophes
    rmv = str.maketrans({key: None for key in string.punctuation + string.digits + "‘’"})

    #make string lowercase                    
    lower_str = corpus.translate(rmv).lower()

    #make string into a spaCy object
    nlpdoc = nlp(lower_str)

    #holds tokens that are not stopwords
    clean_wordls = []

    #each token in the spaCy object
    for token in nlpdoc:
        
        #get the text for the token
        tknwd = token.text
        
        #list of exception words
        exceptions = ['but', 'nor', 'not']
        
        #check to see if the token is a stopword
        if tknwd in stopwords:
            
            #check to see if token is an exception word
            #if True, then add to the clean words list
            #if False, skip the word; don't add to list
            if tknwd in exceptions: clean_wordls.append(tknwd)
            else: pass
        
        #if word is not a stopword, add to clean words list
        else:
            clean_wordls.append(tknwd)

    #join list of cleaned words together into one string
    clean_str = " ".join(clean_wordls)  
    
    #returned output is an item from corpus as a string
    return clean_str

In [5]:
#make a copy of original dataframe
clean_ds_df = ds_df.copy()

In [6]:
#run the cleanCorpus function on the thread title text
clean_ds_df['title'] = clean_ds_df['title'].apply(cleanCorpus)

In [7]:
#verify cleaned text in new dataframe
clean_ds_df.head()

Unnamed: 0,id,created_utc,author,can_mod_post,link_flair_text,title,num_comments,score,permalink
0,bamzb9,1554682018,bbbobbyb,False,,hard get masters data science program,1,1,/r/datascience/comments/bamzb9/is_it_hard_to_g...
1,ban361,1554682674,bobbyb2222,False,Education,hard get masters data science program,2,1,/r/datascience/comments/ban361/is_it_hard_to_g...
2,banj60,1554685531,rdy1107,False,,finished academic thesis technical analysis wr...,1,1,/r/datascience/comments/banj60/just_finished_a...
3,banxus,1554688162,TwoToneDonut,False,Education,edx microsoft professional program data scie...,0,1,/r/datascience/comments/banxus/edx_microsoft_p...
4,baod6v,1554690889,CharlesPolley,False,Discussion,data structure algorithms sql courses crack ...,1,0,/r/datascience/comments/baod6v/10_data_structu...


In [8]:
#duplicated thread titles
#some users try to post thread more than once to get it noticed
clean_ds_df.loc[clean_ds_df['title'].duplicated()]

Unnamed: 0,id,created_utc,author,can_mod_post,link_flair_text,title,num_comments,score,permalink
1,ban361,1554682674,bobbyb2222,False,Education,hard get masters data science program,2,1,/r/datascience/comments/ban361/is_it_hard_to_g...
10,barv7c,1554717195,plexex,False,Discussion,best tool learn data science,2,1,/r/datascience/comments/barv7c/the_best_tool_t...
21,bauj4r,1554734042,arnauda9,False,Discussion,apache airflow distributes jobs celery workers,0,2,/r/datascience/comments/bauj4r/how_apache_airf...
30,baw5jz,1554742284,arnauda9,False,,apache airflow distributes jobs celery workers,0,1,/r/datascience/comments/baw5jz/how_apache_airf...
55,bb5le5,1554800041,syslynx,False,Projects,guys hosting first presentation care help,2,1,/r/datascience/comments/bb5le5/guys_im_hosting...
65,bb8a1k,1554818419,jeetugalav,False,Projects,state data analytics,1,1,/r/datascience/comments/bb8a1k/state_of_data_a...
72,bb9re3,1554826110,awkwardable,False,Career,data analyst intern vs data science intern,2,0,/r/datascience/comments/bb9re3/data_analyst_in...
81,bbegxh,1554849706,generalizederror,False,Job Search,machine learning research job interview experi...,0,2,/r/datascience/comments/bbegxh/d_my_machine_le...
133,bbtnad,1554942643,Chaostorrent48,False,Education,given months free time data science,2,1,/r/datascience/comments/bbtnad/what_to_do_give...
149,bbza4h,1554985722,multiks2200,False,Discussion,data scientist statistical tests one need know,3,1,/r/datascience/comments/bbza4h/as_a_data_scien...


In [9]:
#select rows with no title text or author duplication
nodupe_ds_df = clean_ds_df.loc[clean_ds_df[['title', 'author']].duplicated() == False]

In [10]:
#length of original dataframe
len(clean_ds_df)

236

In [11]:
#length of dataframe without duplicates
len(nodupe_ds_df)

229

### Explore Data

This section examines the frequency of words, bigrams, and trigrams in the title text data. First, I converted the `title` column in the dataframe to a list, then joined all the list items as one whole string value. The string was then passed through the NLTK `word_tokenize` function, which returned a list of all the tokens in the dataset. With all the tokens in a single list, I could then get a frequency distribution count and find the top 10 occuring words in the dataset. Then for every thread title, I collected the possible bigrams and trigrams and also calculated their frequencies.

In [12]:
#make a list out of the 'title' column
#each list item is a thread title
title_ls = list(nodupe_ds_df['title'])

In [13]:
#first 5 items in list
title_ls[:5]

['hard get masters data science program',
 'hard get masters data science program',
 'finished academic thesis technical analysis written python would like get eyes   go check code accompanying pdf',
 'edx microsoft professional program   data science last class selection suggestions',
 '  data structure algorithms sql courses crack programming job interview']

In [14]:
#put all items together into a whole string
title_str = " ".join(title_ls)

In [15]:
#use word_tokenize function
#will convert string into a list
#each list item is a token(word)
all_tknz = word_tokenize(title_str)

In [16]:
#top 10 occuring words in thread titles
FreqDist(all_tknz).most_common(10)

[('data', 128),
 ('science', 61),
 ('scientist', 15),
 ('job', 14),
 ('python', 12),
 ('know', 12),
 ('best', 10),
 ('program', 9),
 ('vs', 9),
 ('career', 9)]

In [17]:
#create bigrams from each thread title
#append all bigrams from entire dataset into a list
all_bgs =[]

for topic in title_ls:
    
    title_tknz = word_tokenize(topic)
    
    for bigram in ngrams(title_tknz, 2):
        all_bgs.append(bigram)

In [18]:
#list of all bigrams in dataset
all_bgs[:5]

[('hard', 'get'),
 ('get', 'masters'),
 ('masters', 'data'),
 ('data', 'science'),
 ('science', 'program')]

In [19]:
#top 10 occuring words in thread titles
FreqDist(all_bgs).most_common(10)

[(('data', 'science'), 59),
 (('data', 'scientist'), 15),
 (('data', 'scientists'), 9),
 (('data', 'analyst'), 6),
 (('machine', 'learning'), 4),
 (('science', 'program'), 3),
 (('would', 'like'), 3),
 (('online', 'courses'), 3),
 (('resume', 'review'), 3),
 (('science', 'career'), 3)]

In [20]:
#create trigrams from each thread title
#append all trigrams from entire dataset into a list
all_tgs =[]

for topic in title_ls:
    
    title_tknz = word_tokenize(topic)
    
    for trigram in ngrams(title_tknz, 3):
        all_tgs.append(trigram)

  import sys


In [21]:
#list of all trigrams in dataset
FreqDist(all_tgs).most_common(10)

[(('data', 'science', 'program'), 3),
 (('data', 'science', 'career'), 3),
 (('data', 'science', 'jobs'), 3),
 (('hard', 'get', 'masters'), 2),
 (('get', 'masters', 'data'), 2),
 (('masters', 'data', 'science'), 2),
 (('best', 'tool', 'learn'), 2),
 (('tool', 'learn', 'data'), 2),
 (('learn', 'data', 'science'), 2),
 (('apache', 'airflow', 'distributes'), 2)]

### Text Analytics

In order to use predictive modeling on the dataset with the `link_flair_text` (flair) as the target, I subsetted the data to only include rows in which the flair was not missing. Flairs are used as an indicator of the type of thread that was created (to differentiate serious discussion threads from upbeat memes) but all thread posts are not required to have a flair. I mapped each flair category to a numerical value to use in the model and then split the data into training and testing sets. Each non-target feature set was converted into a TF-IDF matrix. The training matricies were then used to build a decision tree model and I created another TF-IDF matrix of all the title text rows for finding 5 clusters using the k-means algorithm. After getting the cluster predictions, then I evaluated the model using a silhouette score and analyzed the terms for topic modeling in each cluster.

In [22]:
#drops of data w/o a flair
#will use for Decision Tree model
flair_df = nodupe_ds_df.loc[nodupe_ds_df['link_flair_text'].notnull()]

In [23]:
#dataframe with only 'link_text_flair" and 'title' columns
flair_df = flair_df[['link_flair_text', 'title']]

In [24]:
#count number of each flair category
flair_df['link_flair_text'].value_counts()

Discussion    38
Education     37
Career        28
Projects      24
Job Search     7
Tooling        7
Fun/Trivia     2
Meta           1
Name: link_flair_text, dtype: int64

In [25]:
#verify firts 5 rows in flair dataframe
flair_df.head()

Unnamed: 0,link_flair_text,title
1,Education,hard get masters data science program
3,Education,edx microsoft professional program data scie...
4,Discussion,data structure algorithms sql courses crack ...
6,Career,working japan tell job
8,Discussion,python vs r


In [26]:
#put each distinct flair category name into a list
flair_names = list(flair_df['link_flair_text'].unique())

In [27]:
#verify this is a list
type(flair_names)

list

In [28]:
#first item in list
flair_names[0]

'Education'

In [29]:
#all flair category names
flair_names

['Education',
 'Discussion',
 'Career',
 'Projects',
 'Tooling',
 'Job Search',
 'Meta',
 'Fun/Trivia']

In [30]:
#use flair name index positions to create a dictionary
#dict_key will be the flair name, dict_value is the index position
target_val = {}

for flair in flair_names:
    target_val[flair] = flair_names.index(flair)

In [31]:
#full dictionary of flair name(key) and category number(value)
target_val

{'Career': 2,
 'Discussion': 1,
 'Education': 0,
 'Fun/Trivia': 7,
 'Job Search': 5,
 'Meta': 6,
 'Projects': 3,
 'Tooling': 4}

In [32]:
#create column to hold the flair category numbers
#numbers will be used to train the model (since model can't take string values)
#dictionary will look for flair name then assign 'target' column the dict_val
flair_df['target'] = flair_df['link_flair_text'].map(target_val)

In [33]:
#verify new column in dataframe
flair_df.head()

Unnamed: 0,link_flair_text,title,target
1,Education,hard get masters data science program,0
3,Education,edx microsoft professional program data scie...,0
4,Discussion,data structure algorithms sql courses crack ...,1
6,Career,working japan tell job,2
8,Discussion,python vs r,1


In [34]:
#get rid of flair name (string) column
#model_df is dataframe that is for the machine learning models
model_df = flair_df.drop('link_flair_text', axis=1)

In [35]:
#variable that only has the 'title' column
#will be used for train/test split and to create matrix for k-means
X = model_df['title']

In [36]:
#training set=60%, test set=40%
#X is "title", y is "target"
X_train, X_test, y_train, y_test = train_test_split(X, model_df['target'], random_state=90, test_size=0.4)

In [37]:
#initialize TF-IDF vectorizer fucntion to a variable
tfidf_vec = TfidfVectorizer()

In [38]:
#turn text data into a TF-IDF matrix
#each column is a "term"(word)
#each row is a thread title
X_train_mtx = tfidf_vec.fit_transform(X_train)

In [39]:
#86 rows, 365 columns
X_train_mtx.shape

(86, 365)

#### Decision Tree

In [40]:
#initialize decision tree model
tree = DecisionTreeClassifier()

In [41]:
#build the tree model using the training data
tree.fit(X_train_mtx, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [42]:
#model has a ~99% accuracy score on the training data
tree.score(X_train_mtx, y_train)

0.9883720930232558

In [43]:
#turn the test data into a TF-IDF matrix
X_test_mtx = tfidf_vec.transform(X_test)

In [44]:
#feed the test data into the model
#generate predictive output into variable "y_predict"
y_predict = tree.predict(X_test_mtx)

In [45]:
#model had a ~17% accuracy score on the test data
tree.score(X_test_mtx, y_test)

0.1724137931034483

In [46]:
#predictive scores from decision tree model by category
print(classification_report(y_test, y_predict))

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


              precision    recall  f1-score   support

           0       0.21      0.36      0.26        14
           1       0.12      0.21      0.15        14
           2       0.67      0.15      0.25        13
           3       0.00      0.00      0.00        10
           4       0.00      0.00      0.00         5
           5       0.00      0.00      0.00         2
           7       0.00      0.00      0.00         0

   micro avg       0.17      0.17      0.17        58
   macro avg       0.14      0.10      0.09        58
weighted avg       0.23      0.17      0.16        58



In [47]:
#verify number of each category in test set
y_test.value_counts()

1    14
0    14
2    13
3    10
4     5
5     2
Name: target, dtype: int64

In [48]:
#verify number of each category in training set
y_train.value_counts()

1    24
0    23
2    15
3    14
5     5
7     2
4     2
6     1
Name: target, dtype: int64

#### K-Means Clustering

In [49]:
#initialize the TF-IDF vectorizer
tfidf_vec = TfidfVectorizer()

In [50]:
#transform the entire thread title dataset into a TF-IDF matrix (unsupervised learning)
X_mtx = tfidf_vec.fit_transform(X)

In [51]:
#save the KMeans algorithm into a variable
#initialized with 10 clusters and randomize the data
kmeans5 = KMeans(n_clusters=5, random_state=90)

In [52]:
#build the model with the data
kmeans5.fit(X_mtx)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=5, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=90, tol=0.0001, verbose=0)

In [53]:
#predict which clusters the tweets belong to
cluster_pred = kmeans5.predict(X_mtx)

In [54]:
#verify first 5 data points' cluster assignments
cluster_pred[:5]

array([0, 0, 4, 4, 1])

In [55]:
#get the silhouette score for the model    
sil_avg = silhouette_score(X_mtx, cluster_pred)

print(f"For 5 clusters, the average silhouette score is: {sil_avg}")

For 5 clusters, the average silhouette score is: 0.014429950780745184


In [56]:
#print the top 10 words for each cluster from the KMeans model (5 clusters)

k5cluster_labels = list(set(cluster_pred))

for cluster_label in k5cluster_labels:
    
    cluster_tweets = np.array(X)[cluster_pred == cluster_label]
    
    top10 = FreqDist(word_tokenize(' '.join(cluster_tweets))).most_common(10)
    
    print(f"The Top 10 words in Cluster {cluster_label}:")
    
    for word in top10:
        print(word[0])
        
    print("\n")

The Top 10 words in Cluster 0:
data
science
program
best
learn
vs
project
hard
tool
school


The Top 10 words in Cluster 1:
python
vs
r
decision
data
web
scraping
using
opensource
commercial


The Top 10 words in Cluster 2:
data
best
scientist
help
jobs
looking
actually
scientists
first
analysis


The Top 10 words in Cluster 3:
free
data
analytics
science
given
months
time
webinar
hr
new


The Top 10 words in Cluster 4:
data
job
learning
interview
machine
courses
science
ds
research
structure




## Conclusion:

r/datascience is a decently active subreddit and even though the dataset was a bit small having only collected a week's worth of posts, I was still able to get some insight that showed distinctive trends. The majority of the posts are labeled with the flairs "Discussion", "Education", "Career", and "Projects". Naturally because of the subreddit's name, the terms "data" and "science" are the most common words in the thread titles but other noteworthy words such as "scientist", "job", "python", and "career" show that many posts tend to be focused on the job position and tool used in data science, rather than general discussions about the field. This is also further reflected in the bigram and trigram output.

The decision tree model did not do very well on the dataset to predict the flair category based on the TF-IDF matrix. This is most likely because there were not a lot of data points for the model to learn from, both in the data as a whole and for each particular flair. Flair such as 'Meta' and 'Fun/Trivia' had too few examples for the model to learn patterns from, so it was not surprising that overall the model did not perform well. However, "Discussion", "Education", and "Career" seemed to be distinct enough for the model to recognize some patterns in the data, especially with its high precision score for the "Career" flair.

From the clusters predicted in the k-means clustering model, clusters 0-2 seemed to have more cohesive themes than clusters 3 and 4. Cluster 0 seems to represent questions asked in the subreddit about which data science education program are best for them (the user asking), Cluster 1 reflects the comparision of the python and R languages (users ask quite a bit about which one they should learn) as well as other tools related to working on a personal portfolio project. Cluster 2 gets slightly murky but I think that it reveals a topic related to people looking for their first data scientist jobs (or possibly even asking about what to expect in their first data scientist job). The terms "jobs" and "interview" also show up in Cluster 4, which makes it look career/job related but words such as "machine", "learning", "courses", and "research" seem to indicate that there could be a mixture of topics in this cluster and that we possible need more clusters to make better distinctive topic groups (reflected in the low silhouette score).

### References:

Loot, R. (2018, October 28). Using Pushshift's API to extract Reddit Submissions. Retrieved from https://medium.com/@RareLoot/using-pushshifts-api-to-extract-reddit-submissions-fb517b286563