# Matthew Garton

## NLP Project: Using Reddit's API for Predicting Comments

In this project, I will use NLP to classify posts I have gathered using Reddit's API as coming from one of two subreddits: r/DCcomics or r/Marvel.

Link to presentation: 

### Scraping Thread Info from Reddit.com

In [None]:
# import necessary modules
import requests
import json
import time
import pandas as pd
import datetime
import ast

In [None]:
# instantiate needed URLS
url_dc = 'http://www.reddit.com/r/DCcomics.json'
url_marvel = 'https://www.reddit.com/r/Marvel.json'

In [None]:
# create my header to get my requests to work
my_header = {'User-agent': 'Matt Garton'}

In [None]:
def get_reddit_data(urls, header, size = 1000):
    '''Returns a dataframe of posts from a given list of subreddits'''
    
    posts = []
    after = None
    
    for u in urls: # iterate over urls passed to function
        for _ in range(size): # loop the desired number of times
            if after == None:
                params = {}
            else:
                params = {'after': after}
                
            res = requests.get(u, params = params, headers = header) # get the data
            
            if res.status_code == 200: # add data to posts if request succeeds
                the_json = res.json()
                post = [p['data'] for p in the_json['data']['children']]
                posts.extend(post)
                if the_json['data']['after'] == None:
                    continue
                else:
                    after = the_json['data']['after'] # update after
            else:
                print(res.status_code) # break and provide feedback if request fails
                break
                
            time.sleep(1) # pause
        
    df = pd.DataFrame(posts)
    
    now = datetime.datetime.now()
    df.to_csv('./data/reddit_data_{}{}{}_{}:{}:{}'.format(now.year, now.month, now.day, now.hour, now.minute, now.second))
    return df

comic_urls = [url_dc, url_marvel]

get_reddit_data(comic_urls, my_header)

# Data Preparation

Due to an error in initial attempts at data scraping, I had collected a large amount of data which was in a format I could not use. Because I had pulled the data at the wrong 'level' I was left with a dataframe where only one column was the data I needed, and each observation was a string representation of the JSON data I wanted to convert to a dataframe. In the following cells, I demonstrate the workaround I used to convert the 'bad' data into something useable, by implementing a function from Python's Abstract Syntax Trees module.

# Do not run the 'old data' cells

I have removed the data referenced here from this repository, due to the total amount of my data being too large for GitHub. I have left these in here to demonstrate my workflow and show the workaround I used to handle 'bad' data and make it use-able. For the purposes of testing, run the 'get data' function to get new data, and proceed as normal.

In [None]:
# Read in all csv data and combine into one 
old_datasets = ['./data/dc_posts_1','./data/dc_posts_2','./data/dc_posts_3','./data/dc_posts_4',
           './data/dc_posts_5','./data/dc_posts_6','./data/marvel_posts']

old_data = [pd.read_csv(d) for d in old_datasets]

In [None]:
# for each dataframe in the list I created, convert the string in the 'data' column
# into a dictionary, then convert the result into a dataframe
# finally, put all dataframes into a list
# when I tried to do the below in one for loop over all dataframes, my Jupyter Notebook crashed
# so I did the 'brute force' approach below

data_1 = pd.DataFrame([ast.literal_eval(d) for d in old_data[0]['data']])
data_2 = pd.DataFrame([ast.literal_eval(d) for d in old_data[1]['data']])
data_3 = pd.DataFrame([ast.literal_eval(d) for d in old_data[2]['data']])
data_4 = pd.DataFrame([ast.literal_eval(d) for d in old_data[3]['data']])
data_5 = pd.DataFrame([ast.literal_eval(d) for d in old_data[4]['data']])
data_6 = pd.DataFrame([ast.literal_eval(d) for d in old_data[5]['data']])
data_7 = pd.DataFrame([ast.literal_eval(d) for d in old_data[6]['data']])

In [None]:
old_data = pd.concat([data_1, data_2, data_3, data_4, data_5, data_6]) # concatenate the results into one dataframe

In [None]:
old_data.shape

In [None]:
old_data.drop_duplicates(subset = 'id', inplace = True)

In [None]:
old_data.shape

In [None]:
old_data.dropna(subset = ['selftext'], inplace = True)

In [None]:
old_data.shape

I found myself repeating the steps above a number of times, so I decided to write a function for my initial data cleaning. Run this cell to ensure the function is in memory.

In [None]:
def clean_reddit_data(df):
    '''Returns a cleaned dataframe with duplicates and nulls removed'''
    
    df.drop_duplicates(subset = 'id', inplace = True) # remove instances where the same post was pulled twice
    df.dropna(subset = ['selftext'], inplace = True) # remove observations which contain no data
    
    return df

# Note: Do not run the two cells below. 

Again, these are here to demonstrate my workflow. For the purposes of testing my models, it is necessary to get new data.

In [None]:
# pull in good datasets and concat to a dataframe

good_datasets = ['./data/reddit_data_201897_10:14:21', './data/reddit_data_201897_10:35:26',
                 './data/reddit_data_201897_11:10:35', './data/reddit_data_201897_11:34:3', 
                 './data/reddit_data_201897_14:34:56', './data/reddit_data_201898_14:24:10',
                './data/reddit_data_201897_15:9:56', './data/reddit_data_201899_9:37:39',
                './data/reddit_data_201899_9:46:17']

good_data = pd.concat([pd.read_csv(d, index_col = 0) for d in good_datasets])

In [None]:
data_clean = clean_reddit_data(good_data) # clean the data

# combine old and new data - repeat cleaning process for good measure
data_full = pd.concat([old_data, data_clean]) 

reddit_data = clean_reddit_data(data_full)

# EDA Part 1

I now believe I have a sufficiently sized dataset (this is the most I can get at this point without getting archives). Time for some basic EDA, starting with: what is in each column? How balanced is my data? What is the text data that I will use for my predictions?

In [None]:
reddit_data.shape

In [None]:
reddit_data.columns

In [None]:
reddit_data['subreddit'].unique()

# Baseline Accuracy

Right off the bat, I have much more DC data than Marvel Data. I am not sure what is driving this, but I want to try to get more Marvel Data to correct this.

In [None]:
print(reddit_data['subreddit'].value_counts(normalize = True))
print('')
print(reddit_data['subreddit'].value_counts(normalize = False))

In [None]:
# Re-sample (w/o replacement) for new marvel data to balance out classes.

urls = [url_marvel]
new_marvel = get_reddit_data(urls, header = my_header, size = 1000)
new_marvel = clean_reddit_data(new_marvel)
reddit_data = pd.concat([reddit_data, new_marvel])
reddit_data = clean_reddit_data(reddit_data)

In [None]:
reddit_data.shape

In [None]:
print(reddit_data['subreddit'].value_counts(normalize = True))
print('')
print(reddit_data['subreddit'].value_counts(normalize = False))

Now I have more balanced data to work with. Export to csv to keep.

In [None]:
reddit_data.to_csv('./data/reddit_data_from_notebook_3', index = False)

Pull in full dataset from last check point, rather than going through full data gathering/cleaning process again.
# WARNING: I ran into errors when I tried to follow my workflow after pulling in the data from the saved csv. 

I need to look into this, but for now, I am skipping this step and using the data I have gathered above. The biggest problem with this is that, for some reason, I need to re-sample Marvel data each time I want to build a model.

In [None]:
reddit_data = pd.read_csv('./data/reddit_data_from_notebook_3')
reddit_data.head()

# Preparing Data for Modeling

In order to prepare my text data for modeling, I need to take the following steps:

1. Create my X variable - I will create a 'document' for each row containing the selftext and title of each post
2. Create my Y variable - I will create a binary variable from the 'subreddit' column, indicating 1 for r/DCcomics and 1 for r/Marvel.
3. Preprocess text - convert the long string of my data into a list of lemmatized words, all lowercase with no punctuation.

In [None]:
# Combining title and selftext into one column which represents 'data'
reddit_data['data'] = reddit_data['selftext'] + ' ' + reddit_data['title']

In [None]:
# reset index of df
reddit_data.reset_index(drop = True, inplace = True)

In [None]:
# Preprocess text

# import necessary modules
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import regex as re

tokenizer = RegexpTokenizer(r'\w+') 
lemmatizer = WordNetLemmatizer()

#function to preprocess text
def preprocess_text(string):
    '''Generalized function to preprocess text data'''
    letters_only = re.sub("[^a-zA-Z]", " ", string) # remove punctuation
    letters_lower = letters_only.lower() # make lowercase
    words = letters_lower.split() # split into words
    lemmas = [lemmatizer.lemmatize(w) for w in words]  # lemmatize
    return (' '.join(lemmas))

In [None]:
reddit_data['data'] = reddit_data['data'].apply(preprocess_text)

In [None]:
# set y variable as dummy for subreddit
reddit_data = pd.get_dummies(reddit_data, columns = ['subreddit'])

In [None]:
# Train-test split

from sklearn.model_selection import train_test_split

X = reddit_data['data']
y = reddit_data['subreddit_DCcomics']

X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle = True, random_state = 43, stratify = y)

## NLP

#### Use `CountVectorizer` or `TfidfVectorizer` from scikit-learn to create features from the thread titles and descriptions (NOTE: Not all threads have a description)
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import stop_words

## Count Vectorizer First - remove stopwords
cvec = CountVectorizer(stop_words = 'english') # Instantiate
cvec = cvec.fit(X_train) # Fit

# Transform train and test

cvec_train = cvec.transform(X_train)
cvec_test = cvec.transform(X_test)

# Tfidf Vectorizer - remove stopwords
tfidf = TfidfVectorizer(stop_words = 'english') # Instantiate
tfidf = tfidf.fit(X_train)

# Transform train and test

tfidf_train = tfidf.transform(X_train)
tfidf_test = tfidf.transform(X_test)

# EDA Part 2 - Word Counts

In [None]:
words = cvec.get_feature_names()
dense_cvec = cvec_train.todense()

In [None]:
words_df = pd.DataFrame(dense_cvec, columns = words)
words_df.head()

word_counts = words_df.apply(sum)
words_transpose = words_df.T
words_transpose['count'] = word_counts

words_transpose[['count']].sort_values(by = 'count', ascending = False).head(20)

Looking at the most frequent words reveals another problem with my data - words like 'http' and 'www' are not meaningful text. I would want to figure out how to remove these from my data to improve upon my model.

## Predicting subreddit using Random Forests + Another Classifier

In [None]:
# import necessary modules for modeling

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, ExtraTreesClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier

#### Thought experiment: What is the baseline accuracy for this model?

As discovered earlier, the baseline accuracy for this model is the proportion of my data that is in the majority class. In this case, about 59%, since 59% of my data came from the DCcomics subreddit. 

Note again that my initial dataset was heavily unbalanced, with over 80% of observations coming from DCcomics. To address this, I simply resampled Marvel data. Since my goal here was to get as many datapoints from each subreddit as possible, I do not think that doing so is bad practice. If it weren't for Reddit's restrictions, I would have pulled an exact 50/50 split.

It is possible that the reason I initially got more DC data than Marvel data is that a larger amount of Marvel data was lost in the data cleaning process. If that is the case, then the lack of balance would actually tell me something about the difference between the two populations, and my assumption above would not be valid.

#### Create a `RandomForestClassifier` model to predict which subreddit a given post belongs to.

In [None]:
## RandomForestClassifier - Count Vectorizer

rf = RandomForestClassifier()
rf = rf.fit(cvec_train, y_train)

In [None]:
rf.score(cvec_test, y_test)

In [None]:
rf.n_features_

In [None]:
from sklearn.metrics import confusion_matrix

def evaluate_model(model, test_data, test_target):
    '''Prints out an evaluation of a given model, including:
    Accuracy, Confusion Matrix and a ROC-AUC curve'''
    accuracy = model.score(test_data, test_target) # Calculate accuracy
    predictions = model.predict(test_data) # Use model to predict class
    
    cm = confusion_matrix(y_test, predictions) # Create confusion matrix
    cm_df = pd.DataFrame(data=cm, columns=['predicted Marvel', 'predicted DC'], 
                         index=['actual Marvel', 'actual DC'])
    print('Model Accuracy: {}\n'.format(accuracy))
    return cm_df

In [None]:
evaluate_model(rf, cvec_test, y_test)

In [None]:
## RandomForestClassifier - Tfidf Vectorizer

rf = RandomForestClassifier()
rf.fit(tfidf_train, y_train)

In [None]:
evaluate_model(rf, cvec_test, y_test)

In [None]:
features = rf.feature_importances_
features

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy of the model, as well as any other metrics you feel are appropriate. 
- **Bonus**: Use `GridSearchCV` with `Pipeline` to optimize your `CountVectorizer`/`TfidfVectorizer` and classification model.

#### Repeat the model-building process using a different classifier (e.g. `MultinomialNB`, `LogisticRegression`, etc)

In [None]:
## MultinomialNB

mnb = MultinomialNB()
mnb = mnb.fit(cvec_train, y_train)
evaluate_model(mnb, cvec_test, y_test)

In [None]:
# LogisticRegression

lr = LogisticRegression()
lr = lr.fit(cvec_train, y_train)
evaluate_model(lr, cvec_test, y_test)

In [None]:
# K-Nearest Neighbors

knn = KNeighborsClassifier()
knn = knn.fit(cvec_train, y_train)
evaluate_model(knn, cvec_test, y_test)

In [None]:
## MultinomialNB

mnb = MultinomialNB()
mnb = mnb.fit(tfidf_train, y_train)
evaluate_model(mnb, cvec_test, y_test)

In [None]:
# LogisticRegression

lr = LogisticRegression()
lr = lr.fit(tfidf_train, y_train)
evaluate_model(lr, cvec_test, y_test)

In [None]:
# K-Nearest Neighbors

knn = KNeighborsClassifier()
knn = knn.fit(tfidf_train, y_train)
evaluate_model(knn, cvec_test, y_test)

# Permutations of Models to Test

I will follow the below framework when I return to this project to clean it up and finish the research.

## Vectorizers

1. CountVectorizer (binary - yes or no; n-gram range)
2. TfidfVectorizer

## Models

1. RandomForestClassifier
2. KNNClassifier(n_neighbors; distance; weights)
3. LogisticRegression
4. MultinomialNB
5. SupportVectorMachines


# Output

## For each model, produce:

1. Score
2. Confusion Matrix
3. ROC-AUC curve
4. Best Params
5. Best Features

In [None]:
from sklearn.pipeline import Pipeline

steps = [
    ("vectorizer", CountVectorizer()),
    ("knn", KNeighborsClassifier())
]

pipe = Pipeline(steps)

grid_params = {
    "knn__n_neighbors": [3,10,20],
    "knn__weights": ["distance", "uniform"]
}

gs = GridSearchCV(pipe, grid_params, verbose=2)
results = gs.fit(X_train, y_train)

In [None]:
results.score(X_test, y_test)

# Executive Summary
---
Put your executive summary in a Markdown cell below.

    The data science problem I attempted to address in this project was: _What characteristics of a post on Reddit contribute most to what subreddit it belongs to?_. My approach was to collect as many posts as possible from two subreddits using Reddit's API, and then to apply NLP and classification models to that data. My goals were the following: 1) Build the best prediction model possible, 2) Learn what characteristics were most useful in predicting a subreddit, 3) Automate and streamline my workflow as much as possible by building functions to handle repeated tasks, 4) Learn how to leverage Pipelines and Gridsearch, as well as review the various models I have learned thus far.

    The two subreddits I chose to use for this analysis were r/DCcomics and r/Marvel. I thought it might be a relatively eady problem for a human (with enough subject matter knowledge) to predict between the two and I wondered if the same was true for a machine learning algorithm. I also wanted to explore which features would be most useful in predicting between the two - are they the things I would expect to be most obvious, such as character names? Additionally, I am interested in looking at how a model would change if the 'proper noun' features were removed.
    
    The major problem encountered in this project was in data gathering. First, reddit only retains 1000 of the most recent posts, so there is an upper limit on the amount of data that can be obtained without learning how to access the archived. Compounding the problem was the fact that my function did not 'break' as I would expect when reaching the end of the available posts - it looped back up to the beginning Looking through the posts on each subreddit, I noticed this could be a problem, as many of the posts did not contain text data at all - they were just pictures. I think this could be a unique problem with using comic book subreddits, but I am not sure. 
    
    The second problem I encountered with my data is that in my initial data pulls, I was saving the data to a dataframe at the wrong level, so I needed to figure out how to extract the data that I needed (this was necessary because I could not get enough 'new' data, so I needed to access the 'old' data. I used a Python module called Abstract Syntax Trees in order to accomplish this.
    
    Once I had all of the data that I needed, I began building prediction models. I used both Count Vectorizer and tf-idf Vectorizers to transform my text data, and tested several models: Random Forests, Multinomial Naive Bayes, Logistic Regression, and K Nearest Neighbors. All models improved on the 'Baseline Accuracy', but Logistic Regression with Count Vectorized data was the best model, with an accuracy of ~91%. 
    
    The primary lesson I learned from this project is to carefully examine data as I am collecting it to catch problems early. In terms of modeling, I was ultimately unable to do the modeling practice that I was hoping to accomplish with this project, due to time constraints, so I did not learn much in terms of model building or hyperparameter tuning.
