<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DSI 9 Project 3: Classification of Reddit Posts
Author Jordan Bai

Objectives

- Collect posts from two subreddits using Reddit's API.
- Use NLP to train a classifier on which subreddit a given post came from. This is a binary classification problem.

In [1]:
import requests
import time
import pandas as pd
import numpy as np

### 1. Collecting Posts from Two Subreddits Using Reddit's API

The subreddits of choice were **NBA** and **MLB**. These are sports discussion threads on two different leagues, National Basketball Association and Major League Baseball. To acces the API, `.json` is added to the end of the url: https://www.reddit.com/r/mlb.json

Reddit gives 25 posts **per request**. To get enough data, Reddit's API was hit **repeatedly** in a `for` loop. `time.sleep()` function was added at the end of your loop to allow for a break in between requests.

In [2]:
posts = []
after = None
url1 = 'https://www.reddit.com/r/mlb.json'

for url in ['https://www.reddit.com/r/mlb.json', 'https://www.reddit.com/r/nba.json']:
    for i in range(40):
        if after == None:
            params = {}
        else:
            params = {'after': after}
    
        res = requests.get(url, params=params, headers={'User-agent': 'mozl'})
    
        if res.status_code == 200:
            json = res.json()
            posts.extend(json['data']['children'])
            after = json['data']['after']
        else:
            print(res1.status_code)
            break
        time.sleep(1)

ConnectionError: HTTPSConnectionPool(host='www.reddit.com', port=443): Max retries exceeded with url: /r/mlb.json (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000201A13701D0>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

### 2. Data Cleaning and Extraction

As many of the posts were images, the focus of this poject would be based on the thread **titles**. The titles were extracted with the subreddit it belonged to. The sample size of each subreddit (NBA:972 vs MLB:983) was checked to be similar in order to avoid a mismatch problem.

In [None]:
df = pd.DataFrame([{'title':p['data']['title'], 'subreddit':p['data']['subreddit']} for p in posts])

In [None]:
df.shape

In [None]:
df = df.dropna()
df.shape

In [None]:
df.to_csv('data.csv',index=False)

In [None]:
df['nba'] = np.where(df['subreddit']=='nba', 1, 0)

In [None]:
print('No. of NBA: ',df.nba.sum())
print('No. of MLB: {}'.format(1955-df.nba.sum()))

### 3. Feature Extraction and Train-Test Split

Three feature selection methods were employed (**Count Vectorizer** and **TF-IDF**), to explore which method gave a better score for this classification problem. A 75-25 train-test split was defind to evaluate the accuracy of the model. 

In [None]:
X = df['title']
y = df['nba']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    random_state=42,
                                                    stratify=y)

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer

# Instantiate our Vectorizers.
cvec = CountVectorizer(stop_words='english', max_features=1000)
tvec = TfidfVectorizer(stop_words='english',
                                      sublinear_tf=True,
                                      max_df=0.5,
                                      max_features=1000)

In [None]:
# Fit-transform our Vectorizers on the training data and transform our testing data.
cvec.fit(X_train)
X_train_cvec = pd.DataFrame(cvec.transform(X_train).todense(),
                       columns=cvec.get_feature_names())
X_test_cvec = pd.DataFrame(cvec.transform(X_test).todense(),
                       columns=cvec.get_feature_names())

tvec.fit(X_train)
X_train_tvec = pd.DataFrame(tvec.transform(X_train).todense(),
                       columns=tvec.get_feature_names())
X_test_tvec = pd.DataFrame(tvec.transform(X_test).todense(),
                       columns=tvec.get_feature_names())

### 4. Model Building and Testing

Two classification methods were selected for evaluation (**Multinomial Naive Bayes Classifier** and **Logistic Regression**). The metric for determining the best model coupled with the feature extraction method is the prediction accuracy of the test set.  

In [None]:
# Train and score our Multinomial Naive Bayes model
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb = nb.fit(X_train_cvec, y_train)
print('Bayes CountV - Train: {}, Test: {}'.format(nb.score(X_train_cvec, y_train),nb.score(X_test_cvec, y_test)))

nb = MultinomialNB()
nb = nb.fit(X_train_tvec, y_train)
print('Bayes TfidV - Train: {}, Test: {}'.format(nb.score(X_train_tvec, y_train),nb.score(X_test_tvec, y_test)))

In [None]:
# Train and score our Logistic model
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr = lr.fit(X_train_cvec, y_train)
print('Log CountV - Train: {}, Test: {}'.format(lr.score(X_train_cvec, y_train),lr.score(X_test_cvec, y_test)))

lr = LogisticRegression()
lr = lr.fit(X_train_tvec, y_train)
print('Log TfidV - Train: {}, Test: {}'.format(lr.score(X_train_tvec, y_train),lr.score(X_test_tvec, y_test)))

### 5. Best Performing Model

Logistic Regression yielded the best training scores (>97% accuracy) but the testing score was lower at around 91% accuracy. This might indicate overfitting of the training model resulting in high biasness. On the other hand, the difference between the training and testing accuracies for Multinomial Naive Bayes Classifier were smaller (around 4%). With this classifier, feature extraction by TF-IDF had a higher testing accuracy of 92%. This was identified as the best combination in this exercise. We study the confusion matrix of the best model below.

In [None]:
# Import the confusion matrix function.
from sklearn.metrics import confusion_matrix

nb = MultinomialNB()
nb = nb.fit(X_train_tvec, y_train)
predictions = nb.predict(X_test_tvec)

tn, fp, fn, tp = confusion_matrix(y_test, predictions).ravel()
print("True Negatives: %s" % tn)
print("False Positives: %s" % fp)
print("False Negatives: %s" % fn)
print("True Positives: %s" % tp)