<h1>Project Three: History, or Story?</h1>

### Imports

In [None]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns

from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC, SVC
import string

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import re

from IPython.display import display_html

plt.style.use("dark_background")
plt.figure(figsize=(12,12));

Before getting started, we will want to set a global random seed to ensure reproduceable results.

In [None]:
np.random.seed(42)

### Tagging _y_ variable

We have already acquired and done initial clean to our data with the scraper.ipynb and cleaner.ipynb notebooks, respectively.  Here, we have two datasets from their respective subreddits, r/AskHistorians and r/HistoricalWhatIf.  

We will begin by identifying which rows belong to the WhatIf sub, before merging the sets into one dataframe.

In [None]:
df1 = pd.read_csv('./data/askhistorians_clean.csv')
df2 = pd.read_csv('./data/whatif_clean.csv')

In [None]:
# merge into one dataset with a single column for target
df2['whatif'] = 1
df2['whatif'].value_counts()

In [None]:
# fill the ask-historians entries as 0
df1['whatif'] = 0

### Additional Cleaning
We will then use regular expressions to render the selftext column (our intended X variable) into a more machine-friendly form.  The function below will convert our text to lowercase, strips special characters, punctuation, hyperlinks, and whitespace that might interfere with our modeling.  It also removes any words of two letters or less, which are too common to be sufficiently predictive.

In [None]:
# clean text

# function courtesy of Marta Ghiglioni
def cleaner(text):
    # Make lowercase
    text = text.lower()
    # Remove HTML special characters
    text = re.sub(r'\&\w*;', '', text)
    # Remove hyperlinks
    text = re.sub(r'https?:\/\/.*\/\w*', '', text)
    # Remove punctuation and split
    text = re.sub(r'[' + string.punctuation.replace('@', '') + ']+', ' ', text)
    # Remove words with 2 or fewer letters
    text = re.sub(r'\b\w{1,2}\b', '', text)
    # Remove whitespace (including new line characters)
    text = re.sub(r'\s\s+', ' ', text)
    # Remove characters beyond Basic Multilingual Plane
    text = ''.join(c for c in text if c <= '\uFFFF') 
    return text

df1['selftext'] = df1['selftext'].apply(cleaner)
df2['selftext'] = df2['selftext'].apply(cleaner)

### Merge

Now, we can merge them into a single dataframe.

In [None]:
# merge the dataframes
df = df1.append(df2)

In [None]:
# drop the irrelevant columns
df.drop(columns=['Unnamed: 0', 'Unnamed: 0.1', 'subreddit',
       'created_utc', 'author', 'num_comments', 'score', 'is_self',
       'timestamp'], inplace=True)

In [None]:
X = df['selftext']
y = df['whatif']

In [None]:
# Split the data into the training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.4,
                                                    stratify=y,
                                                    random_state=42)

In [None]:
# Baseline accuracy
y_test.value_counts(normalize=True)

### Using TF-IDF to examine word frequency

I will be using custom stopwords, in an effort to eliminate common words and zero in on more unusual and predictive terminology.  To do this, I will iterate through the top words in the corpus repeatedly, adding additional words to the custom stopwords with each pass, until I arrive at a top 25 words which all seem reasonably useful for analysis.

In [None]:
tvec = TfidfVectorizer()

In [None]:
# convert training data to dataframe
X_train_df = pd.DataFrame(tvec.fit_transform(X_train).todense(), 
                          columns=tvec.get_feature_names())

X_train_df = X_train_df.drop(columns=['history', 'what', 'reddit', 'subreddit', 'their', 'any',
          'been', 'had', 'from', 'not', 'like', 'with', 'about', 'were',
          'did', 'but', 'for', 'they', 'how', 'this', 'what', 'have', 'was',
          'that', 'would', 'and', 'after', 'much', 'when', 'them',
          'who', 'why', 'could', 'all', 'some', 'his', 'more', 'you', 'are',
          'there', 'the', 'where', 'has', 'into', 'which', 'also', 'out', 'does', 'these'])

# plot top occuring words
X_train_df.sum().sort_values(ascending=False).head(25).plot(kind='barh');

While many of our top 25 are still fairly generic, they are at least meaningful.  We have successfully removed words that are structural rather than informative to be plucked out.

In addition, we will remove a few words that would tell us either nothing, or too much, based on the particular dataset.  Namely: "history", "what", "if", "reddit" and "subreddit".

In [None]:
# excluding what and if is going to make this a lot harder, but it feels
# like the right thing to do in the spirit of the goal I am aiming at.

custom = ['history', 'what', 'if', 'reddit', 'subreddit', 'their', 'any',
          'been', 'had', 'from', 'not', 'like', 'with', 'about', 'were',
          'did', 'but', 'for', 'they', 'how', 'this', 'what', 'have', 'was',
          'that', 'would', 'and', 'he', 'after', 'much', 'when', 'them',
          'who', 'why', 'could', 'all', 'some', 'his', 'more', 'you', 'are',
          'there', 'the', 'where', 'has', 'into', 'which', 'also', 'out', 
          'does', 'these', 'most', 'many', 'being']
combined_words = text.ENGLISH_STOP_WORDS.union(custom)

### Initial Modeling

#### Selection
I selected the following models:

1. KNN, to provide a baseline. If more sophisticated modeling techniques cannot beat KNN for accuracy and precision, their sophistication is counterproductive.
2. Decision Trees Classifier.  An excellent 'mid-level' model that will bring a more sophisticated technique the the table, yet still run quickly and be highly interpretable.  Will require tuning to avoid overfit.
3. Random Forest Classifier.  An improvement upon decision trees: by selecting a random subset of features at each split, we decrease decision trees' tendency toward high variance.  Unfortunately, this will be less interpretable.
4. Support Vector Classifier.  A very effective classifier for high dimensional data (such as NLP data).  Expected to be the most accurate model used.  Black box, not interpretable.

In [None]:
knn_cvec = CountVectorizer(analyzer = "word",
                       preprocessor = None,
                       stop_words = combined_words, 
                       max_features = 100, 
                       ngram_range = (1, 4)
                      )

knn = KNeighborsClassifier()

knn_sc = StandardScaler(with_mean=False)

knn_pipe = Pipeline([
    ('cvec', knn_cvec),
    ('sc', knn_sc),
    ('knn', knn)
])

knn_pipe.fit(X_train, y_train)

knn_train = knn_pipe.score(X_train, y_train)
knn_test = knn_pipe.score(X_test, y_test)
print(f'KNN, score on training set: {knn_train}, score on test set: {knn_test}.')

In [None]:
plot_confusion_matrix(knn_pipe, X_test, y_test)

In [None]:
dt = DecisionTreeClassifier(max_depth=12, min_samples_split=180,
                            min_samples_leaf=60,
                            random_state=42)
p_stemmer = PorterStemmer()
dt_cvec = CountVectorizer(analyzer="word",
                          preprocessor=None,
                          stop_words=combined_words,
                          max_features=5,
                          max_df=.90,
                          min_df=.10,
                          ngram_range=(1,2)
                          )
dt_pipe = Pipeline([
    ('cvec', dt_cvec),
    ('dt', dt)
])

dt_pipe.fit(X_train, y_train)

dt_train = dt_pipe.score(X_train, y_train)
dt_test = dt_pipe.score(X_test, y_test)

print(f'Decision Trees, Score on training set: {dt_train}, score on test set: {dt_test}')

In [None]:
# Possible gridsearch to try to make DT converge properly.
#
# grid = GridSearchCV(estimator=dt,
#                     param_grid={'max_depth': [2,3,5,7],
#                                'min_samples_split': [5,10,15,20],
#                                'min_samples_leaf': [2,3,4,5,6],
#                                'ccp_alpha': [0,0.001, 0.01, 0.1, 1, 10]},
#                     cv = 5,
#                     verbose = 1)

# grid.fit(X_train, y_train);

In [None]:
plot_confusion_matrix(dt_pipe, X_test, y_test)

In [None]:
rf_cvec = CountVectorizer(analyzer = "word",
                       preprocessor = None,
                       stop_words = combined_words, 
                       max_features = 1000, 
                       ngram_range = (1, 4)
                      )
                    
rf = RandomForestClassifier(random_state = 42, n_estimators = 25)

rf_pipe = Pipeline([
    ('cvec', rf_cvec),
    ('rf', rf)
])
rf_pipe.fit(X_train, y_train)
rf_train = rf_pipe.score(X_train, y_train)
rf_test = rf_pipe.score(X_test, y_test)
print(f'Random Forest, score on training set: {rf_train}, score on test set: {rf_test}.')

In [None]:
plot_confusion_matrix(rf_pipe, X_test, y_test)

In [None]:
svc_cvec = CountVectorizer(analyzer = "word",
                       preprocessor = None,
                       stop_words = combined_words, 
                       max_features = 2000, 
                       ngram_range = (1, 1),
                      )

pgrid = {"svc__C": np.linspace(0.01, 1, 20)}

svc = SVC(max_iter=7000, tol=0.1) # model object

svc_sc = StandardScaler(with_mean=False)

svc_pipe = Pipeline([
    ('cvec', svc_cvec),
    ('sc', svc_sc),
    ('svc', svc)
])

svc_grid = GridSearchCV(svc_pipe,
                        pgrid,
                        cv = 5)

svc_grid.fit(X_train, y_train)
svc_train = svc_grid.score(X_train, y_train)
svc_test = svc_grid.score(X_test, y_test)
print(f'SVC, score on training set: {svc_train}, score on test set: {svc_test}.')

In [None]:
plot_confusion_matrix(svc_grid, X_test, y_test)

In [None]:
preds = svc_grid.predict(X_test)

### Model Tuning
With results in, I selected the best-performing models for additional tuning.  I will set up a gridsearch to find the best parameters for the Random Forest and SVM classifiers before declaring one of the models victorious.

I have commented the search cells in the interest of time, but the variety of parameters I tried searching is reflected here.

In [1]:
# rf_params = {'cvec__analyzer':  ['word'],
#              'cvec__stop_words': [combined_words],
#              'cvec__max_features': [1000],
#              'cvec__ngram_range': [(1, 4)],
#              'rf__random_state': [42],
#              'rf__n_estimators': [5, 15, 25],
#              'rf__max_depth': [3, 5, 8],
#              'rf__max_features': [5, 10, 15, 20],
#              'rf__max_leaf_nodes': [5, 8, 12],
#              'rf__min_samples_leaf': [9, 15, 25],
#              'rf__bootstrap': [True, False]
#             }



In [2]:
gs1 = GridSearchCV(rf_pipe,
                   param_grid=rf_params,
                   cv=5)

gs1.fit(X_train, y_train)
print(gs1.best_score_)
gs1.best_params_

NameError: name 'GridSearchCV' is not defined

In [3]:
# svc_params = {
#     'max_iter': [3000, 5000, 7000],
#     'tol': [0.001, 0.01, 0.1],
#     'C': [0.1, 0.5, 1],
#     'class_weight': ['balanced', None]
# }

# gs2 = GridSearchCV(svc_pipe,
#                   param_grid=svc_params,
#                   cv=5)

In [4]:
gs2.fit(X_train, y_train)
print(gs2.best_score_)
gs2.best_params_

NameError: name 'gs2' is not defined

In [None]:
# # notes for future improvements

# most common ten words

# Tiffydiff

# Lemma/stem

# plot ROC curve

# explain choice of measurement

# pull out misclassifications to examine

# 

### Choice of Measurement

We are using Accuracy as our target metric, to gauge the overall feasibility of predicting our target category (fact / fiction).

### Results

With a maximum accuracy of .83, it is too soon to say this model can be used to predict fiction reliably in real world scenarios, when the stakes are high.

Unfortunately gridsearching for better-tuned parameters had to be cut short.  The above accuracy does not reflect a well-tuned model.

### Conclusions and Recommendations

It will not be seen how well this technique generalizes from the field of history to another, such as current politics vs. propaganda.

However, the goal was to provide proof of concept and assess the feasibility of this as a practical application of machine learning, and we have very much achieved that.  Accuracy of .83 is a more than reasonable starting benchmark, and can surely be improved into something much more robust.