### Imports

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction import text
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

### Importing Data

For this notebook, I am only using the fiction_sample to try models before running them through all of the other datasets. I had hoped that the models would be tranferable with the results, but the models all ended with varying scores. Below, you will find the models that I've tried and an explanation of why I chose them.

In our problem statement, we are building a recommender system that will give the user a recommendation on a book that they should read based on words or phrases they about a story they'd like to read. Ideally, it would be similar to saying "I want a 19th Century novel about love" and it may give a recommendation of "Pride and Prejudice".

In [4]:
fiction_df = pd.read_csv('./data/fiction_sample.csv')

### Functions

* **best_params(pipeline, params, X_train, y_train)**: Reads in a pipeline, parameters, X_train, and y_train set that you've created, performs a GridSearchCV to find the best score and parameters through hypertuning. 
* **return_gs(pipeline, params, X_train, y_train)**: Returns GridSearch of a given pipeline and parameters
* **scores(gs, X_train, y_train, X_test, y_test)**: Using the returned gridsearch, the function will fit the model and perform a train-test-split to evaluate the R2 Train and Test scores.
* **predictions(pipeline, X_train, X_test, y_train)**: Returns predictions based on a pipeline and its model
* **classification_scores(model, y_test, y_pred)**: Using the predictions, it'll return recall, precision, f1, and accuracy scores for you to evaluate.

Note: Functions are reused from [my previous Subreddit project](https://git.generalassemb.ly/lisaliang/project-3.git)

In [4]:
def best_params(pipeline, params, X_train, y_train):
    gs = GridSearchCV(pipeline,
                      param_grid = params,
                      n_jobs=-1)

    gs.fit(X_train, y_train)
    return f'Best Score: {gs.best_score_}, Params: {gs.best_params_}'

In [5]:
def return_gs(pipeline, params, X_train, y_train):
    gs = GridSearchCV(pipeline,
                      param_grid = params,
                      n_jobs=-1)
    return gs

In [6]:
def scores(gs, X_train, y_train, X_test, y_test):
    gs.fit(X_train, y_train)
    return f'Train Score: {gs.score(X_train, y_train)}, Test Score: {gs.score(X_test, y_test)}'

In [7]:
def predictions(pipeline, X_train, X_test, y_train):
    pipeline.fit(X_train, y_train)
    prediction = pipeline.predict(X_test)
    
    return prediction

In [8]:
def classification_scores(model, y_test, y_pred):
    dataframe = pd.DataFrame(columns = ['Recall', 'Precision', 'F1', 'Accuracy'])
    
    recall = recall_score(y_test, y_pred, average = 'weighted')
    precision = precision_score(y_test, y_pred, average = 'weighted')
    f1 = f1_score(y_test, y_pred, average = 'weighted')
    accuracy = accuracy_score(y_test, y_pred)
    
    dataframe.loc[model] = [recall, precision, f1, accuracy]
    
    return dataframe

* **my_lemmatizer(text)**: This function lemmatizes inputted text to their dictionary forms. It adds conditions to filter out words with apostrophes or digits so they are done as accurately as possible.

Additional: We created a list of English stopwords, contractions, and numbers for the model to remove while it's iterating through the text. These attributes were seen as not adding significance in helping the model distinguish book titles.

In [11]:
def my_lemmatizer(text):
    wnet = WordNetLemmatizer()
    # exclude words with apostrophes and numbers
    return [wnet.lemmatize(w) for w in text.split() if "'" not in w and not w.isdigit()]

In [12]:
wnet = WordNetLemmatizer()
lem_stopwords = [wnet.lemmatize(w) for w in stopwords.words('english')]

contractions = ['ve', 't', "'s'", 'd', 'll', 'm', 're']
lem_contractions = [wnet.lemmatize(contraction) for contraction in contractions]

numbers = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
lem_numbers = [wnet.lemmatize(num) for num in numbers]

lem_stopwords = lem_stopwords + lem_contractions + lem_numbers

### Baseline Accuracy 

For the baseline accuracy, we see that there are 6722 unique values where Pride and Prejudice has a slight bias towards it. The dataset (along with all the others) are incredibly imbalanced with the number of classes present and the data within those classes.

In [9]:
fiction_df['Title'].value_counts(normalize = True)

Pride and Prejudice                                         0.03694
Brave New World                                             0.01184
Great Expectations                                          0.01102
To kill a mockingbird                                       0.00634
Alice's Adventures in Wonderland                            0.00580
                                                             ...   
Chocolate Dipped Death (A Candy Shop Mystery)               0.00002
Predator: Concrete Jungle                                   0.00002
The Gates of Damascus                                       0.00002
His Love Saved Her                                          0.00002
Miss Billings Treads the Boards (Signet Regency Romance)    0.00002
Name: Title, Length: 6722, dtype: float64

### Instantiating X and y
X will be the description that we perform the model on and y would be the titles that they could be classified as.

In [9]:
X = fiction_df['description']
y = fiction_df['Title']

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Multinomial Naive Bayes (and Hypertuning)

With Multinomial Naive Bayes, I chose this model for a few reasons:
* Efficiency with large datasets
* Ability to use text for probability and predictions

This model works by assigning documents to classes based an analysis of the content [source](https://towardsdatascience.com/multinomial-na%C3%AFve-bayes-for-documents-classification-and-natural-language-processing-nlp-e08cc848ce6). In doing so, it could take a fragment of text and determine the likelihood that it'll belong to specific class. It was ideal for my intentions since I was parsing through the book descriptions to make these classifications and predictions for my system.

For the pipeline, I paired the MNB with a TfidfVectorizer as the latter considers how many times the word appears and its impact in the text. In our EDA, we saw the words, bigrams, and trigrams that were in our datasets. Most were generic enough that they didn't add meaning to our exploration, but not simple enough to be taken out my our lemmatization efforts. I set the stop_words and tokenizer to the special functions that I had created. The max_features was set to 5_000 to allow the model to run.

As for the parameters, I used the min_df and max_df to set thresholds to note the required times the word appears to be considered in the tf-idf process. I included an ngram_range to add dimensionality to the GridSearch.

Finally, for the MNB, we evaluated the alpha to consider zero probabilities and fit prior that looks at prior knowledge of the model.

In [14]:
mnb_pipe = Pipeline([
    ('tf', TfidfVectorizer(stop_words = lem_stopwords, 
                           tokenizer = my_lemmatizer,
                           token_pattern = None,
                           max_features = 5_000)),
    ('mnb', MultinomialNB(alpha = 0.5))
])

In [15]:
mnb_params = {
    'tf__min_df': [0.1, 0.25, 0.5, 1.0],
    'tf__max_df': [0.25, 0.5, 0.8, 1.0],
    'tf__ngram_range': [(1,1), (2,2), (3,3)],
    'mnb__alpha': [0.1, 0.25, 0.5, 1],
    'mnb__fit_prior': [True, False]
}

In [None]:
best_params(mnb_pipe, mnb_params, X_train, y_train)

# this cell runs a long warning, so the output was cleared
# Output: "Best Score: 0.7210666666666666, Params: {'mnb__alpha': 0.1, 'mnb__fit_prior': False, 'tf__max_df': 0.5, 'tf__min_df': 0.1, 'tf__ngram_range': (1, 1)}"

In [23]:
mnb_gs = return_gs(mnb_pipe, mnb_params, X_train, y_train)

In [None]:
scores(mnb_gs, X_train, y_train, X_test, y_test)

# this cell runs a long warning, so the output was cleared
# Output: 'Train Score: 0.7585866666666666, Test Score: 0.70728'

In [25]:
mnb_pred = predictions(mnb_pipe, X_train, X_test, y_train)

In [26]:
classification_scores('Multinomial Naive Bayes', y_test, mnb_pred)

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,Recall,Precision,F1,Accuracy
Multinomial Naive Bayes,0.71232,0.580423,0.625477,0.71232


**Evaluation**: The model had an accuracy of 71.23% which meant that it was classifying models correctly 71.2%. However, the model had a low precision score at 58.04% which was what I had hoped to reduce since having false negatives was worst. If users enjoyed one book, they may enjoy a similar book. A false negative would be classifying as one that a user would enjoy, especially since the model is based on the description of a book.

### Additional Recommendations

The only successful model I ran was the Multinomial Naive Bayes. I attempted to run the Random Forest Classification, but only worked for some of my datasets. The Support Vector Classification ran for 8 hours, and ultimately crashed. For anyone who wants to explore similar methods as mine, I highly suggest being put on a cloud service so that you are able to run your models appropriate without the memory affecting your local computer. Below, you can still read about why I wanted to test certain models.

### Random Forest Classification (and Hypertuning)

I wanted to explore this model because:
* Ability to pinpoint strongest predictors in a model
* Training on imbalanced data

With the Random Forest Classifier, the ensemble model will take multiple Decision Trees to make the predictions. In doing so, it would alleviate the impact that the imbalanced dataset has as it'll randomly select the subset of variables the Decision Tree uses. The parameters I chose to tune were the n_estimators and max_depth. N_esimators would've evaluated the number of Decision Trees used while max_depth controls how complex the trees are by manipulating the length between the root and leaf nodes.

However, one disadvantage is that the model takes a large amount of time to run (similar to Logistic Regression), especially in dealing with multiple decision trees and depth. The memory requirements had caused my computer to crash oftentimes, but occassionally a model successfully ran (go to part 5). For future projects, a cloud service or computer with more available memory capacity should be used to run this model.

In [13]:
rfc_pipe = Pipeline([
    ('tf', TfidfVectorizer(stop_words = lem_stopwords, 
                           tokenizer = my_lemmatizer,
                           token_pattern = None,
                           max_features = 1_000)),
    ('rfc', RandomForestClassifier(max_features = 1_000))
])

In [15]:
rfc_params = {
    'tf__min_df': [0.05, 0.1],
    'tf__max_df': [0.5],
    'tf__ngram_range': [(1,1)],
    'rfc__n_estimators': [100, 200, 300],
    'rfc__max_depth': [None, 5, 10, 20]
}

In [None]:
best_params(rfc_pipe, rfc_params, X_train, y_train)

In [None]:
rfc_gs = return_gs(rfc_pipe, rfc_params, X_train, y_train)

In [None]:
scores(rfc_gs, X_train, y_train, X_test, y_test)

In [None]:
rfc_pred = predictions(rfc_pipe, X_train, X_test, y_train)

In [None]:
classification_scores('Random Forest Classifier', y_test, rfc_pred)

### Logistic Regression

Reasons why I wanted to explore Logistic Regression:
* Interpretibility

With Logistic Regression, we find the relationship between the features and the target variable by minimizing the loss function. It works well with overfit models as we've seen in the MNB model that we had. This model, I did not actually run once because of the time constraints the regularization would cause.

In [18]:
lr_pipe = Pipeline([
    ('tf', TfidfVectorizer(stop_words = lem_stopwords, 
                           tokenizer = my_lemmatizer,
                           token_pattern = None,
                           max_features = 1_000)),
    ('lr', LogisticRegression(solver = 'saga'))
])

In [19]:
lr_params = {
    'tf__min_df': [0.05, 0.1, 0.25, 0.5],
    'tf__max_df': [0.25, 0.5, 0.8],
    'tf__ngram_range': [(1,1), (2,2), (3,3)],
    'lr__penalty': ['l1', 'l2', 'elasticnet', None],
    'lr__C': [0.05, 1.0, 10],
    'lr__class_weight': [None, 'balanced']
}

In [None]:
best_params(lr_pipe, lr_params, X_train, y_train)

In [None]:
lr_gs = return_gs(lr_pipe, lr_params, X_train, y_train)

In [None]:
scores(lr_gs, X_train, y_train, X_test, y_test)

In [None]:
lr_pred = predictions(lr_pipe, X_train, X_test, y_train)

In [None]:
classification_scores('Logistic Regression', y_test, lr_pred)

### Support Vector Classification

Originally, I wanted to evaluate the Support Vector Classification for its ability on dataset dimensionality. It finds the closest match of a data point to the curve it creates, so it works well with non-linear relationships.

However, as I read more and ran some models, I would not recommend using this further as its features doesn't align with what I want the model to do and the dataset we are given. The Support Vector is not suitable for large datasets and noise which all my datasets have since they are classifying thousands of classes that overlap. Typically, the Support Vector will underperform too by favoring the majority class in imbalanced datasets.

In [35]:
sv_pipe = Pipeline([
    ('tf', TfidfVectorizer(stop_words = lem_stopwords, 
                           tokenizer = my_lemmatizer,
                           token_pattern = None,
                           max_features = 1_000)),
    ('sv', SVC())
])

In [36]:
sv_params = {
    'tf__min_df': [0.05, 0.1, 0.25, 0.5],
    'tf__max_df': [0.25, 0.5, 0.8],
    'tf__ngram_range': [(1,1), (2,2), (3,3)],
    'sv__C': [0.5, 1, 10],
    'sv__kernel': ['linear', 'poly', 'rbf'],
    'sv__class_weight': [None, 'balanced']
}

In [None]:
best_params(sv_pipe, sv_params, X_train, y_train)

In [None]:
sv_gs = return_gs(sv_pipe, sv_params, X_train, y_train)

In [None]:
scores(sv_gs, X_train, y_train, X_test, y_test)

In [None]:
sv_pred = predictions(sv_pipe, X_train, X_test, y_train)

In [None]:
classification_scores('Support Vector', y_test, sv_pred)