# Model the Data

For my MVP, I would like to use several algorithms with cross-validation and grid search. The algorithms I plan on using are:

* Random Forest Classifier
* K-Nearest Neighbors
* Gaussian Naive Bayes
* Multinomial Naive Bayes
* XGBoost (If I can get it to work. I will attempt this later)

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the Data

In [2]:
blurbs = pd.read_csv('blurbs_for_exploration.csv')

In [3]:
blurbs.head()

Unnamed: 0,genre,sub-genre,original,clean,stemmed,lemmatized,lem_char_count,lem_word_count,lem_unique_word_count,sentence_count,avg_words_per_sentence,sentiment,stopword_count,word_stopword_ratio
0,Horror,ghost-stories,"Designed to appeal to the book lover, the Macm...",designed appeal book lover macmillan collector...,design appeal book lover macmillan collector '...,designed appeal book lover macmillan collector...,1102,147,120,8,18,0.9582,96,0.65
1,Horror,ghost-stories,"Part of the Penguin Orange Collection, a limit...",part penguin orange collection limitedrun seri...,part penguin orang collect limitedrun seri twe...,part penguin orange collection limitedrun seri...,954,118,87,2,59,0.91,55,0.47
2,Horror,ghost-stories,Part of a new six-volume series of the best in...,part new sixvolume series best classic horror ...,part new sixvolum seri best classic horror sel...,part new sixvolume series best classic horror ...,1260,173,138,7,25,-0.2144,85,0.49
3,Horror,ghost-stories,A USA TODAY BESTSELLER!An Indie Next Pick!An O...,usa today bestselleran indie next pickan octob...,usa today bestselleran indi next pickan octob ...,usa today bestselleran indie next pickan octob...,800,104,92,2,52,-0.953,63,0.61
4,Horror,ghost-stories,From the New York Times best-selling author of...,new york times bestselling author southern boo...,new york time bestsel author southern book clu...,new york time bestselling author southern book...,603,77,74,5,15,-0.9726,28,0.36


In [4]:
blurbs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21414 entries, 0 to 21413
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   genre                   21414 non-null  object 
 1   sub-genre               21414 non-null  object 
 2   original                21414 non-null  object 
 3   clean                   21414 non-null  object 
 4   stemmed                 21414 non-null  object 
 5   lemmatized              21414 non-null  object 
 6   lem_char_count          21414 non-null  int64  
 7   lem_word_count          21414 non-null  int64  
 8   lem_unique_word_count   21414 non-null  int64  
 9   sentence_count          21414 non-null  int64  
 10  avg_words_per_sentence  21414 non-null  int64  
 11  sentiment               21414 non-null  float64
 12  stopword_count          21414 non-null  int64  
 13  word_stopword_ratio     21414 non-null  float64
dtypes: float64(2), int64(6), object(6)
mem

# Split the Data

Since I will be using cross-validation, I won't need a validation set. Only train and test sets.

In [5]:
train, test = train_test_split(blurbs, stratify = blurbs.genre, test_size = .25, random_state = 123)
train.shape, test.shape

((16060, 14), (5354, 14))

# Create X and y Groups

In [6]:
X_train, y_train = train.drop(columns = ['genre']), train.genre
X_test, y_test = test.drop(columns = ['genre']), test.genre

In [7]:
#Check train shapes
X_train.shape, y_train.shape

((16060, 13), (16060,))

In [8]:
#Check test shapes
X_test.shape, y_test.shape

((5354, 13), (5354,))

# Scale the X Groups

In [9]:
#Instantiate the scaler
scaler = MinMaxScaler()

#Fit and transform the data on train 
X_train[['lem_char_count', 'lem_word_count', 'lem_unique_word_count', 'sentence_count', 'avg_words_per_sentence', 'sentiment', 'stopword_count', 'word_stopword_ratio']] = scaler.fit_transform(X_train[['lem_char_count', 'lem_word_count', 'lem_unique_word_count', 'sentence_count', 'avg_words_per_sentence', 'sentiment', 'stopword_count', 'word_stopword_ratio']])

#Transform the data on test
X_test[['lem_char_count', 'lem_word_count', 'lem_unique_word_count', 'sentence_count', 'avg_words_per_sentence', 'sentiment', 'stopword_count', 'word_stopword_ratio']] = scaler.transform(X_test[['lem_char_count', 'lem_word_count', 'lem_unique_word_count', 'sentence_count', 'avg_words_per_sentence', 'sentiment', 'stopword_count', 'word_stopword_ratio']])

# Part 1

Make predictions using only the engineered features, not the term frequencies or TF-IDF.

In [10]:
#Select only the engineered features
X_train_part1 = X_train.drop(columns = ['sub-genre', 'original', 'clean', 'stemmed', 'lemmatized'])


In [11]:
X_train_part1.head()

Unnamed: 0,lem_char_count,lem_word_count,lem_unique_word_count,sentence_count,avg_words_per_sentence,sentiment,stopword_count,word_stopword_ratio
17405,0.063722,0.070363,0.107901,0.029289,0.132867,0.986747,0.048842,0.255
11323,0.037368,0.041007,0.068819,0.037657,0.062937,0.87872,0.030056,0.27
18975,0.053889,0.058248,0.100255,0.020921,0.146853,0.023256,0.047589,0.3
8934,0.050461,0.060112,0.092608,0.025105,0.132867,0.014904,0.053225,0.325
594,0.05445,0.064772,0.105353,0.029289,0.125874,0.6001,0.071384,0.405


### Create Baseline

I will use the dummy classifier with a stratify strategy to create my baseline.

In [12]:
#Instantiate the model
baseline_model = DummyClassifier(strategy = 'stratified', random_state = 123)

#Fit the model
baseline_model.fit(X_train_part1, y_train)

#Score the model
baseline_model.score(X_train_part1, y_train)

0.27017434620174346

__Baseline Accuracy: 27%__ 

### Random Forest Classifier

Begin with the random forest classifier algorithm. Write a function that utilizes grid search and cross-validation to optimize it and return the best model.

In [13]:
def get_random_forest_models(X_train, y_train, param_dict, cv = 5):
    """
    This function creates and returns an optimized random forest classification model. It also
    prints out the best model's mean cross-validated accuracy score and parameters.
    
    This function takes in the X and y training sets to fit the models.
    
    This function takes in a dictionary that contains the parameters to be iterated through.
    
    This function also takes in a value for the number of cross validation folds to do.
    The cv value defaults to 5.
    """
    #Create the classifier model
    clf = RandomForestClassifier(random_state = 123)
    
    #Create the GridSearchCV object
    grid = GridSearchCV(clf, param_dict, cv = cv)
    
    #Fit the GridSearchCV object
    grid.fit(X_train, y_train)
    
    #Print the best model's score and parameters
    print('Mean Cross-Validated Accuracy: ', round(grid.best_score_, 4))
    print('Max Depth: ', grid.best_params_['max_depth'])
    print('Min Samples Per Leaf: ', grid.best_params_['min_samples_leaf'])
    
    #Return the best model
    return grid.best_estimator_

In [14]:
#Create the dictionary of parameters and their values to iterate through
rf_dict = {
    'max_depth': range(14, 26),
    'min_samples_leaf': range(1, 16)
}

In [15]:
#best_rf_model = get_random_forest_models(X_train_part1, y_train, rf_dict)

Best Attempt:

- Mean Cross-Validated Accuracy: 0.4351
- Max Depth: 14
- Min Samples Per Leaf: 4

### K-Nearest Neighbors

Now write a function to create an optimized K-Nearest Neighbors model. It will behave like the previous function.

In [16]:
def get_KNN_models(X_train_scaled, y_train, param_dict, cv = 5):
    """
    This function takes in scaled data and builds an optimized KNN classification model. 
    It will use the parameters specified in the param_dict to optimizie across.
    
    This function utilizes GridSearchCV.
    
    This function returns the best model and prints out its parameters and mean
    cross-validated accuracy.
    """
    #Create the KNN model
    clf = KNeighborsClassifier()
    
    #Create the GridSearchCV object
    grid = GridSearchCV(clf, param_dict, cv = cv)
    
    #Fit the GridSearchCV object
    grid.fit(X_train_scaled, y_train)
    
    #Print the best model's score and parameters
    print('Mean Cross-Validated Accuracy: ', round(grid.best_score_, 4))
    print('Num Neighbors: ', grid.best_params_['n_neighbors'])
    print('Weights: ', grid.best_params_['weights'])
    
    #Return the best model
    return grid.best_estimator_

In [17]:
#Create the dict of parameters and their values to optimize across
knn_dict = {
    'n_neighbors': range(10, 500, 10),
    'weights': ['uniform', 'distance']
}

In [18]:
#best_knn_model = get_KNN_models(X_train_part1, y_train, knn_dict)

Mean Cross-Validated Accuracy:  0.4215
Num Neighbors:  130
Weights:  distance


Best Attempt:

- Mean Cross-Validated Accuracy: 0.4215
- Num Neighbors: 130
- Weights: distance

### Gaussian Naive Bayes

Write a function that creates a Gaussian Naive Bayes model.

In [22]:
def get_gauss_models(X_train_scaled, y_train, param_dict, cv = 5):
    """
    This function will create an optimized Gaussian Naive Bayes classification model. It will
    use the X_train_scaled and y_train data for fitting. The param_dict contains the parameters
    and their values that the GridSearchCV function will optimize across. The cv parameter
    indicates how many folds will be fitted and evaluated, and defaults to 5.
    
    This function prints out the mean cross-validated accuracy, best paramters, and returns
    the best model.
    """
    
    #Create the Gaussian Naive Bayes model
    clf = GaussianNB()
    
    #Create the GridSearchCV object
    grid = GridSearchCV(clf, param_dict, cv = cv)
    
    #Fit the GridSearchCV object
    grid.fit(X_train_scaled, y_train)
    
    #Print the best model's score and parameters
    print('Mean Cross-Validated Accuracy: ', round(grid.best_score_, 4))
    print('Smoothing: ', grid.best_params_['var_smoothing'])
    
    #Return the best model
    return grid.best_estimator_

In [23]:
#Create the param_dict
gauss_dict = {
    'var_smoothing': np.logspace(0,-9, num=100)
}

In [24]:
#best_gauss_model = get_gauss_models(X_train_part1, y_train, gauss_dict)

Mean Cross-Validated Accuracy:  0.3993
Smoothing:  0.03511191734215131


Best Attempt:

- Mean Cross-Validated Accuracy: 0.3993
- Smoothing: 0.035

### Multinomial Naive Bayes

Write a function that creates a multinomial naive bayes model. I know this function is better suited to event driven counts, but I want to see if it will work using the engineered features in part 1.

In [25]:
def get_multinomial_models(X_train_scaled, y_train, param_dict, cv = 5):
    """
    This function will create an optimized Multinomial Naive Bayes classification model. It will
    use the X_train_scaled and y_train data for fitting. The param_dict contains the parameters
    and their values that the GridSearchCV function will optimize across. The cv parameter
    indicates how many folds will be fitted and evaluated, and defaults to 5.
    
    This function prints out the mean cross-validated accuracy, best paramters, and returns
    the best model.
    """
    
    #Create the Multinomial Niave Bayes model
    clf = MultinomialNB()
    
    #Create the GridSearchCV object
    grid = GridSearchCV(clf, param_dict, cv = cv)
    
    #Fit the GridSearchCV object
    grid.fit(X_train_scaled, y_train)
    
    #Print the best model's score and parameters
    print('Mean Cross-Validated Accuracy: ', round(grid.best_score_, 4))
    print('Alpha: ', grid.best_params_['alpha'])
    print('Fit Prior: ', grid.best_params_['fit_prior'])
    
    #Return the best model
    return grid.best_estimator_

In [28]:
#Create the param_dict
multinomial_dict = {
    'alpha': range(1, 101),
    'fit_prior': [True, False]
}

In [29]:
#Get the multinomial models
#best_multinomial_model = get_multinomial_models(X_train_part1, y_train, multinomial_dict)

Mean Cross-Validated Accuracy:  0.3217
Alpha:  1
Fit Prior:  True


Best Attempt:

- Mean Cross-Validated Accuracy: 0.3217
- Alpha: 1
- Fit Prior: True