# Model the Data

For my MVP, I would like to use several algorithms with cross-validation and grid search. The algorithms I plan on using are:

* Random Forest Classifier
* K-Nearest Neighbors
* Gaussian Naive Bayes
* Multinomial Naive Bayes
* XGBoost (If I can get it to work. I will attempt this later)

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Load the Data

In [2]:
blurbs = pd.read_csv('blurbs_for_exploration.csv')

In [3]:
blurbs.head()

Unnamed: 0,genre,sub-genre,original,clean,stemmed,lemmatized,lem_char_count,lem_word_count,lem_unique_word_count,sentence_count,avg_words_per_sentence,sentiment,stopword_count,word_stopword_ratio
0,Horror,ghost-stories,"Designed to appeal to the book lover, the Macm...",designed appeal book lover macmillan collector...,design appeal book lover macmillan collector '...,designed appeal book lover macmillan collector...,1102,147,120,8,18,0.9582,96,0.65
1,Horror,ghost-stories,"Part of the Penguin Orange Collection, a limit...",part penguin orange collection limitedrun seri...,part penguin orang collect limitedrun seri twe...,part penguin orange collection limitedrun seri...,954,118,87,2,59,0.91,55,0.47
2,Horror,ghost-stories,Part of a new six-volume series of the best in...,part new sixvolume series best classic horror ...,part new sixvolum seri best classic horror sel...,part new sixvolume series best classic horror ...,1260,173,138,7,25,-0.2144,85,0.49
3,Horror,ghost-stories,A USA TODAY BESTSELLER!An Indie Next Pick!An O...,usa today bestselleran indie next pickan octob...,usa today bestselleran indi next pickan octob ...,usa today bestselleran indie next pickan octob...,800,104,92,2,52,-0.953,63,0.61
4,Horror,ghost-stories,From the New York Times best-selling author of...,new york times bestselling author southern boo...,new york time bestsel author southern book clu...,new york time bestselling author southern book...,603,77,74,5,15,-0.9726,28,0.36


In [4]:
blurbs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21414 entries, 0 to 21413
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   genre                   21414 non-null  object 
 1   sub-genre               21414 non-null  object 
 2   original                21414 non-null  object 
 3   clean                   21414 non-null  object 
 4   stemmed                 21414 non-null  object 
 5   lemmatized              21414 non-null  object 
 6   lem_char_count          21414 non-null  int64  
 7   lem_word_count          21414 non-null  int64  
 8   lem_unique_word_count   21414 non-null  int64  
 9   sentence_count          21414 non-null  int64  
 10  avg_words_per_sentence  21414 non-null  int64  
 11  sentiment               21414 non-null  float64
 12  stopword_count          21414 non-null  int64  
 13  word_stopword_ratio     21414 non-null  float64
dtypes: float64(2), int64(6), object(6)
mem

# Split the Data

Since I will be using cross-validation, I won't need a validation set. Only train and test sets.

In [5]:
train, test = train_test_split(blurbs, stratify = blurbs.genre, test_size = .25, random_state = 123)
train.shape, test.shape

((16060, 14), (5354, 14))

# Create X and y Groups

In [6]:
X_train, y_train = train.drop(columns = ['genre']), train.genre
X_test, y_test = test.drop(columns = ['genre']), test.genre

In [7]:
#Check train shapes
X_train.shape, y_train.shape

((16060, 13), (16060,))

In [8]:
#Check test shapes
X_test.shape, y_test.shape

((5354, 13), (5354,))

# Scale the X Groups

In [9]:
#Instantiate the scaler
scaler = MinMaxScaler()

#Fit and transform the data on train 
X_train[['lem_char_count', 'lem_word_count', 'lem_unique_word_count', 'sentence_count', 'avg_words_per_sentence', 'sentiment', 'stopword_count', 'word_stopword_ratio']] = scaler.fit_transform(X_train[['lem_char_count', 'lem_word_count', 'lem_unique_word_count', 'sentence_count', 'avg_words_per_sentence', 'sentiment', 'stopword_count', 'word_stopword_ratio']])

#Transform the data on test
X_test[['lem_char_count', 'lem_word_count', 'lem_unique_word_count', 'sentence_count', 'avg_words_per_sentence', 'sentiment', 'stopword_count', 'word_stopword_ratio']] = scaler.transform(X_test[['lem_char_count', 'lem_word_count', 'lem_unique_word_count', 'sentence_count', 'avg_words_per_sentence', 'sentiment', 'stopword_count', 'word_stopword_ratio']])

# Part 1 - Engineered Features Only

Make predictions using only the engineered features, not the term frequencies or TF-IDF.

In [10]:
#Select only the engineered features
X_train_part1 = X_train.drop(columns = ['sub-genre', 'original', 'clean', 'stemmed', 'lemmatized'])


In [11]:
X_train_part1.head()

Unnamed: 0,lem_char_count,lem_word_count,lem_unique_word_count,sentence_count,avg_words_per_sentence,sentiment,stopword_count,word_stopword_ratio
17405,0.063722,0.070363,0.107901,0.029289,0.132867,0.986747,0.048842,0.255
11323,0.037368,0.041007,0.068819,0.037657,0.062937,0.87872,0.030056,0.27
18975,0.053889,0.058248,0.100255,0.020921,0.146853,0.023256,0.047589,0.3
8934,0.050461,0.060112,0.092608,0.025105,0.132867,0.014904,0.053225,0.325
594,0.05445,0.064772,0.105353,0.029289,0.125874,0.6001,0.071384,0.405


### Create Baseline

I will use the dummy classifier with a stratify strategy to create my baseline.

In [12]:
#Instantiate the model
baseline_model = DummyClassifier(strategy = 'stratified', random_state = 123)

#Fit the model
baseline_model.fit(X_train_part1, y_train)

#Score the model
baseline_model.score(X_train_part1, y_train)

0.27017434620174346

__Baseline Accuracy: 27%__ 

### Random Forest Classifier

Begin with the random forest classifier algorithm. Write a function that utilizes grid search and cross-validation to optimize it and return the best model.

In [13]:
def get_random_forest_models(X_train, y_train, param_dict, cv = 5):
    """
    This function creates and returns an optimized random forest classification model. It also
    prints out the best model's mean cross-validated accuracy score and parameters.
    
    This function takes in the X and y training sets to fit the models.
    
    This function takes in a dictionary that contains the parameters to be iterated through.
    
    This function also takes in a value for the number of cross validation folds to do.
    The cv value defaults to 5.
    """
    #Create the classifier model
    clf = RandomForestClassifier(random_state = 123)
    
    #Create the GridSearchCV object
    grid = GridSearchCV(clf, param_dict, cv = cv)
    
    #Fit the GridSearchCV object
    grid.fit(X_train, y_train)
    
    #Print the best model's score and parameters
    print('Mean Cross-Validated Accuracy: ', round(grid.best_score_, 4))
    print('Max Depth: ', grid.best_params_['max_depth'])
    print('Min Samples Per Leaf: ', grid.best_params_['min_samples_leaf'])
    
    #Return the best model
    return grid.best_estimator_

In [14]:
#Create the dictionary of parameters and their values to iterate through
rf_dict = {
    'max_depth': range(14, 26),
    'min_samples_leaf': range(1, 16)
}

In [15]:
#best_rf_model = get_random_forest_models(X_train_part1, y_train, rf_dict)

Best Attempt:

- Mean Cross-Validated Accuracy: 0.4351
- Max Depth: 14
- Min Samples Per Leaf: 4

### K-Nearest Neighbors

Now write a function to create an optimized K-Nearest Neighbors model. It will behave like the previous function.

In [16]:
def get_KNN_models(X_train_scaled, y_train, param_dict, cv = 5):
    """
    This function takes in scaled data and builds an optimized KNN classification model. 
    It will use the parameters specified in the param_dict to optimizie across.
    
    This function utilizes GridSearchCV.
    
    This function returns the best model and prints out its parameters and mean
    cross-validated accuracy.
    """
    #Create the KNN model
    clf = KNeighborsClassifier()
    
    #Create the GridSearchCV object
    grid = GridSearchCV(clf, param_dict, cv = cv)
    
    #Fit the GridSearchCV object
    grid.fit(X_train_scaled, y_train)
    
    #Print the best model's score and parameters
    print('Mean Cross-Validated Accuracy: ', round(grid.best_score_, 4))
    print('Num Neighbors: ', grid.best_params_['n_neighbors'])
    print('Weights: ', grid.best_params_['weights'])
    
    #Return the best model
    return grid.best_estimator_

In [17]:
#Create the dict of parameters and their values to optimize across
knn_dict = {
    'n_neighbors': range(10, 500, 10),
    'weights': ['uniform', 'distance']
}

In [18]:
#best_knn_model = get_KNN_models(X_train_part1, y_train, knn_dict)

Best Attempt:

- Mean Cross-Validated Accuracy: 0.4215
- Num Neighbors: 130
- Weights: distance

### Gaussian Naive Bayes

Write a function that creates a Gaussian Naive Bayes model.

In [19]:
def get_gauss_models(X_train_scaled, y_train, param_dict, cv = 5):
    """
    This function will create an optimized Gaussian Naive Bayes classification model. It will
    use the X_train_scaled and y_train data for fitting. The param_dict contains the parameters
    and their values that the GridSearchCV function will optimize across. The cv parameter
    indicates how many folds will be fitted and evaluated, and defaults to 5.
    
    This function prints out the mean cross-validated accuracy, best paramters, and returns
    the best model.
    """
    
    #Create the Gaussian Naive Bayes model
    clf = GaussianNB()
    
    #Create the GridSearchCV object
    grid = GridSearchCV(clf, param_dict, cv = cv)
    
    #Fit the GridSearchCV object
    grid.fit(X_train_scaled, y_train)
    
    #Print the best model's score and parameters
    print('Mean Cross-Validated Accuracy: ', round(grid.best_score_, 4))
    print('Smoothing: ', grid.best_params_['var_smoothing'])
    
    #Return the best model
    return grid.best_estimator_

In [20]:
#Create the param_dict
gauss_dict = {
    'var_smoothing': np.logspace(0,-9, num=100)
}

In [21]:
#best_gauss_model = get_gauss_models(X_train_part1, y_train, gauss_dict)

Best Attempt:

- Mean Cross-Validated Accuracy: 0.3993
- Smoothing: 0.035

### Multinomial Naive Bayes

Write a function that creates a multinomial naive bayes model. I know this function is better suited to event driven counts, but I want to see if it will work using the engineered features in part 1.

In [22]:
def get_multinomial_models(X_train_scaled, y_train, param_dict, cv = 5):
    """
    This function will create an optimized Multinomial Naive Bayes classification model. It will
    use the X_train_scaled and y_train data for fitting. The param_dict contains the parameters
    and their values that the GridSearchCV function will optimize across. The cv parameter
    indicates how many folds will be fitted and evaluated, and defaults to 5.
    
    This function prints out the mean cross-validated accuracy, best paramters, and returns
    the best model.
    """
    
    #Create the Multinomial Niave Bayes model
    clf = MultinomialNB()
    
    #Create the GridSearchCV object
    grid = GridSearchCV(clf, param_dict, cv = cv)
    
    #Fit the GridSearchCV object
    grid.fit(X_train_scaled, y_train)
    
    #Print the best model's score and parameters
    print('Mean Cross-Validated Accuracy: ', round(grid.best_score_, 4))
    print('Alpha: ', grid.best_params_['alpha'])
    print('Fit Prior: ', grid.best_params_['fit_prior'])
    
    #Return the best model
    return grid.best_estimator_

In [23]:
#Create the param_dict
multinomial_dict = {
    'alpha': range(1, 101),
    'fit_prior': [True, False]
}

In [24]:
#Get the multinomial models
#best_multinomial_model = get_multinomial_models(X_train_part1, y_train, multinomial_dict)

Best Attempt:

- Mean Cross-Validated Accuracy: 0.3217
- Alpha: 1
- Fit Prior: True

# Part 2 - Word Counts Only

Make predictions using only the word counts, not the engineered features or TF-IDF.

In [25]:
#Initialize the count vectorizer
cv = CountVectorizer()

#Create the bag of words from the lemmatized column in the X_train df
X_train_bow = cv.fit_transform(X_train.lemmatized)

In [26]:
#Initialize another count vectorizer and add bigrams
cv = CountVectorizer(ngram_range = (1,2))

#Create the bag of words from the lemmatized column in the X_train df
X_train_bow_bigrams = cv.fit_transform(X_train.lemmatized)

In [27]:
#Initialize another count vectorizer and add trigrams
cv = CountVectorizer(ngram_range = (1,3))

#Create the bag of words from the lemmatized column in the X_train df
X_train_bow_trigrams = cv.fit_transform(X_train.lemmatized)

### Random Forest Classifier - Individual Words

Create the Random Forest Classifier models using only the bag of words as features. Start with only the individual word counts, then add bigrams and trigrams.

In [28]:
rf_dict = {
    'max_depth': range(41, 46),
    'min_samples_leaf': range(1, 2)
}

In [29]:
#best_rf_model = get_random_forest_models(X_train_bow, y_train, rf_dict)

Best Attempt:

- Mean Cross-Validated Accuracy: 0.8043
- Max Depth: 43
- Min Samples Per Leaf: 1

### Random Forest Classifier - Individual Words and Bigrams

In [30]:
rf_dict = {
    'max_depth': range(46, 51),
    'min_samples_leaf': range(1,2)
}

In [31]:
#Call the function to create the models
#best_rf_model = get_random_forest_models(X_train_bow_bigrams, y_train, rf_dict)

Best Attempt:

- Mean Cross-Validated Accuracy: 0.7773
- Max Depth: 50
- Min Samples Per Leaf: 1

Due to the long run times for each test, I'm stopping here. It doesn't seem to be better than the unigram only feature set.

### Random Forest Classifier - Individual Words, Bigrams, and Trigrams

In [32]:
rf_dict = {
    'max_depth': range(45, 51),
    'min_samples_leaf': range(1, 2)
}

In [33]:
#Call the function to create the models
#best_rf_model = get_random_forest_models(X_train_bow_trigrams, y_train, rf_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.7468
- Max Depth: 49
- Min Samples Per Leaf: 1

Due to the long run times for each test, I'm stopping here. It doesn't seem to be better than the unigram only feature set.

### K-Nearest Neighbor - Individual Words

After doing some research, I have learned that KNN can actually be used for text classification. However, instead of just using simple word counts, I should use normalized TF-IDF values. I will still attempt the classification with just word counts, but the real test for this algorithm will be in the TF-IDF section.

In [34]:
#Create the dict of parameters and their values to optimize across
knn_dict = {
    'n_neighbors': range(11, 502, 10),
    'weights': ['uniform', 'distance']
}

In [35]:
#Create the models
#best_knn_model = get_KNN_models(X_train_bow, y_train, knn_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.3832
- Num Neighbors: 10
- Weights: distance

Yeah, it seems like KNN won't be much use here. I'm going to move on.

### Gaussian Naive Bayes - Individual Words

After doing some research, I have learned that the Gaussian Naive Bayes algorithm is actually not a good choice for text classification. From here on, I will no longer consider this algorithm and instead focus on the others.

### Multinomial Naive Bayes - Individual Words

After doing some research, I have learned that the Multinomial Naive Bayes algorithm is a good choice for text classification, especially when dealing with word counts. I do not know how it will perform with TF-IDF, but I will find out later.

In [36]:
#Create the param_dict
multinomial_dict = {
    'alpha': range(1, 101),
    'fit_prior': [True, False]
}

In [37]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_bow, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8649
- Alpha: 1
- Fit Prior: True

I'd also like to note that this algorithm was __extremely fast__ compared to the Random Forest Classifiers above. It also performed quite a bit better.

### Multinomial Naive Bayes - Individual Words and Bigrams

In [38]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_bow_bigrams, y_train, multinomial_dict)

Best Attempt:

- Mean Cross-Validated Accuracy: 0.8786
- Alpha: 1
- Fit Prior: True

Performed slightly better than just using the individual word counts.

### Multinomial Naive Bayes - Individual Words, Bigrams, and Trigrams

In [39]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_bow_trigrams, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8813
- Alpha: 1
- Fit Prior: True

Performed slightly better than just using unigram and bigram counts.

### Multinomial Naive Bayes - Bigrams

Since this algorithm is performing so well and training so quickly, I've decided to see how it will do using only Bigram counts.

In [40]:
#initialize the count vectorizer
cv = CountVectorizer(ngram_range = (2,2))

#Fit and transform the train data
X_train_bigrams_only = cv.fit_transform(X_train.lemmatized)

In [41]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_bigrams_only, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8311
- Alpha: 5
- Fit Prior: True

Although it still performed well, it wasn't quite as good as the others.

### Multinomial Naive Bayes - Trigrams

Same as above, but with trigrams only.

In [42]:
#Initialize the count vectorizer
cv = CountVectorizer(ngram_range = (3,3))

#Fit and transform the train data
X_train_trigrams_only = cv.fit_transform(X_train.lemmatized)

In [43]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_trigrams_only, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.6686
- Alpha: 13
- Fit Prior: True

This one was surprisingly bad compared to the others. It seems using bigrams and trigrams on their own is not better than using them as supplemental data to the individual word counts.

# Part 3: TF-IDF

Now I will build models using only the TF-IDF values.

In [44]:
#Create tf-idf vectorizer without bigrams or trigrams
tfidf = TfidfVectorizer()

X_train_tfidf = tfidf.fit_transform(X_train.lemmatized)

In [45]:
#Create tf-idf vectorizer with unigrams and bigrams
tfidf = TfidfVectorizer(ngram_range = (1,2))

X_train_tfidf_bigrams = tfidf.fit_transform(X_train.lemmatized)

In [46]:
#Create tf-idf vectorizer with unigrams, bigrams, and trigrams
tfidf = TfidfVectorizer(ngram_range = (1,3))

X_train_tfidf_trigrams = tfidf.fit_transform(X_train.lemmatized)

### Random Forest Classifier - Unigrams Only

Here, I will build a Random Forest Classifier model using the unigram tf-idf feature set. Since the random forest classifier took so long to run in the last section, I'm only going to run each of these once, using the same parameter dictionary that gave me the best results from before. It will give me a genreal idea of what I can expect out of each model.

In [47]:
rf_dict = {
    'max_depth': range(45, 51),
    'min_samples_leaf': range(1, 2)
}

In [48]:
#Call the function to create the models
#best_rf_model = get_random_forest_models(X_train_tfidf, y_train, rf_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8104
- Max Depth: 49
- Min Samples Per Leaf: 1

### Random Forest Classifier - Unigrams and Bigrams

In [49]:
#Call the function to create the models
#best_rf_model = get_random_forest_models(X_train_tfidf_bigrams, y_train, rf_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.7773
- Max Depth: 49
- Min Samples Per Leaf: 1

### Random Forest Classifier - Unigrams, Bigrams, and Trigrams

In [50]:
#Call the function to create the models
#best_rf_model = get_random_forest_models(X_train_tfidf_trigrams, y_train, rf_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.7502
- Max Depth: 50
- Min Samples Per Leaf: 1

Not quite as good as the the last model, but still not bad. It did take a while though.

### K-Nearest Neighbors - Unigrams Only

Now create a KNN model that uses the unigram tf-idf feature set.

In [51]:
#Create the dict of parameters and their values to optimize across
knn_dict = {
    'n_neighbors': range(11, 502, 10),
    'weights': ['uniform', 'distance']
}

In [52]:
#Create the models
#best_knn_model = get_KNN_models(X_train_tfidf, y_train, knn_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8286
- Num Neighbors: 171
- Weights: distance

Pretty good! Way better than when we used KNN for the unigram counts.

### K-Nearest Neighbors - Unigrams and Bigrams

In [53]:
#Create the models
#best_knn_model = get_KNN_models(X_train_tfidf_bigrams, y_train, knn_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8353
- Num Neighbors: 251
- Weights: distance

### K-Nearest Neighbors - Unigrams, Bigrams, and Trigrams

In [54]:
#Create the models
#best_knn_model = get_KNN_models(X_train_tfidf_trigrams, y_train, knn_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8333
- Num Neighbors: 101
- Weights: distance

### Multinomial Naive Bayes - Unigrams

I know that multinomial naive bayes is really only supposed to work with integer counts, but I read in the sklearn documentation for the algorithm that fractional counts also work, specifically tf-idf. I'm not sure how it will perform, but I've read that multinomial naive bayes often does better with tf-idf than just a simple word count. Let's find out.

In [55]:
#Create the param_dict
multinomial_dict = {
    'alpha': range(1, 101),
    'fit_prior': [True, False]
}

In [56]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_tfidf, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8302
- Alpha: 1
- Fit Prior: False

Although not as good as model just using the simple word count, it still performed surprisingly well.

### Multinomial Naive Bayes - Unigrams and Bigrams

In [57]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_tfidf_bigrams, y_train, multinomial_dict)


Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8373
- Alpha: 1
- Fit Prior: False

### Multinomial Naive Bayes - Unigrams, Bigrams, and Trigrams

In [58]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_tfidf_trigrams, y_train, multinomial_dict)

# Part 4 - Word Count and TF-IDF

Can we combine the word count feature set with the tf-idf feature set and get better results? I will use scipy.sparse.hstack to concatenate the sparse matrices and then feed the new sparse into a multinomial naive bayes model.

In [59]:
from scipy.sparse import hstack

In [60]:
#Concatenate the sparse matrices
X_train_bow_tfidf = hstack([X_train_bow, X_train_tfidf])

### Multinomial Naive Bayes - Unigrams Only

In [61]:
#Create the param_dict
multinomial_dict = {
    'alpha': range(1, 101),
    'fit_prior': [True, False]
}

In [62]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_bow_tfidf, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8669
- Alpha: 1
- Fit Prior: True

Performed slightly better than simply using the word counts for unigrams only.

### Multinomial Naive Bayes - Unigrams and Bigrams

In [63]:
#Concatenate the sparse matrices
X_train_bow_tfidf_bigrams = hstack([X_train_bow_bigrams, X_train_tfidf_bigrams])

In [64]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_bow_tfidf_bigrams, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.875
- Alpha: 1
- Fit Prior: False

### Multinomial Naive Bayes - Unigrams, Bigrams, and Trigrams

In [65]:
#Concatenate the sparse matrices
X_train_bow_tfidf_trigrams = hstack([X_train_bow_trigrams, X_train_tfidf_trigrams])

In [66]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_bow_tfidf_trigrams, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8786
- Alpha: 1
- Fit Prior: False

Although I think this is the second best performing model I've created so far, it is not as good as the unigram, bigram, and trigram word count feature set. But only by about 1%.

# Part 5 - Combining Word Counts with Engineered Features

In this section, I will attempt to combine the vectorized word counts with the engineered features from Part 1. I'm not sure if it will work, but I think it's worth a try.

### Multinomial Naive Bayes - Unigrams with Engineered Features

In [67]:
#Concatenate the data
X_train_part_5 = hstack([X_train_part1, X_train_bow])

In [68]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_part_5, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8648
- Alpha: 1
- Fit Prior: True

### Multinomial Naive Bayes - Unigrams and Bigrams with Engineered Features

In [69]:
#Concatenate the data
X_train_part_5_bigrams = hstack([X_train_part1, X_train_bow_bigrams])

In [70]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_part_5_bigrams, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8784
- Alpha: 1
- Fit Prior: True

### Multinomial Naive Bayes - Unigrams, Bigrams, and Trigrams with Engineered Features

In [71]:
#Concatenate the data
X_train_part_5_trigrams = hstack([X_train_part1, X_train_bow_trigrams])

In [72]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_part_5_trigrams, y_train, multinomial_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8813
- Alpha: 1
- Fit Prior: True

After reviewing the results, it appears that the engineered features made no difference in the mean accuracies of each model.

# Part 6 - TF-IDF Combined with Engineered Features

In this section, I will test whether or not the engineered features offer any value when combined with the TF-IDF feature sets. Since KNN performed relatively well with the TF-IDF feature sets, I will test it along with the multinomial naive bayes algorithm.

### K-Nearest Neighbor - Unigrams and Engineered Features

In [73]:
#Concatenate the data sets
X_train_part_6 = hstack([X_train_tfidf, X_train_part1])

In [74]:
#Create the dict of parameters and their values to optimize across
knn_dict = {
    'n_neighbors': range(11, 502, 10),
    'weights': ['uniform', 'distance']
}

In [75]:
#Create the models
#best_knn_model = get_KNN_models(X_train_part_6, y_train, knn_dict)

Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8007
- Num Neighbors: 31
- Weights: distance

### K-Nearest Neighbors - Unigrams, Bigrams and Engineered Features

In [76]:
#Concatenate the data sets
X_train_part_6_bigrams = hstack([X_train_tfidf_bigrams, X_train_part1])

In [78]:
#Create the models
#best_knn_model = get_KNN_models(X_train_part_6_bigrams, y_train, knn_dict)

Mean Cross-Validated Accuracy:  0.803
Num Neighbors:  11
Weights:  distance


Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.803
- Num Neighbors: 11
- Weights: distance

### K-Nearest Neighbors - Unigrams, Bigrams, Trigrams and Engineered Features

In [79]:
#Concatenate the data sets
X_train_part_6_trigrams = hstack([X_train_tfidf_trigrams, X_train_part1])

In [80]:
#Create the models
#best_knn_model = get_KNN_models(X_train_part_6_trigrams, y_train, knn_dict)

Mean Cross-Validated Accuracy:  0.7958
Num Neighbors:  11
Weights:  distance


Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.7958
- Num Neighbors: 11
- Weights: distance

### Multinomial Naive Bayes - Unigrams and Engineered Features

In [81]:
#Create the param_dict
multinomial_dict = {
    'alpha': range(1, 101),
    'fit_prior': [True, False]
}

In [82]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_part_6, y_train, multinomial_dict)

Mean Cross-Validated Accuracy:  0.8085
Alpha:  1
Fit Prior:  False


Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.8085
- Alpha: 1
- Fit Prior: False

### Multinomial Naive Bayes - Unigrams, Bigrams and Engineered Features

In [83]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_part_6_bigrams, y_train, multinomial_dict)

Mean Cross-Validated Accuracy:  0.7575
Alpha:  1
Fit Prior:  False


Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.7575
- Alpha: 1
- Fit Prior: False

### Multinomial Naive Bayes - Unigrams, Bigrams, Trigrams and Engineered Features

In [84]:
#Create the models
#best_multinomial_model = get_multinomial_models(X_train_part_6_trigrams, y_train, multinomial_dict)

Mean Cross-Validated Accuracy:  0.714
Alpha:  1
Fit Prior:  False


Best Attempt:
    
- Mean Cross-Validated Accuracy: 0.714
- Alpha: 1
- Fit Prior: False

# Conclusion

After creating all of these different models and testing the different feature sets, I have found that my best model was the Multinomial Naive Bayes algorithm with the vectorized unigram, bigram, and trigram count feature set. If I want to improve the accuracy of my model, I will need to build better engineered features or improve the quality of my data. I could also try rerunnning everything above with the stemmed descriptions instead of the lemmatized ones. From here, I will move on to the final report.

Best Model:

- Algorithm: Multinomial Naive Bayes
- Mean Cross-Validated Accuracy: .8813
- Alpha: 1
- Fit Prior: True
- Feature Set: Unigram, Bigram, and Trigram Counts