In [252]:
url = 'https://www.thinkful.com'

### Unsupervised Learning Challenge: Build you own NLP model  
#### For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1.  Data cleaning / processing / language parsing  
2.  Create features using two different NLP methods: For example, BoW vs tf-idf.  
3.  Use the features to fit supervised learning models for each feature set to predict the category outcomes.  
4.  Assess your models using cross-validation and determine whether one model performed better.  
5.  Pick one of the models and try to increase accuracy by at least 5 percentage points. 

Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.  

### Questions that need answering:

 1. What question are you trying to solve (or prove wrong) ?   
 __No questions to answer for this challenge.__
 1. What kind of data do you have? -> describe the source.. 
 __Sentiment analaysis, using bag of words, tfidf, Random Forest, Bernoulli Naive Bayes, and Multinomial Naive Bayes.__
 1. Do some EDA, plots
 1. What's missing from the data and how do you deal with it?  
 __No missing data here.__
 1. How can you add, change, or remove features to get more out of your data?  
 __No added features here; it's sentiment analysis.__

### Test Report Template

#### 


#### Key Learning. 
You cannot always improve on a model for even 5 percentage points.  Especially when using bag of words which has no configurable parameters.

In [253]:
# Constants
max_iterations         = 10            # set it to > 0 for determining the features inportance
random_state           = 57
rows_in_training_set   = 10000
rows_in_test_set       = 200000
test_size              = 0.10
train_size             = 0.90
rfc_test_size          = 50000
rfc_train_size         = 5000
sample_size            = 10
run_CountVectorizer    = False
run_TfidfVectorizer    = True
BegTimeStampNewlines   = 3
EndTimeStampNewlines   = 3
EndTimeStamp           = '\n'*EndTimeStampNewlines+'End'
BegTimeStamp           = 'End'+'\n'*BegTimeStampNewlines
SustainerSTDDEVLimit   = 0.020

num_clusters = 3
target_column = 'yyyy'
xcolumnname = 'xxxx'
CrossValidations = 5 # We are using 5 cross validations

In [254]:
# Controls for running sentiment_analyzer
flag_to_run_rf = False
flag_to_plot_them = False
flag_to_run_correlation_matrix = False
flag_to_run_features_importance = False
flag_to_run_gradient_boosting  = False
flag_to_run_linear_regression  = False
flag_to_run_logistic_regression = False
flag_to_run_lasso_regression = False
flag_to_run_ridge_regression = False
flag_to_run_svc = False
flag_to_run_vectorizer_nb = False
flag_to_run_sentiment_analyzer = False
flag_to_run_affinity_propagation = False
flag_to_run_kmeans = False
flag_to_run_mean_shift = False
flag_to_run_spectral_clustering = False
flag_to_run_elbow_plot = False

debug = False

In [255]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt

%matplotlib inline

import chardet
import datetime
from sklearn import datasets, ensemble, metrics, linear_model
from sklearn.utils import shuffle
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
import time, sys
import seaborn as sns
from sklearn.svm import SVC
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict, cross_val_score, GridSearchCV,cross_val_score, train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler, normalize
from IPython.display import Markdown, display
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics import pairwise_distances, mean_squared_error
from sklearn.cluster import AffinityPropagation, KMeans, MeanShift, estimate_bandwidth, SpectralClustering
from scipy.spatial.distance import cdist
import spacy
import re
from nltk.corpus import gutenberg, stopwords
import nltk
from collections import Counter
import gensim
from gensim.models import word2vec
import random

nltk.download('gutenberg') # Load the gutenberg nltk works
nlp = spacy.load('en')

[nltk_data] Downloading package gutenberg to /Users/lou/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [256]:
# add this to a dictionary
# Constants
max_iterations         = 10            # set it to > 0 for determining the features inportance
random_state           = 57
test_size              = 0.10
train_size             = 0.90

begin_string = '\n'*3+'Begin'
end_string = 'End'+'\n'*3

# Regression/Classification control
Regression = False 

print("Regression = {}".format(Regression))

Regression = False


In [257]:
# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_row', 1000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

In [258]:
def plot_time_to_complete():
    objects = ('BernoulliNB', 'MultinomialNB', 'Logistic Regression')
    y_pos = np.arange(len(objects))
    performance = [18,17,32]

    plt.bar(y_pos, performance, align='center', alpha=0.5)
    plt.xticks(y_pos, objects)
    plt.ylabel('Time in Minutes')
    plt.title('Yelp Sentiment Analysis Time to Complete')

    plt.show()

In [259]:
def file_stuff():
    # Use this for stand-alone file
    
    path = "../../../../"
    filename = "Datafiles/bostonmarathon/results/2013/results.csv"
    print("fullfilename = {}".format(path+filename))
    df = pd.read_csv(path+filename)
    print("There are {} rows in this file.".format(df.shape[0]))
    return df

In [260]:
def text_cleaner(text):
    '''
    # Utility function for standard text cleaning.
    '''
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    
    return text

In [261]:
def load_clean_parse_group_gutenberg(gutenberg_file, author, percent_of_file): 
    """
    # currently, this function handles:
    #    Persuasion, Austen
    #    Alice In Wonderland Austen
    #    Paradise Lost Milton
    #    Moby Dick Melville
    """

    # Load and clean the data.
    if gutenberg_file == "persuasion":
        file_to_load = gutenberg.raw('austen-persuasion.txt')
        book = re.sub(r'Chapter \d+', '', persuasion)
    elif gutenberg_file == "alice":
        file_to_load = gutenberg.raw('carroll-alice.txt')
        book = re.sub(r'CHAPTER .*', '', file_to_load)
    elif gutenberg_file == "paradise":
        file_to_load = gutenberg.raw('milton-paradise.txt')
        book = re.sub(r'BOOK .*', '', file_to_load)
    elif gutenberg_file == "moby":
        file_to_load = gutenberg.raw('melville-moby_dick.txt')
        book = re.sub(r'BOOK .*', '', file_to_load)

    book = text_cleaner(book[:int(len(book)/percent_of_file)])
    
    nlp = spacy.load('en')
    book_doc = nlp(book)
    
    print("book_doc datatype is {}".format(type(book_doc)))
    
#     book_sents = pd.DataFrame() # I just added this
#     book_sents = [[sent, author] for sent in book_doc.sents]
    return pd.DataFrame([[sent, author] for sent in book_doc.sents]), book_doc
    
#     return book_sents, book_doc # previously, I had return pd.DataFrame(book_sents), book_doc

In [262]:
def parse_gutenberg(book):
    '''
    # Parse the cleaned novels
    '''
    
    nlp = spacy.load('en')
#     book_doc = nlp(book)

    return nlp(book)

In [263]:
def group_gutenberg(book_doc, author):
    '''
    # Group into sentences.
    '''
    print("book_doc datatype is {}".format(type(book_doc)))
    
    book_sents = [[sent, author] for sent in book_doc.sents]

    # Combine the sentences from the two novels into one data frame.
#     sentences = pd.DataFrame(book_sents)
    if debug:
        sentences.head()
        
    return pd.DataFrame(book_sents)

In [264]:
def dataset_cleanup(df):

    # data Cleanup --> last used for the Boston Marathon challenge
    
#     Left over from the Boston Marathon challenge...
  
    df['gender_int'] = np.where(df['gender'] == 'M', 1, 0).astype(float)
    df['bib_int'] = df['bib'].replace(to_replace=r'[W|F]', value='-', regex=True).astype(int)
    kcolumns = ['5k', '10k', '20k', '25k', '30k', '35k', '40k', 'half']
    for kcol in kcolumns:
        df[kcol] = np.where(df[kcol] == '-', 0, df[kcol])
        df[kcol] = df[kcol].astype(float)
    df['5kpace']   = df['5k']/5.0
    df['10kpace']  = df['10k']/10.0
    df['20kpace']  = df['20k']/20.0
    df['halfpace'] = df['half']/21.095
    df['25kpace']  = df['25k']/25.0
    df['30kpace']  = df['30k']/30.0
    df['35kpace']  = df['35k']/35.0
    df['40kpace']  = df['40k']/40.0
    df['officialpace'] = df['official']/42.19
    # df['raceavg'] = ,axis=0).mean()
    df['racestd'] = df[['5kpace','10kpace','20kpace','halfpace','25kpace','30kpace','35kpace','40kpace','officialpace']].std(axis=1)
    df['raceavg'] = df[['5kpace','10kpace','20kpace','halfpace','25kpace','30kpace','35kpace','40kpace','officialpace']].mean(axis=1)
#     X = df[['age', 'gender_int','genderdiv', 'country', 'official','racestd','raceavg']]
#     X = pd.get_dummies(X)

    df.drop('ctz', axis=1, inplace=True)
    df.drop('state',axis=1, inplace=True)
    # these are the 2% sustainers.  They can be running at any pace, but they are consistent!
    df['sustainer'] = np.where(df['racestd'] <= SustainerSTDDEVLimit, 1, 0).astype(float) # they sustained their pace very well for the race
    scaler = MinMaxScaler()
    
    scaler.fit(df[['age']])
    df['age_scaled'] = scaler.transform(df[['age']]).astype(float)
    
    scaler.fit(df[['overall']])
    df['overall_scaled'] = scaler.transform(df[['overall']]).astype(float)
    
    scaler.fit(df[['pace']])
    df['pace_scaled'] = scaler.transform(df[['pace']])
    
    scaler.fit(df[['official']])
    df['official_scaled'] = scaler.transform(df[['official']]).astype(float)
    
    display('columns are now', df.columns)
#     df = fcn_MinMaxScaler(df, 'age', 'age_scaled')
#     df = fcn_MinMaxScaler(df, 'official', 'official_scaled')
    X = df[['age_scaled', 'sustainer', 'gender_int', 'racestd']]
#     X = pd.get_dummies(X)
   
    
    display("df columns cpt 92310: ", df.columns)
    
    global target_column, xcolumnname, ycolumnname
    
#     target_column = 'overall_scaled'
#     xcolumnname = 'age_scaled'
    ycolumnname = target_column
    
    y = df[target_column]
    printFormatted("target, y column is {}".format(target_column))

    if debug == True:
        print_timestamp("X and y variables created")
        
    printFormatted('we have cleaned up the dataframe.')
    display_column_names('df values', df)
    display_column_names('X values', X)
    return df, X, y

In [265]:
def printFormatted(string):
    newline = '\n'
    display(Markdown(string))
    write_to_logfile(string+newline)

In [266]:
def fcn_MinMaxScaler(dataframe, orig_column, new_column):
    display("cp 1: In fcn_MinMaxScaler.  shape is:", dataframe.shape)
    scaler = MinMaxScaler()
    scaler.fit(dataframe[['{}'.format(orig_column)]])
    dataframe[['{}'.format(new_column)]] = scaler.transform(dataframe['{}'.format(orig_column)])
    display("cp 2: In fcn_MinMaxScaler.  shape is:", dataframe.shape)
    
    return dataframe

In [267]:
def plot_facet():
    g = sns.FacetGrid(data=df, col='stars')
    g.map(plt.hist, 'message_length', bins=50)

In [268]:
def write_to_logfile(message, mdformat=''):
    bufsize = 0
    with open('TestResults.md', 'a+') as the_file:
        the_file.write('{} {}'.format(mdformat, message))

In [269]:
def plot_model_accuracy():
    objects = ('BernoulliNB', 'MultinomialNB', 'Logistic Regression')
    y_pos = np.arange(len(objects))
    performance = [75.81,85.98,91.08]

    plt.bar(y_pos, performance, align='center', alpha=0.5)
    plt.xticks(y_pos, objects)
    plt.ylabel('Accuracy Percent')
    plt.title('Yelp Sentiment Analysis Accuracy')

    plt.show()

In [270]:
def print_timestamp(displaytext):    
    import sys
    import datetime
    datetime_now = str(datetime.datetime.now())
    printFormatted("{:19.19}: In: {} {} ".format(datetime_now, sys._getframe(1).f_code.co_name, displaytext))

In [271]:
def return_current_datetime():
    datetime_now = str(datetime.datetime.now())
    return datetime_now

In [272]:
def data_demographics(dataframe, num_rows):

    display("dataframe.isnull().sum()", dataframe.isnull().sum())

    display("dataframe.columns\n", dataframe.columns)
    display("dataframe.head({})\n".format(num_rows), dataframe.head(num_rows))

    display("dataframe.sample({})\n".format(num_rows), dataframe.sample(num_rows))
    display("dataframe.dtypes\n", dataframe.dtypes)
    display("dataframe.describe()\n", dataframe.describe())

In [273]:
def plot_them():
    for column in X_train.columns:
#         plt.hist(X_train[column]*100, bins=40)
        plt.scatter(y_train, X_train[column]*100)
        plt.xlabel(column)
        plt.show()

In [274]:
def rfc_and_feature_importances(rf):    # Here we are using Random Forest classifier method to determine the top 30 features.

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, train_size=train_size)
    
    ## Fit the model on your training data.
    rf.fit(X_train, y_train) 
    
    ## And score it on your testing data.
    rf.score(X_test, y_test)

    feature_importance = rf.feature_importances_

    # Make importances relative to max importance.
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    sorted_idx = np.argsort(feature_importance)
    cols=X.columns[sorted_idx].tolist() 
    cols=cols[::-1]
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.subplot(1, 2, 2)
    plt.barh(pos, feature_importance[sorted_idx], align='center')
    plt.yticks(pos, X.columns[sorted_idx])
    plt.xlabel('Relative Importance')
    plt.title('Variable Importance')
    plt.show()
#     print("We are returning these columns {}".format(cols))
    return cols[:30] # return it sorted

In [275]:
def run_features_importance(rf,n):
# Here we will return the feature importances
    all_feature_important_columns = []
 
    for i in range(1,n):
        print_timestamp('running rfc iteration {} features importance for {} times'.format(i,n))
        columns2 = rfc_and_feature_importances(rf)
#         columns2.extend('{}'.format(i))
        all_feature_important_columns = all_feature_important_columns + columns2
    #     print("all_feature_import_columns={}".format(all_feature_important_columns))

    print("\nBOD:\nall_feature_important_columns = {}\nEOD".format(sorted(all_feature_important_columns)))
    for feature in set(all_feature_important_columns):
        print_timestamp("the NOC of feature {} in all_feature_important_columns is {}".format(feature, all_feature_important_columns.count(feature)))

In [276]:
def run_correlation_matrix():
    
    print_timestamp('Begin'+'\n'*3)
    
    # Setup the correlation matrix.
    corrmat = X.corr()
    print(corrmat)

    # Set up the subplots
    f, ax = plt.subplots(figsize=(12, 9))

    # Let's draw the heatmap using seaborn.
    sns.heatmap(corrmat, vmax=.6, square=True)
    plt.show()
    
    print_timestamp('\n'*3+'End')

In [277]:
def data_characteristics():
    
    printFormatted("#### Columns used in the dataset")
    display(df.columns)

    print("\n\n")
    printFormatted("#### Describe of the df dataset")
    display(df.describe())

    print("\n\n")
    printFormatted("#### Sample of 10 from the dataset")
    display(df.sample(sample_size))

    print("\n\n")
    printFormatted("#### Number of nulls in X")
    display(X.isnull().sum())
    print("\n\n\n")

In [278]:
def training_test_set(X, y):
#     global X_train, X_test, y_train, y_test
    # Let's fit it with the RFC training set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, train_size=train_size, random_state=0)
    print("train_size = {}, X_train is {}, and y_train is {}".format(train_size, len(X_train), len(y_train)))
    print("test_size  = {}, X_test  is {}, and y_test is {}".format(test_size, len(X_test), len(y_test)))
    
    return X_train, X_test, y_train, y_test

In [279]:
def run_rf(X_train=None, X_test=None, y_train=None, y_test=None, params=None, cross_validate=None):
    
    rfc = ensemble.RandomForestClassifier(n_estimators=1000)
    if params != None:
        print("In run_rf, params = {}".format(params)) 
        rfc.set_params(**params)
        
    ## Fit the model on your training data.
    rfc_fit = rfc.fit(X_train, y_train)  
    
    #   Let's score it with the test data set    this is new 13-Aug-2019
    print_training_and_test_scores(rfc_fit, X_train, X_test, y_train, y_test) # new on 13-Aug-2019
    
#   Let's produce the metrics scores
    print_metrics_score(rfc_fit, X_train, X_test, y_train, y_test) # new on 13-Aug-2019
    
#   Let's run cross validation 
    if cross_validate == True:
        print_cross_validation_scores(rfc_fit, X_train, X_test, y_train, y_test)
        
#   Let's run the confusion matrix
    if confusion_matrix == True:
        confusion_matrix_function(rfc_fit, X_train, X_test, y_train, y_test)
    
#     ## Let's score it with the training data set
#     train_score = rfc.score(X_train, y_train)
#     printFormatted("### Training score = {:.2%}".format(train_score))

#     ## Let's score it with the test data set
#     test_score = rfc.score(X_test, y_test)
#     printFormatted("### Test score = {:.2%}".format(test_score))
    
#     metrics_train_score = metrics.accuracy_score(y_train, y_pred_class2)
#     metrics_test_score =  metrics.accuracy_score(y_test, y_pred_class)

#     printFormatted('###  Metrics train accuracy score = {:.2%} with {}'.format(metrics_train_score, 'Random Forest Classifier'))
#     printFormatted('###  Metrics test accuracy score = {:.2%} with {}'.format(metrics_test_score, 'Random Forest Classifier'))
    
#     if cross_validate == True:
#         accuracy = cross_val_score(rfc, X_train, y_train, scoring='accuracy', cv = 5)
#         printFormatted("### Cross validation scores:  {}".format(accuracy))
#         printFormatted("### Accuracy of Model with Cross Validation average is: {:.2%}".format(accuracy.mean()))
        
#     #   Let's produce the metrics scores
#     print_metrics_score(mnb_fit, X_train, X_test, y_train, y_test) # new on 13-Aug-2019
      
    print_timestamp('End run_rfr part 1')

In [280]:
def run_BernoulliNB(X_train=None, X_test=None, y_train=None, y_test=None, params=None, cross_validate=None):
    
    # Our data is binary / boolean, so we're importing the Bernoulli classifier.

    # Instantiate our model and store it in a new variable.
    bnb = BernoulliNB()

    # Fit our model to the data.
    bnb_fit = bnb.fit(X_train, y_train)

    # Classify, storing the result in a new variable.
#     y_pred = bnb.predict(data)
#     y_pred = bnb.predict(X_train)

    # Display our results.
#     print("Number of mislabeled points out of a total {} points : {}".format(
#        X_train.shape[0],
#         (y_test != y_pred).sum() 
#     ))
    
#   Let's score it with the test data set    this is new 13-Aug-2019
    print_training_and_test_scores(bnb_fit, X_train, X_test, y_train, y_test) # new on 13-Aug-2019
    
#   Let's produce the metrics scores
    print_metrics_score(bnb_fit, X_train, X_test, y_train, y_test) # new on 13-Aug-2019
    
#   Let's run cross validation 
    if cross_validate == True:
        print_cross_validation_scores(bnb_fit, X_train, X_test, y_train, y_test)
        
#   Let's run the confusion matrix
    if confusion_matrix == True:
        confusion_matrix_function(bnb_fit, X_train, X_test, y_train, y_test)

In [281]:
def sentiment_analyzer(path, parameters, classifier, tfidf_parms, X_train=None, X_test=None, y_train=None, y_test=None, cross_validate=None):
    # path A = the old path
    # path B = the new path, no CountVectorizer at all
    
# run block of code and catch warnings
  
    if debug == True:
        print_timestamp(BegTimeStamp+" running with path={}".format(path))
    
    global vectorized
    vectorized = True
    
    pipeline_array = []
   
    if path == "A":
        if classifier == 'bnb':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   BernoulliNB(**parameters))
            ]))
        elif classifier == 'svc':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   SVC(kernel = 'linear', **parameters))
            ])) 
        elif classifier == 'mlb':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   MultinomialNB(**parameters))
            ]))
        elif classifier == 'logit':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   LogisticRegression(**parameters))
            ]))
        elif classifier == 'rfc':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   ensemble.RandomForestClassifier(**parameters))
            ]))  
            
    elif path == "B":
        if classifier == 'bnb':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   BernoulliNB(**parameters))
            ]))
        elif classifier == 'svc':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   SVC(kernel = 'linear', **parameters))
            ])) 
        elif classifier == 'mlb':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   MultinomialNB(**parameters))
            ]))
        elif classifier == 'logit':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('clf',   LogisticRegression(**parameters))
            ]))
        elif classifier == 'rfc':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   ensemble.RandomForestClassifier(**parameters))
            ]))

    pipe = pipeline_array[0]
    
    try:
        vect_name_list = str(pipe.named_steps['vect']).split('(')
        vect_name = "vect = {}, ".format(vect_name_list[0])
    except:
        vect_name = ''

    classifier_name_list=str(pipe.named_steps['clf']).split('(')
    classifier_name=classifier_name_list[0]
    tfidf_name_list = str(pipe.named_steps['tfidf']).split('(')
    if len(tfidf_name_list) > 0:
        tfidf_name = tfidf_name_list[0]
    else:
        tfidf_name = ''

    printFormatted("###  Now running with: {} tfidf={} and clf={} {}\nparameters={} \n\n tfidf_parms={}".format( vect_name,
                                                                                                tfidf_name,
                                                                                                classifier_name,
                                                                                                return_current_datetime(),
                                                                                                parameters,
                                                                                                tfidf_parms
                                                                                                ))
    pipefit = pipe.fit(X_train, y_train)

#   Let's score it with the test data set    this is new 13-Aug-2019
    print_training_and_test_scores(pipefit, X_train, X_test, y_train, y_test) # new on 13-Aug-2019
    
#   Let's produce the metrics scores
    print_metrics_score(pipefit, X_train, X_test, y_train, y_test) # new on 13-Aug-2019
    
#   Let's run cross validation 
    if cross_validate == True:
        print_cross_validation_scores(pipefit, X_train, X_test, y_train, y_test)
        
#   Let's run the confusion matrix
    if confusion_matrix == True:
        confusion_matrix_function(pipefit, X_train, X_test, y_train, y_test)
            
    if debug == True:
        printFormatted("Steps information: {}".format(pipe.steps))
        print_timestamp("Finished running pipeline with:\n{}: ".format(classifier_name))
        
    print_timestamp(EndTimeStamp)

In [282]:
def print_training_and_test_scores(model, X_train, X_test, y_train, y_test):
    
    ## Let's score it with the test data set    this is new 13-Aug-2019
    training_score = model.score(X_train, y_train) 
    printFormatted("### Training score = {:.2%}".format(training_score))
    
    ## Let's score it with the test data set  this is new 13-Aug-2019
    test_score = model.score(X_test, y_test)
    printFormatted("### Test score = {:.2%}".format(test_score))

In [283]:
def print_metrics_score(fit_model, X_train, X_test, y_train, y_test):
    y_pred_class  = fit_model.predict(X_test)
    metrics_test_score =  metrics.accuracy_score(y_test, y_pred_class)
    printFormatted('###  Metrics test accuracy score = {:.2%}'.format(metrics_test_score))

In [284]:
def print_cross_validation_scores(fit_model, X_train, X_test, y_train, y_test):
    
        accuracy = cross_val_score(fit_model, X_train, y_train, scoring='accuracy', cv = 5)
        printFormatted("### Cross validation scores:  {}".format(accuracy))
        printFormatted("### Accuracy of Model with Cross Validation average is: {:.2%}".format(accuracy.mean()))

In [285]:
def run_gradient_boosting():

    print_timestamp('Begin')
    
    clf = ensemble.GradientBoostingClassifier(**params)

    #Let's run cross validate score with the training data set
    cross_val_score(clf, X_train, y_train, cv=5)

    loss_function = 'deviance' # could be exponential
    depth_value = 8
    params = {'n_estimators': 500,
              'max_depth': 8,
              'loss_function': loss_function,
              'max_leaf_nodes': depth_value, # 8 worked best...
              'min_samples_leaf': depth_value * 3
              ,'random_state' : random_state
             }

    clf.fit(X_train, y_train)

    predict_train = clf.predict(X_train)
    predict_test = clf.predict(X_test)
    
    print_timestamp('End')

In [286]:
def run_svc(X_train=None, X_test=None, y_train=None, y_test=None, cross_validate=None, params=None):

    print_timestamp('\n'*3+'Begin run_svc')
    
    # Let's do a linear Support Vector Classifier
    print_timestamp('Running SVC(kernel=linear')
    svm = SVC(kernel = 'linear')
    
    if params == True:
        print("In run_rf, params = {}".format(params)) 
        svm.set_params(**params)
    
    # Let's fit the training model
    print_timestamp('Running svm.fit')
    svm.fit(X_train, y_train)
    
    # Let's score the training set
    print_timestamp('Running svm.score for the training set')
    svm_training_score = svm.score(X_train, y_train)
    printFormatted("###  SVM Training score={:.2%}".format(svm_training_score))

    # Let's score the test set
    print_timestamp('Running svm.fit for the test set')
    svm_test_score = svm.score(X_test, y_test)
    printFormatted("###  SVM Test score={:.2%}".format(svm_test_score))

    if cross_validate == True:
        accuracy = cross_val_score(svm, X_train, y_train, scoring='accuracy', cv = 5)
        printFormatted("### Cross validation scores:  {}".format(accuracy))
        printFormatted("### Accuracy of Model with Cross Validation average is: {:.2%}".format(accuracy.mean()))

    print_timestamp('\n'*3+'End run_svc')

In [287]:
def run_logistic_regression():
    print_timestamp('\n'*3+'Begin')

    lr = LogisticRegression(C=1e20, solver='lbfgs', max_iter=1000)

    print_timestamp('Running lr.fit for the training set')
    lr.fit(X_train, y_train)
    
    print_timestamp('Running lr.fit for the training set')
    print('\nR-squared simple model training set yields:')
    print(lr.score(X_train, y_train))
    print("here comes the test set")
    lrscore = lr.score(X_test, y_test)
    printFormatted("###  Logistic Regression score={:.2%}".format(lrscore))
    
    print_timestamp('\n'*3+'End')

In [288]:
def run_linear_regression():

    print_timestamp('\n'*3+'Begin')

    regr = linear_model.LinearRegression()

    print_timestamp('Running regr.fit for the training set')
    regr.fit(X_train, y_train)
    
    print("\nCoeffecients: \n", regr.coef_)
    print("\nIntercept: \n", regr.intercept_)
    print("\nR-squared for training data set:")
    print(regr.score(X_train, y_train))
    
    print("\nR-squared for test data set:")
    print(regr.score(X_test, y_test))
    
    print_timestamp('End run_linear_regression.\n\n')
    
    print_timestamp('\n'*3+'End')

In [289]:
def run_ridge_regression():
    # Fitting a ridge regression model. Alpha is the regularization
    # parameter (usually called lambda). As alpha gets larger, parameter
    # shrinkage grows more pronounced. Note that by convention, the
    # intercept is not regularized. Since we standardized the data
    # earlier, the intercept should be equal to zero and can be dropped.
    print_timestamp('\n'*3+'Begin')
    
    ridgeregr = linear_model.Ridge(alpha=10, fit_intercept=False) 
    ridgeregr.fit(X_train, y_train)
    print(ridgeregr.score(X_train, y_train))

    print_timestamp('\n'*3+'End')

In [290]:
def run_affinity_propagation(data, target):
      
    print_timestamp('\n'*3+'starting AffinityPropagation')

    print_timestamp('\n'*3+'Begin')
    
    ap = AffinityPropagation()
#     ap = AffinityPropagation(damping=0.5,
#                          max_iter=200,
#                          convergence_iter=15,
#                          copy=True,
#                          preference=None,
#                          affinity='euclidean',
#                          verbose=False) 

    model = ap.fit(data)
    pred = ap.predict(data)

    Z = merge_predict_and_cluster(data, target, pred) # let's merge the data dataframe, prediction, and the cluster
    
    # Pull the number of clusters and cluster assignments for each data point.
    cluster_centers_indices = ap.cluster_centers_indices_
    n_clusters_ = len(cluster_centers_indices)
    labels = ap.labels_
    
    print('Estimated number of clusters: {}'.format(n_clusters_))

    labels = model.labels_
    
    print("from run_affinity_propagation {}".format(metrics.silhouette_score(data, labels, metric='euclidean')))
    
    print_timestamp('\n'*3+'finished with AffinityPropagation')
    
    return Z, n_clusters_

In [291]:
def run_kmeans(data, target, K):

    print_timestamp('\n'*3+'Begin')
    print("running with number of clusters = {}".format(K))
    km = KMeans(n_clusters=K, random_state=42)

#     pred = KMeans(n_clusters=K, random_state=42).fit_predict(data)
    pred = km.fit_predict(data)
#     Z = pd.DataFrame()
    Z = merge_predict_and_cluster(data, target, pred) # let's merge the data dataframe, prediction, and the cluster
#     Z = pd.merge(data, pd.DataFrame(pred), left_index=True, right_index=True)
#     display_column_names('first Z values', Z)
#     Z.rename(columns={Z.columns[-1]: 'cluster'}, inplace=True)
#     display_column_names('second Z values', Z)
#     Z = pd.merge(Z, target, left_index=True, right_index=True)
#     display_column_names('third Z values', Z)
#     print("z columns are {}".format(Z.columns))

    if debug == True:  
        print("the shape of Kmeans_pred is {}, and the shape of X is {}, and the shape of Z is {}".format(pred.shape,
                                                                                                      data.shape,
                                                                                                      Z.shape))
        display(Z.head(100))
        display_column_names('Z below values', Z)

        count = Z.groupby(['cluster']).count() 
        display("Z: Count by clusters are this:\n", count) 
  
    return Z
        
    print_timestamp('\n'*3+'End')

In [292]:
def merge_predict_and_cluster(dataframe, target, predict):
    Z = pd.merge(dataframe, target, left_index=True, right_index=True)
    Z = pd.merge(Z, pd.DataFrame(predict), left_index=True, right_index=True)
    Z.rename(columns={Z.columns[-1]: 'cluster'}, inplace=True)
    
    return Z

In [293]:
def confusion_matrix_function(model_fit, X_train, X_test, y_train, y_test):
    
    y_pred_class  = model_fit.predict(X_test)
    
    conf_matrix = confusion_matrix_function(y_test, y_pred_class)
    printFormatted("### Confusion Matrix:  {}".format(conf_matrixscores))
    
    print("\n\n")

In [294]:
def run_spectral_clustering(data, target, K):
    display_dataframe_shape('entering run_spectral_clustering, data has shape of:', data)
    display_dataframe_shape('entering run_spectral_clustering, target has shape of:', target)
    print_timestamp('\n'*3+'Begin')
    
#     for clusternum in range(2, K):
    print_timestamp("Running spectral_clustering with {} clusters.".format(K))
    n_clusters=K

    # Declare and fit the model.
    sc = SpectralClustering(n_clusters=K)
    sc.fit(data)

    #Predicted clusters.
    predict=sc.fit_predict(data)

    Z = merge_predict_and_cluster(data, target, predict) # let's merge the data dataframe, prediction, and the cluster

    if debug == True:
        display_dataframe_shape('in run_spectral_clustering, Z has shape of:', Z)
        display_dataframe_shape('in run_spectral_clustering, target has shape of:', target)
        display("the datatypes of Z are", Z.dtypes)

#     plt.scatter(Z['cluster'], Z[target_column], c=Z['cluster'])
#     plt.show()

    labels = sc.labels_
    print("from spectral clustering {}".format(metrics.silhouette_score(data, labels, metric='euclidean')))

#     print('Comparing the assigned categories to the ones in the data:')
#     print(pd.crosstab(target,predict))
    
    print_timestamp('\n'*3+'End')
    
    return Z

In [295]:
def do_the_elbow(X):
    printFormatted("## We are plotting the elbow method!")
    # calculate distortion for a range of number of cluster
    distortions = []
    for i in range(1, 11):
        km = KMeans(
            n_clusters=i, init='random',
            n_init=10, max_iter=300,
            tol=1e-04, random_state=0
        )
        km.fit(X)
        distortions.append(km.inertia_)

    # plot
    plt.plot(range(1, 11), distortions, marker='o')
    plt.xlabel('Number of clusters')
    plt.ylabel('Distortion')
    plt.show()

In [296]:
def plot_it_clusters(dataframe, xvalue, yvalue, title):
    
    if debug == True:
        display_dataframe_shape('entry received in plot_it_clusters', dataframe)
        display(dataframe.dtypes)

    data_demographics(dataframe, 10)
        
    plt.rcParams['figure.figsize'] = [xvalue, yvalue]
    plt.xlabel(xcolumnname)
    plt.ylabel(ycolumnname)
    
    df0 = dataframe[dataframe.cluster == 0]
    df1 = dataframe[dataframe.cluster == 1]
    df2 = dataframe[dataframe.cluster == 2]
    df3 = dataframe[dataframe.cluster == 3]
    df4 = dataframe[dataframe.cluster == 4]
    df5 = dataframe[dataframe.cluster == 5]
    
    plt.scatter(df0[xcolumnname], df0[ycolumnname], color='green')
    plt.scatter(df1[xcolumnname], df1[ycolumnname], color='red')
    plt.scatter(df2[xcolumnname], df2[ycolumnname], color='blue')
    plt.scatter(df3[xcolumnname], df3[ycolumnname], color='black')
    plt.scatter(df4[xcolumnname], df4[ycolumnname], color='magenta')
    plt.scatter(df5[xcolumnname], df5[ycolumnname], color='orange')
    plt.title(title)
    plt.show()
        
#     plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',label='centroid')
    
#     if type == 'KMeans':
#         plt.xlabel('Age')
#         plt.ylabel('Income ($)')
#         plt.legend()
#         plt.scatter(km.cluster_centers[:,0], 
#                     km.cluster_centers[:,1],
#                     marker = '*',
#                     label = 'centroid')

In [297]:
def run_mean_shift(data, target):
    
    print_timestamp('\n'*3+'Begin')  

    X_train = data
    
    # Here we set the bandwidth. This function automatically derives a bandwidth
    # number based on an inspection of the distances among points in the data.
    bandwidth = estimate_bandwidth(X_train, quantile=0.2, n_samples=500)

    # Declare and fit the model.
    ms = MeanShift(bandwidth=bandwidth, bin_seeding=True)
    if debug == True:
        display_dataframe_shape('this is the shape of data coming into run_mean_shift', data)
    ms.fit(data)

    if debug == True:
        display_dataframe_shape('this is the shape of target coming into run_mean_shift', target)
    pred = ms.predict(data)
    if debug == True:
        display_dataframe_shape('this is the shape of pred after predict in run_mean_shift', data)
        
    Z = merge_predict_and_cluster(data, target, pred) # let's merge the data dataframe, prediction, and the cluster

    # Extract cluster assignments for each data point.
    labels = ms.labels_

    print("from mean shift {}".format(metrics.silhouette_score(data, labels, metric='euclidean')))
    
    # Coordinates of the cluster centers.
    cluster_centers = ms.cluster_centers_

    # Count our clusters.
    n_clusters_ = len(np.unique(labels))

    print("Number of estimated clusters: {}".format(n_clusters_))
    
    print_timestamp('\n'*3+'End')
    
    return Z, n_clusters_

In [298]:
def run_tfidf_vectorizer(df, max_df=0.5, min_df=2, stop_words='english', lowercase=True, use_idf=True, norm=u'l2', smooth_idf=True):
    print_timestamp("in run_tfidf_vectorizer: df is a {} datatype.".format(type(df)))
    vectorizer = TfidfVectorizer(
                             max_df=max_df, # drop words that occur in more than half the paragraphs
                             min_df=min_df, # only use words that appear at least twice
                             stop_words=stop_words, 
                             lowercase=lowercase, #convert everything to lower case (since Alice in Wonderland has the HABIT of CAPITALIZING WORDS for EMPHASIS)
                             use_idf=use_idf,#we definitely want to use inverse document frequencies in our weighting
                             norm=norm, #Applies a correction factor so that longer paragraphs and shorter paragraphs get treated equally
                             smooth_idf=smooth_idf #Adds 1 to all document frequencies, as if an extra document existed that used every word once.  Prevents divide-by-zero errors
                            )

    #Applying the vectorizer
    tfidf_df = vectorizer.fit(df)
    
    return tfidf_df

In [299]:
def text_cleaner(text):
    # Visual inspection identifies a form of punctuation spaCy does not
    # recognize: the double dash '--'.  Better get rid of it now!
    text = re.sub(r'--',' ',text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = ' '.join(text.split())
    return text

In [300]:
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]

In [301]:
def bow_features_dev(sentences, common_words):
    display(sentences.head(10))
    num_sentences_to_print = 10
    print("inside bow_features: sentences is a {} datatype, of {} length,\nand common_words is a {} datatype of {} length."
          .format(type(sentences), sentences.shape[0], type(common_words), len(common_words)))
    print("here come {} sentences: {}".format(num_sentences_to_print, sentences[0:num_sentences_to_print]))
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    print("in bow_features: sentences.iloc[0] = {}".format(sentences.iloc[0]))
    df['text_sentence'] = sentences.iloc[0] # this could be the problem
    df['text_source'] = sentences.iloc[1]   # this too could be the problem...
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    num_values = len(list(enumerate(df['text_sentence'])))
    print("There are {} enumerated items for iterations.".format(num_values))
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return pd.DataFrame(df)

#     return df

In [302]:
def bow_features(sentences, common_words):
  
    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0] # this could be the problem
    df['text_source'] = sentences[1]   # this too could be the problem...
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row {}".format(i))
            
    return df

In [303]:
def run_mnb(X_train=None, X_test=None, y_train=None, y_test=None, cross_validate=None, params=None):

    mnb = MultinomialNB()
    
    if params == True:
        print("In run_rf, params = {}".format(params)) 
        mnb.set_params(**params)
        
    mnb_fit = mnb.fit(X_train, y_train)
        
#     training_score = mnb.score(X_train, y_train) 
#     printFormatted("### Training score = {:.2%}".format(training_score))
    
#      ## Let's score it with the test data set
#     test_score = mnb.score(X_test, y_test)
    
    #   Let's score it with the test data set    this is new 13-Aug-2019
    print_training_and_test_scores(mnb_fit, X_train, X_test, y_train, y_test) # new on 13-Aug-2019
    
#   Let's produce the metrics scores
    print_metrics_score(mnb_fit, X_train, X_test, y_train, y_test) # new on 13-Aug-2019
    
    if cross_validate == True:
        print_cross_validation_scores(mnb_fit, X_train, X_test, y_train, y_test)
        
    if confusion_matrix == True:
        confusion_matrix_function(mnb_fit, X_train, X_test, y_train, y_test)

In [304]:
def word2vec_function(sentences, workers=4, min_count=10, window=6, sg=0, sample=1e-3, size=300, hs=1):
# import gensim
# from gensim.models import word2vec
# original values from the curriculum:
# workers = 4, min_count=10, window=6, sg=0, sample=1e-3, size=300, hs=1

    model = word2vec.Word2Vec(
        sentences,
        workers=workers,     # Number of threads to run in parallel (if your computer does parallel processing).
        min_count=min_count,  # Minimum word count threshold.
        window=window,      # Number of words around target word to consider.
        sg=sg,          # Use CBOW because our corpus is small.
        sample=sample ,  # Penalize frequent words.
        size=size,      # Word vector length.
        hs=hs           # Use hierarchical softmax.
    )

    print('done!')
    
    return model

In [305]:
def vectorizer_nb(type_of_vectorizer):

    print_timestamp(BegTimeStamp)
    
    # 1. import and instantiate CountVectorizer (with the default parameters)

    # 2. instantiate CountVectorizer (vectorizer)

#     X = df.message
#     y = df.sentiment_label

    # split X and y into training and testing sets
    # by default, it splits 75% training and 25% test
    # random_state=1 for reproducibility
    
#     X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

    # 3. fit & transform
    if type_of_vectorizer == 'Count':
        print("We are running with CountVectorizer")
        vectorizer = CountVectorizer()
        vectorizer.fit(X_train)
        vectorizer_method = 'CountVectorizer'
    elif type_of_vectorizer == 'Tfidf':
        print("We are running with TfidfVectorizer")
        vectorizer = TfidfVectorizer()
        vectorizer.fit_transform(X_train)
        vectorizer_method = 'TfidfVectorizer'
    
    # 4. transform training data
    X_train_dtm = vectorizer.transform(X_train)

    # equivalently: combine fit and transform into a single step
    # this is faster and what most people would do
    X_train_dtm = vectorizer.fit_transform(X_train)

    # 4. transform testing data (using fitted vocabulary) into a document-term matrix
    X_test_dtm = vectorizer.transform(X_test)

    # 1. import

    # 2. instantiate a Multinomial Naive Bayes model
    nb = MultinomialNB()

    # 3. train the model 
    # using X_train_dtm (timing it with an IPython "magic command")

    nb.fit(X_train_dtm, y_train)
    
    
    # 4. make class predictions for X_test_dtm
    y_pred_class = nb.predict(X_test_dtm)

    # calculate accuracy of class predictions

    met_test_score = metrics.accuracy_score(y_test, y_pred_class)
    printFormatted('###  With {} vectorizer, the metrics accuracy score = {:.2%}'.format(vectorizer_method,
                                                                                         met_test_score))
    
    print_timestamp(EndTimeStamp)

In [306]:
def display_column_names(label, df):
    display("Label: {}: Column names are:".format(label), df.columns)

In [307]:
def display_datatype(var):
    # this function just returns the data type of the variable var
    return "{}".format(type(var))

In [308]:
def display_dataframe_shape(label, df):
    display("Label: {}: Dataframe shape is:".format(label), df.shape)

In [309]:
def run_it(X_train, X_test, y_train, y_test, y):
    
#     file_stuff()
    
#     data_cleanup()
    
    print_timestamp('\n'*3+'Begin')
    
    if Regression == True:
        print_timestamp("We are running with a Regression model")
    elif Regression == False:
        print_timestamp("We are running with a Classifier model")
    else:
        print_timestamp("We have failed to set the Regression variable")
        sys.exit(main())
        

    if flag_to_plot_them == True:
        plot_them()

    if flag_to_run_features_importance == True:
        
        number_of_features_to_consider = 50
        params = {'n_estimators': 100}

        if Regression == True:
            print_timestamp('We are running RandomForestRegressor')
            rf = ensemble.RandomForestRegressor(**params)
            
        else:
            print_timestamp('We are running RandomForestClassifier')
            rf = ensemble.RandomForestClassifier(**params)

        run_features_importance(rf, number_of_features_to_consider)

    if flag_to_run_correlation_matrix == True:
        run_correlation_matrix()
        
    if flag_to_run_sentiment_analyzer == True:
        path = "B"


        for path in ['A']:
            for vectorizer_iterator in ['logit', 'mlb', 'bnb']:
                if vectorizer_iterator == 'rfc':
                    sentiment_analyzer(path=path, parameters=params, classifier=vectorizer_iterator, tfidf_parms=tfidf_parms)

                elif vectorizer_iterator == 'bnb':
                    parameters = {}
                    sentiment_analyzer(path=path, parameters=parameters, classifier=vectorizer_iterator, tfidf_parms=tfidf_parms)

                elif vectorizer_iterator == 'mlb':
                    parameters = {}
                    sentiment_analyzer(path=path, parameters=parameters, classifier=vectorizer_iterator, tfidf_parms=tfidf_parms)

                elif vectorizer_iterator == 'logit': # newton-cg took too long. sag and saga about the same as lbfgs.
                    tfidf_parms = {'max_features' :  10000 } # determined this through iterative testing
                    parameters = {'C' :1e20, 'solver': 'lbfgs', 'max_iter': 1000} # max_iter=100 reports warning, try 1000
                    sentiment_analyzer(path=path, parameters=parameters, classifier=vectorizer_iterator, tfidf_parms=tfidf_parms)

                elif vectorizer_iterator == 'svc':
                    parameters = {}
                    sentiment_analyzer(path=path, parameters=parameters, classifier=vectorizer_iterator, tfidf_parms=tfidf_parms)
                    if confusion_matrix != None:
                        confusion_matrix_function(y_test, y_pred_class)

    if flag_to_run_rf == True:
        #     params = {}
        params = {'n_estimators': 100} 

        if Regression == True:
            rf = ensemble.RandomForestRegressor(**params)
            print_timestamp('We are running RandomForestRegressor')
        else:
            rf = ensemble.RandomForestClassifier(**params)
            print_timestamp('We are running RandomForestClassifier')

        run_rf(rf)

    if flag_to_run_gradient_boosting  == True:
        run_gradient_boosting()

    if flag_to_run_linear_regression  == True:
        run_linear_regression()

    if flag_to_run_logistic_regression == True:
        run_logistic_regression()

    if flag_to_run_svc == True:
        run_svc() 

    if flag_to_run_ridge_regression == True:
        run_ridge_regression()
        
    if flag_to_run_vectorizer_nb == True:
        for vectorizer_iterator in ['Count', 'Tfidf']:
            vectorizer_nb(vectorizer_iterator)
        
    if flag_to_run_kmeans == True:
        method = KMeans(
             n_clusters=num_clusters
#                 ,random_state=42
#                 ,init='random'
#                 ,n_init=10
#                 ,max_iter=300
#                 ,tol=1e-04 
        )
        df1 = run_kmeans(X_train, y_train, num_clusters)
        plot_it_clusters(df1, xvalue=16, yvalue=16, title="KMeans with number of clusters = {}".format(num_clusters))
        display("next plot please")

    if flag_to_run_affinity_propagation == True:
        display_column_names('columns of X_train going into affinity_propagation: ', X_train)
        df2, ap_num_clusters = run_affinity_propagation(X_train, y_train)
        plot_it_clusters(df2, xvalue=16, yvalue=16, title="Affinity Propagation with number of clusters = {}".format(ap_num_clusters))
        
    if flag_to_run_mean_shift == True:
        df3, mean_shift_num_clusters = run_mean_shift(X_train, y_train)
        plot_it_clusters(df3, xvalue=16, yvalue=16, title="Mean Shift with number of clusters = {}".format(mean_shift_num_clusters))
    
    if flag_to_run_spectral_clustering == True:
        df4 = run_spectral_clustering(X_train, y_train, K=num_clusters)
        plot_it_clusters(df4, xvalue=16, yvalue=16,title="Spectral clustering with number of clusters = {}".format(num_clusters) )

    print_timestamp('End'+'\n'*3)

In [310]:
def main(entry_point):
        
    if entry_point == 0:
        print_timestamp("Starting main()")
        df = file_stuff()
        data_demographics(df, 5)
        display_column_names('post data_demographics of df', df)
        df, X, y = dataset_cleanup(df)
        display_column_names('post dataset_cleanup on X', X)
        data_demographics(df, 5)
        display_column_names('post data_demographics on X #2', X)
#         make_X_and_Y()
        X_train, X_test, y_train, y_test = training_test_set(X, y)
        display_column_names('after training_test_set: columns of X_train going into affinity_propagation: ', X_train)
#         data_characteristics()
#         plot_time_to_complete()
#         plot_model_accuracy()
#         plot_facet()

    if flag_to_run_elbow_plot == True:    do_the_elbow(X)
    run_it(X_train, X_test, y_train, y_test, y)
        
    print_timestamp("Ending main()")

In [311]:
def do_word_counts_from_text_files(gutenberg_books, regex=None, book_divisor=1 ):
    print_timestamp("Starting main_nlp()")
      
    sentences = pd.DataFrame()
    
    
    book_list = [('milton-paradise.txt',r'BOOK .*') , ('melville-moby_dick.txt',r'BOOK .*',)]
#     gutenberg_books is a list of files on the computer
    for book in book_list:
#     paradise_book = 'milton-paradise.txt'
#     moby_book = 'melville-moby_dick.txt'

        book - gutenberg.raw(book)
        book = re.sub(book[1],'', book)
#     paradise = gutenberg.raw(paradise_book)
#     paradise = re.sub(r'BOOK .*', '', paradise)

#     moby = gutenberg.raw(moby_book)
#     moby = re.sub(r'BOOK .*', '', moby)
        book = text_cleaner(book[:int(len(book)/book_divisor)])


    paradise = text_cleaner(paradise[:int(len(paradise)/paradise_divisor)])
    moby     = text_cleaner(moby[:int(len(moby)/moby_divisor)])

    if debug == True:
        print("length of paradise is {}, and length of moby is {}.".format(len(paradise), len(moby)))

    paradise_doc = nlp(paradise)
    moby_doc = nlp(moby)

    paradise_sents = [[sent, "Milton"] for sent in paradise_doc.sents]
    moby_sents = [[sent, "Melville"] for sent in moby_doc.sents]

    print("paradise_sents is a {} datatype, and moby_sents is a {} datatype.".format(display_datatype(paradise_sents), display_datatype(moby_sents)))
    # a better way to do this is with pd.append... from https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html
#     >>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
#     >>> df
#        A  B
#     0  1  2
#     1  3  4
#     >>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
#     >>> df.append(df2, ignore_index=True)
#            A  B
#         0  1  2
#         1  3  4
#         2  5  6
#         3  7  8
    sentences = pd.DataFrame(paradise_sents + moby_sents) 
    sentences = pd.DataFrame(paradise_sents)
    sentences.append(moby_sents, ignore_index=True)
    
    if debug == True:
        display("Here is a sample from sentences:\n", sentences.sample(20))  
        print("sentences is a {} datatype.".format(display_datatype(sentences)))
        sentences.head(30)

    paradisewords = bag_of_words(paradise_doc)
    mobywords = bag_of_words(moby_doc)
    common_words = list(set(paradisewords + mobywords))
    word_counts = bow_features(sentences, common_words)

In [312]:
def main_nlp(entry_point):
    
    # this function was specifically designed for this NLP Challenge
    
    if entry_point == 0:
        confusion_matrix = None
        
        paradise_divisor = 20 # was 20
        moby_divisor     = 53 # was 53
        
        print_timestamp("Starting main_nlp()")
        
        paradise_book = 'milton-paradise.txt'
        moby_book = 'melville-moby_dick.txt'
        
        paradise = gutenberg.raw(paradise_book)
        paradise = re.sub(r'BOOK .*', '', paradise)

        moby = gutenberg.raw(moby_book)
        moby = re.sub(r'BOOK .*', '', moby)
        
        paradise = text_cleaner(paradise[:int(len(paradise)/paradise_divisor)])
        moby     = text_cleaner(moby[:int(len(moby)/moby_divisor)])
        
        if debug == True:
            print("length of paradise is {}, and length of moby is {}.".format(len(paradise), len(moby)))
        
        paradise_doc = nlp(paradise)
        moby_doc = nlp(moby)
        
        paradise_sents = [[sent, "Milton"] for sent in paradise_doc.sents]
        moby_sents = [[sent, "Melville"] for sent in moby_doc.sents]

        print("paradise_sents is a {} datatype, and moby_sents is a {} datatype.".format(display_datatype(paradise_sents), display_datatype(moby_sents)))
        sentences = pd.DataFrame(paradise_sents + moby_sents) 
        if debug == True:
            display("Here is a sample from sentences:\n", sentences.sample(20))  
            print("sentences is a {} datatype.".format(display_datatype(sentences)))
            sentences.head(30)
        
        paradisewords = bag_of_words(paradise_doc)
        mobywords = bag_of_words(moby_doc)
        common_words = list(set(paradisewords + mobywords))
        word_counts = bow_features(sentences, common_words)
        
        if debug == True:  display("words_counts sample:\n", word_counts.sample(20))
        word_counts['author'] = np.where(word_counts['text_source'] == 'Milton', pd.to_numeric(1), pd.to_numeric(0))

        X = np.array(word_counts.drop(['text_sentence', 'text_source', 'author'], 1))
        Y = word_counts['author']
        X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)
        
        for runit in range(0,2):
            if runit == 0:
                
                n_estimators = 100
                
                params = {'n_estimators': n_estimators}
                printFormatted('## Running Random Forests with Word Counts')
                run_rf(X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, params=params, cross_validate=True)

                printFormatted('## Running MultiNomial Naive Bayes with Word Counts')
                params = {}
                run_mnb(X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, params=params, cross_validate=True)

                printFormatted('## Running Bernoulli Naive Bayes with Word Counts')
                run_BernoulliNB(X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, params=params, cross_validate=True) 
       
            else:
            
                author = []
                all_paras = []
                for paragraph in gutenberg.paras(paradise_book):
                    para = paragraph[0]
                    para = [re.sub(r'--', '', word) for word in para]
                    all_paras.append(' '.join(para))
                    author.append('Milton')

                for paragraph in gutenberg.paras(moby_book):
                    para = paragraph[0]
                    para = [re.sub(r'--', '', word) for word in para]
                    all_paras.append(' '.join(para))
                    author.append('Melville')

                paragraphs = pd.DataFrame()        
                paragraphs['paragraphs'] = all_paras
                paragraphs['author'] = author
                paragraphs['author'] = np.where(paragraphs['author'] == 'Milton', pd.to_numeric(1), pd.to_numeric(0))
                
                if debug == True:
                    display("paragraphs sampe:\n", paragraphs.sample(20))
                    display("describe paragraphs\n", paragraphs.describe())

                X = paragraphs['paragraphs']
                Y = paragraphs['author']
                printFormatted('## Running with Paragraph Counts and tfidf and others')

                X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)
                params = {}
                tfidf_parms = {}
                for vectorizer_iterator in ['rfc', 'bnb', 'mlb', 'svc']:

                    path = 'A'
                    if vectorizer_iterator == 'rfc':
                        sentiment_analyzer(path=path, parameters=params, classifier=vectorizer_iterator, tfidf_parms=tfidf_parms, 
                                           X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, cross_validate=True)

                    elif vectorizer_iterator == 'bnb':
                        parameters = {}
                        sentiment_analyzer(path=path, parameters=params, classifier=vectorizer_iterator, tfidf_parms=tfidf_parms, 
                                           X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, cross_validate=True)

                    elif vectorizer_iterator == 'mlb':
                        parameters = {}
                        sentiment_analyzer(path=path, parameters=params, classifier=vectorizer_iterator, tfidf_parms=tfidf_parms, 
                                           X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, cross_validate=True)
                        
                    elif vectorizer_iterator == 'logit': # newton-cg took too long. sag and saga about the same as lbfgs.
                        tfidf_parms = {'max_features' : 10000 } # determined this through iterative testing
                        
                        parameters = {'C' :1e20, 'solver': 'lbfgs', 'max_iter': 1000} # max_iter=100 reports warning, try 1000
                        sentiment_analyzer(path=path, parameters=parameters, classifier=vectorizer_iterator, tfidf_parms=tfidf_parms, 
                                           X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, cross_validate=True)

                    elif vectorizer_iterator == 'svc':
                        parameters = {}
                        sentiment_analyzer(path=path, parameters=parameters, classifier=vectorizer_iterator, tfidf_parms=tfidf_parms, 
                                           X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test, cross_validate=True)

In [313]:
main_nlp(0)
print("it ran ok, dude!")

2019-08-14 23:06:41: In: main_nlp Starting main_nlp() 

paradise_sents is a <class 'list'> datatype, and moby_sents is a <class 'list'> datatype.
Processing row 0
Processing row 50
Processing row 100
Processing row 150
Processing row 200
Processing row 250
Processing row 300
Processing row 350
Processing row 400
Processing row 450
Processing row 500
Processing row 550
Processing row 600


## Running Random Forests with Word Counts

In run_rf, params = {'n_estimators': 100}


### Training score = 98.37%

### Test score = 86.59%

###  Metrics test accuracy score = 86.59%

### Cross validation scores:  [0.83783784 0.83783784 0.7972973  0.83783784 0.79166667]

### Accuracy of Model with Cross Validation average is: 82.05%

2019-08-14 23:09:57: In: run_rf End run_rfr part 1 

## Running MultiNomial Naive Bayes with Word Counts

### Training score = 97.55%

### Test score = 92.28%

###  Metrics test accuracy score = 92.28%

### Cross validation scores:  [0.90540541 0.85135135 0.89189189 0.90540541 0.875     ]

### Accuracy of Model with Cross Validation average is: 88.58%

## Running Bernoulli Naive Bayes with Word Counts

### Training score = 82.34%

### Test score = 83.33%

###  Metrics test accuracy score = 83.33%

### Cross validation scores:  [0.7972973  0.74324324 0.74324324 0.75675676 0.76388889]

### Accuracy of Model with Cross Validation average is: 76.09%

## Running with Paragraph Counts and tfidf and others

###  Now running with:  tfidf=TfidfVectorizer and clf=RandomForestClassifier 2019-08-14 23:09:59.268467
parameters={} 

 tfidf_parms={}



### Training score = 99.76%

### Test score = 98.76%

###  Metrics test accuracy score = 98.76%

### Cross validation scores:  [0.99115044 0.99115044 0.99115044 0.99112426 0.99112426]

### Accuracy of Model with Cross Validation average is: 99.11%

2019-08-14 23:09:59: In: sentiment_analyzer 


End 

###  Now running with:  tfidf=TfidfVectorizer and clf=BernoulliNB 2019-08-14 23:09:59.774286
parameters={} 

 tfidf_parms={}

### Training score = 99.05%

### Test score = 98.76%

###  Metrics test accuracy score = 98.76%

### Cross validation scores:  [0.99115044 0.99115044 0.99115044 0.99112426 0.99112426]

### Accuracy of Model with Cross Validation average is: 99.11%

2019-08-14 23:10:00: In: sentiment_analyzer 


End 

###  Now running with:  tfidf=TfidfVectorizer and clf=MultinomialNB 2019-08-14 23:10:00.125135
parameters={} 

 tfidf_parms={}

### Training score = 99.11%

### Test score = 98.76%

###  Metrics test accuracy score = 98.76%

### Cross validation scores:  [0.99115044 0.99115044 0.99115044 0.99112426 0.99112426]

### Accuracy of Model with Cross Validation average is: 99.11%

2019-08-14 23:10:00: In: sentiment_analyzer 


End 

###  Now running with:  tfidf=TfidfVectorizer and clf=SVC 2019-08-14 23:10:00.467223
parameters={} 

 tfidf_parms={}

### Training score = 99.11%

### Test score = 98.76%

###  Metrics test accuracy score = 98.76%

### Cross validation scores:  [0.99115044 0.99115044 0.99115044 0.99112426 0.99112426]

### Accuracy of Model with Cross Validation average is: 99.11%

2019-08-14 23:10:01: In: sentiment_analyzer 


End 

it ran ok, dude!


### Challenge Conclusion
#### As you can see, the best metrics accuracy cross validations scores came from the models using tfidf for feature generation, and they all tied at accuracy of 98.76%

#### I tried to increase the metrics accuracy of the word counts Random Forest model, but was unable to change it.  I was unable to change the parameter settings for word counts, as this model generator has fewer options than tfidf method of generating features.

#### All of the models running with tfidf did exceptionally well, and the models using word counts were less favorable than the other test group.

###  Summary of result runs

|Model|NLP Feature Generator|Training Score|:Test Score|Metrics Accuracy|Accuracy of Model|Status|
|:----|:--------------------|:-------------|:----------|:---------------|:----------------|:----|
|Random Forests|Word Counts|98.37|86.59|86.59|82.05| |
|Multinomial Naive Bayes|Word Counts|97.55|92.28|92.28|88.58| |
|Bernoulli Naive Bayes|Word Counts|82.34|83.33|83.33|76.09| |
|Random Forests|tfidf|99.76|98.76|98.76|99.11|**Winner** |
|Multinomial Naive Bayes|tfidf|99.11|98.76|98.76|99.11|**Winner** |
|Bernoulli Naive Bayes|tfidf|99.05|98.76|98.76|99.11|**Winner**|