In [1]:
url = 'https://https://www.kaggle.com/yelp-dataset/yelp-dataset'

### Supervised Learning Capstone  
#### You're ready to put into practice everything you've learned so far.  

####  First: Go out and find a dataset of interest. It could be from one of our recommended resources, some other aggregation, or scraped yourself. Just make sure it has lots of variables in it, including an outcome of interest to you.  

####  Second: Explore the data. Get to know the data. Spend a lot of time going over its quirks and peccadilloes. You should understand how it was gathered, what's in it, and what the variables look like.  

####  Third: Model your outcome of interest. You should try several different approaches and really work to tune a variety of models before using the model evaluation techniques to choose what you consider to be the best performer. Make sure to think about explanatory versus predictive power and experiment with both.  

####  So, here is the deliverable: Prepare a slide deck and 15 minute presentation that guides viewers through your model. Be sure to cover a few specific things:  

1. A specified research question your model addresses  
1. How you chose your model specification and what alternatives you compared it to  
1. The practical uses of your model for an audience of interest  
1. Any weak points or shortcomings of your model  


This presentation is not a drill. You'll be presenting this slide deck live to a group as the culmination of your work in the last two supervised learning units. As a secondary matter, your slides and / or the Jupyter notebook you use or adapt them into should be worthy of inclusion as examples of your work product when applying to jobs.

Good luck!

#### Specific items to cover:

1.  #### Research question that this model addresses:
    #### Can we predict the number of stars for a Yelp rating from the review provided by the customer?  
    
1.  #### How did I choose my model specifications and against what alternatives did I compare it?
#### This was my performance criteria for this capstone:
    1. #### Model needed to complete the fitting, and prediction, and scoring within 20 minutes, with minimal parameter tuning of either the vectorizer, or the model, using a classifier model.
    
    
####  I chose a classifier model suitable for sentiment analysis, and chose from these 5 models:  
1. #### Random Forests
2. #### Support Vector Machines
3. #### Naive Bayes
4. #### MultiNomial Naive Bayes
5. #### Logistic Regression
    
####    In addition, I chose 2 Vectorizers for the model:
1. #### CountVectorizer
1. #### TfidfVectorizer
    
    
#### My findings on the 5 models attempted above:
1. Random Forests - RFC
With the dataset from Kaggle, RFC was not sufficiently performant using either CountVectorizer, or TfidVectorizer, in relation to the timings of the other models.  Several of my test runs did complete in several hours, but that was not within my criteria.
       
2. Support Vector Machines - SVC
SVC was another candidate that was not sufficiently performant, and did not finish in the 20 minute time requirement.  Perhaps this model requires more powerful hardware to complete in less than 20 minutes.  On several test runs, I allowed it to run overnight, and minimal progress was made.  It could be that there are parameter that could optimize this, but this is my first time to use SVC for sentiment analysis. 
       
1. Naive Bayes - BernoulliNB
This model using some different testing with parameters, and combinations of CountVectorizer, and TfidfVectorizer was performant.  This became my third choice, coming in at a prediction rate of 75.81%, with an elapsed time of 11 minutes.  
       
4. MultiNomial Naive Bayes - MultinomialNB
This model was initially my pick to be the most performant, but unfortunately, it came in as my second, at a prediction rate of 85.98%, and completed in 12 minutes.  
       
5. Logistic Regression - LR  
This model was not the fastest, but had the best prediction percent at 91.08%, and completing in 17 minutes.  This is my pick for best accuracy, given the performance within my requirement.
       
Discoveries in my testing:
    
1.  Iterating through the different models, and varying parameters to find the best fit can take a long, long time.  Perhaps optimizations can help on this.  
2.  Given the numerous parameters on some of the models, finding the right combination of n parameters could use some extra computing resources to shorten the iteration times.
    
    
       
    

### Practical Uses of the Model

#### This model is very valuable in many ways.

#### Once trained, and optimized for performance, which is beyond the scope of this capstone, these are some of the things that could be of great business value:

1.  #### Sentiment analysis in real time, while providing a review, and potentially linking up an unhappy customer with a discounted "next visit" special.
1.  #### Daily sentiment temperature for business owners that use Yelp, to see if the reviews are going up, down, or level for a given business period - daily, weekly, monthly, and reaching out to those customers in some way.
1.  #### Applying this same model to e-mails from customers/vendors/employees to measure sentiment, and to be proactive.
1.  #### With modifications, determine if fake reviews are being submitted, and to stop them.

### The Dataset

#### The Yelp dataset that originiated from Kaggle includes 5.2 million Yelp user reviews for 174,000 business, spanning 11 metropolitan cities around the world.

#### Dataset components:
1. review_id  
An internal id from Yelp
1. user_id  
An internal user id from Yelp.
1. business_id  
An internal business id from Yelp.
1. stars  
The number of stars that a review was given by a reviewer.  1 being the worst, and 5 being the best.
1. useful
This is another rating that a person reading the review can provide as a rating of usefulness.
1. funny  
This is another level of review that a person reading the review can provide as a rating of funny.
1. cool  
This is another level of review that a person reading the review can provide as a rating of cool.
1. text  
This is a free form text review field, that can be up to 5,000 characters.
1. date  
Date of review.

### Questions that need answering:

 1. What question are you trying to solve or prove in this model ?   
 __We are correlating the sentiment analysis from the review with the stars rating that a customer provided.  We want to predict the stars rating from the customer review.  .__
 1. What kind of data do you have? -> describe the source.. 
 __The dataset includes 5.2 million Yelp user reviews for 174,000 businesses, spanning 11 metropolitan cities.__
 1. What's missing from the data and how do you deal with it?  
 __There really is nothing missing from the data.    __
 1. Where are the outliers and why should we pay attention to them?  
 __There were no outliers to mention in the dataset.  Since this is sentiment data, we don't have outliers.__
 1. How can you add, change, or remove features to get more out of your data?  
 __I added message_length, derived from the length of the message.  In addition Feature importances, PCA, dropping low variance items, correlated feature pairs, features too highly correlated, feature engineering, transforming, timestamp and make a month, day, year column from it.__

In [2]:
# Constants
random_state           = 57
train_size             = 0.90
test_size              = 0.10
max_data               = 0.149568495  #6685900 #        file_line_count # for the entire file 1m = 0.149568495 full: 6685900
rfc_test_size          = 50000
rfc_train_size         = 5000
run_CountVectorizer    = True
run_TfidfVectorizer    = True
BegTimeStampNewlines   = 3
EndTimeStampNewlines   = 3
EndTimeStamp           = '\n'*EndTimeStampNewlines+'End'
BegTimeStamp           = 'Begin'+'\n'*BegTimeStampNewlines

# Regression/Classification control
Regression = False 

print("Regression = {}".format(Regression))

Regression = False


In [3]:
# Controls
flag_to_run_rf = False
flag_to_plot_them = False
flag_to_run_correlation_matrix = False
flag_to_run_features_importance = False
flag_to_run_gradient_boosting  = False
flag_to_run_linear_regression  = False
flag_to_run_logistic_regression = False
flag_to_run_lasso_regression = False
flag_to_run_ridge_regression = False
flag_to_run_svc = False
flag_to_run_vectorizer_nb = False
flag_to_run_sentiment_analyzer = True

In [4]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt

%matplotlib inline

import chardet
import datetime
from sklearn import ensemble
from sklearn import datasets
from sklearn import metrics
from sklearn import linear_model
from sklearn.utils import shuffle
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import json
import time, sys
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn import linear_model
from sklearn.model_selection import cross_val_predict, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
from IPython.display import Markdown, display
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import Pipeline
from IPython.display import HTML, display

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
# from tqdm import tqdm

In [5]:
# pd set preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format
# pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_row', 1000)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('max_colwidth',200)

In [6]:
def file_length(full_file_name):
    count = 0
    for line in open(full_file_name).readlines(  ): 
        count += 1
    return count

In [7]:
def write_to_logfile(message, mdformat=''):
    bufsize = 0
    with open('TestResults.md', 'a+') as the_file:
        the_file.write('{} {}'.format(mdformat, message))

In [8]:
def print_timestamp(displaytext):    
    import sys
    import datetime
    datetime_now = str(datetime.datetime.now())
    printFormatted("{:19.19}: In: {} {} ".format(datetime_now, sys._getframe(1).f_code.co_name, displaytext))

In [9]:
def file_stuff():

    global df
    
    file = 'yelp_academic_dataset_review.json'# /Users/lou/GITHubProjects/Thinkful/Datafiles/yelp/yelp_academic_dataset_review.json
    path = path="../../../../Datafiles/yelp/"
    column_names = []
    import json

    print_timestamp("Log start get filecount")
    file_line_count = file_length(path+file) # 

    file_len = file_length(path+file)
    print_timestamp("There are {:,} lines in data file {}.".format(file_len, file))

    line = None
    data = []
    cnt = 0
    if isinstance(max_data, float):
        max_data2 = max_data * file_len # it's a factor of the full file size
    elif isinstance(max_data, int):
        max_data2 = max_data # it's the number of records in the file

    with open(path+file, 'r',encoding='UTF-8') as f:
      line = f.readline()
      while line and cnt < max_data2:
        data.append(json.loads(line))
        line = f.readline()
        cnt += 1
    print_timestamp("Log end read datafile {}".format(file))

    column_names = ['review_id', 'user_id', 'business_id', 'stars', 'useful','funny','cool','text', 'date']
    df = pd.DataFrame(data, columns = column_names)
    print_timestamp("there are {:,} entries in  {}".format(len(df), 'data frame df'))
    print_timestamp("columns={}".format(df.columns))

    # data Cleanup

    df['sentiment_label'] = df.stars.map({4.0:1, 5.0: 1, 3.0:0, 2.0: 0, 1.0:0}) # Let's conver the stars values into success(1.0) or fail(0)
    df = df.rename({'text': 'message'}, axis=1)
    df['message_length'] = df['message'].str.len()
    print_timestamp("columns are {}".format(df.columns))

    print_timestamp('we have cleaned up the dataframe.')

In [10]:
def printFormatted(string):
    newline = '\n'
    display(Markdown(string))
    write_to_logfile(string+newline)

In [11]:
def make_X_and_Y():
        
    global X, y
    
    # Definine outcome and predictors.
    print_timestamp("the current time of start is {}".format(str(datetime.datetime.now())))

#     # Make the categorical variables below into enumerated categorical variables
#     for dummy_column in []: # Remove these for now...
#         df = pd.concat([df, pd.get_dummies(df[dummy_column])], axis=1)
#     df.columns = df.columns.str.replace(' ', '')

    # Definine outcome and predictors.

    y = df['sentiment_label']
    X = df['message']

In [12]:
def plot_them():
    for column in X_train.columns:
#         plt.hist(X_train[column]*100, bins=40)
        plt.scatter(y_train, X_train[column]*100)
        plt.xlabel(column)
        plt.show()

In [13]:
def training_test_set():
    
    global X_train, X_test, y_train, y_test
    printFormatted("test_size={}, and train_size={}".format(test_size,train_size))
    # Let's fit it with the RFC training set
    #  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, train_size=train_size, random_state=0)
    printFormatted("train_size = {}, X_train is {}, and y_train is {}".format(train_size, len(X_train), len(y_train)))
    printFormatted("test_size  = {}, X_test  is {}, and y_test is {}".format(test_size, len(X_test), len(y_test)))

In [14]:
def plot_them_others(df,columnlist):
    
    print_timestamp(BegTimeStamp)
    
    print_timestamp("We will plot these independent variables {}".format(columnlist))
    for column in columnlist:
        plt.hist(df[column], bins=90)
        plt.xlabel(column)
        plt.ylabel('count')
        plt.show()
#     dfcolumn.plot.hist()

    print_timestamp(EndTimeStamp)

In [15]:
def run_correlation_matrix():
    
    print_timestamp(BegTimeStamp)
    
    # Setup the correlation matrix.
    corrmat = X.corr()
    print(corrmat)

    # Set up the subplots
    f, ax = plt.subplots(figsize=(12, 9))

    # Let's draw the heatmap using seaborn.
    sns.heatmap(corrmat, vmax=.6, square=True)
    plt.show()
    
    print_timestamp(EndTimeStamp)

Let's try predicting with gradient boosting classification

In [None]:
def sentiment_analyzer(path, parameters, classifier, tfidf_parms):
    # path A = the old path
    # path B = the new path, no CountVectorizer at all
    
    
# run block of code and catch warnings
  
    print_timestamp(BegTimeStamp+" running with path={}".format(path))
   
#     tfidf_parms = {
#         'strip_accents': 'unicode',
#         'stop_words'   : 'english',
#         'ngram_range'  : (2,2),
#         'max_df'       : 3,
#         'min_df'       : 2
#     }
    
    global vectorized
    vectorized = True
    
    pipeline_array = []
   
    if path == "A":
        if classifier == 'bnb':
            pipeline_array.append(Pipeline([
#                 ('vect',  CountVectorizer()),
#                 ('tfidf', TfidfTransformer()),
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   BernoulliNB(**parameters))
            ]))
        elif classifier == 'svc':
            pipeline_array.append(Pipeline([
    #             ('vect',  CountVectorizer()),
    #             ('tfidf', TfidfTransformer()),
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   SVC(kernel = 'linear', **parameters))
            ])) 
        elif classifier == 'mlb':
            pipeline_array.append(Pipeline([
#                 ('vect',  CountVectorizer()),
#                 ('tfidf', TfidfTransformer()),
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   MultinomialNB(**parameters))
            ]))
        elif classifier == 'logit':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
#                 ('vect',  CountVectorizer()),
#                 ('tfidf', TfidfTransformer()),
                ('clf',   LogisticRegression(**parameters))
            ]))
        elif classifier == 'rfc':
            pipeline_array.append(Pipeline([
                ('vect',  CountVectorizer()),
                ('tfidf', TfidfTransformer()),
#                 ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   ensemble.RandomForestClassifier(**parameters))
            ]))  
            
    elif path == "B":
        if classifier == 'bnb':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   BernoulliNB(**parameters))
            ]))
        elif classifier == 'svc':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   SVC(kernel = 'linear', **parameters))
            ])) 
        elif classifier == 'mlb':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   MultinomialNB(**parameters))
            ]))
        elif classifier == 'logit':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('clf',   LogisticRegression(**parameters))
            ]))
        elif classifier == 'rfc':
            pipeline_array.append(Pipeline([
                ('tfidf', TfidfVectorizer(**tfidf_parms)),
                ('clf',   ensemble.RandomForestClassifier(**parameters))
            ]))

    for pipe in pipeline_array:
        len1 = len(pipeline_array)
        
        try:
            vect_name_list = str(pipe.named_steps['vect']).split('(')
            vect_name = "vect = {}, ".format(vect_name_list[0])
        except:
            vect_name = ''
            
        classifier_name_list=str(pipe.named_steps['clf']).split('(')
        classifier_name=classifier_name_list[0]
        tfidf_name_list = str(pipe.named_steps['tfidf']).split('(')
        if len(tfidf_name_list) > 0:
            tfidf_name = tfidf_name_list[0]
        else:
            tfidf_name = ''

        printFormatted("####  Now running pipeline with: {} tfidf={} and clf={}\nparameters={}:".format(vect_name,
                                                                                                        tfidf_name,
                                                                                                        classifier_name,
                                                                                                        parameters))

        pipe.fit(X_train, y_train)

        y_pred_class = pipe.predict(X_test)

        metrics_test_score = metrics.accuracy_score(y_test, y_pred_class)
        printFormatted('###  Metrics accuracy score = {:.2%} with {}'.format(metrics_test_score, classifier_name))

        printFormatted("Steps information: {}".format(pipe.steps))
        print_timestamp("Finished running pipeline with:\n{}: ".format(classifier_name))
            
    print_timestamp(EndTimeStamp)

In [None]:
def run_it():
    
    print_timestamp(BegTimeStamp)
   
    if Regression == True:
        print_timestamp("We are running with a Regression model")
    elif Regression == False:
        print_timestamp("We are running with a Classifier model")
    else:
        print_timestamp("We have failed to set the Regression variable")
        sys.exit(main())
        
    if flag_to_plot_them == True:
        print("we went to flag_to_plot_them")
#         for colnames in ['stars','useful','funny','cool','message_length']:
#             plot_them_others(df[colnames])
        columns_to_plot = ['funny','cool','useful','message_length','sentiment_label']
        plot_them_others(df, columns_to_plot)
        
#         plot_them()

    if flag_to_run_features_importance == True:
        
        number_of_features_to_consider = 50
        params = {'n_estimators': 100}

        if Regression == True:
            print_timestamp('We are running RandomForestRegressor')
            rf = ensemble.RandomForestRegressor(**params)
            
        else:
            print_timestamp('We are running RandomForestClassifier')
            rf = ensemble.RandomForestClassifier(**params)

        run_features_importance(rf, number_of_features_to_consider)

    if flag_to_run_correlation_matrix == True:
        run_correlation_matrix()

    if flag_to_run_rf == True:
        #     params = {}
        params = {'n_estimators': 100} 

        if Regression == True:
            rf = ensemble.RandomForestRegressor(**params)
            print_timestamp('We are running RandomForestRegressor')
        else:
            rf = ensemble.RandomForestClassifier(**params)
            print_timestamp('We are running RandomForestClassifier')

        run_rf(rf)

    if flag_to_run_gradient_boosting  == True:
        run_gradient_boosting()

    if flag_to_run_linear_regression  == True:
        run_linear_regression()

    if flag_to_run_logistic_regression == True:
        run_logistic_regression()

    if flag_to_run_svc == True:
        run_svc() 

    if flag_to_run_ridge_regression == True:
        run_ridge_regression()
        
    if flag_to_run_sentiment_analyzer == True:
        path = "B"
        tfidf_parms = {
#       original parameters
#         'strip_accents': 'unicode', unused
#             'stop_words'   : 'english',
#             'ngram_range'  : (2,2),
#             'max_df'       : .5,
#             'min_df'       : 15,
#             'max_features' : 50000
            
            'stop_words'   : 'english',
            'ngram_range'  : (1,3),
            'max_df'       : .8,
            'min_df'       : .1,
            'max_features' : 50000
        }
                
        printFormatted("tfidf_parms={}".format(tfidf_parms))
        for path in ['A']:
#             for vectorizer_iterator in ['bnb', 'mlb', 'logit','rfc', 'svc']:  # removed 'rfc' -> too slow with large datasets
            for vectorizer_iterator in ['logit', 'mlb', 'bnb']:
                if vectorizer_iterator == 'rfc':
                    sentiment_analyzer(path, params, vectorizer_iterator, tfidf_parms)
                elif vectorizer_iterator == 'bnb':
                    parameters = {}
                    sentiment_analyzer(path, parameters, vectorizer_iterator, tfidf_parms)
                elif vectorizer_iterator == 'mlb':
                    parameters = {}
                    sentiment_analyzer(path, parameters, vectorizer_iterator, tfidf_parms)
                elif vectorizer_iterator == 'logit': # newton-cg took too long. sag and saga about the same as lbfgs.
                    parameters = {'C' :1e20, 'solver': 'lbfgs', 'max_iter': 100} # not that good
                    sentiment_analyzer(path, parameters, vectorizer_iterator, tfidf_parms)
                elif vectorizer_iterator == 'svc':
                    parameters = {}
                    sentiment_analyzer(path, parameters, vectorizer_iterator, tfidf_parms)

    print_timestamp(EndTimeStamp)

In [None]:
def main(entry_point):
    if entry_point == 0:
        print_timestamp("Starting main()")
        file_stuff()

        make_X_and_Y()

#         X_describe()
#         X_dtypes()
#         df_sample(10)
#         X_isnull()

        training_test_set()

    run_it()
    
    print_timestamp("Ending main()")

In [None]:
main(0)

2019-06-24 22:16:29: In: main Starting main() 

2019-06-24 22:16:29: In: file_stuff Log start get filecount 

2019-06-24 22:17:01: In: file_stuff There are 6,685,900 lines in data file yelp_academic_dataset_review.json. 

2019-06-24 22:17:09: In: file_stuff Log end read datafile yelp_academic_dataset_review.json 

2019-06-24 22:17:11: In: file_stuff there are 1,000,001 entries in  data frame df 

2019-06-24 22:17:11: In: file_stuff columns=Index(['review_id', 'user_id', 'business_id', 'stars', 'useful', 'funny', 'cool', 'text', 'date'], dtype='object') 

2019-06-24 22:17:12: In: file_stuff columns are Index(['review_id', 'user_id', 'business_id', 'stars', 'useful', 'funny', 'cool', 'message', 'date', 'sentiment_label', 'message_length'], dtype='object') 

2019-06-24 22:17:12: In: file_stuff we have cleaned up the dataframe. 

2019-06-24 22:17:12: In: make_X_and_Y the current time of start is 2019-06-24 22:17:12.605287 

test_size=0.1, and train_size=0.9

train_size = 0.9, X_train is 900000, and y_train is 900000

test_size  = 0.1, X_test  is 100001, and y_test is 100001

2019-06-24 22:17:12: In: run_it Begin


 

2019-06-24 22:17:12: In: run_it We are running with a Classifier model 

tfidf_parms={'stop_words': 'english', 'ngram_range': (1, 3), 'max_df': 0.8, 'min_df': 0.1, 'max_features': 50000}

2019-06-24 22:17:12: In: sentiment_analyzer Begin


 running with path=A 

####  Now running pipeline with:  tfidf=TfidfVectorizer and clf=LogisticRegression
parameters={'C': 1e+20, 'solver': 'lbfgs', 'max_iter': 100}:

###  Metrics accuracy score = 76.24% with LogisticRegression

Steps information: [('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.float64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.8, max_features=50000, min_df=0.1,
        ngram_range=(1, 3), norm='l2', preprocessor=None, smooth_idf=True,
        stop_words='english', strip_accents=None, sublinear_tf=False,
        token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
        vocabulary=None)), ('clf', LogisticRegression(C=1e+20, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False))]

2019-06-24 22:26:05: In: sentiment_analyzer Finished running pipeline with:
LogisticRegression:  

2019-06-24 22:26:05: In: sentiment_analyzer 


End 

2019-06-24 22:26:16: In: sentiment_analyzer Begin


 running with path=A 

####  Now running pipeline with:  tfidf=TfidfVectorizer and clf=MultinomialNB
parameters={}:

In [None]:
print_timestamp(BegTimeStamp)
tfidf_parms = {
#         'strip_accents': 'unicode',
        'stop_words'   : 'english',
        'ngram_range'  : (2,2),
        'max_df'       : .5,
        'min_df'       : 20,
        'max_features' : 50000
    }

Tf = TfidfVectorizer(**tfidf_parms)
tfifx = Tf.fit_transform(X_train[:100000])

# Tf.vocabulary_
# len(Tf.vocabulary_)
# tfifx.shape
print_timestamp(EndTimeStamp+"Here is the shape of Tf " + str(tfifx.shape))

In [None]:
# Tf.vocabulary_
tfifx.shape
# Tf.vocabulary_

In [None]:
df.columns

In [None]:
df.columns()

In [None]:
df.describe()

In [None]:
def X_describe():
    X.describe()

In [None]:
def X_dtypes():
    X.dtypes

In [None]:
def df_sample(n):
    df.sample(n)

In [None]:
def X_isnull():
    X.isnull().sum()