<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reading-Data" data-toc-modified-id="Reading-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Reading Data</a></span></li><li><span><a href="#Data-pre-processing" data-toc-modified-id="Data-pre-processing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data pre-processing</a></span></li><li><span><a href="#Splitting-data-75-25(train-test)" data-toc-modified-id="Splitting-data-75-25(train-test)-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Splitting data 75-25(train-test)</a></span></li><li><span><a href="#Hyperparameter-Tuning" data-toc-modified-id="Hyperparameter-Tuning-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Hyperparameter Tuning</a></span><ul class="toc-item"><li><span><a href="#SGD-model-Hyperparameter-tuning" data-toc-modified-id="SGD-model-Hyperparameter-tuning-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>SGD model Hyperparameter tuning</a></span></li><li><span><a href="#SVC-model" data-toc-modified-id="SVC-model-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>SVC model</a></span></li><li><span><a href="#Random-Forest-model" data-toc-modified-id="Random-Forest-model-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Random Forest model</a></span></li></ul></li></ul></div>

## Reading Data

In [103]:
import pandas as pd
dataset = pd.read_csv('cleaned_mar_eng1.csv', delimiter=',')

In [104]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4977 entries, 0 to 4976
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   TEXT       4977 non-null   object
 1   Sentiment  4977 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 77.9+ KB


In [105]:
dataset.columns

Index(['TEXT', 'Sentiment'], dtype='object')

In [106]:
dataset.Sentiment.value_counts()

3    2404
2    1712
1     861
Name: Sentiment, dtype: int64

## Data pre-processing

In [107]:
from nltk.tokenize import RegexpTokenizer
token = RegexpTokenizer(r'[a-zA-Z0-9]+')
cv = CountVectorizer(stop_words = 'english', ngram_range = (1,1),tokenizer = token.tokenize)

In [108]:
#DC-FEM
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score, train_test_split 


In [109]:
text_counts = cv.fit_transform(dataset.TEXT)

## Splitting data 75-25(train-test)

In [110]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(text_counts, dataset['Sentiment'], test_size = 0.25, random_state = 5)

In [111]:
MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [112]:
from sklearn import metrics
predicted = MNB.predict(X_test)

accuracy_score = metrics.accuracy_score(predicted, Y_test)

In [113]:
print(str('{:04.2f}'.format(accuracy_score*100))+'%')

62.33%


In [114]:
#ngram-1
cv = CountVectorizer(stop_words='english', ngram_range = (2,2), tokenizer = token.tokenize)
text_counts = cv.fit_transform(dataset['TEXT'])

X_train, X_test, Y_train, Y_test = train_test_split(text_counts, dataset['Sentiment'], test_size = 0.25, random_state = 5)

MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

accuracy_score = metrics.accuracy_score(MNB.predict(X_test), Y_test)
print(str('{:04.2f}'.format(accuracy_score*100))+'%')


53.01%


In [115]:
#ngram-2
cv = CountVectorizer(stop_words='english', ngram_range = (1,1), tokenizer = token.tokenize)
text_counts = cv.fit_transform(dataset['TEXT'])

X_train, X_test, Y_train, Y_test = train_test_split(text_counts, dataset['Sentiment'], test_size = 0.25, random_state = 5)

MNB = MultinomialNB()
MNB.fit(X_train, Y_train)

accuracy_score = metrics.accuracy_score(MNB.predict(X_test), Y_test)
print(str('{:04.2f}'.format(accuracy_score*100))+'%')

62.33%


In [116]:
#with this ngram we are getting 60+ accuracy
cv = CountVectorizer(stop_words='english', ngram_range = (1,1), tokenizer = token.tokenize)
text_counts = cv.fit_transform(dataset['TEXT'])

X_train, X_test, Y_train, Y_test = train_test_split(text_counts, dataset['Sentiment'], test_size = 0.25, random_state = 5)

#defining and compiling the model using ComplementNB
from sklearn.naive_bayes import ComplementNB

#fitting the model
CNB = ComplementNB()
CNB.fit(X_train, Y_train)

#evaluating the model
accuracy_score = metrics.accuracy_score(CNB.predict(X_test), Y_test)

print(str('{:4.2f}'.format(accuracy_score*100))+'%')

65.54%


In [117]:
from sklearn.naive_bayes import GaussianNB
GNB = GaussianNB()
GNB.fit(X_train.todense(), Y_train)
accuracy_score = metrics.accuracy_score(CNB.predict(X_test),Y_test)

print('GNB accuracy = ' + str('{:4.2f}'.format(accuracy_score*100))+'%')

GNB accuracy = 65.54%


In [118]:
from sklearn.naive_bayes import BernoulliNB
BNB = BernoulliNB()
BNB.fit(X_train, Y_train)
accuracy_score_bnb = metrics.accuracy_score(BNB.predict(X_test),Y_test)
print('BNB accuracy = ' + str('{:4.2f}'.format(accuracy_score_bnb*100))+'%')

BNB accuracy = 53.57%


In [119]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
text_count_2 = tfidf.fit_transform(dataset['TEXT'])

#splitting data in test and training

x_train, x_test, y_train, y_test = train_test_split(text_count_2, dataset['Sentiment'], test_size = 0.25, random_state = 5)

#defining the model
MNB.fit(x_train, y_train)
accuracy_score_mnb = metrics.accuracy_score(MNB.predict(x_test), y_test)
print('accuracy_score_mnb = '+str('{:4.2f}'.format(accuracy_score_mnb*100))+'%')

BNB.fit(x_train, y_train)
accuracy_score_bnb = metrics.accuracy_score(BNB.predict(x_test), y_test)
print('accuracy_score_bnb = '+str('{:4.2f}'.format(accuracy_score_bnb*100))+'%')

CNB.fit(x_train, y_train)
accuracy_score_cnb = metrics.accuracy_score(CNB.predict(x_test), y_test)
print('accuracy_score_cnb = '+str('{:4.2f}'.format(accuracy_score_cnb*100))+'%')

GNB.fit(x_train.todense(), y_train)
accuracy_score_gnb = metrics.accuracy_score(GNB.predict(x_test.todense()), y_test)
print('accuracy_score_gnb = '+str('{:4.2f}'.format(accuracy_score_gnb*100))+'%')


accuracy_score_mnb = 58.15%
accuracy_score_bnb = 53.98%
accuracy_score_cnb = 64.18%
accuracy_score_gnb = 58.07%


In [120]:
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC
SGDC = SGDClassifier()
LSVC = LinearSVC()

#on TF-IDF data
#LSVC.fit(x_train, y_train)
#accuracy_score_lsvc = metrics.accuracy_score(LSVC.predict(x_test), y_test)
#print('accuracy_score_lsvc = '+str('{:4.2f}'.format(accuracy_score_lsvc*100))+'%')

SGDC.fit(x_train, y_train)
accuracy_score_sgdc = metrics.accuracy_score(SGDC.predict(x_test), y_test)
print('accuracy_score_sgdc = '+str('{:4.2f}'.format(accuracy_score_sgdc*100))+'%')

#on countvectorize data
LSVC.fit(X_train, Y_train)
accuracy_score_lsvc_CV = metrics.accuracy_score(LSVC.predict(X_test), Y_test)
print('accuracy_score_lsvc_cv = '+str('{:4.2f}'.format(accuracy_score_lsvc_CV*100))+'%')

#SGDC.fit(X_train, Y_train)
#accuracy_score_sgdc_CV = metrics.accuracy_score(SGDC.predict(X_test), Y_test)
#print('accuracy_score_sgdc_cv = '+str('{:4.2f}'.format(accuracy_score_sgdc_CV*100))+'%')

accuracy_score_sgdc = 69.64%
accuracy_score_lsvc_cv = 68.19%


In [121]:
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix
pred = LSVC.predict(X_test)
print(classification_report(Y_test, pred))
print()
print("confusion matrix : \n", confusion_matrix(Y_test, pred) )
print()
print("accuracy: \n", accuracy_score(Y_test, pred))

              precision    recall  f1-score   support

           1       0.66      0.43      0.52       213
           2       0.58      0.73      0.65       425
           3       0.78      0.74      0.76       607

    accuracy                           0.68      1245
   macro avg       0.67      0.63      0.64      1245
weighted avg       0.69      0.68      0.68      1245


confusion matrix : 
 [[ 91  85  37]
 [ 25 309  91]
 [ 21 137 449]]

accuracy: 
 0.6819277108433734


In [122]:
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix
pred = SGDC.predict(x_test)
print(classification_report(y_test, pred))
print()
print("confusion matrix : \n", confusion_matrix(y_test, pred) )
print()
print("accuracy: \n", accuracy_score(y_test, pred))

              precision    recall  f1-score   support

           1       0.64      0.53      0.58       213
           2       0.60      0.68      0.64       425
           3       0.79      0.76      0.78       607

    accuracy                           0.70      1245
   macro avg       0.68      0.66      0.67      1245
weighted avg       0.70      0.70      0.70      1245


confusion matrix : 
 [[113  75  25]
 [ 38 291  96]
 [ 26 118 463]]

accuracy: 
 0.6963855421686747


In [123]:
from sklearn.ensemble import RandomForestClassifier

text_classifier = RandomForestClassifier(n_estimators=350, max_depth=15, random_state=1, min_samples_leaf=1)
text_classifier.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=15, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=350,
                       n_jobs=None, oob_score=False, random_state=1, verbose=0,
                       warm_start=False)

In [124]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

predictions = text_classifier.predict(x_test)
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))

[[  6   2 205]
 [  2  25 398]
 [  2   1 604]]
              precision    recall  f1-score   support

           1       0.60      0.03      0.05       213
           2       0.89      0.06      0.11       425
           3       0.50      1.00      0.67       607

    accuracy                           0.51      1245
   macro avg       0.66      0.36      0.28      1245
weighted avg       0.65      0.51      0.37      1245

0.5100401606425703


In [48]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-1.1.1-py3-none-win_amd64.whl (54.4 MB)
Installing collected packages: xgboost
Successfully installed xgboost-1.1.1


In [125]:
import xgboost
from xgboost import XGBClassifier
classifier = XGBClassifier()
classifier.fit(X_train, Y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=6,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [126]:
predictions = classifier.predict(X_test)

In [127]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(Y_test,predictions))
print(classification_report(Y_test,predictions))
print(accuracy_score(Y_test, predictions))

[[ 82 101  30]
 [ 27 322  76]
 [ 19 141 447]]
              precision    recall  f1-score   support

           1       0.64      0.38      0.48       213
           2       0.57      0.76      0.65       425
           3       0.81      0.74      0.77       607

    accuracy                           0.68      1245
   macro avg       0.67      0.63      0.63      1245
weighted avg       0.70      0.68      0.68      1245

0.6835341365461848


In [133]:
# Hyperparameter tuning for XGBoost

from __future__ import print_function

from pprint import pprint
from time import time
import logging

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# #############################################################################
# Define a pipeline combining a text feature extractor 
# with a simple classifier (logistic regression)

pipeline = Pipeline([
    #('vect', CountVectorizer()),  #http://scikit-learn.org/stable/modules/feature_extraction.html
    #('tfidf', TfidfTransformer()), #ignore for now
    ('clf', XGBClassifier(base_score=0.5, booster='gbtree',min_child_weight=1, gamma=0, objective= 'multi:softprob', seed=27)) 
])


# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = { #listed in the form of "step__parameter", e.g, clf__penalty
    #'vect__max_df': (0.5, 0.75, 1.0),
    # jgs 'vect__max_features': (None, 500, 5000, 10000, 50000),
    # jgs 'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams (single words) or bigrams (or sequence of words of length 2)
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__n_estimators': ( 200, 300, 400, 500),
    'clf__learning_rate': (0.1, 0.3, 0.5, 0.8),
    'clf__max_depth': (5, 10)
    #'clf__loss': ('log', 'hinge'),  #hinge linear SVM
    #'clf__n_iter': (10, 50, 80),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    # n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.
    #
    # By default, the GridSearchCV uses a 3-fold cross-validation. However, if it 
    #            detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold.
    grid_search = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(X_train, Y_train)
    print("done in %0.3fs" % (time() - t0))
    print()
    #print("grid_search.cv_results_", grid_search.cv_results_)
    #estimator : estimator object. This is assumed to implement the scikit-learn estimator interface.  
    #            Either estimator needs to provide a score function, or scoring must be passed.
    #Accuracy is the default for classification; feel free to change this to precision, recall, fbeta
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['clf']
parameters:
{'clf__learning_rate': (0.1, 0.3, 0.5, 0.8),
 'clf__max_depth': (5, 10),
 'clf__n_estimators': (200, 300, 400, 500)}
Fitting 3 folds for each of 32 candidates, totalling 96 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  2.6min
[Parallel(n_jobs=-1)]: Done  96 out of  96 | elapsed:  7.2min finished


done in 432.386s

Best score: 0.663
Best parameters set:
	clf__learning_rate: 0.3
	clf__max_depth: 5
	clf__n_estimators: 300


In [134]:
# best model
classifier = XGBClassifier(base_score=0.5, booster='gbtree',min_child_weight=1,
                           gamma=0, objective= 'multi:softprob', seed=27, 
                           learning_rate = 0.3, max_depth = 5, n_estimators = 300)
classifier.fit(X_train, Y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.3, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=300, n_jobs=0, num_parallel_tree=1,
              objective='multi:softprob', random_state=27, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=None, seed=27, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [135]:
# predictions
predictions = classifier.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(Y_test,predictions))
print(classification_report(Y_test,predictions))
print(accuracy_score(Y_test, predictions))


[[ 92  91  30]
 [ 35 311  79]
 [ 19 129 459]]
              precision    recall  f1-score   support

           1       0.63      0.43      0.51       213
           2       0.59      0.73      0.65       425
           3       0.81      0.76      0.78       607

    accuracy                           0.69      1245
   macro avg       0.67      0.64      0.65      1245
weighted avg       0.70      0.69      0.69      1245

0.6923694779116466


## Hyperparameter Tuning

In [73]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier


from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

In [88]:
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# Convert a number to a percent.    
def pct(x):
    return round(100*x,1)

class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

# Joining the scaled numeric and categorical features.

full_pipeline =Pipeline([
     ('vectorizer', CountVectorizer())
     #('to_dense', DenseTransformer())
])     




In [89]:

# A Function to execute the grid search and record the results.
def ConductGridSearch(X_train, y_train, X_test, y_test, i=0, prefix='', n_jobs=-1,verbose=1):
    # Create a list of classifiers for our grid search experiment
    classifiers = [
        ('Logistic Regression', LogisticRegression(random_state=42)),
        ('K-Nearest Neighbors', KNeighborsClassifier()),
        ('Naive Bayes', GaussianNB()),
        ('Support Vector', SVC(random_state=42)),
        ('Stochastic GD', SGDClassifier(random_state=42)),
        ('RandomForest', RandomForestClassifier()),
    ]

    # Arrange grid search parameters for each classifier
    params_grid = {
        'Logistic Regression': {
            'penalty': ('l1', 'l2'),
            'tol': (0.0001, 0.00001, 0.0000001), 
            'C': (10, 1, 0.1, 0.01),
        },
        
        'Naive Bayes': {},
        'Support Vector' : {
            'kernel': ('rbf', 'poly'),     
            'degree': (1, 2, 3, 4, 5),
            'C': (10, 1, 0.1, 0.01),
        },
        'Stochastic GD': {
            'loss': ('hinge', 'perceptron', 'log'),
            'penalty': ('l1', 'l2', 'elasticnet'),
            'tol': (0.0001, 0.00001, 0.0000001), 
            'alpha': (0.1, 0.01, 0.001, 0.0001), 
        },
        'RandomForest':  {
            'max_depth': [9, 15, 22, 26, 30],
            'max_features': [1, 3, 5],
            'min_samples_split': [5, 10, 15],
            'min_samples_leaf': [3, 5, 10],
            'bootstrap': [False],
            'n_estimators':[20, 80, 150, 200, 300]},
    }
    
    for (name, classifier) in classifiers:
        i += 1
        # Print classifier and parameters
        print('****** START',prefix, name,'*****')
        parameters = params_grid[name]
        print("Parameters:")
        for p in sorted(parameters.keys()):
            print("\t"+str(p)+": "+ str(parameters[p]))
        
        # generate the pipeline
        full_pipeline_with_predictor = Pipeline([
        ("preparation", full_pipeline),
        ("predictor", classifier)
        ])
        
        # Execute the grid search
        params = {}
        for p in parameters.keys():
            pipe_key = 'predictor__'+str(p)
            params[pipe_key] = parameters[p] 
        grid_search = GridSearchCV(full_pipeline_with_predictor, params, scoring='accuracy', cv=5, verbose=1)
        grid_search.fit(X_train, y_train)
                
        # Best estimator score
        best_train = pct(grid_search.best_score_)

        # Best estimator fitting time
        start = time()
        grid_search.best_estimator_.fit(x_train, y_train)
        train_time = round(time() - start, 4)

        # Best estimator prediction time
        start = time()
        best_test_accuracy = pct(grid_search.best_estimator_.score(X_test, y_test))
        test_time = round(time() - start, 4)

        # Generate 30 training accuracy scores with the best estimator and 30-split CV
        
        #==================================================#
        best_train_scores = cross_val_score(grid_search.best_estimator_, X_train, y_train, cv=cv30Splits)
        best_train_accuracy = pct(np.mean(best_train_scores))
        #==================================================#    
       
        # Conduct t-test with baseline logit (control) and best estimator (experiment)
        #(t_stat, p_value) = stats.ttest_rel(logit_scores, best_train_scores)
        
        # Collect the best parameters found by the grid search
        print("Best Parameters:")
        best_parameters = grid_search.best_estimator_.get_params()
        param_dump = []
        for param_name in sorted(params.keys()):
            param_dump.append((param_name, best_parameters[param_name]))
            print("\t"+str(param_name)+": " + str(best_parameters[param_name]))
        print("****** FINISH",prefix,name," *****")
        print("")
        
        # Record the results
        results.loc[i] = [prefix+name, best_train_accuracy, best_test_accuracy, round(p_value,5), train_time, test_time, json.dumps(param_dump)]

### SGD model Hyperparameter tuning

In [93]:


from __future__ import print_function

from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# #############################################################################
# Define a pipeline combining a text feature extractor 
# with a simple classifier (logistic regression)

pipeline = Pipeline([
    ('vect', CountVectorizer()),  #http://scikit-learn.org/stable/modules/feature_extraction.html
    #('tfidf', TfidfTransformer()), #ignore for now
    ('clf', SGDClassifier(max_iter=5)) 
])


# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = { #listed in the form of "step__parameter", e.g, clf__penalty
    #'vect__max_df': (0.5, 0.75, 1.0),
    # jgs 'vect__max_features': (None, 500, 5000, 10000, 50000),
    # jgs 'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams (single words) or bigrams (or sequence of words of length 2)
    #'tfidf__use_idf': (True, False),
    #'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l1', 'l2', 'elasticnet'),
    #'clf__penalty': ('l1', 'l2', 'elasticnet'),
    'clf__loss': ('log', 'hinge'),  #hinge linear SVM
    #'clf__n_iter': (10, 50, 80),
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    # n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.
    #
    # By default, the GridSearchCV uses a 3-fold cross-validation. However, if it 
    #            detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold.
    grid_search = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(dataset.TEXT, dataset.Sentiment)
    print("done in %0.3fs" % (time() - t0))
    print()
    #print("grid_search.cv_results_", grid_search.cv_results_)
    #estimator : estimator object. This is assumed to implement the scikit-learn estimator interface.  
    #            Either estimator needs to provide a score function, or scoring must be passed.
    #Accuracy is the default for classification; feel free to change this to precision, recall, fbeta
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'clf']
parameters:
{'clf__alpha': (1e-05, 1e-06),
 'clf__loss': ('log', 'hinge'),
 'clf__penalty': ('l1', 'l2', 'elasticnet')}
Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


done in 0.829s

Best score: 0.635
Best parameters set:
	clf__alpha: 1e-05
	clf__loss: 'hinge'
	clf__penalty: 'elasticnet'
Wall time: 834 ms


[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:    0.6s finished


### SVC model

In [94]:
%%time
# This code is adopted  and has been modified from 
#
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause

from __future__ import print_function

from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# #############################################################################
# Define a pipeline combining a text feature extractor 
# with a simple classifier (logistic regression)

pipeline = Pipeline([
    ('vect', CountVectorizer()),  #http://scikit-learn.org/stable/modules/feature_extraction.html
    #('tfidf', TfidfTransformer()), #ignore for now
    ('clf', SVC(random_state=42))
])


# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = { #listed in the form of "step__parameter", e.g, clf__penalty
    #'vect__max_df': (0.5, 0.75, 1.0),
    # jgs 'vect__max_features': (None, 500, 5000, 10000, 50000),
    'clf__kernel': ('rbf', 'poly'),     
    'clf__degree': (1, 2, 3, 4, 5),
    'clf__C': (10, 1, 0.1, 0.01)
}

if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    # n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.
    #
    # By default, the GridSearchCV uses a 3-fold cross-validation. However, if it 
    #            detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold.
    grid_search = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(dataset.TEXT, dataset.Sentiment)
    print("done in %0.3fs" % (time() - t0))
    print()
    #print("grid_search.cv_results_", grid_search.cv_results_)
    #estimator : estimator object. This is assumed to implement the scikit-learn estimator interface.  
    #            Either estimator needs to provide a score function, or scoring must be passed.
    #Accuracy is the default for classification; feel free to change this to precision, recall, fbeta
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'clf']
parameters:
{'clf__C': (10, 1, 0.1, 0.01),
 'clf__degree': (1, 2, 3, 4, 5),
 'clf__kernel': ('rbf', 'poly')}
Fitting 3 folds for each of 40 candidates, totalling 120 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   12.2s
[Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed:   29.2s finished


done in 31.370s

Best score: 0.648
Best parameters set:
	clf__C: 10
	clf__degree: 1
	clf__kernel: 'rbf'
Wall time: 31.4 s


### Random Forest model

In [97]:
%%time
# This code is adopted  and has been modified from 
#
# Author: Olivier Grisel <olivier.grisel@ensta.org>
#         Peter Prettenhofer <peter.prettenhofer@gmail.com>
#         Mathieu Blondel <mathieu@mblondel.org>
# License: BSD 3 clause

from __future__ import print_function

from pprint import pprint
from time import time
import logging

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# #############################################################################
# Define a pipeline combining a text feature extractor 
# with a simple classifier (logistic regression)

pipeline = Pipeline([
    ('vect', CountVectorizer()),  #http://scikit-learn.org/stable/modules/feature_extraction.html
    #('tfidf', TfidfTransformer()), #ignore for now
    ('clf', RandomForestClassifier())
])


# uncommenting more parameters will give better exploring power but will
# increase processing time in a combinatorial way
parameters = { #listed in the form of "step__parameter", e.g, clf__penalty
    #'vect__max_df': (0.5, 0.75, 1.0),
    # jgs 'vect__max_features': (None, 500, 5000, 10000, 50000),
    'clf__max_depth': [9, 15, 22, 26, 30],
    'clf__max_features': [1, 3, 5],
    'clf__min_samples_split': [5, 10, 15],
    'clf__min_samples_leaf': [3, 5, 10],
    'clf__bootstrap': [False],
    'clf__n_estimators':[20, 80, 150, 200, 300]
}


if __name__ == "__main__":
    # multiprocessing requires the fork to happen in a __main__ protected
    # block

    # find the best parameters for both the feature extraction and the
    # classifier
    # n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.
    #
    # By default, the GridSearchCV uses a 3-fold cross-validation. However, if it 
    #            detects that a classifier is passed, rather than a regressor, it uses a stratified 3-fold.
    grid_search = GridSearchCV(pipeline, parameters, cv=3, n_jobs=-1, verbose=1)

    print("Performing grid search...")
    print("pipeline:", [name for name, _ in pipeline.steps])
    print("parameters:")
    pprint(parameters)
    t0 = time()
    grid_search.fit(dataset.TEXT, dataset.Sentiment)
    print("done in %0.3fs" % (time() - t0))
    print()
    #print("grid_search.cv_results_", grid_search.cv_results_)
    #estimator : estimator object. This is assumed to implement the scikit-learn estimator interface.  
    #            Either estimator needs to provide a score function, or scoring must be passed.
    #Accuracy is the default for classification; feel free to change this to precision, recall, fbeta
    print("Best score: %0.3f" % grid_search.best_score_)
    print("Best parameters set:")
    best_parameters = grid_search.best_estimator_.get_params()
    for param_name in sorted(parameters.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['vect', 'clf']
parameters:
{'clf__bootstrap': [False],
 'clf__max_depth': [9, 15, 22, 26, 30],
 'clf__max_features': [1, 3, 5],
 'clf__min_samples_leaf': [3, 5, 10],
 'clf__min_samples_split': [5, 10, 15],
 'clf__n_estimators': [20, 80, 150, 200, 300]}
Fitting 3 folds for each of 675 candidates, totalling 2025 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.5s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:   18.5s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   39.7s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  2.8min


done in 190.587s

Best score: 0.550
Best parameters set:
	clf__bootstrap: False
	clf__max_depth: 9
	clf__max_features: 1
	clf__min_samples_leaf: 3
	clf__min_samples_split: 5
	clf__n_estimators: 20
Wall time: 3min 10s


[Parallel(n_jobs=-1)]: Done 2025 out of 2025 | elapsed:  3.2min finished


### XGBoost model

### Logistic model