### Comparing Models and Vectorization Strategies for Text Classification

This try-it focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in scikitlearn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.

**Note:** The original dataset contains 200K rows of data. It is best to try to use the full dtaset. If the original dataset is too large for your computer, please use the 'dataset-minimal.csv', which has been reduced to 100K.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('try-it_18_1_starter/text_data/dataset.csv')

In [3]:
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    200000 non-null  object
 1   humor   200000 non-null  bool  
dtypes: bool(1), object(1)
memory usage: 1.7+ MB


#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

In [63]:
pd.DataFrame({'model': ['Logistic', 'Decision Tree', 'Bayes'], 
             'best_params': ['', '', ''],
             'best_score': ['0.9273', '0.87038', '0.9136']}).set_index('model')

Unnamed: 0_level_0,best_params,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1
Logistic,,0.9273
Decision Tree,,0.87038
Bayes,,0.9136


### Summary:

#### The LogisticRegression classifier produced the best score on a test data holdout of 25%.
#### The performance of Bag-of-Words (CountVectorizer) was better than TF-IDF in every case. 
#### The DecisionTree took a lot more compute time to produce an inferior score.
#### Bayes had a very close score to the LogisticRegression and fit the data much more quickly.
#### Overall winner: CountVectorizer capped at 2000 features, no stop word removal, and LogisticRegression classifier.
#### Close second: CountVectorizer capped at 2000 features, no stop word removal, and MultinomialNBclassifier.

In [69]:
pd.DataFrame({'model': ['Cvect+Logistic', 'Cvect+Decision Tree', 'Cvect+Bayes','Tfidf+Logistic', 'Tfidf+Decision Tree', 'Tfidf+Bayes'], 
             
             'best_score': ['0.9273', '0.87038','0.9136','0.92054','0.86032','0.90892']}).set_index('model')

Unnamed: 0_level_0,best_score
model,Unnamed: 1_level_1
Cvect+Logistic,0.9273
Cvect+Decision Tree,0.87038
Cvect+Bayes,0.9136
Tfidf+Logistic,0.92054
Tfidf+Decision Tree,0.86032
Tfidf+Bayes,0.90892


### 1.Stemming

In [6]:
import nltk
from nltk import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to /Users/kellen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/kellen/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/kellen/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [7]:
def stemmer(text):
    lst = word_tokenize(text)
    stem = PorterStemmer()
    return ' '.join([stem.stem(i) for i in lst])


In [8]:
print(stemmer("Joe biden rules out 2020 bid: 'guys, i'm not"))

joe biden rule out 2020 bid : 'guy , i 'm not


In [9]:
df['stemmer'] = df['text'].apply(stemmer)
df.head()

Unnamed: 0,text,humor,stemmer
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False,"joe biden rule out 2020 bid : 'guy , i 'm not ..."
1,Watch: darvish gave hitter whiplash with slow ...,False,watch : darvish gave hitter whiplash with slow...
2,What do you call a turtle without its shell? d...,True,what do you call a turtl without it shell ? de...
3,5 reasons the 2016 election feels so personal,False,5 reason the 2016 elect feel so person
4,"Pasco police shot mexican migrant from behind,...",False,"pasco polic shot mexican migrant from behind ,..."


### Lemmatizing

In [10]:
def lemma(text):
    lst = word_tokenize(text)
    lemma = WordNetLemmatizer()
    return ' '.join([lemma.lemmatize(i) for i in lst])


In [11]:
print(lemma("Joe biden rules out 2020 bid: 'guys, i'm not"))

Joe biden rule out 2020 bid : 'guys , i 'm not


In [12]:
df['lemma'] = df['text'].apply(lemma)
df.head()

Unnamed: 0,text,humor,stemmer,lemma
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False,"joe biden rule out 2020 bid : 'guy , i 'm not ...","Joe biden rule out 2020 bid : 'guys , i 'm not..."
1,Watch: darvish gave hitter whiplash with slow ...,False,watch : darvish gave hitter whiplash with slow...,Watch : darvish gave hitter whiplash with slow...
2,What do you call a turtle without its shell? d...,True,what do you call a turtl without it shell ? de...,What do you call a turtle without it shell ? d...
3,5 reasons the 2016 election feels so personal,False,5 reason the 2016 elect feel so person,5 reason the 2016 election feel so personal
4,"Pasco police shot mexican migrant from behind,...",False,"pasco polic shot mexican migrant from behind ,...",Pasco police shot mexican migrant from behind ...


### Count Vectorization

In [13]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

In [14]:
X1 = df[['stemmer']]
y1 = df['humor']

In [15]:
X1_train, X1_test, y1_train, y1_test = train_test_split(X1['stemmer'],y1,random_state = 42)
X1_train.head()

21743       9 fact you probabl would n't believ 50 year ago
124554    knock knock ? ? who 's there ? ? jehovah wit ....
10351       i 'm not an expert on masturb but i hold my own
135164    dad demand apolog from ann coulter for use 're...
49969     whi us it not good to have an oili face ? the ...
Name: stemmer, dtype: object

In [16]:
X1_train.info()

<class 'pandas.core.series.Series'>
Index: 150000 entries, 21743 to 121958
Series name: stemmer
Non-Null Count   Dtype 
--------------   ----- 
150000 non-null  object
dtypes: object(1)
memory usage: 2.3+ MB


### CountVectorizer + Logistic Regression

In [17]:
vect_pipe = Pipeline([('cvect',CountVectorizer()),
                      ('lgr',LogisticRegression())])
vect_pipe

In [18]:
vect_pipe.fit(X1_train,y1_train)

In [19]:
vect_acc = vect_pipe.score(X1_test,y1_test)
vect_acc

0.9273

In [20]:
df.humor.value_counts()

humor
False    100000
True     100000
Name: count, dtype: int64

### Pipeline and Grid Search

In [21]:
params = {'cvect__max_features':[100,500,1000,2000,3000,4000,5000],
         'cvect__stop_words':['english',None] }

In [22]:
cvect_grid = GridSearchCV(vect_pipe, param_grid = params)
cvect_grid.fit(X1_train, y1_train)

In [23]:
cvect_grid_acc = cvect_grid.score(X1_test,y1_test)
cvect_grid_acc

0.92318

In [24]:
cvect_grid_best = cvect_grid.best_params_
cvect_grid_best

{'cvect__max_features': 5000, 'cvect__stop_words': None}

In [25]:
cvect_grid_acc = cvect_grid.score(X1_test,y1_test)
cvect_grid_acc

0.92318

In [26]:
cvect_grid_best = cvect_grid.best_params_
cvect_grid_best

{'cvect__max_features': 5000, 'cvect__stop_words': None}

### CountVectorizer + DecisionTree

In [28]:
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree

In [29]:
vect_tree_pipe = Pipeline([('cvect',CountVectorizer()),
                           ('tree', DecisionTreeClassifier())])
vect_tree_pipe

In [30]:
vect_tree_pipe.fit(X1_train,y1_train)

In [31]:
vect_tree_acc = vect_tree_pipe.score(X1_test,y1_test)
vect_tree_acc

0.87038

In [37]:
tree_params = {'cvect__max_features':[100,5000],
            'tree__min_samples_split':[0.1,0.05],
              'tree__max_depth':[2,10],
              'tree__min_impurity_decrease':[0.01,0.05]}


In [38]:
tree_grid1 = GridSearchCV(vect_tree_pipe, param_grid = tree_params)
tree_grid1.fit(X1_train,y1_train)

In [39]:
tree_grid.score(X1_test,y1_test)

0.76992

In [40]:
tree_grid.best_params_

{'cvect__max_features': 100,
 'tree__max_depth': 10,
 'tree__min_impurity_decrease': 0.01,
 'tree__min_samples_split': 0.1}

In [34]:
tree_grid = GridSearchCV(vect_tree_pipe, param_grid = params)
tree_grid.fit(X1_train,y1_train)

In [35]:
tree_grid_acc = tree_grid.score(X1_test,y1_test)
tree_grid_acc

0.86976

In [36]:
tree_grid_best = tree_grid.best_params_
tree_grid_best

{'cvect__max_features': 5000, 'cvect__stop_words': None}

### CountVectorizer + MultinomialNB

In [42]:
from sklearn.naive_bayes import MultinomialNB

In [43]:
vect_NB_pipe = Pipeline([('cvect',CountVectorizer()),
                        ('bayes',MultinomialNB())])
vect_NB_pipe.fit(X1_train,y1_train)

In [44]:
vect_NB_pipe.score(X1_test,y1_test)

0.9136

In [45]:
bayes_grid = GridSearchCV(vect_NB_pipe,param_grid = params)
bayes_grid.fit(X1_train,y1_train)

In [46]:
bayes_grid.score(X1_test,y1_test)

0.90594

In [47]:
bayes_grid.best_params_

{'cvect__max_features': 5000, 'cvect__stop_words': None}

### Tfidf + Logistic Regression

In [48]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [49]:
tfidf_pipe = Pipeline([('tfidf',TfidfVectorizer()),
                      ('lgr',LogisticRegression())])
tfidf_pipe.fit(X1_train,y1_train)

In [50]:
tfidf_pipe.score(X1_test,y1_test)

0.92054

In [51]:
tfidf_params = {'tfidf__max_features': [100,500,1000,2000],
               'tfidf__stop_words':['english',None]}


In [52]:
tfidf_grid = GridSearchCV(tfidf_pipe, param_grid = tfidf_params)
tfidf_grid.fit(X1_train,y1_train)

In [53]:
tfidf_grid.score(X1_test,y1_test)

0.91104

In [54]:
tfidf_grid.best_params_

{'tfidf__max_features': 2000, 'tfidf__stop_words': None}

### Tfidf + Decision Tree

In [55]:
tfidf_tree_pipe = Pipeline([('tfidf',TfidfVectorizer()),
                           ('tree', DecisionTreeClassifier())])
tfidf_tree_pipe.fit(X1_train,y1_train)

In [56]:
tfidf_tree_pipe.score(X1_test,y1_test)

0.86032

In [57]:
tfidf_tree_params = {'tfidf__max_features': [100,1000,2000],
            'tree__min_samples_split':[0.1,0.05],
              'tree__max_depth':[2,10],
              'tree__min_impurity_decrease':[0.01,0.05]}
                    

In [58]:
tfidf_tree_grid = GridSearchCV(tfidf_tree_pipe, param_grid = tfidf_tree_params)
tfidf_tree_grid.fit(X1_train,y1_train)

In [59]:
tfidf_tree_grid.score(X1_test,y1_test)

0.76992

In [60]:
tfidf_tree_grid.best_params_

{'tfidf__max_features': 1000,
 'tree__max_depth': 10,
 'tree__min_impurity_decrease': 0.01,
 'tree__min_samples_split': 0.1}

### Tfidf + MultinomialNB

In [61]:
tfidf_bayes_pipe = Pipeline([('tfidf',TfidfVectorizer()),
                            ('bayes',MultinomialNB())])
tfidf_bayes_pipe.fit(X1_train,y1_train)

In [62]:
tfidf_bayes_pipe.score(X1_test,y1_test)

0.90892