### Try-It Activity 18.1: Comparing Methods


This Try-It activity focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in Scikit-Learn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('data/dataset-minimal.csv')

In [3]:
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

In [4]:
pd.DataFrame({'model': ['Logistic', 'Decision Tree', 'Bayes'],
             'best_params': ['', '', ''],
             'best_score': ['', '', '']}).set_index('model')

Unnamed: 0_level_0,best_params,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1
Logistic,,
Decision Tree,,
Bayes,,


In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
import nltk
from nltk.tokenize import word_tokenize

# Download required NLTK data
nltk.download('punkt')
nltk.download('wordnet')

# Custom transformer classes for stemming and lemmatizing
class StemTokenizer:
    def __init__(self):
        self.stemmer = PorterStemmer()

    def __call__(self, doc):
        return [self.stemmer.stem(t) for t in word_tokenize(doc)]

class LemmaTokenizer:
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()

    def __call__(self, doc):
        return [self.lemmatizer.lemmatize(t) for t in word_tokenize(doc)]

# Create pipelines
pipelines = {
    'Logistic_stem': Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=StemTokenizer())),
        ('classifier', LogisticRegression(random_state=42))
    ]),
    'Logistic_lemma': Pipeline([
        ('vectorizer', TfidfVectorizer(tokenizer=LemmaTokenizer())),
        ('classifier', LogisticRegression(random_state=42))
    ]),
    'DecisionTree_stem': Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=StemTokenizer())),
        ('classifier', DecisionTreeClassifier(random_state=42))
    ]),
    'DecisionTree_lemma': Pipeline([
        ('vectorizer', TfidfVectorizer(tokenizer=LemmaTokenizer())),
        ('classifier', DecisionTreeClassifier(random_state=42))
    ]),
    'MultinomialNB_stem': Pipeline([
        ('vectorizer', CountVectorizer(tokenizer=StemTokenizer())),
        ('classifier', MultinomialNB())
    ]),
    'MultinomialNB_lemma': Pipeline([
        ('vectorizer', TfidfVectorizer(tokenizer=LemmaTokenizer())),
        ('classifier', MultinomialNB())
    ])
}

# Parameter grids for each pipeline
param_grids = {
    'Logistic_stem': {
        'vectorizer__max_features': [1000, 2000],
        'vectorizer__stop_words': ['english', None],
        'classifier__C': [0.1, 1.0]
    },
    'Logistic_lemma': {
        'vectorizer__max_features': [1000, 2000],
        'vectorizer__stop_words': ['english', None],
        'classifier__C': [0.1, 1.0]
    },
    'DecisionTree_stem': {
        'vectorizer__max_features': [1000, 2000],
        'vectorizer__stop_words': ['english', None],
        'classifier__max_depth': [5, 10]
    },
    'DecisionTree_lemma': {
        'vectorizer__max_features': [1000, 2000],
        'vectorizer__stop_words': ['english', None],
        'classifier__max_depth': [5, 10]
    },
    'MultinomialNB_stem': {
        'vectorizer__max_features': [1000, 2000],
        'vectorizer__stop_words': ['english', None],
        'classifier__alpha': [0.1, 1.0]
    },
    'MultinomialNB_lemma': {
        'vectorizer__max_features': [1000, 2000],
        'vectorizer__stop_words': ['english', None],
        'classifier__alpha': [0.1, 1.0]
    }
}

# Train and evaluate models
results = {}
X = df['text']
y = df['humor']

for name, pipeline in pipelines.items():
    grid_search = GridSearchCV(pipeline, param_grids[name], cv=5, n_jobs=-1)
    grid_search.fit(X, y)
    results[name] = {
        'best_params': grid_search.best_params_,
        'best_score': grid_search.best_score_
    }

# Create results DataFrame
best_results = pd.DataFrame({
    'model': ['Logistic', 'Decision Tree', 'Bayes'],
    'best_params': [
        max(results['Logistic_stem']['best_score'], results['Logistic_lemma']['best_score']),
        max(results['DecisionTree_stem']['best_score'], results['DecisionTree_lemma']['best_score']),
        max(results['MultinomialNB_stem']['best_score'], results['MultinomialNB_lemma']['best_score'])
    ],
    'best_score': [
        str(results['Logistic_stem' if results['Logistic_stem']['best_score'] > results['Logistic_lemma']['best_score'] else 'Logistic_lemma']['best_params']),
        str(results['DecisionTree_stem' if results['DecisionTree_stem']['best_score'] > results['DecisionTree_lemma']['best_score'] else 'DecisionTree_lemma']['best_params']),
        str(results['MultinomialNB_stem' if results['MultinomialNB_stem']['best_score'] > results['MultinomialNB_lemma']['best_score'] else 'MultinomialNB_lemma']['best_params'])
    ]
}).set_index('model')

best_results

[nltk_data] Downloading package punkt to /Users/mpt/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/mpt/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0_level_0,best_params,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1
Logistic,0.95901,"{'classifier__C': 1.0, 'vectorizer__max_featur..."
Decision Tree,0.939569,"{'classifier__max_depth': 10, 'vectorizer__max..."
Bayes,0.935229,"{'classifier__alpha': 0.1, 'vectorizer__max_fe..."
