## ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: NLP Classification: Subreddit Pepsi vs Coca-Cola | Part 2: Vectorizer

---

[README](../README.md) | [Part 1: EDA](01_EDA.ipynb) | **Part 2: Vectorizer** | [Part 3: Vectorizer Performance](03_Vectorizer_Performance.ipynb) | [Part 4: Model Tuning](04_Model_Tuning.ipynb)

---

### Introduction
From [Part 1: EDA](01_EDA.ipynb), we obtained `subreddit_pepsi_vs_cocacola.csv_clean`, which has been cleaned. In this part, we focus on tuning the hyperparameters of two vectorizers: **CountVectorizer** and **TfidfVectorizer**, while using default settings for the model. Our goal is to analyze how the hyperparameters of the vectorizers affect model performance.

The **CountVectorizer** represents each document by the frequency of words (tokens) in the text. While **TfidfVectorizer** measures the relative importance of a word in a document by balancing its frequency in that document (Term Frequency) with how rare it is across all documents (Inverse Document Frequency). The latter vectorizer often provides better results, especially when distinguishing key terms in large corpora.

### Import

In [5]:
# Data manipulation and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# NLP tools
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, RegexpTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Model selection and evaluation
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, confusion_matrix, recall_score, precision_score, f1_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Classifiers
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

# Time utility
import time
from joblib import parallel_backend

# Set seed for reproducibility
np.random.seed(42)             

### Data Preparation

In [7]:
df = pd.read_csv('../data/subreddit_pepsi_vs_cocacola_clean.csv')          # Load Data
df.head(1)                                                                 # Check first row

Unnamed: 0,title,score,id,url,comms_num,created,body,is_pepsi,title_body,title_body_length,title_body_word_count
0,Wall clock.,17,1godhom,https://www.reddit.com/gallery/1godhom,1,11/11/2024 5:58,I'm trying to locate a value for this clock. I...,0,Wall clock. I'm trying to locate a value for t...,220,39


In [8]:
df.shape

(1958, 11)

In [9]:
# Set Features and Target

X = df['title_body']
y = df['is_pepsi']

# Split the data to training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X
                                                    , y
                                                    , test_size = 0.2
                                                    , stratify = y
                                                   )

### Vectorizer
- We focus on two types of vectorizers: **CountVectorizer** and **TfidfVectorizer**, which share the same set of parameters.
- First, converting the text to lowercase and removing words related to **Pepsi** and **Coca-Cola** are mandatory steps.
- The other parameters we will adjust for comparison are:
    - Removes English Stopwords or not
    - Maximum features: 3000, 5000, or None
    - N-gram length: whether to include bigrams or not
    - Minimum document frequency (min_df): 2, or 3
    - Maximum document frequency (max_df): 0.8 or 0.9

In [11]:
# Related words are words that can easily identify the subreddit categories, 
# such as the names of brands.
related_words = {'pepsi', 'pepsico', 'coca', 'cola', 'coke'}

# Custom stop words are common English words that are meaningless, 
# and we also include related words to create a custom stop word list.

custom_stop_words = set(stopwords.words('english')).union(related_words)

In [12]:
# Instantiate Stemmer and Lemmatizer
stemmer = nltk.PorterStemmer()
lemmatizer = nltk.WordNetLemmatizer()

# Create 6 custom tokenizers. 
# Note that all tokenizers are designed to remove symbols, related words, and convert text to lowercase, 
# with the following additional functions:
# - lower_only: no other changes
# - lower_stop: removes stopwords
# - stem_only: applies stemming
# - stem_stop: applies stemming and removes stopwords
# - lem_only: applies lemmatizing
# - lem_stop: applies lemmatizing and removes stopwords

def lower_only(doc):
    doc = doc.lower()
    tokens = word_tokenize(doc)
    return [token for token in tokens if token not in related_words and token.isalpha()]

def lower_stop(doc):
    doc = doc.lower()
    tokens = word_tokenize(doc)
    return [token for token in tokens if token not in custom_stop_words and token.isalpha()]

def stem_only(doc):
    doc = doc.lower()
    tokens = word_tokenize(doc)
    return [stemmer.stem(token) for token in tokens if token not in related_words and token.isalpha()]

def stem_stop(doc):
    doc = doc.lower()
    tokens = word_tokenize(doc)
    return [stemmer.stem(token) for token in tokens if token not in custom_stop_words and token.isalpha()]

def lem_only(doc):
    doc = doc.lower()
    tokens = word_tokenize(doc)
    return [lemmatizer.lemmatize(token) for token in tokens if token not in related_words and token.isalpha()]

def lem_stop(doc):
    doc = doc.lower()
    tokens = word_tokenize(doc)
    return [lemmatizer.lemmatize(token) for token in tokens if token not in custom_stop_words and token.isalpha()]

In [13]:
# Custom metrics used in GridSearch results
scorers = {
    'accuracy': make_scorer(accuracy_score)
    , 'recall': make_scorer(recall_score, average = 'binary')
    , 'precision': make_scorer(precision_score, average = 'binary')
    , 'f1': make_scorer(f1_score, average = 'binary')
}

In [14]:
# GridSearch parameters
# 'vectorizer__' is used to set parameters for CountVectorizer and TfidfVectorizer.
# Since the tokenizer parameter is applied, we set 'token_pattern' to None.
params = {
    'vectorizer': [CountVectorizer(token_pattern = None)
                   , TfidfVectorizer(token_pattern = None)
                  ] 
    , 'vectorizer__tokenizer': [lower_only
                                , lower_stop
                                , stem_only
                                , stem_stop
                                , lem_only
                                , lem_stop
                               ] 
    , 'vectorizer__max_features': [3000
                                   , 5000
                                   , None
                                  ]
    , 'vectorizer__ngram_range': [(1, 1)                  # unigrams
                                  , (1, 2)                # unigrams and bigrams
                                 ]
    , 'vectorizer__min_df': [2, 3]
    , 'vectorizer__max_df': [0.8, 0.9]
    , 'classifier': [#MultinomialNB()
                     #, LogisticRegression()
                     #, KNeighborsClassifier()
                     #, DecisionTreeClassifier()
                     #, BaggingClassifier()
                     #, RandomForestClassifier()
                     #, AdaBoostClassifier(algorithm = 'SAMME')
                      GradientBoostingClassifier()
                     #, SVC()
                     #, XGBClassifier()
                    ]
}

In [15]:
# Create Pipeline
pipeline = Pipeline([
    ('vectorizer', CountVectorizer())                    # Will replace in GridSearch  
    , ('classifier', MultinomialNB())                    # Will replace in GridSearch 
])

In [16]:
# GridSearchCV
grid_search = GridSearchCV(pipeline
                           , param_grid=params
                           , cv = 5
                           , verbose = 3
                           , scoring = scorers
                           , refit = 'f1'
                           , return_train_score = True
)

In [17]:
# Fit the grid search

# Record start time
start_time =time.time()

with parallel_backend('threading', n_jobs=-1):
    grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 288 candidates, totalling 1440 fits
[CV 1/5] END classifier=GradientBoostingClassifier(), vectorizer=CountVectorizer(token_pattern=None), vectorizer__max_df=0.8, vectorizer__max_features=3000, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1), vectorizer__tokenizer=<function lower_stop at 0x00000236FE7A89A0>; accuracy: (train=0.819, test=0.669) f1: (train=0.842, test=0.716) precision: (train=0.759, test=0.636) recall: (train=0.945, test=0.819) total time=   8.2s
[CV 3/5] END classifier=GradientBoostingClassifier(), vectorizer=CountVectorizer(token_pattern=None), vectorizer__max_df=0.8, vectorizer__max_features=3000, vectorizer__min_df=2, vectorizer__ngram_range=(1, 1), vectorizer__tokenizer=<function lower_stop at 0x00000236FE7A89A0>; accuracy: (train=0.812, test=0.665) f1: (train=0.836, test=0.715) precision: (train=0.751, test=0.629) recall: (train=0.943, test=0.830) total time=   9.0s
[CV 2/5] END classifier=GradientBoostingClassifier(), vectorizer=Cou

3 fits failed out of a total of 1440.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
3 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\Home\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 895, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\Home\anaconda3\Lib\site-packages\sklearn\base.py", line 1474, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Home\anaconda3\Lib\site-packages\sklearn\pipeline.py", line 471, in fit
    Xt = self._fit(X, y, routed_params)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Home\anaconda3\Lib\site-packages\sklearn\

In [18]:
# Record end time
end_time = time.time()

# Calculate elapsed time
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.4f} seconds")

# Time taken: 4522.0351 seconds  for xgboost

Time taken: 4437.6006 seconds


In [19]:
# Collect results
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results.to_csv('../data/cv_results_gb.csv') 
# Display important metrics
cv_results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier,param_vectorizer,param_vectorizer__max_df,param_vectorizer__max_features,param_vectorizer__min_df,param_vectorizer__ngram_range,...,mean_test_f1,std_test_f1,rank_test_f1,split0_train_f1,split1_train_f1,split2_train_f1,split3_train_f1,split4_train_f1,mean_train_f1,std_train_f1
0,11.951224,0.910323,2.033591,0.862367,GradientBoostingClassifier(),CountVectorizer(token_pattern=None),0.8,3000,2,"(1, 1)",...,0.735552,0.023590,58,0.843056,0.830269,0.832523,0.825018,0.837631,0.833699,0.006189
1,11.831407,4.265018,0.579388,0.276554,GradientBoostingClassifier(),CountVectorizer(token_pattern=None),0.8,3000,2,"(1, 1)",...,0.736700,0.019390,45,0.841737,0.829713,0.836465,0.824843,0.829030,0.832357,0.005990
2,18.754280,1.204162,2.971510,0.585754,GradientBoostingClassifier(),CountVectorizer(token_pattern=None),0.8,3000,2,"(1, 1)",...,0.735884,0.023719,54,0.837145,0.832509,0.829474,0.827343,0.837307,0.832756,0.004003
3,24.379055,7.465999,0.944496,0.567476,GradientBoostingClassifier(),CountVectorizer(token_pattern=None),0.8,3000,2,"(1, 1)",...,0.739829,0.028917,13,0.828411,0.843728,0.825996,0.821201,0.823695,0.828606,0.007931
4,13.130297,7.015924,0.471777,0.580527,GradientBoostingClassifier(),CountVectorizer(token_pattern=None),0.8,3000,2,"(1, 1)",...,,,288,0.837370,,,,0.836692,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
283,13.178176,6.216785,0.524394,0.471759,GradientBoostingClassifier(),TfidfVectorizer(token_pattern=None),0.9,,3,"(1, 2)",...,0.707965,0.019803,190,0.868140,0.868876,0.858965,0.861051,0.865018,0.864410,0.003877
284,26.129953,1.747220,3.614737,1.476485,GradientBoostingClassifier(),TfidfVectorizer(token_pattern=None),0.9,,3,"(1, 2)",...,0.706111,0.004610,202,0.880935,0.885965,0.886792,0.876274,0.881406,0.882274,0.003810
285,19.888088,10.872776,1.510359,2.091426,GradientBoostingClassifier(),TfidfVectorizer(token_pattern=None),0.9,,3,"(1, 2)",...,0.714203,0.014387,150,0.853087,0.867787,0.853296,0.861091,0.860246,0.859101,0.005488
286,21.877410,1.394613,1.628071,0.668358,GradientBoostingClassifier(),TfidfVectorizer(token_pattern=None),0.9,,3,"(1, 2)",...,0.697257,0.015942,246,0.876990,0.889868,0.888727,0.881798,0.882396,0.883956,0.004761
