### Building Machine Learning Classifiers: Evaluate Random Forest with GridSearchCV

**Grid-search:** Exhaustively search all parameter combinations in a given grid to determine the best model.

**Cross-validation:** Divide a dataset into k subsets and repeat the holdout method k times where a different subset is used as the holdout set in each iteration.

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import string


In [2]:

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

In [3]:
data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

In [4]:
def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

In [5]:
data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

In [6]:
def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

# This function clean_text preprocesses text by removing punctuation, converting to lowercase, splitting into tokens, removing stopwords, and stemming using the Porter stemmer.

In [7]:
# TF-IDF
tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])
X_tfidf_feat = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)

# TF-IDF vectorization is applied to the text data using the TfidfVectorizer with the clean_text function as the analyzer. The resulting TF-IDF matrix is concatenated with additional features ('body_len' and 'punct%') to create X_tfidf_feat.

In [11]:
# CountVectorizer
count_vect = CountVectorizer(analyzer=clean_text)
X_count = count_vect.fit_transform(data['body_text'])
X_count_feat = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_count.toarray())], axis=1)

X_count_feat.head()

# Similarly, CountVectorizer is applied to the text data using the CountVectorizer with the clean_text function as the analyzer. The resulting count matrix is concatenated with additional features to create X_count_feat.

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,8094,8095,8096,8097,8098,8099,8100,8101,8102,8103
0,128,4.7,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,49,4.1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,62,3.2,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,28,7.1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,135,4.4,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV


GridSearchCV (Grid Search Cross-Validation) is a technique used for hyperparameter tuning in machine learning models. It is part of the scikit-learn library and provides an exhaustive search over a predefined grid of hyperparameters. Grid search is commonly used to find the optimal combination of hyperparameter values that yields the best model performance.

In [13]:
X_count_feat.columns = X_count_feat.columns.astype(str)
X_tfidf_feat.columns = X_tfidf_feat.columns.astype(str)


In [16]:
# for TfidfVectorizer

rf = RandomForestClassifier()


# First, you specify the machine learning model you want to tune (e.g., RandomForestClassifier) and create a dictionary where the keys are hyperparameter names, and the values are lists of hyperparameter values to explore.
param = {'n_estimators': [10, 150, 300],
         'max_depth': [30, 60, 90, None]}


# You create a GridSearchCV object by passing the model, the hyperparameter grid, and other optional parameters like the number of cross-validation folds (cv) and the scoring metric (scoring). 
gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)


# You fit the GridSearchCV object to your training data. During this process, GridSearchCV performs an exhaustive search over all combinations of hyperparameters specified in the grid. For each combination, it trains the model using cross-validation and evaluates its performance using the specified scoring metric.
gs_fit = gs.fit(X_tfidf_feat, data['label'])


pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5] # printing the top 5 models


# This code initializes a random forest classifier and specifies hyperparameters for tuning (n_estimators and max_depth). Grid search with cross-validation (5-fold) is performed using GridSearchCV.

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
7,14.86339,0.820195,0.235477,0.035604,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.980251,0.978456,0.973944,0.967655,0.974843,0.97503,0.00435,1
8,30.626752,0.63426,0.376012,0.024613,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.977558,0.979354,0.973944,0.967655,0.972147,0.974132,0.004121,2
10,17.380368,2.417373,0.255923,0.063604,,150,"{'max_depth': None, 'n_estimators': 150}",0.976661,0.975763,0.975741,0.967655,0.971249,0.973414,0.003445,3
5,22.96787,0.275618,0.271146,0.012495,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.977558,0.973968,0.974843,0.965858,0.97035,0.972515,0.004049,4
11,25.315645,1.296192,0.190464,0.002023,,300,"{'max_depth': None, 'n_estimators': 300}",0.977558,0.974865,0.973944,0.965858,0.969452,0.972336,0.00416,5


In [15]:
# for CountVectorizer

rf = RandomForestClassifier()
param = {'n_estimators': [10, 150, 300],
         'max_depth': [30, 60, 90, None]}

gs = GridSearchCV(rf, param, cv=5, n_jobs=-1)
gs_fit = gs.fit(X_count_feat, data['label'])
pd.DataFrame(gs_fit.cv_results_).sort_values('mean_test_score', ascending=False)[0:5]

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
8,31.202008,0.339279,0.420038,0.038618,90.0,300,"{'max_depth': 90, 'n_estimators': 300}",0.977558,0.974865,0.972147,0.97035,0.971249,0.973234,0.002638,1
11,25.551421,1.414573,0.197908,0.004267,,300,"{'max_depth': None, 'n_estimators': 300}",0.976661,0.973968,0.973944,0.967655,0.969452,0.972336,0.003292,2
7,14.223841,0.623316,0.231218,0.012547,90.0,150,"{'max_depth': 90, 'n_estimators': 150}",0.974865,0.973968,0.973944,0.969452,0.968553,0.972157,0.002612,3
10,16.907369,1.217214,0.277853,0.062739,,150,"{'max_depth': None, 'n_estimators': 150}",0.976661,0.97307,0.972147,0.968553,0.965858,0.971258,0.003735,4
5,21.554027,0.356724,0.285362,0.01891,60.0,300,"{'max_depth': 60, 'n_estimators': 300}",0.975763,0.972172,0.971249,0.966757,0.969452,0.971079,0.002983,5


In [None]:

# Hyperparameter tuning is the process of finding the optimal hyperparameters for a machine learning model. 

# Hyperparameters are configuration settings external to the model that cannot be directly estimated from the data. They govern the learning process of the model and have a significant impact on its performance and generalization ability.

# Number of Estimators (n_estimators): In ensemble methods like random forests or gradient boosting, this hyperparameter specifies the number of base learners (trees) in the ensemble.

# Maximum Depth of Trees (max_depth): For tree-based models, such as decision trees, random forests, and gradient boosting, this hyperparameter limits the maximum depth of each tree in the ensemble, thereby controlling the complexity of the model.

# Learning Rate: In gradient boosting algorithms like XGBoost and AdaBoost, the learning rate controls the step size at each iteration, affecting the speed of convergence and the influence of each base learner on the final ensemble.

# Regularization Parameters: Hyperparameters like regularization strength (e.g., alpha in Lasso and Ridge regression) control the balance between fitting the training data and preventing overfitting by penalizing large coefficients.

