### Building Machine Learning Classifiers: Explore Gradient Boosting model with grid-search

**Grid-search:** Exhaustively search all parameter combinations in a given grid to determine the best model.

Read in & clean text

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import string

stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

data = pd.read_csv("SMSSpamCollection.tsv", sep='\t')
data.columns = ['label', 'body_text']

def count_punct(text):
    count = sum([1 for char in text if char in string.punctuation])
    return round(count/(len(text) - text.count(" ")), 3)*100

data['body_len'] = data['body_text'].apply(lambda x: len(x) - x.count(" "))
data['punct%'] = data['body_text'].apply(lambda x: count_punct(x))

def clean_text(text):
    text = "".join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    text = [ps.stem(word) for word in tokens if word not in stopwords]
    return text

tfidf_vect = TfidfVectorizer(analyzer=clean_text)
X_tfidf = tfidf_vect.fit_transform(data['body_text'])

X_features = pd.concat([data['body_len'], data['punct%'], pd.DataFrame(X_tfidf.toarray())], axis=1)
X_features.head()

Unnamed: 0,body_len,punct%,0,1,2,3,4,5,6,7,...,8094,8095,8096,8097,8098,8099,8100,8101,8102,8103
0,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,62,3.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,28,7.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,135,4.4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Explore GradientBoostingClassifier Attributes & Hyperparameters

In [2]:
from sklearn.ensemble import GradientBoostingClassifier

In [3]:
print(dir(GradientBoostingClassifier))
print(GradientBoostingClassifier().get_params()) # .get_params() method to show hyperparameters

# Note: In the hyperparameters, there is no "n_jobs" because Gradient Boosting can't be parallelized

['__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__sklearn_clone__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_build_request_for_signature', '_check_feature_names', '_check_initialized', '_check_n_features', '_clear_state', '_compute_partial_dependence_recursion', '_doc_link_module', '_doc_link_template', '_doc_link_url_param_generator', '_encode_y', '_estimator_type', '_fit_stage', '_fit_stages', '_get_default_requests', '_get_doc_link', '_get_loss', '_get_metadata_request', '_get_param_names', '_get_tags', '_init_state', '_is_fitted', '_make_estimator', '_more_tags', '_parameter_constraints', '_raw_predict', 

### Build our own Grid-search

In [4]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X_features, data['label'], test_size=0.2)

Gradient Boosting is a popular machine learning technique used for classification and regression tasks. It belongs to the ensemble learning methods, which combine multiple weak learners (models that are slightly better than random guessing) to create a strong learner. Gradient Boosting specifically builds an additive model in a forward stage-wise manner, meaning it adds new models to correct the errors made by existing models in the ensemble.

In [None]:
# Convert feature names to string if they are not already strings
X_features.columns = X_features.columns.astype(str)
X_train.columns = X_train.columns.astype(str)
X_test.columns = X_test.columns.astype(str)


def train_GB(est, max_depth, lr):
    gb = GradientBoostingClassifier(n_estimators=est, max_depth=max_depth, learning_rate=lr)
    gb_model = gb.fit(X_train, y_train)
    y_pred = gb_model.predict(X_test)
    precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')
    print('Est: {} / Depth: {} / LR: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
        est, max_depth, lr, round(precision, 3), round(recall, 3), 
        round((y_pred==y_test).sum() / len(y_pred), 3)))

In [12]:
for n_est in [50, 100, 150]:
    for max_depth in [3, 7, 11, 15]:
        for lr in [0.01, 0.1, 1]:
            train_GB(n_est, max_depth, lr)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Est: 50 / Depth: 3 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.862
Est: 50 / Depth: 3 / LR: 0.1 ---- Precision: 0.945 / Recall: 0.675 / Accuracy: 0.95
Est: 50 / Depth: 3 / LR: 1 ---- Precision: 0.939 / Recall: 0.799 / Accuracy: 0.965


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Est: 50 / Depth: 7 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.862
Est: 50 / Depth: 7 / LR: 0.1 ---- Precision: 0.922 / Recall: 0.773 / Accuracy: 0.96
Est: 50 / Depth: 7 / LR: 1 ---- Precision: 0.918 / Recall: 0.799 / Accuracy: 0.962


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Est: 50 / Depth: 11 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.862
Est: 50 / Depth: 11 / LR: 0.1 ---- Precision: 0.928 / Recall: 0.753 / Accuracy: 0.958
Est: 50 / Depth: 11 / LR: 1 ---- Precision: 0.912 / Recall: 0.812 / Accuracy: 0.963


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Est: 50 / Depth: 15 / LR: 0.01 ---- Precision: 0.0 / Recall: 0.0 / Accuracy: 0.862
Est: 50 / Depth: 15 / LR: 0.1 ---- Precision: 0.924 / Recall: 0.792 / Accuracy: 0.962
Est: 50 / Depth: 15 / LR: 1 ---- Precision: 0.883 / Recall: 0.831 / Accuracy: 0.961
Est: 100 / Depth: 3 / LR: 0.01 ---- Precision: 0.947 / Recall: 0.461 / Accuracy: 0.922
Est: 100 / Depth: 3 / LR: 0.1 ---- Precision: 0.959 / Recall: 0.766 / Accuracy: 0.963
Est: 100 / Depth: 3 / LR: 1 ---- Precision: 0.915 / Recall: 0.766 / Accuracy: 0.958
Est: 100 / Depth: 7 / LR: 0.01 ---- Precision: 0.939 / Recall: 0.604 / Accuracy: 0.94
Est: 100 / Depth: 7 / LR: 0.1 ---- Precision: 0.926 / Recall: 0.818 / Accuracy: 0.966
Est: 100 / Depth: 7 / LR: 1 ---- Precision: 0.899 / Recall: 0.812 / Accuracy: 0.961
Est: 100 / Depth: 11 / LR: 0.01 ---- Precision: 0.923 / Recall: 0.701 / Accuracy: 0.951
Est: 100 / Depth: 11 / LR: 0.1 ---- Precision: 0.933 / Recall: 0.818 / Accuracy: 0.967
Est: 100 / Depth: 11 / LR: 1 ---- Precision: 0.903 / Recall

Note: The grid search using Gradient Boosting actually took around 1 hr and 30 mins.

In [None]:
# # Here's a simplified explanation of how Gradient Boosting works:

# Initial Model: Gradient Boosting starts with an initial model that predicts the target variable (e.g., classification or regression) based on the features.

# Residuals Calculation: It calculates the residuals (the differences between the predicted and actual values) from the initial model. These residuals represent the errors that the initial model couldn't capture.

# Base Learner Fit: A new model (a "weak learner") is trained to predict the residuals obtained in step 2. This model is typically a decision tree with a small number of nodes (a shallow tree), known as a "base learner."

# Weighted Combination: The predictions from the new model are combined with the predictions from the initial model. The combination is weighted in such a way that it minimizes the overall error.

# Iterative Process: Steps 2-4 are repeated multiple times. In each iteration, a new model is trained to predict the residuals of the combined model from the previous iteration.

# Final Model: The final model is the sum of all the models created in the previous steps. It combines the predictions of multiple weak learners to produce a strong learner that can make accurate predictions on the target variable.

# Gradient Boosting uses a technique called gradient descent to minimize the loss function (typically a differentiable function representing the error between predicted and actual values) during each iteration. This is where the "gradient" in Gradient Boosting comes from.

