<a href="https://colab.research.google.com/github/kniJanjaras/229352-StatisticalLearning/blob/main/Lab04_Naive_Bayes_Grid_and_Random_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [3]:
from sklearn.naive_bayes import MultinomialNB

nd = MultinomialNB(alpha = 0.1)

### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

pipeline = Pipeline([('count', CountVectorizer(stop_words = 'english')),('nb' , MultinomialNB())])

Parameter = {'nb__alpha' : uniform(loc=0.1, scale = 10)}

clf = RandomizedSearchCV(pipeline, Parameter, cv = 3, n_iter = 10)

clf.fit(Xtrain, ytrain)

In [6]:
ypred = clf.predict(Xtest)
print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.74      0.67      0.70        21
           1       0.67      0.48      0.56        21
           2       1.00      0.04      0.07        26
           3       0.51      0.71      0.59        34
           4       0.81      0.85      0.83        34
           5       0.69      0.77      0.73        26
           6       0.86      0.82      0.84        22
           7       0.77      0.96      0.86        28
           8       0.93      0.82      0.87        33
           9       0.91      0.84      0.88        25
          10       0.90      0.96      0.93        27
          11       0.78      0.90      0.84        20
          12       0.57      0.67      0.62        24
          13       0.86      0.83      0.84        23
          14       0.79      0.93      0.85        28
          15       0.77      0.93      0.84        29
          16       0.53      0.95      0.68        21
          17       0.94    

#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

In [8]:
from sklearn.metrics import f1_score

# Exercise 1 & 2: Grid Search
print("--- Grid Search ---")
grid_pipeline = Pipeline([('count', CountVectorizer(stop_words = 'english')),('nb' , MultinomialNB())])

# Define the parameter grid for alpha
grid_param_grid = {'nb__alpha': [0.1, 0.5, 1.0, 2.0, 5.0, 10.0]}

# Perform GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(grid_pipeline, grid_param_grid, cv=5, scoring='f1_macro')
grid_search.fit(Xtrain, ytrain)

# Get the best alpha and best estimator from grid search
best_alpha_grid = grid_search.best_params_['nb__alpha']
best_grid_model = grid_search.best_estimator_

# Predict on the test set using the best grid search model
grid_ypred = best_grid_model.predict(Xtest)

# Compute the f1_macro score on the test set
grid_f1_macro = f1_score(ytest, grid_ypred, average='macro')

print(f"Best alpha (Grid Search): {best_alpha_grid}")
print(f"f1_macro score (Grid Search): {grid_f1_macro}")


--- Grid Search ---
Best alpha (Grid Search): 0.1
f1_macro score (Grid Search): 0.7188482458540597


In [9]:
# Exercise 3: Random Search
print("\n--- Random Search ---")
random_pipeline = Pipeline([('count', CountVectorizer(stop_words = 'english')),('nb' , MultinomialNB())])

# Define the parameter distribution for alpha
random_param_dist = {'nb__alpha': uniform(loc=0.1, scale=10.0)}

# Perform RandomizedSearchCV with 5-fold cross-validation and more iterations
random_search = RandomizedSearchCV(random_pipeline, random_param_dist, n_iter=50, cv=5, scoring='f1_macro', random_state=42)
random_search.fit(Xtrain, ytrain)

# Get the best alpha and best estimator from random search
best_alpha_random = random_search.best_params_['nb__alpha']
best_random_model = random_search.best_estimator_

# Predict on the test set using the best random search model
random_ypred = best_random_model.predict(Xtest)

# Compute the f1_macro score on the test set
random_f1_macro = f1_score(ytest, random_ypred, average='macro')

print(f"Best alpha (Random Search): {best_alpha_random}")
print(f"f1_macro score (Random Search): {random_f1_macro}")

# Compare f1_macro scores
print(f"\nDid Random Search get a better f1_macro score? {random_f1_macro > grid_f1_macro}")


--- Random Search ---
Best alpha (Random Search): 0.30584494295802445
f1_macro score (Random Search): 0.7136681491699264

Did Random Search get a better f1_macro score? False
