<a href="https://colab.research.google.com/github/krisana-y/229352-StatisticalLearning/blob/main/650510483_kNN_using_Grid_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #3

In [None]:
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

In [None]:
print("X[0]:", Xtrain[0])
print("y[0]:", ytrain[0])

In [None]:
train.target_names

### Apply Tfidf ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html))

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# define transformation
tfidf = TfidfVectorizer(stop_words='english')

# fit+transform training set
Xtrain_tfidf = tfidf.fit_transform(Xtrain)

# See output
Xtrain_tfidf

In [None]:
tfidf.vocabulary_

#### Exercise 1: Find post in the training set that is closest in tf-idf to the first post in the test set (`Xtest[0]`). Print the content of both posts (not the tf-idf vectors).

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

vectorizer = TfidfVectorizer(stop_words='english')
Xtrain_tfidf = vectorizer.fit_transform(Xtrain)
Xtest0_tfidf = vectorizer.transform([Xtest[0]])

similarities = cosine_similarity(Xtest0_tfidf, Xtrain_tfidf)

most_similar_index = np.argmax(similarities)

print(Xtest[0])
print(Xtrain[most_similar_index])

### Classify with k-Nearest Neighbor (kNN) ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html))

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Define model
nb = KNeighborsClassifier(n_neighbors=5)

# Fit the model
nb.fit(Xtrain_tfidf, ytrain)

Evaluate on the test set using [`classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

We will focus on the [F1-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

In [None]:
from sklearn.metrics import classification_report

# Transform the test set
Xtest_tfidf = tfidf.transform(Xtest)

# Make predictions on the test set
ypred = nb.predict(Xtest_tfidf)

#report classification scores
print(classification_report(ytest, ypred))

### Combine all methods into a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [None]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words='english')),
                     ('nb', KNeighborsClassifier(n_neighbors=5))])

# Fit the pipeline to the training set
pipeline.fit(Xtrain, ytrain)

# Make predictions on the test set
ypred = pipeline.predict(Xtest)

# report classification scores
print(classification_report(ytest, ypred))

Now we will use [grid search cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find model with the best hyperparameters

![5CV](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

In [None]:
from sklearn.model_selection import GridSearchCV

params = {'tfidf__stop_words':['english'],
          'nb__n_neighbors':[1, 3, 5]}

# Define GridSearchCV
gridcv = GridSearchCV(pipeline, params,
                      scoring='f1_macro', cv=3)

# Fit and cross-validate the model on 3-fold data
gridcv.fit(Xtrain, ytrain)

In [None]:
gridcv.best_estimator_

In [None]:
# Make predictions on the test set
ypred = gridcv.predict(Xtest)

# Report classification scores
print(classification_report(ytest, ypred))

#### Exercise 2:

1. Use grid search 5-fold cross-validation across different values of the following two kNN parameters: `n_neighbors` and `metric`  **on the training set** to find the best model.

2. For the best value of `n_neighbors` and `metric` you found above, compute the `f1_macro` score **on the test set**.
* Print the value of `n_neighbors` and `metric`.
* Print the model's `f1_macro` score.

In [None]:
from sklearn.model_selection import GridSearchCV

params2 = {
    'tfidf__stop_words': ['english'],
    'nb__n_neighbors': [1, 3, 5, 7, 9],
    'nb__metric': ['euclidean', 'manhattan', 'minkowski']
}

# Define GridSearchCV
gridcv2 = GridSearchCV(pipeline, params2, scoring='f1_macro', cv=5)

# Fit the model on training data
gridcv2.fit(Xtrain, ytrain)


In [None]:
gridcv2.best_estimator_

In [None]:
# Make predictions on the test set
ypred2 = gridcv2.predict(Xtest)

# Report classification scores
print(classification_report(ytest, ypred2))