### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #3

In [1]:
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


In [2]:
print("X[0]:", Xtrain[0])
print("y[0]:", ytrain[0])

X[0]: From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----





y[0]: 7


In [3]:
train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

### Apply Tfidf ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html))

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# define transformation
tfidf = TfidfVectorizer()

# fit+transform training set
Xtrain_tfidf = tfidf.fit_transform(Xtrain)

# See output
Xtrain_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 470232 stored elements and shape (3000, 61994)>

In [5]:
tfidf.vocabulary_

{'from': 26262,
 'lerxst': 35043,
 'wam': 58861,
 'umd': 56422,
 'edu': 22922,
 'where': 59292,
 'my': 39746,
 'thing': 54692,
 'subject': 52978,
 'what': 59268,
 'car': 16257,
 'is': 31982,
 'this': 54718,
 'nntp': 40813,
 'posting': 44658,
 'host': 29698,
 'rac3': 46536,
 'organization': 42202,
 'university': 56775,
 'of': 41688,
 'maryland': 37056,
 'college': 17956,
 'park': 43037,
 'lines': 35397,
 '15': 1529,
 'was': 58934,
 'wondering': 59745,
 'if': 30492,
 'anyone': 11667,
 'out': 42420,
 'there': 54637,
 'could': 19062,
 'enlighten': 23554,
 'me': 37517,
 'on': 41921,
 'saw': 49528,
 'the': 54578,
 'other': 42373,
 'day': 20257,
 'it': 32110,
 'door': 22022,
 'sports': 52053,
 'looked': 35736,
 'to': 55083,
 'be': 13596,
 'late': 34675,
 '60s': 6506,
 'early': 22710,
 '70s': 7242,
 'called': 16056,
 'bricklin': 14952,
 'doors': 22024,
 'were': 59174,
 'really': 47001,
 'small': 51319,
 'in': 30898,
 'addition': 10376,
 'front': 26267,
 'bumper': 15260,
 'separate': 50178,
 'r

#### Exercise 1: Find post in the training set that is closest in tf-idf to the first post in the test set (`Xtest[0]`). Print the content of both posts (not the tf-idf vectors).

### Classify with k-Nearest Neighbor (kNN) ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html))

In [6]:
from sklearn.neighbors import KNeighborsClassifier

# Define model
nb = KNeighborsClassifier(n_neighbors=5)

# Fit the model
nb.fit(Xtrain_tfidf, ytrain)

Evaluate on the test set using [`classification_report`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

We will focus on the [F1-score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

In [7]:
from sklearn.metrics import classification_report

# Transform the test set
Xtest_tfidf = tfidf.transform(Xtest)

# Make predictions on the test set
ypred = nb.predict(Xtest_tfidf)

#report classification scores
print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.18      0.48      0.26        21
           1       0.38      0.48      0.43        21
           2       0.56      0.58      0.57        26
           3       0.52      0.50      0.51        34
           4       0.67      0.53      0.59        34
           5       0.88      0.54      0.67        26
           6       0.50      0.41      0.45        22
           7       0.58      0.54      0.56        28
           8       0.74      0.70      0.72        33
           9       0.82      0.72      0.77        25
          10       0.75      0.56      0.64        27
          11       0.71      0.85      0.77        20
          12       0.31      0.21      0.25        24
          13       0.79      0.48      0.59        23
          14       0.76      0.57      0.65        28
          15       0.67      0.55      0.60        29
          16       0.56      0.67      0.61        21
          17       0.59    

### Combine all methods into a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)

In [8]:
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('nb', KNeighborsClassifier(n_neighbors=5))
])

# Fit the pipeline to the training set
pipeline.fit(Xtrain, ytrain)

# Make predictions on the test set
ypred = pipeline.predict(Xtest)

# report classification scores
print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.18      0.48      0.26        21
           1       0.38      0.48      0.43        21
           2       0.56      0.58      0.57        26
           3       0.52      0.50      0.51        34
           4       0.67      0.53      0.59        34
           5       0.88      0.54      0.67        26
           6       0.50      0.41      0.45        22
           7       0.58      0.54      0.56        28
           8       0.74      0.70      0.72        33
           9       0.82      0.72      0.77        25
          10       0.75      0.56      0.64        27
          11       0.71      0.85      0.77        20
          12       0.31      0.21      0.25        24
          13       0.79      0.48      0.59        23
          14       0.76      0.57      0.65        28
          15       0.67      0.55      0.60        29
          16       0.56      0.67      0.61        21
          17       0.59    

Now we will use [grid search cross-validation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) to find model with the best hyperparameters

![5CV](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png)

In [9]:
from sklearn.model_selection import GridSearchCV

params = {
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'nb__n_neighbors': [1, 3, 5, 7],
    'nb__weights': ['uniform', 'distance']
}

# Define GridSearchCV
gridcv = GridSearchCV(pipeline, params,
                      scoring='f1_macro', cv=3)

# Fit and cross-validate the model on 3-fold data
gridcv.fit(Xtrain, ytrain)

In [10]:
gridcv.best_estimator_

In [11]:
# Make predictions on the test set
ypred = gridcv.predict(Xtest)

# Report classification scores
print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.29      0.52      0.37        21
           1       0.38      0.43      0.40        21
           2       0.54      0.54      0.54        26
           3       0.46      0.38      0.42        34
           4       0.81      0.50      0.62        34
           5       0.68      0.50      0.58        26
           6       0.41      0.41      0.41        22
           7       0.67      0.57      0.62        28
           8       0.82      0.70      0.75        33
           9       0.69      0.72      0.71        25
          10       0.68      0.56      0.61        27
          11       0.70      0.80      0.74        20
          12       0.43      0.42      0.43        24
          13       0.65      0.57      0.60        23
          14       0.64      0.64      0.64        28
          15       0.48      0.52      0.50        29
          16       0.63      0.57      0.60        21
          17       0.43    

#### Exercise 2:

1. Use grid search 5-fold cross-validation across different values of the following two kNN parameters: `n_neighbors` and `metric`  **on the training set** to find the best model.

2. For the best value of `n_neighbors` and `metric` you found above, compute the `f1_macro` score **on the test set**.
* Print the value of `n_neighbors` and `metric`.
* Print the model's `f1_macro` score.

In [12]:
from sklearn.metrics import f1_score

# สร้าง Pipeline
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('knn', KNeighborsClassifier())
])

param_grid = {
    'knn__n_neighbors': [1, 3, 5, 7],
    'knn__metric': ['euclidean', 'manhattan', 'cosine']
}

# สร้าง GridSearchCV ใช้ 5-fold
gridcv = GridSearchCV(pipeline, param_grid, cv=5, scoring='f1_macro')

# ฝึกและค้นหาพารามิเตอร์ที่ดีที่สุด
gridcv.fit(Xtrain, ytrain)

# ทำนายชุดทดสอบ
y_pred = gridcv.predict(Xtest)

# คำนวณคะแนน f1_macro
f1 = f1_score(ytest, ypred, average='macro')

# แสดงผลลัพธ์
print("Best Parameters:")
print("n_neighbors:", gridcv.best_params_['knn__n_neighbors'])
print("metric     :", gridcv.best_params_['knn__metric'])
print("f1_macro score on test set:", round(f1, 4))

Best Parameters:
n_neighbors: 1
metric     : euclidean
f1_macro score on test set: 0.5497
