#Load a dataset from sklearn

This is a news dataset for text classification: given a news as an input, classify its category (one of the twenty pre-defined categories) as the output. The detail is here https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html#sklearn.datasets.fetch_20newsgroups

In [None]:
from sklearn.datasets import fetch_20newsgroups

#we use the same random seed for reproducibility
RANDOM_STATE = 0
#this dataset has almost 20k instances. For quick demostration,
#we will only use a sub-sample from the three categories below
categories = ['comp.graphics', 'sci.med', 'talk.politics.guns']
#now let's retrieve all the data
data_all = fetch_20newsgroups(
        subset="all",
        shuffle=True,
        categories= categories,
        random_state=RANDOM_STATE,
    )



In [None]:
#we can now print the dataset description to better understand the dataset
#it shows the number of instances, attributes, and examples etc

print(data_all.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`~sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

Classes                     20
Samples total            18846
Dimensionality               1
Features                  text

# Train-valid-test split

In the class we talked about splitting a dataset for training, validation, and testing sets for developing methods. Recall Recall that we use the training set for training the model.

We use the validation set for tuning the hyperparameters.

When the model is finalized, we apply the model to the test set and evaluate.

Sklearn datasets often already provide training and testing sets available. However, in practice, we need to do it by ourselves. Our first exercise is to do the data split.

## Exercise 1: split the dataset into training (70% of the data), validation (10% of the date), and testing sets (20% of the data) randomly

In [None]:
#Step 1: calculate the train, validation, test set size

train_set_size = int(len(data_all.data)*0.7) #70% of the data for training
valid_set_size = int(len(data_all.data)*0.1) #10% of the data for validation
test_set_size = len(data_all.data) - train_set_size - valid_set_size #the remaining data for testing

print('Training set size:', train_set_size)
print('Valid set size:', valid_set_size)
print('Testing set size:', test_set_size)

from sklearn.model_selection import train_test_split

#Step 2: now we first split the testing set out. the remaining data will be training and validation
#We use the https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

#data_all.data: they are the input instances that we want our model to learn
#data_all.target: they are the output class labels that we want our model to predict
X_trainvalid, X_test, y_trainvalid, y_test = train_test_split(data_all.data,
                                                              data_all.target,
                                                              test_size=test_set_size,
                                                              random_state=RANDOM_STATE)

#Step 3: now it's your turn. We further split the remaining data into train and vali
#Please fill in XXX
X_train, X_valid, y_train, y_valid = train_test_split(X_trainvalid, y_trainvalid, test_size=valid_set_size, random_state=RANDOM_STATE)


Training set size: 2011
Valid set size: 287
Testing set size: 575


# Training a text classication model

##Exercise 2: build a KNN model

Now that we have the training, validation, and testing sets available. We can now train a model for text classification!

In this week, we introduced the KNN model and primary evaluation metrics. We will start with it and you can try other models too.

In [None]:
#Step 1: text representation
#Recall we talked about different text representation methods
#For demonstration, let's use tfidf representation

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

#we use the tfidf to generate representation from the training set
X_train_vector = vectorizer.fit_transform(X_train)

#then we apply this vector to the validation set
#note here we use transform instead of fit_transform because we already generated the vector
#from the training set
X_valid_vector = vectorizer.transform(X_valid)



In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score


#create an instance of the KNN model https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
#the documentation mentions it takes k=5 by default as the closest neighbors
#n_job is for parallel processing
model = KNeighborsClassifier(n_neighbors=5, n_jobs=10)

#use the training set tfidf vector to train the model
model.fit(X_train_vector, y_train)

#now we apply this model to the validation set tfidf vector to see the performance
y_predict_valid = model.predict(X_valid_vector)
print ('accuracy in the validation set', accuracy_score(y_valid, y_predict_valid))

accuracy in the validation set 0.9198606271777003


In [None]:
#now let's try different number of the closest neighbors from 1 to 10, train
#the KNN model and report the accuracy on the validation set

for i in range(1, 11):
  print('K =', i)
  #it's your turn: create the KNN model with i as the n_neighbors
  #please fill in XXX
  model = KNeighborsClassifier(i, n_jobs=10) # changed xxx to i
  #same as above use the training set tfidf vector to train the model
  model.fit(X_train_vector, y_train)
  y_predict_valid = model.predict(X_valid_vector)
  print ('accuracy in the validation set', accuracy_score(y_valid, y_predict_valid))

K = 1
accuracy in the validation set 0.9512195121951219
K = 2
accuracy in the validation set 0.9442508710801394
K = 3
accuracy in the validation set 0.9337979094076655
K = 4
accuracy in the validation set 0.9407665505226481
K = 5
accuracy in the validation set 0.9198606271777003
K = 6
accuracy in the validation set 0.9198606271777003
K = 7
accuracy in the validation set 0.9059233449477352
K = 8
accuracy in the validation set 0.9198606271777003
K = 9
accuracy in the validation set 0.9059233449477352
K = 10
accuracy in the validation set 0.9128919860627178


In [None]:
#in this specific subset, K=1 (the closest neighbor in the training set) gave the
#the highest accuracy in the development set. So we set K=1

#note that this is a simplified demonstration for this week since we just started talking
#about machine learning. We focused on data split and train and evaluate a model
#We will talk in more depth with more detailed exercises in the following weeks
#In week 6, we will have a complete and thorough machine learning pipeline

#now we test the model with K=1 on the testing set

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Re-initialize and fit the TfidfVectorizer on the training data to ensure consistency
vectorizer = TfidfVectorizer()
X_train_vector = vectorizer.fit_transform(X_train)

# Apply this fitted vectorizer to the testing set
X_test_vector = vectorizer.transform(X_test)

# Create and train the KNN model with K=1
model = KNeighborsClassifier(n_neighbors=1, n_jobs=10)
model.fit(X_train_vector, y_train)

# Make predictions on the testing set
y_predict_test = model.predict(X_test_vector)
print ('accuracy in the testing set', accuracy_score(y_test, y_predict_test))

accuracy in the testing set 0.928695652173913


Congratulations on finishing the exercises!

Note that this is a simplified demonstration for this week since we just started talking
about machine learning. We focused on data split and train and evaluate a model

We will talk in more depth with more detailed exercises in the following weeks

By week 6, we will have a complete and thorough machine learning pipeline