# Lab 9: Document Analysis

In this assignment, we will learn how to do document classification and clustering



## 1. Example

In this example, we use [20newsgroups](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) dataset. Each sample is a document and there are totally 20 classes.

### 1.1 Load data

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

print("Train data target labels: {}".format(data_train.target))
print("Train data target names: {}".format(data_train.target_names))

print('#training samples: {}'.format(len(data_train.data)))
print('#testing samples: {}'.format(len(data_test.data)))


Train data target labels: [7 4 4 ... 3 1 8]
Train data target names: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
#training samples: 11314
#testing samples: 7532


### 1.2 Represent documents with TF-IDF represention

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler


#TF-IDF representation for each document
vectorizer = TfidfVectorizer()
data_train_vectors = vectorizer.fit_transform(data_train.data)
data_test_vectors = vectorizer.transform(data_test.data)

print(data_train_vectors.shape, data_test_vectors.shape)


(11314, 101631) (7532, 101631)


### 1.3 Use KNN to do document classification

Here, we use the cross-validation method to select $K$.

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score


Xtr = data_train_vectors
Ytr = data_train.target

Xte = data_test_vectors
Yte = data_test.target

k_range = range(1, 5)
param_grid = dict(n_neighbors=k_range)

clf_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clf_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_score_)
print(grid.best_params_)

0.16846385009722467
{'n_neighbors': 1}


### 1.3 Use Logistic Regression to do document classification
Here, we also use the cross-validation method to select the regularization coefficient.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

#=====training with cross validation======
coeff = range(1, 10)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clf_lr, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_params_)

#=====testing======
clf_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clf_lr.fit(Xtr, Ytr)

y_pred = clf_lr.predict(Xte)

acc = accuracy_score(Yte, y_pred)
macro_f1 = f1_score(Yte, y_pred, average='macro')
micro_f1 = f1_score(Yte, y_pred, average='micro')

print(acc, macro_f1, micro_f1)



{'C': 8}
0.6889272437599575 0.6778761181105242 0.6889272437599575


## 2. Task: Document Classification and Clustering

In this task, we are going to use [BBCNews](BBC_News_Train.csv) dataset. There are 1490 articles from 5 topics, including tech, business, sport, entertainment, politics.

* Task 1: Please use KNN and logistic regression to do classification, and compare their performance.

* Task 2: Please use K-means to partition this dataset into 5 clusters and find the representative words in each cluster.

### 2.1 Load data and represent it with TF-IDF representation

In [1]:
# import csv file
from google.colab import drive
drive.mount('/content/drive')

from google.colab import files
uploaded = files.upload()

Mounted at /content/drive


Saving BBC_News_Train.csv to BBC_News_Train.csv


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('BBC_News_Train.csv')

In [3]:
df.head()

Unnamed: 0,ArticleId,Text,Category
0,1833,worldcom ex-boss launches defence lawyers defe...,business
1,154,german business confidence slides german busin...,business
2,1101,bbc poll indicates economic gloom citizens in ...,business
3,1976,lifestyle governs mobile choice faster bett...,tech
4,917,enron bosses in $168m payout eighteen former e...,business


In [4]:
df.dtypes

ArticleId     int64
Text         object
Category     object
dtype: object

In [5]:
from sklearn.model_selection import train_test_split

data_train, data_test = train_test_split(df, test_size = 0.15, random_state = 42)

In [6]:
print('Train data target names: {}'.format(data_train["Category"].unique()))
print('num training samples: {}'.format(len(data_train)))
print('num testing samples: {}'.format(len(data_test)))

Train data target names: ['entertainment' 'tech' 'business' 'politics' 'sport']
num training samples: 1266
num testing samples: 224


**Represent the data with TF-IDF representation**



In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler

#TF-IDF representation for each document
vectorizer = TfidfVectorizer(stop_words = 'english')
data_train_vectors = vectorizer.fit_transform(data_train["Text"])
data_test_vectors = vectorizer.transform(data_test["Text"])

print(data_train_vectors.shape, data_test_vectors.shape)

(1266, 22864) (224, 22864)


### 2.2 Use KNN to do document classification

In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score

Xtr = data_train_vectors
Ytr = data_train['Category']

Xte = data_test_vectors
Yte = data_test['Category']

k_range = range(1, 5)
param_grid = dict(n_neighbors = k_range)

clf_knn =  KNeighborsClassifier(n_neighbors = 1)

grid = GridSearchCV(clf_knn, param_grid, cv = 5, scoring = 'accuracy')
grid.fit(Xtr, Ytr)

In [9]:
print(grid.best_score_)
print(grid.best_params_)

0.9296878404033488
{'n_neighbors': 4}


In [10]:
# test
clf_knn =  KNeighborsClassifier(n_neighbors = grid.best_params_['n_neighbors'])
clf_knn.fit(Xtr, Ytr)

y_pred = clf_knn.predict(Xte)

# performance
acc = accuracy_score(Yte, y_pred)
macro_f1 = f1_score(Yte, y_pred, average = 'macro')
micro_f1 = f1_score(Yte, y_pred, average = 'micro')

print('acc: {}, macro_f1: {}, micro_f1: {}'.format(acc, macro_f1, micro_f1))

acc: 0.9464285714285714, macro_f1: 0.9442512742303932, micro_f1: 0.9464285714285714


### 2.3 Use Logistic Regression to do document classification

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

#=====training with cross validation======
coeff = range(1, 10)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty = 'l2')

grid = GridSearchCV(clf_lr, param_grid, cv = 5, scoring = 'accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_params_)

{'C': 5}


In [12]:
# test
clf_lr = LogisticRegression(penalty='l2', C = grid.best_params_['C'])
clf_lr.fit(Xtr, Ytr)

y_pred = clf_lr.predict(Xte)

# performance

acc = accuracy_score(Yte, y_pred)
macro_f1 = f1_score(Yte, y_pred, average = 'macro')
micro_f1 = f1_score(Yte, y_pred, average = 'micro')

print('acc: {}, macro_f1: {}, micro_f1: {}'.format(acc, macro_f1, micro_f1))

acc: 0.9732142857142857, macro_f1: 0.9737690086489567, micro_f1: 0.9732142857142857


### 2.4 Use K-means to do document clustering and find the 10 most representative words in each cluster.

In [13]:
# k means with 5 clusters for 5 categories
from sklearn.cluster import KMeans

#clusters
cluster = KMeans(n_clusters = 5, random_state = 42, n_init = 'auto').fit(Xtr)

#centroids
centroids = cluster.cluster_centers_

# get words
terms = vectorizer.get_feature_names_out()

# get order of centroids
order_centroids = centroids.argsort()[:, ::-1]

# Print the most representative words for each cluster
for i in range(5):
    print("Cluster " + str(i) + ":"),
    for index in order_centroids[i,:10]:
        print(str(terms[index]))
    print('')

Cluster 0:
mr
labour
election
blair
party
said
brown
tax
howard
government

Cluster 1:
growth
said
economy
economic
sales
year
eu
india
dollar
market

Cluster 2:
england
game
win
said
cup
chelsea
match
world
season
team

Cluster 3:
said
mr
people
mobile
new
music
firm
phone
government
uk

Cluster 4:
film
best
awards
actor
band
award
festival
films
star
oscar

