# Lab 9: Document Analysis

In this assignment, we will learn how to do document classification and clustering



## 1. Example

In this example, we use [20newsgroups](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) dataset. Each sample is a document and there are totally 20 classes. 

### 1.1 Load data

In [8]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups

data_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
data_test = fetch_20newsgroups(subset='test', remove=('headers', 'footers', 'quotes'))

print("Train data target labels: {}".format(data_train.target))
print("Train data target names: {}".format(data_train.target_names))

print('#training samples: {}'.format(len(data_train.data)))
print('#testing samples: {}'.format(len(data_test.data)))


Train data target labels: [7 4 4 ... 3 1 8]
Train data target names: ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
#training samples: 11314
#testing samples: 7532


### 1.2 Represent documents with TF-IDF represention

<dl>
    <div>
        <dt><abbr described-by='tf-idf-title'>TF-IDF</abbr></dt>
        <dd role='tooltip' id='tf-idf-title'>Term Frequency-Inverse Document Frequency</dd>
        <dd>Reflect how important a word is to a document in a collection</dd>
        <dd>
            $$
                \operatorname{TF\_IDF}\left(t, d, \mathcal{D}\right)
                := \operatorname{TF}\left(t, d\right)%
                \times\operatorname{IDF}\left(t, \mathcal{D}\right),
            $$
            where Term frequency
            $$
                \operatorname{TF}\left(t, d\right)
                := \frac{\#\left(t\text{ in document }d\right)}
                {\#\left(\mathbf{words}\text{ in document }d\right)}
            $$
            measures the frequency of a word in a document,
            and Inverse Document Frequency
            $$
                \operatorname{IDF}\left(t, \mathcal{D}\right)
                := \log\left(
                    \mathbb{P}^{-1}\!\left\{\mathcal{D}\text{ contains }t\right\}
                \right)
            $$
            measures the rareness of a word <strong>in all documents</strong>.
        </dd>
    </div>
</dl>

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler


#TF-IDF representation for each document
vectorizer = TfidfVectorizer()
data_train_vectors = vectorizer.fit_transform(data_train.data)
data_test_vectors = vectorizer.transform(data_test.data) 

print(data_train_vectors.shape, data_test_vectors.shape)


(11314, 101631) (7532, 101631)


### 1.3 Use KNN to do document classification

Here, we use the cross-validation method to select $K$.

In [4]:
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score


Xtr = data_train_vectors
Ytr = data_train.target

Xte = data_test_vectors
Yte = data_test.target

k_range = range(1, 5)
param_grid = dict(n_neighbors=k_range)

clf_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clf_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_score_)
print(grid.best_params_)

0.16846385009722467
{'n_neighbors': 1}


### 1.3 Use Logistic Regression to do document classification
Here, we also use the cross-validation method to select the regularization coefficient. 

In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

#=====training with cross validation======
coeff = range(1, 10)
param_grid = dict(C=coeff)

clf_lr = LogisticRegression(penalty='l2')

grid = GridSearchCV(clf_lr, param_grid, cv=5, scoring='accuracy')
grid.fit(Xtr, Ytr)

print(grid.best_params_)

#=====testing======
clf_lr = LogisticRegression(penalty='l2', C=grid.best_params_['C'])
clf_lr.fit(Xtr, Ytr)

y_pred = clf_lr.predict(Xte)

acc = accuracy_score(Yte, y_pred)
macro_f1 = f1_score(Yte, y_pred, average='macro')
micro_f1 = f1_score(Yte, y_pred, average='micro')

print(acc, macro_f1, micro_f1)



{'C': 8}
0.6889272437599575 0.6778761181105242 0.6889272437599575


## 2. Task: Document Classification and Clustering

In this task, we are going to use [BBCNews](BBC_News_Train.csv) dataset. There are 1490 articles from 5 topics, including tech, business, sport, entertainment, politics. 

* Task 1: Please use KNN and logistic regression to do classification, and compare their performance.

* Task 2: Please use K-means to partition this dataset into 5 clusters and find the representative words in each cluster. 

In [56]:
# importing the necessary modules
import pandas as pd                                         # for the dataframe
# for TD-IDF representation
from sklearn.model_selection import train_test_split        # for splitting the data
from sklearn.feature_extraction.text import TfidfVectorizer # creates a TF-IDF vector from data
# for document classification
from sklearn.model_selection import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, f1_score

In [73]:
# constants
DATASET_FILENAME = r'BBC_News_Train.csv'                    # filename of the dataset input
TEST_SIZE = 1.0/5.0                                         # proportion of test data
SEED = 42                                                   # seed for sampling
KILOBYTES_PER_BYTE = 1.0/1024.0                             # for converting bytes to KB
MAX_RANGE = 5                                               # maximum range for KNN

### 2.1 Load data and represent it with TF-IDF representation

In [65]:
# load the BBCNews dataset
df = pd.read_csv(DATASET_FILENAME)

# split the data
(ids_train, ids_test, X_df_train, X_df_test, y_train, y_test) = \
    train_test_split(df[df.columns[0]], df[df.columns[1:-1]], df[df.columns[-1]],
                     test_size=TEST_SIZE, random_state=SEED)

# reshape the X, y
X_train, X_test, y_train, y_test = \
    (df.values.reshape((np.product(df.shape),))
     for df in (X_df_train, X_df_test, y_df_train, y_df_test))

#TF-IDF representation for each document
vectorizer = TfidfVectorizer()
M_train = vectorizer.fit_transform(X_train)
M_test = vectorizer.transform(X_test)

# get the shape and summary data
(TOTAL_ARTICLES, _) = df.shape
(N_ARTICLES, ) = X_train.shape

# display the number of articles
print(r"{} total articles".format(TOTAL_ARTICLES))
print()
print(type(X_train))
print(r"{} entries".format(N_ARTICLES))
print("{} non-null Count".format(sum(X_train != None)))
print(r"dtypes: {}".format(X_train.dtype))
print(r"memory usage: {:.1f} KB".format(X_train.nbytes * KILOBYTES_PER_BYTE))
print()
print("{} non-null Count".format(sum(y_train != None)))
print(r"dtypes: {}".format(y_train.dtype))
print(r"memory usage: {:.1f} KB".format(y_train.nbytes * KILOBYTES_PER_BYTE))
print()
print(r"TD-IDF training data shape: {}".format(M_train.shape))
print(r"TD-IDF testing data shape: {}".format(M_test.shape))

1490 total articles

<class 'numpy.ndarray'>
1192 entries
1192 non-null Count
dtypes: object
memory usage: 9.3 KB

1192 non-null Count
dtypes: object
memory usage: 9.3 KB

TD-IDF training data shape: (1192, 22591)
TD-IDF testing data shape: (298, 22591)


### 2.2 Use KNN to do document classification

In [74]:
param_grid = dict(n_neighbors=range(1, (1 + MAX_RANGE)))

clf_knn =  KNeighborsClassifier(n_neighbors=1)

grid = GridSearchCV(clf_knn, param_grid, cv=5, scoring='accuracy')
grid.fit(M_train, y_train)

print(grid.best_score_)
print(grid.best_params_)

0.9177701206005414
{'n_neighbors': 4}


### 2.3 Use Logistic Regression to do document classification

In [22]:
# your code

### 2.4 Use K-means to do document clustering and find the 10 most representative words in each cluster. 

In [23]:
# your code