# Project 3: Topic Models
In this project, we will analyze the 20 newsgroup dataset (http://qwone.com/jason/20Newsgroups/) using topic models. We consider all the articles in the following two news groups: comp.sys.ibm.pc.hardware and comp.sys.mac.hardware.

Removing the stopwords from the vocabulary and further limiting the vocabulary to the top 1000 most frequent terms, we can now summarize the N_train training articles and N_test testing articles into a 1000×(N_train +N_test) word frequency count matrix. 

Denote this matrix as X.

In [2]:
# import sklearn
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
hardware_train = fetch_20newsgroups(subset='train', categories =['comp.sys.ibm.pc.hardware','comp.sys.mac.hardware'])
hardware_test = fetch_20newsgroups(subset='test', categories =['comp.sys.ibm.pc.hardware','comp.sys.mac.hardware'])

#### Q1. Use tf as the feature for each document. Train a binary logistic regression model on the training set. Evaluate its document classification accuracy on the testing set.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

# get stopwords list
stopwords = open("stopwords.txt").read().replace('\n', ' ').split()

# vectorize and transform traning data using tf as the feature
tf_vectorizer = TfidfVectorizer(use_idf=False, 
                                max_features=1000, 
                                stop_words = stopwords)

tf_vectors = tf_vectorizer.fit_transform(hardware_train.data)

print("train vector shape:", tf_vectors.shape)

# MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,f1_score

# vectorize test data
tf_vectors_test = tf_vectorizer.transform(hardware_test.data)
print("test vector shape:", tf_vectors_test.shape)



# train
tf_clf = LogisticRegression(solver='lbfgs')
tf_clf.fit(tf_vectors, hardware_train.target)

# predict
tf_pred = tf_clf.predict(tf_vectors_test)
print("f1_score", f1_score(hardware_test.target,tf_pred,average='macro'))
print("accuracy:", accuracy_score(hardware_test.target, tf_pred))

train vector shape: (1168, 1000)
test vector shape: (777, 1000)
f1_score 0.8712826017811705
accuracy: 0.8712998712998713


#### Q2. Use tf-idf as the feature for each document and repeat Q1.

In [6]:
# vectorize and transform traning data using tf as the feature
tfidf_vectorizer = TfidfVectorizer(use_idf=True, 
                                max_features=1000, 
                                stop_words = stopwords)

tfidf_vectors = tfidf_vectorizer.fit_transform(hardware_train.data)

# vectorize test data
tfidf_vectors_test = tfidf_vectorizer.transform(hardware_test.data)

# train
tfidf_clf = LogisticRegression(solver='lbfgs')
tfidf_clf.fit(tfidf_vectors, hardware_train.target)

# predict
tfidf_pred = tfidf_clf.predict(tfidf_vectors_test)
print("f1_score", f1_score(hardware_test.target,tfidf_pred, average='macro'))
print("accuracy:", accuracy_score(hardware_test.target, tfidf_pred))

f1_score 0.8815799936386769
accuracy: 0.8815958815958816


#### Q3. Decompose X into USV T using SVD, where we set the dimension of U as 1000×20. Using SV T as the document features and repeat Q1.

In [13]:
#TODO: we don't have X yet
print(type(tf_vectors))
print(type(tf_vectors_test))
print(tf_vectors.shape)
print(tf_vectors_test.shape)

<class 'scipy.sparse.csr.csr_matrix'>
<class 'scipy.sparse.csr.csr_matrix'>
(1168, 1000)
(777, 1000)
