<a href="https://colab.research.google.com/github/mesahwi/TextAnlaysis/blob/master/Learning/LSA_textClassification_toy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Topic Modelling, for testing purposes (English texts used)



Rough Implementation

Step 0 : Import packages

In [0]:
import gensim
import numpy as np
import sklearn
import warnings
import os
warnings.filterwarnings('ignore')

from sklearn.datasets import fetch_20newsgroups
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.feature_extraction.text import HashingVectorizer
# from sklearn.feature_extraction.text import TfidfTransformer
# from sklearn.feature_extraction.text import CountVectorizer

categories = [
    'talk.politics.mideast',
    'rec.sport.baseball',
    'sci.electronics'
]

from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Step 1 : Gather Data

In [2]:
data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

length = len(np.unique(data_train.target))
for i in range(0, length):
  n = len(np.where(data_train.target==0)[0])
  print(categories[i], ' : ', n)
  i=i+1

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


talk.politics.mideast  :  597
rec.sport.baseball  :  597
sci.electronics  :  597


In [3]:
data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)

length = len(np.unique(data_test.target))
for i in range(0, length):
  n = len(np.where(data_test.target==0)[0])
  print(categories[i], ' : ', n)
  i=i+1

talk.politics.mideast  :  397
rec.sport.baseball  :  397
sci.electronics  :  397


Step 2 : Preprocess
 1. Create a tf-idf (or word-document) matrix
 2. Run a SVD on the matrix (A = u Sig v', where \
      u = Word matrix for Topic \\
      Sig = Topic Strength \\
      v = Document matrix for Topic)
      

In [0]:
# tf-idf matrix
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data_train.data)  #X is in CSR format
# print(vectorizer.get_feature_names())

#SVD - naive : ARPACK
# XArr = X.toarray()
# U, sig, Vt = np.linalg.svd(XArr)
# sigDiag = np.diag(sig)


#SVD - faster : randomized solver
numComp = len(np.unique(data_train.target))
svd = TruncatedSVD(n_components = numComp, n_iter=100)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

X_train = lsa.fit_transform(X) # ~~ np.dot(sigDiag[:numComp, :numComp], Vt[:numComp, :])

It's worth noting that if you try to run a naive SVD with a sizeable dataset, the memory explodes, so making use of Truncated SVD is recommended

Step 3 : Train! (and test performance on training data)

In [0]:
#Testing performance of model on training data
gnb = GaussianNB()
gnbfit = gnb.fit(X_train, data_train.target)
y_retro_NB = gnbfit.predict(X_train)

svm = SVC()
svmfit = svm.fit(X_train, data_train.target)
y_retro_SVM = svmfit.predict(X_train)

In [9]:
dif = y_retro_NB - data_train.target
val, cnt = np.unique(dif, return_counts = True)
correctN = dict(zip(val, cnt))[0]
print ('Training set Performance NB : ', correctN / sum(cnt))

dif = y_retro_SVM - data_train.target
val, cnt = np.unique(dif, return_counts = True)
correctN = dict(zip(val, cnt))[0]
print ('Training set Performance SVM : ', correctN / sum(cnt))

Training set Performance NB :  0.8664383561643836
Training set Performance SVM :  0.8635844748858448


Step 4 : Test performance on testing dataset

1.   Preprocess testing dataset
2.   Test performance



In [0]:
X2 = vectorizer.fit_transform(data_test.data)
X_test = lsa.fit_transform(X2)

In [11]:
y_pred_NB = gnbfit.predict(X_test)
y_pred_SVM = svmfit.predict(X_test)


dif = y_pred_NB - data_test.target
val, cnt = np.unique(dif, return_counts = True)
correctN = dict(zip(val, cnt))[0]
print ('Testing set Performance NB : ', correctN / sum(cnt))

dif = y_pred_SVM - data_test.target
val, cnt = np.unique(dif, return_counts = True)
correctN = dict(zip(val, cnt))[0]
print ('Testing set Performance SVM : ', correctN / sum(cnt))

Testing set Performance NB :  0.8001715265866209
Testing set Performance SVM :  0.8456260720411664
