<a href="https://colab.research.google.com/github/mesahwi/TextAnlaysis/blob/master/Learning/LSA_textClassification_toy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Text classification using Latent Semantic Analysis,  for testing purposes <br>
Using 20newsgroups

Step 0 : Import packages

In [0]:
import gensim
import numpy as np
import sklearn
import warnings
import os
warnings.filterwarnings('ignore')

from sklearn.datasets import fetch_20newsgroups
categories = [
    'talk.politics.mideast',
    'rec.sport.baseball',
    'sci.electronics'
]

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.feature_extraction.text import HashingVectorizer
# from sklearn.feature_extraction.text import TfidfTransformer
# from sklearn.feature_extraction.text import CountVectorizer


from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Step 1 : Gather Data

In [2]:
print('-----Training Dataset Word Count-----')
data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
length = len(np.unique(data_train.target))
for i in range(length):
  n = len(np.where(data_train.target==i)[0])
  wc = 0
  for j in range(n):
    wc = wc + len(data_train.data[j])
    
  print('type',i, ' : ', wc)
  i=i+1
  
print('-----Test Dataset Word Count-----')  
data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
length = len(np.unique(data_test.target))
for i in range(length):
  n = len(np.where(data_test.target==i)[0])
  wc = 0
  for j in range(n):
    wc = wc + len(data_test.data[j])
    
  print('type',i, ' : ', wc)
  i=i+1


-----Training Dataset Word Count-----
type 0  :  1175617
type 1  :  1163580
type 2  :  1130432
-----Test Dataset Word Count-----
type 0  :  735298
type 1  :  726829
type 2  :  693695


Step 2 : Preprocess
 1. Create a tf-idf (or word-document) matrix
 2. Run a SVD on the matrix (A = u Sig v', where \
      u = Word matrix for Topic \\
      Sig = Topic Strength \\
      v = Document matrix for Topic)
      

In [0]:
# tf-idf matrix
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(data_train.data)  #X is in CSR format
# print(vectorizer.get_feature_names())

#SVD - naive : ARPACK
# XArr = X.toarray()
# U, sig, Vt = np.linalg.svd(XArr)
# sigDiag = np.diag(sig)


#SVD - faster : randomized solver
numComp = len(np.unique(data_train.target))
svd = TruncatedSVD(n_components = numComp, n_iter=100)
normalizer = Normalizer(copy=False)
lsa = make_pipeline(svd, normalizer)

X_train = lsa.fit_transform(X) # ~~ np.dot(sigDiag[:numComp, :numComp], Vt[:numComp, :])

It's worth noting that if you try to run a naive SVD with a sizeable dataset, the memory explodes, so making use of Truncated SVD is recommended

Step 3 : Train! (and test performance on training data)

In [0]:
#Testing performance of model on training data
gnb = GaussianNB()
gnbfit = gnb.fit(X_train, data_train.target)
y_retro_NB = gnbfit.predict(X_train)

svm = SVC()
svmfit = svm.fit(X_train, data_train.target)
y_retro_SVM = svmfit.predict(X_train)

In [5]:
dif = y_retro_NB - data_train.target
val, cnt = np.unique(dif, return_counts = True)
correctN = dict(zip(val, cnt))[0]
print ('Training set Performance NB : ', correctN / sum(cnt))

dif = y_retro_SVM - data_train.target
val, cnt = np.unique(dif, return_counts = True)
correctN = dict(zip(val, cnt))[0]
print ('Training set Performance SVM : ', correctN / sum(cnt))

Training set Performance NB :  0.8664383561643836
Training set Performance SVM :  0.8635844748858448


Step 4 : Test performance on testing dataset

1.   Preprocess test dataset
2.   Test performance



In [0]:
X2 = vectorizer.fit_transform(data_test.data)
X_test = lsa.fit_transform(X2)

In [7]:
y_pred_NB = gnbfit.predict(X_test)
y_pred_SVM = svmfit.predict(X_test)


dif = y_pred_NB - data_test.target
val, cnt = np.unique(dif, return_counts = True)
correctN = dict(zip(val, cnt))[0]
print ('Test set Performance NB : ', correctN / sum(cnt))

dif = y_pred_SVM - data_test.target
val, cnt = np.unique(dif, return_counts = True)
correctN = dict(zip(val, cnt))[0]
print ('Test set Performance SVM : ', correctN / sum(cnt))

Test set Performance NB :  0.8001715265866209
Test set Performance SVM :  0.8456260720411664
