# Tutorial - Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### Import common packages

In [1]:
import pandas as pd
import numpy as np

np.random_seed = 1

### Load data

In [2]:
news = pd.read_csv('news.csv')

news.shape


(597, 5)

In [3]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


### Check for missing values

In [4]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

## Assign the input variable to X and the target variable to y

In [5]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [6]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [7]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

## Split the data

In [8]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [9]:
X_train.shape, y_train.shape

((417,), (417,))

In [10]:
X_test.shape, y_test.shape

((180,), (180,))

In [11]:
X_train.head(5)

536    Thanks for all your assistance. I'll see if he...
431    My 14-y-o son has the usual teenage spotty chi...
451    Article-I.D.: pitt.19408 References: < x> < LM...
63     In a previous article, trb3@Ra.MsState.Edu (To...
590    Article-I.D.: reed.1993Apr16.170752.6312 Refer...
Name: TEXT, dtype: object

In [12]:
y_train[:5]

array([2, 2, 2, 0, 2])

## Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [13]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [14]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [15]:
X_train.shape, X_test.shape

((417, 9887), (180, 9887))

In [16]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x9887 sparse matrix of type '<class 'numpy.float64'>'
	with 29911 stored elements in Compressed Sparse Row format>

In [17]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Latent Semantic Analysis (Singular Value Decomposition)

### SVD with n_componets = 100

In [20]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=100, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train_100= svd.fit_transform(X_train)
X_test_100 = svd.transform(X_test)


In [21]:
X_train_100.shape, X_test_100.shape

((417, 100), (180, 100))

### SVD with n_componets = 300

In [20]:
svd = TruncatedSVD(n_components=300, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train_300 = svd.fit_transform(X_train)
X_test_300 = svd.transform(X_test)


In [21]:
X_train_300.shape, X_test_300.shape

((417, 300), (180, 300))

### SVD with n_componets = 500

In [24]:
svd = TruncatedSVD(n_components=500, n_iter=10) #n_components is the number of topics, which should be less than the number of features

X_train_500 = svd.fit_transform(X_train)
X_test_500 = svd.transform(X_test)


In [25]:
X_train_500.shape, X_test_500.shape

((417, 417), (180, 417))

#### Notice that even though the n_components value is 500, the shape shows only 417 columns. Meaning the value cannot be more than the number of features.


## Random Forest

In [34]:
from sklearn.ensemble import RandomForestClassifier 

rnd_clf_100 = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 
_ = rnd_clf_100.fit(X_train_100, y_train)

rnd_clf_300 = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1)
_ = rnd_clf_300.fit(X_train_300, y_train)

rnd_clf_500 = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1)
_ = rnd_clf_500.fit(X_train_500, y_train)

### Evaluating Model Performance

In [35]:
from sklearn.metrics import accuracy_score

In [37]:
#Test accuracy
y_pred_test_100 = rnd_clf_100.predict(X_test_100)
#acc = accuracy_score(y_test, y_pred_test_100)
print(f"Test acc 100: {accuracy_score(y_test, y_pred_test_100):.4f}")

y_pred_test_300 = rnd_clf_300.predict(X_test_300)
#acc = accuracy_score(y_test, y_pred_test)
print(f"Test acc 300: {accuracy_score(y_test, y_pred_test_300):.4f}")

y_pred_test_500 = rnd_clf_500.predict(X_test_500)
#acc = accuracy_score(y_test, y_pred_test)
print(f"Test acc 500: {accuracy_score(y_test, y_pred_test_500):.4f}")

Test acc 100: 0.8833
Test acc 300: 0.8556
Test acc 500: 0.8278


In [38]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test_100)

array([[56,  0,  6],
       [ 0, 45,  1],
       [ 9,  5, 58]], dtype=int64)

In [39]:
# Confusion Matrix
confusion_matrix(y_test, y_pred_test_300)

array([[49,  2, 11],
       [ 0, 45,  1],
       [ 8,  4, 60]], dtype=int64)

In [40]:
# Confusion Matrix
confusion_matrix(y_test, y_pred_test_500)

array([[55,  4,  3],
       [ 0, 45,  1],
       [10, 13, 49]], dtype=int64)

## Stochastic Gradient Descent Classifier

In [41]:
from sklearn.linear_model import SGDClassifier

sgd_clf_100 = SGDClassifier(max_iter=100)
_ = sgd_clf_100.fit(X_train_100, y_train)

sgd_clf_300 = SGDClassifier(max_iter=100)
_ = sgd_clf_300.fit(X_train_300, y_train)

sgd_clf_500 = SGDClassifier(max_iter=100)
_ = sgd_clf_500.fit(X_train_500, y_train)

### Evaluating Model Performance

In [44]:
#Test accuracy
y_pred_test_100 = sgd_clf_100.predict(X_test_100)
print(f"Test acc 100: {accuracy_score(y_test, y_pred_test_100):.4f}")

y_pred_test_300 = sgd_clf_300.predict(X_test_300)
print(f"Test acc 300: {accuracy_score(y_test, y_pred_test_300):.4f}")

y_pred_test_500 = sgd_clf_500.predict(X_test_500)
print(f"Test acc 500: {accuracy_score(y_test, y_pred_test_500):.4f}")

Test acc 100: 0.9167
Test acc 300: 0.9111
Test acc 500: 0.9333


In [45]:
# Confusion Matrix
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, y_pred_test_100)

array([[61,  1,  0],
       [ 1, 45,  0],
       [11,  2, 59]], dtype=int64)

In [46]:
# Confusion Matrix
confusion_matrix(y_test, y_pred_test_300)

array([[62,  0,  0],
       [ 1, 45,  0],
       [12,  3, 57]], dtype=int64)

In [47]:
# Confusion Matrix
confusion_matrix(y_test, y_pred_test_500)

array([[62,  0,  0],
       [ 1, 45,  0],
       [10,  1, 61]], dtype=int64)

### SUMMARY

SVD is a widely used technique to decompose a matrix into several component matrices, exposing many of the useful and interesting properties of the original matrix. The impact of applying SVD to data is considerable. Once we reduce the columns using SVD, we tend to compress and loose some information compared to the original data. As we might try to increase the n_components value, we add more data to the algorithms for processing.
We may want to use SGD to get summarized information when there is a 1000's of columns in text and processing them all is not possible.

The Random Forest model evaluation was highest(88.3%) when the SVD n_components value was the least(100). Thus, it is evident that, as the n-component value increases, there is additional data to process and evaluate, decreasing the accuracy of the model to 82.7% at n_components as 500.

In the case of SGD classifier evaluation, the model accuracy went down from 91.6%(for 100 components) to 91.1%(for 300 components). But displayed highest accuracy of 93.3% at n_components as 500.