# Tutorial - Text Mining - Classification - SCIKIT-LEARN

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

In [None]:
import pandas as pd
import numpy as np

In [None]:
news = pd.read_csv('news.csv')

In [None]:
news.head(5)

## Assign the "target" variable

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [None]:
target = news['newsgroup']

## Assign the "text" (input) variable

In [None]:
# Check for missing values

news[['TEXT']].isna().sum()

In [None]:
input_data = news['TEXT']

## Split the data

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(input_data, target, test_size=0.3, random_state=42)

In [None]:
train_set.shape, train_y.shape

In [None]:
test_set.shape, test_y.shape

## Sklearn: Text preparation

We need to prepare the text data. We'll use sklearn's CountVectorizer, which counts the frequency of words that appear in your entire data set.<br>
CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

If you don't use the CountVectorizer, you have to do all the text prep on your own:<br>
1- Convert to lowercase<br>
2- Remove numbers (if needed)<br>
3- Remove punctuation<br>
4- Remove whitespace<br>
5- Tokenize<br>
6- Stemming<br>
etc.

Note that, CountVectorizer doesn't do stemming, or lemmatizing. You may want to use NLTK for that (import NLTK)

In [None]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')

train_x_tr = tfidf_vect.fit_transform(train_set)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [None]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

test_x_tr = tfidf_vect.transform(test_set)


In [None]:
train_x_tr.shape, test_x_tr.shape

In [None]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr

In [None]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr.toarray()

## Latent Semantic Analysis (Singular Value Decomposition)

In [None]:
from sklearn.decomposition import TruncatedSVD

In [None]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100

svd = TruncatedSVD(n_components=300, n_iter=10)

In [None]:
train_x_lsa = svd.fit_transform(train_x_tr)

In [None]:
train_x_lsa.shape

In [None]:
train_x_lsa

### Let's transform the test data set

In [None]:
test_x_lsa = svd.transform(test_x_tr)

In [None]:
test_x_lsa.shape

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier 

from sklearn.metrics import accuracy_score

In [None]:
rnd_clf = RandomForestClassifier(n_estimators=100, max_leaf_nodes=16, n_jobs=-1) 

rnd_clf.fit(train_x_lsa, train_y)



## Accuracy

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
#Train accuracy

train_y_pred = rnd_clf.predict(train_x_lsa)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

In [None]:
#Test accuracy

test_y_pred = rnd_clf.predict(test_x_lsa)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

# Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

#Usually created on test set
confusion_matrix(test_y, test_y_pred)

## Stochastic Gradient Descent Classifier

In [None]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(max_iter=100)


In [None]:
sgd_clf.fit(train_x_lsa, train_y)

## Accuracy

In [None]:
#Train accuracy

train_y_pred = sgd_clf.predict(train_x_lsa)

train_acc = accuracy_score(train_y, train_y_pred)

print('Train acc: {}' .format(train_acc))

In [None]:
#Test accuracy

test_y_pred = sgd_clf.predict(test_x_lsa)

test_acc = accuracy_score(test_y, test_y_pred)

print('Test acc: {}' .format(test_acc))

# Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix

#Usually created on test set
confusion_matrix(test_y, test_y_pred)

# Explore the SVDs - OPTIONAL

In [None]:
svd.explained_variance_.sum()

In [None]:
#These are the all the components:
svd.components_

In [None]:
svd.components_.shape

In [None]:
#Let's select the first component:

first_component = svd.components_[0,:]

In [None]:
# Sort the weights in the first component, and get the indeces

indeces = np.argsort(first_component).tolist()

In [None]:
#Be careful, indeces are in descending order (least important first)

print(indeces)

In [None]:
#Let's get the feature names from the count vectorizer:
feat_names = tfidf_vect.get_feature_names()

In [None]:
#Print the last 10 terms (i.e., the 10 terms that have the highest weigths)

for index in indeces[-10:]:
    print(feat_names[index], "\t\tweight =", first_component[index])