# Text Mining - Classification 

We will predict the category of discussion posts in a newsgroup.

**The unit of analysis is a discussion post**

### 1. Import common packages

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

np.random_seed = 1

### 2. Load data

In [2]:
news = pd.read_csv('news.csv')
news

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics
...,...,...,...,...,...
592,carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick) writ...,0,0,1,medical
593,"In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...",0,0,1,medical
594,Article-I.D.: kestrel.1993Apr16.172052.27843 R...,0,0,1,medical
595,"In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...",0,0,1,medical


In [3]:
news.shape

(597, 5)

In [4]:
news.head(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
0,I have a few reprints left of chapters from my...,1,0,0,graphics
1,"gnuplot, etc. make it easy to plot real valued...",1,0,0,graphics
2,Article-I.D.: snoopy.1pqlhnINN8k1 References: ...,1,0,0,graphics
3,"Hello, I am looking to add voice input capabil...",1,0,0,graphics
4,I recently got a file describing a library of ...,1,0,0,graphics


In [5]:
news.tail(5)

Unnamed: 0,TEXT,graphics,hockey,medical,newsgroup
592,carl@SOL1.GPS.CALTECH.EDU (Carl J Lydick) writ...,0,0,1,medical
593,"In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...",0,0,1,medical
594,Article-I.D.: kestrel.1993Apr16.172052.27843 R...,0,0,1,medical
595,"In article < 1qmlgaINNjab@hp-col.col.hp.com> ,...",0,0,1,medical
596,"I have a 42 yr old male friend, misdiagnosed a...",0,0,1,medical


### 3. Check for missing values

In [6]:
news[['TEXT']].isna().sum()

TEXT    0
dtype: int64

It is clear that there are no missing values.

### 4. Assign the input variable to X and the target variable to y

In [7]:
X = news['TEXT']

This is a multi-class classification problem. There are three categories we will predict:<br>
Whether a post is "graphics," "hockey," or "medical" related

In [8]:
y = news['newsgroup']
y.unique()

array(['graphics', 'hockey', 'medical'], dtype=object)

In [9]:
from sklearn import preprocessing

le = preprocessing.LabelEncoder()
le.fit(y)
print(le.classes_)
y = le.transform(y)

y


['graphics' 'hockey' 'medical']


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

### 5. Split the data

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [11]:
X_train.shape, y_train.shape

((417,), (417,))

In [12]:
X_test.shape, y_test.shape

((180,), (180,))

In [13]:
X_train.head(5)

120    In article 2G1@bcstec.ca.boeing.com, rgc3679@b...
402    In article < ng4.733990422@husc.harvard.edu> ,...
454    Article-I.D.: pitt.19422 References: < 19211@p...
314    Article-I.D.: hydra.91678 References: < 1993Ap...
Name: TEXT, dtype: object

In [14]:
y_train[:5]

array([0, 0, 2, 2, 1])

### 6.  Sklearn: Text preparation

For simplicity (and focus), we will not do any text cleaning or preprocessing. We will just use the raw text as input to the model. See the text mining fundamentals tutorial for more details on text cleaning and preprocessing.

In [15]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english', lowercase=True, token_pattern="[^\W\d_]+")

X_train = tfidf_vect.fit_transform(X_train)

**Notice in the previous step that we use `fit_transform` on TRAIN. When we transform the TEST data, we need to use `transform` only. This enables us to keep the number of columns (features) the same across the data sets. Otherwise, they WILL be different, and no model will work!**

In [16]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

X_test = tfidf_vect.transform(X_test)


In [17]:
X_train.shape, X_test.shape

((417, 10192), (180, 10192))

In [18]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train

<417x10192 sparse matrix of type '<class 'numpy.float64'>'
	with 30482 stored elements in Compressed Sparse Row format>

In [19]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
X_train.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.17269124, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

### 7. Latent Semantic Analysis (Singular Value Decomposition)

In [20]:
from sklearn.decomposition import TruncatedSVD

#svd = TruncatedSVD(n_components=300, n_iter=10) #n_components is the number of topics, which should be less than the number of feature
    
# n_components = 100
svd_1 = TruncatedSVD(n_components=100, n_iter=10)
X_train_svd_1 = svd_1.fit_transform(X_train)
X_test_svd_1 = svd_1.transform(X_test)

# n_components = 300
svd_2 = TruncatedSVD(n_components=300, n_iter=10)
X_train_svd_2 = svd_2.fit_transform(X_train)
X_test_svd_2 = svd_2.transform(X_test)

# n_components = 500
svd_3 = TruncatedSVD(n_components=500, n_iter=10)
X_train_svd_3 = svd_3.fit_transform(X_train)
X_test_svd_3 = svd_3.transform(X_test)


In [21]:
X_train.shape, X_test.shape

((417, 10192), (180, 10192))

### 8.  Random Forest Model Performance

In [22]:
from sklearn.ensemble import RandomForestClassifier

# Define n_components to try
n_components_list = [100, 300, 500]

for n in n_components_list:
    print(f"n_components = {n}")
    
    # Apply TruncatedSVD to reduce dimensionality
    svd = TruncatedSVD(n_components=n, n_iter=10)
    X_train_svd = svd.fit_transform(X_train)
    X_test_svd = svd.transform(X_test)
    
    # Train a Random Forest Classifier on the reduced data
    rf_clf = RandomForestClassifier()
    _ = rf_clf.fit(X_train_svd, y_train)
    
    # Evaluate the model on the train set
    y_pred_train = rf_clf.predict(X_train_svd)
    train_acc = accuracy_score(y_train, y_pred_train)
    print(f"Train acc: {train_acc:.4f}")
    
    # Evaluate the model on the test set
    y_pred_test = rf_clf.predict(X_test_svd)
    test_acc = accuracy_score(y_test, y_pred_test)
    print(f"Test acc: {test_acc:.4f}")
    
    # Print the confusion matrix
    print(f"Confusion matrix:\n{confusion_matrix(y_test, y_pred_test)}\n")

n_components = 100
Train acc: 0.9952
Test acc: 0.8833
Confusion matrix:
[[52  0  9]
 [ 0 53  5]
 [ 5  2 54]]

n_components = 300
Train acc: 0.9952
Test acc: 0.9111
Confusion matrix:
[[53  0  8]
 [ 0 56  2]
 [ 3  3 55]]

n_components = 500
Train acc: 0.9952
Test acc: 0.8889
Confusion matrix:
[[54  0  7]
 [ 0 53  5]
 [ 4  4 53]]



### 9. Stochastic Gradient Descent Classifier Model Performance

In [23]:
for n in [100, 300, 500]:
    print(f"n_components = {n}")
    svd = TruncatedSVD(n_components=n, n_iter=10)
    X_train_svd = svd.fit_transform(X_train)
    X_test_svd = svd.transform(X_test)

    sgd_clf = SGDClassifier(max_iter=100)
    _ = sgd_clf.fit(X_train_svd, y_train)

    # Train accuracy
    y_pred_train = sgd_clf.predict(X_train_svd)
    print(f"Train acc (n_components={n}): {accuracy_score(y_train, y_pred_train):.4f}")

    # Test accuracy
    y_pred_test = sgd_clf.predict(X_test_svd)
    print(f"Test acc (n_components={n}): {accuracy_score(y_test, y_pred_test):.4f}")

    # Confusion Matrix
    print(f"Confusion Matrix (n_components={n}): \n{confusion_matrix(y_test, y_pred_test)}\n")

n_components = 100
Train acc (n_components=100): 0.9880
Test acc (n_components=100): 0.9611
Confusion Matrix (n_components=100): 
[[59  0  2]
 [ 1 54  3]
 [ 1  0 60]]

n_components = 300
Train acc (n_components=300): 0.9952
Test acc (n_components=300): 0.9611
Confusion Matrix (n_components=300): 
[[61  0  0]
 [ 2 56  0]
 [ 4  1 56]]

n_components = 500
Train acc (n_components=500): 0.9952
Test acc (n_components=500): 0.9111
Confusion Matrix (n_components=500): 
[[61  0  0]
 [ 2 56  0]
 [11  3 47]]



### ANALYSIS
1. It is evident from the concept of SVD that this technique reduces the dimensionality of the data.Now that we have used varied n_component values for random forest and stochastic gradient descent models, the accuracy scoring metric is evaluated.


2. From the random forest model performance, we observe that training accuracy remained same but the testing accuracy has been decreased when n_component value is changed from 100 to 500.This shows that accuracy gets slightly effected when we use different n_component values.


3. From stochastic gradient descent model performance, 
for n_components(300 and 500), the training accuracy is  0.9976 and both share same values.
for all n_Components we see there is high testing accuracy score when n_component value is 500.
This shows that Latent Semantic Analyis is a linear combination of orginal values and when n_components score is slightly increased it shows higher accuracy and low variance and I feel it could sometimes lead to overfitting as well.


Overall, I have understood from the concept of Principal Component analysis and Latent Semantic Analysis that it is not necessary on datasets which are small but it would be really impact on larger datasets but not datasets like the one used here for the analysis.
