## Import and visualize dataset

In [130]:
import sklearn.utils
import numpy as np
from sklearn import metrics
from sklearn.datasets import fetch_20newsgroups

In [57]:
twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)

In [58]:
twenty_test = fetch_20newsgroups(subset='test')

In [5]:
print(twenty_train.keys())

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


In [59]:
print(twenty_train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [7]:
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Preprocessing

Before extract data into vector using `Bag of Words`, `TF`, ... We need to clean the text data and this process to prepare (or clean) text data before encoding is called **text preprocessing**.

***There are 3 main components:***
- Tokenization
- Normalization
- Noise removal

Paragraphs can be tokenized into sentences and sentences can be tokenized into words, it's **Tokenization**. **Normalization** aims to put all text on a level playing field, e.g., converting all characters to lowercase. **Noise removal** cleans up the text, e.g., remove extra whitespaces.

***Text Preprocessing steps:*** 
- Remove HTML tags
- Remove extra whitespaces
- Convert accented characters to ASCII characters
- Expand contractions
- Remove special characters
- Lowercase all texts
- Convert number words to numeric form
- Remove numbers
- Remove stopwords
- Lemmatization

In [86]:
import re
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\manhd\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\manhd\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\manhd\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [87]:
#removing URL
def clean_url (text):
    return re.sub(r'http\S+', '', text)

#removing special characters
def clean_special_character (text):
    return re.sub('[^a-zA-Z]', ' ', text)

#removing upper case characters
def clean_uppercase (text):
    return str(text).lower()

#sentence seqmentation
def sent_tokenization (text):
    return sent_tokenize(text)

#tokenization
def tokenization (text):
    return word_tokenize(text)

#removing stop words
def clean_stop_word (tokens):
    stop_words = set(stopwords.words('english'))
    return [token for token in tokens if token not in stop_words]

#steamming
def steam (tokens):
    return [PorterStemmer().stem(token) for token in tokens]

#lenmatization
def lenmatization (tokens):
    return [WordNetLemmatizer().lemmatize(word=token, pos='v') for token in tokens]

#remove the words having length <= 2
def clean_length (tokens):
    return [token for token in tokens if len(token) > 2]

#convert back to string
def convert_2_string (text):
    return ' '.join(text)

#apply all cleaner
def clean (text):
    res = clean_url(text)
    res = clean_special_character(res)
    res = clean_uppercase(res)
    res = tokenization(res)
    res = clean_stop_word(res)
    res = lenmatization(res)
    res = clean_length(res)
    return convert_2_string(res)

In [60]:
#example
example = twenty_train.data[0]
after_clean = clean(example)
print(example, after_clean)

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----




 lerxst wam umd edu thing subject car nntp post host rac wam umd edu organization university maryland college park line wonder anyone could enlighten car saw day door sport car look late early call bricklin doors really small addition front bumper separate rest body know anyone 

## Feature Extraction

**Feature Extraction** is used to transform each text into a numerical representation in the form of a vector. (This process can contain *Tokenization, Vectorization, etc*)

Feature Extraction aims to reduce the number of features in a dataset by creating new features from the existing ones (and then discarding the original features)

***Feature Extraction advantages:***
- Accuracy improvements.
- Overfitting risk reduction.
- Speed up in training.
- Improved Data Visualization.
- Increase in explainability of our model.

### FE using Bag of Word

In [66]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer()
analyze = count_vector.build_analyzer()

In [62]:
print(analyze(twenty_train.data[0]))

['from', 'lerxst', 'wam', 'umd', 'edu', 'where', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac3', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', '15', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', '60s', 'early', '70s', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front', 'bumper', 'was', 'separate', 'from', 'the', 'rest', 'of', 'the', 'body', 'this', 'is', 'all', 'know', 'if', 'anyone', 'can', 'tellme', 'model', 'name', 'engine', 'specs', 'years', 'of', 'production', 'where', 'this', 'car', 'is', 'made', 'history', 'or', 'whatever', 'info', 'you', 'have', 'on', 'this', 'funky', 'looking', 'car', 'please', 'mail', 'thanks', 'il', 'brought', 'to', 'you', 'by', 'your', 'neigh

In [64]:
X_train_counts = count_vector.fit_transform(twenty_train.data)
X_train_counts.shape

(11314, 130107)

### FE using TF, TF–TDF

***TF*** stands for "Term Frequency"

***TF-TDF*** stands for “Term Frequency times Inverse Document Frequency”

Longer documents will have higher average count values than shorter documents, even though they might talk about the same topics. 

To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called `tf` for Term Frequencies.

<img src="./tf-idf equation.png"/>

### TF:

In [67]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer(use_idf=False)

In [68]:
tf_transformer.fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(11314, 130107)

### TF-IDF:

In [69]:
tfidf_transformer = TfidfTransformer()

In [70]:
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(11314, 130107)

## Training a classifier

### Using Naive Bayes

#### Bag of word only

In [74]:
from sklearn.pipeline import Pipeline

In [125]:
NB_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', MultinomialNB()),
])

In [126]:
NB_clf.fit(twenty_train.data, twenty_train.target)


Pipeline(steps=[('vect', CountVectorizer()), ('clf', MultinomialNB())])

In [132]:
predicted = NB_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7728359001593202

#### Visualize the model

In [133]:
from sklearn import metrics

In [135]:
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.79      0.77      0.78       319
           comp.graphics       0.67      0.74      0.70       389
 comp.os.ms-windows.misc       0.20      0.00      0.01       394
comp.sys.ibm.pc.hardware       0.56      0.77      0.65       392
   comp.sys.mac.hardware       0.84      0.75      0.79       385
          comp.windows.x       0.65      0.84      0.73       395
            misc.forsale       0.93      0.65      0.77       390
               rec.autos       0.87      0.91      0.89       396
         rec.motorcycles       0.96      0.92      0.94       398
      rec.sport.baseball       0.96      0.87      0.91       397
        rec.sport.hockey       0.93      0.96      0.95       399
               sci.crypt       0.67      0.95      0.78       396
         sci.electronics       0.79      0.66      0.72       393
                 sci.med       0.87      0.82      0.85       396
         

In [138]:
print(metrics.confusion_matrix(twenty_test.target, predicted))

[[245   0   0   1   0   1   0   0   1   0   2   1   1   2   2  41   2  11
    5   4]
 [  1 287   0  12   4  31   1   0   0   1   0  26   5   2   8   2   2   1
    6   0]
 [  2  55   1 134  13 112   2   0   1   3   1  31   4   4   8   5   2   1
   14   1]
 [  0  11   1 300  15  11   3   5   0   0   1  11  23   0   5   0   1   2
    3   0]
 [  0  12   1  22 289   5   3   5   1   1   0  14  10   3   3   1   4   2
    9   0]
 [  1  25   2  11   1 332   0   0   0   0   0  13   0   2   4   1   2   1
    0   0]
 [  0   6   0  35  17   3 253  16   4   1   4   6  16   7   6   2   5   4
    5   0]
 [  0   1   0   2   0   0   4 360   3   2   2   3   0   0   4   0   4   2
    9   0]
 [  0   0   0   1   0   0   2  13 365   0   0   4   0   0   0   1   3   4
    5   0]
 [  1   1   0   0   1   1   0   6   0 345  16   0   0   0   5   6   2   2
   11   0]
 [  0   0   0   0   0   0   0   0   0   2 385   1   0   1   1   3   1   2
    3   0]
 [  0   3   0   0   0   2   1   2   0   0   0 377   2   1   1   0

#### TF

In [140]:
TF_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tf', TfidfTransformer(use_idf=False)),
    ('clf', MultinomialNB()),
])

In [141]:
TF_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(steps=[('vect', CountVectorizer()),
                ('tf', TfidfTransformer(use_idf=False)),
                ('clf', MultinomialNB())])

In [145]:
predicted = TF_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7052575677110993

In [144]:
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.85      0.24      0.37       319
           comp.graphics       0.71      0.60      0.65       389
 comp.os.ms-windows.misc       0.79      0.65      0.71       394
comp.sys.ibm.pc.hardware       0.63      0.75      0.69       392
   comp.sys.mac.hardware       0.86      0.68      0.76       385
          comp.windows.x       0.88      0.68      0.77       395
            misc.forsale       0.90      0.72      0.80       390
               rec.autos       0.71      0.92      0.80       396
         rec.motorcycles       0.84      0.91      0.87       398
      rec.sport.baseball       0.86      0.85      0.86       397
        rec.sport.hockey       0.90      0.93      0.91       399
               sci.crypt       0.52      0.96      0.67       396
         sci.electronics       0.78      0.52      0.63       393
                 sci.med       0.82      0.76      0.79       396
         

#### TF-IDF

In [147]:
TF_IDF_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

In [148]:
TF_IDF_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(steps=[('vect', CountVectorizer()), ('tf', TfidfTransformer()),
                ('clf', MultinomialNB())])

In [149]:
predicted = TF_IDF_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.7738980350504514

In [150]:
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.80      0.52      0.63       319
           comp.graphics       0.81      0.65      0.72       389
 comp.os.ms-windows.misc       0.82      0.65      0.73       394
comp.sys.ibm.pc.hardware       0.67      0.78      0.72       392
   comp.sys.mac.hardware       0.86      0.77      0.81       385
          comp.windows.x       0.89      0.75      0.82       395
            misc.forsale       0.93      0.69      0.80       390
               rec.autos       0.85      0.92      0.88       396
         rec.motorcycles       0.94      0.93      0.93       398
      rec.sport.baseball       0.92      0.90      0.91       397
        rec.sport.hockey       0.89      0.97      0.93       399
               sci.crypt       0.59      0.97      0.74       396
         sci.electronics       0.84      0.60      0.70       393
                 sci.med       0.92      0.74      0.82       396
         

### Using SVM

In [151]:
from sklearn.linear_model import SGDClassifier

In [152]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])

In [155]:
text_clf.fit(twenty_train.data, twenty_train.target)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf',
                 SGDClassifier(alpha=0.001, max_iter=5, random_state=42,
                               tol=None))])

In [156]:
predicted = text_clf.predict(twenty_test.data)
np.mean(predicted == twenty_test.target)

0.8248805098247477

In [157]:
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.73      0.71      0.72       319
           comp.graphics       0.78      0.72      0.75       389
 comp.os.ms-windows.misc       0.73      0.78      0.75       394
comp.sys.ibm.pc.hardware       0.74      0.67      0.70       392
   comp.sys.mac.hardware       0.81      0.83      0.82       385
          comp.windows.x       0.84      0.76      0.80       395
            misc.forsale       0.84      0.90      0.87       390
               rec.autos       0.91      0.90      0.90       396
         rec.motorcycles       0.93      0.96      0.95       398
      rec.sport.baseball       0.88      0.90      0.89       397
        rec.sport.hockey       0.88      0.99      0.93       399
               sci.crypt       0.84      0.96      0.90       396
         sci.electronics       0.83      0.62      0.71       393
                 sci.med       0.87      0.86      0.87       396
         

### Preprocessing before Training

In [158]:
processed_train_data = [clean(letter) for letter in twenty_train.data]

In [167]:
print(processed_train_data[0])

lerxst wam umd edu thing subject car nntp post host rac wam umd edu organization university maryland college park line wonder anyone could enlighten car saw day door sport car look late early call bricklin doors really small addition front bumper separate rest body know anyone tellme model name engine specs years production car make history whatever info funky look car please mail thank bring neighborhood lerxst


In [168]:
print(twenty_train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [160]:
processed_test_data = [clean(letter) for letter in twenty_test.data]
len(processed_test_data)

7532

### Using SVM to training models with preprocessed data

In [161]:
text_clf.fit(processed_train_data, twenty_train.target)

Pipeline(steps=[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('clf',
                 SGDClassifier(alpha=0.001, max_iter=5, random_state=42,
                               tol=None))])

In [169]:
predicted = text_clf.predict(processed_test_data)
np.mean(predicted == twenty_test.target)

0.8177110993096123

In [170]:
print(metrics.classification_report(twenty_test.target, predicted,
    target_names=twenty_test.target_names))

                          precision    recall  f1-score   support

             alt.atheism       0.73      0.69      0.71       319
           comp.graphics       0.79      0.72      0.75       389
 comp.os.ms-windows.misc       0.71      0.74      0.73       394
comp.sys.ibm.pc.hardware       0.73      0.67      0.70       392
   comp.sys.mac.hardware       0.81      0.78      0.80       385
          comp.windows.x       0.85      0.77      0.81       395
            misc.forsale       0.81      0.85      0.83       390
               rec.autos       0.89      0.87      0.88       396
         rec.motorcycles       0.92      0.96      0.94       398
      rec.sport.baseball       0.89      0.91      0.90       397
        rec.sport.hockey       0.88      0.99      0.93       399
               sci.crypt       0.82      0.95      0.88       396
         sci.electronics       0.80      0.64      0.71       393
                 sci.med       0.88      0.86      0.87       396
         