## Import and visualize dataset

In [2]:
import sklearn.utils
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.datasets import fetch_20newsgroups

In [3]:
twenty_train = fetch_20newsgroups(subset='train', shuffle=True)

In [4]:
twenty_test = fetch_20newsgroups(subset='test')

In [5]:
print(twenty_train.keys())

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


In [6]:
print(twenty_train.data[0])

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







In [7]:
twenty_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

## Preprocessing

Before extract data into vector using `Bag of Words`, `TF`, ... We need to clean the text data and this process to prepare (or clean) text data before encoding is called **text preprocessing**.

***There are 3 main components:***
- Tokenization
- Normalization
- Noise removal

Paragraphs can be tokenized into sentences and sentences can be tokenized into words, it's **Tokenization**. **Normalization** aims to put all text on a level playing field, e.g., converting all characters to lowercase. **Noise removal** cleans up the text, e.g., remove extra whitespaces.

***Text Preprocessing steps:*** 
- Remove HTML tags
- Remove extra whitespaces
- Convert accented characters to ASCII characters
- Expand contractions
- Remove special characters
- Lowercase all texts
- Convert number words to numeric form
- Remove numbers
- Remove stopwords
- Lemmatization

In [32]:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\manhd\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [46]:
#removing URL
def clean_url (text):
    return re.sub(r'http\S+', '', text)

#removing special characters
def clean_special_character (text):
    return re.sub('[^a-zA-Z]', ' ', text)

#removing upper case characters
def clean_uppercase (text):
    return str(text).lower()

#tokenization
def tokenization (text):
    return word_tokenize(text)

#removing stop words
def clean_stop_word (tokens):
    stop_words = set(stopwords.words('english'))
    return [token for token in tokens if token not in stop_words]

#steamming
def steam (tokens):
    return [PorterStemmer().stem(token) for token in tokens]

#lenmatization
def lenmatization (tokens):
    return [WordNetLemmatizer().lemmatize(word=token, pos='v') for token in tokens]

#remove the words having length <= 2
def clean_length (tokens):
    return [token for token in tokens if len(token) > 2]

#convert back to string
def convert_2_string (text):
    return ' '.join(text)

#apply all cleaner
def clean (text):
    res = clean_url(text)
    res = clean_special_character(res)
    res = clean_uppercase(res)
    res = tokenization(res)
    res = clean_stop_word(res)
    res = lenmatization(res)
    res = clean_length(res)
    return convert_2_string(res)

In [48]:
#example
example = twenty_train.data[0]
after_clean = clean(example)
print(example, after_clean)

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----




 lerxst wam umd edu thing subject car nntp post host rac wam umd edu organization university maryland college park line wonder anyone could enlighten car saw day door sport car look late early call bricklin doors really small addition front bumper separate rest body know anyone 

## Feature Extraction

**Feature Extraction** is used to transform each text into a numerical representation in the form of a vector. (This process can contain *Tokenization, Vectorization, etc*)

Feature Extraction aims to reduce the number of features in a dataset by creating new features from the existing ones (and then discarding the original features)

***Feature Extraction advantages:***
- Accuracy improvements.
- Overfitting risk reduction.
- Speed up in training.
- Improved Data Visualization.
- Increase in explainability of our model.

### FE using Bag of Word

In [11]:
vectorizer = CountVectorizer()

In [12]:
vectorizer 

CountVectorizer()

In [14]:
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
]
X = vectorizer.fit_transform(corpus)
pprint(X)

<4x9 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>


In [20]:
analyze = vectorizer.build_analyzer()
print(analyze("<h1>This &*$) https://google.com is a text document to analyze.</h1>"))

['h1', 'this', 'https', 'google', 'com', 'is', 'text', 'document', 'to', 'analyze', 'h1']
