Feature extraction refers to the process of transforming raw data into numerical features that can be processed while preserving the information in the original data set. It yields better results than applying machine learning directly to the raw data.

In [1]:
import spacy

# building natural language processor from scratch
let's use basic Python to build a rudimentary NLP system. We'll build a *corpus of documents* (two small text files), create a *vocabulary* from all the words in both documents, and then demonstrate a *Bag of Words* technique to extract features from each document.<br>

In [2]:
%%writefile one.txt
let us learn how to create a bag of words

Writing one.txt


In [9]:
vocab={}
i=1
with open('one.txt') as file:
    x= file.read().lower().split()

for word in x:
    if word in vocab:
        continue
    else:
        vocab[word]=i
        i=i+1
    
print(vocab)

{'let': 1, 'us': 2, 'learn': 3, 'how': 4, 'to': 5, 'create': 6, 'a': 7, 'bag': 8, 'of': 9, 'words': 10}


In [14]:
one = [0]*(len(vocab)+1)
one

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [15]:
with open('one.txt') as file:
    x= file.read().lower().split()

for word in vocab:
    one[vocab[word]]+=1
    
print(one)

[0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


In [16]:
x

['let', 'us', 'learn', 'how', 'to', 'create', 'a', 'bag', 'of', 'words']

In [20]:
%%writefile two.txt
let us learn how to create a bag of words let us

Writing two.txt


In [23]:
two = [0]*(len(vocab)+1)
two

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [24]:
with open('two.txt') as file:
    x= file.read().lower().split()

for word in x:
    two[vocab[word]]+=1
    
print(two)

[0, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1]


# feature extraction from text

In [25]:
import numpy as np
import pandas as pd

In [26]:
df = pd.read_csv('smsspamcollection.tsv', sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [27]:
# checking for missing values
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [28]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [29]:
from sklearn.model_selection import train_test_split

X = df['message']  # this time we want to look at the text
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Scikit-learn's CountVectorizer
Text preprocessing, tokenizing and the ability to filter out stopwords are all included in [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), which builds a dictionary of features and transforms documents to feature vectors.

In [30]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(3733, 7082)

## Transform Counts to Frequencies with Tf-idf
While counting words is helpful, longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.

To avoid this we can simply divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called **tf** for Term Frequencies.

Another refinement on top of **tf** is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.

This downscaling is called **tf–idf** for “Term Frequency times Inverse Document Frequency”.

Both tf and tf–idf can be computed as follows using [TfidfTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html):

In [31]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3733, 7082)

## Combine Steps with TfidVectorizer
we can combine the CountVectorizer and TfidTransformer steps into one using [TfidVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html):

In [32]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()

X_train_tfidf = vectorizer.fit_transform(X_train) # remember to use the original X_train set
X_train_tfidf.shape

(3733, 7082)

In [33]:
from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC()

# creating a pipeline of TfidfVectorizer() and  LinearSVC()

In [34]:
from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [35]:
# Form a prediction set
predictions = text_clf.predict(X_test)

In [36]:
#Report the confusion matrix
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[1586    7]
 [  12  234]]


In [37]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.989668297988037


# checking our spam classifier

In [38]:
text_clf.predict(['Congratulations, you won the lottery'])

array(['spam'], dtype=object)

In [39]:
text_clf.predict(['Assignment is submitted'])

array(['ham'], dtype=object)