<a href="https://colab.research.google.com/github/noircir/Python/blob/master/014_Feature_extraction_from_raw_text.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Steps to build an NLP system that can turn a body of text into a numerical array of *features*.


# Building a Natural Language Processor From Scratch
In this section we'll use basic Python to build a rudimentary NLP system. We'll build a *corpus of documents* (two small text files), create a *vocabulary* from all the words in both documents, and then demonstrate a *Bag of Words* technique to extract features from each document.<br>
<div class="alert alert-info" style="margin: 20px">**For illustration only!**</div>

In [0]:
%%writefile 1.txt
This is a story about cats
our feline pets
Cats are furry animals

Writing 1.txt


In [0]:
%%writefile 2.txt
This story is about surfing
Catching waves is fun
Surfing is a popular water sport

Writing 2.txt


In [0]:
# Build a vocabulary out of the two documents.

vocab = {}
i = 1

with open('1.txt') as f:
    x = f.read().lower().split()

for word in x:
    if word in vocab:
        continue
    else:
        vocab[word]=i
        i+=1

print(vocab)

{'this': 1, 'is': 2, 'a': 3, 'story': 4, 'about': 5, 'cats': 6, 'our': 7, 'feline': 8, 'pets': 9, 'are': 10, 'furry': 11, 'animals': 12}


In [0]:
with open('2.txt') as f:
    x = f.read().lower().split()

for word in x:
    if word in vocab:
        continue
    else:
        vocab[word]=i
        i+=1

print(vocab)

{'this': 1, 'is': 2, 'a': 3, 'story': 4, 'about': 5, 'cats': 6, 'our': 7, 'feline': 8, 'pets': 9, 'are': 10, 'furry': 11, 'animals': 12, 'surfing': 13, 'catching': 14, 'waves': 15, 'fun': 16, 'popular': 17, 'water': 18, 'sport': 19}


In [0]:
## Feature extraction

# Create an empty vector with the length of the vocabulary
one = ['1.txt']+[0]*len(vocab)
one

['1.txt', 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

In [0]:
# map the frequencies of each word in 1.txt to our vector:
with open('1.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    one[vocab[word]]+=1
    
one

['1.txt', 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

In [0]:
two = ['2.txt']+[0]*len(vocab)

with open('2.txt') as f:
    x = f.read().lower().split()
    
for word in x:
    two[vocab[word]]+=1

In [0]:
# Compare the two vectors (tw bags of words):
print(f'{one}\n{two}')

['1.txt', 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
['2.txt', 1, 3, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 2, 1, 1, 1, 1, 1, 1]


In [0]:
# Next steps: clean (remove stopwords and punctiation, lemmatize), 
# calculate tf-idf weights of words. 
# Then add tags with information about POS, dependencies, etc. =>
# this adds more dimensions to our data and enables a deeper understanding 
# of the context of specific documents. => Vectors become high-dimensional sparse matrices.

# Feature extraction from text with Scikit-Learn

In [0]:
import numpy as np
import pandas as pd

df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/NLP-Spacy/TextFiles/smsspamcollection.tsv', sep='\t')
df.head()

Unnamed: 0,label,message,length,punct
0,ham,"Go until jurong point, crazy.. Available only ...",111,9
1,ham,Ok lar... Joking wif u oni...,29,6
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,155,6
3,ham,U dun say so early hor... U c already then say...,49,6
4,ham,"Nah I don't think he goes to usf, he lives aro...",61,2


In [0]:
df.isnull().sum()

label      0
message    0
length     0
punct      0
dtype: int64

In [0]:
df['label'].value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [0]:
from sklearn.model_selection import train_test_split

X = df['message']  # this time we want to look at the text
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [0]:
# Text preprocessing, tokenizing and the ability to filter out stopwords 
# are all included in CountVectorizer, which builds a dictionary of features 
# and transforms documents to feature vectors.

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()

X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape

(3733, 7082)

In [0]:
print(X_train_counts)

  (0, 1736)	1
  (0, 4415)	1
  (0, 7069)	1
  (1, 849)	1
  (1, 878)	1
  (1, 935)	1
  (1, 938)	1
  (1, 957)	1
  (1, 1797)	1
  (1, 1835)	1
  (1, 2156)	2
  (1, 2472)	1
  (1, 2566)	1
  (1, 2677)	1
  (1, 3008)	1
  (1, 3116)	1
  (1, 3280)	1
  (1, 3416)	1
  (1, 3501)	1
  (1, 3726)	1
  (1, 4018)	1
  (1, 4270)	1
  (1, 4470)	1
  (1, 4489)	1
  (1, 4513)	1
  :	:
  (3728, 6913)	1
  (3728, 6928)	1
  (3728, 7048)	1
  (3729, 1454)	1
  (3729, 3674)	1
  (3729, 3794)	1
  (3729, 5795)	1
  (3730, 2743)	1
  (3730, 3085)	1
  (3730, 4902)	1
  (3730, 5141)	1
  (3730, 5799)	1
  (3730, 5800)	1
  (3731, 3505)	1
  (3731, 4429)	1
  (3731, 5520)	1
  (3731, 6345)	1
  (3732, 2090)	1
  (3732, 3073)	1
  (3732, 3416)	1
  (3732, 3532)	1
  (3732, 4285)	1
  (3732, 5423)	1
  (3732, 5763)	1
  (3732, 6119)	1


In [0]:
# Transform Counts to Frequencies with Tf-idf

from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3733, 7082)

In [0]:
print(X_train_tfidf)

  (0, 7069)	0.6019702680143677
  (0, 4415)	0.35852876712053044
  (0, 1736)	0.7135046738275388
  (1, 7048)	0.06220835924135395
  (1, 6330)	0.12175375951774331
  (1, 6250)	0.18699157878309497
  (1, 6247)	0.12959378252338027
  (1, 6219)	0.07434593703652681
  (1, 5791)	0.236698937005449
  (1, 5512)	0.15094660409003707
  (1, 5468)	0.236698937005449
  (1, 5443)	0.22545044092015623
  (1, 5437)	0.22545044092015623
  (1, 5436)	0.17424313877010147
  (1, 5243)	0.22545044092015623
  (1, 4519)	0.1982400748683877
  (1, 4518)	0.12552667891184655
  (1, 4513)	0.09521739789951615
  (1, 4489)	0.236698937005449
  (1, 4470)	0.0931011308331512
  (1, 4270)	0.08916256276356054
  (1, 4018)	0.08517903405827136
  (1, 3726)	0.20194453011537272
  (1, 3501)	0.2062210098516256
  (1, 3416)	0.08402534579064598
  :	:
  (3728, 2090)	0.220194057121089
  (3728, 1440)	0.19745041735266144
  (3728, 1080)	0.2074000350927176
  (3729, 5795)	0.5474874445685368
  (3729, 3794)	0.4832600441125685
  (3729, 3674)	0.5570538854722596
 

In [0]:
# Using TfidfVecttorizer, instead of CountVectorizer
# Converts a collection of raw documents to a matrix of TF-IDF features.
# Equivalent to CountVectorizer followed by TfidfTransformer.

from sklearn.feature_extraction.text import TfidfVectorizer

In [0]:
TfidfVec = TfidfVectorizer(stop_words='english')

In [0]:
X_train_tfidf = TfidfVec.fit_transform(X_train)

In [0]:
print(X_train_tfidf) # different weights

  (0, 1682)	0.7643175120589921
  (0, 6810)	0.6448400892934248
  (1, 4363)	0.23420934589848316
  (1, 2394)	0.25692782234957695
  (1, 5262)	0.20585833407164308
  (1, 4336)	0.2796462988006707
  (1, 3382)	0.24363836555217044
  (1, 1743)	0.18064935619933836
  (1, 1781)	0.1919988484140977
  (1, 5070)	0.2663568420032642
  (1, 911)	0.2796462988006707
  (1, 6043)	0.22091988910107668
  (1, 6040)	0.15310766532681155
  (1, 5263)	0.2663568420032642
  (1, 908)	0.2796462988006707
  (1, 5294)	0.2796462988006707
  (1, 5598)	0.2796462988006707
  (1, 5269)	0.2663568420032642
  (1, 5334)	0.1783348065873963
  (2, 3840)	0.5864912369683677
  (2, 4789)	0.5864912369683677
  (2, 3501)	0.5586197793836414
  (3, 3173)	0.47065622909137744
  (3, 3299)	0.5402157093551054
  (3, 2247)	0.51454337483273
  :	:
  (3728, 2224)	0.4675008243643285
  (3728, 6660)	0.4675008243643285
  (3728, 6675)	0.39154125318646427
  (3728, 3021)	0.26077899741932115
  (3728, 2033)	0.26818566261748344
  (3728, 2830)	0.22604236235844813
  (3729

In [0]:
print(X_train_tfidf.shape) # after removing stopwords, fewer features

(3733, 6823)


In [0]:
## Train a classifier

# LinearSVC is similar to SVC, but handles sparse input better, 
# and scales well to large numbers of samples. 
# https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

from sklearn.svm import LinearSVC
clf = LinearSVC()
clf.fit(X_train_tfidf,y_train)

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

# Build a pipeline

Only our training set has been vectorized into a full vocabulary. In order to perform an analysis on our test set we'll have to submit it to the same procedures. Scikit-Learn offers a Pipeline class that behaves as a compound classifier.


In [0]:
from sklearn.pipeline import Pipeline
# from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.svm import LinearSVC

text_clf = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf.fit(X_train, y_train)  

Pipeline(memory=None,
         steps=[('tfidf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('clf',
                 LinearSVC(C=1.0, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
               

In [0]:
predictions = text_clf.predict(X_test)

In [0]:
X_test

3245    Squeeeeeze!! This is christmas hug.. If u lik ...
944     And also I've sorta blown him off a couple tim...
1044    Mmm thats better now i got a roast down me! i...
2484        Mm have some kanji dont eat anything heavy ok
812     So there's a ring that comes with the guys cos...
                              ...                        
4944    Check mail.i have mailed varma and kept copy t...
3313    I know you are serving. I mean what are you do...
3652         Want to send me a virtual hug?... I need one
14                    I HAVE A DATE ON SUNDAY WITH WILL!!
4758    hey, looks like I was wrong and one of the kap...
Name: message, Length: 1839, dtype: object

In [0]:
predictions

array(['ham', 'ham', 'ham', ..., 'ham', 'ham', 'ham'], dtype=object)

In [0]:
from sklearn import metrics
print(metrics.confusion_matrix(y_test,predictions))

[[1586    7]
 [  12  234]]


In [0]:
# Print a classification report
print(metrics.classification_report(y_test,predictions))

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99      1593
        spam       0.97      0.95      0.96       246

    accuracy                           0.99      1839
   macro avg       0.98      0.97      0.98      1839
weighted avg       0.99      0.99      0.99      1839



In [0]:
# Print the overall accuracy
print(metrics.accuracy_score(y_test,predictions))

0.989668297988037


In [0]:
# Using the text of the messages, our model performed exceedingly well; 
# it correctly predicted spam 98.97% of the time!

In [0]:
text_clf.predict(['''
Amy, Pete and Tom getting out within hours of each other consolidates the field 
a ton around Bernie and Joe. Elizabeth just raised $29 million in February 
and isn’t going anywhere. The big question mark is Bloomberg and how he performs 
on Super Tuesday.
'''])

array(['ham'], dtype=object)

In [0]:
text_clf.predict(['''
Are you going to convince Bernard to adopt his UBI policy? No?
'''])

array(['ham'], dtype=object)

In [0]:
# Should be 'spam' as it is a promoted ad.

text_clf.predict(['''
More than 90% of IT Professionals in IDC’s CloudView Survey indicated 
that they would evolve their digital transformation strategies to encompass 
multicloud postures this year.
'''])

array(['ham'], dtype=object)

In [0]:
# Should be 'spam'
# This model has trained on labeling crud-er messages as spam.
# Polite messages squeeze through, even though they ARE spam.

text_clf.predict(['''
Apply for the NEW Amex Business Edge™ Card and start earning 3x the points on eligible office 
supplies & electronics. T&Cs apply. 
'''])

array(['ham'], dtype=object)