<a href="https://colab.research.google.com/github/kshitijmamgain/NLP/blob/master/1578772832_Exercise_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

In [0]:
np.random.seed(500)

In [0]:
from google.colab import drive
drive.mount('/content/drive')
Corpus = pd.read_csv(r"/content/drive/My Drive/Colab Notebooks/1578771807_corpus.csv",encoding='latin-1')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
print(Corpus.shape)
Corpus.head(3)

(10000, 2)


Unnamed: 0,text,label
0,Stuning even for the non-gamer: This sound tr...,__label__2
1,The best soundtrack ever to anything.: I'm re...,__label__2
2,Amazing!: This soundtrack is my favorite musi...,__label__2


##### STEP -4: Data pre-processing
This is an important step in any data mining process. 
This basically involves transforming raw data into an understandable format for NLP models. Real-world data is often incomplete, inconsistent, and/or lacking in certain behaviors or trends, and is likely to contain many errors. Data pre-processing is a proven method of resolving such issues.This will help in getting better results through the classification algorithms.

Below, I have explained the two techniques that are also performed besides other easy to understand steps in data pre-processing:

######  Tokenization: 


This is a process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing. NLTK Library has word_tokenize and sent_tokenize to easily break a stream of text into a list of words or sentences, respectively.


###### Word Stemming/Lemmatization: 


The aim of both processes is the same, reducing the inflectional forms of each word into a common base or root. Lemmatization is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.


Here’s the complete script which performs the aforementioned data pre-processing steps, you can always add or remove steps which best suits the data set you are dealing with:

* Remove Blank rows in Data, if any
* Change all the text to lower case
* Word Tokenization
* Remove Stop words
* Remove Non-alpha text
* Word Lemmatization


In [0]:
# Step - a : Remove blank rows if any.
Corpus.dropna(axis = 0, how = 'any', inplace = True)

# Step - b : Change all the text to lower case. This is required as python interprets 'dog' and 'DOG' differently
Corpus['text'] = [str.lower(txt) for txt in Corpus['text']]

# Step - c : Tokenization : In this each entry in the corpus will be broken into set of words
Corpus['text'] = [word_tokenize(txt) for txt in Corpus['text']]


In [0]:
#nltk.download('wordnet')
#nltk.download('stopwords')
#nltk.download('averaged_perceptron_tagger')
# Step - d1 : Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
"""
1. to remove stop words the words should not be in stopwords.words('english)
2. to allow only non-numeric .isalpha()"""
def remove_stopwrds(series):
  txt_series = series.copy()
  for i in range(txt_series.shape[0]):
    tst1 = []
    for word in txt_series.iloc[i]:
        if (word not in stopwords.words('english')) & (word.isalpha()):
            tst1.append(word)
        txt_series.iloc[i] = tst1
  return(txt_series)

#defining a new column with out stop words
Corpus['stp_text'] = remove_stopwrds(Corpus['text'])

In [0]:
#Step - d2
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun
"""
1. pos tag function
2. Define a function for lemmatizing
"""
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wn.ADJ,
                "N": wn.NOUN,
                "V": wn.VERB,
                "R": wn.ADV}

    return tag_dict.get(tag, wn.NOUN)

def word_lemmatizer(series):
  """lemmatizes words in series to their root using POS tag"""
  
  count = series.shape[0] # number of inputs
  txt_series = series.copy()
  lemmatizer = WordNetLemmatizer() # initializing lemmatizer
  
  for i in range(count):
    tst = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in txt_series.iloc[i]]
    txt_series.iloc[i] = tst
  
  return(txt_series)

# a new column with lemmatized words
Corpus['lem_txt'] = word_lemmatizer(Corpus['stp_text'])

In [0]:
Corpus.tail(10)

Unnamed: 0,text,label,stp_text,lem_txt
9990,"[psychological, thriller, !, :, this, movie, r...",__label__2,"[psychological, thriller, movie, really, scare...","[psychological, thriller, movie, really, scar,..."
9991,"[a, little, more, money, than, what, i, expect...",__label__2,"[little, money, expected, spend, good, movie, ...","[little, money, expect, spend, good, movie, hu..."
9992,"[``, the, silence, of, the, dummies, '', :, th...",__label__1,"[silence, dummies, overall, real, hoot, anthon...","[silence, dummy, overall, real, hoot, anthony,..."
9993,"[mauled, again, -, killing, bears, to, enrich,...",__label__1,"[mauled, killing, bears, enrich, jim, cole, ti...","[maul, kill, bear, enrich, jim, cole, ticket, ..."
9994,"[sorry, jim, :, as, a, former, realtor, ,, mr....",__label__1,"[sorry, jim, former, realtor, cole, owes, natu...","[sorry, jim, former, realtor, cole, owes, natu..."
9995,"[a, revelation, of, life, in, small, town, ame...",__label__2,"[revelation, life, small, town, america, early...","[revelation, life, small, town, america, early..."
9996,"[great, biography, of, a, very, interesting, j...",__label__2,"[great, biography, interesting, journalist, bi...","[great, biography, interest, journalist, biogr..."
9997,"[interesting, subject, ;, poor, presentation, ...",__label__1,"[interesting, subject, poor, presentation, tel...","[interest, subject, poor, presentation, tell, ..."
9998,"[do, n't, buy, :, the, box, looked, used, and,...",__label__1,"[buy, box, looked, used, obviously, new, tried...","[buy, box, look, use, obviously, new, try, con..."
9999,"[beautiful, pen, and, fast, delivery, ., :, th...",__label__2,"[beautiful, pen, fast, delivery, pen, shipped,...","[beautiful, pen, fast, delivery, pen, ship, pr..."


In [0]:
#The above dataframe gives us the list of final text but for TF-IDF we need the input as string. So we can create a final column with lambda function
Corpus['text_final'] = Corpus['lem_txt'].apply(lambda x: ' '.join(x))
Corpus['text_final']

0       stun even sound track beautiful paint senery m...
1       best soundtrack ever anything reading lot revi...
2       amaze soundtrack favorite music time hand inte...
3       excellent soundtrack truly like soundtrack enj...
4       remember pull jaw floor hearing played game kn...
                              ...                        
9995    revelation life small town america early thoug...
9996    great biography interest journalist biography ...
9997    interest subject poor presentation tell boring...
9998    buy box look use obviously new try contact ema...
9999    beautiful pen fast delivery pen ship promptly ...
Name: text_final, Length: 10000, dtype: object

##### Prepare Test/ Train/ Datasets

In [0]:
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'],Corpus['label'],test_size=1/3)

##### STEP -6: Encoding
Label encode the target variable — This is done to transform Categorical data of string type in the data set into numerical values which the model can understand.

In [0]:
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Train_Y)

In [0]:
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

In [227]:
print(Tfidf_vect.vocabulary_)

{'stun': 4283, 'even': 1500, 'sound': 4136, 'track': 4558, 'beautiful': 362, 'paint': 3135, 'mind': 2817, 'well': 4862, 'would': 4948, 'recomend': 3596, 'people': 3207, 'hate': 2031, 'video': 4762, 'game': 1838, 'music': 2901, 'played': 3288, 'cross': 989, 'ever': 1503, 'best': 401, 'back': 307, 'away': 298, 'crude': 992, 'take': 4376, 'step': 4221, 'guitar': 1983, 'orchestra': 3072, 'impress': 2215, 'anyone': 186, 'care': 605, 'listen': 2594, 'soundtrack': 4137, 'anything': 187, 'reading': 3565, 'lot': 2637, 'review': 3720, 'say': 3834, 'figure': 1669, 'write': 4956, 'disagree': 1201, 'bit': 420, 'ultimate': 4633, 'masterpiece': 2732, 'timeless': 4506, 'year': 4975, 'beauty': 364, 'simply': 4019, 'refuse': 3620, 'price': 3396, 'tag': 4375, 'pretty': 3391, 'must': 2907, 'go': 1909, 'buy': 567, 'cd': 636, 'much': 2892, 'money': 2852, 'one': 3051, 'feel': 1646, 'worth': 4944, 'every': 1505, 'penny': 3206, 'amaze': 142, 'favorite': 1634, 'time': 4505, 'hand': 2002, 'intense': 2304, 'sadne

In [228]:
print(Train_X_Tfidf)

  (0, 4975)	0.06938596572401537
  (0, 4948)	0.05382985370268391
  (0, 4892)	0.11723818854107895
  (0, 4862)	0.0545130914338351
  (0, 4708)	0.12776119152305676
  (0, 4614)	0.09147880821345104
  (0, 4156)	0.13488611075298393
  (0, 4073)	0.09196696652161021
  (0, 4040)	0.09767867624304678
  (0, 4000)	0.10198096658119934
  (0, 3695)	0.15484697105647668
  (0, 3552)	0.09850515549500942
  (0, 3416)	0.0845546218683585
  (0, 3124)	0.13395307442770482
  (0, 3073)	0.08660292783244004
  (0, 3068)	0.16047779489294817
  (0, 2960)	0.09117937572932755
  (0, 2944)	0.1344144086284772
  (0, 2892)	0.06492657036844217
  (0, 2734)	0.13262695805357477
  (0, 2359)	0.14292506495836502
  (0, 2351)	0.10663986396277707
  (0, 1880)	0.049649752402516775
  (0, 1859)	0.14513656755519097
  (0, 1701)	0.19284096047956154
  :	:
  (6664, 1420)	0.20735738397295808
  (6664, 1241)	0.20363414869065574
  (6664, 1150)	0.20606269567044813
  (6664, 922)	0.4332651379076388
  (6664, 753)	0.19221565197016288
  (6664, 666)	0.10233038

##### STEP -7: Use the ML Algorithms

In [229]:
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  84.64307138572286


In [230]:
"""GIVEN """
# Classifier - Algorithm - SVM
# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
# predict the labels on validation dataset
predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy
print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score ->  84.64307138572286
