### What is Text Classification :
- Classification based on the text
- examples : spam/ham, sentiment analysis, etc..

#### Types of Text Classification :
- Binary Classification : like email spam/ham
- Multi Classification : like if you read one news and based on that decision will take whether it is related to sports, entertainment, etc..
- Multi-lable : single input & multiple outputs; same example of new you can take and one single news can be related to politics, sports, entertainment, etc...


#### Applications :
- email filtering
- customer support
- language detection
- fake news detection

#### Different Approaches :
- Heuristic approach (used when you have no/less data and based on that your model can' train; inshort it's creating Software Engineering problem)
- using APIs (Cloud resouces or liek https://nlpcloud.com/home/playground/)
- using ML :
    - can be implemented at text vectorization stage like with use of BOW, n-grams, Tf-Idf techniqueue
    - at modeling stage like with Naive base, Random Forest, SVM algos
- using DL :
    - using RNN (LSTM)
    - CNN
    - Pre-trained models

#### Implimentation with BOW & n-grams

In [60]:
import numpy as np
import pandas as pd

In [61]:
temp_df = pd.read_csv('IMDB_Dataset.csv')

In [62]:
df = temp_df.iloc[:10000]

In [63]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [64]:
df['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

In [65]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,5028
negative,4972


In [66]:
df.isnull().sum()

Unnamed: 0,0
review,0
sentiment,0


In [67]:
df.duplicated().sum()

17

In [68]:
df.drop_duplicates(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop_duplicates(inplace=True)


In [69]:
df.duplicated().sum()

0

In [70]:
df.sample(5)

Unnamed: 0,review,sentiment
9152,...that Jamie Foxx would ever deliver such a w...,negative
8975,"OK, I admit that I still associate Sophie Marc...",negative
5225,"I saw this movie a time ago, because some of m...",negative
6554,The reviews I read for this movie were pretty ...,negative
7901,i rented this when it came out on video casset...,negative


- Basic Preprocessing
- Remove tags
- lowercase
- remove stopwords

In [71]:
import re
def remove_tags(raw_text):
    cleaned_text = re.sub(re.compile('<.*?>'), '', raw_text)
    return cleaned_text

In [72]:
df['review'] = df['review'].apply(remove_tags)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(remove_tags)


In [73]:
df.sample(5)

Unnamed: 0,review,sentiment
28,This movie was so frustrating. Everything seem...,negative
9191,"As a movie, THE ITALIAN JOB is ok at best; goo...",positive
6781,"Absolutely unwatchable, lowest quality film ma...",negative
6286,"I just finished viewing this finely conceived,...",positive
3226,One of the commentators on the subject of Lil'...,negative


In [74]:
df['review'] = df['review'].apply(lambda x:x.lower())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x:x.lower())


In [75]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

sw_list = stopwords.words('english')

df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['review'] = df['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))


In [76]:
df.sample(7)

Unnamed: 0,review,sentiment
6417,saimin (usa: hypnotist /uk: hypnosis) aspect r...,negative
7168,much like show hard know start. unlikeable cha...,negative
2837,"dead man walking, absolutely brilliant, tears ...",positive
5708,"**attention spoilers**first all, let say rob r...",positive
9320,"think jason lee huge potential, wrong vehicle ...",negative
2881,event really happened mean make good screenpla...,negative
9915,"christian say movie terrible acting, unreal si...",negative


In [77]:
X = df.iloc[:,0:1]
y = df['sentiment']

In [78]:
X

Unnamed: 0,review
0,one reviewers mentioned watching 1 oz episode ...
1,wonderful little production. filming technique...
2,thought wonderful way spend time hot summer we...
3,basically there's family little boy (jake) thi...
4,"petter mattei's ""love time money"" visually stu..."
...,...
9995,"fun, entertaining movie wwii german spy (julie..."
9996,"give break. anyone say ""good hockey movie""? kn..."
9997,movie bad movie. watching endless series bad h...
9998,"movie probably made entertain middle school, e..."


In [79]:
y

Unnamed: 0,sentiment
0,positive
1,positive
2,positive
3,negative
4,positive
...,...
9995,positive
9996,negative
9997,negative
9998,negative


In [80]:
# now we have convert `sentiment` into 0 / 1
# `LabelEncoder` is used to convert categorical labels into numeric form.
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y = encoder.fit_transform(y)

In [81]:
y

array([1, 1, 1, ..., 0, 0, 1])

In [82]:
# split the data for training and testing
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

In [83]:
X_train.shape

(7986, 1)

In [84]:
# Applying BoW
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

In [85]:
X_train_bow.shape

(7986, 48282)

In [86]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

gnb.fit(X_train_bow,y_train)

In [87]:
y_pred = gnb.predict(X_test_bow)

from sklearn.metrics import accuracy_score,confusion_matrix
accuracy_score(y_test,y_pred)

0.6324486730095142

In [88]:
confusion_matrix(y_test,y_pred)

array([[717, 235],
       [499, 546]])

In [89]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8517776664997496

In [90]:
# to improve more accuracy, lets take first 3000 features
# as we have `48282` whcih we checked at cell 30
# X_train_bow.shape after applying BOW

cv = CountVectorizer(max_features=3000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8407611417125689

In [91]:
# litle bit accuracy drop down so lets play with `ngram_range`
cv = CountVectorizer(ngram_range=(1,3),max_features=3000)

X_train_bow = cv.fit_transform(X_train['review']).toarray()
X_test_bow = cv.transform(X_test['review']).toarray()

rf = RandomForestClassifier()

rf.fit(X_train_bow,y_train)
y_pred = rf.predict(X_test_bow)
accuracy_score(y_test,y_pred)

0.8382573860791187

Accuracy reduced by changing ngram so to get more accuracy we can play with hyperparameters of models

## Using TfIdf

In [92]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [93]:
tfidf = TfidfVectorizer()

In [94]:
X_train_tfidf = tfidf.fit_transform(X_train['review']).toarray()
X_test_tfidf = tfidf.transform(X_test['review'])

In [95]:
rf = RandomForestClassifier()

rf.fit(X_train_tfidf,y_train)
y_pred = rf.predict(X_test_tfidf)

accuracy_score(y_test,y_pred)

0.8547821732598898

## By using Word2Vec :

In [96]:
import gensim

In [97]:
from gensim.models import Word2Vec,KeyedVectors

In [98]:
!pip install gdown
!gdown --id 0B7XkCwpI5KDYNlNUTTlSS21pQmM --output GoogleNews-vectors-negative300.bin.gz

Downloading...
From (original): https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
From (redirected): https://drive.google.com/uc?id=0B7XkCwpI5KDYNlNUTTlSS21pQmM&confirm=t&uuid=f7ca9a0d-957e-4a40-8f86-95bafec92a12
To: /content/GoogleNews-vectors-negative300.bin.gz
100% 1.65G/1.65G [00:21<00:00, 76.8MB/s]


In [99]:
model = KeyedVectors.load_word2vec_format('/content/GoogleNews-vectors-negative300.bin.gz',binary=True,limit=500000)

In [100]:
model['cricket'].shape

(300,)

In [101]:
from nltk.corpus import stopwords
sw_list = stopwords.words('english')

In [102]:
sw_list

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [103]:
# Remove stopwords from Training
X_train = X_train['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

# Remove stopwords from Testing
X_test = X_test['review'].apply(lambda x: [item for item in x.split() if item not in sw_list]).apply(lambda x:" ".join(x))

In [104]:
import spacy
import en_core_web_sm
# Load the spacy model. This takes a few seconds.
nlp = en_core_web_sm.load()
# Process a sentence using the model
doc = nlp(X_train.values[0])
print(doc.vector)

[-9.38924626e-02 -3.84084344e-01  5.38174808e-02 -5.68839815e-03
  5.53607009e-02  4.34335917e-01  1.12127282e-01  3.69744062e-01
  2.90832311e-01 -1.70808136e-01  2.64332980e-01 -1.62290543e-01
 -3.06538224e-01 -2.85405099e-01 -1.81089625e-01  1.67842567e-01
 -5.53924069e-02 -2.39561155e-01 -9.84708741e-02 -2.47755840e-01
 -1.03777826e-01  3.14559907e-01 -1.58189446e-01  9.29157138e-02
 -6.30339906e-02  1.74012944e-01  2.98766553e-01  6.10951722e-01
  2.79511511e-01  6.51886985e-02 -5.98237105e-02 -1.00067459e-01
  2.43561059e-01 -5.19757196e-02  1.00620603e-02 -1.06087245e-01
  2.54588515e-01 -9.97592583e-02 -1.51748762e-01 -2.58460432e-01
 -3.79791379e-01  1.97484657e-01 -5.39284982e-02  4.59246263e-02
 -4.72242162e-02 -1.32725611e-01  2.24066466e-01 -1.11504689e-01
  2.12011918e-01  1.45252958e-01 -3.38403374e-01  1.08032361e-01
  7.14340210e-02 -1.43950149e-01 -4.83557396e-02  7.11817760e-03
  2.28908435e-01 -6.45604953e-02  1.57704279e-02 -2.28218064e-02
  8.61754715e-02 -1.64328

In [105]:
!pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz

Collecting https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone


In [106]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m36.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [107]:
input_arr = []
for item in X_train.values:
    doc = nlp(item)
    input_arr.append(doc.vector)

In [108]:
input_arr = np.array(input_arr)

In [109]:
input_arr.shape

(7986, 96)

In [110]:
input_test_arr = []
for item in X_test.values:
    doc = nlp(item)
    input_test_arr.append(doc.vector)

In [111]:
input_test_arr = np.array(input_test_arr)

In [112]:
input_test_arr.shape

(1997, 96)

In [113]:
from sklearn.naive_bayes import GaussianNB

In [114]:
gnb = GaussianNB()
gnb.fit(input_arr,y_train)

In [115]:
y_pred = gnb.predict(input_test_arr)
accuracy_score(y_test,y_pred)

0.6169253880821232