# Guided Projects Artificial Intelligence & Machine Learning
## Guided Projects: Natural Language Processing 
### Text Classification
Text classification is the process of assigning tags or categories to text according to its content. 
It is one of the fundamental tasks in Natural Language Processing (NLP) with broad applications 
such as sentiment analysis, topic labelling, spam detection, and intent detection. Text classifiers 
can automatically analyze text and then assign a set of pre-defined tags or categories based on its 
content.
### Question:
Using vector semantics, we can easily convert a given text into its corresponding vector form. 
Given any text, first pre process the text and convert it into a vector using BoW methods. 
Given this vector, implement your own classifier to classify the vector is pre-defined 
categories. You may use of these datasets for training and for defining the categories:


14 Best Text classification Datasets for Machine Learning

In [32]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import string
import nltk
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from tensorflow.keras.utils import to_categorical
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [2]:
train = pd.read_csv('../24. Text Classification/Corona_NLP_train.csv')
test = pd.read_csv('../24. Text Classification/Corona_NLP_test.csv')
train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [3]:
punctuation = ["'","@","#",",",".","/",":",";",'"',"(",")","_"]
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in punctuation])
    return punctuationfree

In [4]:
#storing the puntuation free text
train['clean_msg']= train['OriginalTweet'].apply(lambda x:remove_punctuation(x))
train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,clean_msg
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,MeNyrbie PhilGahan Chrisitv httpstcoiFz9FAn2Pa...
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive,advice Talk to your neighbours family to excha...
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive,Coronavirus Australia Woolworths to give elder...
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive,My food stock is not the only one which is emp...
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative,Me ready to go at supermarket during the COVID...


In [5]:
test['clean_msg']= test['OriginalTweet'].apply(lambda x:remove_punctuation(x))
test.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,clean_msg
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative,TRENDING New Yorkers encounter empty supermark...
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,Positive,When I couldnt find hand sanitizer at Fred Mey...
2,3,44955,,02-03-2020,Find out how you can protect yourself and love...,Extremely Positive,Find out how you can protect yourself and love...
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,Negative,Panic buying hits NewYork City as anxious shop...
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral,toiletpaper dunnypaper coronavirus coronavirus...


In [6]:
train['msg_lower']= train['clean_msg'].apply(lambda x: x.lower())
test['msg_lower']= test['clean_msg'].apply(lambda x: x.lower())
train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment,clean_msg,msg_lower
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,MeNyrbie PhilGahan Chrisitv httpstcoiFz9FAn2Pa...,menyrbie philgahan chrisitv httpstcoifz9fan2pa...
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive,advice Talk to your neighbours family to excha...,advice talk to your neighbours family to excha...
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive,Coronavirus Australia Woolworths to give elder...,coronavirus australia woolworths to give elder...
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive,My food stock is not the only one which is emp...,my food stock is not the only one which is emp...
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative,Me ready to go at supermarket during the COVID...,me ready to go at supermarket during the covid...


In [7]:
unwanted_cols = ['UserName', 'ScreenName', 'Location', 'TweetAt', 'OriginalTweet', 'clean_msg','Sentiment']

In [8]:
train_X = train.drop(unwanted_cols, axis=1)
train_Y = train['Sentiment']
test_X = test.drop(unwanted_cols, axis=1)
test_Y = test['Sentiment']

In [9]:
stop_words = nltk.corpus.stopwords.words('english')
stop_words[:5]

['i', 'me', 'my', 'myself', 'we']

In [10]:
train_X['length'] = train_X['msg_lower'].apply(len)
train_X.head()

Unnamed: 0,msg_lower,length
0,menyrbie philgahan chrisitv httpstcoifz9fan2pa...,92
1,advice talk to your neighbours family to excha...,237
2,coronavirus australia woolworths to give elder...,124
3,my food stock is not the only one which is emp...,284
4,me ready to go at supermarket during the covid...,287


In [11]:
max(train_X['length'])

322

In [12]:
ps = PorterStemmer()

In [13]:
def pre_process(text):
    text = [word for word in text.split() if word.lower() not in stop_words]
    words = ""
    for i in text:
            words += (ps.stem(i))+" "
    return words

In [14]:
textFeatures_train = train_X['msg_lower'].copy()
textFeatures_train = textFeatures_train.apply(pre_process)
textFeatures_train[:5]

0    menyrbi philgahan chrisitv httpstcoifz9fan2pa ...
1    advic talk neighbour famili exchang phone numb...
2    coronaviru australia woolworth give elderli di...
3    food stock one empti pleas dont panic enough f...
4    readi go supermarket covid19 outbreak im paran...
Name: msg_lower, dtype: object

In [15]:
textFeatures_test = test_X['msg_lower'].copy()
textFeatures_test = textFeatures_test.apply(pre_process)
textFeatures_test[:5]

0    trend new yorker encount empti supermarket she...
1    couldnt find hand sanit fred meyer turn amazon...
2                  find protect love one coronaviru ? 
3    panic buy hit newyork citi anxiou shopper stoc...
4    toiletpap dunnypap coronaviru coronavirusaustr...
Name: msg_lower, dtype: object

In [22]:
textFeatures_combined = pd.concat([textFeatures_train, textFeatures_test], axis = 0)
len(textFeatures_combined)

44955

In [23]:
vectorizer = TfidfVectorizer(ngram_range=(1, 2))
features_combined = vectorizer.fit_transform(textFeatures_combined)
features_combined.shape

(44955, 546515)

In [24]:
train_X = features_combined[:len(textFeatures_train)]
test_X = features_combined[len(textFeatures_train):]
train_X.shape, test_X.shape

((41157, 546515), (3798, 546515))

In [25]:
len(train_Y), len(test_Y)

(41157, 3798)

In [26]:
target_names = train_Y.unique()
target_names

array(['Neutral', 'Positive', 'Extremely Negative', 'Negative',
       'Extremely Positive'], dtype=object)

In [27]:
# Prediction using Support Vector Machine
svc = SVC(kernel='sigmoid', gamma=1.0)
svc.fit(train_X, train_Y)
prediction = svc.predict(test_X)
# accuracy_score(labels_test,prediction)
print(classification_report(test_Y, prediction, target_names = target_names))

                    precision    recall  f1-score   support

           Neutral       0.72      0.39      0.50       592
          Positive       0.77      0.47      0.58       599
Extremely Negative       0.50      0.56      0.53      1041
          Negative       0.61      0.63      0.62       619
Extremely Positive       0.48      0.67      0.56       947

          accuracy                           0.56      3798
         macro avg       0.62      0.54      0.56      3798
      weighted avg       0.59      0.56      0.56      3798



In [28]:
# Prediction using Multinomial Naive Bayes Model
mnb = MultinomialNB(alpha=0.2)
mnb.fit(train_X, train_Y)
prediction = mnb.predict(test_X)
# accuracy_score(labels_test,prediction)
print(classification_report(test_Y, prediction, target_names = target_names))

                    precision    recall  f1-score   support

           Neutral       0.81      0.07      0.13       592
          Positive       0.77      0.14      0.24       599
Extremely Negative       0.39      0.50      0.43      1041
          Negative       0.64      0.18      0.28       619
Extremely Positive       0.33      0.75      0.46       947

          accuracy                           0.38      3798
         macro avg       0.59      0.33      0.31      3798
      weighted avg       0.54      0.38      0.34      3798

