## Sentiment Analysis from Tweets
##### (Multiclass sentiment analysis using textual data)

### About the Dataset:
* Dataset used in this project consists of a **collection of tweets** that were **posted during covid-19 pandemic** by users from different locations across the globe.
* Datset **consists of tweets and sentiments** reflected from them, alongwith other information like location, date, username etc.

> *Workflow:*
> * Load necessary **packages and libraries**.
> * **Data Agumention and Organization:** Use data from multiple sources and combine them with proper organisation to increase size of data. Split data into train and test sets.
> * **EDA & Data Visualization:** Analyze, Understand and Explore the Data.
> * **Preprocessing:**
    * **(NLP -> Text Cleaning + Text Vectorization)**
        * **Text cleaning:** Remove emojis, Convert Emoticons to Words, Apply Text Contractions, Correct Abbreviations, Remove Mentions and URLs, Remove Special Characters, Punctuations, HTML tags, Escape Characters and Extra Spaces, Apply Spell Check, Lemmatize/Stem texts.
        * **Text Vectorization:** Convert text data to numerical vectors. (I wil use **Tfidf** vectorization technique, but other methods like Bag of Words or Word embedding can also be used.)
    * **Label Encoding:** Machine Learning models can not understand text labelled categories hence we must convert text labels to numerical values before feeding our data to a ML model.
> * **Training:** Feed processed data to various **Machine Learning/Deep Learning models** and **Test** each model's accuracy using Test data (created during train-test-split).
> * **Create Pipeline:** Choose the **best performing model** and create a **model pipeline**.
> * **Ground Testing:** Use the trained model to predict language from user inputs.

## Load Dependencies
(I prefer loading all the necessary files and libraries at the beginning, makes notebook cleaner!)

In [1]:
# install extra packages (not included in kaggle by default)
!pip install nltk # provides text cleaning and normalizing function
!pip install contractions # should't -> should not
!pip install pyspellchecker  # python spell checker

[0mCollecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl (8.7 kB)
Collecting textsearch>=0.0.21
  Downloading textsearch-0.0.24-py2.py3-none-any.whl (7.6 kB)
Collecting pyahocorasick
  Downloading pyahocorasick-2.0.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (101 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m101.8/101.8 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting anyascii
  Downloading anyascii-0.3.2-py3-none-any.whl (289 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m289.9/289.9 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyahocorasick, anyascii, textsearch, contractions
Successfully installed anyascii-0.3.2 contractions-0.1.73 pyahocorasick-2.0.0 textsearch-0.0.24
[0mCollecting pyspellchecker
  Downloading pyspellchecker-0.7.2-py3-none-any.whl (3.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31

In [2]:
# Basic Libraries
import os
import pickle
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings("ignore")

# libraries for preprocessing
import re
import contractions
from spellchecker import SpellChecker
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
from nltk.stem import WordNetLemmatizer

# initiating lemmatizer(kaggle specific):
# Define a install path for nltk
if os.environ.get('KAGGLE_KERNEL_RUN_TYPE', ''):
    nltk_path='/kaggle/working'
else:
    nltk_path="{}".format(os.getcwd())

isnltk_installed = os.path.isdir(f'{nltk_path}/nltk_data/corpora/wordnet')

# Install relevent libraries to nltk path
if isnltk_installed:
    nltk.data.path.append(f'{nltk_path}/nltk_data')
else:
    # Make directort name 'nlrk_data' in current work directory '/kaggle/working/'
    !mkdir nltk_data
    # Download neccessaty package as .zip file ('corpora' directory are automatically created)
    nltk.download('wordnet', f"{nltk_path}/nltk_data")
    nltk.download('omw-1.4', f"{nltk_path}/nltk_data/")
    # Unzip .zip file in folder '/kaggle/working/nltk_data/corpora'
    !unzip /kaggle/working/nltk_data/corpora/wordnet.zip -d /kaggle/working/nltk_data/corpora
    !unzip /kaggle/working/nltk_data/corpora/omw-1.4.zip -d /kaggle/working/nltk_data/corpora
    # Add custom location nltk file data path
    nltk.data.path.append(f'{nltk_path}/nltk_data')

# libraries for ML
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /usr/share/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /kaggle/working/nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /kaggle/working/nltk_data/...
Archive:  /kaggle/working/nltk_data/corpora/wordnet.zip
   creating: /kaggle/working/nltk_data/corpora/wordnet/
  inflating: /kaggle/working/nltk_data/corpora/wordnet/lexnames  
  inflating: /kaggle/working/nltk_data/corpora/wordnet/data.verb  
  inflating: /kaggle/working/nltk_data/corpora/wordnet/index.adv  
  inflating: /kaggle/working/nltk_data/corpora/wordnet/adv.exc  
  inflating: /kaggle/working/nltk_data/corpora/wordnet/index.verb  
  inflating: /kaggle/working/nltk_data/corpora/wordnet/cntlist.rev  
  inflating: /kaggle/working/nltk_data/corpora/wordnet/da

## Load and Organize Data

In [3]:
train_data = pd.read_csv('/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_train.csv',encoding='latin1')
test_data = pd.read_csv('/kaggle/input/covid-19-nlp-text-classification/Corona_NLP_test.csv',encoding='latin1')

train_data.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [4]:
# keep only useful features
train_data = pd.concat((train_data.OriginalTweet,train_data.Sentiment), axis=1, keys = ['text','sentiment'])
test_data = pd.concat((test_data.OriginalTweet,test_data.Sentiment), axis=1, keys = ['text','sentiment'])

display(train_data.head(3))
display(test_data.head(3))

Unnamed: 0,text,sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive


Unnamed: 0,text,sentiment
0,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative
1,When I couldn't find hand sanitizer at Fred Me...,Positive
2,Find out how you can protect yourself and love...,Extremely Positive


In [5]:
train_data.shape, test_data.shape

((41157, 2), (3798, 2))

In [6]:
train_overview = train_data.isnull().sum()
test_overview = test_data.isnull().sum()
overview = pd.concat((train_overview,test_overview), axis=1, keys=['train_nulls','test_nulls'])
overview

Unnamed: 0,train_nulls,test_nulls
text,0,0
sentiment,0,0


In [7]:
train_data.sentiment.value_counts()

Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
Name: sentiment, dtype: int64

In [8]:
# grouping labels:
def group(data):
    for i in range(len(data)):
        if data.sentiment.iloc[i]=='Extremely Positive':
            data.sentiment.iloc[i]='Positive'
        if data.sentiment.iloc[i]=='Extremely Negative':
            data.sentiment.iloc[i]='Negative'
    return data

train_data = group(train_data)
test_data = group(test_data)

In [9]:
train_overview = train_data.isnull().sum()
test_overview = test_data.isnull().sum()
overview = pd.concat((train_overview,test_overview), axis=1, keys=['train_nulls','test_nulls'])
overview

Unnamed: 0,train_nulls,test_nulls
text,0,0
sentiment,0,0


In [10]:
# orgnaize data
x_train = train_data.text.values
y_train = train_data.sentiment.values

x_test = test_data.text.values
y_test = test_data.sentiment.values

In [11]:
# checking a text before cleaning:
x_train[47]

'People posting and sharing photos of of half to completely empty shelves calling those people "dumb" or "idiots." All while shopping at the grocery store. lol\r\r\n\r\r\n#coronavirus #COVID19'

## Textual Preprocessing
* **Hashtag**, **Mention** and **URL** Removal 
* Punctuation removal:
    * **Apostrophe**
    * **Special Characters**
    * **Numbers**
* **Formatting Symbols** and **Escape character** removal
* **Lowercasing**
* **Redundant Spaces** removal

In [12]:
# function for NLP:
def nlp(text):
  def remove_emoji(text):
    emoji_pattern = re.compile(
      '['
      u'\U0001F600-\U0001F64F'  # emoticons
      u'\U0001F300-\U0001F5FF'  # symbols & pictographs
      u'\U0001F680-\U0001F6FF'  # transport & map symbols
      u'\U0001F1E0-\U0001F1FF'  # flags
      u'\U00002702-\U000027B0'
      u'\U000024C2-\U0001F251'
      ']+',
      flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)
  def lemmatize(text):
    lemmatizer = WordNetLemmatizer()
    words = text.split()
    words = ' '.join([lemmatizer.lemmatize(word) for word in words if word not in stopwords.words('english')])
    return words
  text = remove_emoji(text)
  text = str(text).lower()
  text = re.sub(r'https?://\S+|www\.\S+', '', text)
  text = re.sub(r'RT[\s]+', '', text)
  text = re.sub(r'@\S+', '', text)
  text = re.sub(r'#', '', text)
  text = re.sub(r'\[', '', text)
  text = re.sub(r'\]', '', text)
  text = re.sub('â\\x92', "'", text)
  text = re.sub('â\S+', '', text)
  text = re.sub('\.+', '.', text)
  text = re.sub('&amp;', 'and', text)
  text = re.sub("let's", 'let us', text)
  text = re.sub("'s", ' is', text)
  text = re.sub("'re", ' are', text)
  text = re.sub("ain't", 'am not', text)
  text = re.sub("won't", 'will not', text)
  text = re.sub("n't", ' not', text)
  text = re.sub("'ve", ' have', text)
  text = re.sub("y'all", "you all", text)
  text = re.sub("'ll", ' will', text)
  text = re.sub("i'd", 'i would', text)
  text = re.sub("i'm", 'i am', text)
  text = re.sub(r"[^a-z<>!?\s]+", '', text)
  text = re.sub('covid\S*', 'coronavirus', text)
  text = re.sub('corona\S*', 'coronavirus', text)
  text = re.sub(r'\s+', ' ', text)
  text = lemmatize(text)
  return text

In [13]:
# cleaning texts
x_train_clean = np.array([nlp(text) for text in tqdm(x_train, desc='Progress')])
x_test_clean = np.array([nlp(sent) for sent in tqdm(x_test, desc='Progress')])

Progress:   0%|          | 0/41157 [00:00<?, ?it/s]

Progress:   0%|          | 0/3798 [00:00<?, ?it/s]

In [14]:
# checking tweet after cleaning:
x_train_clean[47]

'people posting sharing photo half completely empty shelf calling people dumb idiot shopping grocery store lol coronavirus coronavirus'

## Vectorization
##### (We are using TF-IDF Vectorization method)

In [15]:
tfidf = TfidfVectorizer()
tfidf.fit(x_train)
x_train_ready = tfidf.transform(x_train_clean)
x_test_ready = tfidf.transform(x_test_clean)

x_train_ready, x_test_ready

(<41157x80424 sparse matrix of type '<class 'numpy.float64'>'
 	with 653364 stored elements in Compressed Sparse Row format>,
 <3798x80424 sparse matrix of type '<class 'numpy.float64'>'
 	with 61710 stored elements in Compressed Sparse Row format>)

## Label Encoding

In [16]:
enc = LabelEncoder()
enc.fit(y_train)
y_train_ready = enc.transform(y_train)
y_test_ready = enc.transform(y_test)

In [17]:
enc.classes_

array(['Negative', 'Neutral', 'Positive'], dtype=object)

## Hyperparameter Tuning and Model Selection
#### (Using GridSearchCV)

In [18]:
# function for fitting tuning and result generation:

def result_grid(x_train, y_train, x_test, y_test):
    model_param_grid={
        'Logistic Regression':(LogisticRegression(max_iter=10000),{'penalty': ['l1','l2'],'solver': ['liblinear']}),
        'KNN':(KNeighborsClassifier(),{'n_neighbors': [3, 5, 7]}),
        'Naive Bayes':(MultinomialNB(),{'alpha': [0.1, 1.0, 10.0], 'fit_prior': [True,False]}),
        'SVM':(LinearSVC(max_iter=10000),{'C': [0.1, 1.0, 10.0], 'penalty':['l1','l2']}),
#         'Decision Tree':(DecisionTreeClassifier(),{'criterion': ['gini','entropy'], 'splitter':['best','random']})
    }
    
    results=[]
    
    for name, (model,parameters) in model_param_grid.items():
        grid=GridSearchCV(model,parameters)
        
        print(f"Tuning for {name}")
        
        st=time.time()
        
        grid.fit(x_train,y_train) # fitting in GridSearchCV
        y_pred=grid.predict(x_test) # predicts using best hyperparameters
        acc=accuracy_score(y_test,y_pred) # best accuracy obtained
        
        en=time.time()
        
        net_time=en-st
        
        # printing best parameters:
        print(f"Best hyperparameters for {name}: {grid.best_params_}")
        print(f"Best obtined score for {name}: {grid.best_score_*100:.3f}%")
        print(f"Running time for {name}:{net_time:.3f}s")
        
        results.append([name,grid.best_params_,grid.best_score_*100,net_time])
        print('-'*100)
        print()
    
    result_df=pd.DataFrame(results, columns=['model_name','best_parameters','best_test_score','running_time'])
    result_df=result_df.style.highlight_max(subset=['best_test_score'], color = 'green')
    result_df=result_df.highlight_min(subset=['best_test_score'], color = 'red')
    result_df=result_df.highlight_max(subset=['running_time'], color = 'red')
    result_df=result_df.highlight_min(subset=['running_time'], color = 'green')
    return result_df

In [19]:
%%time
result_df=result_grid(x_train_ready,y_train_ready,x_test_ready,y_test_ready)
result_df

Tuning for Logistic Regression
Best hyperparameters for Logistic Regression: {'penalty': 'l1', 'solver': 'liblinear'}
Best obtined score for Logistic Regression: 81.843%
Running time for Logistic Regression:18.974s
----------------------------------------------------------------------------------------------------

Tuning for KNN
Best hyperparameters for KNN: {'n_neighbors': 3}
Best obtined score for KNN: 20.842%
Running time for KNN:194.095s
----------------------------------------------------------------------------------------------------

Tuning for Naive Bayes
Best hyperparameters for Naive Bayes: {'alpha': 0.1, 'fit_prior': False}
Best obtined score for Naive Bayes: 65.977%
Running time for Naive Bayes:0.696s
----------------------------------------------------------------------------------------------------

Tuning for SVM
Best hyperparameters for SVM: {'C': 1.0, 'penalty': 'l2'}
Best obtined score for SVM: 79.299%
Running time for SVM:40.497s
-----------------------------------

Unnamed: 0,model_name,best_parameters,best_test_score,running_time
0,Logistic Regression,"{'penalty': 'l1', 'solver': 'liblinear'}",81.842715,18.973832
1,KNN,{'n_neighbors': 3},20.842153,194.095217
2,Naive Bayes,"{'alpha': 0.1, 'fit_prior': False}",65.976603,0.696015
3,SVM,"{'C': 1.0, 'penalty': 'l2'}",79.29878,40.496924


***Logistic Regression performed best for our data.***

## Create pipeline and save model

In [20]:
best_clf = LogisticRegression(penalty='l1', solver='liblinear')
best_clf.fit(x_train_ready,y_train_ready)
best_clf.score(x_test_ready,y_test_ready)

0.8120063191153238

In [21]:
model = Pipeline([('vectorizer',tfidf),('classifier',best_clf)])
model.score(x_test_clean,y_test_ready)

0.8120063191153238

In [22]:
# save model and encoder:
pickle.dump(model, open('/kaggle/working/model.pkl','wb'))
pickle.dump(enc, open('/kaggle/working/encoder.pkl','wb'))

## Ground testing

In [23]:
def predict(text):
    pred = model.predict([nlp(text)])
    return enc.inverse_transform(pred)[0]

In [24]:
predict('I am glad that we are safe!'), predict('you dumbass!')

('Positive', 'Neutral')