# **Project Name**    - Coronavirus Tweet Sentiment Analysis


##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Project by Mohd Ravi** 


# **Project Summary -**

In this Classification project, our aim to predict correct sentiment of a tweet from the data.
* Dataset Loading
* Dataset Information
* Checking Duplicates
* Understanding Variable.
* Data Wrangling
* Story telling with several kind of charts like barplot,heatmap,pie chart,wordcloud,pairplot.
* Feature Engineering like Handling missing values
* Text Preprocessing like Removing users name,expand contraction,lower casing,removing punctuation,removing urls and digits,removing stopwords & white spaces, removing accent words,removing hashtags,tokenization,text normalization,pos,text vectorization
* Data transformation
* Data Splitting
* checking for imbalanced class
* Training different models such as: 

1 . Logistic Regression

2 . Stochastic Gradient Descent

3 . Multinomial naive bayes

# **GitHub Link -**

https://github.com/mohdraavi/Coronavirus_Tweet_Sentiment_Analysis

# **Problem Statement**


**Building a classification model to predict the sentiment of covid-19 tweets.**


**Determine Sentiment(Negative,Positive and Neutral) Tweets which posted during Covid 19 (2020).**

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import necessary libraries
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
import numpy as np
import pandas as pd
from numpy import math
from numpy import loadtxt
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib import rcParams
from datetime import *
import warnings

# Ignore warning messages
warnings.filterwarnings('ignore')

# Import stemming and lemmatizing libraries from NLTK
from nltk.stem.snowball import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import WhitespaceTokenizer

# Import vectorization libraries from scikit-learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Import image-related libraries
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

# Import evaluation metrics from scikit-learn
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report

# Import logistic regression model from scikit-learn
from sklearn.linear_model import LogisticRegression


import string
import nltk
from nltk.corpus import stopwords

In [None]:
#Mounting Drive
from google.colab import drive
drive.mount('/content/drive')

### Dataset Loading

In [None]:
!pip install unidecode
import unidecode


In [None]:
# Load Dataset
df = pd.read_csv('/content/drive/MyDrive/Coronavirus Tweets.csv',encoding='latin-1')

### Dataset First View

In [None]:
# Dataset First Look
df.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'Number of Rows : {df.shape[0]}')
print(f'Number of columns : {df.shape[1]}')

### Dataset Information

In [None]:
# Dataset Info
df.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.duplicated().sum()


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
sns.heatmap(df.isnull(),cmap='plasma',annot=False)

### What did you know about your dataset?

* There are 41157 rows and 6 columns.
* Feature like Location has null values.
* There is no duplicate values i.e: 41157 unique values.
* All features have object data types except username and screenname.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df.columns

In [None]:
# Dataset Describe
df.describe(include='all')

### Variables Description 

* UserName - CodedUsername
* ScreenName - CodedScreenname
* Location - Region of origin
* TweetAt - Tweet Timing
* OriginalTweet - Frist tweet in the thread
* Sentiment - Sentiment of the tweet(target)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
df.nunique()

In [None]:
#Checking unique values in Sentiment
df['Sentiment'].unique()

In [None]:
df['TweetAt'].head()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
#copy original dataset just to preserve original
data = df.copy()

In [None]:
data.rename(columns={'TweetAt':'Date'},inplace=True)

In [None]:
#converting Date variable into date_time type
data['Date'] = pd.to_datetime(data['Date'])

In [None]:
#extracting month and year and storing in another variable
data['month'] = pd.DatetimeIndex(data['Date']).month

In [None]:
#in which month the most number of tweet was made
month_df=pd.DataFrame(data.groupby(['month'])['OriginalTweet'].count().reset_index().sort_values(['OriginalTweet'],ascending=False).rename(columns={'OriginalTweet':'Count'}))
month_df

In [None]:
locatoin_count_df=pd.DataFrame(df.groupby(['Location'])['OriginalTweet'].count().reset_index().sort_values(by='OriginalTweet',ascending=False).rename(columns={'OriginalTweet':'Count'}))
locatoin_count_df

In [None]:
sentiment_count_df = pd.DataFrame(data['Sentiment'].value_counts().reset_index().rename(columns={'index':'Sentiment','Sentiment':'Count'}))
sentiment_count_df

### What all manipulations have you done and insights you found?

* Converted Date type to datetime type
* cheked month wise tweet count
* checked from which location most no. tweets were made
* we also saw each 

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1 Most Tweeted Months

In [None]:
# Chart - 1 visualization code
#checking which month is most tweeted
plt.figure(figsize=(12,6))
sns.barplot(data=month_df,x='month',y='Count')
plt.show()

##### 1. Why did you pick the specific chart?

To check which month is most tweeted

##### 2. What is/are the insight(s) found from the chart?

Month 3 (March ) is the most tweeted month

#### Chart - 2 

In [None]:
#countplot of sentiment in each month
plt.figure(figsize=(12,7))
sns.countplot(data=data,x='month',hue='Sentiment')
plt.show()

##### 1. Why did you pick the specific chart?

To check most tweet in each month sentiment wise.

###Chart - 3

In [None]:
# Chart - 2 visualization code
#from which location most no. of tweets were made
plt.figure(figsize=(16,7))
sns.barplot(data=locatoin_count_df[:10],x='Location',y='Count')
plt.title('Top 10 countres with most tweeted')
plt.xlabel('Country Name')
plt.ylabel('No. of tweets')
plt.show()

##### 1. Why did you pick the specific chart?

to check from which location the most no. of tweets were posted

##### 2. What is/are the insight(s) found from the chart?

Most tweets were tweeted from London,United state,London England.

#### Chart - 4

In [None]:
# Chart - 3 visualization code
#plotting pie chart for sentiments
d = sentiment_count_df['Count'].tolist()
label = sentiment_count_df['Sentiment'].tolist()
plt.pie(d,labels=label,colors = sns.color_palette('pastel'),autopct='%.0f%%')
plt.show()

##### 1. Why did you pick the specific chart?

To see which type of sentiments were the most.

##### 2. What is/are the insight(s) found from the chart?

we have seen in pie chart positive sentiment is the most and followed by Negative

#### Chart - 5

In [None]:
# Chart - 4 visualization code
#countries with most negative tweets
neg_df = data[(data['Sentiment']=='Negative') | (data['Sentiment']=='Extremely Negative')]
plt.figure(figsize=(12,6))
neg_df['Location'].value_counts().head(20).plot(kind='bar',color = sns.color_palette("husl"),width=0.7)
plt.title(" Top 20 Country have Negative Tweeted Most ", fontsize = 10)
plt.xlabel('Country', fontsize = 12)
plt.ylabel('Tweet ', fontsize = 12)
plt.xticks(rotation=60)
plt.show()

##### 1. Why did you pick the specific chart?

from which country the most number of negatives reviews were made.

##### 2. What is/are the insight(s) found from the chart?

Most no negative were posted from London

#### Chart - 6

In [None]:
# Chart - 4 visualization code
#countries with most positive tweets
Pos_df = data[(data['Sentiment']=='Positive') | (data['Sentiment']=='Extremely Positive')]
plt.figure(figsize=(12,6))
Pos_df['Location'].value_counts().head(20).plot(kind='bar',color = sns.color_palette("husl"),width=0.7)
plt.title(" Top 20 Countres with  Most  positive Tweets ", fontsize = 10)
plt.xlabel('Country', fontsize = 12)
plt.ylabel('Tweet ', fontsize = 12)
plt.xticks(rotation=60)
plt.show()

##### 1. Why did you pick the specific chart?

from which country the most number of positive reviews were made

##### 2. What is/are the insight(s) found from the chart?

Most number of positive reviews were posted from United state

#### Chart - 7

In [None]:
#Function for Extract #tags 
def ext_hashtag(t):
  return [i for i in t.split() if '#' in i]


In [None]:
#apply above function 
data['hastag']=data['OriginalTweet'].apply(ext_hashtag)


In [None]:
# Count every hashtags from Counter dependency 
from collections import Counter
d = Counter(data.hastag.sum())
hashtags= pd.DataFrame([d]).T.reset_index()


In [None]:
#rename COlumn name
hashtags.rename(columns = {'index':'Hashtags',0:'Count'}, inplace = True)

In [None]:
#sort data frame
top_hashtags=hashtags.sort_values(by='Count',ascending=False).reset_index(drop=True)

In [None]:
# Bar plot of Most used hashtags
plt.rcParams["figure.figsize"]=(10,6)

ax=top_hashtags.head(20).plot(kind='bar',
                           x='Hashtags',
                           color = sns.color_palette("husl"),
                           width=0.8
                           )
plt.xticks(rotation=70)
plt.title(" Top 20 most used #tags ", fontsize = 10)
plt.xlabel('#tags', fontsize = 12)
plt.ylabel('Count ', fontsize = 12)


#Patches Height 
for p in ax.patches:
  x = p.get_x() + p.get_width() / 2 - 0.4
  y = p.get_y() + p.get_height() 
  ax.annotate(p.get_height(),(x,y) ,size = 8)


#### Chart - 8

In [None]:
# Chart - 7 visualization code
freq_dict=dict(zip(hashtags['Hashtags'].tolist(),hashtags['Count'].tolist()))
word = WordCloud(width=900,height=400,max_words=200,background_color='black').generate_from_frequencies(freq_dict)
plt.figure(figsize=(14, 12))
plt.imshow(word, interpolation='bilinear')
plt.axis('off')
plt.show()

#### Chart - 9

In [None]:
# Chart - 8 visualization code
#Most used user name in tweets
def ext_name(text):
  return [i for i in text.split() if '@' in i ]

In [None]:
data['User_Name']=data['OriginalTweet'].apply(ext_name)

In [None]:
# Count every username from Counter dependency 

from collections import Counter
cnt= Counter(data.User_Name.sum())

#dataframe of username and value count
usrnam= pd.DataFrame([cnt]).T.reset_index()


In [None]:
#rename Column name
usrnam.rename(columns = {'index':'User_name',0:'Count'}, inplace = True)

In [None]:
#sorted by descending order
top_username=usrnam.sort_values(by='Count',ascending=False).reset_index(drop=True)

In [None]:
# Bar plot of Most used @username

plt.rcParams["figure.figsize"]=(11,6)

ax=top_username.head(20).plot(kind='bar',
                           x='User_name',
                           color = sns.color_palette("husl"),
                           width=0.8
                           )
plt.xticks(rotation=70)

#Patcches height
for p in ax.patches:
  x = p.get_x() + p.get_width() / 2 - 0.4
  y = p.get_y() + p.get_height() 
  ax.annotate(p.get_height(),(x,y) ,size = 8)


###Chart - 10

In [None]:
#Wordcloud image of @username, showing highlight most used and low least used.

# dictionary of hashtags and his value counts
wcloud_data_user = dict(zip(usrnam['User_name'].tolist(), usrnam['Count'].tolist()))

# generate image
wcloud = WordCloud(width=800, height=400, max_words=200,background_color = 'black').generate_from_frequencies(wcloud_data_user)
plt.figure(figsize=(14, 12))
plt.imshow(wcloud, interpolation='bilinear')
plt.axis('off')
plt.show()


#### Chart - 11 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(12,10))   
sns.heatmap(data.corr(),cmap="Spectral", cbar_kws={'shrink': .6}, square=True, annot=True, fmt='.2f', linewidths=0.8)
plt.show()

#### Chart - 12 - Pair Plot 

In [None]:
# Pair Plot visualization code
sns.pairplot(data)

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
data.isnull().sum()

In [None]:
data['Location'].fillna(method="ffill",limit=1,inplace=True)

In [None]:
data.Location.isnull().sum()

In [None]:
data['Location'].fillna(method="bfill",limit=1,inplace=True)
data.Location.isnull().sum()

In [None]:
data['Location'].mode()

In [None]:
# remain nan value fiil by most frequent value
data['Location'].fillna('London',inplace=True)
data['Location'].isnull().sum()

#### What all missing value imputation techniques have you used and why did you use those techniques?


We used ffill, bfill and mode value of this data because it is catagorical column

### 2. Textual Data Preprocessing 
(It's mandatory for textual dataset i.e., NLP, Sentiment Analysis, Text Clustering etc.)

### 1.Removing Users Name

In [None]:
def remove_name(text):
  return " ".join([i for i in text.split() if '@' not in i])

In [None]:
data['Tweets'] = data['OriginalTweet'].apply(remove_name)

#### 2. Expand Contraction

In [None]:
!pip install contractions

In [None]:
#Expand Contraction
import contractions
data['Tweets'] = data['Tweets'].apply(lambda x : contractions.fix(x))

#### 3. Lower Casing

In [None]:
# Lower Casing
def lower(text):
  return " ".join([i.lower() for i in text.split()])

In [None]:
data['Tweets'] = data['Tweets'].apply(lower)

#### 4. Removing Punctuations

In [None]:
# Remove Punctuations
def remove_punc(data):
  translator = str.maketrans('', '', string.punctuation)
  return data.translate(translator)

In [None]:
data['Tweets'] = data['Tweets'].apply(remove_punc)

#### 5. Removing URLs & Removing words and digits contain digits.

In [None]:
# Remove URLs & Remove words and digits contain digits
import re
#remove urls
data['Tweets'] = data['Tweets'].apply(lambda x :re.sub(r'http\S+',"",x))
#removing digits
data['Tweets'] = data['Tweets'].apply(lambda x : re.sub(r"\d+","",x))

#### 6. Removing Stopwords & Removing White spaces

In [None]:
nltk.download('stopwords')

In [None]:
sw = stopwords.words('english')

In [None]:
# Remove Stopwords
def remvove_stopwords(text):
  '''a function for removing the stopword'''
  # removing the stop words and lowercasing the selected words
  text = [word.lower() for word in text.split() if word.lower() not in sw]
  # joining the list of words with space separator
  return " ".join(text)

In [None]:
data['Tweets'] = data['Tweets'].apply(remvove_stopwords)

In [None]:
# Remove White spaces
def space_rem(text):
    text = [word.strip() for word in text.split()]
    return " ".join(text)

In [None]:
data['Tweets'] = data['Tweets'].apply(space_rem)

###7. Removing Accents from Words

In [None]:
#function for remove accents from words
def remove_accents(text):
  return " ".join([unidecode.unidecode(i) for i in text.split()])
  

In [None]:
data['Tweets'] = data['Tweets'].apply(remove_accents)


####8. Remove Hashtag

In [None]:
#function for remove hashtags
def remove_hash(text):
  l=[]
  for i in text.split():
    if '#' not in i:
      l.append(i)
    else:
      l.append(i.replace("#", ""))
  return ' '.join(l)   

In [None]:
data['Tweets'] = data['Tweets'].apply(remove_hash)


#### 7. Tokenization

In [None]:
nltk.download('punkt')

In [None]:
# Tokenization
data['Tweets'] = data['Tweets'].apply(nltk.word_tokenize)

#### 8. Text Normalization

In [None]:
nltk.download('wordnet')

In [None]:
# Normalizing Text (i.e., Stemming, Lemmatization etc.)
from nltk.stem import WordNetLemmatizer

# Create a lemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_tokens(tokens):
    lemmatized_tokens = " ".join([lemmatizer.lemmatize(token) for token in tokens])
    return lemmatized_tokens

# Lemmatize the 'Review' column
data['Tweets'] = data['Tweets'].apply(lemmatize_tokens)


#### 9. Part of speech tagging

In [None]:
# POS Taging
#nltk.download('averaged_perceptron_tagger')
#def pos(text):
#  text=nltk.pos_tag(text)
#  return text

In [None]:
#i am not doing pos here becase its not giving good result

In [None]:
#data['text']=data['Tweets'].apply(pos)

In [None]:
corpus=[]
for i in range(0,len(data)):
  review=data['Tweets'][i]
  corpus.append(review)

In [None]:
corpus[:10]

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS
import numpy as npy
from PIL import Image
import requests
import io
response = requests.get("https://res.cloudinary.com/maxie/image/upload/v1617197755/TEMP/covid_ywd7ph.jpg")
image_bytes = io.BytesIO(response. content)
dataset = " ".join(corpus)
def create_word_cloud(string):

    maskArray = npy.array(Image.open(image_bytes))
    cloud = WordCloud(background_color = "black", max_words = 150, mask = maskArray, stopwords = set(STOPWORDS),contour_width=1, contour_color='#333')
    cloud.generate(string)
#     cloud.to_file("wordCloud.png")
    return cloud
dataset = dataset.lower()
wordcloud=create_word_cloud(dataset)
plt.figure(figsize=[20,10])
plt.imshow(wordcloud) # image show
plt.axis('off') # to off the axis of x and y
plt.show()

#### 10. Text Vectorization

In [None]:
from sklearn.feature_extraction.text import CountVectorizer


In [None]:
# Vectorizing Text
count_vectorizer = CountVectorizer()
count_vectorizer.fit_transform(data['Tweets'])

### 3. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
# already have done

#### 2. Feature Selection

In [None]:
# Select your features wisely to avoid overfitting
data.drop(['UserName','ScreenName'],axis =1,inplace=True)

In [None]:
data.sample(5)

In [None]:
data.columns.tolist()

### 5. Data Transformation

#### Do you think that your data needs to be transformed? If yes, which transformation have you used. Explain Why?

In [None]:
# Transform Your data
def senti_anal(val):
  if (val=='Positive') | (val=='Extremely Positive'):  
    return 'Positive'
  elif (val=='Negative') | (val=='Extremely Negative'):
    return 'Negative'
  else:
    return 'Neutral' 

In [None]:
data['Sentiment']=data['Sentiment'].apply(senti_anal)

In [None]:
data['Sentiment'].unique()

### 6. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
x=data['Tweets']
y=data['Sentiment']

In [None]:
from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y ,test_size=0.2, random_state = 42, stratify = y)
print(x_train.shape)                                   
print(x_test.shape )

##### What data splitting ratio have you used and why? 

We used 80:20 splitting ratio because this is considered best split ratio. 

That is 80% of the dataset goes into the training set and 20% of the dataset goes into the testing set.

### 7. Handling Imbalanced Dataset

In [None]:
# Handling Imbalanced Dataset (If needed)
plt.figure(figsize=(6,6))

data['Sentiment'].value_counts().plot(kind='pie',
                                         fontsize=15,
                                         autopct="%0.1f%%",
                                         labels=data['Sentiment'].value_counts().index,
                                         explode=[0.01,0.04,0.09],
                                         colors = sns.color_palette("husl"),
                                         shadow=True
                                         )

In [None]:
count_vectorizer = CountVectorizer()
#fit value
x_train_vectorized=count_vectorizer.fit_transform(x_train)
x_test_vectorized =count_vectorizer.transform(x_test)

## ***7. ML Model Implementation***

### ML Model - 1 Logistic Regression

In [None]:
def mod_evaluat(model, true, predicted):
  
  """Function which gives output of the model implimentation on train 
     and test set (Used as Output Display) """

  train_accuracy = model.score(x_train_vectorized, y_train)
  test_accuracy = accuracy_score(true, predicted)
  report = classification_report(predicted, true)

  print(model,'\n')
  print('Train Accuracy: ',round((train_accuracy), 2)*100,'%')
  print('Test Accuracy: ',round((test_accuracy), 2)*100,'%')
  print('Model Report: \n', report)
  

In [None]:
def conf_matrix(model, true, predicted):
  print('-'*40)
  print('Confusion Matrix: \n')
  cm = confusion_matrix(true, predicted)
  cm = cm / np.sum(cm, axis = 1)[:,None]
  labels = y.unique()
  sns.set(rc={'figure.figsize':[5,4]})
  sns.heatmap(cm, xticklabels = labels,
            yticklabels = labels, 
            annot=True, 
            cmap = 'YlGnBu')
  plt.show()
  plt.pause(0.05)

In [None]:
#function compute test accuracy for comparison
from sklearn.metrics import f1_score
def mod_comp(model, true, predicted):
  test_accur= accuracy_score(true, predicted)
  f1_scor = f1_score(true, predicted,average='weighted')
  report = classification_report(predicted, true) 
  return test_accur,f1_scor
  

In [None]:
# ML Model - 1 Implementation
LogReg = LogisticRegression()
# Fit the Algorithm
LogReg.fit(x_train_vectorized, y_train)

# Predict on the model
LogReg_prediction = LogReg.predict(x_test_vectorized)
# Report

In [None]:
# Visualizing evaluation Metric Score chart
mod_evaluat(LogReg, y_test, LogReg_prediction)
conf_matrix(LogReg, y_test, LogReg_prediction)

In [None]:
#Actual vs Predict Values
act_vs_pred=pd.DataFrame({"actual":y_test,"prediction": LogReg_prediction})
act_vs_pred[:20]

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import GridSearchCV
Logreg_cv = LogisticRegression()

parameters = dict(penalty=['l1', 'l2'],C=[100, 10, 1.0, 0.1, 0.01,0.01])


#Hyperparameter tuning by GridserchCV
logreg_Gcv=GridSearchCV(Logreg_cv,parameters,cv=5)


#fitting the data to model
%time logreg_Gcv.fit(x_train_vectorized, y_train)


# Predict on the model
lgcv_pred=logreg_Gcv.predict(x_test_vectorized)


In [None]:
#evaluation Chart of model
mod_evaluat(logreg_Gcv.best_estimator_, y_test, lgcv_pred)
conf_matrix(logreg_Gcv.best_estimator_, y_test, lgcv_pred)

In [None]:
print('The Best estimator or model : ',logreg_Gcv.best_estimator_)
print("\nThe best fit parameters value is found out to be : " ,logreg_Gcv.best_params_)
print( "\n the average of all the cross-validation fold : ", logreg_Gcv.best_score_)

##### Which hyperparameter optimization technique have you used and why?

I used Grid Search CV optimization technique because GridSearchCV is a technique for finding the optimal parameter values from a given set of parameters in a grid. It's essentially a cross-validation technique. The model as well as the parameters must be entered. After extracting the best parameter values, predictions are made.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
print(' Evaluation metric Score Chart (Base model)\n')
mod_evaluat(LogReg, y_test, LogReg_prediction)
print('-'*40,'\n Evaluation metric Score Chart with Hyperparameter tuning technique\n')
mod_evaluat(logreg_Gcv.best_estimator_, y_test, lgcv_pred)

We can see there is no difference between

### ML Model - 2 Stochastic Gradient Descent

In [None]:
# ML Model - 2 Implementation
SGDClassifier_model = SGDClassifier(max_iter = 10000)
SGDClassifier_model.fit(x_train_vectorized, y_train)

# Predict on the model
SGDC_prediction = SGDClassifier_model.predict(x_test_vectorized)

# Report

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
mod_evaluat(SGDClassifier_model, y_test, SGDC_prediction)
conf_matrix(SGDClassifier_model, y_test, SGDC_prediction)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import RandomizedSearchCV 
sgdc_cv = SGDClassifier()
parameters = dict(penalty=['l1', 'l2'],alpha = [0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000] )

#Hyperparameter tuning by GridserchCV
sgdc_Rcv=RandomizedSearchCV(estimator=sgdc_cv,param_distributions=parameters,  
                              verbose=1, n_jobs=-1, n_iter=1000) 

#fitting the data to model
%time sgdc_Rcv.fit(x_train_vectorized, y_train)

# Predict on the model
sgdcr_pred=sgdc_Rcv.predict(x_test_vectorized)


In [None]:
# Visualizing evaluation Metric Score chart
mod_evaluat(sgdc_Rcv.best_estimator_, y_test, sgdcr_pred)
conf_matrix(sgdc_Rcv.best_estimator_, y_test, sgdcr_pred)

In [None]:
print('The Best estimator : ',sgdc_Rcv.best_estimator_)
print("\nThe best fit alpha value is found out to be :" ,sgdc_Rcv.best_params_)
print( "\n the average of all the cross-validation fold : ", sgdc_Rcv.best_score_)

##### Which hyperparameter optimization technique have you used and why?

I used Random Search CV because RandomizedSearchCV randomly passes the set of hyperparameters and calculate the score and gives the best set of hyperparameters which gives the best score as an output.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
print(' Evaluation metric Score Chart (Base model)\n')
mod_evaluat(SGDClassifier_model, y_test, SGDC_prediction)
print('-'*40,'\n Evaluation metric Score Chart with Hyperparameter technique\n')
mod_evaluat(sgdc_Rcv.best_estimator_, y_test, sgdcr_pred)

Answer Here.

#### 3. Explain each evaluation metric's indication towards business and the business impact pf the ML model used.

Answer Here.

### ML Model - 3

In [None]:
# ML Model - 3 Implementation
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
#Fit the Algorithm
mnb.fit(x_train_vectorized, y_train)

# Predict on the model
mnb_pred = mnb.predict(x_test_vectorized)

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
mod_evaluat(mnb, y_test, mnb_pred)
conf_matrix(mnb, y_test, mnb_pred)

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
          'fit_prior': [True, False]
         }
# Fit the Algorithm

multinomial_nb_grid = GridSearchCV(MultinomialNB(), param_grid=params, n_jobs=-1, cv=5, verbose=5)
multinomial_nb_grid.fit(x_train_vectorized, y_train)

# Predict on the model
nb_pred_cv=multinomial_nb_grid.predict(x_test_vectorized)

In [None]:
# Visualizing evaluation Metric Score chart
mod_evaluat(multinomial_nb_grid.best_estimator_, y_test, nb_pred_cv)
conf_matrix(multinomial_nb_grid.best_estimator_, y_test, nb_pred_cv)

In [None]:
print('The Best estimator : ',multinomial_nb_grid.best_estimator_)
print("\nThe best fit alpha value is found out to be :" ,multinomial_nb_grid.best_params_)
print( "\nthe average of all the cross-validation fold : ", multinomial_nb_grid.best_score_)

##### Which hyperparameter optimization technique have you used and why?

I used Grid Search CV optimization technique because GridSearchCV is a technique for finding the optimal parameter values from a given set of parameters in a grid. It's essentially a cross-validation technique. The model as well as the parameters must be entered. After extracting the best parameter values, predictions are made.

##### Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

In [None]:
print(' Evaluation metric Score Chart (Base model)\n')
mod_evaluat(mnb, y_test,mnb_pred)
print('-'*40,'\n Evaluation metric Score Chart with Hyperparameter technique\n')
mod_evaluat(multinomial_nb_grid.best_estimator_, y_test, nb_pred_cv)

###Models Comparision

In [None]:
# Comparison of accuracy, and f1_score of Models 
logreg_acc , logreg_f1 = mod_comp(LogReg, y_test, LogReg_prediction)

logregCV_acc , logregCV_f1 = mod_comp(logreg_Gcv.best_estimator_, y_test, lgcv_pred)

sgdc_acc , sgdc_f1 = mod_comp(SGDClassifier_model, y_test, SGDC_prediction)\

sgdcCV_acc , sgdcCV_f1 = mod_comp(sgdc_Rcv.best_estimator_, y_test, sgdcr_pred)

naiveb_acc , naiveb_f1 = mod_comp(mnb, y_test, mnb_pred)

naivebCV_acc , naivebCV_f1 = mod_comp(multinomial_nb_grid.best_estimator_, y_test, nb_pred_cv)

In [None]:
models_df = pd.DataFrame(
    {'Models': ['Logistic Regression','SGDClassifier','Naive Bayes Classifier','Logistic Regression on CV','SGDClassifier on CV','Naive_BC on CV'],
     'Test Accuracy': [logreg_acc, sgdc_acc, naiveb_acc,logregCV_acc,sgdcCV_acc,naivebCV_acc],
     'F1 Score' : [logreg_f1, sgdc_f1, naiveb_f1,logregCV_f1,sgdcCV_f1,naivebCV_f1]
    })
models_df.sort_values(by=['Test Accuracy'], ascending=False, inplace=True)
models_df.reset_index(drop=True)

### 1. Which Evaluation metrics did you consider for a positive business impact and why?

Accuracy is best evaluation metric to business impact because 'Accuracy' get answer the question, what percent of the models predictions were correct.

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

**SGDClassifier**  is best model from above created because it have highest Accuracy And F1_score.

And even in Cross-Validation & Hyperparameter Tuning technique **SGDClassifier** have higghest accuracy.

## ***8.*** ***Future Work (Optional)***

### 1. Save the best performing ml model in a pickle file or joblib file format for deployment process.


In [None]:
# Save the File
# import pickle

#serialized_model = pickle.dumps(SGDClassifier_model)
# Save the model to a file
#with open('best_model.pkl', 'wb') as file:
#    file.write(serialized_model)

### ***Congrats! Your model is successfully created and ready for deployment on a live server for a real user interaction !!!***

# **Conclusion**

* Most tweeted month was march this dataset was year of 2020.
* Most Number of tweets were made from London and United State.
* most negative tweets were  made from London and Positive from United State.
* Treding hashtags were #coronavirus,COVID19.
* @realDonaldTrump and @Tesco were the most tagged and active users on the twitter.
* High Frequent words in the data are coronavirus,supermarket,grocery store,sanitizer,people,
* CounVectorizer used for vectorization.which Convert a collection of raw documents to vector of term/token counts.
* For a Multiclass Classification, Out of all the models SGDClassifier is the best performing model with  82% accuracy. After which, Increase to 1% when hyper-tuned with RandomSearchCV.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***