<font color="#4b76b7">To start practicing, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# AfterWork Data Science: Getting Started with NLP Project

### Prerequisites

In [None]:
# Importing the required libraries
# ---
# 
import pandas as pd # library for data manipulation
import numpy as np  # librariy for scientific computations
import re           # regex library to perform text preprocessing
import string       # library to work with strings
import nltk         # library for natural language processing
import scipy        # scientific conputing 

### 1. Importing our Data

In [None]:
# Question: Given a new tweets, create a sentiment analysis model that will 
# predict whether a tweet will contain positive or negative sentiment.
# ---
# Dataset url = https://bit.ly/31kqByD 
# ---
#
df = pd.read_csv('https://bit.ly/31kqByD', encoding='latin-1')
df.head()

Unnamed: 0.1,Unnamed: 0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer. You shoulda got David Carr of Third Day to do it. ;D"
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


### 2. Data Exploration

In [None]:
# We can determine the size of our dataset
# ---
#
df.shape

(10000, 7)

Seems this dataset will need some data cleaning i.e. columns. We also don't need some columns to perform create our model. We will drop those columns.

### 3. Data Preparation

#### Basic Data Cleaning Techniques

In [None]:
# We rename the columns for ease of referencing our columns later on
# ---
#
df.columns = ['id', 'target', 't_id', 'created_at', 'query', 'user', 'text']
df.head()

Unnamed: 0,id,target,t_id,created_at,query,user,text
0,346508,0,2016177685,Wed Jun 03 06:18:50 PDT 2009,NO_QUERY,UriGrey,Obama forges his Muslim alliance against the c...
1,883537,4,1686152287,Sun May 03 04:02:08 PDT 2009,NO_QUERY,MariesolW,Had the most spectacular prom ever but now my...
2,764173,0,2298725623,Tue Jun 23 12:02:12 PDT 2009,NO_QUERY,ColleenBurns,I am overwhelmed today taking a moment to eat...
3,638701,0,2234530495,Thu Jun 18 23:13:54 PDT 2009,NO_QUERY,queenarchy,@lindork Tres sad. I was totally a Max fan. #...
4,664821,0,2244623416,Fri Jun 19 14:59:46 PDT 2009,NO_QUERY,reinventingjess,"Crap, I was counting down the hours until my d..."


In [None]:
# We retain the relevant columns by dropping the columns we don't need 
# for creating a sentiment analysis model. 
# ---
#
df = df.drop(['id', 't_id', 'created_at', 'query', 'user'], axis = 1)
df.head()

Unnamed: 0,target,text
0,0,Obama forges his Muslim alliance against the c...
1,4,Had the most spectacular prom ever but now my...
2,0,I am overwhelmed today taking a moment to eat...
3,0,@lindork Tres sad. I was totally a Max fan. #...
4,0,"Crap, I was counting down the hours until my d..."


In [None]:
# Understanding the distribution of target
# ---
#
df.target.value_counts() 

0    5067
4    4933
Name: target, dtype: int64

In [None]:
# Let's determine whether our columns have the right data types
# ---
#
df.dtypes

target     int64
text      object
dtype: object

In [None]:
# What values are in our target variable?
# ---
#
df.target.unique()

array([0, 4])

These are the two classes to which each document (text) belongs. The target value 0 means a text with a negative sentiment, while that of 4 means a text with a positive sentiment. 

In [None]:
# Let's check for missing values 
# ---
# 
df.isnull().sum()

target    0
text      0
dtype: int64

We don't have any missing values, so we are good to go.

#### Text Processing

In [None]:
# Text Cleaning: Removing all urls/links
# ---
# 
df['text'] =  df['text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+','', str(x)))
df[['text']].head()

Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,@lindork Tres sad. I was totally a Max fan. #...
4,"Crap, I was counting down the hours until my d..."


In [None]:
# Text Cleaning: Removing @ and # characters or replace them with space
df['text'] = df.text.str.replace('[@#]','')
df[['text']].head()



Unnamed: 0,text
0,Obama forges his Muslim alliance against the c...
1,Had the most spectacular prom ever but now my...
2,I am overwhelmed today taking a moment to eat...
3,lindork Tres sad. I was totally a Max fan. SY...
4,"Crap, I was counting down the hours until my d..."


In [None]:
# Text Cleaning: Conversion to lowercase
# ---
df['text'] = df.text.str.lower() 
df[['text']].head()


Unnamed: 0,text
0,obama forges his muslim alliance against the c...
1,had the most spectacular prom ever but now my...
2,i am overwhelmed today taking a moment to eat...
3,lindork tres sad. i was totally a max fan. sy...
4,"crap, i was counting down the hours until my d..."


In [None]:
# Text Cleaning: Splitting concatenated words
# ---
# Performing this step will take few minutes...
# ---
# YOUR CODE GOES BELOW
# 

# Installing wordnija and textblob
!pip3 install wordninja
!pip3 install textblob


# Importing those libraries
import wordninja 
from textblob import TextBlob



In [None]:
# Performing the split

df['text'] = df.text.apply(lambda x: wordninja.split(str(TextBlob(x))))  
df['text'] = df.text.str.join(' ')
df[['text']].sample(10)


Unnamed: 0,text
4131,tom mcfly hah a a nice ee hope you're having f...
4239,miley cyrus shout out s please u never do that...
3655,nose bleed i wish i never left bed
4335,garb off man my keyboard is messed up i cant f...
464,cpe d raza yep me too that should save me an h...
8700,brittany ta stic it used to be fun now its jus...
7333,my knee hurts also my back
7603,dougie mcfly dont leave poynter
2603,brook a ayy i figured out how to reply to you ...
8546,omg i'm so sad a stone fell out of my dior s a...


In [None]:
# Text Cleaning: Removing punctuation characters

df['text'] = df.text.str.replace('[^\w\s]','') 
df[['text']].head(10)

Unnamed: 0,text
0,obama forges his muslim alliance against the c...
1,had the most spectacular prom ever but now my ...
2,i am overwhelmed today taking a moment to eat ...
3,lin dork tres sad i was totally a max fan sytycd
4,crap i was counting down the hours until my da...
5,dc b tv dc b tv i had to go check some things ...
6,s mr or ke why are you never on gmail anymore
7,alex jeffrey s id have loved to have come just...
8,br rrr heading to work chilly today
9,ga bri iii ella i nee ed to talk to you u good...


In [None]:
# Text Cleaning: Removing stop words
# ---

# Importing the TfidfVectorizer which will help us with this process
from sklearn.feature_extraction.text import TfidfVectorizer

# Now creating a Word Level TF-IDF feature
# For our parameters:
# 1.   max_features: We would want to use 1000 most occurring words as features.
# 2.   stop_words: We can remove stop words as TfidfVectorizer has ability to remove stop words.
# 3.   analyzer='word' We can compute tfidf of word n-grams by setting word as the value of analyzer 
# 4.   ngram_range=(1, 3):  We then set the parameter ngram_range=(a, b),  where a is the minimum 
#      and b is the maximum size of ngrams we want to include in our features.

tfidf = TfidfVectorizer(max_features=1000,analyzer='word', ngram_range=(1,3),  stop_words= 'english')
df_text_vect = tfidf.fit_transform(df['text'])

df_text_vect.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [None]:
# Text Cleaning: Lemmatization
# ---
# YOUR CODE GOES BELOW
#

# For lemmatization, we will need to download wordnet

nltk.download('wordnet')

# Lemmatizing our text : To perform lemmatization, we use will also import the Word object from the textblob library 
# and pass it the word that you want to lemmatize and then call the lemmatize method as shown

from textblob import Word

# Then perform lematization

df['lemmatization'] = df.text.apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()])) 
df[['text', 'lemmatization']].head(10)


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,text,lemmatization
0,obama forges his muslim alliance against the c...,obama forge his muslim alliance against the ci...
1,had the most spectacular prom ever but now my ...,had the most spectacular prom ever but now my ...
2,i am overwhelmed today taking a moment to eat ...,i am overwhelmed today taking a moment to eat ...
3,lin dork tres sad i was totally a max fan sytycd,lin dork tres sad i wa totally a max fan sytycd
4,crap i was counting down the hours until my da...,crap i wa counting down the hour until my dad ...
5,dc b tv dc b tv i had to go check some things ...,dc b tv dc b tv i had to go check some thing b...
6,s mr or ke why are you never on gmail anymore,s mr or ke why are you never on gmail anymore
7,alex jeffrey s id have loved to have come just...,alex jeffrey s id have loved to have come just...
8,br rrr heading to work chilly today,br rrr heading to work chilly today
9,ga bri iii ella i nee ed to talk to you u good...,ga bri iii ella i nee ed to talk to you u good...


We won't remove numerics because we could loose meaning of our text if we lost the numerics. We could also further prepare our text by performing spelling correction but this is a resource intensive process that we will skip for now.

#### Feature Engineering Techniques 

In [None]:
# Feature Construction: Length of tweet ( will include the spaces)
# ---
df['length_of_tweet'] = df.text.str.len()
df[['text','length_of_tweet']].head(10)


Unnamed: 0,text,length_of_tweet
0,obama forges his muslim alliance against the c...,103
1,had the most spectacular prom ever but now my ...,129
2,i am overwhelmed today taking a moment to eat ...,54
3,lin dork tres sad i was totally a max fan sytycd,48
4,crap i was counting down the hours until my da...,133
5,dc b tv dc b tv i had to go check some things ...,82
6,s mr or ke why are you never on gmail anymore,45
7,alex jeffrey s id have loved to have come just...,98
8,br rrr heading to work chilly today,35
9,ga bri iii ella i nee ed to talk to you u good...,54


In [None]:
# Feature Construction: Word count #word count: counts the number of tokens in the text (separated by a space)

df['word_count'] = df["text"].apply(lambda x: len(str(x).split(" ")))
df.head(10)

Unnamed: 0,target,text,lemmatization,length_of_tweet,word_count
0,0,obama forges his muslim alliance against the c...,obama forge his muslim alliance against the ci...,103,20
1,4,had the most spectacular prom ever but now my ...,had the most spectacular prom ever but now my ...,129,25
2,0,i am overwhelmed today taking a moment to eat ...,i am overwhelmed today taking a moment to eat ...,54,11
3,0,lin dork tres sad i was totally a max fan sytycd,lin dork tres sad i wa totally a max fan sytycd,48,11
4,0,crap i was counting down the hours until my da...,crap i wa counting down the hour until my dad ...,133,29
5,4,dc b tv dc b tv i had to go check some things ...,dc b tv dc b tv i had to go check some thing b...,82,20
6,0,s mr or ke why are you never on gmail anymore,s mr or ke why are you never on gmail anymore,45,11
7,0,alex jeffrey s id have loved to have come just...,alex jeffrey s id have loved to have come just...,98,20
8,0,br rrr heading to work chilly today,br rrr heading to work chilly today,35,7
9,4,ga bri iii ella i nee ed to talk to you u good...,ga bri iii ella i nee ed to talk to you u good...,54,15


In [None]:
# Feature Construction: Word density (Average no. of words / tweet)
# ---
# YOUR CODE GOES BELOW
# tweets: count the number of tweets (separated by a period)
# df['tweet_count'] = df["text"].apply(lambda x: len(str(x).split(".")))
# df.sample(20)


# number of tweets in df:
index = df. index
# find length of index.
number_of_rows = len(index)
print(number_of_rows)

df['Word density'] = df['word_count'] / number_of_rows
df.sample(20)

10000


Unnamed: 0,target,text,lemmatization,length_of_tweet,word_count,Word density
3018,0,exhausted,exhausted,9,1,0.0001
4538,0,i really wanted to go to super target tonight ...,i really wanted to go to super target tonight ...,109,22,0.0022
1976,4,cars honking around here i see a bii iii g whi...,car honking around here i see a bii iii g whit...,127,24,0.0024
527,4,is hungry yyyy yy going to eat traditional ind...,is hungry yyyy yy going to eat traditional ind...,93,19,0.0019
2627,4,yeah twit terrific re marche,yeah twit terrific re marche,28,5,0.0005
851,4,another day home resting my knee after knee su...,another day home resting my knee after knee su...,133,25,0.0025
2905,4,just got home had a fun day with foxy angela m...,just got home had a fun day with foxy angela m...,72,16,0.0016
5139,4,justine ville it doesnt feel like youre connec...,justine ville it doesnt feel like youre connec...,76,13,0.0013
6511,4,ariel emo on fire yeah i usually make mulled w...,ariel emo on fire yeah i usually make mulled w...,62,13,0.0013
3166,0,so tired 3 hours of work left and so much to d...,so tired 3 hour of work left and so much to do...,129,29,0.0029


In [None]:
# Feature Construction: Noun count
# ---
# YOUR CODE GOES BELOW
#
# First, we will download the punkt and the averaged_perceptron_tagger into our notebook environment. 
# which will allow us to find the part of speech tags.
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:

# We create the function to check and get the part of speech tag count of a words in a given sentence

from textblob import TextBlob
pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    cnt = 0
    try:
        wiki = TextBlob(x)
        for tup in wiki.tags:
            ppo = list(tup)[1]
            if ppo in pos_dic[flag]:
                cnt += 1
    except:
        pass
    return cnt

In [None]:
# Noun Count

df['noun_count'] = df.text.apply(lambda x: pos_check(x, 'noun'))
df[['text','noun_count']].sample(10)


Unnamed: 0,text,noun_count
6181,techno phobic xo not much different really sho...,6
457,lady maryann watched it,2
3935,xx ree is rad 16 xx oh no well im glad you are...,4
9475,yay for the lake today,3
1652,reagan gomez yes we are,1
2160,varsity tutors video in our future for sure on...,8
9966,so sleepy maybe i should sleep,1
494,trying to figure out this thing its not going ...,1
178,cat rah nah teaching should be dull unimaginat...,3
9869,cherry coke rocks not the new pic cie sweetheart,5


In [None]:
# Feature Construction: Verb count
# ---
df['verb_count'] = df.text.apply(lambda x: pos_check(x, 'verb'))
df[['text','verb_count']].sample(10)

Unnamed: 0,text,verb_count
6270,teff 95 youre welcome sweetie,0
984,bus to subway oh you know the usual 8 5 grind ...,6
2677,le mez ma thank you for the follow friday and ...,3
9584,painting faces at the agricultural museum all day,2
7456,gaa h me thinks there should be plans for tonight,3
5598,suite basement go thru the side door if you di...,3
3912,a comic for yall,0
1574,watched a home movie mad me sad last time i sa...,2
9910,downloading the big bang theory series one 3 g...,2
8253,stam at a this wasnt posting thus pun fail,1


In [None]:
# Feature Construction: Adjective count / Tweet

df['adj_count'] = df.text.apply(lambda x: pos_check(x, 'adj'))
df[['text','adj_count']].sample(10)


Unnamed: 0,text,adj_count
1458,another fight woke up with no eyes and i have ...,0
6417,bored like crazy nothing on the tv,1
6898,read mr fry s review of the iphone 3 gs findin...,2
1315,taking a hiatus from my blog in pt her words i...,4
6359,mtv awards were imo boring my pc is screwed at...,3
920,weekend that means you have two days to get al...,1
8935,home day with 1 son whose not feeling great tu...,1
5686,fuck life really hard just take it and fuck it...,2
3945,niro who,0
2897,summer,0


In [None]:
# Feature Construction: Adverb count / Tweet
df['adv_count'] = df.text.apply(lambda x: pos_check(x, 'adv'))
df[['text','adv_count']].sample(10)


Unnamed: 0,text,adv_count
3372,by the way the turkish representative at the e...,3
5939,robin taylor roth i couldnt tame my hair to sa...,0
3801,bye twitter still l addicted but have to go b ...,4
4251,eucalypt deletes mine as well,2
4439,web my c its only for us and canada ever note ...,2
6691,p ryo was freaking amazing weston is omg that ...,1
4544,hello kt junkie ya know i have no idea its kin...,0
7201,good morning malta sending some tweets to all ...,0
5481,so i definitely just watched the last 1 3 of t...,5
5234,mitchel musso hi can you com to norway u rock,0


In [None]:
# Feature Construction: Pronoun 
# ---
df['pron_count'] = df.text.apply(lambda x: pos_check(x, 'pron'))
df[['text','pron_count']].sample(10)


Unnamed: 0,text,pron_count
5309,i guess im not hanging out with bra ad,0
9087,green i girl a ww poor girl im about to go to ...,2
4517,i wish i would just blackout when i do stupid ...,0
2078,ca voce r u hoh sounds like moving isnt going ...,0
4657,chad lad oh lend me your flip flops my feet ar...,4
7812,j ough dee a ww sorry it was exiled blog whoop...,1
1133,yu h ng fresh that z how you do it no play mor...,2
6922,and i hate when i really wanna talk to somebod...,1
822,new haircut yay,0
8187,baby rabies i dont know my dds blood ty pr the...,4


In [None]:
# Feature Construction: Subjectivity
def get_subjectivity(text):
    try:
        textblob = TextBlob(text)
        subj = textblob.sentiment.subjectivity
    except:
        subj = 0.0
    return subj

df['subjectivity'] = df.text.apply(get_subjectivity)
df[['text', 'subjectivity']].sample(10)


Unnamed: 0,text,subjectivity
7942,sch of e yeah totally it would have been nice ...,0.875
6538,i want my blackberry back im over my sidekick l x,0.0
4351,is going to eat out with the family this doesn...,0.575
4870,a beautiful mind 1 fingers amp toes crossed go...,0.8
6363,sky lowe theres only one place worth venturing...,0.55
5920,why is it so hard for me to understand math,0.541667
9412,emd an yell aw that sucks i have a feeling my ...,0.61
6720,glad to be home sip pin tea makin rice with da...,0.928571
3425,shine y new chase card arrived i can has pay p...,0.454545
9865,ben patrick 90069 but im not waiting for him t...,0.0


In [None]:
# Feature Construction: Polarity
def get_polarity(text):
    try:
        textblob = TextBlob(text)
        pol = textblob.sentiment.polarity
    except:
        pol = 0.0
    return pol

df['polarity'] = df.text.apply(get_polarity)
df[['text', 'polarity']].sample(10)


Unnamed: 0,text,polarity
8034,jos or doni would you believe it todays boot s...,0.0
7031,ugh final exam today ready for my summer to start,0.1
7822,for so cal this sure is a dark dreary day,0.175
7655,ok this year the fire starters mean business h...,0.256167
1066,um dm syntax dont work,0.0
4604,arsenal fan n coming up sow wy,0.0
8480,just before pmqs too brilliant timing,0.9
7304,today marks 4 yrs at my current job get to cel...,0.0
7160,most i 1 your tweet was just included in the l...,0.5
9150,ty kat mcgraw 4 my hug needed it x,0.0


In [None]:
# Feature Construction: Word Level N-Gram TF-IDF Feature 

from nltk import word_tokenize, ngrams

# Word ngrams
# ---
#
list(ngrams(word_tokenize(df['text'][0]), 2)) 


[('obama', 'forges'),
 ('forges', 'his'),
 ('his', 'muslim'),
 ('muslim', 'alliance'),
 ('alliance', 'against'),
 ('against', 'the'),
 ('the', 'civilized'),
 ('civilized', 'world'),
 ('world', 'and'),
 ('and', 'he'),
 ('he', 'didnt'),
 ('didnt', 'even'),
 ('even', 'drop'),
 ('drop', 'in'),
 ('in', 'for'),
 ('for', 'a'),
 ('a', 'cup'),
 ('cup', 'of'),
 ('of', 'tea')]

In [None]:
# Feature Construction: Character Level N-Gram TF-IDF Feature
list(ngrams(df['text'][0], 2))

[('o', 'b'),
 ('b', 'a'),
 ('a', 'm'),
 ('m', 'a'),
 ('a', ' '),
 (' ', 'f'),
 ('f', 'o'),
 ('o', 'r'),
 ('r', 'g'),
 ('g', 'e'),
 ('e', 's'),
 ('s', ' '),
 (' ', 'h'),
 ('h', 'i'),
 ('i', 's'),
 ('s', ' '),
 (' ', 'm'),
 ('m', 'u'),
 ('u', 's'),
 ('s', 'l'),
 ('l', 'i'),
 ('i', 'm'),
 ('m', ' '),
 (' ', 'a'),
 ('a', 'l'),
 ('l', 'l'),
 ('l', 'i'),
 ('i', 'a'),
 ('a', 'n'),
 ('n', 'c'),
 ('c', 'e'),
 ('e', ' '),
 (' ', 'a'),
 ('a', 'g'),
 ('g', 'a'),
 ('a', 'i'),
 ('i', 'n'),
 ('n', 's'),
 ('s', 't'),
 ('t', ' '),
 (' ', 't'),
 ('t', 'h'),
 ('h', 'e'),
 ('e', ' '),
 (' ', 'c'),
 ('c', 'i'),
 ('i', 'v'),
 ('v', 'i'),
 ('i', 'l'),
 ('l', 'i'),
 ('i', 'z'),
 ('z', 'e'),
 ('e', 'd'),
 ('d', ' '),
 (' ', 'w'),
 ('w', 'o'),
 ('o', 'r'),
 ('r', 'l'),
 ('l', 'd'),
 ('d', ' '),
 (' ', 'a'),
 ('a', 'n'),
 ('n', 'd'),
 ('d', ' '),
 (' ', 'h'),
 ('h', 'e'),
 ('e', ' '),
 (' ', 'd'),
 ('d', 'i'),
 ('i', 'd'),
 ('d', 'n'),
 ('n', 't'),
 ('t', ' '),
 (' ', 'e'),
 ('e', 'v'),
 ('v', 'e'),
 ('e', 'n'),

In [None]:
df.shape
df.head(2)

Unnamed: 0,target,text,lemmatization,length_of_tweet,word_count,Word density,noun_count,verb_count,adj_count,adv_count,pron_count,subjectivity,polarity
0,0,obama forges his muslim alliance against the c...,obama forge his muslim alliance against the ci...,103,20,0.002,6,2,2,1,2,0.9,0.4
1,4,had the most spectacular prom ever but now my ...,had the most spectacular prom ever but now my ...,129,25,0.0025,5,5,3,3,4,0.7625,0.6125


In [None]:
# Let's prepare the constructed features for modeling
# ---
#
X_metadata = np.array(df.iloc[:, 3:12])
X_metadata

array([[1.03000000e+02, 2.00000000e+01, 2.00000000e-03, ...,
        1.00000000e+00, 2.00000000e+00, 9.00000000e-01],
       [1.29000000e+02, 2.50000000e+01, 2.50000000e-03, ...,
        3.00000000e+00, 4.00000000e+00, 7.62500000e-01],
       [5.40000000e+01, 1.10000000e+01, 1.10000000e-03, ...,
        0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       ...,
       [5.40000000e+01, 1.10000000e+01, 1.10000000e-03, ...,
        2.00000000e+00, 1.00000000e+00, 8.33333333e-01],
       [3.90000000e+01, 7.00000000e+00, 7.00000000e-04, ...,
        3.00000000e+00, 0.00000000e+00, 5.67857143e-01],
       [6.80000000e+01, 1.70000000e+01, 1.70000000e-03, ...,
        0.00000000e+00, 3.00000000e+00, 6.00000000e-01]])

TF-IDF is another feature extraction technique that we can use to represent text data in a format that can be consumed by models. When we use TF-IDF, we assume that high frequency may not able to provide much information gain such as in the case of
 Bag of words and on the contrary, rare words contribute more weights to the model.

In [None]:
# from sklearn.feature_extraction.text import TfidfVectorizer

# tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='word', ngram_range=(1,3),  stop_words= 'english')
# df_word_vect = tfidf.fit_transform(df['text'])

# df_word_vect.toarray()

In [None]:
# tfidf = TfidfVectorizer(max_features=1000, lowercase=True, analyzer='char', ngram_range=(1,3),  stop_words= 'english')
# df_char_vect = tfidf.fit_transform(df['text'])

# df_char_vect.toarray()


In [None]:
# We combine our two tfidf (sparse) matrices and X_metadata
# ---
#
# X = scipy.sparse.hstack([df_word_vect.astype(object), df_char_vect.astype(object),  X_metadata])
# X


X = scipy.sparse.hstack([df_word_vect, df_char_vect,  X_metadata])
X

<10000x2009 sparse matrix of type '<class 'numpy.float64'>'
	with 1244167 stored elements in COOrdinate format>

In [None]:
# Getting our response variable
# ---
#
y = np.array(df.iloc[:, 0])
y

array([0, 4, 0, ..., 0, 4, 0])

### 4. Data Modelling

During this step, we will use machine learning algorithms to train and test our sentiment analysis models.

In [None]:
# Splitting our data
# ---
#
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# Fitting our model
# ---
#

# Importing the algorithms
from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression

nb_classifier = MultinomialNB() 
lr_classifier = LogisticRegression(max_iter=1000) 

# Training our model
nb_classifier.fit(X_train, y_train) 
lr_classifier.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(max_iter=1000)

In [None]:
# Making predictions
# ---
#
y_predict_nb = nb_classifier.predict(X_test) 
y_predict_lr = lr_classifier.predict(X_test)

In [None]:
# Evaluating the Models
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Accuracy scores
# ---
#
print("Naive Bayes Classifier:\n", accuracy_score(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", accuracy_score(y_test, y_predict_lr))

Naive Bayes Classifier:
 0.735
Logistic Regression Classifier: 
 0.756


In [None]:
# Confusion matrices
# ---
# 
print("Naive Bayes Classifier: \n", confusion_matrix(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", confusion_matrix(y_test, y_predict_lr))

Naive Bayes Classifier: 
 [[775 275]
 [255 695]]
Logistic Regression Classifier: 
 [[788 262]
 [226 724]]


In [None]:
# Classification Reports
# ---
#
print("Naive Bayes Classifier: \n", classification_report(y_test, y_predict_nb)) 
print("Logistic Regression Classifier: \n", classification_report(y_test, y_predict_lr))

Naive Bayes Classifier: 
               precision    recall  f1-score   support

           0       0.75      0.74      0.75      1050
           4       0.72      0.73      0.72       950

    accuracy                           0.73      2000
   macro avg       0.73      0.73      0.73      2000
weighted avg       0.74      0.73      0.74      2000

Logistic Regression Classifier: 
               precision    recall  f1-score   support

           0       0.78      0.75      0.76      1050
           4       0.73      0.76      0.75       950

    accuracy                           0.76      2000
   macro avg       0.76      0.76      0.76      2000
weighted avg       0.76      0.76      0.76      2000



**Evaluation our Models**

* **Accuracy:** the percentage of texts that were assigned the correct topic.
* **Precision:** the percentage of texts the classifier classified correctly out of the total number of texts it predicted for each topic
* **Recall:** the percentage of texts the model predicted for each topic out of the total number of texts it should have predicted for that topic.
* **F1 Score:** the average of both precision and recall.

To improve our model, we can try perfoming other text processing techniques that would better prepare our data for fitting our model. We can also use different vectorizing techniques, implement other machine learning models and perform hyperparameter tuning.

### 5. Recommendations


Our best model had an accuracy of 73.25% and use it for classifying newer tweets. We can improve this performance by performing hyperparameter tuning and feature engineering methods. 