## What is Natural Language Processing?

Natural language is the language you and I talk in. It could be Hindi, English, Spanish, anything. And we talk about natural language processing, we basically refer to making computers able to process this language, and more importantly understand it and take actions based on it. Now this language can be text based or audio based. Your Google Voice assistant, Siri and even google translator are great examples of this.


## So what are we doing in the Twitter Sentiment Analysis Project?

I am going to build a machine learning model that is able to analyze loads of twitter tweets, and be able to judge the sentiments behind the tweets. We are gonna be analyzing tweets made about climate change, and judging if they are of positive, negative or neutral sentiments. These types of models are massively used by Twitter and Facebook, to filter out to say hate speeches, or other unwanted comments on their platforms.

In [3]:
# Import packages

# Database manipulation package
import pandas as pd

# Natural language Toolkit packages.
# Necessary libraries and modules that are 
# going to help us do the data processing 
# from the nltk library.
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
import string

# Regular expression
import re

# to make bag of words
from sklearn.feature_extraction.text import CountVectorizer

# Packages to create models
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, f1_score



[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\henre\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# Import dataset

tweets = pd.read_csv("train.csv")

In [7]:
tweets.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [17]:
tweets['sentiment'].unique()

array([ 1,  2,  0, -1], dtype=int64)

Sentiment Description:
* 2 News: the tweet links to factual news about climate change

* 1 Pro: the tweet supports the belief of man-made climate change

* 0 Neutral: the tweet neither supports nor refutes the belief of man-made climate change

* -1 Anti: the tweet does not believe in man-made climate change

In [15]:
tweets.shape

(15819, 3)

 The shape of the data, when you print it, will be shown as (15819, 3). Basically means that there are 15819 tweets in the dataset, with 3 parameters associated with each tweet.

## Data Pre-Processing is KEY


In NLP projects, the most important part is preprocessing the data, so that it is ready for a model to use. That’s what we are doing now.
Since there is a unnecessary parameters in the dataset, lets extract and store the input and the output, which is the 'message' and 'sentiment'  column using the following lines:

In [18]:
X = tweets['message']
y = tweets['sentiment']

## Why and What at all do we need to do in Data Processing?

Remember your input is just a bunch of words for now. Machine learning models only understand and work on numbers. So we have to convert all the text into numbers.

To remove unnecessary words, I am going to use the following techniques:

1. Removing Stop Words: Basically words like this, an, a, the, etc that do not affect the meaning of the tweet
2. Removing Punctuation: (‘,.*!’) and other punctuation marks that are not really needed by the model
3. Stemming: Basically reducing words like ‘jumping, jumped, jump’ into its root word(also called stem), which is jump in this case. Since all variations of the root word convey the same meaning, we don’t need each of the word to be converted into different numbers.

In [22]:
# We simply store the english stop words,the stemmer function 
# in different variables in the following lines.

stop_words = stopwords.words('english')
punct = string.punctuation
stemmer = PorterStemmer()


Now, for the actual data processing, we use the following lines of code.

In [23]:
cleaned_data=[]

for i in range(len(X)):
    tweet=re.sub('[^a-zA-Z]',' ',X.iloc[i])
    tweet=tweet.lower().split()
    tweet=[stemmer.stem(word) for word in tweet if (word not in stop_words) and (word not in punct)]
    tweet=' '.join(tweet)
    cleaned_data.append(tweet)

Firstly, we create an empty list called cleaned_data, where will be storing our text data after getting rid of all unnecessary words.

We import a library called re(regular expressions), that is gonna help us remove a lot of unnecessary stuff.

To begin with, we run a for loop to iterate through each and every tweet in the dataset, at a time. Using re.sub we can substitute and replace things in a sentence. So we are basically taking one tweet at a time, and inside it whatever is not a letter(belonging to a-z or A-Z), will be replaced by an empty space. It will automatically filter out punctuation marks and other non-letters.

In the next line, we convert all words into lower cases and split them into a list.

Next, we iterate through each word in a tweet, and if that word is not a stop word, we stem that using stemmer.stem(word).

After that we join all the words using ‘’.join(tweet), to get a single sentence instead of separated words. And then we simply append that into the cleaned data list that we had created. If you print the cleaned list it should look something like this.



In [27]:
cleaned_data[:100]

['polyscimajor epa chief think carbon dioxid main caus global warm wait http co yelvcefxkc via mashabl',
 'like lack evid anthropogen global warm',
 'rt rawstori research say three year act climat chang late http co wdt kdur f http co z anpt',
 'todayinmak wire pivot year war climat chang http co wotxtlcd',
 'rt soynoviodetoda racist sexist climat chang deni bigot lead poll electionnight',
 'worth read whether believ climat chang http co gglzvnyjun http co afe mah j',
 'rt thenat mike penc believ global warm smoke caus lung cancer http co gvwyaauu r',
 'rt makeandmendlif six big thing today fight climat chang climat activist http co tymlu dbnm h',
 'aceofspadeshq yo nephew inconsol want die old age like perish fieri hellscap climat chang',
 'rt paigetweedi offens like believ global warm',
 'rt stephenschlegel think go die husband believ climat chang http co sjofon',
 'hope peopl vocal climat chang also power home renew energi like goodenergi',
 'rt tveitdal percent chanc avoid danger g

In this, climat and chang is the subject for which the tweets are made. We’ll also remove them later, as they do not play a role in the sentiment of the tweet. Also you’ll notice weird words like ‘tsqzmjwlh’ which do not make sense. That is because people in tweets will often use incorrect spellings of words and also the stemmer function sometimes produces words that are not real English words, but that is something that we will just roll with for now.

## Bringing in the Bag of Words Approach

Now that input is clean and ready, we convert it into numbers using something called as the ‘Bag of Words’ approach. Basically we create a matrix table, where each row represents a sentence and each word will have separate column for itself that represents it’s frequency.

One con about this method that you might not notice is that the order of the sentence is lost. There are other approaches to counter this, but we are just going to stick with this method.



In [28]:
count_vector = CountVectorizer(max_features = 3000, stop_words = \
                               ['chang','climat', 'http','co'])
X_fin = count_vector.fit_transform(cleaned_data).toarray()


The Count Vectorizer function converts a list of words into bag of words, however notice that we specify something called as the max features to it. Basically as you might have seen in the bag of words illustration table, each word will have separate column. This number of columns can explode into large numbers in big datasets.

To avoid this we set the max columns as 3000, and basically keep the maximum occurring 3000 words. Also we set the stop_words parameter to the name of the the subject(climate change) and url words, as we want to remove that from the tweets as well.

Finall the cv.fit_transform function takes the cleaned_data and converts it into the bag of words that we wanted.

Let’s have a look at the output column ‘y’

In [30]:
y.head()

0    1
1    1
2    2
3    1
4    1
Name: sentiment, dtype: int64

As we can see the output is already in numeric format, so no processing is necessary.

Let’s have a look at the output column ‘y’

## Build the NLP models

I am going to use the Multinomial Naive Bayes model to figure out the relationship between the input and the output. Multinomial NB is a supervised learning algorithm that works really well for text based data. 

Now to build the model we split the dataset into a training and testing section(testing size=30% of the actual data). To actually fit the model, we call the model.fit function and supply it the training input and output.

In [40]:
X_train,X_test,y_train,y_test=train_test_split(X_fin,y,test_size=0.3)

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

#c_report = classification_report(y_test, y_pred)

f1score = round(f1_score(y_test, y_pred, average= 'macro'),5)

In [46]:
#print(c_report)
print('macro f1 score:', f1score)

macro f1 score: 0.61959


The accuracy is not great as as macro f1 score of greater than 0.7 is requiered. To improve it, there are several things that we could do, for example using something called as lemmatization instead of stemming, using a balanced dataset, and trying with different machine learning models as well.