# Exploring NLP Pipeline (Part 1)
As we mentioned in the lecture slides, an NLP pipeline is constructed from  the following steps: 
- Data acquisition, 
- Text extraction and cleaning 
- Pre-processing
- Feature Engineering
- Modelling
- Evaluation
- Deployement
- Monitoring & Model updating

In this notebook we will try to explain some of these steps using Pandas,NLTK, String, Contractions and Scikit-learn libraries.  You can open the cloud version of this notebook using the following link:
<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/Ali-Alameer/NLP/blob/main/week2_code.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table> 

## Tweets Sentiment Analysis
With all of the tweets circulating every second, it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in the language is important in these times where decisions and reactions are created and updated in seconds. In this workshop, we'll create an NLP pipeline to predict the sentiment of each tweet.


## Data acquisition

In order to do any type of NLP analysis one requires data to analyze. The twitter data can be collected using the twitter API (https://developer.twitter.com/en/docs/twitter-api). Twitter API is the official programmatic endpoint provided by Twitter. It allows developers to access the enormous amount of public data on Twitter that millions of users share daily. 

Tweepy (https://www.tweepy.org/) is an easy-to-use Python library for accessing the Twitter API. Its API class provides access to the RESTful methods of the Twitter API. We will skip the data acquisition process for this workshop in order to keep it short. However, you can develop the process of extracting tweets from Twitter API as an individual project for your portfolio.

## Data extraction

The second step in the NLP pipeline is extracting the text from its native form (such as pdf, image or html files). 

Our dataset is a CSV(Comma Separated Values) file that contains tweets data. Each row contains the text of a tweet and a sentiment label. We will use the <b>Pandas</b> library to read the CSV file and load data into a dataframe.

A <b>Pandas DataFrame</b> is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [1]:
import pandas as pd

train_raw = pd.read_csv('./data/train_tweets.csv')

Let to check the loaded data by displaying the first 5 tweets in the dataset.

In [2]:
# print the first 5 rows of training data
train_raw.head(10)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


To find out how the data is structured, let's take a look at it. There will be a result showing how many rows and columns the dataset contains by printing the shape attribute.

In [3]:
train_raw.shape

(31962, 3)

The id column is not required in our process so we can remove this column. Also, we can rearrange columns in the dataset by brining the tweet text in the first column and a sentiment label in the second column. 

In [4]:
# rearrange the columns in the training dataset
# and remove the id column
train_df = train_raw[['tweet', 'label']]
train_df.columns = ['tweet', 'sentiment']
train_df.head()

Unnamed: 0,tweet,sentiment
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


We can count the number of positive and negative tweets using the value_counts() method of a dataframe object.

In [7]:
train_df.sentiment.value_counts()

0    29720
1     2242
Name: sentiment, dtype: int64

The dataset description indicates that:
- <b>0</b> ==> <b>positive sentiments</b>
- <b>1</b> ==> <b>negative sentiments</b>

According to the result of the previous cell, there are 29,720 positive tweets and 2,242 negative tweets in the training dataset. As a result, the training dataset is <b>imbalanced</b> since the data points are not equal for the two classes.

For storing sentiments, a Python dictionary is an appropriate data structure.

In [8]:
# define a dictionary to map numbers to corresponding sentiments
map = {0: 'positive', 1: 'negative'}

## Text cleaning & pre-processing

Why Do We Need to clean and pre-process Text?

- <b>Extracting plain text</b>: Textual data can come from a wide variety of sources: the web, PDFs, word documents, speech recognition systems, book scans, etc. Your goal is to extract plain text that is free of any source specific markup or constructs that are not relevant to your task.
- <b>Reducing complexity</b>: Some features of our language like capitalization, punctuation, and common words such as a, of, and the, often help provide structure, but don't add much meaning. Sometimes it's best to remove them if that helps reduce the complexity of the procedures you want to apply later.


In order to clean the text of tweets, we will first create a function that lowercase text, expand contractions, removes text enclosed in square brackets, removes links, removes punctuation, and removes words containing numbers.


In [9]:
import re
from string import punctuation
import contractions

def clean_text(text):
    # make text lowercase
    text = str(text).lower()
    # expand contractions
    text = " ".join([contractions.fix(token) for token in str(text).split()])
    # remove text enclosed in square brackets
    text = re.sub(r"\[.*?\]", "", text)
    # remove text in angle brackets
    text = re.sub(r"<.*?>", "", text)
    # remove links
    text = re.sub("https?://\S+|www\.\S+", "", text)
    # remove punctuation
    text = re.sub('[%s]' % re.escape(punctuation), "", text)
    # remove words containing numbers
    text = re.sub('\w*\d\w*', "", text)
    # remove new lines
    text = re.sub("\n", "", text)
    return text

In [10]:
# apply clean text fuction on each twitte in the training dataset
train_df['clean_tweet'] = train_df['tweet'].apply(lambda x: clean_text(x))
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...
2,bihday your majesty,0,bihday your majesty
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...
4,factsguide: society now #motivation,0,factsguide society now motivation


### <font color='blue'>Exercise</font>

Complete the following code to create a column named "no_sentences" containing the number of sentences for each tweet.


In [11]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\SES100\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
from nltk.tokenize import sent_tokenize

# calculate the number of sentences for each tweet
train_raw['no_sentences']  = ?

train_raw.head()


### Word tokenization

Now we can tokenize tweets into words and extract a list of words for each tweet. We can use the NLTK word tokenizer.

In [12]:
from nltk.tokenize import word_tokenize

train_df['word_list'] = train_df['clean_tweet'].apply(lambda x: word_tokenize(x))
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,word_list
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...,"[user, when, a, father, is, dysfunctional, and..."
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...,"[user, user, thanks, for, lyft, credit, i, can..."
2,bihday your majesty,0,bihday your majesty,"[bihday, your, majesty]"
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...,"[model, i, love, you, take, with, you, all, th..."
4,factsguide: society now #motivation,0,factsguide society now motivation,"[factsguide, society, now, motivation]"


### Finding the most common words in tweets text

Before removing stop words it is worth looking at the tweet's word list and extracting the most common words in tweet texts. This step will help us to understand why we need to remove stop words from the word list. 

In the "collections" module of python, you'll find a class specially designed to count several different objects in one go. This class is conveniently called <b>Counter</b>. We use the Counter class to count the number of repetitions of a word in the word list column and then we store the result in a new dataframe. 

In [14]:
from collections import Counter

top = Counter([word for word_list in train_df['word_list'] for word in word_list])
temp_df = pd.DataFrame(top.most_common(20), columns=['word', 'count'])
temp_df.style.background_gradient(cmap='Blues', axis=0)

Unnamed: 0,word,count
0,user,17474
1,the,10156
2,to,10089
3,you,7510
4,i,7288
5,a,6416
6,is,6108
7,and,4871
8,in,4638
9,for,4479


### Stop words removal

As you can see, many of the most commonly used words are not useful for identifying tweet sentiment. They belong to the stop words list and should be removed from the tweets words list. 

In [15]:
from nltk.corpus import stopwords

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\SES100\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
def remove_stopword(word_list):
    return [word for word in word_list if word not in stopwords.words('english')]

train_df['word_list_without_sw'] = train_df['word_list'].apply(lambda x: remove_stopword(x))

In [17]:
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,word_list,word_list_without_sw
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drags, ..."
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,..."
2,bihday your majesty,0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]"
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, areð±, ððð..."
4,factsguide: society now #motivation,0,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"


Let's to check the most common words in the tweets after removing all stop words.

In [18]:
top = Counter([word for word_list in train_df['word_list_without_sw'] for word in word_list])
temp_df = pd.DataFrame(top.most_common(20), columns=['word', 'count'])
temp_df.style.background_gradient(cmap='Purples', axis=0)

Unnamed: 0,word,count
0,user,17474
1,love,2668
2,day,2198
3,happy,1663
4,amp,1582
5,time,1110
6,life,1086
7,like,1042
8,â¦,1004
9,today,991


### Most common words sentiments wise

As a result of this process, we can see some meaningful words among the most common words. As we have more positive tweets in our dataset, positive words have a larger proportion. We can check the most common word in both negative and positive tweets separately. In the following cell, we will create two separate dataframes for each sentiment and repeat the above process. 

In [22]:
# create seperate dataframes for each sentiment
Positive_sent = train_df[train_df['sentiment'] == 0]
Negative_sent = train_df[train_df['sentiment'] == 1]

In [23]:
# MosT common positive words
top = Counter([word for word_list in Positive_sent['word_list_without_sw'] for word in word_list])
temp_df = pd.DataFrame(top.most_common(20), columns=['word', 'count'])
temp_df.style.background_gradient(cmap='Greens', axis=0)

Unnamed: 0,word,count
0,user,15614
1,love,2643
2,day,2190
3,happy,1651
4,amp,1314
5,time,1088
6,life,1080
7,today,979
8,positive,925
9,thankful,919


In [24]:
# MosT common negative words
top = Counter([word for word_list in Negative_sent['word_list_without_sw'] for word in word_list])
temp_df = pd.DataFrame(top.most_common(20), columns=['word', 'count'])
temp_df.style.background_gradient(cmap='Reds', axis=0)

Unnamed: 0,word,count
0,user,1860
1,amp,268
2,trump,197
3,â¦,181
4,libtard,149
5,like,137
6,white,137
7,black,131
8,racist,102
9,people,99


### Lemmatization

Both stemming and lemmatization converts word to its base form. Stemming is a fast rule based technique and sometimes chops off inaccurately (under-stemming and over-stemming). You may have noticed NLTK provides PorterStemmer and a slightly improved Snowball Stemmer.

Lemmatization is dictionary based technique, more accurate but slightly slower than stemming. We will use WordnetLemmatizer from NLTK. We will download the wordnet resource for this purpose.

In [25]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
train_df['word_list_without_sw'] = train_df['word_list_without_sw'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
train_df.head()

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\SES100\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,tweet,sentiment,clean_tweet,word_list,word_list_without_sw
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drag, k..."
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,..."
2,bihday your majesty,0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]"
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, areð±, ððð..."
4,factsguide: society now #motivation,0,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"


### Final pre-processing

Let's to concatinate all the words in the last column on the dataframe and create a cleaned version of tweet text.

In [26]:
train_df['final_tweet'] = train_df['word_list_without_sw'].apply(lambda x:' '.join(x))
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,word_list,word_list_without_sw,final_tweet
0,@user when a father is dysfunctional and is s...,0,user when a father is dysfunctional and is so ...,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drag, k...",user father dysfunctional selfish drag kid dys...
1,@user @user thanks for #lyft credit i can't us...,0,user user thanks for lyft credit i cannot use ...,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,...",user user thanks lyft credit use offer wheelch...
2,bihday your majesty,0,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]",bihday majesty
3,#model i love u take with u all the time in ...,0,model i love you take with you all the time in...,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, areð±, ððð...",model love take time areð± ðððð ð...
4,factsguide: society now #motivation,0,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]",factsguide society motivation


### <font color='blue'>Exercise</font>

Next week workshop will continue the process by adding new steps to the current pipeline. However, we need to save the result of today workshop in a CSV file. Please search the internet and find the proper code to save the train dataframe as a CSV file in the current folder.

In [None]:
# Your code goes here:
