## Exploring NLP Pipeline (Part 1)

Here are the 8 stages of NLP PIPLINE

- Data Acquisition,
- Text Extractiong & Cleaning,
- Pre-Processing,
- Feature Engineering,
- Modelling,
- Evaluation
- Deployment
- Monitoring and Model Updating

In [63]:
import nltk
import pandas as pd
import re
import contractions
from nltk.corpus import stopwords
from itertools import chain
from collections import Counter
from string import punctuation
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize, sent_tokenize

## Twitter Sentiment Analysis
With all of the tweets circulating every second, it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in the language is important in these times where decisions and reactions are created and updated in seconds. In this workshop, we'll create an NLP pipeline to predict the sentiment of each tweet.

## Data acquisition
In order to do any type of NLP analysis one requires data to analyze. The twitter data can be collected using the twitter API (https://developer.twitter.com/en/docs/twitter-api). Twitter API is the official programmatic endpoint provided by Twitter. It allows developers to access the enormous amount of public data on Twitter that millions of users share daily.

Tweepy (https://www.tweepy.org/) is an easy-to-use Python library for accessing the Twitter API. Its API class provides access to the RESTful methods of the Twitter API. We will skip the data acquisition process for this workshop in order to keep it short. However, you can develop the process of extracting tweets from Twitter API as an individual project for your portfolio.

## Data extraction
The second step in the NLP pipeline is extracting the text from its native form (such as pdf, image or html files).

Our dataset is a CSV(Comma Separated Values) file that contains tweets data. Each row contains the text of a tweet and a sentiment label. We will use the Pandas library to read the CSV file and load data into a dataframe.

A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.

In [64]:
file_path = "/Users/Odera Nnaji/Downloads/Uni Note/NLP/week 2/train_tweets.csv"
file = pd.read_csv(file_path)

In [65]:
file.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [66]:
file.columns

Index(['id', 'label', 'tweet'], dtype='object')

In [67]:
file.isnull().sum()

id       0
label    0
tweet    0
dtype: int64

In [68]:
len(file)

31962

In [69]:
file.describe()

Unnamed: 0,id,label
count,31962.0,31962.0
mean,15981.5,0.070146
std,9226.778988,0.255397
min,1.0,0.0
25%,7991.25,0.0
50%,15981.5,0.0
75%,23971.75,0.0
max,31962.0,1.0


In [70]:
file.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [71]:
train_df = file[['tweet', 'label']]
train_df.columns = ['tweet', 'sentiment']

We can count the number of positive and negative tweets using the value_counts() method of a dataframe object.

In [72]:
train_df.sentiment.value_counts()

sentiment
0    29720
1     2242
Name: count, dtype: int64

In [73]:
train_df.head()

Unnamed: 0,tweet,sentiment
0,@user when a father is dysfunctional and is s...,0
1,@user @user thanks for #lyft credit i can't us...,0
2,bihday your majesty,0
3,#model i love u take with u all the time in ...,0
4,factsguide: society now #motivation,0


The dataset description indicates that:

- 0 ==> positive sentiments
- 1 ==> negative sentiments

According to the result of the previous cell, there are 29,720 positive tweets and 2,242 negative tweets in the training dataset. As a result, the training dataset is imbalanced since the data points are not equal for the two classes.

For storing sentiments, a Python dictionary is an appropriate data structure.

In [74]:
map = {0: 'Positive', 1: 'Negative'}
train_df = file[['tweet', 'label']].copy()
train_df.columns = ['tweet', 'sentiment']
train_df['sentiment'] = train_df['sentiment'].map(map)

In [75]:
train_df.head(25)

Unnamed: 0,tweet,sentiment
0,@user when a father is dysfunctional and is s...,Positive
1,@user @user thanks for #lyft credit i can't us...,Positive
2,bihday your majesty,Positive
3,#model i love u take with u all the time in ...,Positive
4,factsguide: society now #motivation,Positive
5,[2/2] huge fan fare and big talking before the...,Positive
6,@user camping tomorrow @user @user @user @use...,Positive
7,the next school year is the year for exams.ð...,Positive
8,we won!!! love the land!!! #allin #cavs #champ...,Positive
9,@user @user welcome here ! i'm it's so #gr...,Positive


## Text cleaning & pre-processing
Why Do We Need to clean and pre-process Text?

- Extracting plain text: Textual data can come from a wide variety of sources: the web, PDFs, word documents, speech recognition systems, book scans, etc. Your goal is to extract plain text that is free of any source specific markup or constructs that are not relevant to your task.

- Reducing complexity: Some features of our language like capitalization, punctuation, and common words such as a, of, and the, often help provide structure, but don't add much meaning. Sometimes it's best to remove them if that helps reduce the complexity of the procedures you want to apply later.
In order to clean the text of tweets, we will first create a function that lowercase text, expand contractions, removes text enclosed in square brackets, removes links, removes punctuation, and removes words containing numbers.

In [76]:
def clean_text(text):
    # make text lowercase    
    text = str(text).lower()
    # expand contractions
    text = " ".join([contractions.fix(expanded_word) for expanded_word in text.split()])
    # remove text in square brackets
    text = re.sub(r'\[.*?\]', '', text)
    # remove links
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'<.*?>+', '', text)
    # remove punctuation
    text = re.sub(r'[%s]' % re.escape(punctuation), '', text)
    # remove new lines
    text = re.sub(r'\n', '', text)
    # remove words containing numbers
    text = re.sub(r'\w*\d\w*', '', text)
    return text

In [77]:
train_df.head(20)

Unnamed: 0,tweet,sentiment
0,@user when a father is dysfunctional and is s...,Positive
1,@user @user thanks for #lyft credit i can't us...,Positive
2,bihday your majesty,Positive
3,#model i love u take with u all the time in ...,Positive
4,factsguide: society now #motivation,Positive
5,[2/2] huge fan fare and big talking before the...,Positive
6,@user camping tomorrow @user @user @user @use...,Positive
7,the next school year is the year for exams.ð...,Positive
8,we won!!! love the land!!! #allin #cavs #champ...,Positive
9,@user @user welcome here ! i'm it's so #gr...,Positive


In [78]:
train_df['clean_tweet'] = train_df['tweet'].apply(clean_text)

In [79]:
train_df.head(20)

Unnamed: 0,tweet,sentiment,clean_tweet
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...
2,bihday your majesty,Positive,bihday your majesty
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...
4,factsguide: society now #motivation,Positive,factsguide society now motivation
5,[2/2] huge fan fare and big talking before the...,Positive,huge fan fare and big talking before they lea...
6,@user camping tomorrow @user @user @user @use...,Positive,user camping tomorrow user user user user user...
7,the next school year is the year for exams.ð...,Positive,the next school year is the year for examsð¯...
8,we won!!! love the land!!! #allin #cavs #champ...,Positive,we won love the land allin cavs champions clev...
9,@user @user welcome here ! i'm it's so #gr...,Positive,user user welcome here i am it is so


                                                        OR

In [80]:
 train_df['clean_tweet'] = train_df['tweet'].apply(lambda x: clean_text(x))

In [81]:
train_df.head(20)

Unnamed: 0,tweet,sentiment,clean_tweet
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...
2,bihday your majesty,Positive,bihday your majesty
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...
4,factsguide: society now #motivation,Positive,factsguide society now motivation
5,[2/2] huge fan fare and big talking before the...,Positive,huge fan fare and big talking before they lea...
6,@user camping tomorrow @user @user @user @use...,Positive,user camping tomorrow user user user user user...
7,the next school year is the year for exams.ð...,Positive,the next school year is the year for examsð¯...
8,we won!!! love the land!!! #allin #cavs #champ...,Positive,we won love the land allin cavs champions clev...
9,@user @user welcome here ! i'm it's so #gr...,Positive,user user welcome here i am it is so


Complete the following code to create a column named "no_sentences" containing the number of sentences for each tweet.

In [82]:
def len_sentence(text):
    return len(sent_tokenize(text))

In [83]:
train_df['no_sentence'] = train_df['tweet'].apply(len_sentence)

In [84]:
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,no_sentence
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...,2
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...,2
2,bihday your majesty,Positive,bihday your majesty,1
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...,2
4,factsguide: society now #motivation,Positive,factsguide society now motivation,1


                                            OR

In [85]:
train_df['no_sentence'] = train_df['tweet'].apply(lambda x: len(sent_tokenize(x)))

In [86]:
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,no_sentence
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...,2
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...,2
2,bihday your majesty,Positive,bihday your majesty,1
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...,2
4,factsguide: society now #motivation,Positive,factsguide society now motivation,1


## Word tokenization
Now we can tokenize tweets into words and extract a list of words for each tweet. We can use the NLTK word tokenizer

In [87]:
def word_token(text):
    return word_tokenize(text)

In [88]:
train_df['word_list'] = train_df['clean_tweet'].apply(word_tokenize)

In [89]:
train_df.head(20)

Unnamed: 0,tweet,sentiment,clean_tweet,no_sentence,word_list
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...,2,"[user, when, a, father, is, dysfunctional, and..."
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...,2,"[user, user, thanks, for, lyft, credit, i, can..."
2,bihday your majesty,Positive,bihday your majesty,1,"[bihday, your, majesty]"
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...,2,"[model, i, love, you, take, with, you, all, th..."
4,factsguide: society now #motivation,Positive,factsguide society now motivation,1,"[factsguide, society, now, motivation]"
5,[2/2] huge fan fare and big talking before the...,Positive,huge fan fare and big talking before they lea...,3,"[huge, fan, fare, and, big, talking, before, t..."
6,@user camping tomorrow @user @user @user @use...,Positive,user camping tomorrow user user user user user...,1,"[user, camping, tomorrow, user, user, user, us..."
7,the next school year is the year for exams.ð...,Positive,the next school year is the year for examsð¯...,1,"[the, next, school, year, is, the, year, for, ..."
8,we won!!! love the land!!! #allin #cavs #champ...,Positive,we won love the land allin cavs champions clev...,3,"[we, won, love, the, land, allin, cavs, champi..."
9,@user @user welcome here ! i'm it's so #gr...,Positive,user user welcome here i am it is so,2,"[user, user, welcome, here, i, am, it, is, so]"


                                              OR

In [90]:
train_df['word_list'] = train_df['clean_tweet'].apply(lambda x: word_tokenize(x))

In [91]:
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,no_sentence,word_list
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...,2,"[user, when, a, father, is, dysfunctional, and..."
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...,2,"[user, user, thanks, for, lyft, credit, i, can..."
2,bihday your majesty,Positive,bihday your majesty,1,"[bihday, your, majesty]"
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...,2,"[model, i, love, you, take, with, you, all, th..."
4,factsguide: society now #motivation,Positive,factsguide society now motivation,1,"[factsguide, society, now, motivation]"


## Finding the most common words in tweets text
Before removing stop words it is worth looking at the tweet's word list and extracting the most common words in tweet texts. This step will help us to understand why we need to remove stop words from the word list.

In the "collections" module of python, you'll find a class specially designed to count several different objects in one go. This class is conveniently called Counter. We use the Counter class to count the number of repetitions of a word in the word list column and then we store the result in a new dataframe.

In [92]:
flattened_word = chain.from_iterable(train_df['word_list'])
top_word = Counter(flattened_word)
temp_df = pd.DataFrame(top_word.most_common(20), columns = ['Words', 'Counts'])
temp_df.style.background_gradient(cmap = 'gist_earth')

Unnamed: 0,Words,Counts
0,user,17474
1,the,10156
2,to,10089
3,you,7510
4,i,7288
5,a,6416
6,is,6108
7,and,4871
8,in,4638
9,for,4479


                                                OR

## Stop words removal

As you can see, many of the most commonly used words are not useful for identifying tweet sentiment. They belong to the stop words list and should be removed from the tweets words list.

In [93]:
stopword = set(stopwords.words('english'))

In [94]:
def remove_stopwords(word_list):

    return [word for word in word_list if word not in stopword]

In [95]:
train_df['word_list_without_sw'] = train_df['word_list'].apply(remove_stopwords)

In [96]:
train_df.head(20)

Unnamed: 0,tweet,sentiment,clean_tweet,no_sentence,word_list,word_list_without_sw
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...,2,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drags, ..."
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...,2,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,..."
2,bihday your majesty,Positive,bihday your majesty,1,"[bihday, your, majesty]","[bihday, majesty]"
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...,2,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, areð±, ððð..."
4,factsguide: society now #motivation,Positive,factsguide society now motivation,1,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"
5,[2/2] huge fan fare and big talking before the...,Positive,huge fan fare and big talking before they lea...,3,"[huge, fan, fare, and, big, talking, before, t...","[huge, fan, fare, big, talking, leave, chaos, ..."
6,@user camping tomorrow @user @user @user @use...,Positive,user camping tomorrow user user user user user...,1,"[user, camping, tomorrow, user, user, user, us...","[user, camping, tomorrow, user, user, user, us..."
7,the next school year is the year for exams.ð...,Positive,the next school year is the year for examsð¯...,1,"[the, next, school, year, is, the, year, for, ...","[next, school, year, year, examsð¯, think, ð..."
8,we won!!! love the land!!! #allin #cavs #champ...,Positive,we won love the land allin cavs champions clev...,3,"[we, won, love, the, land, allin, cavs, champi...","[love, land, allin, cavs, champions, cleveland..."
9,@user @user welcome here ! i'm it's so #gr...,Positive,user user welcome here i am it is so,2,"[user, user, welcome, here, i, am, it, is, so]","[user, user, welcome]"


                                                            OR

In [97]:
train_df['word_list_without_sw'] = train_df['word_list'].apply(lambda x: remove_stopwords(x))

In [98]:
train_df.head()

Unnamed: 0,tweet,sentiment,clean_tweet,no_sentence,word_list,word_list_without_sw
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...,2,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drags, ..."
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...,2,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,..."
2,bihday your majesty,Positive,bihday your majesty,1,"[bihday, your, majesty]","[bihday, majesty]"
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...,2,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, areð±, ððð..."
4,factsguide: society now #motivation,Positive,factsguide society now motivation,1,"[factsguide, society, now, motivation]","[factsguide, society, motivation]"


In [99]:
flat_words  = chain.from_iterable(train_df['word_list_without_sw'])
Top = Counter(flat_words)
tempdf = pd.DataFrame(Top.most_common(20), columns = ['Words', 'Count'])
tempdf.style.background_gradient(cmap = 'Purples')

Unnamed: 0,Words,Count
0,user,17474
1,love,2668
2,day,2198
3,happy,1663
4,amp,1582
5,time,1110
6,life,1086
7,like,1042
8,â¦,1004
9,today,991


## Most common words sentiments wise
As a result of this process, we can see some meaningful words among the most common words. As we have more positive tweets in our dataset, positive words have a larger proportion. We can check the most common word in both negative and positive tweets separately. In the following cell, we will create two separate dataframes for each sentiment and repeat the above process.

In [100]:
# create seperate dataframes for each sentiment
Positive_sentiment = train_df[train_df['sentiment'] == 'Positive']
Negative_sentiment = train_df[train_df['sentiment'] == 'Negative']

In [101]:
positive_sent = Positive_sentiment['word_list_without_sw']
flat_word = chain.from_iterable(positive_sent)
topwords = Counter(flat_word)
tmpdf = pd.DataFrame(topwords.most_common(20), columns = ['Words', 'Counts'])
tmpdf = tmpdf.iloc[1:,:]
tmpdf.style.background_gradient(cmap = 'Reds')

Unnamed: 0,Words,Counts
1,love,2643
2,day,2190
3,happy,1651
4,amp,1314
5,time,1088
6,life,1080
7,today,979
8,positive,925
9,thankful,919
10,new,914


In [102]:
negative_sent = Negative_sentiment['word_list_without_sw']
fflatwords = chain.from_iterable(negative_sent)
topwording = Counter(fflatwords)
tmpdfs = pd.DataFrame(topwording.most_common(20), columns= ['Words', 'Counts'])
tmpdfs = tmpdfs.iloc[1:,:]
tmpdfs.style.background_gradient(cmap = 'nipy_spectral_r')

Unnamed: 0,Words,Counts
1,amp,268
2,trump,197
3,â¦,181
4,libtard,149
5,like,137
6,white,137
7,black,131
8,racist,102
9,people,99
10,politics,96


## Lemmatization
Both stemming and lemmatization converts word to its base form. Stemming is a fast rule based technique and sometimes chops off inaccurately (under-stemming and over-stemming). You may have noticed NLTK provides PorterStemmer and a slightly improved Snowball Stemmer.

Lemmatization is dictionary based technique, more accurate but slightly slower than stemming. We will use WordnetLemmatizer from NLTK. We will download the wordnet resource for this purpose.

In [103]:
lemmatizer = WordNetLemmatizer()

In [104]:
def lemmatize_word(word_list_without_sw):
    return [lemmatizer.lemmatize(word, pos = 'r') for word in word_list_without_sw]

In [105]:
train_df['Lemmatized_word'] = train_df['word_list_without_sw'].apply(lemmatize_word)

In [106]:
train_df.head(50)

Unnamed: 0,tweet,sentiment,clean_tweet,no_sentence,word_list,word_list_without_sw,Lemmatized_word
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...,2,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drags, ...","[user, father, dysfunctional, selfish, drags, ..."
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...,2,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,...","[user, user, thanks, lyft, credit, use, offer,..."
2,bihday your majesty,Positive,bihday your majesty,1,"[bihday, your, majesty]","[bihday, majesty]","[bihday, majesty]"
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...,2,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, areð±, ððð...","[model, love, take, time, areð±, ððð..."
4,factsguide: society now #motivation,Positive,factsguide society now motivation,1,"[factsguide, society, now, motivation]","[factsguide, society, motivation]","[factsguide, society, motivation]"
5,[2/2] huge fan fare and big talking before the...,Positive,huge fan fare and big talking before they lea...,3,"[huge, fan, fare, and, big, talking, before, t...","[huge, fan, fare, big, talking, leave, chaos, ...","[huge, fan, fare, big, talking, leave, chaos, ..."
6,@user camping tomorrow @user @user @user @use...,Positive,user camping tomorrow user user user user user...,1,"[user, camping, tomorrow, user, user, user, us...","[user, camping, tomorrow, user, user, user, us...","[user, camping, tomorrow, user, user, user, us..."
7,the next school year is the year for exams.ð...,Positive,the next school year is the year for examsð¯...,1,"[the, next, school, year, is, the, year, for, ...","[next, school, year, year, examsð¯, think, ð...","[next, school, year, year, examsð¯, think, ð..."
8,we won!!! love the land!!! #allin #cavs #champ...,Positive,we won love the land allin cavs champions clev...,3,"[we, won, love, the, land, allin, cavs, champi...","[love, land, allin, cavs, champions, cleveland...","[love, land, allin, cavs, champions, cleveland..."
9,@user @user welcome here ! i'm it's so #gr...,Positive,user user welcome here i am it is so,2,"[user, user, welcome, here, i, am, it, is, so]","[user, user, welcome]","[user, user, welcome]"


                                                        OR

In [107]:
train_df['Lemmatized_word'] = train_df['word_list_without_sw'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

In [108]:
train_df.head(20)

Unnamed: 0,tweet,sentiment,clean_tweet,no_sentence,word_list,word_list_without_sw,Lemmatized_word
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...,2,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drags, ...","[user, father, dysfunctional, selfish, drag, k..."
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...,2,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,...","[user, user, thanks, lyft, credit, use, offer,..."
2,bihday your majesty,Positive,bihday your majesty,1,"[bihday, your, majesty]","[bihday, majesty]","[bihday, majesty]"
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...,2,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, areð±, ððð...","[model, love, take, time, areð±, ððð..."
4,factsguide: society now #motivation,Positive,factsguide society now motivation,1,"[factsguide, society, now, motivation]","[factsguide, society, motivation]","[factsguide, society, motivation]"
5,[2/2] huge fan fare and big talking before the...,Positive,huge fan fare and big talking before they lea...,3,"[huge, fan, fare, and, big, talking, before, t...","[huge, fan, fare, big, talking, leave, chaos, ...","[huge, fan, fare, big, talking, leave, chaos, ..."
6,@user camping tomorrow @user @user @user @use...,Positive,user camping tomorrow user user user user user...,1,"[user, camping, tomorrow, user, user, user, us...","[user, camping, tomorrow, user, user, user, us...","[user, camping, tomorrow, user, user, user, us..."
7,the next school year is the year for exams.ð...,Positive,the next school year is the year for examsð¯...,1,"[the, next, school, year, is, the, year, for, ...","[next, school, year, year, examsð¯, think, ð...","[next, school, year, year, examsð¯, think, ð..."
8,we won!!! love the land!!! #allin #cavs #champ...,Positive,we won love the land allin cavs champions clev...,3,"[we, won, love, the, land, allin, cavs, champi...","[love, land, allin, cavs, champions, cleveland...","[love, land, allin, cavs, champion, cleveland,..."
9,@user @user welcome here ! i'm it's so #gr...,Positive,user user welcome here i am it is so,2,"[user, user, welcome, here, i, am, it, is, so]","[user, user, welcome]","[user, user, welcome]"


## Final pre-processing
Let's to concatinate all the words in the last column on the dataframe and create a cleaned version of tweet text.

In [110]:
train_df['final_tweet'] = train_df['Lemmatized_word'].apply(lambda x:' '.join(x))

In [112]:
train_df.head(20)

Unnamed: 0,tweet,sentiment,clean_tweet,no_sentence,word_list,word_list_without_sw,Lemmatized_word,final_tweet
0,@user when a father is dysfunctional and is s...,Positive,user when a father is dysfunctional and is so ...,2,"[user, when, a, father, is, dysfunctional, and...","[user, father, dysfunctional, selfish, drags, ...","[user, father, dysfunctional, selfish, drag, k...",user father dysfunctional selfish drag kid dys...
1,@user @user thanks for #lyft credit i can't us...,Positive,user user thanks for lyft credit i cannot use ...,2,"[user, user, thanks, for, lyft, credit, i, can...","[user, user, thanks, lyft, credit, use, offer,...","[user, user, thanks, lyft, credit, use, offer,...",user user thanks lyft credit use offer wheelch...
2,bihday your majesty,Positive,bihday your majesty,1,"[bihday, your, majesty]","[bihday, majesty]","[bihday, majesty]",bihday majesty
3,#model i love u take with u all the time in ...,Positive,model i love you take with you all the time in...,2,"[model, i, love, you, take, with, you, all, th...","[model, love, take, time, areð±, ððð...","[model, love, take, time, areð±, ððð...",model love take time areð± ðððð ð...
4,factsguide: society now #motivation,Positive,factsguide society now motivation,1,"[factsguide, society, now, motivation]","[factsguide, society, motivation]","[factsguide, society, motivation]",factsguide society motivation
5,[2/2] huge fan fare and big talking before the...,Positive,huge fan fare and big talking before they lea...,3,"[huge, fan, fare, and, big, talking, before, t...","[huge, fan, fare, big, talking, leave, chaos, ...","[huge, fan, fare, big, talking, leave, chaos, ...",huge fan fare big talking leave chaos pay disp...
6,@user camping tomorrow @user @user @user @use...,Positive,user camping tomorrow user user user user user...,1,"[user, camping, tomorrow, user, user, user, us...","[user, camping, tomorrow, user, user, user, us...","[user, camping, tomorrow, user, user, user, us...",user camping tomorrow user user user user user...
7,the next school year is the year for exams.ð...,Positive,the next school year is the year for examsð¯...,1,"[the, next, school, year, is, the, year, for, ...","[next, school, year, year, examsð¯, think, ð...","[next, school, year, year, examsð¯, think, ð...",next school year year examsð¯ think ð­ sch...
8,we won!!! love the land!!! #allin #cavs #champ...,Positive,we won love the land allin cavs champions clev...,3,"[we, won, love, the, land, allin, cavs, champi...","[love, land, allin, cavs, champions, cleveland...","[love, land, allin, cavs, champion, cleveland,...",love land allin cavs champion cleveland clevel...
9,@user @user welcome here ! i'm it's so #gr...,Positive,user user welcome here i am it is so,2,"[user, user, welcome, here, i, am, it, is, so]","[user, user, welcome]","[user, user, welcome]",user user welcome
