### **Implementation Details**


#### **Step 2: Data Preprocessing**
In the Data Prepocessing areas , we aim to clean the collected tweets and use them at various stages to calculate sentiment accuracy. At this stage our goal is to check how Data Cleaning impacts Sentiment analysis.

This step is divided into three sub parts
1.  Initial Data preperation (Current File)
    * Replacing Empty Locations with Unknown
    * Filtering out Non English Tweets
    * Create pickle for 1st Sentiment Analysis baseline using VADER
    
    
2. Generic Data Cleaning
* In this step, we perform the initial data cleaning procedures like
    * Lowercaseing 
    * Removing special characters 
    * Removing Whitespaces
    * Removing tagged Usernames
    * Removing Hashtags
    * Removing RT
    * Removing URLs and Http tags 
    * Removing Punctuations
    * Create pickle package for 2nd Sentiment Analysis baseline


2. NLP Specific Data Cleaning
* In this step, we perform additional preprocessing by taking the clean tweets from above steps. These steps are
    * Removing Emojis
    * Stopword Removal 
    * Lemmatization
    * Create pickle package for 3rd Sentiment Analysis baseline using VADER



In [14]:
# Libraries needed
import pandas as pd
pd.options.display.max_columns = 50
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
plt.style.use('bmh')
# packages for data cleaning function
import re
import string
import pickle

import nltk
from sklearn.feature_extraction import text 
from nltk.stem import WordNetLemmatizer 
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.collocations import *
from nltk.tokenize import word_tokenize 
from emot.emo_unicode import UNICODE_EMOJI

Import *tweets_final.csv*

In [15]:
df = pd.read_csv('../data/tweets_final.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,Date,ID,location,tweet,num_of_likes,num_of_retweet,language
0,0,2022-11-07 23:59:59+00:00,1589769667765469186,"California, USA",Taking into account personal contributions &am...,2,1,en
1,1,2022-11-07 23:59:59+00:00,1589769667652235267,@jlo follows ♡ 01.29.21,whats your fav song?\n\n❥ I’m voting #Jennifer...,0,10,en
2,2,2022-11-07 23:59:59+00:00,1589769667127934977,,@MayoIsSpicyy He is allowed to speak his opini...,0,0,en
3,3,2022-11-07 23:59:59+00:00,1589769666918244352,USA,HEY NY DISTRICT 10! PLEASE VOTE FOR @danielsgo...,1,1,en
4,4,2022-11-07 23:59:59+00:00,1589769666679144448,DMV,@YDanasmithdutra @BaddCompani @politicalblond ...,3,0,en


* Initial Data Preparation

In [16]:
### Data Cleaning 

# Checking Locations which are NaN
df.location.isna().sum()

7809

In [17]:
### Data Cleaning 

# Replacing NaN location values with Unknown
df['location'] = df['location'].fillna('Unknown')
df.head()

Unnamed: 0.1,Unnamed: 0,Date,ID,location,tweet,num_of_likes,num_of_retweet,language
0,0,2022-11-07 23:59:59+00:00,1589769667765469186,"California, USA",Taking into account personal contributions &am...,2,1,en
1,1,2022-11-07 23:59:59+00:00,1589769667652235267,@jlo follows ♡ 01.29.21,whats your fav song?\n\n❥ I’m voting #Jennifer...,0,10,en
2,2,2022-11-07 23:59:59+00:00,1589769667127934977,Unknown,@MayoIsSpicyy He is allowed to speak his opini...,0,0,en
3,3,2022-11-07 23:59:59+00:00,1589769666918244352,USA,HEY NY DISTRICT 10! PLEASE VOTE FOR @danielsgo...,1,1,en
4,4,2022-11-07 23:59:59+00:00,1589769666679144448,DMV,@YDanasmithdutra @BaddCompani @politicalblond ...,3,0,en


In [18]:
df.location.isna().sum()

0

In [19]:
### Data Cleaning 

# Dropping non english tweets.

df.drop(df[(df['language'] != 'en')].index, inplace=True)
df.head()

Unnamed: 0.1,Unnamed: 0,Date,ID,location,tweet,num_of_likes,num_of_retweet,language
0,0,2022-11-07 23:59:59+00:00,1589769667765469186,"California, USA",Taking into account personal contributions &am...,2,1,en
1,1,2022-11-07 23:59:59+00:00,1589769667652235267,@jlo follows ♡ 01.29.21,whats your fav song?\n\n❥ I’m voting #Jennifer...,0,10,en
2,2,2022-11-07 23:59:59+00:00,1589769667127934977,Unknown,@MayoIsSpicyy He is allowed to speak his opini...,0,0,en
3,3,2022-11-07 23:59:59+00:00,1589769666918244352,USA,HEY NY DISTRICT 10! PLEASE VOTE FOR @danielsgo...,1,1,en
4,4,2022-11-07 23:59:59+00:00,1589769666679144448,DMV,@YDanasmithdutra @BaddCompani @politicalblond ...,3,0,en


In [20]:
# Create pickle for 1st Sentiment Analysis baseline using VADER 
df.to_pickle("../source/filtered_df.pkl")

* Generic Data Cleaning
    * Lowercaseing 
    * Removing special characters 
    * Removing Whitespaces
    * Removing tagged Usernames
    * Removing Hashtags
    * Removing RT
    * Removing URLs and Http tags 
    * Removing Punctuations

In [21]:
# Generic Data Cleaning function
def clean_tweet(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('\(.*?\)', '', text)
    text = re.sub('\w*\d\w*', '', text)
    text = re.sub('\s+', ' ', text)
    text = re.sub('\n', ' ', text)
    text = re.sub('\"+', '', text)
    text = re.sub('(\&amp\;)', '', text)
    text = re.sub('(@[^\s]+)', '', text)
    text = re.sub('(#[^\s]+)', '', text)
    text = re.sub('(rt)', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('(httptco)', '', text)
    text = re.sub('(httpstco)', '', text)

    return text

In [22]:
df_filtered = pd.read_pickle('../source/filtered_df.pkl')
df_filtered.head()

Unnamed: 0.1,Unnamed: 0,Date,ID,location,tweet,num_of_likes,num_of_retweet,language
0,0,2022-11-07 23:59:59+00:00,1589769667765469186,"California, USA",Taking into account personal contributions &am...,2,1,en
1,1,2022-11-07 23:59:59+00:00,1589769667652235267,@jlo follows ♡ 01.29.21,whats your fav song?\n\n❥ I’m voting #Jennifer...,0,10,en
2,2,2022-11-07 23:59:59+00:00,1589769667127934977,Unknown,@MayoIsSpicyy He is allowed to speak his opini...,0,0,en
3,3,2022-11-07 23:59:59+00:00,1589769666918244352,USA,HEY NY DISTRICT 10! PLEASE VOTE FOR @danielsgo...,1,1,en
4,4,2022-11-07 23:59:59+00:00,1589769666679144448,DMV,@YDanasmithdutra @BaddCompani @politicalblond ...,3,0,en


In [23]:
df_filtered['cleaned_tweets'] = df_filtered['tweet'].apply(clean_tweet)
df_filtered.head()

Unnamed: 0.1,Unnamed: 0,Date,ID,location,tweet,num_of_likes,num_of_retweet,language,cleaned_tweets
0,0,2022-11-07 23:59:59+00:00,1589769667765469186,"California, USA",Taking into account personal contributions &am...,2,1,en,taking into account personal contributions de...
1,1,2022-11-07 23:59:59+00:00,1589769667652235267,@jlo follows ♡ 01.29.21,whats your fav song?\n\n❥ I’m voting #Jennifer...,0,10,en,whats your fav song ❥ i’m voting for at the
2,2,2022-11-07 23:59:59+00:00,1589769667127934977,Unknown,@MayoIsSpicyy He is allowed to speak his opini...,0,0,en,he is allowed to speak his opinion just like ...
3,3,2022-11-07 23:59:59+00:00,1589769666918244352,USA,HEY NY DISTRICT 10! PLEASE VOTE FOR @danielsgo...,1,1,en,hey ny district please vote for
4,4,2022-11-07 23:59:59+00:00,1589769666679144448,DMV,@YDanasmithdutra @BaddCompani @politicalblond ...,3,0,en,vote blue no matter who 💙🇺🇸💙🇺🇸💙🇺🇸💙🇺🇸💙🇺🇸💙


In [24]:
# Create pickle for 2nd Sentiment Analysis baseline
df_filtered.to_pickle("../source/gen_cleaned_df.pkl")

* NLP Specific Data Cleaning
    * Removing Emojis
    * Stopword Removal 
    * Lemmatization

In [25]:
emoji = list(UNICODE_EMOJI.keys())
stop_words = set(stopwords.words('english'))

def nlp_clean_tweet(text):
    tokens = word_tokenize(text)  
    filtered_words = [w for w in tokens if w not in stop_words]
    filtered_words = [w for w in filtered_words if w not in emoji]
    # lemmetizing words
    lemmatizer = WordNetLemmatizer() 
    lemma_words = [lemmatizer.lemmatize(w) for w in filtered_words]
    text = " ".join(lemma_words)
    return text

In [26]:
df_preprocessed = pd.read_pickle('../source/gen_cleaned_df.pkl')
df_preprocessed.head()

Unnamed: 0.1,Unnamed: 0,Date,ID,location,tweet,num_of_likes,num_of_retweet,language,cleaned_tweets
0,0,2022-11-07 23:59:59+00:00,1589769667765469186,"California, USA",Taking into account personal contributions &am...,2,1,en,taking into account personal contributions de...
1,1,2022-11-07 23:59:59+00:00,1589769667652235267,@jlo follows ♡ 01.29.21,whats your fav song?\n\n❥ I’m voting #Jennifer...,0,10,en,whats your fav song ❥ i’m voting for at the
2,2,2022-11-07 23:59:59+00:00,1589769667127934977,Unknown,@MayoIsSpicyy He is allowed to speak his opini...,0,0,en,he is allowed to speak his opinion just like ...
3,3,2022-11-07 23:59:59+00:00,1589769666918244352,USA,HEY NY DISTRICT 10! PLEASE VOTE FOR @danielsgo...,1,1,en,hey ny district please vote for
4,4,2022-11-07 23:59:59+00:00,1589769666679144448,DMV,@YDanasmithdutra @BaddCompani @politicalblond ...,3,0,en,vote blue no matter who 💙🇺🇸💙🇺🇸💙🇺🇸💙🇺🇸💙🇺🇸💙


In [27]:
df_preprocessed['final_cleaned_tweets'] = df_preprocessed['cleaned_tweets'].apply(nlp_clean_tweet)
df_preprocessed.head()

Unnamed: 0.1,Unnamed: 0,Date,ID,location,tweet,num_of_likes,num_of_retweet,language,cleaned_tweets,final_cleaned_tweets
0,0,2022-11-07 23:59:59+00:00,1589769667765469186,"California, USA",Taking into account personal contributions &am...,2,1,en,taking into account personal contributions de...,taking account personal contribution degree ba...
1,1,2022-11-07 23:59:59+00:00,1589769667652235267,@jlo follows ♡ 01.29.21,whats your fav song?\n\n❥ I’m voting #Jennifer...,0,10,en,whats your fav song ❥ i’m voting for at the,whats fav song ❥ ’ voting
2,2,2022-11-07 23:59:59+00:00,1589769667127934977,Unknown,@MayoIsSpicyy He is allowed to speak his opini...,0,0,en,he is allowed to speak his opinion just like ...,allowed speak opinion like rest u opinion vote...
3,3,2022-11-07 23:59:59+00:00,1589769666918244352,USA,HEY NY DISTRICT 10! PLEASE VOTE FOR @danielsgo...,1,1,en,hey ny district please vote for,hey ny district please vote
4,4,2022-11-07 23:59:59+00:00,1589769666679144448,DMV,@YDanasmithdutra @BaddCompani @politicalblond ...,3,0,en,vote blue no matter who 💙🇺🇸💙🇺🇸💙🇺🇸💙🇺🇸💙🇺🇸💙,vote blue matter 💙🇺🇸💙🇺🇸💙🇺🇸💙🇺🇸💙🇺🇸💙


In [28]:
# Create pickle for 3rd Sentiment Analysis baseline
df_preprocessed.to_pickle("../source/nlp_cleaned_df.pkl")