Here we attempt to use a python library called textblob which is mainly used for the transformation of text data. As we have identified, tweets commonly have many spelling and grammar mistakes that needs to be rectified.

In [4]:
from textblob import TextBlob
import pandas as pd
import numpy as np
import re
import pandas as pd
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
import nltk
nltk.download('punkt')
nltk.download('stopwords')
import spacy

[nltk_data] Downloading package punkt to /Users/henry/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/henry/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
# Load the dataset
data = pd.read_csv('../Twitter Sentiment/dataset.csv', header=None)
data.columns = ['labels', 'text']
print(len(data))
print(data.head())

100000
   labels                                               text
0       0  @switchfoot http://twitpic.com/2y1zl - Awww, t...
1       0  is upset that he can't update his Facebook by ...
2       0  @Kenichan I dived many times for the ball. Man...
3       0    my whole body feels itchy and like its on fire 
4       0  @nationwideclass no, it's not behaving at all....


The package "TextBlob" was one of the package we experimented with. It actually has its own sentiment analysis method. The output of the polarity ranges from -1 to 1 and the huge number of 0s (neutral) indicate that there is actually a lot of data that the model is unable to classify and hence we decided to not go ahead with it in the end.

In [7]:
predictions = []

for i in range(24999, 75000):
    string = data.text[i]
    tb = TextBlob(string)
    prob = tb.sentiment.polarity
    predictions.append(prob)

In [8]:
len(list(filter(lambda x: x==0, predictions)))

17789

In [None]:
#Importing the data once again 
data = data = pd.read_csv('../Twitter Sentiment/dataset.csv', header=None)
data.columns = ['labels', 'text']
print(len(data))
print(data.head())

In [10]:
def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@[A-Za-z0-9_]+', '', text)  # Remove mentions
    text = re.sub(r'#', '', text)  # Remove hashtags
    text = re.sub(r'[^A-Za-z\']+', ' ', text)  # Keep only letters
    text = text.lower()  # Convert to lowercase
    return text

data['cleaned_text'] = data['text'].apply(clean_text)

Using autocorrect library as it runs the fastest and most autocorrect libraries would perform equally 

In [14]:

from autocorrect import Speller

check = Speller(lang = 'en')

In [15]:
sample_txt = []

for sentence in data.cleaned_text:
    sample_txt.append(check(sentence))

data['spelled'] = sample_txt

Exporting the data first as the above step took very long to execute everytime

In [None]:
data.to_csv('spelt.csv')

In [9]:
data = pd.read_csv('../Twitter Sentiment/spelt.csv', header=0, index_col=0)
data

Unnamed: 0,labels,text,cleaned_text,spelled
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t...",awww that's a bummer you shoulda got david ca...,www that's a summer you should got david carr...
1,0,is upset that he can't update his Facebook by ...,is upset that he can't update his facebook by ...,is upset that he can't update his facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...,i dived many times for the ball managed to sa...,i dived many times for the ball managed to sa...
3,0,my whole body feels itchy and like its on fire,my whole body feels itchy and like its on fire,my whole body feels itch and like its on fire
4,0,"@nationwideclass no, it's not behaving at all....",no it's not behaving at all i'm mad why am i ...,no it's not behaving at all i'm mad why am i ...
...,...,...,...,...
99995,1,Now need 8 followers to compleate 1000 Follow...,now need followers to compleate follow,now need followers to complete follow
99996,1,I knew I had to explain something to my friend...,i knew i had to explain something to my friend...,i knew i had to explain something to my friend...
99997,1,done tweeting..... til tomorrow..,done tweeting til tomorrow,done meeting til tomorrow
99998,1,@cmozilo Act II set is pretty breath-taking -L...,act ii set is pretty breath taking love the r...,act ii set is pretty breath taking love the r...


Removal of apostrophe 

In [17]:
def clean_text(text):
    text = re.sub(r'\'', '', text)
    return text

data['apostrophe'] = data['spelled'].apply(clean_text)
data.apostrophe

0         www thats a summer you should got david carr ...
1        is upset that he cant update his facebook by t...
2         i dived many times for the ball managed to sa...
3           my whole body feels itch and like its on fire 
4         no its not behaving at all im mad why am i he...
                               ...                        
99995               now need followers to complete follow 
99996    i knew i had to explain something to my friend...
99997                           done meeting til tomorrow 
99998     act ii set is pretty breath taking love the r...
99999    if you dont have an attire account to sell you...
Name: apostrophe, Length: 100000, dtype: object

Tokenization 

In [18]:
data['tokens'] = data['apostrophe'].apply(word_tokenize)
data.tokens

0        [www, thats, a, summer, you, should, got, davi...
1        [is, upset, that, he, cant, update, his, faceb...
2        [i, dived, many, times, for, the, ball, manage...
3        [my, whole, body, feels, itch, and, like, its,...
4        [no, its, not, behaving, at, all, im, mad, why...
                               ...                        
99995         [now, need, followers, to, complete, follow]
99996    [i, knew, i, had, to, explain, something, to, ...
99997                       [done, meeting, til, tomorrow]
99998    [act, ii, set, is, pretty, breath, taking, lov...
99999    [if, you, dont, have, an, attire, account, to,...
Name: tokens, Length: 100000, dtype: object

Lemmatization 

In [21]:
nlp = spacy.load("en_core_web_sm")

def lemmatize_words(word_list):
    doc = nlp(" ".join(word_list)) 
    lemmatized_text = [token.lemma_ for token in doc]
    return lemmatized_text

data['lemmatized_token'] = data['tokens'].apply(lemmatize_words)
data.lemmatized_token

0        [www, that, s, a, summer, you, should, got, da...
1        [be, upset, that, he, ca, nt, update, his, fac...
2        [I, dive, many, time, for, the, ball, manage, ...
3        [my, whole, body, feel, itch, and, like, its, ...
4        [no, its, not, behave, at, all, I, m, mad, why...
                               ...                        
99995          [now, need, follower, to, complete, follow]
99996    [I, know, I, have, to, explain, something, to,...
99997                         [do, meeting, til, tomorrow]
99998    [act, ii, set, be, pretty, breath, take, love,...
99999    [if, you, do, nt, have, an, attire, account, t...
Name: lemmatized_token, Length: 100000, dtype: object

Stopwords Removal then lower casing again as auto correct capitalised the words

In [23]:
stop_words = set(stopwords.words('english'))
data['filtered_tokens'] = data['lemmatized_token'].apply(lambda tokens: [word for word in tokens if word not in stop_words])
data['filtered_tokens'] = data['filtered_tokens'].apply(lambda tokens: [word.lower() for word in tokens])


data['filtered_tokens']

0              [www, summer, got, david, carr, third, day]
1        [upset, ca, nt, update, facebook, texte, might...
2        [i, dive, many, time, ball, manage, save, rest...
3                    [whole, body, feel, itch, like, fire]
4                      [behave, i, mad, i, i, ca, nt, see]
                               ...                        
99995                   [need, follower, complete, follow]
99996    [i, know, i, explain, something, friend, say, ...
99997                             [meeting, til, tomorrow]
99998    [act, ii, set, pretty, breath, take, love, rea...
99999    [nt, attire, account, sell, fun, thing, i, sug...
Name: filtered_tokens, Length: 100000, dtype: object

Exporting the cleaned data to csv format to be applied in the respective models

In [None]:
data.to_csv('cleanedNspelt.csv')