# Clean Sentiment140 Data Set
This notebook processes the Sentiment140 data set to prepare the tweets for using with the model. The processing removes URLs, mentions, punctuation and so on.

In [1]:
import json
import os
import pandas as pd
import preprocessor
import re

## Set some useful variables

In [2]:
DATA_DIR = '../data'
SENTIMENT140_FILE = 'training.1600000.processed.noemoticon.csv'
CONTRACTIONS_FILE = 'contractions.json'
CLEAN_SENTIMENT140_FILE = 'sentiment140_clean.csv'

## Load the data set

In [3]:
df = pd.read_csv(os.path.join(DATA_DIR, SENTIMENT140_FILE),
                 encoding='latin-1',
                 names=['target', 'ids', 'date', 'flag', 'user', 'text'])

Convert all text to lowercase.

In [4]:
df['text'] = df['text'].str.lower()

In [5]:
df.head()

Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@kenichan i dived many times for the ball. man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


The data set is perfectly balanced, so no issues there.

In [6]:
df['target'].value_counts()

4    800000
0    800000
Name: target, dtype: int64

## Clean the tweets
Convert everything to lowercase, expand contractions and remove punctuation.

Load contractions.

In [7]:
with open(os.path.join(DATA_DIR, CONTRACTIONS_FILE), 'r') as file:
    contractions = json.load(file)

Convert everything to lowercase.

In [8]:
contractions = {key.lower(): value.lower() for key, value in contractions.items()}

In [9]:
contractions

{"ain't": 'am not',
 "aren't": 'are not',
 "can't": 'can not',
 "can't've": 'can not have',
 "'cause": 'because',
 "could've": 'could have',
 "couldn't": 'could not',
 "couldn't've": 'could not have',
 "didn't": 'did not',
 "doesn't": 'does not',
 "don't": 'do not',
 "hadn't": 'had not',
 "hadn't've": 'had not have',
 "hasn't": 'has not',
 "haven't": 'have not',
 "he'd": 'he would',
 "he'd've": 'he would have',
 "he'll": 'he will',
 "he'll've": 'he will have',
 "he's": 'he is',
 "how'd": 'how did',
 "how'd'y": 'how do you',
 "how'll": 'how will',
 "how's": 'how is',
 "i'd": 'i would',
 "i'd've": 'i would have',
 "i'll": 'i will',
 "i'll've": 'i will have',
 "i'm": 'i am',
 "i've": 'i have',
 "isn't": 'is not',
 "it'd": 'it had',
 "it'd've": 'it would have',
 "it'll": 'it will',
 "it'll've": 'it will have',
 "it's": 'it is',
 "let's": 'let us',
 "ma'am": 'madam',
 "mayn't": 'may not',
 "might've": 'might have',
 "mightn't": 'might not',
 "mightn't've": 'might not have',
 "must've": 'mus

Function to replace the contractions.

In [10]:
c_re = re.compile('(%s)' % '|'.join(contractions.keys()))
def expand_contractions(text, c_re=c_re):
    def replace(match):
        return contractions[match.group(0)]
    return c_re.sub(replace, text)

In [11]:
expand_contractions("i'm not")

'i am not'

Function to perform the cleaning of a single tweet.

In [12]:
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')

In [13]:
def clean_tweet(tweet):
    # Remove URLs and mentions
    tweet = preprocessor.clean(tweet)
    # Remove bad symbols
    tweet = BAD_SYMBOLS_RE.sub(' ', tweet)
     # Expand contractions
    tweet = expand_contractions(tweet)
    # Remove punctuation
    tweet = ' '.join(re.sub("([^0-9A-Za-z \t])", " ", tweet).split())
    # For RNN models such as LSTM we do not remove stopwords
    return tweet

Apply the cleaning function to all tweets.

In [14]:
df['clean_text'] = df['text'].apply(clean_tweet)

## Recode the labels
Recode the labels in a slightly more intuitive way: 0 for negative and 1 for positive.

In [15]:
df['target'] = df['target'].replace({4: 1})

## Shuffle the data set and save it

In [16]:
df = df.sample(frac=1)

Final data set

In [17]:
df.head()

Unnamed: 0,target,ids,date,flag,user,text,clean_text
1574145,1,2189197570,Mon Jun 15 23:28:07 PDT 2009,NO_QUERY,pattylOoves,@yayitsa hey hey what about u and jose????...umm,hey hey what about u and jose umm
656615,0,2240575390,Fri Jun 19 09:54:40 PDT 2009,NO_QUERY,Kaitikins,@tiffaknee sorry....?,sorry
1253508,1,1996915756,Mon Jun 01 15:55:23 PDT 2009,NO_QUERY,AngelicVampira,@baneen glad you had a good time i think we a...,glad you had a good time i think we all apprec...
1586982,1,2190874142,Tue Jun 16 03:57:48 PDT 2009,NO_QUERY,alexaawasheree,getting ready to leave for my class trip... to...,getting ready to leave for my class trip today...
187285,0,1968614329,Fri May 29 21:49:08 PDT 2009,NO_QUERY,jbxbaybee,im in serious need of ice cream!,im in serious need of ice cream


Save the data set

In [19]:
df[['target', 'clean_text']].to_csv(os.path.join(DATA_DIR, CLEAN_SENTIMENT140_FILE), index=False)