# Data Wrangling

In this notebook we
1. Load data scraped from Twitter.
2. Clean data by inspect tweets and remove unwanted tokens like '\n' & URLs.

## Import Packages

In [1]:
import pandas as pd
import re

## Load Data

In [2]:
# Check how many csv files there are
data_dir = '../Data/*.csv'
! ls {data_dir}

../Data/tweets.csv          ../Data/tweets_eda.csv
../Data/tweets_cleansed.csv


In [3]:
# Load data for the traning set
df = pd.read_csv('../Data/tweets.csv')

In [4]:
pd.options.display.max_colwidth = 500
df.head(20)

Unnamed: 0,Tweets
0,I think there’s a first date going on near me and it’s a disaster:\n\n“You know Harry Potter?”\n“Not really.”\n“The movies?”\n“No.”\n“I’m a Hufflepuff.”\n“Congrats?”\n\nGET OUT OF THERE SWEET HUFFLEPUFF.
1,Sell all your houses. Stop flying private jets. Do movies for free. And give away your cash to the IRS - then we’ll talk
2,Starship Troopers (1997) is a deliciously ambiguous movie:\nliberal viewers think it's a hilarious satire of toxic masculinity & militarism;\nconservatives think it's a savage space-war parable with Red Pill dog-whistles.\nWhat other movies have this 'political Necker cube' quality?
3,Then this & they never ran. I felt like I was in a @Disney movie & they were letting me know my dog Diesel was in a better place.
4,Best Gary Oldman movie? (Pt. 2)
5,Find out how an unexpected practical effect brought the world of #Solo: A #StarWars Story to life. Enjoy the film digitally in HD and 4K Ultra HD and Movies Anywhere on September 14 and on Blu-ray on September 25.
6,"fauni just called me from jail and then he passed the phone to lil wop like ""aye this white boy adam wanna talk to you"". also fauni wants the people to know he's trying to get out in time to see the Predator movie"
7,Jamie Dornan and Matt Bomer both stopped by a lounge at #TIFF to promote their new movies this afternoon!
8,"Just saw that a Russian/South Korean team of scientist are creating a cloning lab where they will attempt to recreate Jurassic era animals with DNA found from long extinct creatures in the permafrost. Didn't they make a movie about this? Everything worked out in that, right?"
9,"My brother, Benjamin Rice, engineered and co-produced all of the music and vocals for this movie .... I’m blown away by these songs. So proud of him. Just pre-ordered the album on iTunes. Go get ittt #AStarIsBorn"


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15001 entries, 0 to 15000
Data columns (total 1 columns):
Tweets    15001 non-null object
dtypes: object(1)
memory usage: 117.3+ KB


##  Remove Unwanted Tokens

By inspecting the above tweets, there are many "\n" which represent newline in code. Let's remove them:

In [6]:
df['Tweets'] = df['Tweets'].apply(lambda x: x.replace('\n', ''))
df.head(20)

Unnamed: 0,Tweets
0,I think there’s a first date going on near me and it’s a disaster:“You know Harry Potter?”“Not really.”“The movies?”“No.”“I’m a Hufflepuff.”“Congrats?”GET OUT OF THERE SWEET HUFFLEPUFF.
1,Sell all your houses. Stop flying private jets. Do movies for free. And give away your cash to the IRS - then we’ll talk
2,Starship Troopers (1997) is a deliciously ambiguous movie:liberal viewers think it's a hilarious satire of toxic masculinity & militarism;conservatives think it's a savage space-war parable with Red Pill dog-whistles.What other movies have this 'political Necker cube' quality?
3,Then this & they never ran. I felt like I was in a @Disney movie & they were letting me know my dog Diesel was in a better place.
4,Best Gary Oldman movie? (Pt. 2)
5,Find out how an unexpected practical effect brought the world of #Solo: A #StarWars Story to life. Enjoy the film digitally in HD and 4K Ultra HD and Movies Anywhere on September 14 and on Blu-ray on September 25.
6,"fauni just called me from jail and then he passed the phone to lil wop like ""aye this white boy adam wanna talk to you"". also fauni wants the people to know he's trying to get out in time to see the Predator movie"
7,Jamie Dornan and Matt Bomer both stopped by a lounge at #TIFF to promote their new movies this afternoon!
8,"Just saw that a Russian/South Korean team of scientist are creating a cloning lab where they will attempt to recreate Jurassic era animals with DNA found from long extinct creatures in the permafrost. Didn't they make a movie about this? Everything worked out in that, right?"
9,"My brother, Benjamin Rice, engineered and co-produced all of the music and vocals for this movie .... I’m blown away by these songs. So proud of him. Just pre-ordered the album on iTunes. Go get ittt #AStarIsBorn"


There are may URLs as well. Let's remove them:

In [7]:
def remove_url(url):
    regex = re.compile('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', re.IGNORECASE)
    url = regex.sub('', url)
    return url
    
df['Tweets'] = df['Tweets'].apply(remove_url)
df.head(20)

Unnamed: 0,Tweets
0,I think there’s a first date going on near me and it’s a disaster:“You know Harry Potter?”“Not really.”“The movies?”“No.”“I’m a Hufflepuff.”“Congrats?”GET OUT OF THERE SWEET HUFFLEPUFF.
1,Sell all your houses. Stop flying private jets. Do movies for free. And give away your cash to the IRS - then we’ll talk
2,Starship Troopers (1997) is a deliciously ambiguous movie:liberal viewers think it's a hilarious satire of toxic masculinity & militarism;conservatives think it's a savage space-war parable with Red Pill dog-whistles.What other movies have this 'political Necker cube' quality?
3,Then this & they never ran. I felt like I was in a @Disney movie & they were letting me know my dog Diesel was in a better place.
4,Best Gary Oldman movie? (Pt. 2)
5,Find out how an unexpected practical effect brought the world of #Solo: A #StarWars Story to life. Enjoy the film digitally in HD and 4K Ultra HD and Movies Anywhere on September 14 and on Blu-ray on September 25.
6,"fauni just called me from jail and then he passed the phone to lil wop like ""aye this white boy adam wanna talk to you"". also fauni wants the people to know he's trying to get out in time to see the Predator movie"
7,Jamie Dornan and Matt Bomer both stopped by a lounge at #TIFF to promote their new movies this afternoon!
8,"Just saw that a Russian/South Korean team of scientist are creating a cloning lab where they will attempt to recreate Jurassic era animals with DNA found from long extinct creatures in the permafrost. Didn't they make a movie about this? Everything worked out in that, right?"
9,"My brother, Benjamin Rice, engineered and co-produced all of the music and vocals for this movie .... I’m blown away by these songs. So proud of him. Just pre-ordered the album on iTunes. Go get ittt #AStarIsBorn"


In [8]:
# Dump tweets to csv file for further analysis.
df.to_csv('../Data/tweets_cleansed.csv', index = False)