In [1]:
import pandas as pd
import numpy as np
import re

**Columns**

```
0 - the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1 - the id of the tweet (2087)
2 - the date of the tweet (Sat May 16 23:58:44 UTC 2009)
3 - the query (lyx). If there is no query, then this value is NO_QUERY.
4 - the user that tweeted (robotickilldozr)
5 - the text of the tweet (Lyx is cool)
```

In [2]:
cols = ['polarity','id', 'date', 'query', 'user', 'tweet']

data = pd.read_csv('sentiment.csv',names=cols, encoding='ISO-8859-1')
print('length of data {}'.format(len(data)))

length of data 1600000


In [3]:
data[:5]

Unnamed: 0,polarity,id,date,query,user,tweet
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


#### 1.) Randomly sample 1% of the data (otherwise, the dataset is too big!)

In [4]:
random = data.sample(frac = 0.01)
random

Unnamed: 0,polarity,id,date,query,user,tweet
134607,0,1836294640,Mon May 18 07:44:26 PDT 2009,NO_QUERY,ellejonees,i think ... i really love you sweet boy
1352694,4,2046479548,Fri Jun 05 12:14:22 PDT 2009,NO_QUERY,chipcummings,Almost margarita time..... Too nice out - will...
1259845,4,1998207188,Mon Jun 01 18:10:12 PDT 2009,NO_QUERY,Upstatemomof3,@WaitingLisa thanks!! Those NEVER happen to yo...
1297993,4,2004703426,Tue Jun 02 08:36:43 PDT 2009,NO_QUERY,thegurl,House painting like a NIN/JA. Rawk
1238275,4,1993245241,Mon Jun 01 10:03:59 PDT 2009,NO_QUERY,SugaredFoxx,@gorjuss Happy Anniversary to you and Mr G
...,...,...,...,...,...,...
182859,0,1967280978,Fri May 29 19:17:34 PDT 2009,NO_QUERY,__nancy,@lavishness WOE.
826594,4,1556704784,Sat Apr 18 23:57:43 PDT 2009,NO_QUERY,linz8976,Almost done at work!!! Sucha great night!!
9285,0,1548690539,Fri Apr 17 21:40:33 PDT 2009,NO_QUERY,carabeth1989,Tied at 3 bottom of the 7th 1 out &lt;~*Liza...
372915,0,2050880970,Fri Jun 05 19:27:41 PDT 2009,NO_QUERY,shanley20,only 4more days till school. N i dont kno if i...


#### 2.) drop id, date, query, and user columns

In [85]:
new_data = random.drop(columns = ['id', 'date', 'query', 'user'])
new_data

Unnamed: 0,polarity,tweet
134607,0,i think ... i really love you sweet boy
1352694,4,Almost margarita time..... Too nice out - will...
1259845,4,@WaitingLisa thanks!! Those NEVER happen to yo...
1297993,4,House painting like a NIN/JA. Rawk
1238275,4,@gorjuss Happy Anniversary to you and Mr G
...,...,...
182859,0,@lavishness WOE.
826594,4,Almost done at work!!! Sucha great night!!
9285,0,Tied at 3 bottom of the 7th 1 out &lt;~*Liza...
372915,0,only 4more days till school. N i dont kno if i...


#### 3.) Change all 4s in polarity to 1

- A lambda function might be useful

In [86]:
#new_data['polarity'] = new_data.apply(lambda row: '1' if '4', axis=1)
#new_data['polarity'] = new_data.apply(lambda x: x['polarity'] if x['dirr'] == 1 else x['result'], axis = 1)
new_data['polarity'] = new_data['polarity'].apply(lambda x: 1 if x == 4 else 0)
new_data

Unnamed: 0,polarity,tweet
134607,0,i think ... i really love you sweet boy
1352694,1,Almost margarita time..... Too nice out - will...
1259845,1,@WaitingLisa thanks!! Those NEVER happen to yo...
1297993,1,House painting like a NIN/JA. Rawk
1238275,1,@gorjuss Happy Anniversary to you and Mr G
...,...,...
182859,0,@lavishness WOE.
826594,1,Almost done at work!!! Sucha great night!!
9285,0,Tied at 3 bottom of the 7th 1 out &lt;~*Liza...
372915,0,only 4more days till school. N i dont kno if i...


#### 4.) How many are there for each polarity?

- groupby might be useful

In [87]:
grouped = new_data.groupby(['polarity']).count()
grouped

Unnamed: 0_level_0,tweet
polarity,Unnamed: 1_level_1
0,8065
1,7935


#### 5.) Perform the following operations on the tweet column:

- create a new column that contains tweet words in call caps (e.g., HELLO) (for tweets without all-caps-words, the row should be NaN)
- create a new column that contains tweet word hashtags (#) (for tweets without hashtags, the row should be NaN)
- create a new column that contains tweet word mentions (@) (for tweets without mentions, the row should be NaN)
- create a new column that contains tweet word urls (http) (for tweets without urls, the row should be NaN)
- create a new column that contains tweet numbers (e.g., 55) (for tweets without numbers, the row should be NaN)
- create a new column that contains the original tweet in all lowercase


In [88]:
#Regular expressions
regex_h = re.compile('#\w+')
new_data['splittweet'] = new_data['tweet'].apply(lambda x: x.split())
new_data['capital'] = new_data['splittweet'].apply(lambda x: [i for i in x if i.isupper()])
new_data['mentions'] = new_data['splittweet'].apply(lambda x: [i for i in x if i[0] == '@'])
new_data['no #'] = new_data['splittweet'].apply(lambda x: [i  for i in x if i[0] == '#'])
new_data['urls'] = new_data['splittweet'].apply(lambda x: [i  for i in x if i == 'http'])
new_data['numbers'] = new_data['splittweet'].apply(lambda x: [i for i in x if i.isdigit()])
new_data['lowercase'] = new_data['splittweet'].apply(lambda x: [i.lower() for i in x])
new_data

Unnamed: 0,polarity,tweet,splittweet,capital,mentions,no #,urls,numbers,lowercase
134607,0,i think ... i really love you sweet boy,"[i, think, ..., i, really, love, you, sweet, boy]",[],[],[],[],[],"[i, think, ..., i, really, love, you, sweet, boy]"
1352694,1,Almost margarita time..... Too nice out - will...,"[Almost, margarita, time....., Too, nice, out,...",[],[],[],[],[],"[almost, margarita, time....., too, nice, out,..."
1259845,1,@WaitingLisa thanks!! Those NEVER happen to yo...,"[@WaitingLisa, thanks!!, Those, NEVER, happen,...","[NEVER, I]",[@WaitingLisa],[],[],[],"[@waitinglisa, thanks!!, those, never, happen,..."
1297993,1,House painting like a NIN/JA. Rawk,"[House, painting, like, a, NIN/JA., Rawk]",[NIN/JA.],[],[],[],[],"[house, painting, like, a, nin/ja., rawk]"
1238275,1,@gorjuss Happy Anniversary to you and Mr G,"[@gorjuss, Happy, Anniversary, to, you, and, M...",[G],[@gorjuss],[],[],[],"[@gorjuss, happy, anniversary, to, you, and, m..."
...,...,...,...,...,...,...,...,...,...
182859,0,@lavishness WOE.,"[@lavishness, WOE.]",[WOE.],[@lavishness],[],[],[],"[@lavishness, woe.]"
826594,1,Almost done at work!!! Sucha great night!!,"[Almost, done, at, work!!!, Sucha, great, nigh...",[],[],[],[],[],"[almost, done, at, work!!!, sucha, great, nigh..."
9285,0,Tied at 3 bottom of the 7th 1 out &lt;~*Liza...,"[Tied, at, 3, bottom, of, the, 7th, 1, out, &l...",[],[],[],[],"[3, 1]","[tied, at, 3, bottom, of, the, 7th, 1, out, &l..."
372915,0,only 4more days till school. N i dont kno if i...,"[only, 4more, days, till, school., N, i, dont,...","[N, B, N, N]",[],[],[],[],"[only, 4more, days, till, school., n, i, dont,..."


#### 6.) Stem all of the words

- Some help: [Learn Python Stemming](https://data-flair.training/blogs/python-stemming/)
- Python Stemming is the act of taking a word and reducing it into a stem. A stem is like a root for a word- that for writing is writing. But this doesn’t always have to be a word; words like study, studies, and studying all stem into the word studi, which isn’t actually a word.
- Use the lowercase tweet column
- Create a new column called "stem"


In [103]:
# example:
import nltk
from nltk.stem import PorterStemmer
ps=PorterStemmer()
ps.stem('writing')
#split and step through each word, in line for loops
new_data['stem'] = new_data['lowercase'].apply(lambda x: [ps.stem(i) for i in x])
new_data['join'] = new_data['stem'].apply(lambda x: ' '.join(x))
new_data

Unnamed: 0,polarity,tweet,splittweet,capital,mentions,no #,urls,numbers,lowercase,stem,join
134607,0,i think ... i really love you sweet boy,"[i, think, ..., i, really, love, you, sweet, boy]",[],[],[],[],[],"[i, think, ..., i, really, love, you, sweet, boy]","[i, think, ..., i, realli, love, you, sweet, boy]",i think ... i realli love you sweet boy
1352694,1,Almost margarita time..... Too nice out - will...,"[Almost, margarita, time....., Too, nice, out,...",[],[],[],[],[],"[almost, margarita, time....., too, nice, out,...","[almost, margarita, time....., too, nice, out,...",almost margarita time..... too nice out - will...
1259845,1,@WaitingLisa thanks!! Those NEVER happen to yo...,"[@WaitingLisa, thanks!!, Those, NEVER, happen,...","[NEVER, I]",[@WaitingLisa],[],[],[],"[@waitinglisa, thanks!!, those, never, happen,...","[@waitinglisa, thanks!!, those, never, happen,...",@waitinglisa thanks!! those never happen to yo...
1297993,1,House painting like a NIN/JA. Rawk,"[House, painting, like, a, NIN/JA., Rawk]",[NIN/JA.],[],[],[],[],"[house, painting, like, a, nin/ja., rawk]","[hous, paint, like, a, nin/ja., rawk]",hous paint like a nin/ja. rawk
1238275,1,@gorjuss Happy Anniversary to you and Mr G,"[@gorjuss, Happy, Anniversary, to, you, and, M...",[G],[@gorjuss],[],[],[],"[@gorjuss, happy, anniversary, to, you, and, m...","[@gorjuss, happi, anniversari, to, you, and, m...",@gorjuss happi anniversari to you and mr g
...,...,...,...,...,...,...,...,...,...,...,...
182859,0,@lavishness WOE.,"[@lavishness, WOE.]",[WOE.],[@lavishness],[],[],[],"[@lavishness, woe.]","[@lavish, woe.]",@lavish woe.
826594,1,Almost done at work!!! Sucha great night!!,"[Almost, done, at, work!!!, Sucha, great, nigh...",[],[],[],[],[],"[almost, done, at, work!!!, sucha, great, nigh...","[almost, done, at, work!!!, sucha, great, nigh...",almost done at work!!! sucha great night!!
9285,0,Tied at 3 bottom of the 7th 1 out &lt;~*Liza...,"[Tied, at, 3, bottom, of, the, 7th, 1, out, &l...",[],[],[],[],"[3, 1]","[tied, at, 3, bottom, of, the, 7th, 1, out, &l...","[tie, at, 3, bottom, of, the, 7th, 1, out, &lt...",tie at 3 bottom of the 7th 1 out &lt;~*lizabet...
372915,0,only 4more days till school. N i dont kno if i...,"[only, 4more, days, till, school., N, i, dont,...","[N, B, N, N]",[],[],[],[],"[only, 4more, days, till, school., n, i, dont,...","[onli, 4more, day, till, school., n, i, dont, ...",onli 4more day till school. n i dont kno if il...


That is, you'll need to call `ps.stem()` on each word for each tweet. 

Hints:

- Convert the lowercase tweets column to a list of strings (e.g., use the string split() function)
- Use a lambda function that steps through each row, then a loop that steps 
- use a join to convery the list of words back into a single string (e.g., ''.join(list))

#### 7.) Dump your dataframe to csv

In [104]:
new_data.to_csv('munged_data.csv')