# 3.2 Skills/Wrangling

In this notebook, we will focus on two essential skills in data analysis:

1. The ability to add select, aggregate and transform data in a dataframe (**part 1**)
2. The ability to get insights about a dataset by means of plotting and summary statistics (**part 2**)

## Part 1

### Imports

In [1]:
import pandas as pd

### Load dataset

Let's read in a CSV file containing an export of [Elon Musk's tweets](https://twitter.com/elonmusk), exported from Twitter's API. 

In [128]:
dataset_path = '../data/musk_tweets/elonmusk_tweets.csv'
df = pd.read_csv(dataset_path)

In [129]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2819 entries, 0 to 2818
Data columns (total 3 columns):
id            2819 non-null int64
created_at    2819 non-null object
text          2819 non-null object
dtypes: int64(1), object(2)
memory usage: 66.1+ KB


In [130]:
df.set_index('id', drop=True, inplace=True)

Let's give this dataset a bit more structure:
- the `id` column can be transformed into the dataframe's index, thus enabling us e.g. to select a tweet by id;
- `created_at` contains a timestamp, thus it can easily be converted into a `datetime` value
- but what's going on with the text column ??

In [131]:
df.created_at = pd.to_datetime(df.created_at)

In [132]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2819 entries, 849636868052275200 to 15434727182
Data columns (total 2 columns):
created_at    2819 non-null datetime64[ns]
text          2819 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 66.1+ KB


### Selection

An operation on dataframes that you'll find yourself doing very often is to rename the columns. 

In [135]:
df.head()

Unnamed: 0_level_0,created_at,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa..."
848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ..."


In [136]:
df.columns = ['created_at', 'tweet']

In [137]:
df.head()

Unnamed: 0_level_0,created_at,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa..."
848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ..."


In [138]:
df = df.rename(columns={"tweet": "text"})

In [139]:
df.head()

Unnamed: 0_level_0,created_at,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa..."
848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ..."


### Transformation

- add a column with link to original tweet
- add a column with links contained in tweet
- add a column with mentions
- add a column with tweet's length (chars)

In [124]:
import re

def find_mentions(tweet_text):
    handle_regexp = r'@[a-zA-Z0-9_]{1,15}'
    return re.findall(handle_regexp, tweet_text)

In [125]:
df['tweet_mentions'] = df.text.apply(find_mentions)

In [126]:
df['n_mentions'] = df.tweet_mentions.apply(len)

In [127]:
df.head()

Unnamed: 0_level_0,created_at,text,tweet_mentions,n_mentions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https:...,[],0
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exa...","[@ForIn2020, @waltmossberg, @mims, @defcon_5]",4
848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'","[@waltmossberg, @mims, @defcon_5]",3
848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...',[],0
848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat ...","[@DaveLeeBBC, @verge]",2


### Aggregation

- number of tweets by hour of the day
- number of tweets by day of the week

#### Grouping

#### Sorting

In [143]:
df.sort_values(by='created_at', ascending=True)

Unnamed: 0_level_0,created_at,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
15434727182,2010-06-04 18:31:57,"b'Please ignore prior tweets, as that was some..."
142179928203460608,2011-12-01 09:55:11,"b""Went to Iceland on Sat to ride bumper cars o..."
142188458125963264,2011-12-01 10:29:04,b'I made the volume on the Model S http://t.co...
142880871391838208,2011-12-03 08:20:28,"b'Great Voltaire quote, arguably better than T..."
142881284019060736,2011-12-03 08:22:07,b'That was a total non sequitur btw'
143171132814671872,2011-12-04 03:33:52,b'Am reading a great biography of Ben Franklin...
149435658115612672,2011-12-21 10:26:51,"b'Read ""Lying"", the new book by my friend Sam ..."
149436471764459520,2011-12-21 10:30:05,b'Sam Harris also wrote a nice piece on the aw...
149439686702661632,2011-12-21 10:42:52,"b""Why does the crowd cry over the glorious lea..."
149441101684686848,2011-12-21 10:48:29,b'His singing and acting talent will be sorely...


## Part 2