# 2.3 Skills/Wrangling with üêº

In this notebook, we will focus on one skill in data analysis, namely the ability to add select, aggregate and transform data in a dataframe.

### Imports

In [1]:
import pandas as pd

### Load dataset

Let's read in a CSV file containing an export of [Elon Musk's tweets](https://twitter.com/elonmusk), exported from Twitter's API. 

In [20]:
dataset_path = '../data/musk_tweets/elonmusk_tweets.csv'
df = pd.read_csv(dataset_path)

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2819 entries, 0 to 2818
Data columns (total 3 columns):
id            2819 non-null int64
created_at    2819 non-null object
text          2819 non-null object
dtypes: int64(1), object(2)
memory usage: 66.1+ KB


In [22]:
df.set_index('id', drop=True, inplace=True)

Let's give this dataset a bit more structure:
- the `id` column can be transformed into the dataframe's index, thus enabling us e.g. to select a tweet by id;
- `created_at` contains a timestamp, thus it can easily be converted into a `datetime` value
- but what's going on with the text column ??

In [23]:
df.created_at = pd.to_datetime(df.created_at)

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2819 entries, 849636868052275200 to 15434727182
Data columns (total 2 columns):
created_at    2819 non-null datetime64[ns]
text          2819 non-null object
dtypes: datetime64[ns](1), object(1)
memory usage: 66.1+ KB


### Selection

#### Renaming columns

An operation on dataframes that you'll find yourself doing very often is to rename the columns. 

The first way of renaming columns is by manipulating directly the dataframe's index via the `columns` property.

In [25]:
df.columns

Index(['created_at', 'text'], dtype='object')

We can change the column names by assigning to `columns` a list having as values the new column names.

**NB**: the size of the list and new number of colums must match!

In [26]:
# here we renamed the column `text` => `tweet`
df.columns = ['created_at', 'tweet']

In [27]:
# let's check that the change did take place
df.head()

Unnamed: 0_level_0,created_at,tweet
id,Unnamed: 1_level_1,Unnamed: 2_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl"""
848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead."""


The second way of renaming colums is to use the method `rename()` of a dataframe. The `columns` parameter takes a dictionary of mappings between old and new column names.

```python
mapping_dict = {
    "old_column_name": "new_column_name"
}
```

In [28]:
# let's change column `tweet` => `text`
df = df.rename(columns={"tweet": "text"})

In [29]:
df.head()

Unnamed: 0_level_0,created_at,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv'
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl"""
848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'"
848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...'
848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead."""


**Question**: in which cases is it more convenient to use the second method over the first?

#### Selecting columns

In [30]:
# this selects one single column and returns as a Series
df["created_at"].head()

id
849636868052275200   2017-04-05 14:56:29
848988730585096192   2017-04-03 20:01:01
848943072423497728   2017-04-03 16:59:35
848935705057280001   2017-04-03 16:30:19
848416049573658624   2017-04-02 06:05:23
Name: created_at, dtype: datetime64[ns]

In [31]:
type(df["created_at"])

pandas.core.series.Series

In [32]:
# whereas this syntax selects one single column
# but returns a Dataframe
df[["created_at"]].head()

Unnamed: 0_level_0,created_at
id,Unnamed: 1_level_1
849636868052275200,2017-04-05 14:56:29
848988730585096192,2017-04-03 20:01:01
848943072423497728,2017-04-03 16:59:35
848935705057280001,2017-04-03 16:30:19
848416049573658624,2017-04-02 06:05:23


In [33]:
type(df[["created_at"]])

pandas.core.frame.DataFrame

####  Selecting rows

Filtering rows in `pandas` is done by means of `[ ]`, which can contain the row number as well as a condition for the selection.

In [419]:
df[0:2]

Unnamed: 0_level_0,created_at,text,tweet_link,week_day,day_hour,tweet_mentions,n_mentions,year,week_day_name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv',https://twitter.com/i/web/status/849636868052275200,2,14,[],0,2017,Wednesday
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl""",https://twitter.com/i/web/status/848988730585096192,0,20,"[@ForIn2020, @waltmossberg, @mims, @defcon_5]",4,2017,Monday


##### Numerical values

In [423]:
# equivalent of `df.query('n_mentions > 0')`

df[df.n_mentions > 0].shape

(1674, 9)

In [398]:
df[df.n_mentions <= 0].shape

(1145, 9)

##### Strings

In [401]:
df[df.week_day_name == 'Saturday'].shape

(426, 9)

In [402]:
df[df.week_day_name.str.startswith('S')].shape

(848, 9)

##### Multiple conditions

In [410]:
# AND condition with `&`

df[
    (df.week_day_name == 'Saturday') & (df.n_mentions == 0)
].shape

(187, 9)

In [430]:
# equivalent expression with `query()`

df.query("week_day_name == 'Saturday' and n_mentions == 0").shape

(187, 9)

In [412]:
# OR condition with `|`

df[
    (df.week_day_name == 'Saturday') | (df.n_mentions == 0)
].shape

(1384, 9)

### Transformation


The two main functions used to manipulate and transform values in a dataframe are:
- `map()`
- `apply()`

In this section we'll be using both to enrich our datasets with useful information (useful for exploration, for later visualizations, etc.).

#### Add link to original tweet

The `map()` method can be called on a column, as well as on the dataframe's index.

When passed as a parameter to `map`, the functional programming-stlye function `lambda` can be used to transform any value from that column into another one.   

In [141]:
df['tweet_link'] = df.index.map(lambda x: f'https://twitter.com/i/web/status/{x}')

In [142]:
pd.set_option("display.max_colwidth", 10000)

def make_clickable(val):
    # target _blank to open new window
    return '<a target="_blank" href="{}">{}</a>'.format(val, val)

df.head().style.format({'tweet_link': make_clickable})

# to apply the style to the entire dataframe just remove
# `.head` from the line above

Unnamed: 0_level_0,created_at,text,tweet_link,week_day,day_hour
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv',https://twitter.com/i/web/status/849636868052275200,Wednesday,14
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl""",https://twitter.com/i/web/status/848988730585096192,Monday,20
848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'",https://twitter.com/i/web/status/848943072423497728,Monday,16
848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...',https://twitter.com/i/web/status/848935705057280001,Monday,16
848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead.""",https://twitter.com/i/web/status/848416049573658624,Sunday,6


#### Add colums with mentions

In [143]:
import re

def find_mentions(tweet_text):
    handle_regexp = r'@[a-zA-Z0-9_]{1,15}'
    return re.findall(handle_regexp, tweet_text)

In [144]:
df['tweet_mentions'] = df.text.apply(find_mentions)

In [145]:
df['n_mentions'] = df.tweet_mentions.apply(len)

In [146]:
df.head()

Unnamed: 0_level_0,created_at,text,tweet_link,week_day,day_hour,tweet_mentions,n_mentions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv',https://twitter.com/i/web/status/849636868052275200,Wednesday,14,[],0
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl""",https://twitter.com/i/web/status/848988730585096192,Monday,20,"[@ForIn2020, @waltmossberg, @mims, @defcon_5]",4
848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'",https://twitter.com/i/web/status/848943072423497728,Monday,16,"[@waltmossberg, @mims, @defcon_5]",3
848935705057280001,2017-04-03 16:30:19,b'Stormy weather in Shortville ...',https://twitter.com/i/web/status/848935705057280001,Monday,16,[],0
848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead.""",https://twitter.com/i/web/status/848416049573658624,Sunday,6,"[@DaveLeeBBC, @verge]",2


#### Add column with week day

In [147]:
df.created_at.dt.weekday_name.head()

id
849636868052275200    Wednesday
848988730585096192       Monday
848943072423497728       Monday
848935705057280001       Monday
848416049573658624       Sunday
Name: created_at, dtype: object

In [219]:
df["week_day_name"] = df.created_at.dt.weekday_name

In [220]:
df["week_day"] = df.created_at.dt.weekday

In [221]:
df.head(3)

Unnamed: 0_level_0,created_at,text,tweet_link,week_day,day_hour,tweet_mentions,n_mentions,year,week_day_name
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv',https://twitter.com/i/web/status/849636868052275200,2,14,[],0,2017,Wednesday
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl""",https://twitter.com/i/web/status/848988730585096192,0,20,"[@ForIn2020, @waltmossberg, @mims, @defcon_5]",4,2017,Monday
848943072423497728,2017-04-03 16:59:35,"b'@waltmossberg @mims @defcon_5 Et tu, Walt?'",https://twitter.com/i/web/status/848943072423497728,0,16,"[@waltmossberg, @mims, @defcon_5]",3,2017,Monday


#### Add column with day hour

In [150]:
df.created_at.dt?

In [151]:
df.created_at.dt.hour.head()

id
849636868052275200    14
848988730585096192    20
848943072423497728    16
848935705057280001    16
848416049573658624     6
Name: created_at, dtype: int64

In [152]:
df["day_hour"] = df.created_at.dt.hour

In [153]:
display_cols = ['created_at', 'week_day', 'day_hour']
df[display_cols].head(4)

Unnamed: 0_level_0,created_at,week_day,day_hour
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
849636868052275200,2017-04-05 14:56:29,Wednesday,14
848988730585096192,2017-04-03 20:01:01,Monday,20
848943072423497728,2017-04-03 16:59:35,Monday,16
848935705057280001,2017-04-03 16:30:19,Monday,16


### Aggregation

(Figure taken from W. Mckinney's *Python for Data Analysis* 2013, p. 252)

<img src='figures/groupby-mechanics.png' width='600px'>

In [297]:
df.agg({'n_mentions': ['min', 'max', 'sum']})

Unnamed: 0,n_mentions
min,0
max,6
sum,2277


#### Grouping

In [253]:
df.groupby?

In [154]:
grp_by_day = df.groupby('week_day')

The object returned by `gropuby` is a `DataFrameGroupBy` **not** a normal `DataFrame`.

However, some methods of the latter work also on the former, e.g. `head` and `tail`

In [155]:
# the head of a DataFrameGroupBy consists of the first
# n records for each group (see `help(grp_by_day.head)`)

grp_by_day.head(1)

Unnamed: 0_level_0,created_at,text,tweet_link,week_day,day_hour,tweet_mentions,n_mentions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
849636868052275200,2017-04-05 14:56:29,b'And so the robots spared humanity ... https://t.co/v7JUJQWfCv',https://twitter.com/i/web/status/849636868052275200,Wednesday,14,[],0
848988730585096192,2017-04-03 20:01:01,"b""@ForIn2020 @waltmossberg @mims @defcon_5 Exactly. Tesla is absurdly overvalued if based on the past, but that's irr\xe2\x80\xa6 https://t.co/qQcTqkzgMl""",https://twitter.com/i/web/status/848988730585096192,Monday,20,"[@ForIn2020, @waltmossberg, @mims, @defcon_5]",4
848416049573658624,2017-04-02 06:05:23,"b""@DaveLeeBBC @verge Coal is dying due to nat gas fracking. It's basically dead.""",https://twitter.com/i/web/status/848416049573658624,Sunday,6,"[@DaveLeeBBC, @verge]",2
848244577521647616,2017-04-01 18:44:01,"b""Why did we waste so much time developing silly rockets? Damn you, aliens! So obtuse! You have all this crazy tech, but can't speak English!?""",https://twitter.com/i/web/status/848244577521647616,Saturday,18,[],0
847958571895619584,2017-03-31 23:47:32,b'@BadAstronomer We can def bring it back like Dragon. Just a question of how much weight we need to add.',https://twitter.com/i/web/status/847958571895619584,Friday,23,[@BadAstronomer],1
847594208219336705,2017-03-30 23:39:41,b'Incredibly proud of the SpaceX team for achieving this milestone in space! Next goal is reflight within 24 hours.',https://twitter.com/i/web/status/847594208219336705,Thursday,23,[],0
846772378067648513,2017-03-28 17:14:01,b'@danahull Very few. We have yet to do a China (or Europe) launch of Model 3.',https://twitter.com/i/web/status/846772378067648513,Tuesday,17,[@danahull],1


`agg` is used to pass an aggregation function to be applied to each group resulting from `groupby`.

In [156]:
# here we are interested in how many tweets
# there are for each group, so we pass `len()`

grp_by_day.agg(len)

Unnamed: 0_level_0,created_at,text,tweet_link,day_hour,tweet_mentions,n_mentions
week_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Friday,530,530,530,530,530,530
Monday,315,315,315,315,315,315
Saturday,426,426,426,426,426,426
Sunday,422,422,422,422,422,422
Thursday,361,361,361,361,361,361
Tuesday,385,385,385,385,385,385
Wednesday,380,380,380,380,380,380


In [157]:
# however, we are not interested in having the count for all
# columns. rather we want to create a new dataframe with renamed
# column names

grp_by_day.agg({'text': len}).rename({'text': 'tweet_count'}, axis='columns')

Unnamed: 0_level_0,tweet_count
week_day,Unnamed: 1_level_1
Friday,530
Monday,315
Saturday,426
Sunday,422
Thursday,361
Tuesday,385
Wednesday,380


##### By label (column)

Previously we've added a column indicating on which day of the week a given tweet appeared.

In [259]:
grpby_result_as_series = df.groupby('day_hour')['text'].count()

In [261]:
grpby_result_as_series.head()

day_hour
0    144
1    183
2    119
3    104
4     98
Name: text, dtype: int64

In [128]:
grpby_result_as_df = df.groupby('day_hour')[['text']]\
    .count()\
    .rename({'text': 'count'}, axis='columns')

In [129]:
grpby_result_as_df.head()

Unnamed: 0_level_0,count
day_hour,Unnamed: 1_level_1
0,144
1,183
2,119
3,104
4,98


##### By series or dict

In [160]:
df.groupby?

In [161]:
for group, rows in df.groupby(df.created_at.dt.day):
    print(group, type(rows))

1 <class 'pandas.core.frame.DataFrame'>
2 <class 'pandas.core.frame.DataFrame'>
3 <class 'pandas.core.frame.DataFrame'>
4 <class 'pandas.core.frame.DataFrame'>
5 <class 'pandas.core.frame.DataFrame'>
6 <class 'pandas.core.frame.DataFrame'>
7 <class 'pandas.core.frame.DataFrame'>
8 <class 'pandas.core.frame.DataFrame'>
9 <class 'pandas.core.frame.DataFrame'>
10 <class 'pandas.core.frame.DataFrame'>
11 <class 'pandas.core.frame.DataFrame'>
12 <class 'pandas.core.frame.DataFrame'>
13 <class 'pandas.core.frame.DataFrame'>
14 <class 'pandas.core.frame.DataFrame'>
15 <class 'pandas.core.frame.DataFrame'>
16 <class 'pandas.core.frame.DataFrame'>
17 <class 'pandas.core.frame.DataFrame'>
18 <class 'pandas.core.frame.DataFrame'>
19 <class 'pandas.core.frame.DataFrame'>
20 <class 'pandas.core.frame.DataFrame'>
21 <class 'pandas.core.frame.DataFrame'>
22 <class 'pandas.core.frame.DataFrame'>
23 <class 'pandas.core.frame.DataFrame'>
24 <class 'pandas.core.frame.DataFrame'>
25 <class 'pandas.core.fr

In [162]:
# here we pass the groups as a series
df.groupby(df.created_at.dt.day).agg({'text':len}).head()

Unnamed: 0_level_0,text
created_at,Unnamed: 1_level_1
1,107
2,107
3,114
4,97
5,117


In [163]:
# here we pass the groups as a series
df.groupby(df.created_at.dt.day)[['text']].count().head()

Unnamed: 0_level_0,text
created_at,Unnamed: 1_level_1
1,107
2,107
3,114
4,97
5,117


In [164]:
df.groupby(df.created_at.dt.hour)[['text']].count().head()

Unnamed: 0_level_0,text
created_at,Unnamed: 1_level_1
0,144
1,183
2,119
3,104
4,98


##### By multiple labels (columns)

In [165]:
# here we group based on the values of two columns
# instead of one

x = df.groupby(['week_day', 'day_hour'])[['text']].count()

In [166]:
x.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,text
week_day,day_hour,Unnamed: 2_level_1
Friday,0,28
Friday,1,29
Friday,2,20
Friday,3,20
Friday,4,15


#### Aggregation methods

**Summary**:

- `count`: Number of non-NA values
- `sum`: Sum of non-NA values
- `mean`: Mean of non-NA values
- `median`: Arithmetic median of non-NA values
- `std`, `var`: standard deviation and variance
- `min`, `max`: Minimum and maximum of non-NA values

They can be used on a single series:

In [370]:
df.n_mentions.max()

6

On the entire dataframe:

In [374]:
df.mean()

week_day         3.196169
day_hour        12.782547
n_mentions       0.807733
year          2014.777226
dtype: float64

Or also as aggregation functions within a groupby:

In [378]:
df.groupby('week_day').agg(
    {
        # each key in this dict specifies
        # a given column
        'n_mentions':[
            # the list contains aggregation functions
            # to be applied to this column
            'count',
            'mean',
            'min',
            'max',
            'std',
            'var'
        ]
    }
)

Unnamed: 0_level_0,n_mentions,n_mentions,n_mentions,n_mentions,n_mentions,n_mentions
Unnamed: 0_level_1,count,mean,min,max,std,var
week_day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
0,315,0.812698,0,5,0.964248,0.929775
1,385,0.72987,0,6,0.874861,0.765381
2,380,0.786842,0,4,0.768328,0.590328
3,361,0.781163,0,4,0.832853,0.693644
4,530,0.879245,0,5,0.901869,0.813368
5,426,0.7277,0,5,0.800607,0.640972
6,422,0.907583,0,6,0.851707,0.725406


#### Sorting

To sort the values of  a dataframe we use its `sort_values` method:
- `by`: specifies the name of the column to be used for sorting
- `ascending` (default = `True`): specifies whether the sorting should be *ascending* (A-Z, 0-9) or `descending` (Z-A, 9-0) 

In [167]:
df.sort_values(by='created_at', ascending=True).head()

Unnamed: 0_level_0,created_at,text,tweet_link,week_day,day_hour,tweet_mentions,n_mentions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
15434727182,2010-06-04 18:31:57,"b'Please ignore prior tweets, as that was someone pretending to be me :) This is actually me.'",https://twitter.com/i/web/status/15434727182,Friday,18,[],0
142179928203460608,2011-12-01 09:55:11,"b""Went to Iceland on Sat to ride bumper cars on ice! No, not the country, Vlad's rink in Van Nuys. Awesome family fun :) http://t.co/rBQXJ9IT""",https://twitter.com/i/web/status/142179928203460608,Thursday,9,[],0
142188458125963264,2011-12-01 10:29:04,b'I made the volume on the Model S http://t.co/wMCnT53M go to 11. Now I just need to work in a miniature Stonehenge...',https://twitter.com/i/web/status/142188458125963264,Thursday,10,[],0
142880871391838208,2011-12-03 08:20:28,"b'Great Voltaire quote, arguably better than Twain. Hearing news of his own death, Voltaire replied the reports were true, only premature.'",https://twitter.com/i/web/status/142880871391838208,Saturday,8,[],0
142881284019060736,2011-12-03 08:22:07,b'That was a total non sequitur btw',https://twitter.com/i/web/status/142881284019060736,Saturday,8,[],0


In [168]:
df.sort_values(by='n_mentions', ascending=False).head()

Unnamed: 0_level_0,created_at,text,tweet_link,week_day,day_hour,tweet_mentions,n_mentions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
752721031735812096,2016-07-12 04:27:33,"b""@kumailn @RealDaveBarton @JuddApatow @SiliconHBO @FredericLambert @MikeJudge can't wait to see the costume""",https://twitter.com/i/web/status/752721031735812096,Tuesday,4,"[@kumailn, @RealDaveBarton, @JuddApatow, @SiliconHBO, @FredericLambert, @MikeJudge]",6
303244501957365760,2013-02-17 20:48:17,b'RT @LawrenceChanin: @TeslaRoadTrip @avantgame @TEDchris @elonmusk @nytimes \nThe driving public is the winner thanks to the efforts of ...',https://twitter.com/i/web/status/303244501957365760,Sunday,20,"[@LawrenceChanin, @TeslaRoadTrip, @avantgame, @TEDchris, @elonmusk, @nytimes]",6
845289977051148289,2017-03-24 15:03:29,b'@faultywarrior @matt_trulli @FredericLambert @TimShelton @JimPengelly Let me just go fetch my magic wand ...',https://twitter.com/i/web/status/845289977051148289,Friday,15,"[@faultywarrior, @matt_trulli, @FredericLambert, @TimShelton, @JimPengelly]",5
672792504895434753,2015-12-04 15:00:07,"b""RT @WSJLife: Supermodel @KarlieKloss blasts off to the future at @elonmusk's @spacex HQ https://t.co/PfE2bWcQwM @wsjmag https://t.co/zc7QM9\xe2\x80\xa6""",https://twitter.com/i/web/status/672792504895434753,Friday,15,"[@WSJLife, @KarlieKloss, @elonmusk, @spacex, @wsjmag]",5
191005784862236672,2012-04-14 03:31:42,b'RT @SethGreen: My love @ClareGrant & @elonmusk partied with the space cheese on our tour of @SpaceX & @TeslaMotors ...which was AWESO\xe2\x80\xa6 h ...',https://twitter.com/i/web/status/191005784862236672,Saturday,3,"[@SethGreen, @ClareGrant, @elonmusk, @SpaceX, @TeslaMotors]",5


### Save

Before continuing with the plotting, let's save our enhanced dataframe, so that we can come back to it without having to redo the same manipulations on it.

`pandas` provides a number of handy functions to export dataframes in a variety of formats.

Here we use `to_pickle` to serialize the dataframe into a binary format, by using behind the scenes Python's `pickle` library. 

In [181]:
df.to_pickle("./musk_tweets_enhanced.pkl")

In [182]:
ls -al | grep pkl

-rw-r--r--   1 matteo  staff   602215 Jul 21 17:42 musk_tweets_enhanced.pkl
