# Exploratory Data Analysis 
An approach to EDA:  
![image of the data flow showing visualization as an exploratory and iterative process](http://benbestphd.com/images/r4ds_data-science.png)

#### The goal of EDA is to discover patterns in data. This is a fundamental stepping stone towards predictive modelling, or an end goal in itself. 

Tips for good EDA:
- Get to know the context of the data.  
- Question the data: Who collected it? Who is distributing it? Do all of the patterns make sense to what you know about the world? If they don’t, go back and look more closely at your data.

- Use EDA to formulate a question based on the patterns that you see.
- Use EDA to check if a hypothesis is worth a deeper analysis.

- Keep the questions SIMPLE and BRIEF- the goal is to understand and build complexity further on.
- Its an iterative process-- its okay to repeat things so long as you learn from previous output.

In [1]:
# importing the libraries for data processing
import numpy as np 
import pandas as pd 


Read the csv files

In [25]:
df_chart = pd.read_csv('data/spotify_top200_tracks_ph.csv')

#data cleaning commands from past lesson
df_chart = df_chart[~df_chart['track_name'].isnull()]
df_chart['track_id'] = df_chart['URL'].str.replace('https://open.spotify.com/track/','')
df_chart = df_chart.drop(columns = 'URL')
df_chart.head()

Unnamed: 0,date,position,track_name,artist,streams,track_id
0,2017-01-01,1,Versace on the Floor,Bruno Mars,185236,0kN8xEmgMW9mh7UmDYHlJP
1,2017-01-01,2,Say You Won't Let Go,James Arthur,180552,5uCax9HTNlzGybIStD3vDh
2,2017-01-01,3,Closer,The Chainsmokers,158720,7BKLCZ1jbUBVqRi2FVlTVw
3,2017-01-01,4,All We Know,The Chainsmokers,130874,2rizacJSyD9S1IQUxUxnsK
4,2017-01-01,5,Don't Wanna Know,Maroon 5,129656,5MFzQMkrl1FOOng9tq6R9r


In [3]:
df_tracks = pd.read_csv('data/spotify_track_data_top200_ph.csv')
df_tracks.head()

Unnamed: 0.1,Unnamed: 0,index,track_id,track_name,artist_id,album_id,duration,release_date,popularity,danceability,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_mins,release_year,release_month,release_day
0,0,0,0kN8xEmgMW9mh7UmDYHlJP,Versace on the Floor,0du5cEVh5yTK9QJze8zA0C,4PgleR09JVnm3zY1fW3XBA,261240,2016-11-17,75,0.578,...,0.0454,0.196,0.0,0.083,0.301,174.152,4.354,2016.0,11.0,17.0
1,1,1,5uCax9HTNlzGybIStD3vDh,Say You Won't Let Go,4IWBUUAFIplrNtaOHcJPRM,7oiJYvEJHsmYtrgviAVIBD,211466,2016-10-28,85,0.358,...,0.059,0.695,0.0,0.0902,0.494,85.043,3.524433,2016.0,10.0,28.0
2,2,2,7BKLCZ1jbUBVqRi2FVlTVw,Closer,69GGBxA162lTqCwzJG5jLp,0rSLgV8p5FzfnqlEk4GzxE,244960,2016-07-29,84,0.748,...,0.0338,0.414,0.0,0.111,0.661,95.01,4.082667,2016.0,7.0,29.0
3,3,3,2rizacJSyD9S1IQUxUxnsK,All We Know,69GGBxA162lTqCwzJG5jLp,0xmaV6EtJ4M3ebZUPRnhyb,194080,2016-09-29,69,0.662,...,0.0307,0.097,0.00272,0.115,0.296,90.0,3.234667,2016.0,9.0,29.0
4,4,4,5MFzQMkrl1FOOng9tq6R9r,Don't Wanna Know,04gDigrS5kc9YWfZHwBETP,0fvTn3WXF39kQs9i3bnNpP,214480,2016-10-11,14,0.783,...,0.08,0.338,0.0,0.0975,0.447,100.048,3.574667,2016.0,10.0,11.0


## 1. Combining using `pd.merge`

In [12]:
avengers_df = pd.DataFrame({'id': [101,102,103,104,105,106],
                            'character_name':['Iron Man','Thor','Captain America',\
                                              'Black Widow','Hulk','Hawkeye'],
                            'gender':['M','M','M','F','M','M']
                            })
celeb_df = pd.DataFrame({'id': [101,102,103,104,105,106,107,108],
                            'celeb_name':['Robert Downey Jr','Chris Hemsworth',\
                                          'Chris Evans','Scarlett Johansson',\
                                          'Mark Ruffalo','Jeremy Renner','Vic Sotto',\
                                         'Vice Ganda']
                            })

In [13]:
avengers_df

Unnamed: 0,id,character_name,gender
0,101,Iron Man,M
1,102,Thor,M
2,103,Captain America,M
3,104,Black Widow,F
4,105,Hulk,M
5,106,Hawkeye,M


In [14]:
celeb_df

Unnamed: 0,id,celeb_name
0,101,Robert Downey Jr
1,102,Chris Hemsworth
2,103,Chris Evans
3,104,Scarlett Johansson
4,105,Mark Ruffalo
5,106,Jeremy Renner
6,107,Vic Sotto
7,108,Vice Ganda


- Merging will use the **`on`** column as a key for the merge.  The code below identifies the column ‘col2’ from both data frames. 
- The argument **`how`** set to 'inner' makes the merge only keep rows occuring in both data frames.

In [15]:
merged_df = pd.merge(avengers_df, celeb_df, how='inner')
merged_df

Unnamed: 0,id,character_name,gender,celeb_name
0,101,Iron Man,M,Robert Downey Jr
1,102,Thor,M,Chris Hemsworth
2,103,Captain America,M,Chris Evans
3,104,Black Widow,F,Scarlett Johansson
4,105,Hulk,M,Mark Ruffalo
5,106,Hawkeye,M,Jeremy Renner


- The default value of the parameter `how` is 'inner'. The following code performs the same task as above.

In [16]:
merged_df = pd.merge(avengers_df, celeb_df)
merged_df

Unnamed: 0,id,character_name,gender,celeb_name
0,101,Iron Man,M,Robert Downey Jr
1,102,Thor,M,Chris Hemsworth
2,103,Captain America,M,Chris Evans
3,104,Black Widow,F,Scarlett Johansson
4,105,Hulk,M,Mark Ruffalo
5,106,Hawkeye,M,Jeremy Renner


- To keep every row in avengers_df then set the parameter `how` = 'left'.

In [17]:
merged_df = pd.merge(avengers_df, celeb_df, how='left')
merged_df

Unnamed: 0,id,character_name,gender,celeb_name
0,101,Iron Man,M,Robert Downey Jr
1,102,Thor,M,Chris Hemsworth
2,103,Captain America,M,Chris Evans
3,104,Black Widow,F,Scarlett Johansson
4,105,Hulk,M,Mark Ruffalo
5,106,Hawkeye,M,Jeremy Renner


- To keep every row in celeb_df then set the parameter `how` = 'right'.

In [18]:
merged_df = pd.merge(avengers_df, celeb_df, how='right')
merged_df

Unnamed: 0,id,character_name,gender,celeb_name
0,101,Iron Man,M,Robert Downey Jr
1,102,Thor,M,Chris Hemsworth
2,103,Captain America,M,Chris Evans
3,104,Black Widow,F,Scarlett Johansson
4,105,Hulk,M,Mark Ruffalo
5,106,Hawkeye,M,Jeremy Renner
6,107,,,Vic Sotto
7,108,,,Vice Ganda


- To keep all rows from both dataframes, set the parameter `how` = 'outer'.

In [19]:
merged_df = pd.merge(avengers_df, celeb_df, how='outer')
merged_df

Unnamed: 0,id,character_name,gender,celeb_name
0,101,Iron Man,M,Robert Downey Jr
1,102,Thor,M,Chris Hemsworth
2,103,Captain America,M,Chris Evans
3,104,Black Widow,F,Scarlett Johansson
4,105,Hulk,M,Mark Ruffalo
5,106,Hawkeye,M,Jeremy Renner
6,107,,,Vic Sotto
7,108,,,Vice Ganda


-------

In [48]:
df_streams = df_chart.merge(df_tracks, on='track_id', how='left')
df_streams.head()

Unnamed: 0.1,date,position,track_name_x,artist,streams,track_id,Unnamed: 0,index,track_name_y,artist_id,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_mins,release_year,release_month,release_day
0,2017-01-01,1,Versace on the Floor,Bruno Mars,185236,0kN8xEmgMW9mh7UmDYHlJP,0,0,Versace on the Floor,0du5cEVh5yTK9QJze8zA0C,...,0.0454,0.196,0.0,0.083,0.301,174.152,4.354,2016.0,11.0,17.0
1,2017-01-01,2,Say You Won't Let Go,James Arthur,180552,5uCax9HTNlzGybIStD3vDh,1,1,Say You Won't Let Go,4IWBUUAFIplrNtaOHcJPRM,...,0.059,0.695,0.0,0.0902,0.494,85.043,3.524433,2016.0,10.0,28.0
2,2017-01-01,3,Closer,The Chainsmokers,158720,7BKLCZ1jbUBVqRi2FVlTVw,2,2,Closer,69GGBxA162lTqCwzJG5jLp,...,0.0338,0.414,0.0,0.111,0.661,95.01,4.082667,2016.0,7.0,29.0
3,2017-01-01,4,All We Know,The Chainsmokers,130874,2rizacJSyD9S1IQUxUxnsK,3,3,All We Know,69GGBxA162lTqCwzJG5jLp,...,0.0307,0.097,0.00272,0.115,0.296,90.0,3.234667,2016.0,9.0,29.0
4,2017-01-01,5,Don't Wanna Know,Maroon 5,129656,5MFzQMkrl1FOOng9tq6R9r,4,4,Don't Wanna Know,04gDigrS5kc9YWfZHwBETP,...,0.08,0.338,0.0,0.0975,0.447,100.048,3.574667,2016.0,10.0,11.0


In [49]:
#drop duplicated track_name column
df_streams = df_streams.drop(columns='track_name_y')
#rename trace_name x
df_streams = df_streams.rename(columns={'track_name_x':'track_name'})
df_streams = df_streams.drop(columns = ['Unnamed: 0', 'index'])
df_streams.head()

Unnamed: 0,date,position,track_name,artist,streams,track_id,artist_id,album_id,duration,release_date,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_mins,release_year,release_month,release_day
0,2017-01-01,1,Versace on the Floor,Bruno Mars,185236,0kN8xEmgMW9mh7UmDYHlJP,0du5cEVh5yTK9QJze8zA0C,4PgleR09JVnm3zY1fW3XBA,261240,2016-11-17,...,0.0454,0.196,0.0,0.083,0.301,174.152,4.354,2016.0,11.0,17.0
1,2017-01-01,2,Say You Won't Let Go,James Arthur,180552,5uCax9HTNlzGybIStD3vDh,4IWBUUAFIplrNtaOHcJPRM,7oiJYvEJHsmYtrgviAVIBD,211466,2016-10-28,...,0.059,0.695,0.0,0.0902,0.494,85.043,3.524433,2016.0,10.0,28.0
2,2017-01-01,3,Closer,The Chainsmokers,158720,7BKLCZ1jbUBVqRi2FVlTVw,69GGBxA162lTqCwzJG5jLp,0rSLgV8p5FzfnqlEk4GzxE,244960,2016-07-29,...,0.0338,0.414,0.0,0.111,0.661,95.01,4.082667,2016.0,7.0,29.0
3,2017-01-01,4,All We Know,The Chainsmokers,130874,2rizacJSyD9S1IQUxUxnsK,69GGBxA162lTqCwzJG5jLp,0xmaV6EtJ4M3ebZUPRnhyb,194080,2016-09-29,...,0.0307,0.097,0.00272,0.115,0.296,90.0,3.234667,2016.0,9.0,29.0
4,2017-01-01,5,Don't Wanna Know,Maroon 5,129656,5MFzQMkrl1FOOng9tq6R9r,04gDigrS5kc9YWfZHwBETP,0fvTn3WXF39kQs9i3bnNpP,214480,2016-10-11,...,0.08,0.338,0.0,0.0975,0.447,100.048,3.574667,2016.0,10.0,11.0


Write `df_streams` to a file

In [50]:
df_streams.columns

Index(['date', 'position', 'track_name', 'artist', 'streams', 'track_id',
       'artist_id', 'album_id', 'duration', 'release_date', 'popularity',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
       'duration_mins', 'release_year', 'release_month', 'release_day'],
      dtype='object')

In [51]:
df_streams.to_csv('data\merged_chart_tracks.csv', index=False)

## Q&A

Q1: What are the top 10 songs in terms of total streams from 2017 to April 2020?

In [37]:
df_streams.groupby(['track_id','track_name'])['streams'].sum().sort_values(ascending=False)[:10]

track_id                track_name            
3WUEs51GpcvlgU7lehLgLh  Kathang Isip              104149008
2BgD4nRyx9EZ5o8YEnjRSV  Kung 'Di Rin Lang Ikaw    103032375
5uCax9HTNlzGybIStD3vDh  Say You Won't Let Go       94072731
1X4l4i472kW5ofFP8Xo0x0  Sana                       92532756
1yDiru08Q6omDOGkZMPnei  Maybe The Night            86217838
00mBzIWv5gHOYxwuEJXjOG  Sa Ngalan Ng Pag-Ibig      80741291
4u8RkgV6P4TLi89SmlUtv8  Mundo                      79751659
0tgVpDi06FyKpA1z0VMD4v  Perfect                    72842489
5f9808hpiCpuNyqqdXmpF2  Buwan                      70748311
7qiZfU4dY1lWllzX7mPBI3  Shape of You               70434067
Name: streams, dtype: int64

Q2: Whats the mean tempo of the top 10 most streamed songs?

In [45]:
top10songs = df_streams.groupby(['track_id','track_name'])['streams'].sum()\
            .sort_values(ascending=False)[:10]\
            .reset_index()['track_id'].values
top10songs

array(['3WUEs51GpcvlgU7lehLgLh', '2BgD4nRyx9EZ5o8YEnjRSV',
       '5uCax9HTNlzGybIStD3vDh', '1X4l4i472kW5ofFP8Xo0x0',
       '1yDiru08Q6omDOGkZMPnei', '00mBzIWv5gHOYxwuEJXjOG',
       '4u8RkgV6P4TLi89SmlUtv8', '0tgVpDi06FyKpA1z0VMD4v',
       '5f9808hpiCpuNyqqdXmpF2', '7qiZfU4dY1lWllzX7mPBI3'], dtype=object)

In [46]:
df_streams[df_streams['track_id'].isin(top10songs)]['tempo'].mean()

112.7347515535611

Q3: What are the top 5 “saddest” songs in the whole dataset? 

In [47]:
df_streams.sort_values('valence')[:5]

Unnamed: 0,date,position,track_name,artist,streams,track_id,artist_id,album_id,duration,release_date,...,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_mins,release_year,release_month,release_day
18989,2017-04-05,190,Midnight,Coldplay,21043,4GKk1uNzpxIptBuaY97Dkj,4gzpq5DPGxSnKTe4SA8HAU,2G4AUqfwxcV1UdQjm2ouYr,294666,2014-05-19,...,0.0355,0.615,0.808,0.0944,0.0349,126.976,4.9111,2014.0,5.0,19.0
18795,2017-04-04,196,Midnight,Coldplay,19750,4GKk1uNzpxIptBuaY97Dkj,4gzpq5DPGxSnKTe4SA8HAU,2G4AUqfwxcV1UdQjm2ouYr,294666,2014-05-19,...,0.0355,0.615,0.808,0.0944,0.0349,126.976,4.9111,2014.0,5.0,19.0
18997,2017-04-05,198,Always in My Head,Coldplay,20076,0FMjqbY3aWo1QDbo3GwXib,4gzpq5DPGxSnKTe4SA8HAU,2G4AUqfwxcV1UdQjm2ouYr,216626,2014-05-19,...,0.0254,0.0128,0.687,0.068,0.0397,97.544,3.610433,2014.0,5.0,19.0
139952,2018-12-04,163,I Always Wanna Die (Sometimes),The 1975,24410,7iPlcFvOMOzt6v0QvcAueZ,3mIj9lX2MWuHmhNCA7LSCW,6PWXKiakqhI17mTYM4y6oY,314662,2018-11-30,...,0.0325,0.00527,0.00162,0.123,0.0398,148.835,5.244367,2018.0,11.0,30.0
140154,2018-12-05,165,I Always Wanna Die (Sometimes),The 1975,24064,7iPlcFvOMOzt6v0QvcAueZ,3mIj9lX2MWuHmhNCA7LSCW,6PWXKiakqhI17mTYM4y6oY,314662,2018-11-30,...,0.0325,0.00527,0.00162,0.123,0.0398,148.835,5.244367,2018.0,11.0,30.0
