# Exploratory Data Analysis 1
An approach to EDA:  
![image of the data flow showing visualization as an exploratory and iterative process](http://benbestphd.com/images/r4ds_data-science.png)

#### The goal of EDA is to discover patterns in data. This is a fundamental stepping stone towards predictive modelling, or an end goal in itself. 

Tips for good EDA:
- Get to know the context of the data.  
- Question the data: Who collected it? Who is distributing it? Do all of the patterns make sense to what you know about the world? If they don’t, go back and look more closely at your data.

- Use EDA to formulate a question based on the patterns that you see.
- Use EDA to check if a hypothesis is worth a deeper analysis.

- Keep the questions SIMPLE and BRIEF- the goal is to understand and build complexity further on.
- Its an iterative process-- its okay to repeat things so long as you learn from previous output.

In [1]:
# importing the libraries for data processing
import numpy as np 
import pandas as pd 


### 1. Tidying our charts data

Read the csv file, check for missing, duplicated and unexpected values, and filtering if needed

In [2]:
# read the charts dataset
charts_df = pd.read_csv('data/spotify_daily_charts.csv')
charts_df.head()

Unnamed: 0,date,position,track_id,track_name,artist,streams
0,2018-01-01,1,0ofbQMrRDsUaVKq2mGLEAb,Havana,Camila Cabello,155633
1,2018-01-01,2,0tgVpDi06FyKpA1z0VMD4v,Perfect,Ed Sheeran,134756
2,2018-01-01,3,3hBBKuWJfxlIlnd9QFoC8k,What Lovers Do (feat. SZA),Maroon 5,130898
3,2018-01-01,4,1mXVgsBdtIVeCLJnSnmtdV,Too Good At Goodbyes,Sam Smith,130798
4,2018-01-01,5,2ekn2ttSfGqwhhate0LSR0,New Rules,Dua Lipa,125472


### Data Checks
It is prudent to do the following on a DataFrame before any analysis is made
1. Check shape
2. Check data types of columns
3. Check null values in columns
4. Check rows with null values
5. Check for duplicates

In [3]:
#Check the shape of the dataframe
charts_df.shape 

(197800, 6)

In [4]:
#Check data types of the columns
charts_df.dtypes

date          object
position       int64
track_id      object
track_name    object
artist        object
streams        int64
dtype: object

In [5]:
#Check null values in the columns
charts_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 197800 entries, 0 to 197799
Data columns (total 6 columns):
date          197800 non-null object
position      197800 non-null int64
track_id      197800 non-null object
track_name    197800 non-null object
artist        197800 non-null object
streams       197800 non-null int64
dtypes: int64(2), object(4)
memory usage: 9.1+ MB


In [12]:
#Check for duplicates
sum(charts_df.duplicated())

0

In [13]:
#check if unique values are expected
charts_df['position'].unique()

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 18

In [14]:
len(charts_df['artist'].unique())

605

In [15]:
len(charts_df['track_name'].unique())

1826

In [16]:
len(charts_df['track_id'].unique())

2292

##### Convert date to datetime index
Pandas has a very useful method `pd.to_datetime` that smartly recognizes date and time columns and allows for easier time series techniques

In [17]:
#transform date column into a datetime column
charts_df['date'] = pd.to_datetime(charts_df['date'])
charts_df.head()

Unnamed: 0,date,position,track_id,track_name,artist,streams
0,2018-01-01,1,0ofbQMrRDsUaVKq2mGLEAb,Havana,Camila Cabello,155633
1,2018-01-01,2,0tgVpDi06FyKpA1z0VMD4v,Perfect,Ed Sheeran,134756
2,2018-01-01,3,3hBBKuWJfxlIlnd9QFoC8k,What Lovers Do (feat. SZA),Maroon 5,130898
3,2018-01-01,4,1mXVgsBdtIVeCLJnSnmtdV,Too Good At Goodbyes,Sam Smith,130798
4,2018-01-01,5,2ekn2ttSfGqwhhate0LSR0,New Rules,Dua Lipa,125472


In [18]:
#extract month 
charts_df['month']=charts_df['date'].dt.month
charts_df.head()

Unnamed: 0,date,position,track_id,track_name,artist,streams,month
0,2018-01-01,1,0ofbQMrRDsUaVKq2mGLEAb,Havana,Camila Cabello,155633,1
1,2018-01-01,2,0tgVpDi06FyKpA1z0VMD4v,Perfect,Ed Sheeran,134756,1
2,2018-01-01,3,3hBBKuWJfxlIlnd9QFoC8k,What Lovers Do (feat. SZA),Maroon 5,130898,1
3,2018-01-01,4,1mXVgsBdtIVeCLJnSnmtdV,Too Good At Goodbyes,Sam Smith,130798,1
4,2018-01-01,5,2ekn2ttSfGqwhhate0LSR0,New Rules,Dua Lipa,125472,1


In [19]:
#extract year
charts_df['year']=charts_df['date'].dt.year
# get day and day of week
charts_df['day']=charts_df['date'].dt.day
charts_df['day_of_week']=charts_df['date'].dt.dayofweek # The day of the week with Monday=0, Sunday=6.
charts_df.head()

Unnamed: 0,date,position,track_id,track_name,artist,streams,month,year,day,day_of_week
0,2018-01-01,1,0ofbQMrRDsUaVKq2mGLEAb,Havana,Camila Cabello,155633,1,2018,1,0
1,2018-01-01,2,0tgVpDi06FyKpA1z0VMD4v,Perfect,Ed Sheeran,134756,1,2018,1,0
2,2018-01-01,3,3hBBKuWJfxlIlnd9QFoC8k,What Lovers Do (feat. SZA),Maroon 5,130898,1,2018,1,0
3,2018-01-01,4,1mXVgsBdtIVeCLJnSnmtdV,Too Good At Goodbyes,Sam Smith,130798,1,2018,1,0
4,2018-01-01,5,2ekn2ttSfGqwhhate0LSR0,New Rules,Dua Lipa,125472,1,2018,1,0


### 2. Examining the charts data
Reshape and aggregate the DataFrame to answer basic data questions 

In [7]:
#Lets create tallies of each column using the `value_counts` method
charts_df['artist'].value_counts()[:10]

LANY                6559
Ben&Ben             5554
Ed Sheeran          5274
December Avenue     5087
Moira Dela Torre    5019
Lauv                4122
Taylor Swift        3256
Post Malone         3204
Ariana Grande       2969
Maroon 5            2821
Name: artist, dtype: int64

In [8]:
charts_df['track_name'].value_counts()

Happier                         1245
Your Song                       1023
Ang Huling El Bimbo             1009
Tadhana                          989
Kathang Isip                     989
                                ... 
Sweet Shadow                       1
The Bells At Christmas             1
Summer on You                      1
can’t bear to be without you       1
It Is the Lord!                    1
Name: track_name, Length: 1826, dtype: int64

In [9]:
#filtering columns
charts_df[charts_df['track_name']=='Happier']

Unnamed: 0,date,position,track_id,track_name,artist,streams
112,2018-01-01,113,2RttW7RAu5nOAfq6YFvApB,Happier,Ed Sheeran,22152
310,2018-01-02,111,2RttW7RAu5nOAfq6YFvApB,Happier,Ed Sheeran,27343
517,2018-01-03,118,2RttW7RAu5nOAfq6YFvApB,Happier,Ed Sheeran,27619
717,2018-01-04,118,2RttW7RAu5nOAfq6YFvApB,Happier,Ed Sheeran,27921
922,2018-01-05,123,2RttW7RAu5nOAfq6YFvApB,Happier,Ed Sheeran,27318
...,...,...,...,...,...,...
155183,2020-02-15,184,2dpaYNEQHiRxtZbfNsse99,Happier,Marshmello,28298
155396,2020-02-16,197,2dpaYNEQHiRxtZbfNsse99,Happier,Marshmello,26734
159395,2020-03-07,196,2dpaYNEQHiRxtZbfNsse99,Happier,Marshmello,26758
169590,2020-04-27,191,2RttW7RAu5nOAfq6YFvApB,Happier,Ed Sheeran,21555


> Q1. From top 50 most streamed, get top 20 most frequently occuring artists

In [10]:
charts_df[charts_df['position']<=50]['artist'].value_counts()[:20]

Ben&Ben                3369
December Avenue        2348
Moira Dela Torre       1771
Lauv                   1579
LANY                   1512
Post Malone            1111
Ariana Grande          1091
This Band              1085
Dua Lipa               1024
I Belong to the Zoo    1014
Ed Sheeran              888
Maroon 5                880
Bazzi                   851
Taylor Swift            841
Khalid                  785
Sam Smith               669
Matthaios               657
Russ                    585
Marshmello              562
PDL                     547
Name: artist, dtype: int64

> Q2. From top 50 list this year, get top 20 most frequently occuring artists

In [21]:
charts_df[(charts_df['position']>=50)&(charts_df['year']==2020)]['artist'].value_counts()[:20]

Taylor Swift          1336
December Avenue       1242
Lauv                  1231
Moira Dela Torre       957
LANY                   916
Ed Sheeran             891
Ben&Ben                763
BTS                    713
Sam Smith              654
This Band              621
TWICE                  584
Harry Styles           581
Hale                   518
South Border           515
Jason Mraz             514
Michael Pangilinan     481
Shawn Mendes           446
Post Malone            443
BLACKPINK              417
PDL                    417
Name: artist, dtype: int64

> Q3. On what positions did Taylor Swift land on the chart for 2019? What were her songs that landed first on the chart?

In [42]:
np.sort(charts_df[(charts_df['artist']=='Taylor Swift')&(charts_df['year']==2019)]['position'].unique())

array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 150, 152, 153, 154, 155, 156, 157, 158,
       159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171,
       172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 18

In [43]:
charts_df[(charts_df['artist']=='Taylor Swift')&\
                    (charts_df['year']==2019)&\
                    (charts_df['position']==1)]['track_name'].unique()

array(['ME! (feat. Brendon Urie of Panic! At The Disco)', 'Lover'],
      dtype=object)

### 3. Describing and Aggregating the charts dataset


Basic stats on the streams column using the `describe` method

In [50]:
charts_df['streams'].describe()

count    197800.000000
mean      52119.772958
std       42018.196845
min       15287.000000
25%       26968.000000
50%       34626.500000
75%       59291.250000
max      514546.000000
Name: streams, dtype: float64

The pandas GroupBy operator functions in the same way as pivot_table in excel

The syntax is:
```python
df.groupby('index_column')['agg_column'].aggfunc
df.groupby(['index_column1','index_column2']).agg('agg_column1':aggfunc1, 'agg_column2':aggfunc2)
```


Q: How many total streams did Spotify earn per year?

In [52]:
charts_df.groupby('year')['streams'].sum()   #inputting a column name string in agg_column outputs a Series

year
2018    3467089600
2019    4081571771
2020    2760629720
Name: streams, dtype: int64

In [53]:
charts_df.groupby('year')[['streams']].sum()   #inputting a list in agg_column outputs a DataFrame

Unnamed: 0_level_0,streams
year,Unnamed: 1_level_1
2018,3467089600
2019,4081571771
2020,2760629720


> Q: How many streams did each of the 200 positions contribute to the annual streams of spotify?

In [54]:
charts_df.groupby(['year','position'])[['streams']].sum()   #inputting a list in agg_column outputs a DataFrame

Unnamed: 0_level_0,Unnamed: 1_level_0,streams
year,position,Unnamed: 2_level_1
2018,1,84309969
2018,2,72077911
2018,3,66770318
2018,4,63507534
2018,5,60422355
...,...,...
2020,196,6249376
2020,197,6229542
2020,198,6210949
2020,199,6193245


> Q: What visualization would best suit the output of the cell above?

### 4. Combining two datasets

- What insights could we get from merging the charts and tracks datasets?

In [205]:
# read the tracks dataset
tracks_df = pd.read_csv('data/spotify_daily_charts_tracks.csv')
tracks_df.head()

Unnamed: 0,track_id,track_name,artist_id,artist_name,album_id,duration,release_date,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,0ofbQMrRDsUaVKq2mGLEAb,Havana,4nDoRrQiYLoBzwC5BhVJzF,Camila Cabello,5chBPOVY2I0bG5V3igb5QL,216896,2017-08-03,5,0.768,0.517,7,-4.323,0,0.0312,0.186,3.8e-05,0.104,0.418,104.992
1,0tgVpDi06FyKpA1z0VMD4v,Perfect,6eUKZXaKkcviH0Ku9w2n3V,Ed Sheeran,3T4tUhGYeRNVUGevb0wThu,263400,2017-03-03,86,0.599,0.448,8,-6.312,1,0.0232,0.163,0.0,0.106,0.168,95.05
2,3hBBKuWJfxlIlnd9QFoC8k,What Lovers Do (feat. SZA),04gDigrS5kc9YWfZHwBETP,Maroon 5,1Jmq5HEJeA9kNi2SgQul4U,199849,2017-11-03,5,0.795,0.615,5,-5.211,0,0.0671,0.0786,3e-06,0.0855,0.393,110.009
3,1mXVgsBdtIVeCLJnSnmtdV,Too Good At Goodbyes,2wY79sveU1sp5g7SokKOiI,Sam Smith,3TJz2UBNYJtlEly0sPeNrQ,201000,2017-11-03,81,0.681,0.372,5,-8.237,1,0.0432,0.64,0.0,0.169,0.476,91.873
4,2ekn2ttSfGqwhhate0LSR0,New Rules,6M2wZ9GZgrQXHCFfjv46we,Dua Lipa,01sfgrNbnnPUEyz6GZYlt9,209320,2017-06-02,81,0.762,0.7,9,-6.021,0,0.0694,0.00261,1.6e-05,0.153,0.608,116.073


In [206]:
df = charts_df.merge(tracks_df, on='track_id', how='left')
df.head()

Unnamed: 0,date,position,track_id,track_name_x,artist,streams,month,year,day,day_of_week,...,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,2018-01-01,1,0ofbQMrRDsUaVKq2mGLEAb,Havana,Camila Cabello,155633,1,2018,1,0,...,0.517,7,-4.323,0,0.0312,0.186,3.8e-05,0.104,0.418,104.992
1,2018-01-01,2,0tgVpDi06FyKpA1z0VMD4v,Perfect,Ed Sheeran,134756,1,2018,1,0,...,0.448,8,-6.312,1,0.0232,0.163,0.0,0.106,0.168,95.05
2,2018-01-01,3,3hBBKuWJfxlIlnd9QFoC8k,What Lovers Do (feat. SZA),Maroon 5,130898,1,2018,1,0,...,0.615,5,-5.211,0,0.0671,0.0786,3e-06,0.0855,0.393,110.009
3,2018-01-01,4,1mXVgsBdtIVeCLJnSnmtdV,Too Good At Goodbyes,Sam Smith,130798,1,2018,1,0,...,0.372,5,-8.237,1,0.0432,0.64,0.0,0.169,0.476,91.873
4,2018-01-01,5,2ekn2ttSfGqwhhate0LSR0,New Rules,Dua Lipa,125472,1,2018,1,0,...,0.7,9,-6.021,0,0.0694,0.00261,1.6e-05,0.153,0.608,116.073


In [223]:
#Always check number of rows when performing merges
charts_df.shape, tracks_df.shape, df.shape

((197800, 10), (2292, 19), (197800, 27))

In [207]:
df.columns

Index(['date', 'position', 'track_id', 'track_name_x', 'artist', 'streams',
       'month', 'year', 'day', 'day_of_week', 'track_name_y', 'artist_id',
       'artist_name', 'album_id', 'duration', 'release_date', 'popularity',
       'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness',
       'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo'],
      dtype='object')

In [208]:
#drop duplicated track_name column
df = df.drop(columns='track_name_y')
#rename trace_name x
df = df.rename(columns={'track_name_x':'track_name'})
df.head()

Unnamed: 0,date,position,track_id,track_name,artist,streams,month,year,day,day_of_week,...,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,2018-01-01,1,0ofbQMrRDsUaVKq2mGLEAb,Havana,Camila Cabello,155633,1,2018,1,0,...,0.517,7,-4.323,0,0.0312,0.186,3.8e-05,0.104,0.418,104.992
1,2018-01-01,2,0tgVpDi06FyKpA1z0VMD4v,Perfect,Ed Sheeran,134756,1,2018,1,0,...,0.448,8,-6.312,1,0.0232,0.163,0.0,0.106,0.168,95.05
2,2018-01-01,3,3hBBKuWJfxlIlnd9QFoC8k,What Lovers Do (feat. SZA),Maroon 5,130898,1,2018,1,0,...,0.615,5,-5.211,0,0.0671,0.0786,3e-06,0.0855,0.393,110.009
3,2018-01-01,4,1mXVgsBdtIVeCLJnSnmtdV,Too Good At Goodbyes,Sam Smith,130798,1,2018,1,0,...,0.372,5,-8.237,1,0.0432,0.64,0.0,0.169,0.476,91.873
4,2018-01-01,5,2ekn2ttSfGqwhhate0LSR0,New Rules,Dua Lipa,125472,1,2018,1,0,...,0.7,9,-6.021,0,0.0694,0.00261,1.6e-05,0.153,0.608,116.073


In [209]:
#check if expected columns are present
df.columns

Index(['date', 'position', 'track_id', 'track_name', 'artist', 'streams',
       'month', 'year', 'day', 'day_of_week', 'artist_id', 'artist_name',
       'album_id', 'duration', 'release_date', 'popularity', 'danceability',
       'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness',
       'instrumentalness', 'liveness', 'valence', 'tempo'],
      dtype='object')

## Q&A

Q1: What are the top 10 songs in terms of total streams from 2018 to 2020?

In [210]:
# groupby tracks and sum streams, sort and get first 10 rows 
df.groupby(['track_id','track_name'])['streams'].sum().sort_values(ascending=False)[:10]

track_id                track_name            
3WUEs51GpcvlgU7lehLgLh  Kathang Isip              117417737
2BgD4nRyx9EZ5o8YEnjRSV  Kung 'Di Rin Lang Ikaw    111660025
1X4l4i472kW5ofFP8Xo0x0  Sana                      102183621
1yDiru08Q6omDOGkZMPnei  Maybe The Night            98601014
00mBzIWv5gHOYxwuEJXjOG  Sa Ngalan Ng Pag-Ibig      87436861
4u8RkgV6P4TLi89SmlUtv8  Mundo                      85182154
5f9808hpiCpuNyqqdXmpF2  Buwan                      75693387
5l9g7py8RCblcvbZgGQgSd  Pagtingin                  71359213
0Eqg0CQ7bK3RQIMPw1A7pl  Malibu Nights              65806253
76cy1WJvNGJTj78UqeA5zr  IDGAF                      64416473
Name: streams, dtype: int64

Q2: Whats the mean tempo of the top 10 most streamed songs?

In [211]:
top10songs = df.groupby(['track_id','track_name'])['streams'].sum()\
            .sort_values(ascending=False)[:10]\
            .reset_index()['track_id'].values
top10songs

array(['3WUEs51GpcvlgU7lehLgLh', '2BgD4nRyx9EZ5o8YEnjRSV',
       '1X4l4i472kW5ofFP8Xo0x0', '1yDiru08Q6omDOGkZMPnei',
       '00mBzIWv5gHOYxwuEJXjOG', '4u8RkgV6P4TLi89SmlUtv8',
       '5f9808hpiCpuNyqqdXmpF2', '5l9g7py8RCblcvbZgGQgSd',
       '0Eqg0CQ7bK3RQIMPw1A7pl', '76cy1WJvNGJTj78UqeA5zr'], dtype=object)

In [212]:
#isin selects elements in list
df[df['track_id'].isin(top10songs)]['tempo'].mean() #in bpm

116.29211823016844

Q2a. Follow-up: How does this compare with the mean tempo of the rest of the songs?

In [213]:
#use ~ to negate
df[~df['track_id'].isin(top10songs)]['tempo'].mean() #in bpm

117.11312242520918

Q3: What are the top 5 “saddest” charting songs for 2020? 

In [221]:
#filter by year, drop duplicates for track, sort valence from least to greatest,get first 5 indices 
df[df['year']==2020].drop_duplicates(['track_id','track_name']).sort_values('valence')[:5][['track_name','artist']]

Unnamed: 0,track_name,artist
187135,Delicate,Taylor Swift
154863,No Time To Die,Billie Eilish
161793,Oceans (Where Feet May Fail),Hillsong UNITED
175853,Chromatica I,Lady Gaga
146057,Falling,Harry Styles


### Plain tables as output?
1. Tables are simple fast answers to simple fast questions
2. Tables are very useful for troubleshooting. The numbers often reveal if there was something wrong with the data source/processing
3. In most office setups, analtyics output are often offtaked by another team (e.g. market segments group -> finance for sales projections). As it could be readily plugged into their computations, they usually prefer tables instead of deployed products.