# Exploratory Data Analysis 2B
## *Pandas Time Series Techniques*

In [None]:
# importing the libraries for data processing
import numpy as np 
import pandas as pd 

#matplotlib for visualizations
import matplotlib.pyplot as plt


### 1. Data Preparation
Merge the charts and the tracks datasets. Repeat the process from the previous notebook

In [None]:
# read and process the charts dataset
charts_df = pd.read_csv('data/spotify_daily_charts.csv')
#transform date column into a datetime column
charts_df['date'] = pd.to_datetime(charts_df['date'])
charts_df.head()

In [None]:
# read and process the tracks dataset
tracks_df = pd.read_csv('data/spotify_daily_charts_tracks.csv')
tracks_df.head()

In [None]:
df = charts_df.merge(tracks_df, on='track_id', how='left')

df = df.drop(columns='track_name_y')
df = df.rename(columns={'track_name_x':'track_name'})
df.head()

### 2. Seasonal and Trend decomposition using Loess (STL) Method

From the [statsmodels documentation](https://otexts.com/fpp2/stl.html) :

STL is a versatile and robust method for decomposing time series. 

STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships. 

The STL method was developed by Cleveland, Cleveland, McRae, & Terpenning (1990).

STL has several advantages over the classical decomposition methods:

 - STL will handle any type of seasonality, not only monthly and quarterly data.

 - The seasonal component is allowed to change over time, and the rate of change can be controlled by the user.

 - The smoothness of the trend-cycle can also be controlled by the user.

It can be robust to outliers (i.e., the user can specify a robust decomposition), so that occasional unusual observations will not affect the estimates of the trend-cycle and seasonal components. They will, however, affect the remainder component.

In [None]:
import statsmodels.api as sm
from statsmodels import datasets


sample_df = datasets.co2.load_pandas().data 

In [None]:
res = sm.tsa.seasonal_decompose(sample_df['co2'].interpolate())
resplot = res.plot()
plt.suptitle("Mauna Loa Weekly Atmospheric CO2 Concentration", y=1.01)

Q: How does STL look like with top streamed artist Ben&Ben?

In [None]:

#get all dates
data1 = pd.DataFrame({'date':pd.unique(df['date'])}).set_index('date')
#get total streams of all charting songs of the artist per day
artist_streams = df[df['artist']=='Ben&Ben'].groupby('date')[['streams']].sum()
#merge with complete dates
data1['streams']=artist_streams
#fill days with no streams with 0
data1['streams']=data1['streams'].fillna(0)
data1.plot()

In [None]:
res = sm.tsa.seasonal_decompose(data1['streams'])
resplot = res.plot()

In [None]:
#get each component 
data_decomposed = data1.copy()
data_decomposed['trend_component'] = res.trend
data_decomposed['seasonal_component'] = res.seasonal
data_decomposed['residual_component'] = res.resid

data_decomposed.tail(400).plot()

Q: How about for Jose Mari Chan, a Christmas song artist?

In [None]:
data2 = pd.DataFrame({'date':pd.unique(df['date'])}).set_index('date')
artist_streams = df[df['artist']=='Jose Mari Chan'].groupby('date')[['streams']].sum()
data2['streams']=artist_streams
data2['streams']=data2['streams'].fillna(0)
data2.plot()

In [None]:
res = sm.tsa.seasonal_decompose(data2['streams'])
data_decomposed = data2.copy()
data_decomposed['trend_component'] = res.trend
data_decomposed['seasonal_component'] = res.seasonal
data_decomposed['residual_component'] = res.resid
data_decomposed['season_strength'] = data_decomposed['seasonal_component']/data_decomposed['trend_component']
data_decomposed[['streams','trend_component']].tail(400).plot()

In [None]:
res1 = sm.tsa.seasonal_decompose(data1['streams'].diff()[1:])
data_decomposed = data1[1:].copy()
data_decomposed['trend_component'] = res1.trend
data_decomposed['seasonal_component'] = res1.seasonal
data_decomposed['residual_component'] = res1.resid
data_decomposed['season_strength'] = data_decomposed['seasonal_component']/data_decomposed['trend_component']
data_decomposed[['streams','seasonal_component','trend_component']].tail(400).plot()

In [None]:
res2 = sm.tsa.seasonal_decompose(data2['streams'].diff()[1:])
data_decomposed = data2[1:].copy()
data_decomposed['trend_component'] = res2.trend
data_decomposed['seasonal_component'] = res2.seasonal
data_decomposed['residual_component'] = res2.resid
data_decomposed['season_strength'] = data_decomposed['seasonal_component']/data_decomposed['trend_component']
data_decomposed[['seasonal_component','trend_component']].tail(400).plot()

## 3. Autocorrelation and Partial Autocorrelation Functions

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

### Stationarity and differencing time series data

A *stationary time series* is one whose statistical properties such as mean, variance, autocorrelation, etc. are all constant over time. Most statistical forecasting methods are based on the assumption that the time series can be rendered approximately stationary (i.e., “stationarized”) through the use of mathematical transformations. 

Lagged differencing is a simple transformation method that can be used to remove the seasonal component of the series. A lagged difference over an interval n is the difference of the value at current time t and another value at another past time t-n.

This is easily done in pandas using the `diff()` method

In [None]:
fig = plt.figure(figsize=(10,3))
ax = fig.add_subplot(111)

plt.plot(sample_df['co2'].interpolate().diff())

### The Autocorrelation Function (ACF)
Simply put, a time series has autocorrelation if  autocorrelation is when a time series is linearly related to a lagged version of itself. 

It is a measure of how much of the past resembles the present.

The ACF can be used to uncover and verify seasonality in time series data. 

In [None]:
fig = plt.figure(figsize=(10,3))
ax = fig.add_subplot(111)

acf = plot_acf(sample_df['co2'].interpolate().diff()[1:], lags=104, ax=ax)


Values outside the band mean that the correlation value at that time lag is significant. 

The peaks occur at lag 25/52 and alternates in sign, which describes how CO2 falls globally as the seasons transition from summer to winter.

### The Partial Autocorrelation Function (PACF)
The partial autocorrelation function is a measure of the correlation between observations of a time series that are separated by k time units (yt and yt–k), AFTER adjusting for the presence of all the other terms of shorter lag (yt–1, yt–2, ..., yt–k–1).

In [None]:
fig = plt.figure(figsize=(10,3))
ax = fig.add_subplot(111)

pacf = plot_pacf(sample_df['co2'].interpolate().diff()[1:], lags=104, ax=ax)

Values outside the band mean that the correlation value at that time lag is significant. 

For CO2 concentration data, the PACF lags correlation are sharpest when the finishes approximately two annual (lag 100-104) cycles.

> Q: How does ACF and PACF look like for Ben&Ben and Jose Mari Chan's total daily streams

In [None]:
fig = plt.figure(figsize=(10,3))
ax = fig.add_subplot(111)

plt.plot(data1['streams'].interpolate().diff()[1:])

In [None]:
#Ben & Ben
fig = plt.figure(figsize=(10,6))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)

acf = plot_acf(data1['streams'].interpolate().diff()[1:], lags=365, ax=ax1)
pacf = plot_pacf(data1['streams'].interpolate().diff()[1:], lags=365, ax=ax2)

In [None]:
fig = plt.figure(figsize=(10,3))
ax = fig.add_subplot(111)

plt.plot(data2['streams'].interpolate().diff()[1:])

In [None]:
#Jose Mari Chan
fig = plt.figure(figsize=(10,6))
ax1 = fig.add_subplot(211)
ax2 = fig.add_subplot(212)

#full year lag
acf = plot_acf(data2['streams'].interpolate().diff()[1:], lags=365, ax=ax1)
pacf = plot_pacf(data2['streams'].interpolate().diff()[1:], lags=365, ax=ax2)

# *Day Deliverables*

1. (*Easy- Individual Work*) Among those included in the Spotify charts, pick an artist you like. 

   a. Plot the streams and positions of their top 5 streamed songs.

   b. Compare these charts with streams and positions of what you feel to be a possible collaborator/competitor/related artist. 

   What insights can you draw from the data?

2. (*Intermediate - Group Work*) A song may be classified as follows:

   - **Mainstay** - Song with high streams ($>X_1$ streams) and position ($>P_1$ position) all throughout the year
   - **Viral** - Song that reach the peak position fast with high increase in streams ($>X_2$ streams/day),
     followed by a rapid decline in position ($P_2$ places/day) and streams ($>X_3$ streams/day)
   - **Seasonal** - Song that consistently appear ($>C$ autocorrelation score) OR stay and garner considerable streams ($>X_4$ streams) within a certain season and go into low ranks/ disappear from the chart after the season
    
   a. Discuss among your group how you would define and set values to the thresholds that you will use to classify the songs according to the categories as described above. (You may add more thresholds to refine the definitions, as you see fit)
   
   b. Name as many songs as you can per category and plot their streams and position as a time series.

3. (*Advanced - Group Work, Optional*) What percentage of Spotify charts streams from 2018-2020 are from mainstay songs? viral songs? seasonal songs? songs that do not belong in any of these categories? What does this reveal about the streaming market?