In [1]:
import pandas as pd
import numpy as np

import plotly.offline as po
import plotly.graph_objs as go
from plotly import tools

from copy import deepcopy

po.init_notebook_mode(connected=True)

In [2]:
original_data = pd.read_csv("data.csv", parse_dates = ['Date'])

df = deepcopy(original_data)

In [3]:
# From https://www.kaggle.com/connerbrown/visualization-and-exploration
# Remove all NaN
df_nans = df.isnull()

track_name_nans = df_nans['Track Name'][df_nans['Track Name'] == True].index
artist_nans = df_nans['Artist'][df_nans['Artist'] == True].index
nans_overlap = (track_name_nans == artist_nans).sum() / df_nans['Track Name'].sum() * 100.0

df = df.drop(track_name_nans)

In [4]:
#Preprocess for continent
df['Region'] = df['Region'].str.upper()

countries = pd.read_json("countries.json", orient = "index").reset_index()
countries_index_continent = countries[['index', 'continent']]

df = df.merge(countries_index_continent, left_on = 'Region', right_on = 'index', how = "left")\
    .drop('index', axis = 1)

weekdays = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

df['Weekday_index'] = df['Date'].apply(lambda x: x.weekday())
df['Weekday'] = df['Weekday_index'].apply(lambda x: weekdays[x])

df.head(10)

Unnamed: 0,Position,Track Name,Artist,Streams,URL,Date,Region,continent,Weekday_index,Weekday
0,1,Reggaetón Lento (Bailemos),CNCO,19272,https://open.spotify.com/track/3AEZUABDXNtecAO...,2017-01-01,EC,SA,6,Sun
1,2,Chantaje,Shakira,19270,https://open.spotify.com/track/6mICuAdrwEjh6Y6...,2017-01-01,EC,SA,6,Sun
2,3,Otra Vez (feat. J Balvin),Zion & Lennox,15761,https://open.spotify.com/track/3QwBODjSEzelZyV...,2017-01-01,EC,SA,6,Sun
3,4,Vente Pa' Ca,Ricky Martin,14954,https://open.spotify.com/track/7DM4BPaS7uofFul...,2017-01-01,EC,SA,6,Sun
4,5,Safari,J Balvin,14269,https://open.spotify.com/track/6rQSrBHf7HlZjtc...,2017-01-01,EC,SA,6,Sun
5,6,La Bicicleta,Carlos Vives,12843,https://open.spotify.com/track/0sXvAOmXgjR2QUq...,2017-01-01,EC,SA,6,Sun
6,7,Ay Mi Dios,IAmChino,10986,https://open.spotify.com/track/6stYbAJgTszHAHZ...,2017-01-01,EC,SA,6,Sun
7,8,Andas En Mi Cabeza,Chino & Nacho,10653,https://open.spotify.com/track/5mey7CLLuFToM2P...,2017-01-01,EC,SA,6,Sun
8,9,Traicionera,Sebastian Yatra,9807,https://open.spotify.com/track/5J1c3M4EldCfNxX...,2017-01-01,EC,SA,6,Sun
9,10,Shaky Shaky,Daddy Yankee,9612,https://open.spotify.com/track/58IL315gMSTD37D...,2017-01-01,EC,SA,6,Sun


### From the last graph in Level 4 - Basic, we could notice that the streams of Ed Sheeran's "Shape of You" has a obvious periodic change. This is the start of the story. We may start to explore some rules behind this song and try to see whether they work for all streams

### The period of change should be one week and we may further split "global" to "continent" to see if this priciple of periodic change works for all parts of the world

In [5]:
df_soy = df[df['Track Name'] == "Shape of You"]

In [6]:
continent_names = ["AS", "SA", "NA", "EU", "OC"]
df_continents = [{"name": continent_name, "data": df_soy[df_soy['continent'] == continent_name]}\
                 for continent_name in continent_names]

In [7]:
df_continent_groups = [{"name": df['name'], \
                        "data": df['data'].groupby('Date').sum().reset_index()}\
                        for df in df_continents]

In [8]:
trace_soy = [go.Scatter(x = df['data']['Date'], y = df['data']['Streams'],
                        name = df['name'])\
            for df in df_continent_groups]

fig = go.Figure(data = trace_soy)

In [9]:
po.iplot(fig)

### From the graph above, we could see that periodic change of EU, NA and SA is obvious. It seems that there is strong periodic feature, which may have some interesting information behind. Later we will focus on the period change. First, we need some more graphs to see whether OC and AS share the same feature and continue our exploration on the change in single week.

### Also, we can find some information from this graph. 

#### 1. General trend of the song 

It is corresponding to our general thought. Its number of streams will gradually decrease.

#### 2. Abnormal points and possible explanation

- March 3rd and following 2 weeks: According to the Guardian, the album Divide which includes Shape of You is released on that day[1]. The release of one album could increase streams greatly in the following days.

- Whole December and especially December 31st: According to the Spotify Community, "Your 2017 Wrapped" can be accessed from December 5th[2]. This may awake people's memory and repeat songs they have listened in the past year.



Reference:

[1]"Ed Sheeran's new album Divide to be released on 3 March", the Guardian, 2017. [Online]. Available: https://www.theguardian.com/music/2017/jan/12/ed-sheeran-third-album-divide-release-date-3-march. [Accessed: 21- Jan- 2019].

[2]"Your 2017 Wrapped", Community.spotify.com, 2017. [Online]. Available: https://community.spotify.com/t5/Community-Blog/Your-2017-Wrapped/ba-p/3638025. [Accessed: 21- Jan- 2019].

In [10]:
special_continent_names = ['OC', 'AS']

df_special_continent_groups = [df for df in df_continent_groups\
                               if df['name'] in special_continent_names]

In [11]:
trace_special_soy = [go.Scatter(x = df['data']['Date'], 
                                y = df['data']['Streams'],
                        name = df['name'])\
            for df in df_special_continent_groups]

fig = go.Figure(data = trace_special_soy)

In [12]:
po.iplot(fig)

### It seems that in AS and OC, there is a periodic change but not so obvious as other 3 continents. Then in the next step we will focus on other 3 continent to see the change in single week. 

In [13]:
chosen_continent_names = ['NA', 'SA', 'EU', 'AS']

In [14]:
df_soy_chosen_continent = [{"name": continent_name, "data": df_soy[df_soy['continent'] == continent_name]}\
                 for continent_name in chosen_continent_names]

df_soy_chosen_date_groups = [{"name": df['name'], \
                        "data": df['data'].groupby('Date').mean()\
                                .sort_values('Weekday_index')}\
                        for df in df_soy_chosen_continent]

df_soy_chosen_weekday_groups = [{"name": df['name'], \
                        "data": df['data'].groupby('Weekday_index').mean()\
                                .sort_values('Weekday_index')}\
                        for df in df_soy_chosen_date_groups]

In [15]:
trace_soy_weekday = [go.Scatter(x = weekdays, y = df['data']['Streams'],
                        name = df['name'])\
            for df in df_soy_chosen_weekday_groups]

fig = tools.make_subplots(rows=1, cols=4)

fig.append_trace(trace_soy_weekday[0], 1, 1)
fig.append_trace(trace_soy_weekday[1], 1, 2)
fig.append_trace(trace_soy_weekday[2], 1, 3)
fig.append_trace(trace_soy_weekday[3], 1, 4)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]  [ (1,4) x4,y4 ]



In [16]:
po.iplot(fig)

### From the graph above, we could notice some interesting phenomena and try to find some explanation

The general trend of streams is that it will reach peak on Friday, meet slump on Sunday and increase from Monday to Thursday.

#### 1. The increase trend from Monday to Thursday

From daily life, we all experience such situation: on Monday, we are busy dealing with all kinds of affairs including weekly meeting and situation happens on weekend and from Monday to Thursday, we gradually focus less on our work. This is quite same with the trend of streams, which shows people spend more time on listening to music on Thursday than Monday.

This finding can be used as a method to check people's passion and attention on work.

#### 2. The peak of Streams in SA and NA is Friday while the peak in EU is Saturday

（This is not related to the problem of Time Zone since NA and SA is later than EU. However, the peak of AS is on Wednesday and Thursday may have some relationship with it.) We may consider this as a difference in people's preference of listening to songs or having fun.

#### 3. The lowest streams always happen on Sunday

There could be mainly two reasons:

- People in these three continents will spend more time in churches on Sunday. Then their time of listening to songs will be reduced. 

- People will come back to work after Sunday so they may not choose to sleep late and have party or other activities.

(To prove first possible reason, the graph of AS is also drawn. Though it also shows the lowest streams happen on Sunday, we could notice the difference between Sunday's and peak's is only about 6% while the differences in NA, SA and EU are approximately 14%, 15%, 13% (more than 2 times of it in AS). This interesting finding may be used to find the proportion of Christs and Catholic in certain region).

#### 4. The phenomenon of two peaks 

Actually, this phenomenon is not so obvious in SA and EU but it exists in NA. The two peaks are Wednesday and Friday. There is no evidence in American Culture to help explain this phenomenon. One possible explanation could be that on Wednesday, middle of weekdays, people meet their points of exhaustion. They try to relax themselves by listening to music and on next day they become tired of it.


### To check if the trend and rules above could be applied on all songs' streams

In [17]:
chosen_continent_names = ['NA', 'SA', 'EU', 'AS']

In [18]:
df_chosen_continent = [{"name": continent_name, "data": df[df['continent'] == continent_name]}\
                 for continent_name in chosen_continent_names]


df_chosen_date_groups = [{"name": df['name'], \
                        "data": df['data'].groupby(['Date', 'Weekday_index']).sum()\
                                .sort_values('Weekday_index').reset_index()}\
                        for df in df_chosen_continent]

df_chosen_weekday_groups = [{"name": df['name'], \
                        "data": df['data'].groupby('Weekday_index').mean()\
                                .sort_values('Weekday_index')}\
                        for df in df_chosen_date_groups]

In [19]:
from plotly import tools
trace_soy_weekday = [go.Scatter(x = weekdays, y = df['data']['Streams'],
                        name = df['name'])\
            for df in df_chosen_weekday_groups]

fig = tools.make_subplots(rows=1, cols=4)

fig.append_trace(trace_soy_weekday[0], 1, 1)
fig.append_trace(trace_soy_weekday[1], 1, 2)
fig.append_trace(trace_soy_weekday[2], 1, 3)
fig.append_trace(trace_soy_weekday[3], 1, 4)

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]  [ (1,3) x3,y3 ]  [ (1,4) x4,y4 ]



In [20]:
po.iplot(fig)

### From the graph above, we could say that most of findings of 'Shape of You' could be applied on all songs. However, the phenomenon of two peaks disappears so we may consider it as a special case. Another point that needs to be revised is that the peak of SA and EU might be concluded as Friday and Saturday instead of a single day since streams of both of two days are much higher than other days (same result with 'Shape of You' and all songs). 