# Top Charts!
## Data Science Project - Exploratory Data Analysis & Problem formulation

Today we're going to be analysing the music of the '00s. Given a Billboard chart dataset, can we analyse and find out: 

- What made a hit soar to the top of the charts; 
- How long they stayed there? 

We will dig into our handy Data Scientist's toolbox to answer these questions, using __`python`__ and __`pandas`__ for data cleaning, as well as __Tableau__ for some beautiful visualisations to tell our story! Along the way we will use the __`pivot_table`__ and __`melt`__ functionality of `pandas` to make our lives easier.

## Loading the data.

We'll start off the project by importing our trusty `pandas` library to read in our `csv` file:

In [1]:
import pandas as pd
dataset = pd.read_csv("assets/topcharts/billboard.csv")
print dataset.shape
print dataset.dtypes.value_counts()
pd.set_option('display.max_columns', dataset.shape[1])
dataset.head()

(317, 83)
float64    75
object      6
int64       2
dtype: int64


Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,x4th.week,x5th.week,x6th.week,x7th.week,x8th.week,x9th.week,x10th.week,x11th.week,x12th.week,x13th.week,x14th.week,x15th.week,x16th.week,x17th.week,x18th.week,x19th.week,x20th.week,x21st.week,x22nd.week,x23rd.week,x24th.week,x25th.week,x26th.week,x27th.week,x28th.week,x29th.week,x30th.week,x31st.week,x32nd.week,x33rd.week,x34th.week,x35th.week,x36th.week,x37th.week,x38th.week,x39th.week,x40th.week,x41st.week,x42nd.week,x43rd.week,x44th.week,x45th.week,x46th.week,x47th.week,x48th.week,x49th.week,x50th.week,x51st.week,x52nd.week,x53rd.week,x54th.week,x55th.week,x56th.week,x57th.week,x58th.week,x59th.week,x60th.week,x61st.week,x62nd.week,x63rd.week,x64th.week,x65th.week,x66th.week,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,33.0,23.0,15.0,7.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0,7.0,10.0,12.0,15.0,22.0,29.0,31.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,5.0,2.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0,15.0,19.0,21.0,26.0,36.0,48.0,47.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,31.0,20.0,13.0,7.0,6.0,4.0,4.0,4.0,6.0,4.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,4.0,8.0,8.0,12.0,14.0,17.0,21.0,24.0,30.0,34.0,37.0,46.0,47.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,14.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,4.0,8.0,11.0,16.0,20.0,25.0,27.0,27.0,29.0,44.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,29.0,23.0,18.0,11.0,9.0,9.0,11.0,1.0,1.0,1.0,1.0,4.0,8.0,12.0,22.0,23.0,43.0,44.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


#### Let's take a closer look at the dataset.

Our DataFrame contains: 

- Top 100 chart ranking data;
- 317 songs;
- From the period of June 1999 to December 2000;
- Spanning 76 weeks

In addition to basic information about each song such as its __artist__, __track name__, __song length__ and __genre__, the dataset also tracks chart position over time measured in units of weeks, including when it entered the chart and the time of peaking.

There are a large number of NaNs in the weekly chart ranking columns. It can be reasonably assumed that it indicates a song falling out of Top 100. These NaNs will have to be handled as we do our data cleaning.

Furthermore, this table is very wide due to the way it organises the weekly rank information by column. The use of pivot tables and melting will be useful for consolidating that information.

## Data cleaning. 

We're going to:

- Rename some feature columns that are poorly named
- Shorten any strings that may be too long
- Check for missing values and impute them as appropriate

In [2]:
# building dictionary for renaming columns
rename = {'artist.inverted':'artist'}
for i in dataset.columns:
    if 'week' in i:
        if len(i) == 9:
            new = int(i[1:2])
        else:
            new = int(i[1:3])
        rename.setdefault(i,new)

The following code block simplifies column names:

In [3]:
# renaming columns in place
dataset.rename(columns=rename,inplace=True) 
dataset.head()

Unnamed: 0,year,artist,track,time,genre,date.entered,date.peaked,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63.0,49.0,33.0,23.0,15.0,7.0,5.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0,7.0,10.0,12.0,15.0,22.0,29.0,31.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8.0,6.0,5.0,2.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0,15.0,19.0,21.0,26.0,36.0,48.0,47.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48.0,43.0,31.0,20.0,13.0,7.0,6.0,4.0,4.0,4.0,6.0,4.0,2.0,1.0,1.0,1.0,2.0,1.0,2.0,4.0,8.0,8.0,12.0,14.0,17.0,21.0,24.0,30.0,34.0,37.0,46.0,47.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23.0,18.0,14.0,2.0,1.0,1.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,4.0,8.0,11.0,16.0,20.0,25.0,27.0,27.0,29.0,44.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47.0,45.0,29.0,23.0,18.0,11.0,9.0,9.0,11.0,1.0,1.0,1.0,1.0,4.0,8.0,12.0,22.0,23.0,43.0,44.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


Here the rank values are converted from __floats to integers__:

In [4]:
# converting NaN rank numbers to 101
dataset.fillna(value = 101, inplace=True)
# converting float rank numbers to integers
dataset.iloc[:,7:]= dataset.iloc[:,7:].applymap(lambda x: int(x))
dataset.head()

Unnamed: 0,year,artist,track,time,genre,date.entered,date.peaked,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76
0,2000,Destiny's Child,Independent Women Part I,3:38,Rock,2000-09-23,2000-11-18,78,63,49,33,23,15,7,5,1,1,1,1,1,1,1,1,1,1,1,2,3,7,10,12,15,22,29,31,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101
1,2000,Santana,"Maria, Maria",4:18,Rock,2000-02-12,2000-04-08,15,8,6,5,2,3,2,2,1,1,1,1,1,1,1,1,1,1,8,15,19,21,26,36,48,47,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101
2,2000,Savage Garden,I Knew I Loved You,4:07,Rock,1999-10-23,2000-01-29,71,48,43,31,20,13,7,6,4,4,4,6,4,2,1,1,1,2,1,2,4,8,8,12,14,17,21,24,30,34,37,46,47,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101
3,2000,Madonna,Music,3:45,Rock,2000-08-12,2000-09-16,41,23,18,14,2,1,1,1,1,2,2,2,2,2,4,8,11,16,20,25,27,27,29,44,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),3:38,Rock,2000-08-05,2000-10-14,57,47,45,29,23,18,11,9,9,11,1,1,1,1,4,8,12,22,23,43,44,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101,101


The above code converts the NaN rankings to 101 to indicate that the song has fallen off the top 100.

Below, the columns 'year', 'time', 'date.entered' and 'date.peaked' are converted to datetime:

In [5]:
from datetime import datetime
for col in ['year','time','date.entered','date.peaked']:
    pd.to_datetime(dataset[col])

Now, we will use Pandas' built in `melt` function and pivot the weekly ranking data to be long, instead of wide. Doing so removes the 72 'week' columns and replaces them with just two: `Week` and `Ranking`.

In [6]:
weeklist = list(dataset.columns[7:])

In [7]:
melted = pd.melt(dataset, id_vars=['artist', 'track'],
       var_name="Week", value_name="Ranking",
       value_vars = weeklist)
# displaying .head(10)
pivoted = pd.pivot_table(melted, index = ['artist','track', 'Week'], )

## Data Visualisation.

Using __Tableau__, we will create some visualisations that will provide context to this dataset. The cleaned DataFrame and melted pivot table from the last code block will be looked at using data visualisation.

The follow charts and descriptions group together visualisations and observations based around __three themes__:

### 1. Top Genres

[Click here for visualisation with Tableau](https://github.com/matthewwilfred/Data-Science-Projects/blob/master/assets/topcharts/Dashboard%201.png)

Based on the histogram on the left of number of appearances on Top 100 by genre, a Pareto-like distribution of popularity emerges as divided by genre, with the 3 most popular genres being:
1. Rock
2. Country
3. Rap

Looking to the treemaps on the right, although the above genres clearly dominate, the number of songs per artist that made it into Top 10 and Top 100 seems to be much more evenly distributed. Most artists only had one Top 100 song, and even the most prolific and popular artists have at most five songs in Top 100. 

Note also that while Latin (in purple) is not in the top 3 most popular genres, it has an outsize influence in terms of number of Top 10 tracks as shown in the upper right hand chart, coming in just behind Rock and Rap and ahead of Country.

### 2. Long-lasting artists

[Click here for visualisation with Tableau](https://github.com/matthewwilfred/Data-Science-Projects/blob/master/assets/topcharts/Dashboard%204.png)

An interesting feature of this dataset is that we can extract information to find out how long individual artists stayed in the Top 100 charts. While no artist was able to remain in Top 100 for the entire duration of 76 weeks, a handful shown in the top chart lasted more than 35 weeks, with Creed topping out in terms of chart persistence with 65 weeks in the Top 100 before falling off. By contrast, at the bottom rung there were artists who only appeared on the list for one week. 

Also noteworthy is that, comparing with the first chart shown, the set of artists who have longest chart persistence do not overlap significantly with those who created the most Top 10 songs.

### 3. How genres fared over time

[Click here for visualisation with Tableau](https://github.com/matthewwilfred/Data-Science-Projects/blob/master/assets/topcharts/Dashboard%205.png)

The above graph uses median chart rankings of a genre to show the general trends of individual genres over time. We can look at two parameters:

1. How long individual genres lasted; and
2. How high in the rankings they climbed

Looking at the colour-coded curves by genre, we see that Country, Rock and Rap persisted the longest in charts, which is the same set of genres that were the three most popular seen in the very first graph. The median rankings of these same genres also each broke into Top 10 at one point or another, but of the three Rap only made it to the 2nd place, as shown by the table in the bottom left, while the other 2 both reach the number one spot.

We can also see that a number of genres did not last long, falling off at fewer than 20 weeks:

- Electronica
- Jazz (red)
- Gospel
- R&B
- Reggae

These genres also failed to break into Top 10, with the very significant exception of Jazz, which spiked up in its third week but quickly fell off the chart altogether by its sixth week since entering the chart. Pairing this observation with the fact that Jazz only appeared on the charts 5 times throughout the entire period covered by this dataset, as shown by the first chart, indicates that while Jazz rarely appeared in the Top 100, when it did it went straight up to the top, reaching number 7 before falling back to obscurity in the space of a month.

One last feature of the line chart bears mentioning: the bottom right hand corner is empty space. You can imagine this area being bounded roughly by the two lines Median Ranking = 50 in the y-axis and Week = 30 in the x-axis. What this tells us is that for this dataset:

- Once a song has been in the charts for more than 30 weeks, when it falls out of the top 50 it generally disappears off the Top 100 altogether.


#### Now that we have explored the data, let's try to come up with a problem statement. After all, data science is about answering questions using data, so let's think of some problems to solve!

## Problem Statement.

#### 1. Were the songs in the Top 100 chart dominated by a few artists in '99-'00?

The rationale behind posing this question can be summed up as follows:

- The music industry has to devote resources and manpower to promote artists and songs.
- If Top 100 chart songs are dominated by a few artists, the industry can pour resources into just a handful of them to maximize their audience, and by extension their economic return.
- Conversely, if the statement were proven to not be true, then industry leaders should focus instead on casting a wide net in order to capture a greater variety of artists that appeal to more people.
- The above logic applies equally to genres as to artists: if certain genres have an outsize popularity as reflected by the charts, then more effort should be put into promoting artists who create that genre of music.

#### 2. Follow-up: Does that hold true today?

The follow-up question is outside the scope of this exercise, but for future follow-up work, answering this second and more relevant question will allow us to see if the distributions of song/artist popularity and chart persistence have changed. The results of this investigation can potentially be very revealing, because a number of trends in the years between the dataset timeframe of 1999 - 2000 and the present day of 2016 have changed the music landscape beyond recognition:

- The advent of Ipods, then smartphones
- The explosive popularity of Youtube and music streaming services
- Fall-off in popularity of traditional media, e.g. CDs, radio, TV
- Shift from traditional to digital marketing

As the underlying factors that influence music production, promotion and consumption change, it would be interesting to see if these have led to a change in the distribution of songs, artists and genres in the Top 100, not only for prosaic academical reasons, but more importantly for market research purposes that players in the music industry today will no doubt be keeping an eye on.

## Approaches to problem solving.

We can approach problem statement 1 by looking at artist popularity in a number of different ways, mainly by investigating the distribution of:

1. number of songs by artist that appeared in the Top 100
2. ranking achieved by these songs
3. persistence in the charts

Data visualisation is essential to see if any of the three elements are affected by outsize influences of a small number of artists, or not. For item 1, a treemap can be very useful because it displays interval measurements well by linking the area of blocks with the magnitude of measurements.

Regarding item 2, a full investigation can involve following individual song ranks or artist's songs ranks, with the latter being additionally characterisable by highest rank, lowest rank, median rank, standard deviation and other statistical measures for a grouping of songs. A simple table with coloured-in cells for heat values can be an effective visualisation tool for simple measures of rank, and perhaps candlestick charts for better visualisation of data distribution.

To tackle item 3, we can track chart persistence by visualising changes in rank over time. Plotting a time series line chart of median rank can be an effective means to screen out extreme outliers in ranking; though it is arguably more important to look at top ranks, because those are likely to be the greatest drivers of revenue. An additional tool can again be candlestick charts to visualise the distribution of ranks.

Looking at genre popularity is just as important, for which the above methods apply, but generalised to an entire genre of music as opposed to just one artist.

That's it for now - to recap: We've looked at two essential aspects of any Data Science workflow: __Exploratory Data Analysis__ and __problem formulation__, and also brainstormed approaches to problem solving using data. For a showcase of other equally important elements and tools for the practicing Data Scientist, check out the other notebooks hosted on my Github page!