<h1 align='center'>Popularity of music records</h1>

<img src='http://www.dropcards.com/45rpm/images/record.png'/>


The music industry has a well-developed market with a global annual revenue around $15 billion. The recording industry is highly competitive and is dominated by three big production companies which make up nearly 82% of the total annual album sales.

Artists are at the core of the music industry and record labels provide them with the necessary resources to sell their music on a large scale. A record label incurs numerous costs (studio recording, marketing, distribution, and touring) in exchange for a percentage of the profits from album sales, singles and concert tickets.

Unfortunately, the success of an artist's release is highly uncertain: a single may be extremely popular, resulting in widespread radio play and digital downloads, while another single may turn out quite unpopular, and therefore unprofitable. 

Knowing the competitive nature of the recording industry, record labels face the fundamental decision problem of which musical releases to support to maximize their financial success. 

How can we use analytics to predict the popularity of a song? <p style='color: blue'>In this assignment, we challenge ourselves to predict whether a song will reach a spot in the Top 10 of the Billboard Hot 100 Chart.</p>

Taking an analytics approach, we aim to use information about a song's properties to predict its popularity. The dataset songs.csv (available in the data folder) consists of all songs which made it to the Top 10 of the Billboard Hot 100 Chart from 1990-2010 plus a sample of additional songs that didn't make the Top 10. This data comes from three sources: Wikipedia, Billboard.com, and EchoNest.

The variables included in the dataset either describe the artist or the song, or they are associated with the following song attributes: time signature, loudness, key, pitch, tempo, and timbre.

Here's a detailed description of the variables:

**year** = the year the song was released

**songtitle** = the title of the song

**artistname** = the name of the artist of the song

**songID and artistID** = identifying variables for the song and artist

**timesignature and timesignature_confidence** = a variable estimating the time signature of the song, and the confidence in the estimate

**loudness** = a continuous variable indicating the average amplitude of the audio in decibels

**tempo and tempo_confidence** = a variable indicating the estimated beats per minute of the song, and the confidence in the estimate

**key and key_confidence** = a variable with twelve levels indicating the estimated key of the song (C, C#, . . ., B), and the confidence in the estimate

**energy** = a variable that represents the overall acoustic energy of the song, using a mix of features such as loudness

**pitch** = a continuous variable that indicates the pitch of the song

**timbre_0_min, timbre_0_max, timbre_1_min, timbre_1_max, . . . , timbre_11_min, and timbre_11_max** = variables that indicate the minimum/maximum values over all segments for each of the twelve values in the timbre vector (resulting in 24 continuous variables)

**Top10** = a binary variable indicating whether or not the song made it to the Top 10 of the Billboard Hot 100 Chart (1 if it was in the top 10, and 0 if it was not)

In [1]:
songs = pd.read_csv('data/songs.csv', encoding='latin1')

print(f'number of rows: {songs.shape[0]}\nnumber of columns: {songs.shape[1]}')
songs.head()

number of rows: 7574
number of columns: 39


Unnamed: 0,year,songtitle,artistname,songID,artistID,timesignature,timesignature_confidence,loudness,tempo,tempo_confidence,...,timbre_7_max,timbre_8_min,timbre_8_max,timbre_9_min,timbre_9_max,timbre_10_min,timbre_10_max,timbre_11_min,timbre_11_max,Top10
0,2010,This Is the House That Doubt Built,A Day to Remember,SOBGGAB12C5664F054,AROBSHL1187B9AFB01,3,0.853,-4.262,91.525,0.953,...,82.475,-52.025,39.116,-35.368,71.642,-126.44,18.658,-44.77,25.989,0
1,2010,Sticks & Bricks,A Day to Remember,SOPAQHU1315CD47F31,AROBSHL1187B9AFB01,4,1.0,-4.051,140.048,0.921,...,106.918,-61.32,35.378,-81.928,74.574,-103.808,121.935,-38.892,22.513,0
2,2010,All I Want,A Day to Remember,SOOIZOU1376E7C6386,AROBSHL1187B9AFB01,4,1.0,-3.571,160.512,0.489,...,80.621,-59.773,45.979,-46.293,59.904,-108.313,33.3,-43.733,25.744,0
3,2010,It's Complicated,A Day to Remember,SODRYWD1315CD49DBE,AROBSHL1187B9AFB01,4,1.0,-3.815,97.525,0.794,...,96.675,-78.66,41.088,-49.194,95.44,-102.676,46.422,-59.439,37.082,0
4,2010,2nd Sucks,A Day to Remember,SOICMQB1315CD46EE3,AROBSHL1187B9AFB01,4,0.788,-4.707,140.053,0.286,...,110.332,-56.45,37.555,-48.588,67.57,-52.796,22.888,-50.414,32.758,0


In [2]:
songs.isnull().sum()

year                        0
songtitle                   0
artistname                  0
songID                      0
artistID                    0
timesignature               0
timesignature_confidence    0
loudness                    0
tempo                       0
tempo_confidence            0
key                         0
key_confidence              0
energy                      0
pitch                       0
timbre_0_min                0
timbre_0_max                0
timbre_1_min                0
timbre_1_max                0
timbre_2_min                0
timbre_2_max                0
timbre_3_min                0
timbre_3_max                0
timbre_4_min                0
timbre_4_max                0
timbre_5_min                0
timbre_5_max                0
timbre_6_min                0
timbre_6_max                0
timbre_7_min                0
timbre_7_max                0
timbre_8_min                0
timbre_8_max                0
timbre_9_min                0
timbre_9_m

<h2 align='center'>Questions</h2>

In [3]:
# 1 - How many observations (songs) are from the year 2010?

songs[songs.year == 2010].shape

# Answer: 373

(373, 39)

In [4]:
# 2 - How many songs does the dataset include for which the artist name is "Michael Jackson"?

songs[songs.artistname == 'Michael Jackson'].shape

# Answer: 18

(18, 39)

In [5]:
# 3 - Which of these songs by Michael Jackson made it to the Top 10? Select all that apply.

# a) Beat It
# b) You Rock My World
# c) Billie Jean
# d) You Are Not Alone

michael_songs = ['Beat It', 'You Rock My World', 'Billie Jean', 'You Are Not Alone']
x = songs.songtitle[(songs.artistname == 'Michael Jackson') & songs.Top10 == 1]
set(michael_songs).intersection(set(x))


# Answer: 'You Are Not Alone' and 'You Rock My World'

{'You Are Not Alone', 'You Rock My World'}

In [6]:
# 4 - The variable corresponding to the estimated time signature (timesignature) is discrete,
#     What are the values of this variable that occur in our dataset? 

# 5 - Which timesignature value is the most frequent among songs in our dataset?


songs.timesignature.value_counts()

4    6787
3     503
1     143
5     112
7      19
0      10
Name: timesignature, dtype: int64

In [7]:
# 6 - Out of all of the songs in our dataset, which song has the highest tempo ?

songs.songtitle[songs.tempo == songs.tempo.max()]

6205    Wanna Be Startin' Somethin'
Name: songtitle, dtype: object

In [32]:
train = songs[songs.year <= 2009]
test  = songs[songs.year > 2009]

print(train.shape)
print(test.shape)

train_x = train.drop(["Top10", "year", "songtitle", "artistname", "songID", "artistID"], axis=1)
train_y = train.Top10

test_x = test.drop(["Top10", "year", "songtitle", "artistname", "songID", "artistID"], axis=1)
test_y = test.Top10

from sklearn.linear_model import LogisticRegression

log = LogisticRegression()
log.fit(train_x, train_y);

(7201, 39)
(373, 39)


In [33]:
# 7 - Let's now think about the variables in our dataset related to the confidence of the time signature,
#     key and tempo (timesignature_confidence, key_confidence, and tempo_confidence)
#     What does the model suggest?

# a) The lower our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10
# b) The higher our confidence about time signature, key and tempo, the more likely the song is to be in the Top 10


songs.corr()['Top10'][['timesignature_confidence', 'key_confidence', 'tempo_confidence']]


# Answer: b

timesignature_confidence    0.060802
key_confidence              0.010182
tempo_confidence            0.084852
Name: Top10, dtype: float64

In [34]:
# 8 - In general, if the confidence is low for the time signature, tempo, and key,
#     then the song is more likely to be complex. What does Model 1 suggest in terms of complexity?

# a) Mainstream listeners tend to prefer more complex songs
# b) Mainstream listeners tend to prefer less complex songs

d = dict(zip(train_x, log.coef_[0]))
[d[i] for i in ['timesignature_confidence', 'key_confidence', 'tempo_confidence']]

# Answer: b  (from the coefficients, higher confidence -> less complex)

[0.77102497637790035, 0.41065266849934173, 0.67084432664528659]

In [36]:
# 9 - Songs with heavier instrumentation tend to be louder (have higher values in the variable "loudness")
#     and more energetic (have higher values in the variable "energy").

#     By inspecting the coefficient of the variable "loudness", what does Model 1 suggest?


[d[i] for i in ['loudness', 'energy']]


#  Answer: positive coefficient for loudness -> so more chance to be in the top 10 (a hit) 

[0.13717935487020416, -1.2353623166110881]

In [40]:
# 10 - What is the correlation between the variables "loudness" and "energy" in the training set?

train_x.corr()['loudness']['energy']


# Answer: positive correlation of 0.74

0.73990670845580597

Given that these two variables are highly correlated, Model 1 suffers from multicollinearity. To avoid this issue, we will omit one of these two variables and rerun the logistic regression. In the rest of this problem, we'll build two variations of our original model: Model 2, in which we keep "energy" and omit "loudness", and Model 3, in which we keep "loudness" and omit "energy".

In [51]:
model2_train = train.drop(["loudness", "Top10", "year", "songtitle", "artistname", "songID", "artistID"], axis=1)
model2_test = test.drop(["loudness", "Top10", "year", "songtitle", "artistname", "songID", "artistID"], axis=1)

model3_train = train.drop(["energy", "Top10", "year", "songtitle", "artistname", "songID", "artistID"], axis=1)
model3_test = test.drop(["energy", "Top10", "year", "songtitle", "artistname", "songID", "artistID"], axis=1)

In [69]:
# 11 - Inspect the coefficient of the variable "energy". What do you observe?


log.fit(model2_train, train_y)
dd = dict(zip(model2_train, log.coef_[0]))
dd['energy']


# Answer: My answer is since the coefficient is negative, thus songs with high energy tends to be less popular
# The correct answer in the assignment is reversed. That is the coefficient is positive there.
# Hence, songs with high energy levels tend to be more popular.

-0.16971457640352669

To the best of my current knowledge, the rest of the questions seem to be R specific.