<a id='research_question'></a>
# Research Question

What criteria should be met in order for an up and coming artist to make it in the music industry? More specifically what musical style elements result in the most streams or record sales for artists?

# Dataset(s)

- Dataset Name: spotify_dataset.csv
- Link to the dataset: https://www.kaggle.com/sashankpillai/spotify-top-200-charts-20202021
- Number of observations: 1556

This dataset displays a list of songs that made it to the spotify top 200 charts between 2020 and 2021. This dataset includes information such as highest charting position, number of times charted, when it charted the highest, number of streams, artist, amount of artist followers, genre, release date, as well as scores regarding danceability, energy, loudness, speechiness, acousticness, etc that were calculated and recorded by the spotify platform and API.   

# Setup

In [24]:
import pandas as pd
pd.options.display.float_format = '{:.2f}'.format
import numpy as np
#import matplotlib.pyplot as plt

In [32]:
songs = pd.read_csv('./data/spotify_dataset.csv')
hundf = pd.read_csv("data/2000song.csv")
hunndf = pd.read_csv("data/charts.csv")
songs.head()

Unnamed: 0,Index,Highest Charting Position,Number of Times Charted,Week of Highest Charting,Song Name,Streams,Artist,Artist Followers,Song ID,Genre,...,Danceability,Energy,Loudness,Speechiness,Acousticness,Liveness,Tempo,Duration (ms),Valence,Chord
0,1,1,8,2021-07-23--2021-07-30,Beggin',48633449,Måneskin,3377762,3Wrjm47oTz2sjIgck11l5e,"['indie rock italiano', 'italian pop']",...,0.714,0.8,-4.808,0.0504,0.127,0.359,134.002,211560,0.589,B
1,2,2,3,2021-07-23--2021-07-30,STAY (with Justin Bieber),47248719,The Kid LAROI,2230022,5HCyWlXZPP0y6Gqq8TgA20,['australian hip hop'],...,0.591,0.764,-5.484,0.0483,0.0383,0.103,169.928,141806,0.478,C#/Db
2,3,1,11,2021-06-25--2021-07-02,good 4 u,40162559,Olivia Rodrigo,6266514,4ZtFanR9U6ndgddUvNcjcG,['pop'],...,0.563,0.664,-5.044,0.154,0.335,0.0849,166.928,178147,0.688,A
3,4,3,5,2021-07-02--2021-07-09,Bad Habits,37799456,Ed Sheeran,83293380,6PQ88X9TkUIAUIZJHW2upE,"['pop', 'uk pop']",...,0.808,0.897,-3.712,0.0348,0.0469,0.364,126.026,231041,0.591,B
4,5,5,1,2021-07-23--2021-07-30,INDUSTRY BABY (feat. Jack Harlow),33948454,Lil Nas X,5473565,27NovPIUIRrOZoCHxABJwK,"['lgbtq+ hip hop', 'pop rap']",...,0.736,0.704,-7.409,0.0615,0.0203,0.0501,149.995,212000,0.894,D#/Eb


In [33]:
hunndf = hunndf[['title','artist','streams']]
hunndf.head()


Unnamed: 0,title,artist,streams
0,Despacito (Featuring Daddy Yankee),Luis Fonsi,365941.0
1,El Amante,Nicky Jam,179697.0
2,Reggaetón Lento (Bailemos),CNCO,169647.0
3,Shape of You,Ed Sheeran,168495.0
4,Chantaje (feat. Maluma),Shakira,141696.0


In [35]:
hunndf['totstream'] = hunndf.groupby(['title'])['streams'].transform('sum')
hunndf = hunndf.drop_duplicates()
hunndf = hunndf[['title','totstream']]

In [36]:
hunndf = hunndf[['title','totstream']]
hunndf

Unnamed: 0,title,totstream
0,Despacito (Featuring Daddy Yankee),1520215032.00
1,El Amante,508847644.00
2,Reggaetón Lento (Bailemos),439157161.00
3,Shape of You,4961902422.00
4,Chantaje (feat. Maluma),588792731.00
...,...,...
25508450,Народная,0.00
25508653,Para Mucho Más,0.00
25510402,Una Domenica Al Mare,0.00
25510798,Beat It (Base Trap Remix),0.00


In [41]:
hunnndf = pd.merge(hundf, hunndf,  how='left', left_on='Title', right_on = 'title')
hunnndf = hunnndf.drop_duplicates()

In [42]:
hunnndf

Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity,title,totstream
0,1,Sunrise,Norah Jones,adult standards,2004,157,30,53,-14,11,68,201,94,3,71,Sunrise,846378.00
86,2,Black Night,Deep Purple,album rock,2000,135,79,50,-11,17,81,207,17,7,39,Black Night,0.00
87,3,Clint Eastwood,Gorillaz,alternative hip hop,2001,168,69,66,-9,7,52,341,2,17,69,Clint Eastwood,519819.00
176,4,The Pretender,Foo Fighters,alternative metal,2007,173,96,43,-4,3,37,269,0,4,76,The Pretender,461580.00
231,5,Waitin' On A Sunny Day,Bruce Springsteen,classic rock,2002,106,82,58,-5,10,87,256,1,3,59,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1210572,1990,Heartbreak Hotel,Elvis Presley,adult standards,1958,94,21,70,-12,11,72,128,84,7,63,Heartbreak Hotel,20102.00
1210582,1991,Hound Dog,Elvis Presley,adult standards,1958,175,76,36,-8,76,95,136,73,6,69,Hound Dog,0.00
1210583,1992,Johnny B. Goode,Chuck Berry,blues rock,1959,168,80,53,-9,31,97,162,74,7,74,Johnny B. Goode,856475.00
1210604,1993,Take Five,The Dave Brubeck Quartet,bebop,1959,174,26,45,-13,7,60,324,54,4,65,Take Five,0.00


In [10]:
# DO NOT RUN THIS CELL
uniqueValues = hunndf['title'].unique()
totdictionary = {}
for val in uniqueValues:
    val2 = 0
    for i in range(len(hunndf.title)):
        indx = []
        if val == hunndf.title[i]:
            indx.append(i)
    for itm in indx:
        temp = hunndf.iloc[0]
        val2 += temp["streams"][itm]
    totdictionary[val] = []
    totdictionary[val].append(val2)
    
        
            
    
# find all unique titles --- set as list
# empty variable
# using loop iterate through each item in list add streams to variable and set title and total to dictionary
# dictionary to new data frame 

KeyboardInterrupt: 

In [9]:
hunndf.iloc[0]

title      Despacito (Featuring Daddy Yankee)
artist                             Luis Fonsi
streams                                365941
Name: 0, dtype: object

In [3]:
songs = songs[['Song Name','Highest Charting Position','Number of Times Charted','Week of Highest Charting','Song Name','Streams','Genre']]
songs.head()

Unnamed: 0,Song Name,Highest Charting Position,Number of Times Charted,Week of Highest Charting,Song Name.1,Streams,Genre
0,Beggin',1,8,2021-07-23--2021-07-30,Beggin',48633449,"['indie rock italiano', 'italian pop']"
1,STAY (with Justin Bieber),2,3,2021-07-23--2021-07-30,STAY (with Justin Bieber),47248719,['australian hip hop']
2,good 4 u,1,11,2021-06-25--2021-07-02,good 4 u,40162559,['pop']
3,Bad Habits,3,5,2021-07-02--2021-07-09,Bad Habits,37799456,"['pop', 'uk pop']"
4,INDUSTRY BABY (feat. Jack Harlow),5,1,2021-07-23--2021-07-30,INDUSTRY BABY (feat. Jack Harlow),33948454,"['lgbtq+ hip hop', 'pop rap']"


In [9]:
hundf.head()

Unnamed: 0,Index,Title,Artist,Top Genre,Year,Beats Per Minute (BPM),Energy,Danceability,Loudness (dB),Liveness,Valence,Length (Duration),Acousticness,Speechiness,Popularity
0,1,Sunrise,Norah Jones,adult standards,2004,157,30,53,-14,11,68,201,94,3,71
1,2,Black Night,Deep Purple,album rock,2000,135,79,50,-11,17,81,207,17,7,39
2,3,Clint Eastwood,Gorillaz,alternative hip hop,2001,168,69,66,-9,7,52,341,2,17,69
3,4,The Pretender,Foo Fighters,alternative metal,2007,173,96,43,-4,3,37,269,0,4,76
4,5,Waitin' On A Sunny Day,Bruce Springsteen,classic rock,2002,106,82,58,-5,10,87,256,1,3,59


In [8]:
pd.merge(hundf,songs,left_on='Title',right_on='Song Name', how = 'left')

ValueError: The column label 'Song Name' is not unique.

In [None]:
for i in range(len(hundf.song_name)):
    if 'bad guy' == hundf.song_name[i]:
        indx = i
hundf.iloc[indx]

# Data Cleaning

Describe your data cleaning steps here.

In [31]:
## YOUR CODE HERE
null_songs = songs.isnull()


## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

['Tempo', 'Acousticness', 'Highest Charting Position', 'Week of Highest Charting', 'Genre', 'Liveness', 'Weeks Charted', 'Song Name', 'Duration (ms)', 'Index', 'Song ID', 'Release Date', 'Artist', 'Danceability', 'Streams', 'Chord', 'Valence', 'Popularity', 'Artist Followers', 'Loudness', 'Speechiness', 'Number of Times Charted', 'Energy']
