# <span style="color:DarkSlateGray">Analysis of genre popularity based on the Kaggle platform's database 'Spotify HUGE database - daily charts over 3 years (2017-2020)</span>

<img src="https://rootblog.pl/wp-content/uploads/2020/10/spotify.jpg" alt="Alt text" title="Title text" width="200" height="100">


<b>Contributors:</b><br>
Dominika Gerszewska<br>
Marcin Sidoruk<br>
Andrzej Łososowski<br>
Joanna Zielińska <br>

## <span style="color:DarkSlateGray">Kaggle dataset:</span>

Source: https://www.kaggle.com/datasets/pepepython/spotify-huge-database-daily-charts-over-3-years?select=Final+database

This database contains information about the top 200 daily streaming songs on Spotify for over three years.<br>
It includes a wealth of information for each track, gathered via Spotify's API, such as the artist, country, genre, and other relevant details.<br>
To simplify the data, the popularity of each song has been aggregated into a single score.<br>
This Spotify database is a valuable resource for anyone interested in music or data analysis.

## <span style="color:DarkSlateGray">Goal of the project:</span>

The goal of this project is to explore and analyze Spotify's daily top 200 streaming songs data over a period of three years. <br>The project includes a variety of visualizations and analyses, such as identifying the most popular music genres, creating a map of average popularity, analyzing popularity by language, examining musical diversity, and identifying the most frequently occurring genres by country. 
<br>Additionally, the project includes a feature that allows users to input a specific genre and view the top 10 countries where that genre is most popular.

The business objective of this project is to provide valuable insights for artists and music industry professionals who are looking to understand music trends and identify opportunities for market entry. 
<br>For example, an artist could use this information to determine which countries to target when promoting their music based on the popularity of their genre in different regions. <br>Similarly, music industry professionals could leverage this data to make informed decisions about marketing and distribution strategies.

In [1]:
#requirements

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore') 
%matplotlib inline
pd.set_option('display.max_columns', 151)

In [2]:
# Loading data #1

#Importing the database with selected columns
df = pd.read_csv('Orginal_database_from_Kaggle/Final database.csv', usecols=['Country', 'Popularity', 'Genre'])
df_1 = pd.read_csv('Orginal_database_from_Kaggle/Final database.csv', usecols=['Country', 'Genre', 'Artist','Title','Album','Cluster','Popularity','Artist_followers'])

In [3]:
# Loading data #2
#Adding extra data set to use in plotly for interpetation country

df_country_iso = pd.read_csv('Country_ISO\countries_codes_and_coordinates.csv') 
df_country_iso = df_country_iso.replace('"','', regex=True) 
df_country_iso = df_country_iso.replace('United Kingdom', 'UK') # adjusting to data in Spotify dataset

In [4]:
# Loading data #3

# Creating dictionary to add 3 letters shortcut to datasetkraj = list(df_country_iso['Country']) #wyciągnięcie krajów z iso
kraj = list(df_country_iso['Country']) #wyciągnięcie krajów z iso
iso = list(df_country_iso['Alpha-3 code']) #wyciągnięcie skrótów krajów z iso
dict = {}
iso = [x.strip(' ') for x in iso] # usnięcie spacji ze skrótów kodów
for i,j in zip(kraj,iso): # tworznie słownika na bazie którego zostanie zapełniona kolumna iso_alpha z df
    dict.setdefault(i,j)

In [5]:
# Loading data #4

df['iso_alpha'] = df['Country'] #dodanie kolmuny iso_alpha z wartościami Country aby dokonać podmiany na trzy literowen zonaczenie

df.replace({"iso_alpha": dict},inplace=True) # podmiana wartosci iso_alpha na ich odpowiednik 3 literowy potrzbne do wykrzystania w plotly do wyświetlania potrzbenych krajów

In [6]:
df['iso_alpha'] = df['Country'] #dodanie kolmuny iso_alpha z wartościami Country aby dokonać podmiany na trzy literowen zonaczenie

df.replace({"iso_alpha": dict},inplace=True) # podmiana wartosci iso_alpha na ich odpowiednik 3 literowy potrzbne do wykrzystania w plotly do wyświetlania pot

## <span style="color:DarkSlateGray">Data exploration and identification of basic issues</span>

In [7]:
df_1.head()

Unnamed: 0,Country,Popularity,Title,Artist,Genre,Artist_followers,Album,Cluster
0,Global,31833.95,adan y eva,Paulo Londra,argentine hip hop,11427104.0,Adan y Eva,global
1,USA,8.0,adan y eva,Paulo Londra,argentine hip hop,11427104.0,Adan y Eva,english speaking and nordic
2,Argentina,76924.4,adan y eva,Paulo Londra,argentine hip hop,11427104.0,Adan y Eva,spanish speaking
3,Belgium,849.6,adan y eva,Paulo Londra,argentine hip hop,11427104.0,Adan y Eva,english speaking and nordic
4,Switzerland,20739.1,adan y eva,Paulo Londra,argentine hip hop,11427104.0,Adan y Eva,english speaking and nordic


In [8]:
df_1.info(null_counts=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 170633 entries, 0 to 170632
Data columns (total 8 columns):
 #   Column            Dtype  
---  ------            -----  
 0   Country           object 
 1   Popularity        float64
 2   Title             object 
 3   Artist            object 
 4   Genre             object 
 5   Artist_followers  object 
 6   Album             object 
 7   Cluster           object 
dtypes: float64(1), object(7)
memory usage: 10.4+ MB


In [9]:
# Data cleansing

df = df.replace('n-a', np.nan)
df = df.dropna()
df_1 = df_1.replace('n-a', np.nan)
df_1 = df_1.dropna()
drop_index_cl = df_1[df_1.Cluster == 'global'].index
drop_index_c = df[df.Country == 'Global'].index
df.drop(drop_index_c,inplace=True)
df_1.drop(drop_index_cl,inplace=True)

### Table of unique values 

In [10]:
Counutries = df_1['Country'].nunique() 
Genres = df_1['Genre'].nunique() 
Titles = df_1['Title'].nunique() 
Albums = df_1['Album'].nunique() 
Artist = df_1['Artist'].nunique()

df_unique = pd.DataFrame({'Countries': [Counutries],'Genres':[Genres], 'Artist': [Artist] , 'Albums':[Albums], 'Title': [Titles],})
df_unique.style.hide_index()

Countries,Genres,Artist,Albums,Title
34,1119,23347,32633,44930


## <span style="color:DarkSlateGray">Data analysis</span>

### Map of mean popularity

In [11]:
# mean to show on map
by_country = df.groupby('iso_alpha')['Popularity'].mean().reset_index().rename(columns={'iso_alpha': 'Country','Popularity':'Mean Popularity'})

In [12]:
# mean to show on map2

uniq = df_1.groupby(['Country','Cluster'])['Popularity'].count().reset_index().sort_values(by = 'Country')
country = df_1.groupby(['Country','Cluster'])['Popularity'].mean().reset_index().rename(columns={'Popularity':'Mean_Popularity'}).sort_values(by = 'Country')
uniq['Mean_Popularity'] = country['Mean_Popularity']

In [48]:
# mean to show on map3

country_list = by_country
fig = px.choropleth(country_list, locations='Country',
                        color='Mean Popularity', # 
                        hover_name='Country', # column to add to hover information
                        color_continuous_scale=px.colors.sequential.Rainbow,
                        width=600,
                        height=600,
                        projection = 'mercator')
fig.update_layout(title='Map of countries')
fig.show()

uniq = uniq.sort_values(ascending=False, by = 'Popularity')
fig = px.bar(uniq,x=uniq.Country,
            y=uniq.Popularity,
            labels={'Country':'Country', 'Popularity':'The number of occurrences'},    
            color = 'Mean_Popularity',            
            color_continuous_scale = px.colors.sequential.Rainbow)
fig.update_layout(title='Number of songs that were on top list 200 in each country')
fig.update_traces(width=0.4)
fig.show()

### Distribution by language

In [14]:
fig = px.sunburst(country, 
                  path=['Cluster','Country'], 
                  values='Mean_Popularity',
                  color='Mean_Popularity', 
                  color_continuous_scale=px.colors.sequential.Rainbow,
                  width = 600,
                  height = 800,
                  title= 'Distribution by language'
                 )
fig.show()

### Mean popularity in each country for language cluster

In [15]:
country_en = country[country.Cluster == 'english speaking and nordic'].sort_values(ascending=False, by = 'Mean_Popularity')
country_spanish = country[country.Cluster == 'spanish speaking'].sort_values(ascending=False, by = 'Mean_Popularity')
country_portuguese = country[country.Cluster == 'southern europe and portuguese heritage'].sort_values(ascending=False, by = 'Mean_Popularity')

fig = make_subplots(rows=1, cols=3, subplot_titles=( "Spanish speaking", "Southern europe and portuguese heritage", "English speaking and nordic",), shared_yaxes=True, horizontal_spacing=0.1)
fig.add_trace(go.Bar(x=country_en.Country, y=country_en.Mean_Popularity), row=1, col=3)
fig.add_trace(go.Bar(x=country_spanish.Country, y=country_spanish.Mean_Popularity), row=1, col=1)
fig.add_trace(go.Bar(x=country_portuguese.Country, y=country_portuguese.Mean_Popularity), row=1, col=2)
fig.update_layout(height=400, width=1000,
                  title_text="Mean popularity in each country for language cluster", showlegend=False, yaxis_title='Mean popularity', xaxis_title='Country')
fig.update_traces(width=0.4)
fig.show()

### Countries with number of genre diversity

In [16]:
count_genre2 = df.groupby('Country')['Genre'].nunique().sort_values(ascending= False)
fig = px.bar(x=count_genre2.index, y=count_genre2.values, labels={'x':'Country', 'y':'The number of different genres'})
fig.update_layout(title='Countries with number of genre diversity ')
fig.update_traces(width=0.4)
fig.show()

### Top 10 most popular music genres

In [17]:
genre_counts = df['Genre'].value_counts().nlargest(10)
fig = px.bar(x=genre_counts.index, y=genre_counts, labels={'x':'Genre', 'y':'The number of songs'})
fig.update_layout(title='Top 10 most popular music genres')
fig.update_traces(width=0.4)
fig.show()

### The most common genre in each country

In [18]:
result = df.groupby('Country')['Genre'].apply(lambda x: x.value_counts().nlargest(1)).sort_values(ascending=True).reset_index(name='Counts')
result.rename(columns = {'level_1' : 'Legend'}, inplace=True)
wykres2 = px.bar(result, y='Country', x='Counts', color='Legend', orientation='h')
wykres2.update_traces(textposition='inside',width =0.4)
wykres2.update_layout(xaxis_title='Number of occurrences',
                  yaxis_title='Country',
                  height=800)
wykres2.update_layout(title='The most common genre in each country')
wykres2.show()

### Top 10 most popular music genres in selected country

In [19]:
poland_counts = df.query('Country == "Poland"')['Genre'].value_counts().nlargest(10)
turkey_counts = df.query('Country == "Turkey"')['Genre'].value_counts().nlargest(10)
ecuador_counts = df.query('Country == "Ecuador"')['Genre'].value_counts().nlargest(10)
fig = make_subplots(rows=1, cols=3, subplot_titles=("Poland", "Turkey", "Ecuador"), shared_yaxes=True)
fig.add_trace(go.Bar(x=poland_counts.index, y=poland_counts), row=1, col=1)
fig.add_trace(go.Bar(x=turkey_counts.index, y=turkey_counts), row=1, col=2)
fig.add_trace(go.Bar(x=ecuador_counts.index, y=ecuador_counts), row=1, col=3)
fig.update_layout(height=400, width=1000, title_text="Top 10 most popular music genres in selected country", showlegend=False, yaxis_title='Number of occurrences', xaxis_title='Genre')

                  
fig.show()
fig = px.bar()

### Number of occurrences in countries for selected genre

In [46]:
#Display the top 10 countries for selected genre
wprowadzony_gatunek = input("Please enter the name of the music genre for which you want to see ordered countries by count: ")
nowy_df = df.loc[df['Genre'] == wprowadzony_gatunek, ['Genre', 'Country','iso_alpha']]
zliczanie = nowy_df['Country'].value_counts()
zliczanie.columns = ['Country', 'Counts']
top_counts = nowy_df[['iso_alpha','Country']].value_counts().reset_index().rename(columns={0 : 'Counts'})
# poloting map from selected countries
country_list = top_counts
fig = px.choropleth(country_list, locations="iso_alpha",
                        color="Counts", # lifeExp is a column of gapminder
                        hover_name="Country", # column to add to hover information
                        color_continuous_scale=px.colors.sequential.Rainbow,
                        width=800,
                        height=800,
                        projection = 'mercator')
fig.show()

# poloting bar from selected countries
fig = px.bar(nowy_df, x=top_counts['Country'], y=top_counts['Counts'], labels={'x':'Country', 'y':'Number of occurrences'})
fig.update_layout(title=f"Count in countries for selected genre ({wprowadzony_gatunek})")
fig.show()

Please enter the name of the music genre for which you want to see ordered countries by count: k-pop
