### Data Analysis - Popularity Score

### Notation

- $K$: number of groups. 
    In this exercise, we have k = 4 (pop_sample1,rock_sample1, hiphop_sample1, country_sample1) 
    
- $X_{ij}$: represents the $j$th observation in the $i$th group.
    For example, we have 100 observations($j$th observations) for the group pop_sample1 (the $i$th group).
    
- $\bar X_{i}$: represents the mean of the $i$th group.
     For example, this is the mean of the $X_{ij}$ for pop_sample. Last, we will have K number of groups.
     
- $\bar X$: represents the mean of all observations from the groups

### Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import (f, stats)
import math
from statsmodels.stats.multicomp import (pairwise_tukeyhsd,
                                         MultiComparison)


import statsmodels.api as sm
from statsmodels.formula.api import ols
%matplotlib inline

- Reading CSV file

In [2]:
#Creating artist_codes dataframe
artists = pd.read_csv('Data/spotify_artists_cleaned.csv')

- Setting variables for each parent genres

In [3]:
pop_artists = artists[(artists['pop']==True)]
rock_artists = artists[(artists['rock']==True)]
hiphop_artists = artists[(artists['hiphop']==True)]
country_artists = artists[(artists['country']==True)]

- Tacking random sample per parent genres as NumPy Array

In [4]:
pop_sample1 = np.array(pop_artists['artist_popularity'].sample(700))
rock_sample1 = np.array(rock_artists['artist_popularity'].sample(300))
hiphop_sample1 = np.array(hiphop_artists['artist_popularity'].sample(300))
country_sample1 = np.array(country_artists['artist_popularity'].sample(200))

### $K$ Groups

Number of Groups: 4
 - Pop_sample1
 - Rock_sample1
 - hiphop_sample1
 - country_sample1


In [5]:
genres_samples = [pop_sample1, rock_sample1, hiphop_sample1, country_sample1]
k = len(genres_samples)


### $X_{ij}$ Number of observations withing each group:

In [6]:
pop_x1 = len(pop_sample1) 
rock_x2 = len(rock_sample1)
hiphop_x3 = len(hiphop_sample1)
country_x4 = len(country_sample1)

### $\bar X$ Mean from all observations from groups

In [7]:
x_all  = np.concatenate([pop_sample1,
                rock_sample1, 
                hiphop_sample1,
                country_sample1], axis=0)

x_bar = x_all.mean()

ss_total = ((x_all - x_bar)**2).sum()

## Sum of Squares 

- Sum of Squares for Treatments (SST)
- Sum of Squares for Erro (SSE)

#### For SST

In [8]:
group_list_t = [pop_sample1,
               rock_sample1, hiphop_sample1, country_sample1]

preview_sst = ([len(sample)*(sample.mean() - x_bar)**2 for sample in group_list_t])

- Final sum between groups 

In [9]:
final_sst = np.sum(preview_sst)

#### For SSE

In [10]:
sample_list_e = [pop_sample1,
               rock_sample1, hiphop_sample1, country_sample1]

preview_sse = ([((sample - sample.mean())**2).sum() for sample in sample_list_e])

- Final sum within groups

In [11]:
final_sse = np.sum(preview_sse)

## Degrees of Freedom

- Degrees of Freedom for Treatments (k - 1)
- Degrees of Freedom for Error (n - k)

#### For (k -1)

In [12]:
df_treatments = (k - 1) 

#### For (n - k)

In [13]:
df_error = (pop_x1 + rock_x2 + hiphop_x3 + country_x4) - k

### F Score

- For MST

In [14]:
mst = final_sst / df_treatments

- For MSE

In [15]:
mse = final_sse / df_error

- F Score ratio

In [16]:
f_score = mst / mse

In [17]:
f_score

9.686736916394066

### P-Value

In [18]:
p_value = 1 - f.cdf(f_score, df_treatments, df_error)

In [19]:
p_value

2.47750709503336e-06

### One-Way Anova test 

In [20]:
f_score_anova, p_value_anova = stats.f_oneway(pop_sample1,
                                              rock_sample1, 
                                              hiphop_sample1, 
                                              country_sample1)
print('f-score:', f_score_anova)
print('p-value:', p_value_anova)

f-score: 9.686736916394063
p-value: 2.477507095034371e-06


### Tukey's Multi-Comparison Method

#### Convert NumPy array to data frame

In [21]:
pop_df = pd.DataFrame(pop_sample1)
pop_df = pop_df.rename(columns={0: 'popularity'})
pop_df['genre'] = 'pop'
pop_df['id'] = 0

rock_df = pd.DataFrame(rock_sample1)
rock_df = rock_df.rename(columns={0: 'popularity'})
rock_df['genre'] = 'rock'
rock_df['id'] = 1

hiphop_df = pd.DataFrame(hiphop_sample1)
hiphop_df = hiphop_df.rename(columns={0: 'popularity'})
hiphop_df['genre'] = 'hiphop'
hiphop_df['id'] = 2

country_df = pd.DataFrame(country_sample1)
country_df = country_df.rename(columns={0: 'popularity'})
country_df['genre'] = 'country'
country_df['id'] = '3'

combined_sample_df = pd.concat([pop_df, rock_df, hiphop_df, country_df], axis=0)

#### Set up the data for comparison (creates a specialised object)

In [22]:
MultiComp = MultiComparison(combined_sample_df['popularity'],
                           combined_sample_df['genre'])

print(MultiComp.tukeyhsd().summary())

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
 group1 group2 meandiff p-adj   lower   upper  reject
-----------------------------------------------------
country hiphop   0.6967    0.9 -3.3006  4.6939  False
country    pop   1.4414 0.6925 -2.0694  4.9523  False
country   rock    -4.78 0.0115 -8.7773 -0.7827   True
 hiphop    pop   0.7448    0.9 -2.2769  3.7664  False
 hiphop   rock  -5.4767  0.001 -9.0519 -1.9014   True
    pop   rock  -6.2214  0.001 -9.2431 -3.1998   True
-----------------------------------------------------


### Welch's t-test

- T test Pop and Rock

In [23]:
stats.ttest_ind(pop_sample1, rock_sample1, equal_var=False)

Ttest_indResult(statistic=5.214768050226803, pvalue=2.6075424174950963e-07)

- T test Pop and Hip Hop

In [24]:
stats.ttest_ind(pop_sample1, hiphop_sample1, equal_var=False)

Ttest_indResult(statistic=0.6293714321041988, pvalue=0.5293639253531013)

- T test Pop and Country

In [25]:
stats.ttest_ind(pop_sample1, country_sample1, equal_var=False)

Ttest_indResult(statistic=1.090149052951046, pvalue=0.2764409106951442)

- T test Rock and Hip Hop 

In [26]:
stats.ttest_ind(rock_sample1, hiphop_sample1, equal_var=False)

Ttest_indResult(statistic=-3.869264708301612, pvalue=0.00012116192528670297)

- T test Rock and Country

In [27]:
stats.ttest_ind(rock_sample1, hiphop_sample1, equal_var=False)

Ttest_indResult(statistic=-3.869264708301612, pvalue=0.00012116192528670297)

- T test Hip Hop and Country

In [28]:
stats.ttest_ind(hiphop_sample1, country_sample1, equal_var=False)

Ttest_indResult(statistic=0.45655965930289033, pvalue=0.6482120780306619)