# Predecir un parámetro en función otras variables

Se intentará predecir la Tonalidad del sonido en función de otras variables. Definiendo variables dicotómicas para aquellas variables categóricas que así lo requieran.

In [1]:
import pandas as pd
import numpy as np

from sklearn import linear_model
import statsmodels.api as sm

np.set_printoptions(precision=2)

In [2]:
data = pd.read_csv('snd-dataset-from-plain-json.csv')
data.head()

Unnamed: 0,Duration,Loudness,LogAttackTime,Tempo,Tempo.confidence,TemporalCentroid,SingleEvent,Loop,Tonality,Tonality.confidence,DynamicRange,Note.midi,Note.frequency,Note.confidence,Genre,Mood
0,24.218412,-16.581459,0.769376,95,0.133154,0.498596,False,False,G major,0.524679,9.689243,55,197.9729,0.0,Genre B,Mood B
1,243.983673,-16.891335,1.618665,65,0.545527,0.479576,False,False,G major,0.785114,5.247044,40,85.456451,0.0,Genre A,Mood A
2,15.281632,-21.658251,0.582658,63,0.996905,0.492315,True,True,C minor,0.698095,1.060242,50,151.972198,0.352345,Genre B,Mood B
3,2.0,-10.525232,-1.590209,119,0.0,0.468918,False,False,G# minor,0.64668,0.0,41,91.402817,0.0,Genre A,Mood A
4,1.45415,-28.335722,-0.492548,152,0.0,0.502481,True,False,F# minor,0.408481,0.0,107,3984.657227,0.695633,Genre A,Mood A


In [3]:
data.shape[0]

1017

In [4]:
# Se descartan columnas que tienen que ver con la confianza
# en las estimaciones en el cálculo de features
# Tip: axis number (0 for rows and 1 for columns)
data = data.drop("Tempo.confidence", axis=1);
data = data.drop("Tonality.confidence", axis=1);
data = data.drop("Note.confidence", axis=1);

### Correlación entre variables

In [5]:
data.corr(method='pearson', min_periods=1) # pearson -> método estándar

Unnamed: 0,Duration,Loudness,LogAttackTime,Tempo,TemporalCentroid,SingleEvent,Loop,DynamicRange,Note.midi,Note.frequency
Duration,1.0,0.081523,0.501519,0.104969,0.258024,-0.232345,-0.233539,0.31864,-0.154564,-0.133548
Loudness,0.081523,1.0,0.060905,0.03522,0.046486,-0.465222,-0.070003,-0.118577,-0.105028,-0.122475
LogAttackTime,0.501519,0.060905,1.0,0.071875,0.340716,-0.238297,-0.234266,0.359254,-0.054808,-0.100731
Tempo,0.104969,0.03522,0.071875,1.0,0.07214,-0.021407,-0.159896,0.046447,0.012247,0.034907
TemporalCentroid,0.258024,0.046486,0.340716,0.07214,1.0,-0.18725,-0.136633,0.038687,-0.007034,-0.021946
SingleEvent,-0.232345,-0.465222,-0.238297,-0.021407,-0.18725,1.0,0.163248,-0.118497,0.12889,0.155892
Loop,-0.233539,-0.070003,-0.234266,-0.159896,-0.136633,0.163248,1.0,-0.088077,0.047451,0.078194
DynamicRange,0.31864,-0.118577,0.359254,0.046447,0.038687,-0.118497,-0.088077,1.0,0.093423,0.002418
Note.midi,-0.154564,-0.105028,-0.054808,0.012247,-0.007034,0.12889,0.047451,0.093423,1.0,0.801161
Note.frequency,-0.133548,-0.122475,-0.100731,0.034907,-0.021946,0.155892,0.078194,0.002418,0.801161,1.0


**Observación: La correlación entre variables en general es baja, salvo en duración con LogAttackTime, DynamicRange y TemporalCentroid en menor medida.**

## Regresión múltiple (varias variables)

In [6]:
# Referencia: https://stackoverflow.com/questions/11479064/multiple-linear-regression-in-python
def reg_multiple(y, x):
    ones = np.ones(len(x[0]))
    X = sm.add_constant(np.column_stack((x[0], ones)))
    for ele in x[1:]:
        X = sm.add_constant(np.column_stack((ele, X)))
    results = sm.OLS(y, X).fit()
    return results

In [7]:
Duration = np.asarray( data.loc[:, 'Duration' ] )
DynamicRange = np.asarray( data.loc[:, 'DynamicRange' ] )
TemporalCentroid = np.asarray( data.loc[:, 'TemporalCentroid' ] )
LogAttackTime = np.asarray( data.loc[:, 'LogAttackTime' ] )
#Tempo = np.asarray( data.loc[:, 'Tempo' ] )


In [8]:
y = LogAttackTime
X = np.array( [Duration, DynamicRange])

reg_multiple(y, X).summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.296
Model:,OLS,Adj. R-squared:,0.294
Method:,Least Squares,F-statistic:,213.0
Date:,"Fri, 16 Nov 2018",Prob (F-statistic):,6.06e-78
Time:,04:11:58,Log-Likelihood:,-1329.5
No. Observations:,1017,AIC:,2665.0
Df Residuals:,1014,BIC:,2680.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,0.0442,0.006,7.985,0.000,0.033,0.055
x2,0.0026,0.000,15.495,0.000,0.002,0.003
const,0.2564,0.047,5.403,0.000,0.163,0.349

0,1,2,3
Omnibus:,158.744,Durbin-Watson:,2.052
Prob(Omnibus):,0.0,Jarque-Bera (JB):,235.732
Skew:,-1.112,Prob(JB):,6.48e-52
Kurtosis:,3.788,Cond. No.,407.0


**Observación: El R cuadrado no llega a 0.3, cuando se espera algo por lo menos de 0.5.**

## Se agrega el feature TemporalCentroid

In [9]:
y = LogAttackTime
X = np.array( [Duration, DynamicRange, TemporalCentroid])

reg_multiple(y, X).summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.348
Model:,OLS,Adj. R-squared:,0.346
Method:,Least Squares,F-statistic:,180.4
Date:,"Fri, 16 Nov 2018",Prob (F-statistic):,1.04e-93
Time:,04:11:59,Log-Likelihood:,-1290.2
No. Observations:,1017,AIC:,2588.0
Df Residuals:,1013,BIC:,2608.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,3.6500,0.404,9.027,0.000,2.857,4.443
x2,0.0464,0.005,8.715,0.000,0.036,0.057
x3,0.0022,0.000,13.205,0.000,0.002,0.003
const,-1.5264,0.203,-7.530,0.000,-1.924,-1.129

0,1,2,3
Omnibus:,163.368,Durbin-Watson:,2.039
Prob(Omnibus):,0.0,Jarque-Bera (JB):,247.697
Skew:,-1.111,Prob(JB):,1.63e-54
Kurtosis:,3.951,Cond. No.,4000.0


**Observación: El valor de R cuadrado mejora, pero no se alcanza un valor de p > 0.005 que defina un 95% de confianza.**

## Variables categóricas  y conversión a dummy dicotómicas


**La asignación de un número a cada categoría no resuelve el problema. La solución es crear tantas variables dicotómicas como número de respuestas. Para k valores posibles, se toman k-1 variables dicotómicas.**

### Key o Tonality

Las categorias posibles para la tonalidad se encuentran en notación americana, donde 'A' es La, 'B' es Sí, etc. Se expresan en el array 'key_to_number_list' y luego se mapean en 23 variables dicotómicas, ya que la cantidad de valores posibles es 24.

### Genre, Loop y Mood

Género, si es loopeable y Mood (humor) se mapean en una única variable dicotómica ya que poseen solo dos valores posibles.

In [10]:
key_to_number_list = ['A minor', 'A major', 'A# minor', 'A# major', 'B minor', 'B major', 'C minor', 'C major', 'C# minor', 'C# major', 'D minor', 'D major', 'D# minor', 'D# major','E minor', 'E major', 'F minor', 'F major', 'F# minor', 'F# major', 'G minor', 'G major','G# minor', 'G# major']

def keyToNumber(x_value):
    return [i for i,x in enumerate(key_to_number_list) if x == x_value][0]

lambda x: True if x % 2 == 0 else False
def keyToNumberD(x_value):
    if keyToNumber(x)==1:
        return 1
    else:
        return 0

In [11]:
len(key_to_number_list)

24

In [12]:
# Mapeo de las tonalidades 'en texto' a categorias numéricas
# Se necesitan k-1 variables dicotómicas, con k=len(key_to_number_list)

# Genre
data['G1'] = data['Genre'].map(lambda x: 1 if x=='Genre A' else 0)

data['L1'] = data['Loop'].map(lambda x: 1 if x==True else 0)
data['M1'] = data['Mood'].map(lambda x: 1 if x=='Mood A' else 0)

# Tonality / Key
data['D1'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==1 else 0)
data['D2'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==2 else 0)
data['D3'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==3 else 0)
data['D4'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==4 else 0)
data['D5'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==5 else 0)
data['D6'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==6 else 0)
data['D7'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==7 else 0)
data['D8'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==8 else 0)
data['D9'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==9 else 0)
data['D10'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==10 else 0)
data['D11'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==11 else 0)
data['D12'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==12 else 0)
data['D13'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==13 else 0)
data['D14'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==14 else 0)

data['D15'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==15 else 0)
data['D16'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==16 else 0)
data['D17'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==17 else 0)
data['D18'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==18 else 0)
data['D19'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==19 else 0)
data['D20'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==20 else 0)
data['D21'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==21 else 0)
data['D22'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==22 else 0)
data['D23'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==23 else 0)

data.head()

Unnamed: 0,Duration,Loudness,LogAttackTime,Tempo,TemporalCentroid,SingleEvent,Loop,Tonality,DynamicRange,Note.midi,...,D14,D15,D16,D17,D18,D19,D20,D21,D22,D23
0,24.218412,-16.581459,0.769376,95,0.498596,False,False,G major,9.689243,55,...,0,0,0,0,0,0,0,1,0,0
1,243.983673,-16.891335,1.618665,65,0.479576,False,False,G major,5.247044,40,...,0,0,0,0,0,0,0,1,0,0
2,15.281632,-21.658251,0.582658,63,0.492315,True,True,C minor,1.060242,50,...,0,0,0,0,0,0,0,0,0,0
3,2.0,-10.525232,-1.590209,119,0.468918,False,False,G# minor,0.0,41,...,0,0,0,0,0,0,0,0,1,0
4,1.45415,-28.335722,-0.492548,152,0.502481,True,False,F# minor,0.0,107,...,0,0,0,0,1,0,0,0,0,0


In [13]:
G1 = np.asarray( data.loc[:, 'G1' ] )
L1 = np.asarray( data.loc[:, 'L1' ] )
M1 = np.asarray( data.loc[:, 'M1' ] )

D1 = np.asarray( data.loc[:, 'D1' ] )
D2 = np.asarray( data.loc[:, 'D2' ] )
D3 = np.asarray( data.loc[:, 'D3' ] )
D4 = np.asarray( data.loc[:, 'D4' ] )
D5 = np.asarray( data.loc[:, 'D5' ] )
D6 = np.asarray( data.loc[:, 'D6' ] )

D7 = np.asarray( data.loc[:, 'D7' ] )
D8 = np.asarray( data.loc[:, 'D8' ] )
D9 = np.asarray( data.loc[:, 'D9' ] )
D10 = np.asarray( data.loc[:, 'D10' ] )
D11 = np.asarray( data.loc[:, 'D11' ] )
D12 = np.asarray( data.loc[:, 'D12' ] )
D13 = np.asarray( data.loc[:, 'D13' ] )
D14 = np.asarray( data.loc[:, 'D14' ] )
D15 = np.asarray( data.loc[:, 'D15' ] )
D16 = np.asarray( data.loc[:, 'D16' ] )
D17 = np.asarray( data.loc[:, 'D17' ] )

D18 = np.asarray( data.loc[:, 'D18' ] )
D19 = np.asarray( data.loc[:, 'D19' ] )
D20 = np.asarray( data.loc[:, 'D20' ] )
D21 = np.asarray( data.loc[:, 'D21' ] )
D22 = np.asarray( data.loc[:, 'D22' ] )

D23 = np.asarray( data.loc[:, 'D23' ] )

In [14]:
# regresión mutivariable con dicotómicas

y = LogAttackTime
X = np.array( [Duration, DynamicRange, TemporalCentroid, G1, L1, M1, D1, D2, D3, D4, D5, D6,D7,D8,D9,D10,D11,D12,D13,D14,D15,D16,D17,D18,D19,D20,D21,D22,D23])

reg_multiple(y, X).summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.404
Model:,OLS,Adj. R-squared:,0.388
Method:,Least Squares,F-statistic:,23.96
Date:,"Fri, 16 Nov 2018",Prob (F-statistic):,1.0999999999999999e-91
Time:,04:12:03,Log-Likelihood:,-1244.4
No. Observations:,1017,AIC:,2547.0
Df Residuals:,988,BIC:,2690.0
Df Model:,28,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,-0.4543,0.183,-2.482,0.013,-0.813,-0.095
x2,-0.4212,0.213,-1.974,0.049,-0.840,-0.002
x3,0.1762,0.121,1.454,0.146,-0.062,0.414
x4,0.0006,0.168,0.003,0.997,-0.329,0.330
x5,0.2102,0.250,0.840,0.401,-0.281,0.701
x6,0.1222,0.167,0.730,0.465,-0.206,0.451
x7,-0.2688,0.176,-1.526,0.127,-0.614,0.077
x8,-0.1213,0.173,-0.703,0.482,-0.460,0.217
x9,0.0060,0.156,0.038,0.969,-0.299,0.311

0,1,2,3
Omnibus:,145.433,Durbin-Watson:,2.066
Prob(Omnibus):,0.0,Jarque-Bera (JB):,213.543
Skew:,-1.012,Prob(JB):,4.2600000000000004e-47
Kurtosis:,3.97,Cond. No.,1e+16


**Observación: El valor de R cuadrado (coeficiente de determinación) mejora.**

## Otra estrategia: limpiar un poco más el dataset

In [15]:
data.shape[0]

1017

### Se filtran los 'single events' ya que se asume que no son canciones

Es decir 'eventos únicos', no van a caracterizar bien tonalidad, género, etc.


In [19]:
data = data[~data['SingleEvent'].isin([True])] 
data = data.drop("SingleEvent", axis=1);

data.shape[0]

871

### Se mantienen solo los sonidos de más de 120 segundos (canciones )

Criterio: Se filtran los sonidos de más de 2 minutos y menos de 5, para restringir el dataset a canciones más convencionales (menos experimentales, muy cortas o muy largas).

In [20]:
data = data[ data['Duration'] > 60*2 ]
data = data[ data['Duration'] < 60*5 ]
data.shape[0]

399

Queda un total de casi 400 instancias

## Correlación

In [21]:
data.corr(method='pearson', min_periods=1) # pearson -> método estándar

Unnamed: 0,Duration,Loudness,LogAttackTime,Tempo,Tempo.confidence,TemporalCentroid,Loop,Tonality.confidence,DynamicRange,Note.midi,Note.frequency,Note.confidence
Duration,1.0,0.138401,0.078643,0.073934,0.121327,-0.052267,,0.214241,-0.137162,-0.136723,-0.092955,
Loudness,0.138401,1.0,-0.035528,0.000409,0.185173,-0.084362,,0.246491,-0.385033,-0.140231,-0.211257,
LogAttackTime,0.078643,-0.035528,1.0,0.025246,-0.076045,0.196094,,0.110509,0.184478,-0.134777,-0.128704,
Tempo,0.073934,0.000409,0.025246,1.0,-0.151628,0.034338,,0.027799,0.080032,0.064543,-0.013254,
Tempo.confidence,0.121327,0.185173,-0.076045,-0.151628,1.0,-0.142393,,0.321075,-0.432871,-0.188001,-0.035153,
TemporalCentroid,-0.052267,-0.084362,0.196094,0.034338,-0.142393,1.0,,-0.055116,0.101423,-2e-05,-0.033961,
Loop,,,,,,,,,,,,
Tonality.confidence,0.214241,0.246491,0.110509,0.027799,0.321075,-0.055116,,1.0,-0.152033,-0.170471,-0.113172,
DynamicRange,-0.137162,-0.385033,0.184478,0.080032,-0.432871,0.101423,,-0.152033,1.0,0.195923,0.091224,
Note.midi,-0.136723,-0.140231,-0.134777,0.064543,-0.188001,-2e-05,,-0.170471,0.195923,1.0,0.819797,


**Observación: La correlación baja.**

## Otra estrategia: Analizar los single events

In [56]:
data = pd.read_csv('snd-dataset-from-plain-json.csv')
data = data[data['SingleEvent'].isin([True])] 

data.shape[0]

146

In [57]:
# eventos cortos
data = data[ data['Duration'] < 5 ]
data.shape[0]

29

In [58]:
data.corr(method='pearson', min_periods=1) # pearson -> método estándar

Unnamed: 0,Duration,Loudness,LogAttackTime,Tempo,Tempo.confidence,TemporalCentroid,SingleEvent,Loop,Tonality.confidence,DynamicRange,Note.midi,Note.frequency,Note.confidence
Duration,1.0,0.409527,0.458463,0.413927,0.10987,-0.003603,,0.111727,0.267255,0.314945,-0.140897,-0.048192,0.070128
Loudness,0.409527,1.0,0.093151,0.011048,0.254183,-0.259784,,0.254445,0.122818,0.344705,-0.113036,-0.263562,0.137129
LogAttackTime,0.458463,0.093151,1.0,0.299679,-0.105107,0.6469,,-0.106338,0.292855,-0.274957,0.360239,0.351031,0.46561
Tempo,0.413927,0.011048,0.299679,1.0,-0.049662,0.220953,,-0.049214,0.106185,0.006667,0.195123,0.199756,0.195006
Tempo.confidence,0.10987,0.254183,-0.105107,-0.049662,1.0,-0.096474,,0.99991,-0.144554,0.173904,-0.070938,-0.152904,-0.010962
TemporalCentroid,-0.003603,-0.259784,0.6469,0.220953,-0.096474,1.0,,-0.102161,0.340015,-0.656999,0.343358,0.257487,0.368906
SingleEvent,,,,,,,,,,,,,
Loop,0.111727,0.254445,-0.106338,-0.049214,0.99991,-0.102161,,1.0,-0.145321,0.180476,-0.070443,-0.152925,-0.012989
Tonality.confidence,0.267255,0.122818,0.292855,0.106185,-0.144554,0.340015,,-0.145321,1.0,-0.285782,-0.502435,-0.548159,0.184913
DynamicRange,0.314945,0.344705,-0.274957,0.006667,0.173904,-0.656999,,0.180476,-0.285782,1.0,-0.121501,-0.129814,-0.149882


**Observación:** Se observa que la variable LogAttackTime esta bastante correlacionada linealmente con Duration, TemporaCentroid, por lo cuál se espera una regresión aceptable.

In [59]:
Duration = np.asarray( data.loc[:, 'Duration' ] )
DynamicRange = np.asarray( data.loc[:, 'DynamicRange' ] )
TemporalCentroid = np.asarray( data.loc[:, 'TemporalCentroid' ] )
LogAttackTime = np.asarray( data.loc[:, 'LogAttackTime' ] )
#Tempo = np.asarray( data.loc[:, 'Tempo' ] )

y = LogAttackTime
X = np.array( [Duration, TemporalCentroid])

reg_multiple(y, X).summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.631
Model:,OLS,Adj. R-squared:,0.602
Method:,Least Squares,F-statistic:,22.21
Date:,"Fri, 16 Nov 2018",Prob (F-statistic):,2.37e-06
Time:,04:22:17,Log-Likelihood:,-19.97
No. Observations:,29,AIC:,45.94
Df Residuals:,26,BIC:,50.04
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x1,3.6067,0.663,5.443,0.000,2.245,4.969
x2,0.2551,0.066,3.867,0.001,0.119,0.391
const,-2.9599,0.336,-8.805,0.000,-3.651,-2.269

0,1,2,3
Omnibus:,3.6,Durbin-Watson:,1.777
Prob(Omnibus):,0.165,Jarque-Bera (JB):,3.174
Skew:,-0.786,Prob(JB):,0.205
Kurtosis:,2.604,Cond. No.,22.2


**Se obtiene un valor aceptable de R cuadrado, pero no así para p**

## Filtrar solo los features calculados con confianza alta

In [118]:
data = pd.read_csv('snd-dataset-from-plain-json.csv')

#data = data[ data['Tempo.confidence'] > 0.5 ]
data = data[ data['Tonality.confidence'] > 0.8 ]
data = data[ data['Note.confidence'] > 0.5 ]

data.shape[0]

32

In [119]:
data.corr(method='pearson', min_periods=1) # pearson -> método estándar

Unnamed: 0,Duration,Loudness,LogAttackTime,Tempo,Tempo.confidence,TemporalCentroid,SingleEvent,Loop,Tonality.confidence,DynamicRange,Note.midi,Note.frequency,Note.confidence
Duration,1.0,0.304569,0.642006,0.242524,0.034554,0.089504,,-0.082728,0.442781,0.356598,0.067216,-0.023755,-0.234207
Loudness,0.304569,1.0,-0.07918,0.032689,-0.122179,0.044776,,-0.168705,0.144841,-0.310329,0.152639,0.116591,-0.320799
LogAttackTime,0.642006,-0.07918,1.0,0.067663,-0.020464,0.264089,,0.165728,0.243919,0.513857,-0.023101,-0.14245,-0.225773
Tempo,0.242524,0.032689,0.067663,1.0,-0.510652,0.338942,,-0.309448,0.231979,-0.068813,-0.075381,-0.055689,-0.125349
Tempo.confidence,0.034554,-0.122179,-0.020464,-0.510652,1.0,-0.281462,,0.77439,-0.058799,0.186137,-0.069464,-0.061536,0.280368
TemporalCentroid,0.089504,0.044776,0.264089,0.338942,-0.281462,1.0,,-0.069671,0.098502,-0.264802,0.258988,0.158615,0.018859
SingleEvent,,,,,,,,,,,,,
Loop,-0.082728,-0.168705,0.165728,-0.309448,0.77439,-0.069671,,1.0,-0.205802,0.084184,-0.125787,-0.14728,0.264528
Tonality.confidence,0.442781,0.144841,0.243919,0.231979,-0.058799,0.098502,,-0.205802,1.0,0.163412,0.062781,0.090452,0.062282
DynamicRange,0.356598,-0.310329,0.513857,-0.068813,0.186137,-0.264802,,0.084184,0.163412,1.0,-0.003761,-0.004028,-0.211804


In [120]:
# Mapeo de las tonalidades 'en texto' a categorias numéricas
# Se necesitan k-1 variables dicotómicas, con k=len(key_to_number_list)

# Genre
data['G1'] = data['Genre'].map(lambda x: 1 if x=='Genre A' else 0)

data['L1'] = data['Loop'].map(lambda x: 1 if x==True else 0)
data['M1'] = data['Mood'].map(lambda x: 1 if x=='Mood A' else 0)

# Tonality / Key
data['D1'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==1 else 0)
data['D2'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==2 else 0)
data['D3'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==3 else 0)
data['D4'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==4 else 0)
data['D5'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==5 else 0)
data['D6'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==6 else 0)
data['D7'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==7 else 0)
data['D8'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==8 else 0)
data['D9'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==9 else 0)
data['D10'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==10 else 0)
data['D11'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==11 else 0)
data['D12'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==12 else 0)
data['D13'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==13 else 0)
data['D14'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==14 else 0)

data['D15'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==15 else 0)
data['D16'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==16 else 0)
data['D17'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==17 else 0)
data['D18'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==18 else 0)
data['D19'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==19 else 0)
data['D20'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==20 else 0)
data['D21'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==21 else 0)
data['D22'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==22 else 0)
data['D23'] = data['Tonality'].map(lambda x: 1 if keyToNumber(x)==23 else 0)

#data.head()

G1 = np.asarray( data.loc[:, 'G1' ] )
L1 = np.asarray( data.loc[:, 'L1' ] )
M1 = np.asarray( data.loc[:, 'M1' ] )

D1 = np.asarray( data.loc[:, 'D1' ] )
D2 = np.asarray( data.loc[:, 'D2' ] )
D3 = np.asarray( data.loc[:, 'D3' ] )
D4 = np.asarray( data.loc[:, 'D4' ] )
D5 = np.asarray( data.loc[:, 'D5' ] )
D6 = np.asarray( data.loc[:, 'D6' ] )

D7 = np.asarray( data.loc[:, 'D7' ] )
D8 = np.asarray( data.loc[:, 'D8' ] )
D9 = np.asarray( data.loc[:, 'D9' ] )
D10 = np.asarray( data.loc[:, 'D10' ] )
D11 = np.asarray( data.loc[:, 'D11' ] )
D12 = np.asarray( data.loc[:, 'D12' ] )
D13 = np.asarray( data.loc[:, 'D13' ] )
D14 = np.asarray( data.loc[:, 'D14' ] )
D15 = np.asarray( data.loc[:, 'D15' ] )
D16 = np.asarray( data.loc[:, 'D16' ] )
D17 = np.asarray( data.loc[:, 'D17' ] )

D18 = np.asarray( data.loc[:, 'D18' ] )
D19 = np.asarray( data.loc[:, 'D19' ] )
D20 = np.asarray( data.loc[:, 'D20' ] )
D21 = np.asarray( data.loc[:, 'D21' ] )
D22 = np.asarray( data.loc[:, 'D22' ] )

D23 = np.asarray( data.loc[:, 'D23' ] )

In [121]:
Duration = np.asarray( data.loc[:, 'Duration' ] )
DynamicRange = np.asarray( data.loc[:, 'DynamicRange' ] )
LogAttackTime = np.asarray( data.loc[:, 'LogAttackTime' ] )
Loudness = np.asarray( data.loc[:, 'Loudness' ] )
Tempo = np.asarray( data.loc[:, 'Tempo' ] )
TemporalCentroid = np.asarray( data.loc[:, 'TemporalCentroid' ] )

y = LogAttackTime
#X = np.array( [Loudness, TemporalCentroid, Duration] )
X = np.array( [DynamicRange, TemporalCentroid, Duration, G1, L1, M1, D1, D2, D3, D4, D5, D6,D7,D8,D9,D10,D11,D12,D13,D14,D15,D16,D17,D18,D19,D20,D21,D22,D23])

reg_multiple(y, X).summary()



  return self.params / self.bse
  return (self.a < x) & (x < self.b)
  return (self.a < x) & (x < self.b)
  cond2 = cond0 & (x <= self.a)


0,1,2,3
Dep. Variable:,y,R-squared:,0.844
Model:,OLS,Adj. R-squared:,0.678
Method:,Least Squares,F-statistic:,5.085
Date:,"Fri, 16 Nov 2018",Prob (F-statistic):,0.00148
Time:,04:36:08,Log-Likelihood:,-12.247
No. Observations:,32,AIC:,58.49
Df Residuals:,15,BIC:,83.41
Df Model:,16,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-1.213e-14,3.42e-14,-0.355,0.728,-8.51e-14,6.08e-14
x1,1.183e-13,4.23e-14,2.795,0.014,2.81e-14,2.09e-13
x2,-1.3670,0.595,-2.297,0.036,-2.636,-0.098
x3,0.1394,0.668,0.209,0.837,-1.285,1.563
x4,1.884e-15,1.38e-15,1.370,0.191,-1.05e-15,4.82e-15
x5,3.406e-15,2.04e-15,1.667,0.116,-9.5e-16,7.76e-15
x6,-2.0880,0.605,-3.454,0.004,-3.376,-0.799
x7,-1.2971,0.717,-1.810,0.090,-2.824,0.230
x8,-0.7046,0.547,-1.288,0.217,-1.871,0.462

0,1,2,3
Omnibus:,4.464,Durbin-Watson:,1.733
Prob(Omnibus):,0.107,Jarque-Bera (JB):,2.959
Skew:,-0.54,Prob(JB):,0.228
Kurtosis:,4.025,Cond. No.,1.23e+16


**Observación: Al filtrar el dataset solo utilizando las instancias que tenian los features calculados solo los features calculados con confianza alta es que se obtuvo los mejores resultados. R cuadrado (ajustado) de 0.678 y buenos valores para p.**


---

**Siguiente: [3 - Reducción de la dimensionalidad SVD y PCA](3%20-%20Reducción%20de%20la%20dimensionalidad%20SVD%20y%20PCA.ipynb)** 