### Machine Learning / Asunciones - Pair Programming

El objetivo de este pairprgramming es que evaluéis si vuestro set de datos cumple todas las asunciones que se deben cumplir para hacer una regresión lineal. Recordamos que estas asunciones son:
- Normalidad (ya la deberíais haber evaluado).
- Homogeneidad de varianzas.
- Independencia de las variables.

Cada asunción la deberéis testar de forma visual y analítica.

In [5]:
import numpy as np
import pandas as pd
import random 

import matplotlib.pyplot as plt
import seaborn as sns


import researchpy as rp
from scipy import stats
from scipy.stats import levene
from scipy.stats import kstest

plt.rcParams["figure.figsize"] = (10,8) 

In [2]:
df = pd.read_csv("data/adult.data_limpio.csv", index_col = 0)

- Asunción de normalidad:

In [3]:
df.shape

(32560, 15)

In [4]:
df.head()

Unnamed: 0_level_0,work_class,final_weight,education,education_yrs,marital_status,occupation,relationship,ethnicity,gender,capital_gain,capital_lost,hours_week,country,salary,census
39,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,45719,Bajo
38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,9004,Bajo
53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,9920,Bajo
28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,36986,Bajo
37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,9246,Bajo


Observamos que nuestro DataFrame tiene más de 5 mil líneas, por lo tanto vamos a usar el test de Kolmogórov-Smirnov.

In [6]:
kstest(df["salary"], 'norm')

KstestResult(statistic=0.9999385739519873, pvalue=0.0)

Un p-valor menor a 0.05 nos indica que debemos rechazar la hipótesis nula y que nuestros datos no son normales.

Por lo tanto, para poder introducirlo en un modelo de machine learning, deberíamos realizar una serie de cambios que veremos próximamente.

- Homogeneidad de varianzas.

Vamos a utilizar el test de Levene porque es más robusto, y se recomienda su uso para datos que no son normales.

In [9]:
df.columns

Index(['work_class', 'final_weight', 'education', 'education_yrs',
       'marital_status', 'occupation', 'relationship', 'ethnicity', 'gender',
       'capital_gain', 'capital_lost', 'hours_week', 'country', 'salary',
       'census'],
      dtype='object')

In [10]:
resultados = {}

numericas_col = df.select_dtypes(include = np.number).drop("education", axis = 1).columns

for col in numericas_col:

    statistic, p_val = levene(df[col], df.education, center='median')
    
    resultados[col] = p_val


KeyError: "['education'] not found in axis"

In [16]:
resultados

{'final_weight': 0.0,
 'education_yrs': 0.0,
 'capital_gain': 0.0,
 'capital_lost': 0.0,
 'hours_week': 0.0}

Como podemos observar, todos los valores son menores de 0.05, por lo tanto no se cumple la asunción. Esto quiere decir que las variables son independientes unas de otras.

Quizá nos dé estos resultados porque deberíamos estar trabajando sobre los residuos, cosa que aún no sabemos lo que es.

- Independencia de las variables.

In [17]:
crosstab, test_results, expected = rp.crosstab(df["salary"], df["hours_week"],
                                               test= "chi-square",
                                               expected_freqs= True,
                                               prop= "cell")

In [18]:
crosstab.head()

Unnamed: 0_level_0,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week
hours_week,1,2,3,4,5,6,7,8,9,10,...,90,91,92,94,95,96,97,98,99,All
salary,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
16,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01


# warning revisar esta celda de markdown #
A la vista de los resultados anteriores, no nos extraña que continúe dándonos 0. Estas dos variables quizás no están correlacionadas porque los datos que corresponden a salary nos los hemos inventado con un random. 

In [42]:
crosstab, test_results, expected = rp.crosstab(df["education_num"], df["hours_week"],
                                               test= "chi-square",
                                               expected_freqs= True,
                                               prop= "cell")

In [43]:
crosstab.head()

Unnamed: 0_level_0,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week,hours_week
hours_week,1,2,3,4,5,6,7,8,9,10,...,90,91,92,94,95,96,97,98,99,All
education_num,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.16
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.52
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.02
4,0.0,0.01,0.01,0.01,0.01,0.0,0.0,0.01,0.0,0.06,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,1.98
5,0.0,0.0,0.0,0.0,0.01,0.01,0.0,0.0,0.01,0.01,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.58


Sin embargo, algunos de estos resltados sí que parecen ser algo dependientes. Tiene sentido, dado que es normal que la cantidad de horas que trabajes tenga relacioń con el número de años que has estudiado.