# Introducción

## Justificación

https://archive-beta.ics.uci.edu/ml/datasets/adult

El data set consiste en predicciones sobre si una persona en Estados Unidos generan más o menos de $50000. Más allá de lo aprendido académicamente, un beneficio que se puede obtener gracias a este data set es poder detectar qué genera desigualdad. Por ejemplo, ¿será que los modelo predicen si que una persona negra estadísticamente genera menos que una blanca como se entiende en el conocimiento popular?

La idea sería tratar de analizar las predecciones que realizan los modelos con varias filas de ejemplo y así hacer un pequeño sondeo meramente arbitrario. Además, tratar de seguir patrones que nos lleven a la conclusión sobre los factores determinantes para generar más o menos de $50000, al menos para el data set usado.

## Exploración

In [13]:
import pandas as pd

In [14]:
PREDICTORS = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", 
                                        "occupation", "relationship", "race", "sex", "capital-gain",
                                        "capital-loss", "hours-per-week", "native-country"]
PREDICTORS_STRING = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]
PREDICTORS_INT = ["age", "fnlwgt", "capital-gain", "capital-loss", "hours-per-week"]
TARGET_VARIABLE = "class"

data = pd.read_csv('adult.data', names=["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", 
                                        "occupation", "relationship", "race", "sex", "capital-gain",
                                        "capital-loss", "hours-per-week", "native-country", "class"])
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [16]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  class           32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


age                              90
workclass               Without-pay
fnlwgt                      1484705
education              Some-college
education-num                    16
marital-status              Widowed
occupation         Transport-moving
relationship                   Wife
race                          White
sex                            Male
capital-gain                  99999
capital-loss                   4356
hours-per-week                   99
native-country           Yugoslavia
class                          >50K
dtype: object

In [17]:
data.max()

age                              90
workclass               Without-pay
fnlwgt                      1484705
education              Some-college
education-num                    16
marital-status              Widowed
occupation         Transport-moving
relationship                   Wife
race                          White
sex                            Male
capital-gain                  99999
capital-loss                   4356
hours-per-week                   99
native-country           Yugoslavia
class                          >50K
dtype: object

In [18]:
data.min()

age                                17
workclass                           ?
fnlwgt                          12285
education                        10th
education-num                       1
marital-status               Divorced
occupation                          ?
relationship                  Husband
race               Amer-Indian-Eskimo
sex                            Female
capital-gain                        0
capital-loss                        0
hours-per-week                      1
native-country                      ?
class                           <=50K
dtype: object

In [21]:
data["age"].mean()

38.58164675532078

Dos interpretaciones que podemos realizar con base en estos dos métodos simples exploratorios son sumamente sencillos para la limpieza de datos:
    
1. Algunas columnas no son de números, por lo que habrá que manipular los datos para convertirlos en enteros y luego ejecutar el algoritmo respectivo.
2. Se está tomando en cuenta personas menores de edad (17 años). Además, la mediana es de 38 años.

## Limpieza de datos

## Modularizar conversión de datos string a enteros
Es necesario convertir esos valores a enteros para poder ejecutar el algoritmo.

In [10]:
def from_string_to_int(key, data_set):
    data_set[key],class_names = pd.factorize(data_set[key])
    return class_names

## Target a string

In [11]:
class_names = from_string_to_int(TARGET_VARIABLE, data)
print(class_names)
print(data[TARGET_VARIABLE].unique())

Index([' <=50K', ' >50K'], dtype='object')
[0 1]


## Demás atributos a string

In [12]:
for predictor in PREDICTORS_STRING:
    from_string_to_int(predictor, data)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             32561 non-null  int64
 1   workclass       32561 non-null  int64
 2   fnlwgt          32561 non-null  int64
 3   education       32561 non-null  int64
 4   education-num   32561 non-null  int64
 5   marital-status  32561 non-null  int64
 6   occupation      32561 non-null  int64
 7   relationship    32561 non-null  int64
 8   race            32561 non-null  int64
 9   sex             32561 non-null  int64
 10  capital-gain    32561 non-null  int64
 11  capital-loss    32561 non-null  int64
 12  hours-per-week  32561 non-null  int64
 13  native-country  32561 non-null  int64
 14  class           32561 non-null  int64
dtypes: int64(15)
memory usage: 3.7 MB


## Tres atributos importantes

De acuerdo con las predicciones hechas, los atributos más determinantes fueron education, workclass, y occupation.