## 1. Descripción y justificación

Este es un problema de clasificación. El data set consiste en predecir si una persona en Estados Unidos generan más o menos de $50000. El problema planteado es detectar qué variables pueden determinar los ingresos anuales de personas estadounidense.

El problema se divide en dos puntos:

1. Detectar qué features son relevantes para determinar si una persona genera más o menos de $50000. Esto se puede averiguar, por ejemplo, descartando columnas con poca varianza.

2. Implementar un modelo que, cuando se ingrese una tupla de datos de un ciudadano arbitrario, detecte si ingresa más o menos de $50000. Con este modelo se podría intuitivamente identificar qué factores pueden estar causando desigualdad en Estados Unidos.

## 2. Exploración e interpretación

La mayoría de columnas se pueden entender perfectamente. Sin embargo, "fnlwgt" podría malinterpretarse. "fnlwgt" significa "final weight". Sin entrar en mucho detalle, esta columna se usa para hacer una clasificación dependiendo de características socioeconómicas. Eso quiere decir que personas con  mismas características demográficas tienen un peso similar.

In [1]:
import pandas as pd

PREDICTORS = ["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", 
                                        "occupation", "relationship", "race", "sex", "capital-gain",
                                        "capital-loss", "hours-per-week", "native-country"]
PREDICTORS_STRING = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "sex", "native-country"]
PREDICTORS_INT = ["age", "fnlwgt", "capital-gain", "capital-loss", "hours-per-week"]
TARGET_VARIABLE = "class"

data = pd.read_csv('adult.data', names=["age", "workclass", "fnlwgt", "education", "education-num", "marital-status", 
                                        "occupation", "relationship", "race", "sex", "capital-gain",
                                        "capital-loss", "hours-per-week", "native-country", "class"])
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             10000 non-null  int64 
 1   workclass       10000 non-null  object
 2   fnlwgt          10000 non-null  int64 
 3   education       10000 non-null  object
 4   education-num   10000 non-null  int64 
 5   marital-status  10000 non-null  object
 6   occupation      10000 non-null  object
 7   relationship    10000 non-null  object
 8   race            10000 non-null  object
 9   sex             10000 non-null  object
 10  capital-gain    10000 non-null  int64 
 11  capital-loss    10000 non-null  int64 
 12  hours-per-week  10000 non-null  int64 
 13  native-country  10000 non-null  object
 14  class           10000 non-null  object
dtypes: int64(6), object(9)
memory usage: 1.1+ MB


### Imprimir valores únicos

In [3]:
print(data.nunique())

age                 71
workclass            9
fnlwgt            8507
education           16
education-num       16
marital-status       7
occupation          15
relationship         6
race                 5
sex                  2
capital-gain       102
capital-loss        73
hours-per-week      83
native-country      41
class                2
dtype: int64


### Obtener duplicados

In [4]:
duplicates = data.duplicated()
print(data[duplicates])

      age workclass  fnlwgt      education  education-num  marital-status  \
4881   25   Private  308144      Bachelors             13   Never-married   
5104   90   Private   52386   Some-college             10   Never-married   
9171   21   Private  250051   Some-college             10   Never-married   

           occupation    relationship                 race      sex  \
4881     Craft-repair   Not-in-family                White     Male   
5104    Other-service   Not-in-family   Asian-Pac-Islander     Male   
9171   Prof-specialty       Own-child                White   Female   

      capital-gain  capital-loss  hours-per-week  native-country   class  
4881             0             0              40          Mexico   <=50K  
5104             0             0              35   United-States   <=50K  
9171             0             0              10   United-States   <=50K  


## 3. Datos

### Modularizar conversión de datos string a enteros

In [5]:
def from_string_to_int(key, data_set):
    data_set[key],class_names = pd.factorize(data_set[key])
    return class_names

### Target a string

In [6]:
class_names = from_string_to_int(TARGET_VARIABLE, data)
print(class_names)
print(data[TARGET_VARIABLE].unique())

Index([' <=50K', ' >50K'], dtype='object')
[0 1]


### Demás atributos a string

In [7]:
for predictor in PREDICTORS_STRING:
    from_string_to_int(predictor, data)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype
---  ------          --------------  -----
 0   age             10000 non-null  int64
 1   workclass       10000 non-null  int64
 2   fnlwgt          10000 non-null  int64
 3   education       10000 non-null  int64
 4   education-num   10000 non-null  int64
 5   marital-status  10000 non-null  int64
 6   occupation      10000 non-null  int64
 7   relationship    10000 non-null  int64
 8   race            10000 non-null  int64
 9   sex             10000 non-null  int64
 10  capital-gain    10000 non-null  int64
 11  capital-loss    10000 non-null  int64
 12  hours-per-week  10000 non-null  int64
 13  native-country  10000 non-null  int64
 14  class           10000 non-null  int64
dtypes: int64(15)
memory usage: 1.1 MB


### Remover filas duplicadas (si existen)
Se remueven por recomendación de Jason Brownlee. Ya que, en general, no es necesario tener filas duplicadas.

In [8]:
print(data.shape)
data.drop_duplicates(inplace=True)
print(data.shape)

(10000, 15)
(9997, 15)


### Remover columnas únicas (si existen)

In [9]:
print(data.shape)
counts = data.nunique()
to_del = [i for i,v in enumerate(counts) if v == 1]
print(to_del)
data.drop(to_del, axis=1, inplace=True)
print(data.shape)

(9997, 15)
[]
(9997, 15)
