## *03. Preprocesamiento de los datos*
El preprocesamiento de datos es una etapa crítica en la preparación de datos antes de su análisis o uso en aplicaciones de machine learning. Consiste en una serie de técnicas y transformaciones que se aplican a los datos brutos con el objetivo de mejorar su calidad, consistencia y relevancia para el análisis.

En resumen, el preprocesamiento de datos es una etapa crítica en la preparación de datos antes de su análisis o uso en aplicaciones de machine learning. Se utiliza para mejorar la calidad, consistencia y relevancia de los datos mediante diversas técnicas y transformaciones.

In [1]:
# Librerías
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)

In [2]:
# Rutas de los archivos
file_path = '../datasets/adult.data'
col_names = pd.read_csv('../datasets/col_names.txt').T.iloc[0].tolist()

# Lectura de los datos
data = pd.read_csv(filepath_or_buffer=file_path,
                   names=col_names)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
# Manejar el nombre de los predictores
data = data.rename(columns=lambda col: str(col).capitalize().strip())
data.head()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country,Income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
# Duplicados
print(f'Tamaño original: {data.shape}')
data.drop_duplicates(inplace=True, ignore_index=True)
print(f'Tamaño sin duplicados: {data.shape}')

Tamaño original: (32561, 15)
Tamaño sin duplicados: (32537, 15)


In [5]:
# Reemplazar valores faltantes de distintas fuentes a np.nan
data = data.fillna(np.nan)

# Valores faltantes
data.isnull().mean().sort_values(ascending=False)

Age               0.0
Workclass         0.0
Fnlwgt            0.0
Education         0.0
Education-num     0.0
Marital-status    0.0
Occupation        0.0
Relationship      0.0
Race              0.0
Sex               0.0
Capital-gain      0.0
Capital-loss      0.0
Hours-per-week    0.0
Native-country    0.0
Income            0.0
dtype: float64

In [6]:
# Uniformizar los predictores categóricas
categoricals = list(data.select_dtypes(include=['object', 'bool']).columns)
data[categoricals] = data[categoricals].applymap(lambda x: str(x).strip().lower())
data[categoricals].sample(5, random_state=42)

Unnamed: 0,Workclass,Education,Marital-status,Occupation,Relationship,Race,Sex,Native-country,Income
3643,state-gov,assoc-voc,married-civ-spouse,craft-repair,husband,white,male,united-states,<=50k
16036,federal-gov,bachelors,never-married,exec-managerial,not-in-family,white,male,united-states,<=50k
9401,local-gov,some-college,married-civ-spouse,other-service,husband,asian-pac-islander,male,philippines,<=50k
17903,private,some-college,never-married,exec-managerial,not-in-family,white,male,united-states,<=50k
5198,federal-gov,bachelors,never-married,exec-managerial,not-in-family,white,male,united-states,>50k


In [7]:
# Validar los cambios
for col in data[categoricals]:
    print(data[col].unique(), sep='\n')

['state-gov' 'self-emp-not-inc' 'private' 'federal-gov' 'local-gov' '?'
 'self-emp-inc' 'without-pay' 'never-worked']
['bachelors' 'hs-grad' '11th' 'masters' '9th' 'some-college' 'assoc-acdm'
 'assoc-voc' '7th-8th' 'doctorate' 'prof-school' '5th-6th' '10th'
 '1st-4th' 'preschool' '12th']
['never-married' 'married-civ-spouse' 'divorced' 'married-spouse-absent'
 'separated' 'married-af-spouse' 'widowed']
['adm-clerical' 'exec-managerial' 'handlers-cleaners' 'prof-specialty'
 'other-service' 'sales' 'craft-repair' 'transport-moving'
 'farming-fishing' 'machine-op-inspct' 'tech-support' '?'
 'protective-serv' 'armed-forces' 'priv-house-serv']
['not-in-family' 'husband' 'wife' 'own-child' 'unmarried' 'other-relative']
['white' 'black' 'asian-pac-islander' 'amer-indian-eskimo' 'other']
['male' 'female']
['united-states' 'cuba' 'jamaica' 'india' '?' 'mexico' 'south'
 'puerto-rico' 'honduras' 'england' 'canada' 'germany' 'iran'
 'philippines' 'italy' 'poland' 'columbia' 'cambodia' 'thailand' '

---
---