# Tratamiento de valores faltantes

Para explorar las distintas opciones a la hora de imputacion de datos vamos a usar el datasets de migrantes, disponible en kaggle: https://www.kaggle.com/datasets/nelgiriyewithana/global-missing-migrants-dataset/code

In [23]:
import pandas as pd

df = pd.read_csv('datasets/global-missing-migrants-dataset.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13020 entries, 0 to 13019
Data columns (total 19 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Incident Type                        13020 non-null  object 
 1   Incident year                        13020 non-null  int64  
 2   Reported Month                       13020 non-null  object 
 3   Region of Origin                     12998 non-null  object 
 4   Region of Incident                   13020 non-null  object 
 5   Country of Origin                    13012 non-null  object 
 6   Number of Dead                       12470 non-null  float64
 7   Minimum Estimated Number of Missing  13020 non-null  int64  
 8   Total Number of Dead and Missing     13020 non-null  int64  
 9   Number of Survivors                  13020 non-null  int64  
 10  Number of Females                    13020 non-null  int64  
 11  Number of Males             

Miramos algunos valores para entender la naturaleza del dataset

In [24]:
df.sample(5)

Unnamed: 0,Incident Type,Incident year,Reported Month,Region of Origin,Region of Incident,Country of Origin,Number of Dead,Minimum Estimated Number of Missing,Total Number of Dead and Missing,Number of Survivors,Number of Females,Number of Males,Number of Children,Cause of Death,Migration route,Location of death,Information Source,Coordinates,UNSD Geographical Grouping
12206,Incident,2022,December,Northern Africa (P),Mediterranean,Unknown,,14,14,0,2,0,1,Drowning,Western Mediterranean,Unspecified location in the Balaeric Sea - dep...,"Caminando Fronteras, CIPIMD","36.76729267, 3.096717",Uncategorized
12137,Incident,2022,November,Southern Asia,Southern Asia,Afghanistan,1.0,0,1,0,0,1,0,Vehicle accident / death linked to hazardous t...,,"Rubat Safcha village, Obe district, Herat prov...",IOM Afghanistan,"34.2728176, 63.1089658",Southern Asia
9043,Incident,2021,August,Latin America / Caribbean (P),North America,Unknown,3.0,0,3,8,0,1,0,Vehicle accident / death linked to hazardous t...,US-Mexico border crossing,Approximately 39 miles northwest of Tucson nea...,US Border Patrol,"32.64174, -111.38443",Northern America
26,Incident,2014,February,Latin America / Caribbean (P),North America,Unknown,1.0,0,1,0,0,0,0,Mixed or unknown,US-Mexico border crossing,Pima Country Office of the Medical Examiner ju...,Pima County Office of the Medical Examiner (PC...,"32.08795, -112.58407",Northern America
11952,Incident,2022,August,Southern Asia,Southern Asia,Afghanistan,1.0,0,1,0,0,1,0,Vehicle accident / death linked to hazardous t...,,"Argan village, Kiti district, Daykundi provinc...",IOM Afghanistan,"33.4096584, 65.8085049",Southern Asia


Vemos que hay valores faltantes en algunas columnas, veamos cuantos hay exactamente en cada una

In [25]:
df.isnull().sum()

Incident Type                             0
Incident year                             0
Reported Month                            0
Region of Origin                         22
Region of Incident                        0
Country of Origin                         8
Number of Dead                          550
Minimum Estimated Number of Missing       0
Total Number of Dead and Missing          0
Number of Survivors                       0
Number of Females                         0
Number of Males                           0
Number of Children                        0
Cause of Death                            0
Migration route                        3021
Location of death                         0
Information Source                        8
Coordinates                              36
UNSD Geographical Grouping                1
dtype: int64

veamos algunos casos. Vamos a usar la funcion `any` junto con `isnull` para recuperar las filas que tengan al menos un atributo en nulo.

In [26]:
df[df.isnull().any(axis=1)].sample(5)

Unnamed: 0,Incident Type,Incident year,Reported Month,Region of Origin,Region of Incident,Country of Origin,Number of Dead,Minimum Estimated Number of Missing,Total Number of Dead and Missing,Number of Survivors,Number of Females,Number of Males,Number of Children,Cause of Death,Migration route,Location of death,Information Source,Coordinates,UNSD Geographical Grouping
1733,Incident,2016,June,Unknown,Northern Africa,Unknown,3.0,0,3,0,0,0,0,Violence,,"Tripoli, Libya",MHub,"32.86414222, 13.1762320988",Northern Africa
6153,Incident,2019,June,Southern Asia (P),Southern Asia,Unknown,2.0,0,2,0,0,2,0,Violence,,"Kurdistan, Iran",Mixed Migration Monitoring Mechanism Initiativ...,"36.048739, 45.809583",Southern Asia
10650,Incident,2022,April,Sub-Saharan Africa (P),Mediterranean,Unknown,,12,12,94,0,0,0,Drowning,Central Mediterranean,"North of Al-Khums, Libya - location of rescue ...",SOS Méditerranée,"33.14699825, 14.021359",Uncategorized
67,Incident,2014,April,Southern Asia,South-eastern Asia,Cambodia,7.0,0,7,7,2,5,0,Vehicle accident / death linked to hazardous t...,,"344 Road, 39-40th km, Nong Sue Chang subdistri...","Naewna, Post Today, Daily News, Manager Online","13.3493133, 101",South-eastern Asia
6993,Incident,2019,December,Sub-Saharan Africa (P),Western Africa,Unknown,2.0,0,2,0,1,1,1,Violence,,"Gao, Mali",Mixed Migration Monitoring Mechanism Initiativ...,"-0.028555, 16.261959",Western Africa


Pensando en la naturaleza de los datos, podemos deducir que algunas columnas tienen relacion entre si. En esta relacion puede estar la solucion para la imputacion de datos. Sabemos que hay una relacion entre 'Country of Origin' y 'Region of Origin', que se traducen en pais de origen y region de origen, respectivamente. Conociendo el pais podemos deducir la region a la que pertenece, no asi a la inversa. 

Veamos cuales son los paises que tienen regiones en nulo para crear el mapeo

In [27]:
df[~df['Country of Origin'].isna() & df['Region of Origin'].isna()]['Country of Origin'].unique()

array(['Afghanistan,Iraq,Syrian Arab Republic', 'Sudan', 'Egypt',
       'Unknown', 'Nigeria', 'Chad', 'Syrian Arab Republic', 'Mauritania',
       'Bangladesh'], dtype=object)

Cuales son las regiones existentes dentro del dataset?

In [18]:
df['Region of Origin'].unique()

array(['Central America', 'Latin America / Caribbean (P)',
       'Northern Africa', 'Unknown', 'Southern Asia', 'Caribbean',
       'South-eastern Asia', 'Eastern Africa', 'Europe', 'South America',
       'Western Asia', 'Middle Africa', 'Eastern Asia',
       'Sub-Saharan Africa (P)', 'Western Africa', 'Mixed',
       'Western / Southern Asia (P)', 'Eastern Africa (P)',
       'Western / Southern Asia', 'Eastern Asia (P)',
       'Western Africa (P)', 'Sub-Saharan Africa', nan, 'Oceania',
       'Central Asia', 'Southern Asia (P)', 'Northern Africa (P)',
       'Southern Africa', 'Caribbean (P)', 'Western Asia (P)',
       'South America (P)', 'Central America (P)',
       'South-eastern Asia (P)', 'Northern America'], dtype=object)

En base a una breve investigacion online, creamos el mapeo

In [28]:
# Creamos el mapeo
region_mapping = {
    'Chad': 'Middle Africa',
    'Bangladesh': 'Southern Asia',
    'Sudan': 'Northern Africa',
    'Egypt': 'Northern Africa',
    'Syrian Arab Republic': 'Western Asia',
    'Nigeria': 'Western Africa',
    'Mauritania': 'Western Africa',
    'Iraq': 'Western Asia',
    'Iran': 'Southern Asia (P)',
    'Afghanistan': 'Central Asia',
    'Unknown': 'Unknown'
}
df['Region of Origin'] = df['Region of Origin'].fillna(df['Country of Origin'].map(region_mapping))

Veamos el resultado de la imputacion

In [29]:
df.isna().sum()

Incident Type                             0
Incident year                             0
Reported Month                            0
Region of Origin                          1
Region of Incident                        0
Country of Origin                         8
Number of Dead                          550
Minimum Estimated Number of Missing       0
Total Number of Dead and Missing          0
Number of Survivors                       0
Number of Females                         0
Number of Males                           0
Number of Children                        0
Cause of Death                            0
Migration route                        3021
Location of death                         0
Information Source                        8
Coordinates                              36
UNSD Geographical Grouping                1
dtype: int64

Todavia queda un caso nulo, veamoslo

In [31]:
df[df[['Region of Origin']].isnull().any(axis=1)]

Unnamed: 0,Incident Type,Incident year,Reported Month,Region of Origin,Region of Incident,Country of Origin,Number of Dead,Minimum Estimated Number of Missing,Total Number of Dead and Missing,Number of Survivors,Number of Females,Number of Males,Number of Children,Cause of Death,Migration route,Location of death,Information Source,Coordinates,UNSD Geographical Grouping
1217,Incident,2016,January,,Mediterranean,"Afghanistan,Iraq,Syrian Arab Republic",39.0,0,39,75,0,0,5,Drowning,Eastern Mediterranean,"between Ayvacik, Canakkale, Türkiye and Lesvos...",Turkish Coast Guard via IOM Athens. AFP and th...,"39.2893824, 26.4734281",Uncategorized


Este caso particular es engañoso, porque tiene mas de un pais. Le asignemos el valor 'Mixed' que significaria 'Combinado' en ingles

In [35]:
df['Region of Origin'] = df['Region of Origin'].fillna(df['Country of Origin'].map({'Afghanistan,Iraq,Syrian Arab Republic': 'Mixed'}))

Chequeamos de vuelta valores nulos

In [36]:
df.isna().sum()

Incident Type                             0
Incident year                             0
Reported Month                            0
Region of Origin                          0
Region of Incident                        0
Country of Origin                         0
Number of Dead                          550
Minimum Estimated Number of Missing       0
Total Number of Dead and Missing          0
Number of Survivors                       0
Number of Females                         0
Number of Males                           0
Number of Children                        0
Cause of Death                            0
Migration route                        3021
Location of death                         0
Information Source                        8
Coordinates                              36
UNSD Geographical Grouping                1
dtype: int64

Los 8 valores nulos en `Country of Origin` no podemos imputarlos con valores decudidos, por lo que podemos asignarles 'unknown'

In [37]:
df['Country of Origin'].fillna('Unknown', inplace=True)
df.isna().sum()

Incident Type                             0
Incident year                             0
Reported Month                            0
Region of Origin                          0
Region of Incident                        0
Country of Origin                         0
Number of Dead                          550
Minimum Estimated Number of Missing       0
Total Number of Dead and Missing          0
Number of Survivors                       0
Number of Females                         0
Number of Males                           0
Number of Children                        0
Cause of Death                            0
Migration route                        3021
Location of death                         0
Information Source                        8
Coordinates                              36
UNSD Geographical Grouping                1
dtype: int64

Ejercicio propuesto:

Inputar 3 columnas mas del dataset, a eleccion. Estudiar la naturaleza de la columna y tomar una decision basada en la exploracion. Antes de imputar con 'unknown', confirmar y validar si es la unica alternativa. 