## Missing Values with Pandas

* isna() and isnull(): These functions return a DataFrame of the same shape as the original, where each element is a Boolean value indicating whether it’s missing (True) or not (False).
* notna() and notnull(): These functions return the opposite of isna() and isnull(), indicating non-missing values.
* info(): This method provides a concise summary of the DataFrame, including the count of non-null values for each column. 

### Importamos Pandas y cargamos el DF 

In [15]:
import pandas as pd

#cargamos DB a partir de url en git
download_url = ("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv") 
df = pd.read_csv(download_url)
type(df)

pandas.core.frame.DataFrame

In [16]:
df.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,,,,,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


### Removing Rows or Columns
If the missing values are limited to a few rows or columns and don’t represent a significant portion of the dataset

In [20]:
# Remove rows with any missing values
df_cleaned = df.dropna()

# Remove columns with any missing values
df_cleaned = df.dropna(axis=1) 

# axis=1: indica que se deben eliminar columnas completas que contengan valores NaN. 
# Si se utilizara axis=0, se eliminarían las filas que contienen valores NaN 

### Filling with a Constant Value

In [21]:
# Fill missing values with a constant value
df_filled = df.fillna(0)

### Filling with mean, mode, etc.

No ejecutamos el código porque al tener variables categóricas dará error

In [24]:
# Fill missing values with the mean of each column
# mean_imputed_df = df.fillna(df.mean())

# Fill missing values with the median of each column
# median_imputed_df = df.fillna(df.median())

# Fill missing values with the mode of each column
# mode_imputed_df = df.fillna(df.mode().iloc[0])

### Interpolation

involves estimating missing values based on the values of neighboring data points

In [29]:
# Interpolate missing values using linear method
linear_interpolated_df = df.interpolate()

# Interpolate missing values using polynomial method of order 2
polynomial_interpolated_df = df.interpolate(method='polynomial', order=2)

# Interpolate missing values using time-based method
# time_based_interpolated_df = df.interpolate(method='time')

  linear_interpolated_df = df.interpolate()
  polynomial_interpolated_df = df.interpolate(method='polynomial', order=2)


In [30]:
linear_interpolated_df.head() #imprimimos el encabezado del DF que interpolamos los datos

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,38.5,18.65,194.0,3350.0,,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


In [33]:
#como sigue habiendo valores categóricos NaN, los reemplazamos con "No Data"

df_filled_linear_interpolated = linear_interpolated_df.fillna("No data")
df_filled_linear_interpolated.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,male,2007
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,female,2007
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,female,2007
3,Adelie,Torgersen,38.5,18.65,194.0,3350.0,No data,2007
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,female,2007


### Machine Learning-Based Imputation
Let’s consider a simple example where we have a dataset of students’ exam scores. Some students have missing scores.

In [1]:
import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'Math_Score': [85, 90, None, 70, None],
    'Physics_Score': [92, None, 78, 88, 75]
}

df2 = pd.DataFrame(data)

We can fill the missing values in this DataFrame with the mean of their respective columns.

Carefull when we have categoricall variables! We'll have to calculate de numerical variable's means before

In [6]:

# Calculate means of numeric columns only
means = df2.select_dtypes(include="number").mean()

# Fill missing values with the means of respective columns
mean_imputed_df2 = df2.fillna(means)

print(mean_imputed_df2)


      Name  Math_Score  Physics_Score
0    Alice   85.000000          92.00
1      Bob   90.000000          83.25
2  Charlie   81.666667          78.00
3    David   70.000000          88.00
4    Emily   81.666667          75.00


### Example 2: Advanced Imputation with Interpolation

In [11]:
import pandas as pd
import numpy as np

# Generar rango de fechas (días hábiles)
dates = pd.date_range(start='2023-01-01', end='2023-01-15', freq='B')
print(len(dates))  # Debe imprimir 10

# Lista de precios ajustada para tener la misma longitud
prices = [100, 105, 110, None, 120, 125, 130, None, 140, 150]
print(len(prices))  # Debe imprimir 10

data = {
    'Date': dates,
    'Price': prices
}

df = pd.DataFrame(data)

print(df)

10
10
        Date  Price
0 2023-01-02  100.0
1 2023-01-03  105.0
2 2023-01-04  110.0
3 2023-01-05    NaN
4 2023-01-06  120.0
5 2023-01-09  125.0
6 2023-01-10  130.0
7 2023-01-11    NaN
8 2023-01-12  140.0
9 2023-01-13  150.0


In this scenario, linear interpolation might be a suitable choice to estimate missing prices.


In [12]:
# Interpolate missing values using linear method
linear_interpolated_df = df.interpolate()

print(linear_interpolated_df)

        Date  Price
0 2023-01-02  100.0
1 2023-01-03  105.0
2 2023-01-04  110.0
3 2023-01-05  115.0
4 2023-01-06  120.0
5 2023-01-09  125.0
6 2023-01-10  130.0
7 2023-01-11  135.0
8 2023-01-12  140.0
9 2023-01-13  150.0
