# 02 - Preprocessing

### 1. Clean Data and Correct Dtypes
Some columns that are not useful or have the wrong data type, we must correct this.

### 2. Imputing and Duplicates
Since missing percentage in each feature (column) in our dataset is low, imputation is a better option to process missing values.

For imputation, given that our variables are mostly categorical we have two strategies
- Replacing all missing values with `Unknown` or `No sabe`(We are going to use this one for the base)
- Replacing all missing values with the `most frequent` value

### 3. Encoding
For encoding, we are doing
- Frequency encoding for features with "High-Cardinality"
- Onehot encoding for binominal and nominal features
- Ordinal encoding for every column that describes an order

### Data Loading and Exploration

In [15]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy
%matplotlib inline

from sklearn.preprocessing import OrdinalEncoder, LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from typing import Union

sns.set_palette('dark')

In [16]:
# Load data
df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,ID,PERIODO,ESTU_PRGM_ACADEMICO,ESTU_PRGM_DEPARTAMENTO,ESTU_VALORMATRICULAUNIVERSIDAD,ESTU_HORASSEMANATRABAJA,FAMI_ESTRATOVIVIENDA,FAMI_TIENEINTERNET,FAMI_EDUCACIONPADRE,FAMI_TIENELAVADORA,FAMI_TIENEAUTOMOVIL,ESTU_PRIVADO_LIBERTAD,ESTU_PAGOMATRICULAPROPIO,FAMI_TIENECOMPUTADOR,FAMI_TIENEINTERNET.1,FAMI_EDUCACIONMADRE,RENDIMIENTO_GLOBAL
0,904256,20212,ENFERMERIA,BOGOTÁ,Entre 5.5 millones y menos de 7 millones,Menos de 10 horas,Estrato 3,Si,Técnica o tecnológica incompleta,Si,Si,N,No,Si,Si,Postgrado,medio-alto
1,645256,20212,DERECHO,ATLANTICO,Entre 2.5 millones y menos de 4 millones,0,Estrato 3,No,Técnica o tecnológica completa,Si,No,N,No,Si,No,Técnica o tecnológica incompleta,bajo
2,308367,20203,MERCADEO Y PUBLICIDAD,BOGOTÁ,Entre 2.5 millones y menos de 4 millones,Más de 30 horas,Estrato 3,Si,Secundaria (Bachillerato) completa,Si,No,N,No,No,Si,Secundaria (Bachillerato) completa,bajo
3,470353,20195,ADMINISTRACION DE EMPRESAS,SANTANDER,Entre 4 millones y menos de 5.5 millones,0,Estrato 4,Si,No sabe,Si,No,N,No,Si,Si,Secundaria (Bachillerato) completa,alto
4,989032,20212,PSICOLOGIA,ANTIOQUIA,Entre 2.5 millones y menos de 4 millones,Entre 21 y 30 horas,Estrato 3,Si,Primaria completa,Si,Si,N,No,Si,Si,Primaria completa,medio-bajo


In [17]:
df.shape

(692500, 17)

In [18]:
df.dtypes

ID                                 int64
PERIODO                            int64
ESTU_PRGM_ACADEMICO               object
ESTU_PRGM_DEPARTAMENTO            object
ESTU_VALORMATRICULAUNIVERSIDAD    object
ESTU_HORASSEMANATRABAJA           object
FAMI_ESTRATOVIVIENDA              object
FAMI_TIENEINTERNET                object
FAMI_EDUCACIONPADRE               object
FAMI_TIENELAVADORA                object
FAMI_TIENEAUTOMOVIL               object
ESTU_PRIVADO_LIBERTAD             object
ESTU_PAGOMATRICULAPROPIO          object
FAMI_TIENECOMPUTADOR              object
FAMI_TIENEINTERNET.1              object
FAMI_EDUCACIONMADRE               object
RENDIMIENTO_GLOBAL                object
dtype: object

In [19]:
# Look the number of missing values
df.isnull().sum()

ID                                    0
PERIODO                               0
ESTU_PRGM_ACADEMICO                   0
ESTU_PRGM_DEPARTAMENTO                0
ESTU_VALORMATRICULAUNIVERSIDAD     6287
ESTU_HORASSEMANATRABAJA           30857
FAMI_ESTRATOVIVIENDA              32137
FAMI_TIENEINTERNET                26629
FAMI_EDUCACIONPADRE               23178
FAMI_TIENELAVADORA                39773
FAMI_TIENEAUTOMOVIL               43623
ESTU_PRIVADO_LIBERTAD                 0
ESTU_PAGOMATRICULAPROPIO           6498
FAMI_TIENECOMPUTADOR              38103
FAMI_TIENEINTERNET.1              26629
FAMI_EDUCACIONMADRE               23664
RENDIMIENTO_GLOBAL                    0
dtype: int64

In [20]:
df.duplicated(subset=['FAMI_TIENEINTERNET', 'FAMI_TIENEINTERNET.1']).sum() / df.shape[0]

0.9999956678700361

### Preprocessing Utilities

In [21]:
import unicodedata
import re

def handle_tuition_price(text: str) -> str:
    """Handles tuition price variable"""
    special_cases = ['Menos de 500 mil', 'Más de 7 millones']
    matches = None
    pattern1 = r'(\d+\.?\d*)' # For special cases
    pattern2 = r'(\d+(?:\.\d+)?).* de (\d+(?:\.\d+)?)' # For other cases

    if text in special_cases:
        matches = re.search(pattern1, text)
    else:
        matches = re.search(pattern2, text)

    if matches:
        # Map values to classes
        # Bajo: menos de 2.5, Medio: mas de 2.5 y menos de 5.5, Alto: 5.5 en adelante
        groups = matches.groups()
        if '500' in groups or '1' in groups:
            text = 'Bajo'
            
        elif '4' in groups:
            text = 'Medio'
            
        elif '7' in groups:
            text = 'Alto'
            
    elif text == 'No pagó matrícula':
        text = 'Gratis'

    return text

def handle_work_hours(text: str) -> str:
    """Handles how is the work load of the student"""
    if text in ['0', 'Menos de 10 horas']:
        text = 'Baja'
    elif text == 'Entre 11 y 20 horas':
        text = 'Media'
    elif text in ['Más de 30 horas', 'Entre 21 y 30 horas']:
        text = 'Alta'
    
    return text

def norm_text(text: str) -> str:
    """Normalize text by removing weird chars"""
    return unicodedata.normalize("NFKD", text).encode("ASCII", "ignore").decode("utf-8")

def handle_rare_values(X, col, threshold=0.1):
    """Assigns a value to rare values on a column"""
    
    percentages = X[col].value_counts(normalize=True) * 100
    group_fn = lambda x: 'OTRO' if percentages[x] < threshold else x

    return X[col].apply(group_fn)

def handle_parent_education(text: str) -> str:
    """Handles and assigns the correct education of a person"""
    # TODO: Check if completa and handle incompleta to get the value of the last category.
    # Superior: Prosgrado, pregrado, tecnica, profesional
    # Media: Bachillerato
    # Basica: Primaria
    # Ninguna: Si no tiene ninguna o no sabe
    pattern = r'.* (completa|incompleta)'
    matches = re.search(pattern, text)

    superior_education = ['Postgrado', 'profesional', 'Tecnica']

    if matches:
        if 'completa' == matches.group(1):
            # Check if at least one level is in text
            if any(level in text for level in superior_education):
                text = 'Superior'
            elif 'Secundaria' in text:
                text = 'Media'
            elif 'Primaria' in text:
                text = 'Basica'
        elif 'incompleta' == matches.group(1):
            if any(level in text for level in superior_education):
                text = 'Media'
            elif 'Secundaria' in text:
                text = 'Basica'
            elif 'Primaria' in text:
                text = 'Ninguna'

    if text == 'Postgrado':
        text = 'Superior'

    if text in ['Unknown', 'No Aplica']:
        text = 'No sabe'

    if text == 'Ninguno':
        text = 'Ninguna'

    return text

In [22]:
def _check_X(X: Union[pd.DataFrame, np.generic, np.ndarray]) -> pd.DataFrame:
    """Checks what type of data structure is being pass to encoders."""
    # TODO: raise if not an accepted data structure.
    if isinstance(X, pd.DataFrame):
        X = X.copy()
    
    elif isinstance(X, (np.ndarray, np.generic)):
        # TODO: Check the shape for 0 - 1 dims.
        X = pd.DataFrame(X)

    return X


class FrequencyEncoder(BaseEstimator, TransformerMixin):
    """Applies frequency encoding to a set of variables"""
    def __init__(self, normalize=False):
        self.freq_map = {}
        self.normalize = normalize

    def fit(self, X, y=None):
        X = _check_X(X) # Check input type
        # Select categorical variable columns from X
        selected_columns = X.select_dtypes(include=["object", "category"]).columns.to_list()
        for col in selected_columns:

            if not self.normalize:
                self.freq_map[col] = X[col].value_counts().to_dict()
            else:
                self.freq_map[col] = X[col].value_counts(normalize=True).to_dict()
        
        return self

    def transform(self, X, y=None):
        X = _check_X(X) # Check input type
        # Replace original values by mapped values.
        for col, freq_map_values in self.freq_map.items():
            X[col] = X[col].map(freq_map_values)
        
        return X

### Preprocessing Process

In [23]:
def preprocess_features(X: pd.DataFrame) -> pd.DataFrame:
    """Cleans and preprocess training features"""
    # Create a dataframe copy
    X = X.copy()

    # Clean data and correct dtypes
    cols_to_drop = ['FAMI_TIENEINTERNET.1', 'ID']
    X = X.drop(cols_to_drop, axis=1)
    X['PERIODO'] = X['PERIODO'].astype(str).apply(lambda text: text[:4])
        

    # Imputation
    for col in ['ESTU_VALORMATRICULAUNIVERSIDAD', 'FAMI_ESTRATOVIVIENDA', 'ESTU_HORASSEMANATRABAJA',
               'FAMI_EDUCACIONPADRE', 'FAMI_EDUCACIONMADRE']:
        X[col] = X[col].fillna('Unknown')
        if col == 'FAMI_ESTRATOVIVIENDA':
            X[col] = X[col].fillna('Sin Estrato')

    for col in ['FAMI_TIENEINTERNET', 'FAMI_TIENELAVADORA', 'FAMI_TIENEAUTOMOVIL',
               'ESTU_PRIVADO_LIBERTAD', 'ESTU_PAGOMATRICULAPROPIO', 'FAMI_TIENECOMPUTADOR']:
        col_mode = X[col].mode()[0]
        X[col] = X[col].fillna(col_mode)

    # Cleaning and normalizing text features
    for col in ['ESTU_PRGM_ACADEMICO', 'ESTU_PRGM_DEPARTAMENTO', 'FAMI_EDUCACIONPADRE', 'FAMI_EDUCACIONMADRE']:
        X[col] = X[col].apply(norm_text) # Normalize text
        if col == 'ESTU_PRGM_DEPARTAMENTO':
            X[col] = handle_rare_values(X, col, threshold=1.5)
        elif col == 'ESTU_PRGM_ACADEMICO':
            X[col] = handle_rare_values(X, col)

    # Handling features values and transforming them into better ones
    X['ESTU_VALORMATRICULAUNIVERSIDAD'] = X['ESTU_VALORMATRICULAUNIVERSIDAD'].apply(handle_tuition_price)
    X['ESTU_HORASSEMANATRABAJA'] = X['ESTU_HORASSEMANATRABAJA'].apply(handle_work_hours)
    X['FAMI_EDUCACIONMADRE'] = X['FAMI_EDUCACIONMADRE'].apply(handle_parent_education)
    X['FAMI_EDUCACIONPADRE'] = X['FAMI_EDUCACIONPADRE'].apply(handle_parent_education)

    # Binary encoding
    for col in ['FAMI_TIENEINTERNET', 'FAMI_TIENELAVADORA', 'FAMI_TIENEAUTOMOVIL',
               'ESTU_PRIVADO_LIBERTAD', 'ESTU_PAGOMATRICULAPROPIO', 'FAMI_TIENECOMPUTADOR']:
        if col == 'ESTU_PRIVADO_LIBERTAD':
            X[col] = X[col].map({'S': 1, 'N': 0})
        else:
            X[col] = X[col].map({'Si': 1, 'No': 0})
    # Ordinal encoding
    encoder = OrdinalEncoder()
    ord_cols = ['FAMI_ESTRATOVIVIENDA', 'FAMI_EDUCACIONMADRE', 'FAMI_EDUCACIONPADRE',
                'ESTU_VALORMATRICULAUNIVERSIDAD', 'ESTU_HORASSEMANATRABAJA']
    
    X_ord = pd.DataFrame(encoder.fit_transform(X[ord_cols]), columns=ord_cols)
    X = pd.concat([X.drop(columns=ord_cols, axis=1), X_ord], axis=1)
    
    # Frequency encoding
    encoder = FrequencyEncoder(normalize=True)
    high_card_cols = ['ESTU_PRGM_ACADEMICO', 'ESTU_PRGM_DEPARTAMENTO']
    
    X_high_card = encoder.fit_transform(X[high_card_cols])
    X = pd.concat([X.drop(columns=high_card_cols, axis=1), X_high_card], axis=1)
    
    # One-hot encoding
    X = pd.get_dummies(X, columns=['PERIODO'], dtype=int)
        
    return X

In [24]:
# Split data into features and target
X = df.drop('RENDIMIENTO_GLOBAL', axis=1)
y = df.RENDIMIENTO_GLOBAL

In [25]:
X = preprocess_features(X) # Preprocess the data

In [26]:
X.sample(10)

Unnamed: 0,FAMI_TIENEINTERNET,FAMI_TIENELAVADORA,FAMI_TIENEAUTOMOVIL,ESTU_PRIVADO_LIBERTAD,ESTU_PAGOMATRICULAPROPIO,FAMI_TIENECOMPUTADOR,FAMI_ESTRATOVIVIENDA,FAMI_EDUCACIONMADRE,FAMI_EDUCACIONPADRE,ESTU_VALORMATRICULAUNIVERSIDAD,ESTU_HORASSEMANATRABAJA,ESTU_PRGM_ACADEMICO,ESTU_PRGM_DEPARTAMENTO,PERIODO_2018,PERIODO_2019,PERIODO_2020,PERIODO_2021
435696,1,1,0,0,0,1,1.0,4.0,4.0,0.0,1.0,0.016562,0.059235,1,0,0,0
142582,1,1,0,0,1,1,0.0,1.0,1.0,1.0,0.0,0.074868,0.064387,0,1,0,0
561539,1,1,0,0,0,1,1.0,0.0,1.0,3.0,2.0,0.007613,0.017214,0,0,0,1
203896,0,1,0,0,0,1,0.0,2.0,0.0,3.0,1.0,0.076887,0.029789,0,0,0,1
16664,1,1,1,0,0,1,2.0,1.0,4.0,0.0,1.0,0.076887,0.029789,0,0,1,0
37324,1,1,1,0,0,1,3.0,4.0,4.0,0.0,1.0,0.076887,0.041629,0,0,1,0
264556,0,1,0,0,1,0,1.0,1.0,4.0,1.0,1.0,0.137743,0.0176,1,0,0,0
206059,1,0,1,0,0,1,0.0,2.0,2.0,1.0,0.0,0.005659,0.019428,0,1,0,0
352213,1,1,1,0,1,1,3.0,0.0,2.0,3.0,0.0,0.002465,0.40745,0,0,0,1
222930,0,1,1,0,0,1,0.0,2.0,2.0,1.0,1.0,0.0105,0.041629,1,0,0,0


In [27]:
y.sample

0         medio-alto
1               bajo
2               bajo
3               alto
4         medio-bajo
             ...    
692495    medio-alto
692496          bajo
692497    medio-bajo
692498          bajo
692499          alto
Name: RENDIMIENTO_GLOBAL, Length: 692500, dtype: object

## Things to consider for the future

- We can make our ordinal encoder better by encoding using custom class, right now it is randomly assigning a value to the classes on the fly.
- Create new features than could be more relevant to the model (feature engineering).
- Do scaling on the features.