# Preprocessing – ENADE 2022 (UFJF)

This notebook performs data preprocessing for predictive modeling. It defines
the target variable, selects relevant features and prepares the dataset for
machine learning models.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../data/raw/cursos_ufjf_enade2022.csv")
df.head()

Unnamed: 0,NU_ANO,CO_CURSO,NU_ITEM_OFG,NU_ITEM_OFG_Z,NU_ITEM_OFG_X,NU_ITEM_OFG_N,NU_ITEM_OCE,NU_ITEM_OCE_Z,NU_ITEM_OCE_X,NU_ITEM_OCE_N,...,NT_CE_D3,CO_RS_I1,CO_RS_I2,CO_RS_I3,CO_RS_I4,CO_RS_I5,CO_RS_I6,CO_RS_I7,CO_RS_I8,CO_RS_I9
0,2022,1105396,8,0,0,0,27,0,0,0,...,,,,,,,,,,
1,2022,1105396,8,0,0,0,27,0,0,0,...,,,,,,,,,,
2,2022,1105396,8,0,0,0,27,0,0,0,...,,,,,,,,,,
3,2022,1105396,8,0,0,0,27,0,0,0,...,,,,,,,,,,
4,2022,1105396,8,0,0,0,27,0,0,0,...,,,,,,,,,,


## Target Variable Definition

The general ENADE score (NT_GER) is used as a proxy for student performance.
To enable a classification task, scores are discretized into three categories
(low, medium and high) based on empirical quantiles.

In [2]:
df["NT_GER"] = pd.to_numeric(df["NT_GER"], errors="coerce")
df["NT_GER"].describe()

count    813.000000
mean      56.450062
std       16.637341
min       12.200000
25%       43.700000
50%       58.000000
75%       69.300000
max       94.500000
Name: NT_GER, dtype: float64

In [3]:
q1 = df["NT_GER"].quantile(0.33)
q2 = df["NT_GER"].quantile(0.66)

def score_to_class(x):
    if pd.isna(x):
        return np.nan
    if x <= q1:
        return "low"
    elif x <= q2:
        return "medium"
    else:
        return "high"

df["y_perf"] = df["NT_GER"].apply(score_to_class)
df["y_perf"].value_counts(dropna=False)

y_perf
NaN       336
high      275
medium    270
low       268
Name: count, dtype: int64

## Feature Selection

A subset of institutional and course-level variables is selected based on
availability and interpretability. Only variables consistently defined across
the dataset are considered at this stage.

In [4]:
candidate_features = [
    "CO_CURSO",
    "CO_IES",
    "CO_CATEGAD",
    "CO_ORGACAD",
    "CO_MODALIDADE",
    "CO_MUNIC_CURSO",
    "CO_UF_CURSO",
]

available = [c for c in candidate_features if c in df.columns]
available

['CO_CURSO']

## Final Dataset Preparation

Rows with missing target values are removed. The resulting dataset will be used
as input for baseline machine learning models.

In [5]:
df_model = df[available + ["NT_GER", "y_perf"]].copy()
df_model.shape

(1149, 3)

In [6]:
df_model = df_model.dropna(subset=["y_perf"])
df_model.isna().sum().sort_values(ascending=False)

CO_CURSO    0
NT_GER      0
y_perf      0
dtype: int64

In [7]:
df_model.to_csv("../data/processed/enade_ufjf_2022_model.csv", index=False)