# Baseline Models â€“ ENADE 2022 (UFJF)

This notebook trains baseline machine learning models to predict ENADE
performance categories using the processed dataset.

In [17]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report

In [18]:
df = pd.read_csv("../data/processed/enade_ufjf_2022_model.csv")
df.head()

Unnamed: 0,CO_CURSO,NT_GER,y_perf
0,1105396,18.9,low
1,1105396,20.8,low
2,1105396,25.8,low
3,1105396,26.2,low
4,1105396,29.5,low


In [19]:
X = df.drop(columns=["y_perf", "NT_GER"])
y = df["y_perf"]

X.shape, y.value_counts()

((813, 1),
 y_perf
 high      275
 medium    270
 low       268
 Name: count, dtype: int64)

In [20]:
categorical_features = X.columns.tolist()

In [21]:
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ]
)

In [22]:
log_reg = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", LogisticRegression(max_iter=1000))
])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

log_reg.fit(X_train, y_train)
y_pred_lr = log_reg.predict(X_test)

print("Logistic Regression")
print(classification_report(y_test, y_pred_lr))

Logistic Regression
              precision    recall  f1-score   support

        high       0.58      0.62      0.60        55
         low       0.74      0.72      0.73        54
      medium       0.43      0.41      0.42        54

    accuracy                           0.58       163
   macro avg       0.58      0.58      0.58       163
weighted avg       0.58      0.58      0.58       163



In [23]:
tree = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", DecisionTreeClassifier(random_state=42))
])

tree.fit(X_train, y_train)
y_pred_tree = tree.predict(X_test)

print("Decision Tree")
print(classification_report(y_test, y_pred_tree))

Decision Tree
              precision    recall  f1-score   support

        high       0.58      0.62      0.60        55
         low       0.74      0.72      0.73        54
      medium       0.43      0.41      0.42        54

    accuracy                           0.58       163
   macro avg       0.58      0.58      0.58       163
weighted avg       0.58      0.58      0.58       163



In [24]:
rf = Pipeline(steps=[
    ("preprocess", preprocessor),
    ("model", RandomForestClassifier(
        n_estimators=100,
        random_state=42
    ))
])

rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest")
print(classification_report(y_test, y_pred_rf))

Random Forest
              precision    recall  f1-score   support

        high       0.58      0.62      0.60        55
         low       0.74      0.72      0.73        54
      medium       0.43      0.41      0.42        54

    accuracy                           0.58       163
   macro avg       0.58      0.58      0.58       163
weighted avg       0.58      0.58      0.58       163



## Model Comparison

The baseline models present different performance levels.
Logistic Regression serves as a linear baseline, while Decision Tree and Random
Forest capture non-linear patterns. Differences in precision and recall across
classes indicate opportunities for further feature engineering and tuning.