#Students Performance in Exams

Este conjunto de dados contém informações sobre o desempenho de 1.000 estudantes em exames, incluindo variáveis como gênero, nível de educação dos pais, tempo de estudo e notas em matemática, leitura e escrita. A variável alvo pode ser a nota em uma dessas disciplinas ou uma média das três.

https://www.kaggle.com/datasets/spscientist/students-performance-in-exams

# Exploração e Análise dos Dados

Nesse exemplo vamos "puxar" o dataset diretamente do Kaggle usando a API no Google Colab

In [None]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV

In [None]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

Baixando o dataset

In [None]:
!kaggle datasets download -d spscientist/students-performance-in-exams

Dataset URL: https://www.kaggle.com/datasets/spscientist/students-performance-in-exams
License(s): unknown
students-performance-in-exams.zip: Skipping, found more recently modified local copy (use --force to force download)


Extraindo o arquivo .zip

In [None]:
!unzip students-performance-in-exams.zip

Archive:  students-performance-in-exams.zip
replace StudentsPerformance.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: StudentsPerformance.csv  


Carregando o dataset em um dataframe Pandas

In [None]:
dataset = pd.read_csv('StudentsPerformance.csv')

print(dataset.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB
None


Explorando a base

In [None]:
# Estatísticas para variáveis numéricas
dataset.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [None]:
# Estatísticas para variáveis categóricas
dataset.describe(include=['object'])

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course
count,1000,1000,1000,1000,1000
unique,2,5,6,2,2
top,female,group C,some college,standard,none
freq,518,319,226,645,642


In [None]:
dataset[['race/ethnicity']].value_counts()

Unnamed: 0_level_0,count
race/ethnicity,Unnamed: 1_level_1
group C,319
group D,262
group B,190
group E,140
group A,89


In [None]:
dataset[['parental level of education']].value_counts()

Unnamed: 0_level_0,count
parental level of education,Unnamed: 1_level_1
some college,226
associate's degree,222
high school,196
some high school,179
bachelor's degree,118
master's degree,59


In [None]:
dataset[['test preparation course']].value_counts()

Unnamed: 0_level_0,count
test preparation course,Unnamed: 1_level_1
none,642
completed,358


In [None]:
dataset[['lunch']].value_counts()

Unnamed: 0_level_0,count
lunch,Unnamed: 1_level_1
standard,645
free/reduced,355


# Pré-processamento

In [None]:
print("Valores nulos por coluna:")
dataset.isnull().sum()

Valores nulos por coluna:


Unnamed: 0,0
gender,0
race/ethnicity,0
parental level of education,0
lunch,0
test preparation course,0
math score,0
reading score,0
writing score,0


Codificando as varáveis categóricas

In [None]:
encoder = OneHotEncoder(drop='first', sparse_output=False)

categorical_columns = ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']

encoded_data = encoder.fit_transform(dataset[categorical_columns])

In [None]:
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_columns))

encoded_df.head()

Unnamed: 0,gender_male,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,parental level of education_some high school,lunch_standard,test preparation course_none
0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0


In [None]:
dataset_encoded = pd.concat([dataset.drop(columns=categorical_columns), encoded_df], axis=1)

dataset_encoded.head()

Unnamed: 0,math score,reading score,writing score,math_score_class,gender_male,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,parental level of education_some high school,lunch_standard,test preparation course_none
0,72,0.662651,0.711111,acima da média,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
1,69,0.879518,0.866667,média,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,90,0.939759,0.922222,acima da média,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
3,47,0.481928,0.377778,abaixo da média,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,76,0.73494,0.722222,acima da média,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0


 Normalização das Variáveis Numéricas

In [None]:
numeric_columns = ['reading score', 'writing score']
scaler = MinMaxScaler()
dataset_encoded[numeric_columns] = scaler.fit_transform(dataset_encoded[numeric_columns])

dataset_encoded[numeric_columns]

Unnamed: 0,reading score,writing score
0,0.662651,0.711111
1,0.879518,0.866667
2,0.939759,0.922222
3,0.481928,0.377778
4,0.734940,0.722222
...,...,...
995,0.987952,0.944444
996,0.457831,0.500000
997,0.650602,0.611111
998,0.734940,0.744444


Criando a variável alvo: class_math_score

Para isso vamos fazer a Discretização da variável class_math_score

* Calcular a média de math score.
* Definir as classes:
  * "acima da média" para valores acima da média calculada.
  * "média" para valores próximos à média.
  * "abaixo da média" para valores abaixo da média.

Criar a nova coluna math_score_class para armazenar essas classes.

In [None]:
def discretizar_variavel_alvo(score, mean):
    if score >= mean + 5:
        return 'acima da média'
    elif score <= mean - 5:
        return 'abaixo da média'
    else:
        return 'média'

In [None]:
math_mean = dataset_encoded['math score'].mean()
print("A média é: " +str(math_mean))

dataset_encoded['math_score_class'] = dataset_encoded['math score'].apply(lambda x: discretizar_variavel_alvo(x, math_mean))

dataset_encoded.head(10)

A média é: 66.089


Unnamed: 0,math score,reading score,writing score,math_score_class,gender_male,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,parental level of education_bachelor's degree,parental level of education_high school,parental level of education_master's degree,parental level of education_some college,parental level of education_some high school,lunch_standard,test preparation course_none
0,72,0.662651,0.711111,acima da média,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0
1,69,0.879518,0.866667,média,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,90,0.939759,0.922222,acima da média,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
3,47,0.481928,0.377778,abaixo da média,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,76,0.73494,0.722222,acima da média,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0
5,71,0.795181,0.755556,média,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0
6,88,0.939759,0.911111,acima da média,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
7,40,0.313253,0.322222,abaixo da média,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
8,64,0.566265,0.633333,média,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,38,0.518072,0.444444,abaixo da média,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0


In [None]:
dataset_encoded[['math_score_class']].value_counts()

Unnamed: 0_level_0,count
math_score_class,Unnamed: 1_level_1
abaixo da média,366
acima da média,365
média,269


# Divisão do Conjunto de Dados

In [None]:
X = dataset_encoded.drop(columns=['math score', 'math_score_class'])
y = dataset_encoded['math_score_class']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Tamanho do conjunto de treino:", X_train.shape)
print("Tamanho do conjunto de teste:", X_test.shape)

Tamanho do conjunto de treino: (700, 14)
Tamanho do conjunto de teste: (300, 14)


# Treinamento e Avaliação do Modelo

### Modelo 1: Árvore de Decisão

In [None]:
model = DecisionTreeClassifier(random_state=42)

model.fit(X_train, y_train)

Avaliação do Modelo

In [None]:
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Acurácia:", accuracy)

print(classification_report(y_test, y_pred))

Acurácia: 0.7133333333333334
                 precision    recall  f1-score   support

abaixo da média       0.79      0.77      0.78       117
 acima da média       0.78      0.81      0.79       103
          média       0.51      0.51      0.51        80

       accuracy                           0.71       300
      macro avg       0.69      0.70      0.70       300
   weighted avg       0.71      0.71      0.71       300



#### Otimização de hiperparâmetros

In [None]:
param_grid = {
    'max_depth': [3, 5, 10, None],          # Profundidade máxima da árvore
    'min_samples_split': [2, 5, 10],        # Número mínimo de amostras para dividir um nó
    'min_samples_leaf': [1, 2, 4],          # Número mínimo de amostras em uma folha
    'criterion': ['gini', 'entropy']        # Critério para medir a qualidade da divisão
}

In [None]:
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train, y_train)

best_dt = grid_search.best_estimator_

y_pred = best_dt.predict(X_test)

In [None]:
grid_dt = classification_report(y_test, y_pred, output_dict=True)

accuracy = accuracy_score(y_test, y_pred)
print("Acurácia com Grid Search no DT:", accuracy)
print("Melhores parâmetros encontrados:", grid_search.best_params_)
print("Relatório de Classificação:")
print(classification_report(y_test, y_pred))

Acurácia com Grid Search no DT: 0.76
Melhores parâmetros encontrados: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 10}
Relatório de Classificação:
                 precision    recall  f1-score   support

abaixo da média       0.86      0.82      0.84       117
 acima da média       0.79      0.85      0.82       103
          média       0.56      0.55      0.56        80

       accuracy                           0.76       300
      macro avg       0.74      0.74      0.74       300
   weighted avg       0.76      0.76      0.76       300



### Modelo 2: SVM

In [None]:
model = SVC()

model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Acurácia:", accuracy)

print(classification_report(y_test, y_pred))

Acurácia: 0.68
                 precision    recall  f1-score   support

abaixo da média       0.77      0.79      0.78       117
 acima da média       0.71      0.83      0.77       103
          média       0.43      0.31      0.36        80

       accuracy                           0.68       300
      macro avg       0.64      0.65      0.64       300
   weighted avg       0.66      0.68      0.67       300



#### Otimização de Hiperparâmetros

In [None]:
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf', 'poly'],
    'gamma': ['scale', 'auto']
}


In [None]:
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train, y_train)

best_svm = grid_search.best_estimator_

y_pred = best_svm.predict(X_test)

In [None]:
# Avaliar o desempenho do modelo
grid_svc = classification_report(y_test, y_pred, output_dict=True)

accuracy = accuracy_score(y_test, y_pred)
print("Acurácia com Grid Search no SVM:", accuracy)
print("Melhores parâmetros encontrados:", grid_search.best_params_)
print("Relatório de Classificação:")
print(classification_report(y_test, y_pred))

Acurácia com Grid Search no SVM: 0.7866666666666666
Melhores parâmetros encontrados: {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}
Relatório de Classificação:
                 precision    recall  f1-score   support

abaixo da média       0.86      0.85      0.86       117
 acima da média       0.84      0.84      0.84       103
          média       0.60      0.61      0.61        80

       accuracy                           0.79       300
      macro avg       0.77      0.77      0.77       300
   weighted avg       0.79      0.79      0.79       300



### Modelo 3: Random Forest

In [None]:
model = RandomForestClassifier(random_state=42)

model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Acurácia:", accuracy)

print(classification_report(y_test, y_pred))

Acurácia: 0.7266666666666667
                 precision    recall  f1-score   support

abaixo da média       0.81      0.83      0.82       117
 acima da média       0.77      0.82      0.79       103
          média       0.52      0.46      0.49        80

       accuracy                           0.73       300
      macro avg       0.70      0.70      0.70       300
   weighted avg       0.72      0.73      0.72       300



#### Otimização de Hiperparâmetros

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

In [None]:
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train, y_train)

best_rf = grid_search.best_estimator_

y_pred = best_rf.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, y_pred)
print("Acurácia com Grid Search no Random Forest:", accuracy)
print("Melhores parâmetros encontrados:", grid_search.best_params_)
print("Relatório de Classificação:")

grid_rf = classification_report(y_test, y_pred, output_dict=True)

print(classification_report(y_test, y_pred))

Acurácia com Grid Search no Random Forest: 0.7066666666666667
Melhores parâmetros encontrados: {'bootstrap': True, 'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 50}
Relatório de Classificação:
                 precision    recall  f1-score   support

abaixo da média       0.81      0.82      0.82       117
 acima da média       0.76      0.82      0.79       103
          média       0.45      0.40      0.42        80

       accuracy                           0.71       300
      macro avg       0.67      0.68      0.68       300
   weighted avg       0.70      0.71      0.70       300



## Comparação dos resultados

In [None]:
results = pd.DataFrame({
    'SVM': {
        'precision': grid_svc['weighted avg']['precision'],
        'recall': grid_svc['weighted avg']['recall'],
        'f1-score': grid_svc['weighted avg']['f1-score'],
        'accuracy': grid_svc['accuracy']
    },
    'Decision Tree': {
        'precision': grid_dt['weighted avg']['precision'],
        'recall': grid_dt['weighted avg']['recall'],
        'f1-score': grid_dt['weighted avg']['f1-score'],
        'accuracy': grid_dt['accuracy']
    },
    'Random Forest': {
        'precision': grid_rf['weighted avg']['precision'],
        'recall': grid_rf['weighted avg']['recall'],
        'f1-score': grid_rf['weighted avg']['f1-score'],
        'accuracy': grid_rf['accuracy']
    }
}).T

In [None]:
results

Unnamed: 0,precision,recall,f1-score,accuracy
SVM,0.787524,0.786667,0.787083,0.786667
Decision Tree,0.759917,0.76,0.759312,0.76
Random Forest,0.697296,0.706667,0.701195,0.706667
