<a href="https://colab.research.google.com/github/nataliamarcoliino/prompts-recipe-to-create-a-ebook/blob/main/Pre_processamento.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Análise Comparativa de Modelos



- Conjunto de Dados:
- Cientistas de Dados:
    - Maria Natália
    - Fernanda Ortega
    - João
    - Juliana Pontes
    - Paulo
    - Agda Souza
    


PRÉ-PROCESSAMENTO DE DADOS

<>Tratamento de dados faltantes
<>Codificação de variáveis qualitativas
<>Normalização de variáveis quantitativas

In [None]:
# ============================================
# 0) Imports
# ============================================
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler

# ============================================
# 1) Carregar dados
# ============================================
url = "https://raw.githubusercontent.com/omadson/datasets/main/datasets/student_habits_performance.csv"
df = pd.read_csv(url)

print("Formato:", df.shape)
print("\nPrévia:")
display(df.head())

print("\nPercentual de valores faltantes por coluna:")
display(df.isna().mean().sort_values(ascending=False).to_frame("%_faltantes") * 100)

# ============================================
# 2) Identificar tipos de variáveis (numéricas x categóricas)
#    - Considera 'object' e 'category' como categóricas
#    - Usa heurística simples para colunas 'booleanas' ou 'binárias'
# ============================================
# Detectar colunas numéricas e categóricas
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()

# colunas binárias numéricas (0/1)

print("\nColunas numéricas:", numeric_cols)
print("Colunas categóricas:", categorical_cols)

# ============================================
# 3) Estratégias de tratamento de faltantes
#    - Numéricas: imputar mediana (robusta a outliers)
#    - Categóricas: imputar moda (valor mais frequente)
# ============================================
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    # escolha UM tipo de normalização:
    # ("scaler", StandardScaler()),         # Normalização padrão (média=0, desvio=1)
    ("scaler", MinMaxScaler())              # Alternativa: escala para [0,1]
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

# ============================================
# 4) ColumnTransformer para aplicar em paralelo
# ============================================
preprocess = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ],
    remainder="drop"  # ou "passthrough" se quiser manter colunas não listadas
)

# ============================================
# 5) Executar o pré-processamento
# ============================================
# Ajusta (fit) e transforma
X_prepared = preprocess.fit_transform(df)

# Os nomes das novas colunas após One-Hot
ohe = preprocess.named_transformers_["cat"].named_steps["onehot"]
cat_out_cols = ohe.get_feature_names_out(categorical_cols)

# Montar DataFrame final com nomes de colunas
out_cols = numeric_cols + list(cat_out_cols)
df_processed = pd.DataFrame(X_prepared, columns=out_cols)

print("\nFormato após pré-processamento:", df_processed.shape)
display(df_processed.head())

# ============================================
# 6) (Opcional) Verificar estatísticas após normalização
# ============================================
print("\nResumo estatístico das variáveis numéricas escaladas:")
display(df_processed[numeric_cols].describe())

# ============================================
# 7) Salvar o dataset processado
# ============================================
output_path = "student_habits_performance_processed.csv"
df_processed.to_csv(output_path, index=False)
print(f"\nArquivo salvo em: {output_path}")

Formato: (1000, 16)

Prévia:


Unnamed: 0,student_id,age,gender,study_hours_per_day,social_media_hours,netflix_hours,part_time_job,attendance_percentage,sleep_hours,diet_quality,exercise_frequency,parental_education_level,internet_quality,mental_health_rating,extracurricular_participation,exam_score
0,S1000,23,Female,0.0,1.2,1.1,No,85.0,8.0,Fair,6,Master,Average,8,Yes,56.2
1,S1001,20,Female,6.9,2.8,2.3,No,97.3,4.6,Good,6,High School,Average,8,No,100.0
2,S1002,21,Male,1.4,3.1,1.3,No,94.8,8.0,Poor,1,High School,Poor,1,No,34.3
3,S1003,23,Female,1.0,3.9,1.0,No,71.0,9.2,Poor,4,Master,Good,1,Yes,26.8
4,S1004,19,Female,5.0,4.4,0.5,No,90.9,4.9,Fair,3,Master,Good,1,No,66.4



Percentual de valores faltantes por coluna:


Unnamed: 0,%_faltantes
parental_education_level,9.1
student_id,0.0
gender,0.0
age,0.0
social_media_hours,0.0
netflix_hours,0.0
part_time_job,0.0
study_hours_per_day,0.0
attendance_percentage,0.0
sleep_hours,0.0



Colunas numéricas: ['age', 'study_hours_per_day', 'social_media_hours', 'netflix_hours', 'attendance_percentage', 'sleep_hours', 'exercise_frequency', 'mental_health_rating', 'exam_score']
Colunas categóricas: ['student_id', 'gender', 'part_time_job', 'diet_quality', 'parental_education_level', 'internet_quality', 'extracurricular_participation']

Formato após pré-processamento: (1000, 1025)


Unnamed: 0,age,study_hours_per_day,social_media_hours,netflix_hours,attendance_percentage,sleep_hours,exercise_frequency,mental_health_rating,exam_score,student_id_S1000,...,diet_quality_Good,diet_quality_Poor,parental_education_level_Bachelor,parental_education_level_High School,parental_education_level_Master,internet_quality_Average,internet_quality_Good,internet_quality_Poor,extracurricular_participation_No,extracurricular_participation_Yes
0,0.857143,0.0,0.166667,0.203704,0.659091,0.705882,1.0,0.777778,0.463235,1.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0
1,0.428571,0.831325,0.388889,0.425926,0.938636,0.205882,1.0,0.777778,1.0,0.0,...,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
2,0.571429,0.168675,0.430556,0.240741,0.881818,0.705882,0.166667,0.0,0.194853,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
3,0.857143,0.120482,0.541667,0.185185,0.340909,0.882353,0.666667,0.0,0.102941,0.0,...,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,0.285714,0.60241,0.611111,0.092593,0.793182,0.25,0.5,0.0,0.588235,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0



Resumo estatístico das variáveis numéricas escaladas:


Unnamed: 0,age,study_hours_per_day,social_media_hours,netflix_hours,attendance_percentage,sleep_hours,exercise_frequency,mental_health_rating,exam_score
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,0.499714,0.427723,0.347986,0.336981,0.639357,0.480897,0.507,0.493111,0.627469
std,0.329729,0.176975,0.162836,0.199096,0.213619,0.18035,0.337571,0.316389,0.206968
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.25,0.313253,0.236111,0.185185,0.5,0.352941,0.166667,0.222222,0.491115
50%,0.428571,0.421687,0.347222,0.333333,0.645455,0.485294,0.5,0.444444,0.63848
75%,0.857143,0.542169,0.458333,0.467593,0.796023,0.602941,0.833333,0.777778,0.77114
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0



Arquivo salvo em: student_habits_performance_processed.csv


## Leitura do conjunto e criação do dicionário de dados

## Seleção de variáveis e separação de entradas e saídas

## Preparação de Dados

## Validação Cruzada

##Apresentação de Resultados

##Conclusão