# Student Performance - Cleaning Recipe

This dataset was downloaded from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Student+Performance). This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

## Importing the Libraries

In [1]:
# General Libraries
import pandas as pd

In [2]:
# Yeast specifics classes
from yeast import Recipe
from yeast.steps import *

## Getting the Data

In [3]:
math_df = pd.read_csv('student-mat.csv', sep=";")
port_df = pd.read_csv('student-por.csv', sep=";")

In [4]:
math_df.head(1)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6


In [5]:
port_df.head(1)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,...,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11


## Cleaning the Data

### Defining the processing Recipe

In [6]:
recipe = Recipe([
    # All column names to snake case
    CleanColumnNamesStep('snake'),
    # Rename columns with better names
    RenameColumnsStep({
        'famsize': 'family_size',
        'pstatus': 'parent_status',
        'medu': 'mother_education',
        'fedu': 'father_education',
        'mjob': 'mother_job',
        'fjob': 'father_job',
        'traveltime': 'travel_time',
        'studytime': 'study_time',
        'schoolsup': 'school_support',
        'famsup': 'family_support',
        'famrel': 'family_relationship_quality',
        'freetime': 'free_time',
        'goout': 'go_out_friends',
        'Dalc': 'workday_alcohol_consumption',
        'Walc': 'weekend_alcohol_consumption',
    })
])

### Preparing the recipe

In [7]:
recipe = recipe.prepare(math_df)

### Bake / Execute the recipe 

In [8]:
baked_math_df = recipe.bake(math_df) 
baked_port_df = recipe.bake(port_df) 

In [9]:
baked_math_df.head(1)

Unnamed: 0,school,sex,age,address,family_size,parent_status,mother_education,father_education,mother_job,father_job,...,family_relationship_quality,free_time,go_out_friends,dalc,walc,health,absences,g1,g2,g3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,6,5,6,6


In [10]:
baked_port_df.head(1)

Unnamed: 0,school,sex,age,address,family_size,parent_status,mother_education,father_education,mother_job,father_job,...,family_relationship_quality,free_time,go_out_friends,dalc,walc,health,absences,g1,g2,g3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,...,4,3,4,1,1,3,4,0,11,11
