# Data Preprocessing for Capstone 2
## Portuguese Student Performance Data Set

The dataset contains grades and survey questions from Portuguese secondary students in mathematics and language classes.

### Imports

First, import the necessary libraries and packages.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

Then, load the data and check datatypes. 

In [2]:
math_file = '../data/interim/math.csv'
lang_file = '../data/interim/lang.csv'

math = pd.read_csv(math_file, index_col=0)
lang = pd.read_csv(lang_file, index_col=0)

print('Math:')
print(math.info())
print('Language:')
print(lang.info())

Math:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 395 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      395 non-null    object
 1   sex         395 non-null    object
 2   age         395 non-null    int64 
 3   address     395 non-null    object
 4   famsize     395 non-null    object
 5   Pstatus     395 non-null    object
 6   Medu        395 non-null    int64 
 7   Fedu        395 non-null    int64 
 8   Mjob        395 non-null    object
 9   Fjob        395 non-null    object
 10  reason      395 non-null    object
 11  guardian    395 non-null    object
 12  traveltime  395 non-null    int64 
 13  studytime   395 non-null    int64 
 14  failures    395 non-null    int64 
 15  schoolsup   395 non-null    object
 16  famsup      395 non-null    object
 17  paid        395 non-null    object
 18  activities  395 non-null    object
 19  nursery     395 non-null    object
 20  high

### Data preprocessing.

Using information from above and from the documentation for the dataset (in README.md), the data can be grouped into a few different groups for processing.

#### 'True' Binaries

These variables relate to yes or no survey questions. They can be processed as 0 for 'no' and 1 for 'yes'.

- schoolsup: yes or no
- famsup: yes or no
- paid: yes or no
- acitivities: yes or no
- nursery: yes or no
- higher: yes or no
- internet: yes or no
- romantic: yes or no

#### Categories

##### Two categories

There a few more variables that contain only two categories, but it makes less sense simply transform them to 0 or 1. Instead, they can be one-hot encoded using `pd.get_dummies()`.

- school: GP or MS
- sex: M or F
- address: U or R
- famsize: LE3 or GT3
- Pstatus: T or A

##### More than categories

These variables are also categorical but with more than two categories. For the sake of clarity they have been separated from the two-category variables above, but they can also be one-hot encoded using `pd.get_dummies()`

- Mjob: teacher, health, services, at home, other
- Fjob: teacher, health, services, at home, other
- reason: (close to) home, reputation, course (preference), other

#### Ordinals

##### Ordinal Categories

These variables deal with responses that ordered numerically, but distances between the responses are difficult or impossible to quantify. For instance, In the case of Medu and Fedu (mothers' and fathers' education) 0 does mean zero  and 4 is higher than 3, but by how much? The same idea applies for traveltime and study time. 5 is higher than 3, but by how much. The data has been collected as an ordinal scale, so no processing is required.

- Medu: 0 - none, 1 - up to 4th, 2 – 5th to 9th grade, 3 - secondary, 4 - higher education
- Fedu: 0 - none, 1 - up to 4th, 2 – 5th to 9th grade, 3 – secondary, 4 - higher education
- traveltime: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour
- studytime: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours

##### Likert-type Responses

For some items, students were asked to respond on a five point scale, 1 being very low and 5 being very high. Data was collected as ordinal variables, although Likert scales are sometimes considered as continuous for analysis purposes. Again, no processing is required for this set of variables.

1, very low to 5, very high:

- famrel
- freetime
- goout
- Dalc
- Walc
- health

#### Numeric

These variables are integers for age, failures, absences, and grades, G3 being the file grade. For now, these variables will also be left as they are.

- age: 15 - 22
- failures: n if 1<=n<3, else 4
- absences
- G1
- G2
- G3

### 'True' Binaries

Look at the heads of the columns containing binary variables.

In [3]:
df_list=[math,lang]

binary = ['schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic']
        
print(math[binary].head())
print(lang[binary].head())

  schoolsup famsup paid activities nursery higher internet romantic
0       yes     no   no         no     yes    yes       no       no
1        no    yes   no         no      no    yes      yes       no
2       yes     no  yes         no     yes    yes      yes       no
3        no    yes  yes        yes     yes    yes      yes      yes
4        no    yes  yes         no     yes    yes       no       no
  schoolsup famsup paid activities nursery higher internet romantic
0       yes     no   no         no     yes    yes       no       no
1        no    yes   no         no      no    yes      yes       no
2       yes     no   no         no     yes    yes      yes       no
3        no    yes   no        yes     yes    yes      yes      yes
4        no    yes   no         no     yes    yes       no       no


Change yes to 1 and no to zero. Look at the heads again.

In [4]:
math[binary] = math[binary].replace({'yes': 1,'no': 0})
lang[binary] = lang[binary].replace({'yes': 1,'no': 0})

print(math[binary].head())
print(lang[binary].head())

   schoolsup  famsup  paid  activities  nursery  higher  internet  romantic
0          1       0     0           0        1       1         0         0
1          0       1     0           0        0       1         1         0
2          1       0     1           0        1       1         1         0
3          0       1     1           1        1       1         1         1
4          0       1     1           0        1       1         0         0
   schoolsup  famsup  paid  activities  nursery  higher  internet  romantic
0          1       0     0           0        1       1         0         0
1          0       1     0           0        0       1         1         0
2          1       0     0           0        1       1         1         0
3          0       1     0           1        1       1         1         1
4          0       1     0           0        1       1         0         0


### Two Categories

Use `pd.get_dummies` to get one-hot encoded variables for two-category variables.

In [5]:
dummy = ['school', 'sex', 'address', 'famsize', 'Pstatus']

math = pd.get_dummies(data=math, columns=dummy)
lang = pd.get_dummies(data=lang, columns=dummy)

dummies = ['school_GP', 'sex_M', 'address_U', 'famsize_LE3', 'Pstatus_T', 
            'school_MS', 'sex_F', 'address_R', 'famsize_GT3', 'Pstatus_A']

math[dummies].head()

Unnamed: 0,school_GP,sex_M,address_U,famsize_LE3,Pstatus_T,school_MS,sex_F,address_R,famsize_GT3,Pstatus_A
0,1,0,1,0,0,0,1,0,1,1
1,1,0,1,0,1,0,1,0,1,0
2,1,0,1,1,1,0,1,0,0,0
3,1,0,1,0,1,0,1,0,1,0
4,1,0,1,0,1,0,1,0,1,0


### More than 2 categories

Do the same thing as above - one-hot encoding using `get_dummies` for variables with more than two categores.

In [6]:
category = ['Mjob', 'Fjob', 'reason', 'guardian']

math = pd.get_dummies(data=math, columns=category)
lang = pd.get_dummies(data=lang, columns=category)

categories = ['Mjob_teacher', 'Mjob_health', 'Mjob_services', 'Mjob_at_home', 'Mjob_other',
              'Fjob_teacher', 'Fjob_health', 'Fjob_services', 'Fjob_at_home', 'Fjob_other']

math[categories].head()

Unnamed: 0,Mjob_teacher,Mjob_health,Mjob_services,Mjob_at_home,Mjob_other,Fjob_teacher,Fjob_health,Fjob_services,Fjob_at_home,Fjob_other
0,0,0,0,1,0,1,0,0,0,0
1,0,0,0,1,0,0,0,0,0,1
2,0,0,0,1,0,0,0,0,0,1
3,0,1,0,0,0,0,0,1,0,0
4,0,0,0,0,1,0,0,0,0,1


Look at the head for the reason variable (why did students attend the school they are in).

In [7]:
reasons = ['reason_home', 'reason_reputation', 'reason_course', 'reason_other']

lang[reasons].head()

Unnamed: 0,reason_home,reason_reputation,reason_course,reason_other
0,0,0,1,0
1,0,0,1,0
2,0,0,0,1
3,1,0,0,0
4,1,0,0,0


Head for columns containing information on the student's guardian>

In [8]:
guardians = ['guardian_father', 'guardian_mother', 'guardian_other']

math[guardians].head()

Unnamed: 0,guardian_father,guardian_mother,guardian_other
0,0,1,0
1,1,0,0
2,0,1,0
3,0,1,0
4,1,0,0


### Numeric

Look at the head of the age, absences, failures, and grades columns. This data will not be processed.

In [9]:
values = ['age', 'absences', 'failures', 'G1', 'G2', 'G3']
math[values].head()

Unnamed: 0,age,absences,failures,G1,G2,G3
0,18,6,0,5,6,6
1,17,4,0,5,5,6
2,15,10,3,7,8,10
3,15,2,0,15,14,15
4,16,4,0,6,10,10


### Likert-types

Look at the head of the remaining columns, the Likert-type variables. These variables do not require processing.

In [10]:
the_rest = ['Medu', 'Fedu', 'traveltime', 'studytime', 
            'famrel', 'freetime', 'goout', 'Dalc', 
            'Walc', 'health']

lang[the_rest].head()

Unnamed: 0,Medu,Fedu,traveltime,studytime,famrel,freetime,goout,Dalc,Walc,health
0,4,4,2,2,4,3,4,1,1,3
1,1,1,1,2,5,3,3,1,1,3
2,1,1,1,2,4,3,2,2,3,3
3,4,2,1,3,3,2,2,1,1,5
4,3,3,1,2,4,3,2,1,2,5


### Final Checks and Prepare for Next Steps

Check the variable types after processing.

In [11]:
print('Math:')
print(math.info())
print('Language')
print(lang.info())

Math:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 395 entries, 0 to 394
Data columns (total 51 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   age                395 non-null    int64
 1   Medu               395 non-null    int64
 2   Fedu               395 non-null    int64
 3   traveltime         395 non-null    int64
 4   studytime          395 non-null    int64
 5   failures           395 non-null    int64
 6   schoolsup          395 non-null    int64
 7   famsup             395 non-null    int64
 8   paid               395 non-null    int64
 9   activities         395 non-null    int64
 10  nursery            395 non-null    int64
 11  higher             395 non-null    int64
 12  internet           395 non-null    int64
 13  romantic           395 non-null    int64
 14  famrel             395 non-null    int64
 15  freetime           395 non-null    int64
 16  goout              395 non-null    int64
 17  Dalc      

Save a math DataFrame and a languate DataFrame with processed data. These will be each be kept individually because how models and results differ between subject areas is a relevant question that may be explored later.

In [12]:
math.to_csv('../data/processed/math.csv')
lang.to_csv('../data/processed/lang.csv')

For now, however, I will plan to use a combined DateFrame in order to model how student responses are related to their final grade.

First, Create a combined DataFrame (`cmbd`), save it, check the shape, and look at a sample of the data.

In [13]:
cmbd = pd.concat([math,lang], axis=0, ignore_index=True)
lang.to_csv('../data/processed/cmbd.csv')

print(cmbd.shape)
cmbd.sample(10)

(1044, 51)


Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,schoolsup,famsup,paid,activities,...,Fjob_other,Fjob_services,Fjob_teacher,reason_course,reason_home,reason_other,reason_reputation,guardian_father,guardian_mother,guardian_other
622,18,2,2,1,2,0,0,1,0,1,...,1,0,0,0,1,0,0,0,1,0
864,16,4,4,1,2,0,0,1,0,1,...,0,0,1,1,0,0,0,1,0,0
663,17,4,4,2,2,0,0,1,0,0,...,1,0,0,1,0,0,0,0,1,0
850,15,2,1,1,2,0,1,1,0,0,...,1,0,0,0,1,0,0,0,1,0
521,15,3,4,1,2,0,1,0,0,1,...,1,0,0,0,1,0,0,0,1,0
543,15,1,1,3,1,1,0,0,0,1,...,1,0,0,1,0,0,0,0,1,0
134,15,3,4,4,2,0,0,1,0,0,...,0,0,1,1,0,0,0,0,1,0
148,16,4,4,1,1,0,0,1,0,0,...,0,0,1,1,0,0,0,0,1,0
695,18,4,3,1,2,0,0,1,0,0,...,1,0,0,0,1,0,0,1,0,0
921,17,4,1,1,1,0,0,0,0,0,...,1,0,0,0,0,1,0,0,1,0


Finally, set up the data for training. Split the data into training and test sets.

In [14]:
y = cmbd['G3']
X = cmbd.drop(['G1','G2','G3'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)