# Predikce známky z kurzů matematiky a portugalištiny

Predikce finální známky z kurzů matematiky a portugalištiny na základě rodinného zázemí a konzumace alkoholu. Dataset obsahuje 33 atributů a je dostupný z https://www.kaggle.com/uciml/student-alcohol-consumption

## Context

The data were obtained in a survey of students math and portuguese language courses in secondary school. It contains a lot of interesting social, gender and study information about students. You can use it for some EDA or try to predict students final grade.

## Content

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

- school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
- sex - student's sex (binary: 'F' - female or 'M' - male)
- age - student's age (numeric: from 15 to 22)
- address - student's home address type (binary: 'U' - urban or 'R' - rural)
- famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
- Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
- Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
- guardian - student's guardian (nominal: 'mother', 'father' or 'other')
- traveltime - home to school travel time (numeric: 1 - 1 hour)
- studytime - weekly study time (numeric: 1 - 10 hours)
- failures - number of past class failures (numeric: n if 1<=n<3, else 4)
- schoolsup - extra educational support (binary: yes or no)
- famsup - family educational support (binary: yes or no)
- paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- activities - extra-curricular activities (binary: yes or no)
- nursery - attended nursery school (binary: yes or no)
- higher - wants to take higher education (binary: yes or no)
- internet - Internet access at home (binary: yes or no)
- romantic - with a romantic relationship (binary: yes or no)
- famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- freetime - free time after school (numeric: from 1 - very low to 5 - very high)
- goout - going out with friends (numeric: from 1 - very low to 5 - very high)
- Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- health - current health status (numeric: from 1 - very bad to 5 - very good)
- absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

1. G1 - first period grade (numeric: from 0 to 20)
2. G2 - second period grade (numeric: from 0 to 20)
3. G3 - final grade (numeric: from 0 to 20, output target)

In [1]:
import pandas as pd

## Sloučení datasetů
Někteří studenti jsou zároveň v obou předmětech. Data jsou anonymizovaná a neobsahují jednoznačné identifikátory, podle kterých by se daly datasety snadno sloučit.  
Autoři doporučují sloučit datasety podle atributů, které jsou pro každého studenta v obou předmětech totožné a identifikují je tak. Sami vytvořili soubor v R, který datasety slučuje a vyšlo jim 382 studentů, kteří mají oba předměty. Nicméně v [diskuzi k datasetu](https://www.kaggle.com/uciml/student-alcohol-consumption/discussion/26889) uživatelé jejich přístup zpochybňují, protože nevyužili všechny atributy, které jsou pro oba předměty společné.  
Pro sloučení datasetů byly tedy použity všechny atributy kromě následujících specifických pro konkrétní předmět:
- failures - počet neúspěšných absolvování
- paid - zda byly placené lekce navíc
- absences - počet absencí
- G1
- G2
- G3

In [19]:
math = pd.read_csv("data/student-mat.csv")
portuguese = pd.read_csv("data/student-por.csv")

identical_attributes = ["school",
                        "sex",
                        "age",
                        "address",
                        "famsize",
                        "Pstatus",
                        "Medu",
                        "Fedu",
                        "Mjob",
                        "Fjob",
                        "reason",
                        "guardian",
                        "traveltime",
                        "studytime",
                        "schoolsup",
                        "famsup",
                        "activities",
                        "nursery",
                        "higher",
                        "internet",
                        "romantic",
                        "famrel",
                        "freetime",
                        "goout",
                        "Dalc",
                        "Walc",
                        "health"]

data = math.merge(portuguese, on=identical_attributes)

print(data.columns.tolist())
print(data.count())

['school', 'sex', 'age', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'reason', 'guardian', 'traveltime', 'studytime', 'failures_x', 'schoolsup', 'famsup', 'paid_x', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'famrel', 'freetime', 'goout', 'Dalc', 'Walc', 'health', 'absences_x', 'G1_x', 'G2_x', 'G3_x', 'failures_y', 'paid_y', 'absences_y', 'G1_y', 'G2_y', 'G3_y']
school        370
sex           370
age           370
address       370
famsize       370
Pstatus       370
Medu          370
Fedu          370
Mjob          370
Fjob          370
reason        370
guardian      370
traveltime    370
studytime     370
failures_x    370
schoolsup     370
famsup        370
paid_x        370
activities    370
nursery       370
higher        370
internet      370
romantic      370
famrel        370
freetime      370
goout         370
Dalc          370
Walc          370
health        370
absences_x    370
G1_x          370
G2_x          370
G3_x          370
fail