# Predikce známky z kurzů matematiky a portugalištiny

Predikce finální známky z kurzů matematiky a portugalištiny na základě rodinného zázemí a konzumace alkoholu. Dataset obsahuje 33 atributů a je dostupný z https://www.kaggle.com/uciml/student-alcohol-consumption

## Původní popis z kagglu

### Context

The data were obtained in a survey of students math and portuguese language courses in secondary school. It contains a lot of interesting social, gender and study information about students. You can use it for some EDA or try to predict students final grade.

### Content

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

- school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
- sex - student's sex (binary: 'F' - female or 'M' - male)
- age - student's age (numeric: from 15 to 22)
- address - student's home address type (binary: 'U' - urban or 'R' - rural)
- famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
- Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
- Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
- Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
- reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
- guardian - student's guardian (nominal: 'mother', 'father' or 'other')
- traveltime - home to school travel time (numeric: 1 - 1 hour)
- studytime - weekly study time (numeric: 1 - 10 hours)
- failures - number of past class failures (numeric: n if 1<=n<3, else 4)
- schoolsup - extra educational support (binary: yes or no)
- famsup - family educational support (binary: yes or no)
- paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
- activities - extra-curricular activities (binary: yes or no)
- nursery - attended nursery school (binary: yes or no)
- higher - wants to take higher education (binary: yes or no)
- internet - Internet access at home (binary: yes or no)
- romantic - with a romantic relationship (binary: yes or no)
- famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
- freetime - free time after school (numeric: from 1 - very low to 5 - very high)
- goout - going out with friends (numeric: from 1 - very low to 5 - very high)
- Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
- Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
- health - current health status (numeric: from 1 - very bad to 5 - very good)
- absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

1. G1 - first period grade (numeric: from 0 to 20)
2. G2 - second period grade (numeric: from 0 to 20)
3. G3 - final grade (numeric: from 0 to 20, output target)

## Načtení dat z CSV

In [66]:
import pandas as pd

math = pd.read_csv("data/student-mat.csv")
portuguese = pd.read_csv("data/student-por.csv")

## Atributy datasetu matematiky
Dataset předmětu matematiky má 395 záznamů.

In [67]:
math_records_count = math.count()
math_records_len = len(math)

print('Math attributes with number of records:\n')
print(math_records_count)

Math attributes with number of records:

school        395
sex           395
age           395
address       395
famsize       395
Pstatus       395
Medu          395
Fedu          395
Mjob          395
Fjob          395
reason        395
guardian      395
traveltime    395
studytime     395
failures      395
schoolsup     395
famsup        395
paid          395
activities    395
nursery       395
higher        395
internet      395
romantic      395
famrel        395
freetime      395
goout         395
Dalc          395
Walc          395
health        395
absences      395
G1            395
G2            395
G3            395
dtype: int64


## Atributy datasetu portugalštiny
Dataset předmětu portugalštiny má 649 záznamů.

In [68]:
potuguese_records_count = portuguese.count()
potuguese_records_len = len(portuguese)

print('Potruguese attributes with number of records:\n')
print(potuguese_records_count)

Potruguese attributes with number of records:

school        649
sex           649
age           649
address       649
famsize       649
Pstatus       649
Medu          649
Fedu          649
Mjob          649
Fjob          649
reason        649
guardian      649
traveltime    649
studytime     649
failures      649
schoolsup     649
famsup        649
paid          649
activities    649
nursery       649
higher        649
internet      649
romantic      649
famrel        649
freetime      649
goout         649
Dalc          649
Walc          649
health        649
absences      649
G1            649
G2            649
G3            649
dtype: int64


## Sloučení datasetů
Někteří studenti jsou zároveň v obou předmětech. Data jsou anonymizovaná a neobsahují jednoznačné identifikátory, podle kterých by se daly datasety snadno sloučit.  
Autoři doporučují sloučit datasety podle atributů, které jsou pro každého studenta v obou předmětech totožné a identifikují je tak. Sami vytvořili soubor v R, který datasety slučuje a vyšlo jim 382 studentů, kteří mají oba předměty. Nicméně v [diskuzi k datasetu](https://www.kaggle.com/uciml/student-alcohol-consumption/discussion/26889) uživatelé jejich přístup zpochybňují, protože nevyužili všechny atributy, které jsou pro oba předměty společné.  
Pro sloučení datasetů byly tedy použity všechny atributy kromě následujících specifických pro konkrétní předmět:
- failures - počet neúspěšných absolvování
- paid - zda byly placené lekce navíc
- absences - počet absencí
- G1
- G2
- G3

In [69]:
identical_attributes = [
    "school",
    "sex",
    "age",
    "address",
    "famsize",
    "Pstatus",
    "Medu",
    "Fedu",
    "Mjob",
    "Fjob",
    "reason",
    "guardian",
    "traveltime",
    "studytime",
    "schoolsup",
    "famsup",
    "activities",
    "nursery",
    "higher",
    "internet",
    "romantic",
    "famrel",
    "freetime",
    "goout",
    "Dalc",
    "Walc",
    "health"
]

both_courses = math.merge(
    portuguese,
    how='outer',
    on=identical_attributes,
    suffixes=('_math', '_portuguese')
)

> - Byla použita funkce `pandas.DataFrame.merge` pro sloučení dvou DataFramů.
> - Parametr `how=outer` sloučí všechna dostupná data obou DataFramů a žádná nevynechá.
> - Parametr `on=identical_attributes` provede outer join s kombinací identických atributů pro oba DataFramy. Tyto atributy slouží jako indexy.
> - Paramter `suffixes=('_math', '_portuguese')` vytvoří, v případě shodných atributů s rozdílnými hodnotami, nové sloupce významově náležící konkrétnímu předmětu.

## Atributy sloučeného datasetu obou předmětů

In [78]:
all_records_count = both_courses.count()
both_courses_records_len = len(both_courses)

only_math_records_len = both_courses_records_len - math_records_len
only_potuguese_records_len = both_courses_records_len - potuguese_records_len
only_both_records_len = both_courses_records_len - (only_math_records_len + only_potuguese_records_len)

all_students_len = only_math_records_len + only_potuguese_records_len + only_both_records_len

print('All merged attributes with number of records:\n')
print(all_records_count)

print(f'\nNumber of students only in math class: {only_math_records_len}')
print(f'Number of students only in portuguese class: {only_potuguese_records_len}')
print(f'Number of students in both classes: {only_both_records_len}')
print(f'Number of all students: {all_students_len}')

All merged attributes with number of records:

school                 674
sex                    674
age                    674
address                674
famsize                674
Pstatus                674
Medu                   674
Fedu                   674
Mjob                   674
Fjob                   674
reason                 674
guardian               674
traveltime             674
studytime              674
failures_math          395
schoolsup              674
famsup                 674
paid_math              395
activities             674
nursery                674
higher                 674
internet               674
romantic               674
famrel                 674
freetime               674
goout                  674
Dalc                   674
Walc                   674
health                 674
absences_math          395
G1_math                395
G2_math                395
G3_math                395
failures_portuguese    649
paid_portuguese        649
absences