# Väljalangemisega tugevalt seotud funktsioonide leidmine
*Finding features which are strongly correlated with dropout.*

#### 1. Ettevalmistused

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Reading in the data
df = pd.read_csv('../data/df_cleaned.csv')
pd.set_option('display.max_columns', 47)
#df.head()

#### 2. Kõrgeimate absoluutsete korrelatsioonide leidmine
*Finding the highest absolute correlations.*

In [3]:
pairwise_correlations = df.corr()
corr_target = pairwise_correlations["dropout"].sort_values(ascending=False)[1:]
print(corr_target)

cum.not.present                 0.258930
cum.negative.results            0.254195
sum_failed_grade                0.160554
cum.grade.F                     0.147267
cum.not.passed                  0.123601
prev.study.level.factor         0.091341
days.on.academic.leave          0.069695
nr.of.previous.unfinished       0.061592
nr.of.previous.studies.in.UT    0.020253
admission.special.conditions   -0.004943
days.as.visiting.student       -0.017513
cum.grade.E                    -0.020328
credits.cancelled.during.2w    -0.032097
normalized_score               -0.059434
nr.of.previous.finished        -0.060390
nr.of.employment.contracts     -0.068597
code.of.curriculum             -0.081637
days.studying.abroad           -0.085989
cum.grade.D                    -0.104435
on.extended.study.period       -0.113010
total_economic_support         -0.135218
cum.extracurricular.credits    -0.156286
study_period_in_years          -0.188723
debug.study.place.ID           -0.194996
year_immatricula

Üleval on näha seoseid tunnuste ja sihtmuutuja vahel. Nüüd vaatame, millised absoluutsed korrelatsioonid on kõige tugevamad.

*Above we can see the correlations between features and the target variable. Now lets see, which absolute correlations are the strongest.*

In [4]:
corr_abs = corr_target.copy()
for i in range(len(corr_abs)):
    corr_abs[i] = abs(corr_target[i])
corr_abs = corr_abs.sort_values(ascending=False)
print(corr_abs)

cum.credits.earned              0.361112
sum_passed_grade                0.330361
credits.registered              0.315927
year_exmatriculation            0.301935
cum.grade.B                     0.293743
nr.of.courses.registered        0.285358
cum.grade.A                     0.284695
active                          0.272223
cum.not.present                 0.258930
cum.passed                      0.257962
cum.negative.results            0.254195
nr.of.courses.with.any.grade    0.249691
cum.all.results                 0.237887
cum.grade.C                     0.206121
semester_current                0.204956
year_immatriculation            0.196055
debug.study.place.ID            0.194996
study_period_in_years           0.188723
sum_failed_grade                0.160554
cum.extracurricular.credits     0.156286
cum.grade.F                     0.147267
total_economic_support          0.135218
cum.not.passed                  0.123601
on.extended.study.period        0.113010
cum.grade.D     

Based on this data, the top features which are directly connected to dropout are:
1. **cum.credits.earned** - aggregated earned credits (läbitud õppeainete maht EAP-des)
2. **sum_passed_grade** - aggregated number of positive results (positiivsete tulemuste koguarv)
3. **credits.registered** - aggregated number of registered credits (summaarne registreeritud EAP-de arv)
4. *year_exmatriculation* - ei ole kindel, kas on päriselt relevantne, kuna kõigil praegustel tudengitel märgitud selle välja alla andmestiku allalaadimise aasta
5. **cum.grade.B** - number of courses resulted in grade B (hindele B läbitud õppeainete arv)
6. **nr.of.courses.registered** - aggregated number of registered courses (summaarne registreeritud kursuste arv)
7. **cum.grade.A** - number of courses resulted in grade A (hindele A läbitud õppeainete arv)
8. *active* - kas tudeng õpib praegu ülikoolis, ei tohiks ka relevantne olla
9. **cum.not.present** - number of courses resulted in 'Not present' (tulemusega 'mitteilmunud' õppeainete arv)
10. **cum.passed** - number of courses resulted in 'Pass' ( tulemusega 'arvestatud' õppeainete arv)
11. **cum.negative.results** - aggregated negative course results (negatiivsete tulemuste koguarv)