# Učenje iz kompleksnih podatkovnih tipov

## 3. Domača naloga pri predmetu NSU

Ukvarjali se bomo z učenjem iz kompleksnih podatkovnih tipov, natančneje iz devetih tabel, ki opisujejo ligo NBA.

Sledeči Jupyter notebook za napovedovanje uporablja **relacijska drevesa**. Z metodami strojnega učenja napovedujemo ali košarkaška ekipa igra v ligi NBA ali ne.

In [1]:
# imports
from re3py.learners.tree import DecisionTree
from re3py.learners.random_forest import RandomForest
from re3py.learners.boosting import GradientBoosting
from re3py.learners.core.heuristic import *
from re3py.data.data_and_statistics import Dataset
from re3py.ranking.ensemble_ranking import EnsembleRanking
from re3py.utilities.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

import pandas as pd
import math

Poglejmo si najprej, kako izgledajo naši podatki. Če si pogledamo glavno tabelo TEAMS, vidimo, da je sestavljena iz 95 ekip, vsaka ekipa pa ima 4 atribute (team, location, name in league, ki je ciljna spremenljivka).

In [2]:
teams = pd.read_csv("dn3/teams.csv", sep="\t") 
teams

Unnamed: 0,team,location,name,league
0,ANA,Anaheim,Amigos,A
1,AND,Anderson,Duffey Packers,N
2,ATL,Atlanta,Hawks,N
3,BA1,Baltimore,Bullets,N
4,BAL,Baltimore,Bullets,N
...,...,...,...,...
90,ID1,Indiana,Pacers,A
91,NY1,New York,Nets,A
92,SA1,San Antonio,Spurs,A
93,ST2,St. Louis,Spirits,A


Če želimo uporabljati re3py moramo najprej generirati tri datoteke:
1. relacije v pravi obliki (basket_opis.txt),
2. seznam ciljnih spremenljivk (basket_ciljna.txt) in 
3. omejitve glede preiskovanja testov (basket.s).

Te datoteke smo pripravili s pomočjo notebooka Priprava_podatkov.ipynb, zato se lahko usmerimo na gradnjo modela.

## GRADNJA MODELA

Poglejmo sedaj, kako dobro delujejo naša relacijska drevesa.

In [3]:
# datoteke
descriptive = 'basket_opis.txt'
target = 'basket_ciljna.txt'
s_file = 'basket.s'

# zložimo skupaj v naš Dataset
s = Settings(s_file)
d = Dataset(s_file, descriptive, target)

Naloga od nas zahteva, da vzorčimo stratificirano (po stolpcu 'league'). Metode v paketu re3py imajo že privzeto vgrajeno stratificirano vzorčenje.

In [4]:
# parametre podamo kot slovar, da jih bomo lažje klicali
tree_params = {'heuristic': HeuristicGini(),
               'max_number_internal_nodes': 100,
               'max_number_atom_tests': 1,
               'allowed_atom_tests': s.get_atom_tests_structured(),
               'allowed_aggregators': s.get_aggregates(),
               'minimal_examples_in_leaf': 1,
               'max_number_of_evaluated_tests_per_node': 1000,
               'max_depth': 20,
               'only_existential': False}

# razdelimo v učno in testno množico
delitev = 0.25
# vzorci_stratificirano(d, delitev)
d_train, d_test = train_test_split(d, test_set=delitev, random_seed=2634)

Sedaj lahko končno zgradimo ter naučimo naš naključni gozd. Če želimo model tudi shraniti, samo odkomentriramo zadnjo vrstico.

In [5]:
# naključni gozd
print("Gradim naključni gozd:")
rf = RandomForest(5, **tree_params)  
rf.build(d_train)
rf.print_model('rel_rf.txt')
# rf.save_model('rel_rf.bin')

Gradim naključni gozd:
Building tree 1
  Building node on depth 1
    Building node on depth 2
      Building node on depth 3
      Building node on depth 3
    Building node on depth 2
induce        : 0.10590100288391113
statistics    : 0.000225067138671875
test value    : 0.03173422813415527
eval split    : 0.05303335189819336
example values: 0.0028421878814697266
find values   : 0.026587724685668945
nominal tests : 0
nom. t. time  : 0
numeric tests : 160
num t. time   : 0.05219388008117676

Building tree 2
  Building node on depth 1
    Building node on depth 2
    Building node on depth 2
induce        : 0.055047035217285156
statistics    : 0.0001621246337890625
test value    : 0.014001131057739258
eval split    : 0.029992341995239258
example values: 0.0016219615936279297
find values   : 0.011121273040771484
nominal tests : 0
nom. t. time  : 0
numeric tests : 80
num t. time   : 0.02958512306213379

Building tree 3
  Building node on depth 1
    Building node on depth 2
    Building

Poglejmo si sedaj, kakšna je natančnost našega modela.

In [6]:
rf.compute_ranking(EnsembleRanking.genie3).print_ranking("genie3_rf.txt")

true_values = [e.target_part for e in d_test.target_data]
predicted_values = [rf.predict(e) for e in d_test.target_data]

# izračunamo natančnost
acc = accuracy_score(true_values, predicted_values)

print('Natančnost modela Naključni gozd je: {0}.\n'.format(acc))

Natančnost modela Naključni gozd je: 1.0.



Za konec pa nam re3py omogoča tudi izpis najpomembnejših (najkoristnejših) relacij.

In [7]:
# najpomembnejše relacije
print(rf.compute_ranking(EnsembleRanking.genie3))

Attributes
coach_season_team[Y,X] :  4.04775e-01 ; iterations: [ 8.52238e-02 ,  7.77006e-02 ,  7.77006e-02 ,  7.71902e-02 ,  8.69599e-02 ]
team_season_o_asts[X,Y]:  8.15972e-03 ; iterations: [ 8.15972e-03 ,  0.00000e+00 ,  0.00000e+00 ,  0.00000e+00 ,  0.00000e+00 ]
team_season_won[X,Y]   :  7.54438e-03 ; iterations: [ 0.00000e+00 ,  0.00000e+00 ,  0.00000e+00 ,  7.54438e-03 ,  0.00000e+00 ]

Attributes summed
coach_season_team      :  4.04775e-01 ; iterations: [ 8.52238e-02 ,  7.77006e-02 ,  7.77006e-02 ,  7.71902e-02 ,  8.69599e-02 ]
team_season_o_asts     :  8.15972e-03 ; iterations: [ 8.15972e-03 ,  0.00000e+00 ,  0.00000e+00 ,  0.00000e+00 ,  0.00000e+00 ]
team_season_won        :  7.54438e-03 ; iterations: [ 0.00000e+00 ,  0.00000e+00 ,  0.00000e+00 ,  7.54438e-03 ,  0.00000e+00 ]

Aggregators
count                  :  4.04775e-01 ; iterations: [ 8.52238e-02 ,  7.77006e-02 ,  7.77006e-02 ,  7.71902e-02 ,  8.69599e-02 ]
sum                    :  1.57041e-02 ; iterations: [ 8.15972

Zanimivo bi bilo videti, kaj se zgodi, ko število dreves v naključnem gozdu povečamo. V našem primeru na 30.

In [8]:
# naključni gozd
rf30 = RandomForest(30, **tree_params)  
rf30.build(d_train)
rf30.print_model('rel_rf30.txt')
# rf.save_model('rel_rf.bin')

rf30.compute_ranking(EnsembleRanking.genie3).print_ranking("genie3_rf30.txt")

true_values30 = [e.target_part for e in d_test.target_data]
predicted_values30 = [rf30.predict(e) for e in d_test.target_data]


Building tree 1
  Building node on depth 1
    Building node on depth 2
      Building node on depth 3
      Building node on depth 3
    Building node on depth 2
induce        : 0.11794090270996094
statistics    : 0.00024437904357910156
test value    : 0.01444101333618164
eval split    : 0.07384800910949707
example values: 0.0033812522888183594
find values   : 0.008718252182006836
nominal tests : 0
nom. t. time  : 0
numeric tests : 160
num t. time   : 0.07261133193969727

Building tree 2
  Building node on depth 1
    Building node on depth 2
    Building node on depth 2
induce        : 0.04933571815490723
statistics    : 0.00015974044799804688
test value    : 0.007520437240600586
eval split    : 0.03150796890258789
example values: 0.0015094280242919922
find values   : 0.0048046112060546875
nominal tests : 0
nom. t. time  : 0
numeric tests : 80
num t. time   : 0.031090736389160156

Building tree 3
  Building node on depth 1
    Building node on depth 2
    Building node on depth 2
ind

      Building node on depth 3
        Building node on depth 4
        Building node on depth 4
          Building node on depth 5
          Building node on depth 5
      Building node on depth 3
induce        : 0.1414167881011963
statistics    : 0.0003142356872558594
test value    : 0.01513528823852539
eval split    : 0.06788992881774902
example values: 0.0029349327087402344
find values   : 0.009774923324584961
nominal tests : 0
nom. t. time  : 0
numeric tests : 320
num t. time   : 0.06635904312133789

Building tree 19
  Building node on depth 1
    Building node on depth 2
    Building node on depth 2
      Building node on depth 3
      Building node on depth 3
induce        : 0.0976400375366211
statistics    : 0.0002493858337402344
test value    : 0.01339268684387207
eval split    : 0.05428361892700195
example values: 0.002553701400756836
find values   : 0.008885622024536133
nominal tests : 0
nom. t. time  : 0
numeric tests : 160
num t. time   : 0.05318737030029297

Building tree

In [9]:
acc30 = accuracy_score(true_values30, predicted_values30) 

print('Natančnost naključnega gozda s 30 drevesi je: {0}.\n'.format(acc30))
    
# najpomembnejši atributi
print(rf30.compute_ranking(EnsembleRanking.genie3))

Natančnost naključnega gozda s 30 drevesi je: 1.0.

Attributes
coach_season_team[Y,X]  :  3.83501e-01 ; iterations: [ 1.42040e-02 ,  1.29501e-02 ,  1.29501e-02 ,  1.28650e-02 ,  1.44933e-02 ,  1.24486e-02 ,  1.24660e-02 ,  1.24486e-02 ,  1.54482e-02 ,  1.35859e-02 ,  1.11822e-02 ,  1.15909e-02 ,  1.24954e-02 ,  1.07167e-02 ,  1.47321e-02 ,  1.27141e-02 ,  1.34384e-02 ,  1.33865e-02 ,  1.16402e-02 ,  1.20242e-02 ,  1.13426e-02 ,  1.27187e-02 ,  1.23548e-02 ,  1.16910e-02 ,  1.32604e-02 ,  1.31250e-02 ,  1.36218e-02 ,  1.33745e-02 ,  1.13426e-02 ,  1.28893e-02 ]
team_season_d_3pa[X,Y]  :  6.06740e-02 ; iterations: [ 0.00000e+00 ,  0.00000e+00 ,  0.00000e+00 ,  0.00000e+00 ,  0.00000e+00 ,  2.66204e-03 ,  0.00000e+00 ,  5.09259e-03 ,  2.01823e-03 ,  0.00000e+00 ,  5.50964e-03 ,  0.00000e+00 ,  4.73373e-03 ,  3.49794e-03 ,  2.29592e-03 ,  0.00000e+00 ,  2.29134e-03 ,  0.00000e+00 ,  8.16327e-03 ,  0.00000e+00 ,  0.00000e+00 ,  2.56000e-03 ,  2.89256e-03 ,  0.00000e+00 ,  2.77253e-03 ,  6.0