# 5. Andmete eeltöötlus


Sisukord:
* [Puuduvad andmed](#puudu)
* [Ülesanne 5.1](#5_1)
* [Sobimatut tüüpi (nominaal-, ordinaal-) andmete konverteerimine](#sobimatu)
* [Andmete skaleerimine: normaliseerimine ja standardiseerimine](#skaleeri)
* [Oluliste atribuutide väljavalimine](#atr)
* [Ülesanne 5.2](#5_2)

Vaata ka
* [SciKit-learn.preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing)

Vt ka: Sebastian Raschka. Python Machine Learning. Ch. 4. Building Good Training Sets: Data Preprocessing.

<a id='puudu'></a>
## Puuduvad andmed

Reaalsete andmestike korral on sagedaseks probleemiks puuduvad andmed: küsitletav ei vasta mõnele küsimusele, ajaloolised andmed on puudulikud jne. Enamik andmekaeve meetodeid jäävad sellise sisendiga hätta. Tüüpilisteks viisideks puuduvate andmetega ümberkäimisel on vastavate ridade/veergude väljaviskamine või puuduvate väärtuste asendamine keskväärtusega.

In [50]:
import pandas as pd
from io import StringIO # StringIO klass esitab stringi faililaadse objektina

# Tekitame puuduvate väärtustega NaN (Not a Number) andmeraamistiku df, kasutades read_csv() meetodit
csv_data = """A,B,C,D
1.0, 2.0, 3.0, 4.0
5.0, 6.0,, 8.0
0.0, 11.0, 12.0,"""
df = pd.read_csv(StringIO(csv_data))
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,0.0,11.0,12.0,


### Väljaviskamine

Kasutame `pandas DataFrame` meetodit [dropna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html).

In [51]:
print("Puuduvad väärtused veeruti\n", df.isnull().sum(axis=0))

Puuduvad väärtused veeruti
 A    0
B    0
C    1
D    1
dtype: int64


In [52]:
# Elimineerime puuduvate väärtustega read
df_clean1 = df.dropna()
df_clean1

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0


In [53]:
 #Elimineerime puuduvate väärtustega veerud
df_clean2 = df.dropna(axis=1)
df_clean2

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,0.0,11.0


In [54]:
# Viska välja ainult need read, kus kõik väärtused puuduvad
df.dropna(how="all")

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,,8.0
2,0.0,11.0,12.0,


In [55]:
# Jäta alles need read, kus on vähemalt 2 väärtust
df.dropna(thresh=3, axis=1)

Unnamed: 0,A,B
0,1.0,2.0
1,5.0,6.0
2,0.0,11.0


In [56]:
# Viska välja ainult need read, kus puuduvad veergude B või C väärtused
df.dropna(subset=['B', 'C'])

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
2,0.0,11.0,12.0,


### Asendamine

In [57]:
# Kasutame klassi Imputer, et asendada puuduvad väärtused (missing_values="NaN") keskmistega (strategy="mean")
#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer
import numpy as np

imr = SimpleImputer(missing_values=np.nan, strategy="mean") 
imputed_data = imr.fit_transform(df)
imputed_data

array([[ 1. ,  2. ,  3. ,  4. ],
       [ 5. ,  6. ,  7.5,  8. ],
       [ 0. , 11. , 12. ,  6. ]])

In [58]:
# Asendatud imputed_data on np.array. 
# Teeme sellest pandas DataFrame objekti, millel on algsega samad veerunimed
imputed_df = pd.DataFrame(imputed_data, columns=df.columns)
imputed_df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.5,8.0
2,0.0,11.0,12.0,6.0


Siin võib kasutada ka pandas meetodit [fillna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html#pandas.DataFrame.fillna), millele antud näites anname asenduseks ette veergude keskmiste vektori (`df.mean()`). Tulemus on eelmisega samaväärne.

In [59]:
df.mean()

A    2.000000
B    6.333333
C    7.500000
D    6.000000
dtype: float64

In [60]:
df = df.fillna(df.mean())

In [61]:
df

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.5,8.0
2,0.0,11.0,12.0,6.0


<a id='5_1'></a>
## Ülesanne 5.1

Eemaldada allolevast andmeraamistikust `df_ex` veerud, kus puuduvad kõik väärtused. Asendada ülejäänud  puuduvad väärtused keskmistega üle veergude.

<!-- Võtta aluseks [UCI Horse Colic (hobuste kõhuvalu) andmestik](https://archive.ics.uci.edu/ml/datasets/Horse+Colic), mille saab alla laadida https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data. -->

In [62]:
csv_ex = """A,B,C,D,E
1.0, 2.0, 3.0, 24.0,
15.0, 6.0,,8.0,
0.0, 11.0, 9.0,,"""
df_ex = pd.read_csv(StringIO(csv_ex))
df_ex

Unnamed: 0,A,B,C,D,E
0,1.0,2.0,3.0,24.0,
1,15.0,6.0,,8.0,
2,0.0,11.0,9.0,,


<!--
"""
cols =                                  ["surgery?",
                                         "Age",
                                         "Hospital Nr",
                                         "rectal temp",
                                         "pulse",
                                         "respiratory rate",
                                         "temp of extermities",
                                         "peripheral pulse",
                                         "mucuos membranes",
                                         "capillary refill time",
                                         "pain",
                                         "peristalsis",
                                         "abdominal distension",
                                         "nasogastric tube"
                                         "nasogastric reflux",
                                         "nasogastric reflux PH",
                                         "rectal examination-feces",
                                         "abdomen",
                                         "packed cell volume",
                                         "total protein",
                                         "abdominocentesis appearance",
                                         "abdominocentesis total protein",
                                         "outcome",
                                         "surgical lesion?",
                                         "type of lesion 1",
                                         "type of lesion 2",
                                         "type of lesion 3",
                                         "cp_data"
                                         ]
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data", 
                 delimiter=" ",  index_col=False)
#print(list(zip(df.columns, cols)))
#print(df.columns)
#print(cols)
df
"""
-->

In [63]:
df_ex = df_ex.dropna(how="all", axis=1)
df_ex

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,24.0
1,15.0,6.0,,8.0
2,0.0,11.0,9.0,


In [64]:
df_ex = df_ex.fillna(df_ex.mean())
df_ex

Unnamed: 0,A,B,C,D
0,1.0,2.0,3.0,24.0
1,15.0,6.0,6.0,8.0
2,0.0,11.0,9.0,16.0


<a id='sobimatu'></a>
## Sobimatut tüüpi (nominaal-, ordinaal-) andmete konverteerimine 

* Nominaalandmed: kategooria, millel puudub järjestus. N: veregrupp A, B, AB, O
* Ordinaalandmed: järjestus on, aga astmete vahe ei pruugi olla üle skaala ühtne. N: meeldib väga, meeldib, neutraalne, ei meeldi, ei meeldi üldse. 

Mõned meetodid (näiteks pertseptron) ootavad arvandmeid ja nende jaoks võib olla vaja rohkem kui kahe väärtusega nominaal- ja vahel ka ordinaalatribuute teisendada.

In [65]:
patsiendi_df = pd.DataFrame({"veregrupp": ["A", "AB", "A", "O", "O"], 
                             "valu": ["puudub", "puudub", "kerge", "intensiivne", "kerge"]})
patsiendi_df                            

Unnamed: 0,veregrupp,valu
0,A,puudub
1,AB,puudub
2,A,kerge
3,O,intensiivne
4,O,kerge


Ordinaalatribuutide puhul on tihti piisav siltide teisendamine täisarvulisele skaalale. Siin sobib hästi kasutada nö. *mapping* sõnastikke, mille võti on väärtus vanal skaalal ja väärtus on väärtus uuel skaalal. `Dataframe[col].map(mapping)` meetod teisendab seejärel vastava veeru.

In [66]:
valu_map = {"puudub": 0, "kerge": 1, "intensiivne": 2}
patsiendi_df["valu"] = patsiendi_df["valu"].map(valu_map)
patsiendi_df

Unnamed: 0,veregrupp,valu
0,A,0
1,AB,0
2,A,1
3,O,2
4,O,1


Nominaalatribuutide puhul on selline lähenemine ebasobiv. Arvuline järjestus oleks petlik, sest sisuline järjestus puudub. Siin tekitatakse iga nominaalkategooria jaoks oma 0-1 atribuut (0-polnud see väärtus, 1-oli see väärtus). Seda võimaldab `sklearn` alammooduli `preprocessing`klass [OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), mis töötab arvandmete peal  või `pandas`mooduli funktsioon  [get_dummies(df)](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html), mis teisendab stringe sisaldavad veerud.

In [67]:
patsiendi_df = pd.get_dummies(patsiendi_df)
patsiendi_df

Unnamed: 0,valu,veregrupp_A,veregrupp_AB,veregrupp_O
0,0,1,0,0
1,0,0,1,0
2,1,1,0,0
3,2,0,0,1
4,1,0,0,1


Kui meil on vaja rakendada erinevatele veergudele erinevaid teisenduspoliitikaid, siis võib kasutada klassi [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer), mis seab igale veerule vastavusse transformaatorobjekti (peab omama `fit()` ja `transform()` meetodeid. ColumnTransformeri initsialiseerimise sisendiks on (nime_str, transformaator, veeru_id või veeru_id list) kolmikute list. 

In [68]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer

patsiendi_df2 = pd.DataFrame({"veregrupp": ["A", "AB", "A", "O", "O"], 
                             "valu": ["puudub", "puudub", "kerge", "intensiivne", "kerge"]})

# NB! OrdinalEncoder järjestab stringid tähestikulises järjekorras, sisuliselt nominaalskaala, mitte ordinaalskaala
ct = ColumnTransformer([("int_valu", OrdinalEncoder(categories=[["intensiivne", "kerge", "puudub"]]), [1]),("binarize_veri", OneHotEncoder(), [0])])
ct.fit_transform(patsiendi_df2)

array([[2., 1., 0., 0.],
       [2., 0., 1., 0.],
       [1., 1., 0., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 1.]])

<a id='skaleeri'></a>
## Andmete skaleerimine: normaliseerimine ja standardiseerimine

Enamik masinõppe algoritme, va otsustuspuud, töötavad paremini kui kõik atribuudid on samal skaalal.
On kaks fundamentaalset lähenemist:
* **Normaliseerimine** \[0..1\] skaalale
* **Standardiseerimine** skaalale, kus keskväärtus on 0 ja standardhälve on 1.

Normaliseeritava veeru $x$  ja rea $i$ elemendi uus väärtus $x_{norm}^i$:
$$ x_{norm}^i = \frac{x^i - x_{min}}{x_{max} - x_{min}}$$
$x^i$: vana väärtus; $x_{max}, x_{min}$: veeru $x$ maksimaalne ja minimaalne väärtus.
Normaliseerimise eest vastutav klass on [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html).

Standardiseeritava veeru $x$  ja rea $i$ elemendi uus väärtus $x_{std}^i$:
$$ x_{std}^i = \frac{x^i - \mu_x}{\sigma_x}$$
$x^i$: vana väärtus; $\mu_x, \sigma_x$: veeru $x$ keskväärtus ja standardhälve.
Standardiseerimise eest vastutav klass on [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).


In [69]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler
norm_data = MinMaxScaler().fit_transform(patsiendi_df)
print(norm_data)
# Tekitame uue normaliseeritud dataframe objekti
patsiendi_df_norm = pd.DataFrame(norm_data, columns=patsiendi_df.columns)
print("\n", patsiendi_df_norm)

[[0.  1.  0.  0. ]
 [0.  0.  1.  0. ]
 [0.5 1.  0.  0. ]
 [1.  0.  0.  1. ]
 [0.5 0.  0.  1. ]]

    valu  veregrupp_A  veregrupp_AB  veregrupp_O
0   0.0          1.0           0.0          0.0
1   0.0          0.0           1.0          0.0
2   0.5          1.0           0.0          0.0
3   1.0          0.0           0.0          1.0
4   0.5          0.0           0.0          1.0


In [70]:
# Teine võimalus on muuta olemasolevat Dataframe objekti 
# loc atribuudi ja maksimaalse lõike abil.
patsiendi_df.loc[:,:] = StandardScaler().fit_transform(patsiendi_df)
patsiendi_df

Unnamed: 0,valu,veregrupp_A,veregrupp_AB,veregrupp_O
0,-1.069045,1.224745,-0.5,-0.816497
1,-1.069045,-0.816497,2.0,-0.816497
2,0.267261,1.224745,-0.5,-0.816497
3,1.603567,-0.816497,-0.5,1.224745
4,0.267261,-0.816497,-0.5,1.224745


<a id='atr'></a>
## Oluliste atribuutide väljavalimine (*feature selection*)

Suure arvu atribuutide korral, eriti kui objektide arv on suhteliselt väike, on ülekohandamise (overfitting) probleem lihtne tekkima: objekti klassi saab ennustada atribuutide kombinatsiooni alusel, mis on unikaalne üksikobjektile. Selliselt treenitud klassifikaatorid annavad näiliselt häid tulemusi treeningandmetel, aga ennustustäpsus langeb tugevalt uute andmete korral. Samuti on sellised mudelid keerulised ja väikese üldistusjõuga. 

Seega on tihti kasulik atribuutide arvu vähendada. Siin on kaks fundamentaalset strateegiat:
* Uute atribuutide defineerimine olemasolevate atribuutide kombinatsioonina, nö dimensionaalsuse vähendamine/leidmine, mida vaatame järgmise nädala teema all. (*feature extraction*)
* Olemasolevatest atribuutidest sobivaimate väljavalimine. (*feature selection*)



Tihti on võimalik vähendada ennustamiseks kasutatavate atribuutide arvu klassifikaatorispetsiifiliste meetodite abil. Näiteks logistilise regressiooni korral suurendab nullkaalude arvu `penalty="l1"` (L1 regulariseerimine (lasso)) koos madala hinnaargumendiga `C`.

https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_l1_l2_sparsity.html

Kaaluvektori **w** hinnafunktsioonile J(**w**) lisandub erinev regulariseerimise trahv:

L1: $$ \sum_{j=1..m} |w_j|$$

L2 (suurem trahv suurtele $w_j$ väärtustele): $$ \sum_{j=1..m} w_j^2$$




In [71]:
from sklearn.linear_model import LogisticRegression
from sklearn import datasets

iris = datasets.load_iris()
clf_general = LogisticRegression(penalty="l2", C=0.1, solver='liblinear')
clf_general.fit(iris.data, iris.target)
print("Tavalise logistilise regressiooni (penalty='l2' ja C=0.1) leitud kaalud:")
print(clf_general.coef_)

clf_sparse = LogisticRegression(penalty="l1", C=0.1, solver='liblinear')
clf_sparse.fit(iris.data, iris.target)
print("\n\nLogistilise regressiooni leitud kaalud, kui penalty='l1' ja C=0.1:")
print(clf_sparse.coef_)


Tavalise logistilise regressiooni (penalty='l2' ja C=0.1) leitud kaalud:
[[ 0.21310863  0.776912   -1.23987617 -0.55518357]
 [ 0.04423007 -0.62688956  0.29231476 -0.24855524]
 [-0.63387401 -0.5619517   0.94991857  0.78619837]]


Logistilise regressiooni leitud kaalud, kui penalty='l1' ja C=0.1:
[[ 0.          1.12163519 -1.34347876  0.        ]
 [ 0.         -0.38716527  0.12337301  0.        ]
 [-0.98749592  0.          1.27657229  0.        ]]


In [72]:
# Vormistame selle ilusa ja loetava DataFrame'ina
print("Tava:")
general_coef_df = pd.DataFrame(clf_general.coef_, 
                             columns=iris.feature_names,
                             index=iris.target_names)
general_coef_df

Tava:


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
setosa,0.213109,0.776912,-1.239876,-0.555184
versicolor,0.04423,-0.62689,0.292315,-0.248555
virginica,-0.633874,-0.561952,0.949919,0.786198


In [73]:
# Vormistame selle ilusa ja loetava DataFrame'ina
print("penalty='l1' ja C=0.1:")
sparse_coef_df = pd.DataFrame(clf_sparse.coef_, 
                             columns=iris.feature_names,
                             index=iris.target_names)
sparse_coef_df

penalty='l1' ja C=0.1:


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
setosa,0.0,1.121635,-1.343479,0.0
versicolor,0.0,-0.387165,0.123373,0.0
virginica,-0.987496,0.0,1.276572,0.0


Klass [SelectKBest](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html) võimaldab valida välja $k$ parimat atribuuti teatud hinnangufunktsiooni järgi. Hinnangufunktsioon võib olla statistiline ($\chi^2$, ANOVA F statistik,...) või näiteks valepositiivsete tulemuste arv atribuudi järgi ennustades.

In [74]:
from sklearn.feature_selection import SelectKBest, chi2

# Tekitame SelectKBest transformaatori,
# treenime ja saame vastuse fit_transform() meetodi abil
k_best = SelectKBest(chi2, k=2)
X_new = k_best.fit_transform(iris.data, iris.target)
print(k_best.scores_)
X_new[:10]

[ 10.81782088   3.7107283  116.31261309  67.0483602 ]


array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.7, 0.4],
       [1.4, 0.3],
       [1.5, 0.2],
       [1.4, 0.2],
       [1.5, 0.1]])

Atribuutide väljavalimiseks saame kasutada ka juhusliku metsa [RandomForest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) ansamblimeetodi omadust `feature_importances_`.

In [75]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
forest.fit(iris.data, iris.target)
print(forest.feature_importances_)

[0.09792816 0.02518223 0.4422498  0.43463981]


In [76]:
# Väljastame atribuutide nimedega kaalude järgi kahanevalt sorditult.
for f, c in sorted(zip(forest.feature_importances_, iris.feature_names), reverse=True):
    print(c, round(f, 3))

petal length (cm) 0.442
petal width (cm) 0.435
sepal length (cm) 0.098
sepal width (cm) 0.025


<a id='5_2'></a>
## Ülesanne 5.2

a) Viia DataFrame kujule varasematest ülesannetest tuttav UCI [loomaaia](http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/) andmestik (või kasutada allpool toodud näitekoodi).

Millised kolmest atribuudist `aquatic`, `legs`, `type` on rohkem kui kahe võimaliku väärtusega? Milline nendest on nominaalatribuut (st selle väärtused pole sisuliselt  järjestatud)? Miks sellist atribuuti ei või käsitleda arvuna ja tuleks asendada $n$ kahendatribuudiga iga $n$  võimaliku väärtuse jaoks? Milline nendest kahest atribuudist  on arv ja milline informatsioon läheks masinõppe meetodite jaoks kaduma kui see asendada $n$ kahendatribuudiga? Kirjutada vastused ja põhjendus uude *Markdown* lahtrisse.

Konverteerida rohkem kui kahe väärtusega nominaalatribuut kahendkujule kasutades funktsiooni [pd.get_dummies(dataframe, columns=[col1, col2,..])](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html), kus `columns` on list veerunimedest, mida tuleb kahendkujule teisendada. Kuvada konverteeritud tabel.

Seejärel normaliseerida kogu tabel [0..1] skaalale. Kuvada veerg `legs`.


In [77]:
# Andmete laadimine
import pandas as pd
zoo_df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/zoo/zoo.data", 
                     header=None, index_col=0,
                     names=["animal_name", "hair", "feathers", "eggs", "milk", "airborne", "aquatic", "predator", "toothed", "backbone", "breathes", "venomous", "fins", "legs", "tail", "domestic", "catsize", "type" ])
zoo_df

Unnamed: 0_level_0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize,type
animal_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
aardvark,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
antelope,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1,1
bass,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0,4
bear,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1,1
boar,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
wallaby,1,0,0,1,0,0,0,1,1,1,0,0,2,1,0,1,1
wasp,1,0,1,0,1,0,0,0,0,1,1,0,6,0,0,0,6
wolf,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1,1
worm,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,7


`legs` ja `type` omavad rohkem kui kahte erinevat väärtust.  
`type` pole sisuliselt järjestatud.  
Naid ei saa käsitleda arvudena, kuna keskmiste ja summa jms matemaatiliste tehete tegemine ei oma sisulist väärtust.  
`legs` info läheb kaduma kui see asendada erinevate kahendatribuutidega, kuna legs omab ka sisulist numbrilist väärtust.  


In [80]:
zoo_nominal_df = pd.get_dummies(zoo_df, columns=["type"])
zoo_nominal_df

Unnamed: 0_level_0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,...,tail,domestic,catsize,type_1,type_2,type_3,type_4,type_5,type_6,type_7
animal_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
aardvark,1,0,0,1,0,0,1,1,1,1,...,0,0,1,1,0,0,0,0,0,0
antelope,1,0,0,1,0,0,0,1,1,1,...,1,0,1,1,0,0,0,0,0,0
bass,0,0,1,0,0,1,1,1,1,0,...,1,0,0,0,0,0,1,0,0,0
bear,1,0,0,1,0,0,1,1,1,1,...,0,0,1,1,0,0,0,0,0,0
boar,1,0,0,1,0,0,1,1,1,1,...,1,0,1,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
wallaby,1,0,0,1,0,0,0,1,1,1,...,1,0,1,1,0,0,0,0,0,0
wasp,1,0,1,0,1,0,0,0,0,1,...,0,0,0,0,0,0,0,0,1,0
wolf,1,0,0,1,0,0,1,1,1,1,...,1,0,1,1,0,0,0,0,0,0
worm,0,0,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1


In [98]:
from sklearn.preprocessing import MinMaxScaler
zoo_norm_df = pd.DataFrame(MinMaxScaler().fit_transform(zoo_nominal_df), columns=zoo_nominal_df.columns)
zoo_norm_df["legs"]

0      0.50
1      0.50
2      0.00
3      0.50
4      0.50
       ... 
96     0.25
97     0.75
98     0.50
99     0.00
100    0.25
Name: legs, Length: 101, dtype: float64

b) Töödeldud andmestiku jaoks:
* Võtta  ennustatavaks klassiks jälle veerg `aquatic`, ja tõsta see klassivektorina ülejäänud andmetest välja (näiteks Dataframe [pop() meetodiga](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.pop.html)).
* Leida logistilise regressiooni mudel, kus nullist erinevate kaalude arv on L1 regulariseerimise abil minimeeritud, C=0.2. Millised atribuudid on logistilise regressiooni mudelis nullist erineva kaaluga (nimedega)?
* Hinnata atribuutide olulisust juhusliku metsa meetodil. Millised on olulised atribuudid?

In [100]:
y = zoo_norm_df.pop("aquatic")
zoo_norm_df

Unnamed: 0,hair,feathers,eggs,milk,airborne,predator,toothed,backbone,breathes,venomous,...,tail,domestic,catsize,type_1,type_2,type_3,type_4,type_5,type_6,type_7
0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
96,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
97,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
98,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0,0.0,...,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
99,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [107]:
from sklearn.linear_model import LogisticRegression
logistig_reg_model = LogisticRegression(penalty="l1", C=0.2, solver='liblinear')
logistig_reg_model.fit(zoo_norm_df, y)
reg_model_df = pd.DataFrame(logistig_reg_model.coef_, columns=zoo_norm_df.columns)
reg_model_df

Unnamed: 0,hair,feathers,eggs,milk,airborne,predator,toothed,backbone,breathes,venomous,...,tail,domestic,catsize,type_1,type_2,type_3,type_4,type_5,type_6,type_7
0,-0.582058,0.0,0.0,0.0,0.0,0.373983,0.0,0.0,-0.989749,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [110]:
reg_model_df.loc[:, (reg_model_df != 0).any(axis=0)]

Unnamed: 0,hair,predator,breathes,fins
0,-0.582058,0.373983,-0.989749,0.939242


In [113]:
from sklearn.ensemble import RandomForestClassifier

forest = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
forest.fit(zoo_norm_df.values, y)
print(forest.feature_importances_)

[0.05805527 0.01069933 0.04477299 0.02137722 0.03103655 0.07547241
 0.04125351 0.01253522 0.16358367 0.01648437 0.14999656 0.10932531
 0.01797138 0.01557888 0.03385294 0.02206182 0.01139779 0.01848126
 0.05975968 0.05291831 0.01523548 0.01815001]


In [120]:
for f, c in sorted(zip(forest.feature_importances_, zoo_norm_df.columns), reverse=True):
    print(f"Prop: {c:<10}, importance: {round(f, 3)}")

Prop: breathes  , importance: 0.164
Prop: fins      , importance: 0.15
Prop: legs      , importance: 0.109
Prop: predator  , importance: 0.075
Prop: type_4    , importance: 0.06
Prop: hair      , importance: 0.058
Prop: type_5    , importance: 0.053
Prop: eggs      , importance: 0.045
Prop: toothed   , importance: 0.041
Prop: catsize   , importance: 0.034
Prop: airborne  , importance: 0.031
Prop: type_1    , importance: 0.022
Prop: milk      , importance: 0.021
Prop: type_3    , importance: 0.018
Prop: type_7    , importance: 0.018
Prop: tail      , importance: 0.018
Prop: venomous  , importance: 0.016
Prop: domestic  , importance: 0.016
Prop: type_6    , importance: 0.015
Prop: backbone  , importance: 0.013
Prop: type_2    , importance: 0.011
Prop: feathers  , importance: 0.011
