# Trénování a validace

## Datová sada

### Raw (surová data)

In [1]:
import aiohttp
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"

async with aiohttp.ClientSession() as session:
    async with session.get(url) as resp:
        # print(resp.status)
        textresponse = await resp.text()
print(textresponse[:1000])

18.0   8   307.0      130.0      3504.      12.0   70  1	"chevrolet chevelle malibu"
15.0   8   350.0      165.0      3693.      11.5   70  1	"buick skylark 320"
18.0   8   318.0      150.0      3436.      11.0   70  1	"plymouth satellite"
16.0   8   304.0      150.0      3433.      12.0   70  1	"amc rebel sst"
17.0   8   302.0      140.0      3449.      10.5   70  1	"ford torino"
15.0   8   429.0      198.0      4341.      10.0   70  1	"ford galaxie 500"
14.0   8   454.0      220.0      4354.       9.0   70  1	"chevrolet impala"
14.0   8   440.0      215.0      4312.       8.5   70  1	"plymouth fury iii"
14.0   8   455.0      225.0      4425.      10.0   70  1	"pontiac catalina"
15.0   8   390.0      190.0      3850.       8.5   70  1	"amc ambassador dpl"
15.0   8   383.0      170.0      3563.      10.0   70  1	"dodge challenger se"
14.0   8   340.0      160.0      3609.       8.0   70  1	"plymouth 'cuda 340"
15.0   8   400.0      150.0      3761.       9.5   70  1	"chevrolet monte ca

### Reformat

In [2]:
import re
textresponse = re.sub(' +', ' ', textresponse)
textresponse = re.sub('\t', ' ', textresponse)
print(textresponse[:1000])

18.0 8 307.0 130.0 3504. 12.0 70 1 "chevrolet chevelle malibu"
15.0 8 350.0 165.0 3693. 11.5 70 1 "buick skylark 320"
18.0 8 318.0 150.0 3436. 11.0 70 1 "plymouth satellite"
16.0 8 304.0 150.0 3433. 12.0 70 1 "amc rebel sst"
17.0 8 302.0 140.0 3449. 10.5 70 1 "ford torino"
15.0 8 429.0 198.0 4341. 10.0 70 1 "ford galaxie 500"
14.0 8 454.0 220.0 4354. 9.0 70 1 "chevrolet impala"
14.0 8 440.0 215.0 4312. 8.5 70 1 "plymouth fury iii"
14.0 8 455.0 225.0 4425. 10.0 70 1 "pontiac catalina"
15.0 8 390.0 190.0 3850. 8.5 70 1 "amc ambassador dpl"
15.0 8 383.0 170.0 3563. 10.0 70 1 "dodge challenger se"
14.0 8 340.0 160.0 3609. 8.0 70 1 "plymouth 'cuda 340"
15.0 8 400.0 150.0 3761. 9.5 70 1 "chevrolet monte carlo"
14.0 8 455.0 225.0 3086. 10.0 70 1 "buick estate wagon (sw)"
24.0 4 113.0 95.00 2372. 15.0 70 3 "toyota corona mark ii"
22.0 6 198.0 95.00 2833. 15.5 70 1 "plymouth duster"
18.0 6 199.0 97.00 2774. 15.5 70 1 "amc hornet"
21.0 6 200.0 85.00 2587. 16.0 70 1 "ford maverick"
27.0 4 97.00 8

### Načtení do Pandas

In [3]:
import pandas as pd
from io import StringIO 

pd.set_option("display.max_columns", 7)
df = pd.read_csv(StringIO(textresponse), sep=" ", names=["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "year", "origin", "name"])
df

Unnamed: 0,mpg,cylinders,displacement,...,year,origin,name
0,18.0,8,307.0,...,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,...,70,1,buick skylark 320
2,18.0,8,318.0,...,70,1,plymouth satellite
3,16.0,8,304.0,...,70,1,amc rebel sst
4,17.0,8,302.0,...,70,1,ford torino
...,...,...,...,...,...,...,...
393,27.0,4,140.0,...,82,1,ford mustang gl
394,44.0,4,97.0,...,82,2,vw pickup
395,32.0,4,135.0,...,82,1,dodge rampage
396,28.0,4,120.0,...,82,1,ford ranger


## Spojité a kategorické proměnné

Podle zdroje 

http://psychology.okstate.edu/faculty/jgrice/psyc3214/Stevens_FourScales_1946.pdf

### Textová data

- Nominal / Nominální = bez pořadí (název města)
- Ordinal / Ordinální = s pořadím (stupeň vzdělání)

### Číselná data

- Interval / Intervaly = bez explicitního omezení (teplota)
- Ratio / S relací = mají minimální hodnotu (rychlost)

### Kódování hodnot

Jednou z nejběžnějších operací nad daty je normalizace. Normalizace umožňuje lépe srovnávat hodnoty. 

Pro strojové učení se často používá tzv. Z-score

$z=\frac{x-\mu}{\sigma}$

kde

$\mu=\frac{x_1+x_2+\dots+x_n}{n}$

a

$\sigma=\sqrt{\frac{1}{N}\Sigma{{(x_i-\mu)}^2}}$

In [5]:
from scipy.stats import zscore

df["mpg_z"] = zscore(df["mpg"])
df

Unnamed: 0,mpg,cylinders,displacement,...,origin,name,mpg_z
0,18.0,8,307.0,...,1,chevrolet chevelle malibu,-0.706439
1,15.0,8,350.0,...,1,buick skylark 320,-1.090751
2,18.0,8,318.0,...,1,plymouth satellite,-0.706439
3,16.0,8,304.0,...,1,amc rebel sst,-0.962647
4,17.0,8,302.0,...,1,ford torino,-0.834543
...,...,...,...,...,...,...,...
393,27.0,4,140.0,...,1,ford mustang gl,0.446497
394,44.0,4,97.0,...,2,vw pickup,2.624265
395,32.0,4,135.0,...,1,dodge rampage,1.087017
396,28.0,4,120.0,...,1,ford ranger,0.574601


### Kódování kategorických hodnot

In [6]:
cylinders = list(df["cylinders"].unique())
cylinders

[8, 4, 6, 3, 5]

In [None]:
import numpy as np
df[(df.cylinders == 3) | (df.cylinders == 5)]

Unnamed: 0,mpg,cylinders,displacement,...,origin,name,mpg_z
71,19.0,3,70.0,...,3,mazda rx2 coupe,-0.578335
111,18.0,3,70.0,...,3,maxda rx3,-0.706439
243,21.5,3,80.0,...,3,mazda rx-4,-0.258075
274,20.3,5,131.0,...,2,audi 5000,-0.4118
297,25.4,5,183.0,...,2,mercedes benz 300d,0.241531
327,36.4,5,121.0,...,2,audi 5000s (diesel),1.650674
334,23.7,3,70.0,...,3,mazda rx-7 gs,0.023754
