<a href="https://colab.research.google.com/github/kdomanski78/machine-learning-bootcamp/blob/main/supervised/01_basics/03_feature_extraction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### scikit-learn
Strona biblioteki: [https://scikit-learn.org](https://scikit-learn.org)  

Dokumentacja/User Guide: [https://scikit-learn.org/stable/user_guide.html](https://scikit-learn.org/stable/user_guide.html)

Podstawowa biblioteka do uczenia maszynowego w języku Python.

Aby zainstalować bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install scikit-learn
```
Aby zaktualizować do najnowszej wersji bibliotekę scikit-learn, użyj polecenia poniżej:
```
!pip install --upgrade scikit-learn
```
Kurs stworzony w oparciu o wersję `0.22.1`

### Spis treści:
1. [Import bibliotek](#0)
2. [Załadowanie danych](#1)
3. [Utworzenie kopii danych](#2)
4. [Generowanie nowych zmiennych](#3)
5. [Dyskretyzacja zmiennej ciągłej](#4)
6. [Ekstrakcja cech](#5)



### <a name='0'></a> Import bibliotek

In [1]:
import numpy as np
import pandas as pd
import sklearn

sklearn.__version__

'1.0.1'

### Mój test


In [16]:
#instalacja pakietu gdy go nie ma
#! pip install wget     

import wget

# nazwa tickera papieru wartościowego (CDR = CD Projekt SA)
TICKER = "KGH"
# data początkowa
START_DATE = "2000-01-01"
# data końcowa
END_DATE = "2021-12-31"
# plik w którym będą przechowywane archiwalne notowania dla danego tickera
FILENAME = "kghm.txt"

'''
W celu porania archiwalnych notować giełdowych dla danego papieru wartościowego posłużymy się danymi z serwisu stooq.pl. Aby pobrać historię notowań dla danego tickera i zakresu dat w formacie CSV, należy wywołać URL o takiej postaci:

https://stooq.pl/q/d/l/?s={0}&d1={1}&d2={2}&i=d
, gdzie:

{0} - nazwa tickera
{1} - data początkowa w formacie YYYYMMDD
{2} - data końcowa w formacie YYYYMMDD
'''
url = "https://stooq.pl/q/d/l/?s={0}&d1={1}&d2={2}&i=d".format(TICKER, START_DATE.replace("-",""), END_DATE.replace("-",""))  
wget.download(url, FILENAME) 


data_frame = pd.read_csv(FILENAME, index_col='Data',
                 parse_dates=True, usecols=['Data', 'Otwarcie', 'Najwyzszy', 'Najnizszy', 'Zamkniecie','Wolumen'],
                 na_values='nan')
# rename the column header with ticker
data_frame = data_frame.rename(columns={'Zamkniecie': TICKER})
data_frame.dropna(inplace=True)
print(data_frame)
data_frame.head()

            Otwarcie  Najwyzszy  Najnizszy       KGH  Wolumen
Data                                                         
2000-01-03    9.3384     9.4081     9.0367    9.0683   930422
2000-01-04    8.9994     8.9994     8.5583    8.6594  1267669
2000-01-05    8.4253     8.6271     8.4253    8.5269   957900
2000-01-06    8.4899     8.8988     8.4253    8.8295  1035218
2000-01-07    8.9994     9.6790     8.9994    9.4769  2489145
...              ...        ...        ...       ...      ...
2021-12-06  139.5500   141.7000   138.0000  139.5000   289507
2021-12-07  142.8000   147.8500   142.0500  147.1500   772031
2021-12-08  147.3000   148.3500   144.3500  145.4000   356574
2021-12-09  146.0000   146.0000   140.0000  141.5500   607749
2021-12-10  141.5500   143.1500   140.8000  142.2500   311939

[5495 rows x 5 columns]


Unnamed: 0_level_0,Otwarcie,Najwyzszy,Najnizszy,KGH,Wolumen
Data,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-03,9.3384,9.4081,9.0367,9.0683,930422
2000-01-04,8.9994,8.9994,8.5583,8.6594,1267669
2000-01-05,8.4253,8.6271,8.4253,8.5269,957900
2000-01-06,8.4899,8.8988,8.4253,8.8295,1035218
2000-01-07,8.9994,9.679,8.9994,9.4769,2489145


### <a name='1'></a> Załadowanie danych

In [7]:
def fetch_financial_data(company='AMZN'):
    """
    This function fetches stock market quotations.
    """
    import pandas_datareader.data as web
    return web.DataReader(name=company, data_source='stooq')

df_raw = fetch_financial_data()
df_raw.head()

Unnamed: 0_level_0,Open,High,Low,Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2021-12-10,3508.34,3518.54,3410.0,3444.24,3034488
2021-12-09,3515.0,3539.39,3482.79,3483.42,2303091
2021-12-08,3523.01,3543.6,3495.01,3523.16,2262683
2021-12-07,3492.0,3549.99,3466.69,3523.29,3320536
2021-12-06,3393.0,3473.91,3338.69,3427.37,3443000


### <a name='2'></a> Utworzenie kopii danych

In [17]:
df = df_raw.copy()
df = df[:5]
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5 entries, 2021-12-10 to 2021-12-06
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Open    5 non-null      float64
 1   High    5 non-null      float64
 2   Low     5 non-null      float64
 3   Close   5 non-null      float64
 4   Volume  5 non-null      int64  
dtypes: float64(4), int64(1)
memory usage: 240.0 bytes


### <a name='3'></a> Generowanie nowych zmiennych

In [18]:
df.index.month

Int64Index([12, 12, 12, 12, 12], dtype='int64', name='Date')

In [19]:
df['day'] = df.index.day
df['month'] = df.index.month
df['year'] = df.index.year
df

Unnamed: 0_level_0,Open,High,Low,Close,Volume,day,month,year
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2021-12-10,3508.34,3518.54,3410.0,3444.24,3034488,10,12,2021
2021-12-09,3515.0,3539.39,3482.79,3483.42,2303091,9,12,2021
2021-12-08,3523.01,3543.6,3495.01,3523.16,2262683,8,12,2021
2021-12-07,3492.0,3549.99,3466.69,3523.29,3320536,7,12,2021
2021-12-06,3393.0,3473.91,3338.69,3427.37,3443000,6,12,2021


### <a name='4'></a> Dyskretyzacja zmiennej ciągłej

In [None]:
df = pd.DataFrame(data={'height': [175., 178.5, 185., 191., 184.5, 183., 168.]})
df

Unnamed: 0,height
0,175.0
1,178.5
2,185.0
3,191.0
4,184.5
5,183.0
6,168.0


In [None]:
df['height_cat'] = pd.cut(x=df.height, bins=3)
df

Unnamed: 0,height,height_cat
0,175.0,"(167.977, 175.667]"
1,178.5,"(175.667, 183.333]"
2,185.0,"(183.333, 191.0]"
3,191.0,"(183.333, 191.0]"
4,184.5,"(183.333, 191.0]"
5,183.0,"(175.667, 183.333]"
6,168.0,"(167.977, 175.667]"


In [None]:
df['height_cat'] = pd.cut(x=df.height, bins=(160, 175, 180, 195))
df

Unnamed: 0,height,height_cat
0,175.0,"(160, 175]"
1,178.5,"(175, 180]"
2,185.0,"(180, 195]"
3,191.0,"(180, 195]"
4,184.5,"(180, 195]"
5,183.0,"(180, 195]"
6,168.0,"(160, 175]"


In [None]:
df['height_cat'] = pd.cut(x=df.height, bins=(160, 175, 180, 195), labels=['small', 'medium', 'high'])
df

Unnamed: 0,height,height_cat
0,175.0,small
1,178.5,medium
2,185.0,high
3,191.0,high
4,184.5,high
5,183.0,high
6,168.0,small


In [None]:
pd.get_dummies(df, drop_first=True, prefix='height')

Unnamed: 0,height,height_medium,height_high
0,175.0,0,0
1,178.5,1,0
2,185.0,0,1
3,191.0,0,1
4,184.5,0,1
5,183.0,0,1
6,168.0,0,0


### <a name='5'></a> Ekstrakcja cech

In [None]:
df = pd.DataFrame(data={'lang': [['PL', 'ENG'], ['GER', 'ENG', 'PL', 'FRA'], ['RUS']]})
df

Unnamed: 0,lang
0,"[PL, ENG]"
1,"[GER, ENG, PL, FRA]"
2,[RUS]


In [None]:
df['lang_number'] = df['lang'].apply(len)
df

Unnamed: 0,lang,lang_number
0,"[PL, ENG]",2
1,"[GER, ENG, PL, FRA]",4
2,[RUS],1


In [None]:
df['PL_flag'] = df['lang'].apply(lambda x: 1 if 'PL' in x else 0)
df

Unnamed: 0,lang,lang_number,PL_flag
0,"[PL, ENG]",2,1
1,"[GER, ENG, PL, FRA]",4,1
2,[RUS],1,0


In [None]:
df = pd.DataFrame(data={'website': ['wp.pl', 'onet.pl', 'google.com']})
df

Unnamed: 0,website
0,wp.pl
1,onet.pl
2,google.com


In [None]:
df.website.str.split('.', expand=True)

Unnamed: 0,0,1
0,wp,pl
1,onet,pl
2,google,com


In [None]:
new = df.website.str.split('.', expand=True)
df['portal'] = new[0]
df['extension'] = new[1]
df

Unnamed: 0,website,portal,extension
0,wp.pl,wp,pl
1,onet.pl,onet,pl
2,google.com,google,com
