# Selekcja zmiennych
Dokumentacja: https://scikit-learn.org/stable/modules/feature_selection.html


Zbiór danych do analizy: https://www.kaggle.com/datasets/clemencetravers/predict-mc-donalds-stock-price

Mamy dane dotyczące wskaźników giełdowych, na podstawie których mamy zaprognozować, czy cena akcji wzrośnie czy zmaleje.

Zmienne:
- S&P500
- Dow Jones
- Wendy's Index
- Yum's index (Taco Bell, Pizza Hut etc.)
- Starbuck's index
- Coca's index
- Wheat index: Chicago SRW Wheat Future (ZW=F)
- Oil index: Crude oil (CL=F)
- Commodity: United State Commodity Index
- Sugar: (SB=F)
- Volatility: VXD index
- War: War in Ukraine 0 : no war, 1: war (the 20/02/2022)

In [1]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold, RFE, SelectFromModel, SequentialFeatureSelector
from sklearn.tree import DecisionTreeClassifier
import numpy as np

In [2]:
# puść ten kod, 
# jeżeli wywołujesz plik  w folderze rozwiąznaia, 
# a ramka danych znajduje się w folderze data
import os 
os.chdir('../')

In [3]:
# Pobranie danych
df = pd.read_csv("data/stock_price.csv",sep=';',decimal=',')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1257 entries, 0 to 1256
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   MCD        1257 non-null   int64  
 1   S&P        1257 non-null   float64
 2   DJ         1257 non-null   float64
 3   Wendy      1257 non-null   float64
 4   YUM        1257 non-null   float64
 5   Starbuck   1257 non-null   float64
 6   Coca       1257 non-null   float64
 7   Wheat      1257 non-null   float64
 8   Oil        1257 non-null   float64
 9   Commodity  1257 non-null   float64
 10  sugar      1257 non-null   float64
 11  Volatilty  1257 non-null   float64
 12  War        1257 non-null   float64
dtypes: float64(12), int64(1)
memory usage: 127.8 KB


## Variance threshold

In [13]:
var_thresh = VarianceThreshold(threshold=0.001).fit(df.drop('MCD',axis=1))

In [14]:
var_thresh.transform(df.drop('MCD',axis=1))

array([[-0.004 , -0.099 ,  1.    ],
       [ 0.0164,  0.1485,  1.    ],
       [-0.0054, -0.5238,  1.    ],
       ...,
       [-0.0019,  0.0137,  0.    ],
       [ 0.0161, -0.03  ,  0.    ],
       [ 0.0241, -0.064 ,  0.    ]])

In [15]:
var_thresh.get_feature_names_out()

array(['Oil', 'Volatilty', 'War'], dtype=object)

## RFE (Recursive Feature Elimination)

In [16]:
rfe = RFE(estimator=DecisionTreeClassifier(),n_features_to_select=4)

In [17]:
# fit
rfe.fit(df.drop('MCD',axis=1), df['MCD'])


In [18]:
# Wybrane zmienne
rfe.get_feature_names_out()

array(['Coca', 'Wheat', 'sugar', 'Volatilty'], dtype=object)

## Wybór z modelu z zadanym thresholdem
Tu usuwamy wszystkie zmienne na raz.

In [20]:
# Model bazowy
model = DecisionTreeClassifier().fit(df.drop('MCD',axis=1),df['MCD'])

In [21]:
# Selekcja
selection = SelectFromModel(model, threshold=0.1, prefit=True).fit(df.drop('MCD',axis=1))

In [22]:
# dane do modelu
new_data = selection.transform(df.drop('MCD',axis=1))
new_data

array([[ 2.300e-03, -2.070e-02, -4.400e-03,  2.000e-02, -9.900e-02],
       [-1.500e-03,  5.000e-04,  1.430e-02,  4.600e-03,  1.485e-01],
       [ 3.000e-04,  1.460e-02, -1.800e-03, -9.500e-03, -5.238e-01],
       ...,
       [-3.200e-03,  1.510e-02,  2.100e-03, -1.900e-02,  1.370e-02],
       [-3.000e-03,  2.700e-03,  7.200e-03, -8.000e-04, -3.000e-02],
       [ 6.000e-03,  1.090e-02, -4.900e-03,  1.120e-02, -6.400e-02]])

In [23]:
# nazwy zmiennych
selection.get_feature_names_out()


array(['Coca', 'Wheat', 'Commodity', 'sugar', 'Volatilty'], dtype=object)

## Wybór sekwencyjny

In [None]:
seq = SequentialFeatureSelector(estimator=DecisionTreeClassifier(),n_features_to_select=4, direction='forward')
seq.fit(df.drop('MCD',axis=1),df['MCD'])

In [25]:
seq.get_feature_names_out()

array(['YUM', 'Starbuck', 'Commodity', 'War'], dtype=object)