# Projekt zaliczeniowy - przygotowanie danych/ model regresji liniowej

Celem projektu jest stworzenie: Python-based model of stock convictions to expected returns within given investment horizons

W tym notebooku zawarta jest:
* analiza otrzymanego zbioru danych 
* wykorzystanie `yfinance` do pobrania infomracji o cenach na dany okres
* wykorzystanie `ta` do stworzenia wskaźników finansowych 
* stworzenie modelu regresji linowej z posiadanych danych



In [1]:
#data download
import yfinance as yf

#data preprocessing
import pandas as pd 
import numpy as np
from datetime import datetime
from datetime import timedelta 
import warnings
warnings.filterwarnings('ignore')
from tqdm import tqdm
#data save
import pickle
from sklearn.preprocessing import LabelEncoder 

#create indicators
import ta


# prepare simple model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import  KFold,  cross_validate
from sklearn.linear_model import LogisticRegression
from numpy import mean, absolute, sqrt

# Wczytanie danych
 * indeks zawiera pustą frazę, powtarzaną w każdym wierszu
 * Date - jest to data, kiedy określona spółka otrzymała wynik zwrotu (tygodniowy)
 * Ticker_2 i Ticker - określają symbol giełdowy
 * Category - kategoria firmy
 * Code - nie wiem, co to jest
 * value - wartość pewnego wskaźnika, któym mamy opisać stope zwrotu
 
## Wyczyszczenie danych
* zmiana typu danych
* usunięcie indexów

In [2]:
df = pd.read_csv('data_new.csv', sep=',', header = None, 
                 names = ['Date', 'Ticker_2','Ticker','Category','Code', 'value'])

df.head()

Unnamed: 0,Date,Ticker_2,Ticker,Category,Code,value
10:01:54.481 77425 [77425-thread-2] INFO a.s.m.c.ConvictionImpl - CONVICTIONLISTTOPN,2004-02-11,SU,SU,Energy Minerals,GN63J3-R,0.953727
10:01:54.481 77425 [77425-thread-2] INFO a.s.m.c.ConvictionImpl - CONVICTIONLISTTOPN,2004-02-11,GGG,GGG,Producer Manufacturing,H5490W-R,0.952753
10:01:54.481 77425 [77425-thread-2] INFO a.s.m.c.ConvictionImpl - CONVICTIONLISTTOPN,2004-02-11,WGR,WGR,Energy Minerals,V0622Q-R,0.947634
10:01:54.481 77425 [77425-thread-2] INFO a.s.m.c.ConvictionImpl - CONVICTIONLISTTOPN,2004-02-11,CWT,CWT,Utilities,GSWXLY-R,0.934181
10:01:54.481 77425 [77425-thread-2] INFO a.s.m.c.ConvictionImpl - CONVICTIONLISTTOPN,2004-02-11,BLL,BLL,Process Industries,VFT0VQ-R,0.922862


In [3]:
df.shape

(37360, 6)

In [4]:
df.isnull().sum()

Date        0
Ticker_2    0
Ticker      0
Category    0
Code        0
value       0
dtype: int64

In [5]:
df.dtypes

Date         object
Ticker_2     object
Ticker       object
Category     object
Code         object
value       float64
dtype: object

In [6]:
df = df.convert_dtypes()
df.dtypes

Date         string
Ticker_2     string
Ticker       string
Category     string
Code         string
value       Float64
dtype: object

In [7]:
df[(df.value == 0)]

Unnamed: 0,Date,Ticker_2,Ticker,Category,Code,value


In [8]:
df = df.reset_index(drop = True)
df.head()

Unnamed: 0,Date,Ticker_2,Ticker,Category,Code,value
0,2004-02-11,SU,SU,Energy Minerals,GN63J3-R,0.953727
1,2004-02-11,GGG,GGG,Producer Manufacturing,H5490W-R,0.952753
2,2004-02-11,WGR,WGR,Energy Minerals,V0622Q-R,0.947634
3,2004-02-11,CWT,CWT,Utilities,GSWXLY-R,0.934181
4,2004-02-11,BLL,BLL,Process Industries,VFT0VQ-R,0.922862


### Chcemy sprawdzić, czy jeden Tiker ma więcej niż jeden Kod 
nie, nie ma żadnego Tiketa z dwoma kodami

In [9]:
df.groupby(['Ticker', 'Category']).count()['Ticker_2'].reset_index().groupby('Ticker').count().value_counts()

Category  Ticker_2
1         1           1834
dtype: int64

In [10]:
print(f'Number of unique tickers/companies: {len(df.Ticker.unique())}')
print(f'Number of unique dates: {len(df.Date.unique())}')

Number of unique tickers/companies: 1834
Number of unique dates: 467


### Ile posiadamy danych o poszczególnych firmach?
więc średnia liczba wierszy dla jednej firmy wynosi 20, przy czym mamy 467 unikatowych dat ... duże braki

In [11]:
df.groupby('Ticker').count()['Date'].describe()

count    1834.000000
mean       20.370774
std        21.318179
min         1.000000
25%         5.000000
50%        13.000000
75%        28.000000
max       170.000000
Name: Date, dtype: float64

### Sprawdzenie kategorii
Jak widać, istnieje kategoria Miscellaneous (różne), którą zachowam, mimo że nie ma żadnych informacji o tej firmie.

In [12]:
df.Category.value_counts()

Finance                   3778
Retail Trade              3146
Producer Manufacturing    3114
Utilities                 2972
Electronic Technology     2926
Consumer Non-Durables     2887
Consumer Services         2593
Process Industries        2509
Technology Services       2480
Health Technology         1403
Consumer Durables         1396
Industrial Services       1376
Energy Minerals           1296
Distribution Services     1250
Commercial Services       1176
Transportation            1174
Non-Energy Minerals        858
Health Services            530
Communications             431
Miscellaneous               65
Name: Category, dtype: Int64

# Rozszerzenie danych 
Jak widać z danych, trudno jest na ich podstawie samych `X` zbudować model bez `Y`, dodatkowo niektóre zmienne mają charakter kategoryczny.
Dlatego postanowiliśmy dodać informację o cenie zamknięcia spółki w danym dniu (z niej zostana wyliczona stopa zwrotu), wykorzystamy do tego API yahoofinance.

Ze względu na problem z brakującymi danymi, nie tylko w otrzymanych danych, ale i w samym Yahoofinance. API Yahoofinance ma kilkudniowe luki, wypełnimy je, wyszukując najbliższe istniejące dane i przypisując je do wyszukiwania.

In [13]:
def str_into_dt_or_timestamp(doc:str): 
    return datetime.timestamp(datetime.strptime(doc, '%Y-%m-%d'))

def clear_df_ML_finance(data_name): 
    df = pd.read_csv(f'{data_name}.csv', sep = ',', header = None,
                     names = ['Date', 'Ticker_2', 'Ticker', 'Category', 'Code', 'Value'])
    df = df.dropna()
    df = df.convert_dtypes()
    df = df.drop('Code', axis = 1)
    df = df.drop('Ticker_2', axis=1)
    df = df.reset_index(drop = True)
    df['Timestamp'] = df.Date.apply(lambda x: str_into_dt_or_timestamp(x))
    df = df[ ~(df.Value == 0)]
    return df

In [14]:
df = clear_df_ML_finance('data_new')
df.head(10)

Unnamed: 0,Date,Ticker,Category,Value,Timestamp
0,2004-02-11,SU,Energy Minerals,0.953727,1076454000.0
1,2004-02-11,GGG,Producer Manufacturing,0.952753,1076454000.0
2,2004-02-11,WGR,Energy Minerals,0.947634,1076454000.0
3,2004-02-11,CWT,Utilities,0.934181,1076454000.0
4,2004-02-11,BLL,Process Industries,0.922862,1076454000.0
5,2004-02-11,APA,Energy Minerals,0.912117,1076454000.0
6,2004-02-11,JW.B,Consumer Services,0.906333,1076454000.0
7,2004-02-11,MATX,Transportation,0.866946,1076454000.0
8,2004-02-11,ROST,Retail Trade,0.864789,1076454000.0
9,2004-02-11,AXL,Producer Manufacturing,0.861478,1076454000.0


In [16]:
def get_data_around_date(data, date, days_around = 6):
    for i in range(days_around):
        output = data.loc[data.Date == datetime.strptime(date, '%Y-%m-%d') + timedelta(days = i)]
        if not output.empty:
            output['Date'] = date
            return output[['Ticker', 'Date', 'Close']]
        output = data.loc[data.Date == datetime.strptime(date, '%Y-%m-%d') - timedelta(days = i)]   
        if not output.empty:
            output['Date'] = date
            return output[['Ticker', 'Date', 'Close']]
    return pd.DataFrame()

def collect_data(stock_code, start, end, weeks):
    df = pd.DataFrame()
    for code in stock_code:
        collect = yf.download(code, 
                              start = start, 
                              end = end, 
                              progress = False,
                              interval = "1d",
        )
        collect = collect.reset_index()
        collect['Ticker'] = code
        collect_df = collect[(collect.Date.isin(weeks))][['Ticker', 'Date', 'Close']]
        for date in weeks:
            if date not in np.array(collect_df['Date'].astype(str)):
                collect_df = pd.concat([collect_df, get_data_around_date(collect, date)])
        df = pd.concat([df, collect_df])
    df['Date'] = df['Date'].astype(str).apply(lambda x: x.split()[0])
    df = df.drop_duplicates()
    return df

min_date = df.Date.min()
max_date = df.Date.max()

yf_data = collect_data(df.Short.unique(), min_date, max_date, df.Date.unique())
yf_data 

Unnamed: 0,Ticker,Date,Close
0,SU,2004-02-11,13.285000
1,SU,2004-02-25,12.660000
2,SU,2004-03-10,13.465000
3,SU,2004-03-24,13.255000
4,SU,2004-04-07,13.130000
...,...,...,...
590986,AGCO,2021-12-29,115.889999
590987,AGCO,2022-01-12,123.750000
590988,AGCO,2022-01-26,116.370003
590989,AGCO,2018-12-05,58.189999


In [None]:
yf_data.to_csv('Close_data.csv', index = False)

### Wnioski 
Pobranych danych jest aż 590991 z uwagi na to, iż dla każdego przedsiębiorstwa (jest ich aż 1834) pobrano
dane dla każdej z unikatowych dat (jest ich aż 467), przy czym należy uwzględnić luki w danych API jak i brak posiadania notowań niektórych spółek. 

# Zmergowanie pobranych danych z projektowymi oraz  zajęcie się wartościami
po zmerogwaniu danych liczba wierszy spadłą z 37360 do 30561. Jednak to nadal jest duży zbiór, można się pogodzić z taką stratą i wykorzystać pozostałe dane.

In [17]:
df = df.merge(yf_data, left_on = ['Ticker', 'Date'], right_on = ['Ticker', 'Date'], how = 'inner')
df

Unnamed: 0,Date,Ticker,Category,Value,Timestamp,Close
0,2004-02-11,SU,Energy Minerals,0.953727,1.076454e+09,13.285000
1,2004-02-11,GGG,Producer Manufacturing,0.952753,1.076454e+09,9.388889
2,2004-02-11,CWT,Utilities,0.934181,1.076454e+09,14.720000
3,2004-02-11,BLL,Process Industries,0.922862,1.076454e+09,8.095000
4,2004-02-11,APA,Energy Minerals,0.912117,1.076454e+09,39.830002
...,...,...,...,...,...,...
30556,2022-02-09,PEP,Consumer Non-Durables,0.701507,1.644361e+09,172.020004
30557,2022-02-09,SSNC,Technology Services,0.701123,1.644361e+09,80.730003
30558,2022-02-09,GEF,Process Industries,0.697954,1.644361e+09,57.880001
30559,2022-02-09,DPZ,Consumer Services,0.697741,1.644361e+09,438.730011


In [18]:
df.Ticker.value_counts().describe()

count    1374.000000
mean       22.242358
std        22.606683
min         1.000000
25%         6.000000
50%        14.000000
75%        31.750000
max       170.000000
Name: Ticker, dtype: float64

In [19]:
print(f'Number of unique tickers/companies: {len(df.Ticker.unique())}')
print(f'Number of unique dates: {len(df.Date.unique())}')

Number of unique tickers/companies: 1374
Number of unique dates: 467


In [20]:
Ticker_num = df['Ticker'].value_counts()
df = df[df['Ticker'].isin(Ticker_num[Ticker_num >= 14].index)].copy()
df

Unnamed: 0,Date,Ticker,Category,Value,Timestamp,Close
2,2004-02-11,CWT,Utilities,0.934181,1.076454e+09,14.720000
3,2004-02-11,BLL,Process Industries,0.922862,1.076454e+09,8.095000
4,2004-02-11,APA,Energy Minerals,0.912117,1.076454e+09,39.830002
5,2004-02-11,MATX,Transportation,0.866946,1.076454e+09,16.394106
6,2004-02-11,ROST,Retail Trade,0.864789,1.076454e+09,7.712500
...,...,...,...,...,...,...
30554,2022-02-09,SLGN,Process Industries,0.709506,1.644361e+09,43.750000
30556,2022-02-09,PEP,Consumer Non-Durables,0.701507,1.644361e+09,172.020004
30557,2022-02-09,SSNC,Technology Services,0.701123,1.644361e+09,80.730003
30558,2022-02-09,GEF,Process Industries,0.697954,1.644361e+09,57.880001


In [21]:
le = LabelEncoder()
df['Category'] = le.fit_transform(df['Category'])
df

Unnamed: 0,Date,Ticker,Category,Value,Timestamp,Close
2,2004-02-11,CWT,19,0.934181,1.076454e+09,14.720000
3,2004-02-11,BLL,14,0.922862,1.076454e+09,8.095000
4,2004-02-11,APA,7,0.912117,1.076454e+09,39.830002
5,2004-02-11,MATX,18,0.866946,1.076454e+09,16.394106
6,2004-02-11,ROST,16,0.864789,1.076454e+09,7.712500
...,...,...,...,...,...,...
30554,2022-02-09,SLGN,14,0.709506,1.644361e+09,43.750000
30556,2022-02-09,PEP,3,0.701507,1.644361e+09,172.020004
30557,2022-02-09,SSNC,17,0.701123,1.644361e+09,80.730003
30558,2022-02-09,GEF,14,0.697954,1.644361e+09,57.880001


In [22]:
filename = 'LabelEncoder.pickle'
pickle.dump(le, open(filename, 'wb'))

In [23]:
df.to_csv('Prepared_data.csv', index = False)

# Wnioski z danych
W tym momencie stworzyliśmy pierwsze modele regresji liniowej oraz CatBoostRegressor, w których otrzymując r2 na poziomie 0.1911353% i 3% (notebooki logistic_reg.ipynb i LinearRegression.ipynb)


* [Linear Regression](https://github.com/kkwasnioch/ML_finanse_new/blob/main/logistic_reg.ipynb)
* [CatBoostRegressor](https://github.com/kkwasnioch/ML_finanse_new/blob/main/LinearRegression.ipynb)

# Dalsze modelowanie danych 
Stwierdziliśmy, iż za pomocą powyższych zmiennych cieżko jest stworzyć odpowiedni regressor, dlatego rozszerzyliśmy zmienne o wskaźniki finansowe. Biblioteka `ta` aby stworzyć wskaźniki wymaga informacje nie tylko o zamknięciu sesji, ale także pozostałe dane które można pobrać z yahoofinance ('Close',	'Open',	'High',	'Low Adj', 'Close',	'Volume'), dlatego pobrano powtórnie dane na temat spółek. Następnie wyliczono logarytmiczną stopę zwrotu `Y` oraz ograniczono dane do spółek, których dane posiadamy na więcej niż 28 okresów, związane jest to z tworzeniem wskaźników finansowych (14 okresów niewystarczało do ich wyliczeniu).

In [27]:
df.Date.max(),df.Date.min()

('2022-02-09', '2004-02-11')

In [26]:
def collect_data(stock_code, start, end):
    df = pd.DataFrame()
    for code in stock_code:
        collect = yf.download(code, 
                              start = start, 
                              end = end, 
                              progress = False,
                              interval = "1d",
        )
        collect = collect.reset_index()
        collect['Ticker'] = code
        df = pd.concat([df, collect])
    df['Date'] = df['Date'].astype(str).apply(lambda x: x.split()[0])
    df = df.drop_duplicates()
    return df

min_date = datetime.strptime(df.Date.min(),  '%Y-%m-%d') - timedelta(days = 31)
max_date = df.Date.max()

yf_data = collect_data(df.Ticker.unique(), min_date, max_date)
yf_data

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,Ticker
0,2004-01-12,12.690000,12.890000,12.590000,12.880000,8.967817,586800.0,SU
1,2004-01-13,12.880000,13.225000,12.880000,13.100000,9.120991,1518000.0,SU
2,2004-01-14,13.020000,13.195000,12.900000,13.145000,9.152327,1136000.0,SU
3,2004-01-15,13.325000,13.470000,12.840000,12.875000,8.964336,1303600.0,SU
4,2004-01-16,12.750000,12.795000,12.425000,12.495000,8.699762,1799000.0,SU
...,...,...,...,...,...,...,...,...
5483232,2022-02-02,117.639999,118.300003,113.029999,116.300003,116.118454,756600.0,AGCO
5483233,2022-02-03,116.059998,117.550003,114.470001,115.459999,115.279762,731500.0,AGCO
5483234,2022-02-04,114.870003,116.230003,112.690002,114.510002,114.331245,651700.0,AGCO
5483235,2022-02-07,115.150002,116.379997,113.080002,115.760002,115.579292,1334300.0,AGCO


In [None]:
yf_data.to_csv('Data_form_yf_2004_1_11.csv')

In [28]:
#dodajemy dane z okresu poprzedającego posiadane przez nas dane, do wyliczenia Y
df = pd.concat([df,yf_data[(yf_data.Date == '2004-02-04') & (yf_data.Ticker.isin(df.Ticker.unique()))]])
df

Unnamed: 0,Date,Ticker,Category,Value,Timestamp,Close,Open,High,Low,Adj Close,Volume
2,2004-02-11,CWT,19.0,0.934181,1.076454e+09,14.720000,,,,,
3,2004-02-11,BLL,14.0,0.922862,1.076454e+09,8.095000,,,,,
4,2004-02-11,APA,7.0,0.912117,1.076454e+09,39.830002,,,,,
5,2004-02-11,MATX,18.0,0.866946,1.076454e+09,16.394106,,,,,
6,2004-02-11,ROST,16.0,0.864789,1.076454e+09,7.712500,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
5290923,2004-02-04,KGC,,,,6.870000,7.120000,7.140000,6.870000,6.384343,1697500.0
5300027,2004-02-04,NEM,,,,41.540001,41.610001,42.709999,40.580002,31.331821,7119000.0
5325952,2004-02-04,AEM,,,,12.630000,12.980000,12.980000,12.610000,10.466825,1212200.0
5336776,2004-02-04,MRTN,,,,2.956444,3.102222,3.141333,2.871111,2.511834,230625.0


In [29]:
# wypełniamy Open	High	Low	 	Adj Close	Volume
df = pd.merge(df[['Date', 'Ticker', 'Category', 'Value', 'Timestamp']], yf_data, left_on = ['Ticker','Date'], right_on = ['Ticker','Date'], how ='left')
df

Unnamed: 0,Date,Ticker,Category,Value,Timestamp,Open,High,Low,Close,Adj Close,Volume
0,2004-02-11,CWT,19.0,0.934181,1.076454e+09,14.650000,14.720000,14.550000,14.720000,9.047531,45800.0
1,2004-02-11,BLL,14.0,0.922862,1.076454e+09,8.047500,8.127500,8.007500,8.095000,6.904861,3648800.0
2,2004-02-11,APA,7.0,0.912117,1.076454e+09,39.549999,39.980000,39.119999,39.830002,31.821800,2063800.0
3,2004-02-11,MATX,18.0,0.866946,1.076454e+09,16.183125,16.422874,16.135176,16.394106,10.440643,442336.0
4,2004-02-11,ROST,16.0,0.864789,1.076454e+09,7.642500,7.727500,7.487500,7.712500,6.473177,5030400.0
...,...,...,...,...,...,...,...,...,...,...,...
26995,2004-02-04,KGC,,,,7.120000,7.140000,6.870000,6.870000,6.384343,1697500.0
26996,2004-02-04,NEM,,,,41.610001,42.709999,40.580002,41.540001,31.331821,7119000.0
26997,2004-02-04,AEM,,,,12.980000,12.980000,12.610000,12.630000,10.466825,1212200.0
26998,2004-02-04,MRTN,,,,3.102222,3.141333,2.871111,2.956444,2.511834,230625.0


### Stworzenie Y

In [31]:
df_Y = pd.DataFrame()
for i in tqdm(df.Ticker.unique()):
    df_of_specyfic_ticker = np.log(df.loc[df.Ticker == i].sort_values('Date')['Close'].pct_change()+1)
    df_Y = pd.concat([df_Y, df_of_specyfic_ticker]) 
df_Y = df_Y.rename(columns = {0:'Y'})
df = pd.merge(df, df_Y, left_index=True, right_index=True)
df

100%|███████████████████████████████████████████████████████████████████████████████| 706/706 [00:02<00:00, 309.02it/s]


Unnamed: 0,Date,Ticker,Category,Value,Timestamp,Open,High,Low,Close,Adj Close,Volume,Y
0,2004-02-11,CWT,19.0,0.934181,1.076454e+09,14.650000,14.720000,14.550000,14.720000,9.047531,45800.0,0.077301
1,2004-02-11,BLL,14.0,0.922862,1.076454e+09,8.047500,8.127500,8.007500,8.095000,6.904861,3648800.0,0.015719
2,2004-02-11,APA,7.0,0.912117,1.076454e+09,39.549999,39.980000,39.119999,39.830002,31.821800,2063800.0,0.050990
3,2004-02-11,MATX,18.0,0.866946,1.076454e+09,16.183125,16.422874,16.135176,16.394106,10.440643,442336.0,0.074671
4,2004-02-11,ROST,16.0,0.864789,1.076454e+09,7.642500,7.727500,7.487500,7.712500,6.473177,5030400.0,0.093011
...,...,...,...,...,...,...,...,...,...,...,...,...
26995,2004-02-04,KGC,,,,7.120000,7.140000,6.870000,6.870000,6.384343,1697500.0,
26996,2004-02-04,NEM,,,,41.610001,42.709999,40.580002,41.540001,31.331821,7119000.0,
26997,2004-02-04,AEM,,,,12.980000,12.980000,12.610000,12.630000,10.466825,1212200.0,
26998,2004-02-04,MRTN,,,,3.102222,3.141333,2.871111,2.956444,2.511834,230625.0,


In [32]:
df.loc[df.Ticker == 'CWT'].sort_values('Date')[['Date','Value','Close','Y']]

Unnamed: 0,Date,Value,Close,Y
26427,2004-02-04,,13.625,
0,2004-02-11,0.934181,14.72,0.077301
36,2004-02-25,0.93487,14.71,-0.00068
75,2004-03-10,0.897524,14.65,-0.004087
993,2005-01-12,0.720181,17.415001,0.172892
1027,2005-01-26,0.742207,17.775,0.020461
1077,2005-02-09,0.733815,17.004999,-0.044286
1124,2005-02-23,0.735064,17.120001,0.00674
19489,2017-08-23,0.633706,36.599998,0.759801
19541,2017-09-06,0.653632,36.849998,0.006807


In [33]:
Ticker_num = df['Ticker'].value_counts()
df = df[df['Ticker'].isin(Ticker_num[Ticker_num >= 28].index)]
df

Unnamed: 0,Date,Ticker,Category,Value,Timestamp,Open,High,Low,Close,Adj Close,Volume,Y
1,2004-02-11,BLL,14.0,0.922862,1.076454e+09,8.047500,8.127500,8.007500,8.095000,6.904861,3648800.0,0.015719
2,2004-02-11,APA,7.0,0.912117,1.076454e+09,39.549999,39.980000,39.119999,39.830002,31.821800,2063800.0,0.050990
3,2004-02-11,MATX,18.0,0.866946,1.076454e+09,16.183125,16.422874,16.135176,16.394106,10.440643,442336.0,0.074671
4,2004-02-11,ROST,16.0,0.864789,1.076454e+09,7.642500,7.727500,7.487500,7.712500,6.473177,5030400.0,0.093011
7,2004-02-11,HUBB,15.0,0.850025,1.076454e+09,39.500000,39.980000,39.400002,39.980000,24.869270,133400.0,0.020979
...,...,...,...,...,...,...,...,...,...,...,...,...
26959,2004-02-04,CME,,,,17.150000,17.150000,16.780001,16.874001,9.752662,939500.0,
26962,2004-02-04,CCL,,,,43.950001,43.959999,43.150002,43.349998,28.307062,3851000.0,
26973,2004-02-04,CCMP,,,,43.049999,43.450001,41.889999,42.509998,28.070055,950900.0,
26975,2004-02-04,ARW,,,,26.000000,26.040001,25.480000,25.700001,25.700001,406400.0,


### Stworzenie wskaźników

In [36]:
%%time
df_indicators = pd.DataFrame()
for i in df['Ticker'].unique():
    
        df_indicators = pd.concat([df_indicators, 
                             ta.add_all_ta_features(df.loc[df.Ticker == i], open="Open", high="High", 
                                              low="Low", close="Close", 
                                              volume="Volume", fillna=True)])
df_indicators

Wall time: 56 s


Unnamed: 0,Date,Ticker,Category,Value,Timestamp,Open,High,Low,Close,Adj Close,...,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama,others_dr,others_dlr,others_cr
1,2004-02-11,BLL,14.0,0.922862,1.076454e+09,8.047500,8.127500,8.007500,8.095000,6.904861,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,8.095000e+00,-58.263519,0.000000,0.000000
37,2004-02-25,BLL,14.0,0.921895,1.077664e+09,8.043750,8.043750,7.977500,8.012500,6.834494,...,-0.081361,-0.016272,-0.065089,-5.351707,-1.070341,-4.281365,-3.871326e+02,-1.019153,-1.024382,-1.019153
73,2004-03-10,BLL,14.0,0.926543,1.078873e+09,8.281250,8.372500,8.211250,8.258750,7.061062,...,0.098518,0.006686,0.091832,-4.775846,-1.811442,-2.964403,3.889053e+05,3.073325,3.027044,2.022850
117,2004-03-24,BLL,14.0,0.81165,1.080083e+09,8.053750,8.166250,8.018750,8.141250,6.960603,...,0.122485,0.029846,0.092639,-4.878733,-2.424900,-2.453832,-2.761922e+08,-1.422737,-1.432955,0.571333
614,2004-09-08,BLL,14.0,0.74538,1.094594e+09,9.400000,9.575000,9.337500,9.560000,8.214259,...,1.516205,0.327117,1.189087,-2.183387,-2.376598,0.193210,1.297232e+10,17.426695,16.064408,18.097592
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26127,2021-12-15,GOOG,17.0,0.916296,1.639523e+09,2887.320068,2950.344971,2854.110107,2947.370117,2947.370117,...,10.795514,11.770535,-0.975021,-8.666743,-11.820093,3.153350,1.658897e+05,4.060572,3.980296,101.834570
26191,2021-12-29,GOOG,17.0,0.915398,1.640732e+09,2928.590088,2943.675049,2910.090088,2930.090088,2930.090088,...,10.328982,11.482225,-1.153243,-10.000914,-11.456257,1.455343,1.658897e+05,-0.586286,-0.588012,100.651241
26253,2022-01-12,GOOG,17.0,0.91044,1.641942e+09,2831.090088,2856.284912,2822.239990,2832.959961,2832.959961,...,9.563491,11.098478,-1.534987,-8.580410,-10.881088,2.300678,1.658897e+05,-3.314919,-3.371108,93.999814
26315,2022-01-26,GOOG,17.0,0.910303,1.643152e+09,2611.850098,2656.149902,2543.070068,2584.800049,2584.800049,...,8.146213,10.508025,-2.361812,-1.941481,-9.093166,7.151686,1.658897e+05,-8.759739,-9.167393,77.005936


In [54]:
df_indicators = df_indicators.drop(['others_dr','others_dlr','others_cr'], axis = 1)
df_indicators.head()

Unnamed: 0,Date,Ticker,Category,Value,Timestamp,Open,High,Low,Close,Adj Close,...,momentum_wr,momentum_ao,momentum_roc,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama
1,2004-02-11,BLL,14.0,0.922862,1076454000.0,8.0475,8.1275,8.0075,8.095,6.904861,...,-27.082787,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.095
37,2004-02-25,BLL,14.0,0.921895,1077664000.0,8.04375,8.04375,7.9775,8.0125,6.834494,...,-76.666709,0.0,0.0,-0.081361,-0.016272,-0.065089,-5.351707,-1.070341,-4.281365,-387.1326
73,2004-03-10,BLL,14.0,0.926543,1078873000.0,8.28125,8.3725,8.21125,8.25875,7.061062,...,-28.797551,0.0,0.0,0.098518,0.006686,0.091832,-4.775846,-1.811442,-2.964403,388905.3
117,2004-03-24,BLL,14.0,0.81165,1080083000.0,8.05375,8.16625,8.01875,8.14125,6.960603,...,-58.544429,0.0,0.0,0.122485,0.029846,0.092639,-4.878733,-2.4249,-2.453832,-276192200.0
614,2004-09-08,BLL,14.0,0.74538,1094594000.0,9.4,9.575,9.3375,9.56,8.214259,...,-0.938929,0.0,0.0,1.516205,0.327117,1.189087,-2.183387,-2.376598,0.19321,12972320000.0


### Na podstawie informacji o [pakiecie](https://technical-analysis-library-in-python.readthedocs.io/en/latest/) z wskaźników usuwamy kolumny `others_*` gdyż:
* other_cr to Cumulative Return 
* other_dr to Daily Return
* other_dlr to Daily Log Return 

**czyli różne podejścia do wyliczenia stopy zwrotu, która jest naszym Y**


In [55]:
df_indicators.to_csv('data_with_indicators.csv', index =False)

# Regresja Liniowa -- initial model
Model został stworzony na podstawie wybranych ręcznie kolumn, został również podzielony na 5 flodów oraz policzono RMSE i R2 otrzymując znacznie lepsze wyniki niż na początkowych danych, model wyjaśnia Y aż w 59,9%

In [118]:
df_indicators = pd.read_csv('data_with_indicators.csv')
df_indicators

Unnamed: 0,Date,Ticker,Category,Value,Timestamp,Open,High,Low,Close,Adj Close,...,momentum_wr,momentum_ao,momentum_roc,momentum_ppo,momentum_ppo_signal,momentum_ppo_hist,momentum_pvo,momentum_pvo_signal,momentum_pvo_hist,momentum_kama
0,2004-02-11,BLL,14.0,0.922862,1.076454e+09,8.047500,8.127500,8.007500,8.095000,6.904861,...,-27.082787,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,8.095000e+00
1,2004-02-25,BLL,14.0,0.921895,1.077664e+09,8.043750,8.043750,7.977500,8.012500,6.834494,...,-76.666709,0.000000,0.000000,-0.081361,-0.016272,-0.065089,-5.351707,-1.070341,-4.281365,-3.871326e+02
2,2004-03-10,BLL,14.0,0.926543,1.078873e+09,8.281250,8.372500,8.211250,8.258750,7.061062,...,-28.797551,0.000000,0.000000,0.098518,0.006686,0.091832,-4.775846,-1.811442,-2.964403,3.889053e+05
3,2004-03-24,BLL,14.0,0.811650,1.080083e+09,8.053750,8.166250,8.018750,8.141250,6.960603,...,-58.544429,0.000000,0.000000,0.122485,0.029846,0.092639,-4.878733,-2.424900,-2.453832,-2.761922e+08
4,2004-09-08,BLL,14.0,0.745380,1.094594e+09,9.400000,9.575000,9.337500,9.560000,8.214259,...,-0.938929,0.000000,0.000000,1.516205,0.327117,1.189087,-2.183387,-2.376598,0.193210,1.297232e+10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21018,2021-12-15,GOOG,17.0,0.916296,1.639523e+09,2887.320068,2950.344971,2854.110107,2947.370117,2947.370117,...,-8.861245,585.511572,17.597515,10.795514,11.770535,-0.975021,-8.666743,-11.820093,3.153350,1.658897e+05
21019,2021-12-29,GOOG,17.0,0.915398,1.640732e+09,2928.590088,2943.675049,2910.090088,2930.090088,2930.090088,...,-12.552007,580.286009,10.918941,10.328982,11.482225,-1.153243,-10.000914,-11.456257,1.455343,1.658897e+05
21020,2022-01-12,GOOG,17.0,0.910440,1.641942e+09,2831.090088,2856.284912,2822.239990,2832.959961,2832.959961,...,-45.002265,549.580433,3.861597,9.563491,11.098478,-1.534987,-8.580410,-10.881088,2.300678,1.658897e+05
21021,2022-01-26,GOOG,17.0,0.910303,1.643152e+09,2611.850098,2656.149902,2543.070068,2584.800049,2584.800049,...,-90.715322,439.385309,-6.136633,8.146213,10.508025,-2.361812,-1.941481,-9.093166,7.151686,1.658897e+05


In [119]:
df_lr = df_indicators.dropna(subset=['Y'])
y = df_lr['Y']
df_lr = df_lr.dropna( axis = 0)
df_lr.columns

Index(['Date', 'Ticker', 'Category', 'Value', 'Timestamp', 'Open', 'High',
       'Low', 'Close', 'Adj Close', 'Volume', 'Y', 'volume_adi', 'volume_obv',
       'volume_cmf', 'volume_fi', 'volume_em', 'volume_sma_em', 'volume_vpt',
       'volume_vwap', 'volume_mfi', 'volume_nvi', 'volatility_bbm',
       'volatility_bbh', 'volatility_bbl', 'volatility_bbw', 'volatility_bbp',
       'volatility_bbhi', 'volatility_bbli', 'volatility_kcc',
       'volatility_kch', 'volatility_kcl', 'volatility_kcw', 'volatility_kcp',
       'volatility_kchi', 'volatility_kcli', 'volatility_dcl',
       'volatility_dch', 'volatility_dcm', 'volatility_dcw', 'volatility_dcp',
       'volatility_atr', 'volatility_ui', 'trend_macd', 'trend_macd_signal',
       'trend_macd_diff', 'trend_sma_fast', 'trend_sma_slow', 'trend_ema_fast',
       'trend_ema_slow', 'trend_vortex_ind_pos', 'trend_vortex_ind_neg',
       'trend_vortex_ind_diff', 'trend_trix', 'trend_mass_index', 'trend_dpo',
       'trend_kst', 'trend_k

In [60]:
# ilość objaśniających skorelowanych z objeśnianą powyżej 30%
(df_lr.corr()['Y'] > 0.30).value_counts()

False    89
True      4
Name: Y, dtype: int64

In [125]:
df_lr = df_indicators.dropna(subset=['Y'])
df_lr = df_lr.dropna(axis = 0)
y = df_lr['Y']
X = df_lr[['Category', 'Value', 'Timestamp', 'volume_adi', 'volume_obv',
       'volume_cmf', 'volume_fi', 'volume_em', 'volume_sma_em', 'volume_vpt',
       'volatility_bbh', 'volatility_bbl', 'volatility_bbw', 'volatility_bbp',
       'volatility_bbhi', 'volatility_bbli', 'volatility_kcc',
       'volatility_kch', 'volatility_kcl', 'volatility_kcw', 'volatility_kcp',
       'volatility_kchi', 'volatility_kcli', 'volatility_dcl',
       'volatility_dch', 'volatility_dcm', 'volatility_dcw', 'volatility_dcp',
       'volatility_atr', 'volatility_ui', 
           'trend_macd', 'trend_macd_signal',
       'trend_macd_diff', 'trend_sma_fast', 'trend_sma_slow', 'trend_ema_fast',
       'trend_ema_slow', 'trend_vortex_ind_pos', 'trend_vortex_ind_neg',
       'trend_vortex_ind_diff', 'trend_trix', 'trend_mass_index', 'trend_dpo',
       'trend_kst', 'trend_kst_sig', 'trend_kst_diff', 'trend_ichimoku_conv',
       'trend_ichimoku_base', 'trend_ichimoku_a', 'trend_ichimoku_b',
       'trend_stc', 'trend_adx', 'trend_adx_pos', 'trend_adx_neg', 'trend_cci',
       'trend_visual_ichimoku_a', 'trend_visual_ichimoku_b', 'trend_aroon_up',
       'trend_aroon_down', 'trend_aroon_ind', 'trend_psar_up',
       'trend_psar_down', 'trend_psar_up_indicator',
       'trend_psar_down_indicator',
           'momentum_rsi', 'momentum_stoch_rsi',
       'momentum_stoch_rsi_k', 'momentum_stoch_rsi_d', 'momentum_tsi',
       'momentum_uo', 'momentum_stoch', 'momentum_stoch_signal', 'momentum_wr',
       'momentum_ao', 'momentum_roc', 'momentum_ppo', 'momentum_ppo_signal',
       'momentum_ppo_hist', 'momentum_pvo', 'momentum_pvo_signal',
       'momentum_pvo_hist']]

In [126]:
reg = LinearRegression()
cv = KFold(n_splits=5, random_state=42, shuffle=True)
cv_results  = cross_validate(reg, X, y, scoring=('r2', 'neg_mean_squared_error'),
                         cv=cv, n_jobs=-1)
cv_results 

{'fit_time': array([0.22978163, 0.2277863 , 0.19453955, 0.20718122, 0.22080564]),
 'score_time': array([0.        , 0.        , 0.01563334, 0.0156188 , 0.        ]),
 'test_r2': array([0.38781196, 0.31994561, 0.26160015, 0.40275797, 0.42174527]),
 'test_neg_mean_squared_error': array([-0.01348895, -0.01160712, -0.01476258, -0.01346346, -0.01212141])}

In [127]:
#view RMSE
sqrt(mean(absolute(cv_results['test_neg_mean_squared_error'])))

0.11440586462694495

In [128]:
#view R2
sqrt(mean(absolute(cv_results['test_r2'])))

0.5989759531341823