`Bu notebook üzerinde veri oluşturma ve time series feature extraction işlemleri yapacağım. Ardından yapılan işlemlerle oluşturulan veriseti üzerinde modelleme işlemleri yapacağım. Çok fazla ram'e sahip olmadığım için yaptığım işlemlerden sonra veri setlerini kaydedip başkabir notebook üzerinde modelleme işlemlerini yapacağım.`

In [2]:
import numpy as np
import pandas as pd
import yfinance as yf

In [4]:
def get_ts_data(ticker_list: list, start_date: str) -> pd.DataFrame:
    data = yf.download(ticker_list, start=start_date)
    data_close = data["Adj Close"].resample("M").last().pct_change() + 1
    return data_close

In [3]:
df_financials = pd.read_csv("data/stock_sectors/financials.csv")
financial_tickers = df_financials["Symbol"].tolist()

In [4]:
financial_ts = get_ts_data(financial_tickers, "2005-01-01")

[*********************100%%**********************]  1003 of 1003 completed

8 Failed downloads:
['BRK.B', 'AGM.A', 'DISA', 'BNRE.A', 'CRD.A', 'LEGT', 'DYCQ', 'CRD.B']: Exception('%ticker%: No timezone found, symbol may be delisted')


In [5]:
df_technology = pd.read_csv("data/stock_sectors/technology.csv")
technology_tickers = df_technology["Symbol"].tolist()

In [6]:
for ticker in technology_tickers:
    if not isinstance(ticker, str):
        technology_tickers.remove(ticker)

In [7]:
technology_ts = get_ts_data(technology_tickers, "2005-01-01")

[*********************100%%**********************]  786 of 786 completed


In [8]:
df_healthcare = pd.read_csv("data/stock_sectors/healthcare.csv")
healthcare_tickers = df_healthcare["Symbol"].tolist()

In [9]:
healthcare_ts = get_ts_data(healthcare_tickers, "2005-01-01")

[*********************100%%**********************]  1219 of 1219 completed

1 Failed download:
['BIO.B']: Exception('%ticker%: No price data found, symbol may be delisted (1d 2005-01-01 -> 2024-03-15)')


In [5]:
def get_rolling_ret(data, n): 
    return data.rolling(n).apply(np.prod)

In [11]:
financial_ts = get_rolling_ret(financial_ts, 3)
technology_ts = get_rolling_ret(technology_ts, 3)
healthcare_ts = get_rolling_ret(healthcare_ts, 3)

In [12]:
financial_ts = financial_ts[3:]
technology_ts = technology_ts[3:]
healthcare_ts = healthcare_ts[3:]

In [6]:
def melt_ts_data(df: pd.DataFrame, sector_name: str) -> pd.DataFrame:
    df = df.copy()
    df.reset_index(inplace=True)
    df = df.melt(id_vars="Date", var_name="Ticker", value_name="Change")
    df["Sector"] = sector_name
    return df

In [25]:
financial_ts = melt_ts_data(financial_ts, "Finance")
technology_ts = melt_ts_data(technology_ts, "Technology")
healthcare_ts = melt_ts_data(healthcare_ts, "Healthcare")

In [27]:
all_ts = pd.concat([financial_ts, technology_ts, healthcare_ts])

In [30]:
all_ts

Unnamed: 0,Date,Ticker,Change,Sector
0,2005-04-30,AACI,,Finance
1,2005-05-31,AACI,,Finance
2,2005-06-30,AACI,,Finance
3,2005-07-31,AACI,,Finance
4,2005-08-31,AACI,,Finance
...,...,...,...,...
277927,2023-11-30,ZYXI,1.189610,Healthcare
277928,2023-12-31,ZYXI,1.361250,Healthcare
277929,2024-01-31,ZYXI,1.333333,Healthcare
277930,2024-02-29,ZYXI,1.480349,Healthcare


Değişimleri incelediğimiz için boş değerleri 0'la doldurmak mantıklı olacaktır.

In [31]:
all_ts.fillna(0, inplace=True)

Time series fresh ile başlayalım.

In [15]:
import tsfresh

In [42]:
df_features = tsfresh.extract_features(
    all_ts.drop("Sector", axis=1),
    column_id="Ticker",
    column_sort="Date",
    default_fc_parameters=tsfresh.feature_extraction.settings.EfficientFCParameters()
)

Feature Extraction: 100%|██████████| 30/30 [00:53<00:00,  1.77s/it]


In [47]:
df_features

Unnamed: 0,Change__variance_larger_than_standard_deviation,Change__has_duplicate_max,Change__has_duplicate_min,Change__has_duplicate,Change__sum_values,Change__abs_energy,Change__mean_abs_change,Change__mean_change,Change__mean_second_derivative_central,Change__median,...,Change__fourier_entropy__bins_5,Change__fourier_entropy__bins_10,Change__fourier_entropy__bins_100,Change__permutation_entropy__dimension_3__tau_1,Change__permutation_entropy__dimension_4__tau_1,Change__permutation_entropy__dimension_5__tau_1,Change__permutation_entropy__dimension_6__tau_1,Change__permutation_entropy__dimension_7__tau_1,Change__query_similarity_count__query_None__threshold_0.0,Change__mean_n_absolute_max__number_of_maxima_7
A,0.0,0.0,0.0,0.0,237.491343,252.238261,0.089564,0.000551,-0.000162,1.046311,...,0.608606,0.971717,2.610491,1.732168,2.937143,4.119118,4.949371,5.294163,,1.353611
AACI,0.0,0.0,1.0,1.0,26.340809,26.688917,0.005178,0.004495,0.000002,0.000000,...,0.099760,0.099760,0.250131,0.418784,0.583114,0.646688,0.655318,0.657778,,1.027081
AACT,0.0,0.0,1.0,1.0,7.098699,7.198851,0.004530,0.004456,-0.000009,0.000000,...,0.099760,0.099760,0.099760,0.113480,0.142331,0.171379,0.172025,0.172677,,1.014100
AADI,0.0,0.0,1.0,1.0,65.813925,70.317652,0.077365,0.004645,0.001504,0.000000,...,0.099760,0.187157,0.540212,0.935161,1.392436,1.733054,1.864353,1.902331,,1.756573
AAMC,0.0,0.0,1.0,1.0,28.556344,52.619379,0.045621,0.003660,-0.000883,0.000000,...,0.099760,0.099760,0.250131,0.266938,0.354118,0.424417,0.495172,0.560143,,2.344726
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZURA,0.0,0.0,1.0,1.0,7.726358,6.329994,0.009306,0.003000,-0.000172,0.000000,...,0.099760,0.099760,0.099760,0.163967,0.199123,0.199870,0.200624,0.201384,,0.841348
ZVRA,0.0,0.0,1.0,1.0,102.949610,121.294966,0.125284,0.004076,-0.001086,0.000000,...,0.099760,0.149525,0.928146,1.247146,1.895057,2.421596,2.679818,2.803428,,2.044759
ZVSA,0.0,0.0,1.0,1.0,13.386454,10.954020,0.024291,0.004014,0.000946,0.000000,...,0.087705,0.099760,0.377217,0.285775,0.426695,0.553645,0.598786,0.629418,,1.039076
ZYME,0.0,0.0,1.0,1.0,85.133038,97.508129,0.071286,0.004577,-0.000697,0.000000,...,0.099760,0.149525,0.580963,1.025424,1.559174,1.945210,2.107063,2.145863,,1.696534


In [50]:
ticker_sector = all_ts.groupby("Ticker").first()["Sector"]

In [51]:
df_features["Sector"] = df_features.index.map(ticker_sector)

Bu veri setini daha sonra kullanabilmek için kaydediyoruz.

In [57]:
df_features.to_csv("data/stock_sectors/feature_data.csv")

## Diğer Sektörler İçin Zaman Serisi Veri Seti Oluşturma

In [19]:
def get_featured_data(data_path: str, sector: str) -> pd.DataFrame:
    df = pd.read_csv(data_path)
    tickers = df["Symbol"].tolist()
    for ticker in tickers:
        if not isinstance(ticker, str):
            tickers.remove(ticker)
    ts_df = get_ts_data(tickers, "2005-01-01")
    ts_df = get_rolling_ret(ts_df, 3)
    ts_df = ts_df[3:]
    ts_df = melt_ts_data(ts_df, sector)
    ts_df.fillna(0, inplace=True)
    features_df = tsfresh.extract_features(
        ts_df.drop("Sector", axis=1),
        column_id="Ticker",
        column_sort="Date",
        default_fc_parameters=tsfresh.feature_extraction.settings.EfficientFCParameters()
    )
    ticker_sector = ts_df.groupby("Ticker").first()["Sector"]
    features_df["Sector"] = features_df.index.map(ticker_sector)
    return features_df

In [20]:
energy_features = get_featured_data("data/stock_sectors/energy.csv", "Energy")

[*********************100%%**********************]  252 of 252 completed

1 Failed download:
['PBR.A']: Exception('%ticker%: No timezone found, symbol may be delisted')
Feature Extraction: 100%|██████████| 28/28 [00:07<00:00,  3.88it/s]


In [22]:
energy_features.to_csv("data/stock_sectors/energy_features.csv")

In [23]:
real_estate_features = get_featured_data("data/stock_sectors/real-estate.csv", "Real Estate")
real_estate_features.to_csv("data/stock_sectors/real_estate_features.csv")

[*********************100%%**********************]  261 of 261 completed
Feature Extraction: 100%|██████████| 29/29 [00:07<00:00,  4.13it/s]
