### 本コードで実施していること

- 前処理
- 特徴量エンジニアリング
- CSVファイルの生成

使用するライブラリのインストール & データの読み込み

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import japanize_matplotlib

In [2]:
df = pd.read_csv('../stock_price.csv')

In [3]:
df.head()

Unnamed: 0,日付け,終値,始値,高値,安値,出来高,変化率 %
0,2024-08-01,156.3,159.3,159.4,156.1,79.15M,-2.56%
1,2024-07-31,160.4,158.2,160.7,158.1,173.91M,1.07%
2,2024-07-30,158.7,158.8,159.2,158.0,138.14M,-0.63%
3,2024-07-29,159.7,158.7,160.2,158.4,126.28M,1.14%
4,2024-07-26,157.9,159.3,159.6,157.9,155.08M,-0.13%


`出来高`と`変化率 %`のデータに含まれている文字列を除去し、float型に変換

In [4]:
# 出来高の'M'と'B'を取り除き、float型に変換
def convert_volume(volume):
    if 'M' in volume: # M: 100万
        return float(volume.replace('M', '')) * 1e6
    elif 'B' in volume: # B: 10億
        return float(volume.replace('B', '')) * 1e9
    else:
        return float(volume)

df['出来高'] = df['出来高'].apply(convert_volume)

# 変化率の'%'を取り除き、float型に変換
df['変化率 %'] = df['変化率 %'].str.replace('%', '').astype(float)

`日付け`のデータから新しい変数(`year`、`quarter`、`month`、`week`, `weekofday`, `day`)を作成

In [5]:
# datatime型に変換
df['日付け'] = pd.to_datetime(df['日付け'])

df['year'] = df['日付け'].dt.year
df['quarter'] = df['日付け'].dt.quarter
df['month'] = df['日付け'].dt.month
df['week'] = df['日付け'].dt.isocalendar().week
df['dayofweek'] = df['日付け'].dt.dayofweek
df['day'] = df['日付け'].dt.day

上記で作成された変数から、循環性表現するために三角関数を用いて新しい変数を作成

In [6]:
# 月の特徴量の追加
df['month_cos'] = df['month'].apply(lambda x: np.cos(2 * np.pi * x / 12))
df['month_sin'] = df['month'].apply(lambda x: np.sin(2 * np.pi * x / 12))

# 週の特徴量の追加
df['week_cos'] = df['week'].apply(lambda x: np.cos(2 * np.pi * x / 52))
df['week_sin'] = df['week'].apply(lambda x: np.sin(2 * np.pi * x / 52))

# 日の特徴量の追加
df['day_cos'] = df['day'].apply(lambda x: np.cos(2 * np.pi * x / 31))
df['day_sin'] = df['day'].apply(lambda x: np.sin(2 * np.pi * x / 31))

`日付け`のデータ中央値を基準時点との差分を新たな変数`median_date`として作成

In [7]:
# datetime型のデータの中央値を計算
median_timestamp = df['日付け'].astype('int64').median()
median_date = pd.to_datetime(median_timestamp)

start_date = pd.Timestamp(median_date)

df['median_date'] = df['日付け'].apply(lambda x: (x - start_date).days)

翌日と当日の終値の差分を計算し、新しい変数`diff_close`を作成

In [8]:
df_shift = df.shift(-1)
df['diff_close'] = df_shift['終値'] - df['終値']

翌日の終値が当日の終値よりも高ければ1、低ければ0として新しい変数、`increase`を作成

In [9]:
df['increase'] = 0
df['increase'][df['diff_close'] > 0] = 1

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df['increase'][df['diff_close'] > 0] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['increase'][df['dif

前日の終値と当日の始値の差分を計算し、新しい変数`diff_close_open`を作成

In [10]:
df_shift_past = df.shift(1)
df['diff_close_open'] =  df_shift_past['終値'] - df['始値']

In [11]:
df.head()

Unnamed: 0,日付け,終値,始値,高値,安値,出来高,変化率 %,year,quarter,month,...,month_cos,month_sin,week_cos,week_sin,day_cos,day_sin,median_date,diff_close,increase,diff_close_open
0,2024-08-01,156.3,159.3,159.4,156.1,79150000.0,-2.56,2024,3,8,...,-0.5,-0.866025,-0.822984,-0.568065,0.97953,0.2012985,6868,4.1,1,
1,2024-07-31,160.4,158.2,160.7,158.1,173910000.0,1.07,2024,3,7,...,-0.866025,-0.5,-0.822984,-0.568065,1.0,-2.449294e-16,6867,-1.7,0,-1.9
2,2024-07-30,158.7,158.8,159.2,158.0,138140000.0,-0.63,2024,3,7,...,-0.866025,-0.5,-0.822984,-0.568065,0.97953,-0.2012985,6866,1.0,1,1.6
3,2024-07-29,159.7,158.7,160.2,158.4,126280000.0,1.14,2024,3,7,...,-0.866025,-0.5,-0.822984,-0.568065,0.918958,-0.3943559,6865,-1.8,0,0.0
4,2024-07-26,157.9,159.3,159.6,157.9,155080000.0,-0.13,2024,3,7,...,-0.866025,-0.5,-0.885456,-0.464723,0.528964,-0.8486443,6862,0.2,1,0.4


当日の始値と終値の差額を追加

In [12]:
df['today_price'] = df['始値'] - df['終値']

三角関数を用いて、時間変数の周期的な構造をエンコード

In [13]:
# 週の特徴量の追加
df['week_cos'] = df['week'].apply(lambda x: np.cos(2 * np.pi * x / 52))
df['week_sin'] = df['week'].apply(lambda x: np.sin(2 * np.pi * x / 52))

# 月の特徴量の追加
df['month_cos'] = df['month'].apply(lambda x: np.cos(2 * np.pi * x / 12))
df['month_sin'] = df['month'].apply(lambda x: np.sin(2 * np.pi * x / 12))

# 四半期の特徴量の追加
df['quarter_cos'] = df['quarter'].apply(lambda x: np.cos(2 * np.pi * x / 4))
df['quarter_sin'] = df['quarter'].apply(lambda x: np.sin(2 * np.pi * x / 4))

# 日の特徴量の追加
df['day_cos'] = df['day'].apply(lambda x: np.cos(2 * np.pi * x / 31))
df['day_sin'] = df['day'].apply(lambda x: np.sin(2 * np.pi * x / 31))

欠損値の確認と欠損値を0で補完

In [14]:
# 欠損値の確認と処理
print(df.isnull().sum())
df = df.fillna(0)

日付け                0
終値                 0
始値                 0
高値                 0
安値                 0
出来高                0
変化率 %              0
year               0
quarter            0
month              0
week               0
dayofweek          0
day                0
month_cos          0
month_sin          0
week_cos           0
week_sin           0
day_cos            0
day_sin            0
median_date        0
diff_close         1
increase           0
diff_close_open    1
today_price        0
quarter_cos        0
quarter_sin        0
dtype: int64


CSVファイルの作成

In [15]:
df.to_csv('ml_before.csv', index=False)

最終的に作成された変数

In [16]:
df.head()

Unnamed: 0,日付け,終値,始値,高値,安値,出来高,変化率 %,year,quarter,month,...,week_sin,day_cos,day_sin,median_date,diff_close,increase,diff_close_open,today_price,quarter_cos,quarter_sin
0,2024-08-01,156.3,159.3,159.4,156.1,79150000.0,-2.56,2024,3,8,...,-0.568065,0.97953,0.2012985,6868,4.1,1,0.0,3.0,-1.83697e-16,-1.0
1,2024-07-31,160.4,158.2,160.7,158.1,173910000.0,1.07,2024,3,7,...,-0.568065,1.0,-2.449294e-16,6867,-1.7,0,-1.9,-2.2,-1.83697e-16,-1.0
2,2024-07-30,158.7,158.8,159.2,158.0,138140000.0,-0.63,2024,3,7,...,-0.568065,0.97953,-0.2012985,6866,1.0,1,1.6,0.1,-1.83697e-16,-1.0
3,2024-07-29,159.7,158.7,160.2,158.4,126280000.0,1.14,2024,3,7,...,-0.568065,0.918958,-0.3943559,6865,-1.8,0,0.0,-1.0,-1.83697e-16,-1.0
4,2024-07-26,157.9,159.3,159.6,157.9,155080000.0,-0.13,2024,3,7,...,-0.464723,0.528964,-0.8486443,6862,0.2,1,0.4,1.4,-1.83697e-16,-1.0


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9202 entries, 0 to 9201
Data columns (total 26 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   日付け              9202 non-null   datetime64[ns]
 1   終値               9202 non-null   float64       
 2   始値               9202 non-null   float64       
 3   高値               9202 non-null   float64       
 4   安値               9202 non-null   float64       
 5   出来高              9202 non-null   float64       
 6   変化率 %            9202 non-null   float64       
 7   year             9202 non-null   int32         
 8   quarter          9202 non-null   int32         
 9   month            9202 non-null   int32         
 10  week             9202 non-null   UInt32        
 11  dayofweek        9202 non-null   int32         
 12  day              9202 non-null   int32         
 13  month_cos        9202 non-null   float64       
 14  month_sin        9202 non-null   float64