<a href="https://colab.research.google.com/github/pratamaridho/prediksis-saham-/blob/main/feature_engineering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [2]:
url = "https://raw.githubusercontent.com/pratamaridho/prediksis-saham-/main/BBCA_cleaned.csv"
df = pd.read_csv(url)

df.head()


Unnamed: 0,Date,Close,High,Low,Open,Volume
0,2004-06-08,99.696106,101.100276,98.291935,98.291935,499150000
1,2004-06-09,101.10025,102.50442,98.29191,99.69608,294290000
2,2004-06-10,101.10025,101.10025,99.69608,101.10025,165590000
3,2004-06-11,101.10025,101.10025,99.69608,99.69608,135830000
4,2004-06-14,99.696106,101.100276,98.291935,101.100276,158540000


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5280 entries, 0 to 5279
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Date    5280 non-null   object 
 1   Close   5280 non-null   float64
 2   High    5280 non-null   float64
 3   Low     5280 non-null   float64
 4   Open    5280 non-null   float64
 5   Volume  5280 non-null   int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 247.6+ KB


In [4]:
df.describe()

Unnamed: 0,Close,High,Low,Open,Volume
count,5280.0,5280.0,5280.0,5280.0,5280.0
mean,3207.568572,3237.872289,3177.297276,3207.946579,107772700.0
std,2956.070745,2981.530692,2932.456985,2956.844978,127261900.0
min,99.696106,99.696106,98.29191,98.291914,0.0
25%,690.317444,695.616469,676.532157,688.42099,49978500.0
50%,2142.572021,2166.68158,2130.235765,2142.478958,74067000.0
75%,5417.664917,5459.961518,5377.56366,5421.718171,117840400.0
max,10570.414062,10570.414456,10401.480891,10522.147296,1949960000.0


In [5]:
df.isnull().sum()

Unnamed: 0,0
Date,0
Close,0
High,0
Low,0
Open,0
Volume,0


In [6]:
df.duplicated().sum()

np.int64(0)

#***Feature Engineering***

In [7]:
df_close = df[['Date','Close']]

window_short = 10
window_long = 50

df_close['MA_10'] = df_close['Close'].rolling(window=window_short).mean()
df_close['MA_50'] = df_close['Close'].rolling(window=window_long).mean()

df_close.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_close['MA_10'] = df_close['Close'].rolling(window=window_short).mean()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_close['MA_50'] = df_close['Close'].rolling(window=window_long).mean()


Unnamed: 0,Date,Close,MA_10,MA_50
0,2004-06-08,99.696106,,
1,2004-06-09,101.10025,,
2,2004-06-10,101.10025,,
3,2004-06-11,101.10025,,
4,2004-06-14,99.696106,,
5,2004-06-15,102.504425,,
6,2004-06-16,101.10025,,
7,2004-06-17,99.696106,,
8,2004-06-18,99.696106,,
9,2004-06-21,99.696106,100.538596,


In [8]:
delta = df_close['Close'].diff()

period = 14

gain = delta.clip(lower=0)
loss = delta.clip(upper=0).abs()

avg_gain = gain.ewm(com=period-1, min_periods=period).mean()
avg_loss = loss.ewm(com=period-1, min_periods=period).mean()
rs = avg_gain / avg_loss
rsi = 100 - (100 / (1 + rs))

df_close['RSI'] = rsi

df_close.head(20)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_close['RSI'] = rsi


Unnamed: 0,Date,Close,MA_10,MA_50,RSI
0,2004-06-08,99.696106,,,
1,2004-06-09,101.10025,,,
2,2004-06-10,101.10025,,,
3,2004-06-11,101.10025,,,
4,2004-06-14,99.696106,,,
5,2004-06-15,102.504425,,,
6,2004-06-16,101.10025,,,
7,2004-06-17,99.696106,,,
8,2004-06-18,99.696106,,,
9,2004-06-21,99.696106,100.538596,,


In [9]:
df_close['Date'] = pd.to_datetime(df_close['Date'])
df_close = df_close.set_index('Date')
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')

common_dates = df.index.intersection(df_close.index)
df = df.loc[common_dates]
df_close = df_close.loc[common_dates]

df_close['day_of_week'] = df_close.index.dayofweek

df_encoded_days = pd.get_dummies(df_close['day_of_week'], prefix='Day')
df_final_features = df_close.join(df_encoded_days)
df_final_features = df_final_features.drop('day_of_week', axis=1)

df_close.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_close['Date'] = pd.to_datetime(df_close['Date'])


Unnamed: 0_level_0,Close,MA_10,MA_50,RSI,day_of_week
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-06-08,99.696106,,,,1
2004-06-09,101.10025,,,,2
2004-06-10,101.10025,,,,3
2004-06-11,101.10025,,,,4
2004-06-14,99.696106,,,,0


In [10]:
df_clean = df_close.dropna()

In [11]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 5231 entries, 2004-08-16 to 2025-10-29
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Close        5231 non-null   float64
 1   MA_10        5231 non-null   float64
 2   MA_50        5231 non-null   float64
 3   RSI          5231 non-null   float64
 4   day_of_week  5231 non-null   int32  
dtypes: float64(4), int32(1)
memory usage: 224.8 KB


In [12]:
df_clean.head(20)

Unnamed: 0_level_0,Close,MA_10,MA_50,RSI,day_of_week
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-08-16,114.886658,118.07794,113.063786,41.512255,0
2004-08-17,114.886658,117.599245,113.367597,41.512255,1
2004-08-18,118.077957,117.280118,113.707151,53.687885,2
2004-08-19,118.077957,116.960991,114.046706,53.687885,3
2004-08-20,116.482277,116.641857,114.354346,47.904875,4
2004-08-23,116.482277,116.641857,114.69007,47.904875,0
2004-08-24,116.482277,116.641857,114.969627,47.904875,1
2004-08-25,114.886658,116.322727,115.245355,42.22447,2
2004-08-26,114.886658,116.163165,115.549166,42.22447,3
2004-08-27,116.482277,116.163165,115.884889,49.209262,4


In [13]:
df_clean = df_close.join(df[['High', 'Low', 'Open', 'Volume']])
df_clean = df_clean.dropna()
df_clean.head(50)

Unnamed: 0_level_0,Close,MA_10,MA_50,RSI,day_of_week,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2004-08-16,114.886658,118.07794,113.063786,41.512255,0,116.482306,114.886658,116.482306,143450000
2004-08-17,114.886658,117.599245,113.367597,41.512255,1,114.886658,114.886658,114.886658,0
2004-08-18,118.077957,117.280118,113.707151,53.687885,2,118.077957,116.482309,116.482309,236070000
2004-08-19,118.077957,116.960991,114.046706,53.687885,3,118.077957,116.482309,118.077957,54470000
2004-08-20,116.482277,116.641857,114.354346,47.904875,4,118.077925,116.482277,116.482277,6340000
2004-08-23,116.482277,116.641857,114.69007,47.904875,0,118.077925,116.482277,116.482277,34770000
2004-08-24,116.482277,116.641857,114.969627,47.904875,1,116.482277,114.886629,116.482277,343910000
2004-08-25,114.886658,116.322727,115.245355,42.22447,2,116.482306,113.29101,116.482306,20510000
2004-08-26,114.886658,116.163165,115.549166,42.22447,3,114.886658,113.29101,113.29101,26650000
2004-08-27,116.482277,116.163165,115.884889,49.209262,4,116.482277,114.886629,114.886629,152960000


In [14]:
df_clean = pd.get_dummies(df_clean, columns=['day_of_week'], prefix='Day')
df_clean.head()

Unnamed: 0_level_0,Close,MA_10,MA_50,RSI,High,Low,Open,Volume,Day_0,Day_1,Day_2,Day_3,Day_4
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2004-08-16,114.886658,118.07794,113.063786,41.512255,116.482306,114.886658,116.482306,143450000,True,False,False,False,False
2004-08-17,114.886658,117.599245,113.367597,41.512255,114.886658,114.886658,114.886658,0,False,True,False,False,False
2004-08-18,118.077957,117.280118,113.707151,53.687885,118.077957,116.482309,116.482309,236070000,False,False,True,False,False
2004-08-19,118.077957,116.960991,114.046706,53.687885,118.077957,116.482309,118.077957,54470000,False,False,False,True,False
2004-08-20,116.482277,116.641857,114.354346,47.904875,118.077925,116.482277,116.482277,6340000,False,False,False,False,True
