**Your tasks this week:**
1. Write a function to load and process a dataset with multiple features with the following
requirements:
  - This function will allow you to specify the start date and the end date for the whole dataset as inputs.
  - This function will allow you to deal with the NaN issue in the data.
  - This function will also allow you to use different methods to split the data into train/test data; e.g. you can split it according to some specified ratio of train/test and you can specify to split it by date or randomly.
  - This function will have the option to allow you to store the downloaded data on your local machine for future uses and to load the data locally to save time.
  - This function will also allow you to have an option to scale your feature columns and store the scalers in a data structure to allow future access to these scalers.
2. Most of the above requirements have already been fulfilled by the code in the project (P1). Feel free to learn from it. But you will have to explain what their code does using detailed comments (the same way we commented the code in v0.1)
3. Upload your Task 2 Report (as a PDF file) to the project Wiki before the deadline and email your project leader to notify that it is ready for viewing and feedback

# Enable GPU

In [1]:
!nvidia-smi

Fri Aug 25 08:47:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Install Dependncies

In [2]:
!pip install -q numpy
!pip install -q matplotlib
!pip install -q pandas
!pip install -q tensorflow
!pip install -q scikit-learn
!pip install -q pandas-datareader
!pip install -q yfinance
!pip install --upgrade mplfinance

# Code Source Note: https://github.com/twopirllc/pandas-ta
!pip install -q pandas_ta

Collecting mplfinance
  Downloading mplfinance-0.12.10b0-py3-none-any.whl (75 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mplfinance
Successfully installed mplfinance-0.12.10b0
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.1/115.1 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for pandas_ta (setup.py) ... [?25l[?25hdone


# Connecting Drive

In [3]:
import os
import sys
from google.colab import drive
drive.mount('/content/drive/')

# Set the working directory for the tasks
SKELETON_DIR = '/content/drive/MyDrive/stock-prediction/DataProcessing1'
os.chdir(SKELETON_DIR)

Mounted at /content/drive/


# Import Dependencies

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import pandas_ta as ta
import pandas_datareader as web
import datetime as dt
import tensorflow as tf
import yfinance as yf
import mplfinance as mpf

from sklearn.preprocessing import MinMaxScaler
from sklearn import metrics
from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.layers import Dense, Dropout, LSTM, InputLayer, Input, Activation
from tensorflow.keras.utils import plot_model

# Set up Variables

In [116]:
start='2015-01-01'
end='2023-08-25'
ticker='TSLA'

# Price Value
price_value = 'Close' # This can be change to 'Open', 'Close', 'Adj Close , 'High', 'Low'

# Split Dataset for Training/Testing
split_ratio=0.8

# Number of look back days to base the prediction
step_size = 30 # Can be changed

# Directory
DATA_DIR = os.path.join(SKELETON_DIR, "data")
PREPARED_DATA_DIR = os.path.join(SKELETON_DIR, "prepared-data")

# File Path
CSV_FILE = os.path.join(DATA_DIR, f"RawData-from-{start}to-{end}-{ticker}_stock_data.csv")
PREPARED_DATA_FILE = os.path.join(PREPARED_DATA_DIR, f"PreparedData-from-{start}to-{end}-{ticker}_stock_data.csv")
PREPARED_TRAIN = os.path.join(PREPARED_DATA_DIR, f"{ticker}_prepared_data.npz")

# ensure_directory_exists
**The function `ensure_directory_exists` takes the following parameter:**

1. `dir_path`: This parameter is a string representing the path of the directory you want to ensure exists.

**This function has the following features:**

1. **Checking and Creating Directory:** The primary purpose of this function is to ensure that a specified directory exists. It checks if the directory at the provided `dir_path` exists using the `os.path.isdir` function. If the directory does not exist, it creates the directory using the `os.mkdir` function.

In [6]:
# Double check directory
def ensure_directory_exists(dir_path):
  # If directory not exist => create
  if not os.path.isdir(dir_path):
      os.mkdir(dir_path)

# load_data
**The function `load_data` takes several parameters:**

1. `start` and `end`: These are date values in the format 'YYYY-MM-DD' that define the time range for the financial data you want to load.

2. `ticker`: This parameter is a string representing the stock ticker symbol (e.g., 'AAPL' for Apple Inc.) for which you want to fetch the financial data.

3. `source`: This parameter is optional and specifies the data source. The default value is 'yahoo', which refers to Yahoo Finance.

**This function has the following features:**

1. **Creating a Directory:** The function first ensures that a directory exists to hold the financial data. If the directory doesn't exist, it creates one using the path defined in the `DATA_DIR` variable.

2. **Checking for Existing Data:** The function then checks if the financial data already exists by looking for a CSV file at the path specified in the `CSV_FILE` variable. If the file is found, the function assumes the data has already been downloaded or loaded and reads it from the CSV using the Pandas library.

3. **Downloading and Saving Data:** If the CSV file containing the financial data doesn't exist, the function assumes that the data needs to be fetched. It uses the `yf.download` function from Yahoo Finance (based on the specified source) to download the financial data for the given stock ticker and time range. The `progress=False` argument suppresses progress messages during the download. The downloaded data is then saved to a new CSV file using the `to_csv` method.

In [7]:
# Load Raw Data
def load_data(start, end, ticker, source='yahoo'):
  ensure_directory_exists(DATA_DIR)

  # Check if CSV file exists
  # If exist => load
  # If not exist => download
  if os.path.exists(CSV_FILE):
      print('Loading Existing Data')
      data = pd.read_csv(CSV_FILE)
  else:
      print('Downloading Data')
      data = yf.download(ticker, start, end, progress=False)
      data.to_csv(CSV_FILE)

  return data

# data_validation
**The function `data_validation` takes several parameters:**

1. `start` and `end`: These are date values in the format 'YYYY-MM-DD' that define the time range for the financial data you want to validate and process.

2. `ticker`: This parameter is a string representing the stock ticker symbol (e.g., 'AAPL' for Apple Inc.) for which you intend to validate and preprocess the financial data.

**This function has the following features:**

1. **Creating a Directory:** The function first ensures that a directory exists to hold the prepared data. If the directory doesn't exist, it creates one using the path defined in the `PREPARED_DATA_DIR` variable.

2. **Checking for Existing Data:** The function then checks if prepared data already exists by looking for a CSV file at the path specified in the `PREPARED_DATA_FILE` variable. If the file is found, the function assumes the data has already been processed and loads it from the CSV using the Pandas library.

3. **Processing Raw Data:** If the prepared data CSV file doesn't exist, the function assumes that the raw data needs to be processed. It reads the raw financial data from a CSV file located at the path specified in the `CSV_FILE` variable. Then, the function applies several preprocessing steps to this raw data:

   - **Adding Indicators:** The function adds indicators to the data, such as the Relative Strength Index (RSI) and various Exponential Moving Averages (EMAF, EMAM, EMAS), calculated using the `ta` library.

   - **Calculating Targets:** The function calculates the 'Target' column, which represents the difference between the adjusted closing price and the opening price. It also shifts this target one step back to represent the future movement.

   - **Creating Target Class:** The function generates a binary 'TargetClass' column based on whether the 'Target' is greater than zero, indicating a positive change.

   - **TargetNextClose:** The function creates a 'TargetNextClose' column by shifting the 'Adj Close' column one step back.

   - **Handling Missing Data:** The function removes any rows that contain NaN (missing) values.

   - **Dropping Columns:** Optionally, there are commented-out lines to drop certain columns like 'Volume', 'Close', and 'Date'. You can uncomment these lines if you want to remove these columns from the final processed data.

   - **Exporting Prepared Data:** Once all the preprocessing steps are complete, the function saves the processed data to a new CSV file using the `to_csv` method. This ensures that the next time the function is called, it can load the already processed data directly from the CSV file without repeating the preprocessing steps.

In [161]:
# Data Validation
def data_validation(start, end, ticker):
  ensure_directory_exists(PREPARED_DATA_DIR)


  if os.path.exists(PREPARED_DATA_FILE):
      print('Loading Prepared Data')
      df = pd.read_csv(PREPARED_DATA_FILE)
  else:
      print('Processing Raw Data')

      # Read Raw Data File
      df = pd.read_csv(CSV_FILE)

      # Adding indicators
      df['RSI']=ta.rsi(df.Close, length=15)
      df['EMAF']=ta.ema(df.Close, length=20)
      df['EMAM']=ta.ema(df.Close, length=100)
      df['EMAS']=ta.ema(df.Close, length=150)

      df['Target'] = df['Adj Close']-df.Open
      df['Target'] = df['Target'].shift(-1)

      df['TargetClass'] = [1 if df.Target[i]>0 else 0 for i in range(len(df))]

      df['TargetNextClose'] = df['Adj Close'].shift(-1)

      # Drop NaN issue in data
      df.dropna(inplace=True)

      # Drop Columns
      # df.drop(['Volume','Close', 'Date'], axis=1, inplace=True)

      # Export Prepared Data
      df.to_csv(PREPARED_DATA_FILE, index=False)

  return df

# split_data
**The function `split_data` takes the following parameters:**

1. `df`: This parameter is a DataFrame containing the financial data that you want to split.
   
2. `split_ratio`: This is the ratio of data to be used for training. The rest will be used for testing. For example, a `split_ratio` of 0.8 would mean 80% of the data is used for training and 20% for testing.

3. `split_by_date`: This is an optional boolean parameter (defaulting to `True`) that indicates whether you want to split the data by date or randomly.

**This function has the following features:**

1. **Splitting Data by Date:** If `split_by_date` is set to `True`, the function calculates the index at which the split should occur based on the ratio of data for training. It then splits the DataFrame into two parts: the first part for training (`train_data`) and the remaining part for testing (`test_data`).

2. **Splitting Data Randomly:** If `split_by_date` is set to `False`, the function uses the `train_test_split` function from the scikit-learn library to randomly split the DataFrame into training and testing sets according to the specified `split_ratio`. The `random_state=42` ensures reproducibility of the random split.

3. **Printing Shapes:** After splitting, the function prints the shapes of the training and testing data, indicating how many rows and columns each set contains.


In [160]:
# Split Data by Date or Randomly
def split_data(df, split_ratio, split_by_date=True):
    if split_by_date:
        # Split by date
        train_size = int(len(df) * split_ratio)
        train_data = df.iloc[:train_size]
        test_data = df.iloc[train_size:]
    else:
        # Split Randomly
        train_data, test_data = train_test_split(df, test_size=1-split_ratio, random_state=42)

    print(f"Train Data Shape: {train_data.shape}")
    print(f"Test Data Shape: {test_data.shape}")

    return train_data, test_data

# scaler_features
**The function `scaler_features` takes the following parameters:**

1. `input_data`: This parameter represents the data that you want to scale. It could be a pandas Series, a 1D numpy array, or a 2D numpy array.

2. `scale`: This is an optional boolean parameter (defaulting to `True`) that indicates whether you want to scale the data.

**This function has the following features:**

1. **Scaling Data:** If the `scale` parameter is set to `True`, the function creates an instance of the `MinMaxScaler` from scikit-learn. The `feature_range` parameter sets the range to which the data will be scaled (between 0 and 1).

2. **Reshaping Data:** Before scaling, the function checks if the input data has a shape of 1 dimension (i.e., it's a Series or 1D numpy array). If so, it reshapes the data into a 2D array with one column using the `.reshape(-1, 1)` method. This is necessary because scikit-learn's scaler expects a 2D input.

3. **Scaling and Transforming Data:** The function then uses the scaler to fit and transform the input data, resulting in scaled data. This scaled data is returned along with the scaler instance.

4. **Not Scaling Data:** If the `scale` parameter is set to `False`, the function simply returns the original input data as-is, without any scaling. In this case, the scaler instance returned is `None`.

In [10]:
# Scaler
def scaler_features(input_data, scale=True):
    if scale:
        scaler = MinMaxScaler(feature_range=(0, 1))

        # Reshaping if input_data is a Series or 1D numpy array
        if len(input_data.shape) == 1:
            input_data = input_data.values.reshape(-1, 1)

        scaled_data = scaler.fit_transform(input_data)
        return scaled_data, scaler
    else:
        return input_data, None


# create_datasets
**The function `create_datasets` takes the following parameters:**

1. `start` and `end`: These are date values in the format 'YYYY-MM-DD' that define the time range for the financial data you want to work with.

2. `ticker`: This parameter is a string representing the stock ticker symbol (e.g., 'AAPL' for Apple Inc.) for which you want to create datasets.

**This function has the following features:**

1. **Downloading or Loading Raw Data:** The function first calls the `load_data` function to download or load the raw financial data using the provided time range and stock ticker.

2. **Data Validation:** The function then calls the `data_validation` function to validate and preprocess the raw data, resulting in a processed DataFrame (`df`).

3. **Splitting Data:** The function calls the `split_data` function to split the processed DataFrame into training and testing datasets (`train_data` and `test_data`) based on the specified `split_ratio`.

4. **Defining Features and Target:** The feature columns (`feature_columns`) and the target column (`target_column`) are defined for the dataset.

5. **Preparing Train Datasets:** For the training dataset:
   
   - The feature scaler (`train_feature_scaler`) is applied to scale the feature columns of the training data.
   
   - The target scaler (`train_target_scaler`) is applied to scale the target values of the training data.
   
   - Sequences of features and corresponding scaled target values are created for training. These sequences are prepared for use in training a sequence-based model like an LSTM.

6. **Preparing Test Datasets:** For the testing dataset:

   - The feature scaler that was used on the training data (`train_feature_scaler`) is applied to scale the feature columns of the testing data.
   
   - The target scaler that was used on the training data (`train_target_scaler`) is applied to scale the target values of the testing data.
   
   - Similar to the training dataset, sequences of features and corresponding scaled target values are created for testing.

7. **Saving Prepared Train Data:** The prepared training sequences (`x_train` and `y_train`) are saved to a `.npz` file using the `np.savez` function.

8. **Returning Prepared Data and Information:** The function returns a variety of data and objects, including the raw data, the processed DataFrame, the split training and testing datasets, the feature and target scalers used for scaling, and the prepared training and testing sequences.

In [11]:
def create_datasets(start, end, ticker):
    # Download or Load Raw Data
    data = load_data(start, end, ticker)

    # Data Validation
    df = data_validation(start, end, ticker)

    # Split Data
    train_data, test_data = split_data(df, split_ratio)

    # Define features and target
    feature_columns = ['Open', 'High', 'Low', 'RSI', 'EMAF', 'EMAM', 'EMAS']
    target_column = 'TargetNextClose'

    # Preparing Train Datasets
    # Scaler for features
    scaled_data_train, train_feature_scaler = scaler_features(train_data[feature_columns])
    # Scaler for target
    scaled_target_train, train_target_scaler = scaler_features(train_data[target_column].values.reshape(-1, 1))

    x_train, y_train = [], []
    for i in range(step_size, len(scaled_data_train)):
        x_train.append(scaled_data_train[i-step_size:i])
        y_train.append(scaled_target_train[i])

    x_train, y_train = np.array(x_train), np.array(y_train)

    # Preparing Test Datasets
    # Use the feature scaler to scale the test data
    scaled_data_test = train_feature_scaler.transform(test_data[feature_columns])
    # Use the target scaler to scale the test target
    scaled_target_test = train_target_scaler.transform(test_data[target_column].values.reshape(-1, 1))

    x_test, y_test = [], []
    for i in range(step_size, len(scaled_data_test)):
        x_test.append(scaled_data_test[i-step_size:i])
        y_test.append(scaled_target_test[i])

    x_test, y_test = np.array(x_test), np.array(y_test)

    np.savez(PREPARED_TRAIN, x_train=x_train, y_train=y_train)

    return data, df, train_data, test_data, train_feature_scaler, train_target_scaler, x_train, x_test, y_train, y_test


# Run Code

In [162]:
data, df, train_data, test_data, train_feature_scaler, train_target_scaler, x_train, x_test, y_train, y_test = create_datasets(start, end, ticker)

Loading Existing Data
Loading Prepared Data
Train Data Shape: (1620, 14)
Test Data Shape: (406, 14)


In [118]:
print("Data shapes/types:")
print("data:", type(data))
print("df:", type(df))
print("train_data:", train_data.shape)
print("test_data:", test_data.shape)
print("train_feature_scaler:", type(train_feature_scaler))
print("train_target_scaler:", type(train_target_scaler))
print("x_train:", x_train.shape)
print("x_test:", x_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

Data shapes/types:
data: <class 'pandas.core.frame.DataFrame'>
df: <class 'pandas.core.frame.DataFrame'>
train_data: (1620, 14)
test_data: (406, 14)
train_feature_scaler: <class 'sklearn.preprocessing._data.MinMaxScaler'>
train_target_scaler: <class 'sklearn.preprocessing._data.MinMaxScaler'>
x_train: (1590, 30, 7)
x_test: (376, 30, 7)
y_train: (1590, 1)
y_test: (376, 1)


In [166]:
# Raw Data
print(len(data))
data.head(3)

2176


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2015-01-02,14.858,14.883333,14.217333,14.620667,14.620667,71466000
1,2015-01-05,14.303333,14.433333,13.810667,14.006,14.006,80527500
2,2015-01-06,14.004,14.28,13.614,14.085333,14.085333,93928500


In [167]:
# Raw Data
print(len(data))
data.tail(3)

2176


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
2173,2023-08-22,240.25,240.820007,229.550003,233.190002,233.190002,130597900
2174,2023-08-23,229.339996,238.979996,229.289993,236.860001,236.860001,101077600
2175,2023-08-24,238.660004,238.919998,228.179993,230.039993,230.039993,99519300


In [168]:
# Valiadted Data
print(len(df))
df.head(3)

2026


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,RSI,EMAF,EMAM,EMAS,Target,TargetClass,TargetNextClose
0,2015-08-06,16.636,17.0,15.741333,16.408667,16.408667,219357000,40.237745,17.562829,16.2451,15.20076,-0.071333,0,16.167334
1,2015-08-07,16.238667,16.248667,15.892667,16.167334,16.167334,76101000,38.5985,17.429924,16.243561,15.213562,0.199333,1,16.076
2,2015-08-10,15.876667,16.198,15.736667,16.076,16.076,62788500,37.971249,17.300979,16.240243,15.224985,0.014667,1,15.824667


In [169]:
# Valiadted Data
print(len(df))
df.tail(3)

2026


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,RSI,EMAF,EMAM,EMAS,Target,TargetClass,TargetNextClose
2023,2023-08-21,221.550003,232.130005,220.580002,231.279999,231.279999,135702700,41.132038,243.068776,230.59096,223.752151,-7.059998,0,233.190002
2024,2023-08-22,240.25,240.820007,229.550003,233.190002,233.190002,130597900,42.417038,242.127941,230.642426,223.877156,7.520004,1,236.860001
2025,2023-08-23,229.339996,238.979996,229.289993,236.860001,236.860001,101077600,44.893445,241.626232,230.765546,224.049114,-8.62001,0,230.039993


In [170]:
# Train Data
print(len(train_data))
train_data.head(3)

1620


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,RSI,EMAF,EMAM,EMAS,Target,TargetClass,TargetNextClose
0,2015-08-06,16.636,17.0,15.741333,16.408667,16.408667,219357000,40.237745,17.562829,16.2451,15.20076,-0.071333,0,16.167334
1,2015-08-07,16.238667,16.248667,15.892667,16.167334,16.167334,76101000,38.5985,17.429924,16.243561,15.213562,0.199333,1,16.076
2,2015-08-10,15.876667,16.198,15.736667,16.076,16.076,62788500,37.971249,17.300979,16.240243,15.224985,0.014667,1,15.824667


In [171]:
# Train Data
print(len(train_data))
train_data.tail(3)

1620


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,RSI,EMAF,EMAM,EMAS,Target,TargetClass,TargetNextClose
1617,2022-01-06,359.0,362.666656,340.166656,354.899994,354.899994,90336600,51.079304,354.262754,316.084408,295.390988,-17.803314,0,342.320007
1618,2022-01-07,360.123322,360.309998,336.666656,342.320007,342.320007,84164700,47.672453,353.12535,316.603925,296.012565,19.373322,1,352.706665
1619,2022-01-10,333.333344,353.033325,326.666656,352.706665,352.706665,91815000,50.58787,353.085475,317.318831,296.76348,3.57666,1,354.799988


In [172]:
# Test Data
print(len(test_data))
test_data.head(3)

406


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,RSI,EMAF,EMAM,EMAS,Target,TargetClass,TargetNextClose
1620,2022-01-11,351.223328,358.616669,346.273346,354.799988,354.799988,66063300,51.175269,353.248762,318.061032,297.532176,9.123322,1,368.73999
1621,2022-01-12,359.616669,371.613342,357.529999,368.73999,368.73999,83739000,54.992715,354.724117,319.064576,298.475325,-25.83667,0,343.853333
1622,2022-01-13,369.690002,371.866669,342.179993,343.853333,343.853333,97209900,47.838278,353.688804,319.555442,299.076359,9.910004,1,349.869995


In [173]:
# Test Data
print(len(test_data))
test_data.tail(3)

406


Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume,RSI,EMAF,EMAM,EMAS,Target,TargetClass,TargetNextClose
2023,2023-08-21,221.550003,232.130005,220.580002,231.279999,231.279999,135702700,41.132038,243.068776,230.59096,223.752151,-7.059998,0,233.190002
2024,2023-08-22,240.25,240.820007,229.550003,233.190002,233.190002,130597900,42.417038,242.127941,230.642426,223.877156,7.520004,1,236.860001
2025,2023-08-23,229.339996,238.979996,229.289993,236.860001,236.860001,101077600,44.893445,241.626232,230.765546,224.049114,-8.62001,0,230.039993


In [180]:
# Checking Ratio
print(f'Actual Ratio: {split_ratio}')
print(f'Train Ratio: {len(train_data)/len(df)}')
print(f'Test Ratio: {len(test_data)/len(df)}')

Actual Ratio: 0.8
Train Ratio: 0.7996051332675223
Test Ratio: 0.20039486673247778


In [159]:
del data, df, train_data, test_data, train_feature_scaler, train_target_scaler, x_train, x_test, y_train, y_test