# Initial scoping

This notebook will simply train a regression algorithm to predict stock performance versus the general market.

## Framing the Problem
This project will be used to compliment my personal, manual investment analysis workflow. The outputs from this project will be an additional source of data that I can use to make decisions regarding my investments.

### Model
This will be a supervised regression model. It will be using numerous input features and can thus be considered a multiple regression problem. It will also be predicting a single value, and is thus a univariate regression problem. The model will be an offline model that is trained first before being saved locally and then used to make predictions. The model and dataset are small enough such that training can be repeated easily with negligible time consequences.

### Data
The data to be used is found in the "NYSE_dataset.parquet" file. The dataset card can be seen in the root directory of this project. The following pieces of information were extracted from the dataset card and presented below in white before being addressed in red. 

1. Dataset shifts: This dataset holds data dating from the 1980s all the way to the present day. Keep in mind that the market dynamics might have changed over time, leading to possible dataset shifts. To account for these shifts, consider dividing the dataset into a training set and a test set using a time-based split. This will ensure that the model is trained on data that is representative of the time period it will be predicting on. One could also consider performing stationarity tests on the time series data. If requred, consider making use of techniques to make the time series data stationary.

2. Leakage: The columns 'priceRatioRelativeToS&P_1Q', 'priceRatioRelativeToS&P_2Q', 'priceRatioRelativeToS&P_3Q', and 'priceRatioRelativeToS&P_4Q' can be considered as labels, indicating future relative price increases of the stock versus the S&P500. To prevent leakage, ensure that these columns are not used to calculate any other features and that they are not present as input features.

3. Correlated features: The dataset was built in a greedy manner by keeping as many features as possible, meaning that there may be many features that are highly correlated with each other. To prevent overfitting and reduced model performance, perform feature selection or dimensionality reduction techniques to reduce the number of features in the dataset.

4. Missing values: The dataset is completely raw and has not been processed. As such, there will certainly be NaN values that must be handled. 

5. Data types: The only columns in the dataset that are not numeric are "date", "start_date" and "period". All others are either float64 or int64.

6. Outliers: The dataset is completely raw and has not been processed. As such, there are likely to be outliers.

7. Scaling: The dataset is completely raw and has not been processed. As such ll data is unscaled.

In [1]:
import pandas as pd
import numpy as np

raw_df = pd.read_parquet('NYSE_dataset.parquet')

In [4]:
len(raw_df)

66627