# Notebook Instructions

1. If you are new to Jupyter notebooks, please go through this introductory manual <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank">here</a>.
1. Any changes made in this notebook would be lost after you close the browser window. **You can download the notebook to save your work on your PC.**
1. Before running this notebook on your local PC:<br>
i.  You need to set up a Python environment and the relevant packages on your local PC. To do so, go through the section on "**Run Codes Locally on Your Machine**" in the course.<br>
ii. You need to **download the zip file available in the last unit** of this course. The zip file contains the data files and/or python modules that might be required to run this notebook.

## Data Pre-processing 
In the previous notebook, you learned about creating custom input parameters that you can use in your model.  

In this notebook, you will learn about a data preprocessing step known as scaling. Also, you will learn how to divide your data into input and output datasets. These are known as independent (X) and dependent (Y) variables. The key steps involved here are:

1. [Import the Data](#import)
2. [Check and Drop NaN Values](#drop)
3. [Scale the Data](#scale)
4. [Create Independent and Dependent Variables](#xy)

In [1]:
# For data manipulation
import pandas as pd
import numpy as np

# Machine learning libraries
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# To ignore unwanted warnings
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

<a id='import'></a>
## Import the Data

We will use the dataset with the input and output parameters that we created in the previous notebook. The input data is stored in `input_parameters.csv`, which we will import here as `gold_prices` to make predictions.

In [2]:
# Read the data
gold_prices = pd.read_csv(
    '../data_modules/input_parameters.csv', index_col='Date')

<a id='drop'></a>
## Check and Drop NaN Values

We will check for any `NaN` values present in the `gold_prices` dataset using the `isna` method. Then, we will use the `sum` method to find the total number of `NaN` values.

In [3]:
# Check for NaN values
gold_prices.isna().sum()

Open      0
High      0
Low       0
Close     0
S_3       3
S_15     15
S_60     60
Corr     13
Std_U     0
Std_D     0
OD        1
OL        1
dtype: int64

As you can see, there are several `NaN` values in the dataset as a result of the new input parameters. However, we will simply drop all these `NaN` values using `dropna`. 

In [4]:
# Drop all the NaN values
gold_prices.dropna(inplace=True)

# Check for NaN values
gold_prices.isna().sum()

Open     0
High     0
Low      0
Close    0
S_3      0
S_15     0
S_60     0
Corr     0
Std_U    0
Std_D    0
OD       0
OL       0
dtype: int64

We can see that all the `NaN` values have been removed from our dataset.

<a id='scale'></a>
## Scale the Data

Suppose our dataset has feature values which have high differences from each other. In that case, features with greater magnitude might dominate over other observations. This could lead to the regression model not being able to learn from all of the features correctly. 

To avoid this, we will standardise the dataset by centring and scaling it. Centring reduces the mean value of the features to 0. Scaling refers to dividing each entry by the standard deviation of the data. This transforms the standard deviation of the features to 1.

We will use the `StandardScaler()` function to standardise our dataset. 

In [5]:
# Initialise the Standard Scaler
scaler = StandardScaler()

# Scale the data in gold_prices and store it as an array in variable scaled
scaled = scaler.fit_transform(gold_prices)

# Convert data stored in scaled from array to dataframe
scaled_gold_prices = pd.DataFrame(
    scaled, index=gold_prices.index, columns=gold_prices.columns)

# Print the new dataframe
scaled_gold_prices.head()

Unnamed: 0_level_0,Open,High,Low,Close,S_3,S_15,S_60,Corr,Std_U,Std_D,OD,OL
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2013-07-10,0.2784,0.388435,0.264553,0.247863,0.013875,0.444063,2.22376,-1.247226,1.480757,0.216553,0.30828,0.675468
2013-07-11,0.751969,0.693429,0.704618,0.747942,0.159572,0.327314,2.195139,-2.894026,-0.745596,0.730431,3.103464,4.243061
2013-07-12,0.639286,0.684325,0.681697,0.731221,0.400532,0.261003,2.171491,-2.2966,0.630713,-0.587816,-0.739927,-0.928662
2013-07-15,0.72456,0.697981,0.761153,0.738822,0.579342,0.266537,2.147346,-0.100679,-0.320568,-0.498424,0.557849,-0.06884
2013-07-16,0.828106,0.822406,0.836026,0.846741,0.743888,0.257452,2.119389,0.310616,-0.037213,-0.07392,0.677638,0.739637


It should be noted that the complete data shouldn't be scaled before splitting into train and test datasets. The correct approach is to fit the scaler on the train data and use the fitted scaler model to transform the train and test sets. This avoids data leakage from the test set to the train set.

## Conclusion
In this notebook, we learned how to check, drop NaN values and scale the data. <br><br>