# Notebook Instructions

1. If you are new to Jupyter notebooks, please go through this introductory manual <a href='https://quantra.quantinsti.com/quantra-notebook' target="_blank">here</a>.
1. Any changes made in this notebook would be lost after you close the browser window. **You can download the notebook to save your work on your PC.**
1. Before running this notebook on your local PC:<br>
i.  You need to set up a Python environment and the relevant packages on your local PC. To do so, go through the section on "**Run Codes Locally on Your Machine**" in the course.<br>
ii. You need to **download the zip file available in the last unit** of this course. The zip file contains the data files and/or python modules that might be required to run this notebook.

## Concept of Pipeline
In the previous notebooks, we have covered the concepts of importing data and creating input and output parameters. We also learnt about scaling which is a data preprocessing method used for preparing the data for a machine learning model. 

In this notebook, we will take a diversion from the GLD dataset we have been working on to learn about the concept of a pipeline. Later in this notebook, will also test the pipeline we create on a sample dataset.

A pipeline in Python is used to execute a certain number of steps sequentially. This is extremely useful when you need to jump through the different steps of data processing and finally train the machine learning model or use the model to make predictions. The key steps involved are:

1. [Create a Sample Dataset](#sample)
2. [Create Pipeline](#pipeline)
3. [Test Pipeline](#test)

In [1]:
# For data manipulation
import pandas as pd
import numpy as np

# Machine learning libraries
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# To ignore unwanted warnings
import warnings
warnings.filterwarnings("ignore")

<a id='sample'></a>
## Create a Sample Dataset

Let us first create a sample dataset that we will use to test our pipeline. We will create 2 lists `x` and `y` which contain close prices of Tesla and Amazon respectively. 

In [2]:
# List containing close prices of Tesla (independent variable)
x= [663.90, 674.90, 628.16, 658.80, 707.73, 759.63, 758.26, 740.37, 775.00, 703.55]

# List containing close prices of Amazon (dependent variable)
y= [2151.82, 2151.14, 2082.00, 2135.50, 2221.55, 2302.93, 2404.19, 2433.68, 2510.22, 2447.00]

Next, we will split our sample data into test and train datasets. We will use 80% of our data for training and 20% for testing.

In [3]:
# Calculate the splitting length 
split=int(0.8*len(x))

# Split x into train and test sets
x_train = x[:split]
x_test = x[split:]

# Split y into train and test sets
y_train = y[:split]
y_test = y[split:]

Now, we will reshape our array from a 1D array to a 2D array. This is because machine learning methods take 2D array of values as input for prediction. Each item in the array is a "point" you want your model to predict. So, here the input is a 2D array of shape `(-1,1)`.

In [4]:
# Reshape training data
x_train = np.reshape(x_train, (-1, 1))
y_train = np.reshape(y_train, (-1, 1))

# Reshape testing data
x_test = np.reshape(x_test, (-1, 1))
y_test = np.reshape(y_test, (-1, 1))

<a id='pipeline'></a>
## Pipeline
As mentioned before, a pipeline allows us to execute a certain number of steps sequentially. It may contain several transformation steps followed by the final estimation step. 

This is useful when we have a fixed sequence of steps in processing the data. Some codes are meant to transform our data. For example, dealing with missing values in the data and standardising the data. These steps are known as transformers. Other codes are meant to predict variables by fitting an algorithm such as linear regression, random forest or support vector machine (SVM). These are called estimators. 

So, in a pipeline, we first sequentially apply a list of transformers (data processing) and then a final estimator (ML model). 

The code shown below depicts these steps. We store the list of tuples in the variable `steps` sequentially.

Syntax: 
```python
steps = [(name_1,transform_1), (name_2,transform_2),........(name_n,transform_n)]
Pipeline(steps)
```
We are using the following two steps in our pipeline:
1. Scale the data 
3. Fit the data using the linear regression model

In [5]:
# First we put scaling and then linear regression in the pipeline.
steps = [('scaler', StandardScaler()), 
         ('linear', LinearRegression())]

# Defining pipeline
pipeline = Pipeline(steps, verbose=True)

<a id='test'></a>
## Test Pipeline
After creating a pipeline, we will fit it on our training dataset.  

In [6]:
pipeline.fit(x_train,y_train)

[Pipeline] ............ (step 1 of 2) Processing scaler, total=   0.0s
[Pipeline] ............ (step 2 of 2) Processing linear, total=   0.0s


Pipeline(steps=[('scaler', StandardScaler()), ('linear', LinearRegression())],
         verbose=True)

Let us now predict the values for `y` using the `predict()` function with the parameter as `x_test`. 

In [7]:
pipeline.predict(x_test)

array([[2418.75009388],
       [2246.40191606]])

As you can see, the pipeline has predicted the `y` values as shown above.  
## Conculsion
In this notebook, we learned how to create a pipeline. We also tested our pipeline on a sample dataset to see its results. In the next notebook, we will use a pipeline to transform and estimate our GLD dataset.<br><br>