# Predicting Stock Prices with Linear Regression

## Recap
![Alt mindmap](https://s3-eu-west-1.amazonaws.com/gallery-prod-4f50/img/a897779b107f4e8a818316dc40f3d25a.png)
[**click here for more details**](https://sketchboard.me/RBiXS3ZIZyOe)

## Part 1: Overfitting and Underfitting

In this part we are building a polynomial linear regression model and demonstrating the problems of overfitting and underfitting.

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [None]:
# Import the dataset overfitting_underfitting_dataset.csv
data = # Your code here

# Show first 5 rows of dataset

In [None]:
# Let's plot the dataset using plt.scatter()
plt.figure(figsize=(20, 10))

# Your code here


plt.show()

#### Train/Test split

[More in detail](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

In [None]:
# New technic: train_test_split splits arrays or matrices into random train and test subsets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data[['X']], data['Y'], test_size=0.3, random_state=68)
# Notice that we use [['X']] and ['Y'] here. Any idea why?

# We sort the training set in order to keep the curve-form when we plot it on a graph using plt.plot()
X_train = X_train.sort_index()
y_train = y_train.sort_index()

In [None]:
# Let's plot the training set AND test set on a graph using plt.scatter()

plt.figure(figsize=(20, 10))

# Your code here

plt.show()

### Building Polynomial Linear Regression

In [None]:
# Refer to the example we did on Lecture section, let's build a polynomial linear regression model and train it
# Step 1: import LinearRegression and PolynomialFeatures
# Your code here

#### Feature engineering
Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

For more information [link](https://stats.stackexchange.com/a/58747)
 

In [None]:
# Step 2: Define a instance of PolynomialFeatures with degree=4 called 'poly_reg'
# Your code here


# Step 3: Fit and transform 'X_train', 'X_test' with the PolynomialFeatures and save the result in 'X_train_poly' and 'X_test_poly'
# Your code here


# Step 4: Define a instance of LinearRegression called 'lm'


# Step 5: Fitting the Linear Regression Model to the training set (X_train_poly)
# Your code here


# Step 6: Run predict() on the training set and save the result in 'y_train_predict' in order to plot your model
# Your code here


# Step 7: Predict test set and save the result in 'y_test_predict'
# Your code here


In [None]:
# Now if you did everything correctly this code below should be able to visualize the result
plt.figure(figsize=(20, 10))
plt.plot(X_train, y_train_predict)
plt.scatter(X_train, y_train)
plt.scatter(X_test, y_test)
plt.show()

### Overfitting & Underfitting demonstration

In [None]:
# Now let's write a function do the same thing we did in previous step but with degree as a argument
def plot_poly_linear_regression(degree=10):
    # Your code here
    # Hint: copy the code that you did in previous step here 
    

    # calculate the residual errors 
    train_error = np.sqrt(np.mean(np.square(y_train_predict - y_train)))
    test_error = np.sqrt(np.mean(np.square(y_test_predict - y_test)))
                          
    # Visualization
    plt.figure(figsize=(20, 10))
    plt.scatter(X_train, y_train, label='Train')
    plt.scatter(X_test, y_test, label='Test')
    
    X_train_2 = np.linspace(X_train.min(), X_train.max(), 100)
    plt.plot(X_train_2, lm.predict(poly_reg.fit_transform(X_train_2.reshape(-1, 1))), label='Predict')
    plt.title('Train error: {:.3f}\nTest error: {:.3f}\n'.format(train_error, test_error))
    plt.legend(loc="best")
    plt.show()

In [None]:
# This code below uses your function above to demo a interative graph which allows us to tune the parameter 'degree'
from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.display import display

In [None]:
p = interactive(plot_poly_linear_regression, degree=(1, 20, 1))
display(p)


## Part 2: Working on VinGroup's stock prices

### Data Description

The dataset is about the stock prices of Vingroup from Jan 3rd 2017 to Oct 9th 2018. 

 - **DATE**: Trading date, or trading session.
 - **OPEN**: Daily opening price, the price of the first trade within each trading day.
 - **CLOSE**: Daily close price, the price at the end of trading day.
 - **HIGH**: The highest price at which a stock traded within a day. 
 - **LOW**: The lowest price at which a stock traded within a day.
 - **VOLUME**: Volume is the number of shares that are traded within a day.
 - **TICKER**: Security code of Vingroup on the market (VIC).
 
**CLOSE** is our target/dependent variable. We are going to build a model to predict the **Close Price**.

**Import mdates from matplotlib**

In [None]:
# we need mdates from matplotlib to work with the DATE column
import matplotlib.dates as mdates


In [None]:
# Import stock prices dataset
stock = # Your code here

# Show first 5 rows of the dataset
# Your code here


In [None]:
# Using tail() to show last 5 rows 
# Your code here


In [None]:
# Check unique values of the column TICKER
# Your code here


We see that the column **TICKER** contains only one value `VIC` and doesn't give us any information about the **CLOSE** prices. So we will entirely remove the column **TICKER**.

In [None]:
# Using drop() with axis=1 to remove the column TICKER out of our dataset
# Your code here


stock.head()

In [None]:
# Using info() to show overview informations
# Your code here


**DATE** column is considered as `Object`. `Object` is generic data type encapsulates everything from `string` to `integer`, etc. So we need to convert **DATE** to `Datetime` and then using `mdates.data2num()` to convert `Datetime` into `number`. This is because computers process numbers and eventually do math operations on features.

In [None]:
# Read and understand this code below. New technic here are the Lambda function and 'apply()'
stock['DATE'] = pd.to_datetime(stock['DATE'], format='%d/%m/%Y')
stock['DATE2NUM'] = stock['DATE'].apply(lambda x: mdates.date2num(x))

In [None]:
# Let's check again using the function info()
# Your code here


In [None]:
# Let's take a look at the new column using the function head()
# Your code here


Now let's see how our stock data looks like. 

In [None]:
# Run this code below
plt.figure(figsize=(20,10))
plt.xticks(rotation=45)
plt.plot_date(stock['DATE'], stock['CLOSE'], fmt='b-', xdate=True, ydate=False, label='Real value')
plt.ylabel('Close prices')
plt.title('Vingroup - VIC')
plt.grid()
plt.show()

In [None]:
# Your job is now to put the code above into a method that we can reuse later on
# The method takes the first argument as datatype date (e.g. stock['DATE'])
# The second argument should be a vector of close prices (e.g. stock['CLOSE'])
# Then the third argument is optional and represents predicted prices of our model
# Noted that all of the arguments should have the same length
def plot_stock(date, outcome, prediction=None):
    # your code here
    
    return
    
# The result should be the same graph from the previous step
plot_stock(stock['DATE'],stock['CLOSE'])

### Train/Test split

In [None]:
# Assign the CLOSE column to the variable y
# Your code here


# Assign ['DATE2NUM'] to the variable X
# Your code here


# This dataset is called time series dataset. Because the observations (rows) follow each other by the date.
# So we can't use train_test_split in this case.
# First, we get the number of rows and columns of the dataset
nrow , ncol = stock.shape

# Then using loc to split the dataset into a training set and a test set
X_test = X.loc[:np.floor(nrow*0.3)]
X_train = X.loc[np.floor(nrow*0.3):]

# Do the same with the outcome y
# your code here


# To visualize the training set and test set, we need to split the DATE column into 2 parts too
# Your code here


In [None]:
# If you do everything correctly your function should be able to plot the training set and test set here
# Plot of training set
plot_stock(date_train, y_train)

In [None]:
# Plot of training set
plot_stock(date_test, y_test)

### Training Linear Regression Model

In [None]:
# Let's import Linear Regression from the sci-kit learn library and train your model with X_train, y_train
# Notice that we use (date_train, y_train) to plot and (X_train, y_train) to train the model
# Your code here


# If everything fine, you should be able to print out the coefficients (b1, b2, .. bn) and the intercept (b0)
print(lm.coef_)
print(lm.intercept_)

In [None]:
# Now we have a Linear Regression model. Let's use function predict() on X_train and save the result in predictions_train
# Your code here


# Call your plot_stock() function to compare our model's curve with y_train
plot_stock(date_train, y_train, predictions_train)

In [None]:
# Now let see how your model predicts the test set. Save the result in a variable called 'predictions_test'
# Your code here


# Call your plot_stock() function to compare the result with y_test
plot_stock(date_test, y_test, predictions_test)

Not a good result as expected? Do you have any idea why?

Let's go back to the splitting data step and add more features to your training set. Let's see how that could improve your model.

## Great job! Hope you enjoy it.