# Final Project : Stock Price Prediction (Google)
##Mohammad Jahed Murad Sunny
# T00707169
Artificial Intelligence - CPSC 7373

## Part 1 - Data Preprocessing
Data preprocessing refers to the process of preparing raw data for machine learning models. The purpose of this step is to clean, transform, and prepare data in a suitable format that can be used by machine learning algorithms.

Cleaning the data involves identifying and removing irrelevant, incomplete, or inaccurate data points. For example, if some data points are missing or contain incorrect information, they can be removed or replaced with suitable values to ensure that the data is consistent and complete.

Data transformation involves converting the data into a suitable format for analysis. This step can include scaling or normalizing the data to ensure that all features are on the same scale, encoding categorical variables to enable them to be used in machine learning models, and extracting useful features from the data that can help in making predictions.

Data preparation involves splitting the data into training and testing sets to evaluate the performance of the machine learning models. This step can also involve feature selection, where only the most important features are selected to improve the accuracy of the models.





### Downloading files in your own google drive
The first line imports the os library and the second line imports the drive module from the google.colab library. The drive.mount() function is called with the parameter '/content/drive' to mount the Google Drive folder in Colab.

The next block of code checks if the training and testing data files already exist in the current directory. If they do, it prints the message 'File exists'. If they don't exist, it downloads them using the !wget command and the URL of the files.

The !wget command is used to download files from the internet in Colab. In this case, it is downloading the files from Google Drive using the file ID provided in the URL. The -O option is used to specify the output file name.

In [19]:

import os
from google.colab import drive
drive.mount('/content/drive', force_remount=True)


##Download Train Data
if os.path.exists('Google_Stock_Price_Train.csv'):
  print('File exists')
else:
  !wget -O Google_Stock_Price_Train.csv "https://drive.google.com/uc?export=download&id=1WQO-v_ofXWWQ7gl2B11FUzjl9okW32ul"

##Download Test Data
if os.path.exists('Google_Stock_Price_Test.csv'):
  print('File exists')
else:
  !wget -O Google_Stock_Price_Test.csv "https://drive.google.com/uc?export=download&id=10rAwHmB8pSxbuJbxGERQhgZsNsHkH8Mf"


Mounted at /content/drive
File exists
File exists


### Importing the libraries
NumPy is a library for numerical computing in Python, and it provides tools for working with arrays and matrices.

Matplotlib is a library for creating static, animated, and interactive visualizations in Python. It provides a variety of visualization tools to create line charts, scatter plots, bar charts, histograms, and more.

Pandas is a library for data manipulation and analysis in Python. It provides data structures and tools for working with structured data, such as data frames and series.

By importing these libraries, the code can use the functions and tools provided by these libraries to perform tasks such as data visualization, data manipulation, and numerical computing.


In [20]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

### Importing the training set
The first line imports the Pandas library and uses it to read the training data from a CSV file named 'Google_Stock_Price_Train.csv'. The data is stored in a Pandas dataframe named 'dataset_train'.

The second line extracts the relevant training data from the 'dataset_train' dataframe. The 'iloc' function is used to select all rows and the second column (index 1) of the dataframe, which contains the opening stock price values. The '.values' function is used to convert this selection into a NumPy array named 'training_set'.

In [21]:
dataset_train = pd.read_csv('/content/Google_Stock_Price_Train.csv')
training_set = dataset_train.iloc[:, 1:2].values

### Feature Scaling
The 'MinMaxScaler' function scales the data by subtracting the minimum value and dividing by the range of the data. By default, it scales the data to the range of (0, 1), which can be adjusted by specifying the 'feature_range' parameter.

In this code snippet, the 'MinMaxScaler' function is initialized with the feature range of (0, 1) and stored in the variable 'sc'.

The 'fit_transform' method of the 'sc' object is then applied to the 'training_set' array to scale the data. The scaled data is stored in a new variable named 'training_set_scaled'.

This code snippet is useful for scaling the data to a specific range, which can be helpful for improving the accuracy of machine learning models. Many machine learning algorithms perform better when the input features are scaled to a similar range, and this can help prevent certain features from dominating others.


In [22]:
 from sklearn.preprocessing import MinMaxScaler
 sc = MinMaxScaler(feature_range= (0, 1))
 training_set_scaled = sc.fit_transform(training_set)

### Creating a data structure with 60 timesteps and 1 output
This code snippet is used to create a supervised learning dataset from the scaled training data, which can be used to train a machine learning model for the Google stock price prediction project.

The first two lines initialize empty lists named 'x_train' and 'y_train', which will be used to store the input and output data for the supervised learning dataset.

The 'for' loop then iterates over the range of indices from 60 to 1258 (the length of the training set), with a step size of 1. This range is chosen because the model will use the previous 60 stock prices to predict the next stock price.

For each iteration of the loop, the 'x_train' list is appended with a subsequence of 60 consecutive stock prices, starting from the current index 'i-60' and ending at 'i'. The 'y_train' list is appended with the stock price value at index 'i'.

Finally, the 'x_train' and 'y_train' lists are converted to NumPy arrays using the 'np.array' function and stored in their respective variables.


In [23]:
from numpy.ma.core import append
x_train = []
y_train = []
for i in range(60, 1258):
  x_train.append(training_set_scaled[i-60:i, 0])
  y_train.append(training_set_scaled[i, 0])
x_train, y_train = np.array(x_train), np.array(y_train)


### Reshaping

The 'x_train' input data was created in the previous code snippet as a 2D array with dimensions (batch_size, time_steps), where 'batch_size' refers to the number of training examples in each batch, and 'time_steps' refers to the length of the input sequence (in this case, 60).

However, an RNN expects a 3D input array with dimensions (batch_size, time_steps, input_dim), where 'input_dim' refers to the number of input features at each time step. In this case, there is only one input feature, which is the stock price value.

To reshape the 'x_train' array to the correct shape, this code snippet uses the 'np.reshape' function, which takes two arguments: the input array and the new shape. The new shape is specified as (x_train.shape[0], x_train.shape[1], 1), which means that the first two dimensions of the input shape will remain the same, but a new third dimension of size 1 will be added to represent the input feature dimension.

After this code snippet is executed, the 'x_train' array will have a 3D shape of (batch_size, time_steps, 1), which can be used as input to an RNN for training the Google stock price prediction model.

In [24]:
x_train = np.reshape(x_train,(x_train.shape[0], x_train.shape[1], 1))

## Part 2 - Building and Training the RNN

### Importing the Keras libraries and packages

'Sequential' is a class from Keras that allows us to build a model layer by layer.

'Dense' is a type of layer in Keras that represents a fully connected neural network layer.

'LSTM' is a type of layer in Keras that represents a Long Short-Term Memory (LSTM) unit, which is a type of recurrent neural network (RNN) that is well-suited for sequential data.

'Dropout' is a type of layer in Keras that applies dropout regularization to the input, which helps prevent overfitting during training.



In [25]:
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dropout

### Initialising the RNN

A sequential model is a linear stack of layers where each layer is added sequentially. This type of model is appropriate for most deep learning tasks where the data has a clear input and output structure, as is the case with the Google stock price prediction task.

The sequential model is stored in the regressor variable and will be used to define the architecture of the LSTM-based RNN model.

In [26]:
regressor = Sequential()

### Adding the first LSTM layer and some Dropout regularisation

The add() method is used to add a new layer to the regressor model. In this case, the layer being added is an LSTM layer with 50 memory units (also known as hidden units or cells).

The return_sequences argument is set to True in order to return the full sequence of output values, rather than just the last output value, from this layer. This is necessary because the output of this layer will be fed into subsequent LSTM layers.

The input_shape argument is set to (x_train.shape[1], 1), which specifies the shape of the input data for this layer. The first dimension of the input shape (x_train.shape[1]) corresponds to the number of time steps in the input sequence, which is 60 in this case. The second dimension of the input shape (1) corresponds to the number of input features at each time step, which is 1 (the stock price value).

After the LSTM layer is added, a Dropout layer is added to the model with a dropout rate of 0.2. This helps prevent overfitting during training by randomly dropping out (i.e., setting to zero) some of the output values from the LSTM layer during each training iteration.


In [27]:
regressor.add(LSTM(units = 50, return_sequences = True, input_shape = (x_train.shape[1], 1)))
regressor.add(Dropout(0.2))

### Adding a second LSTM layer and some Dropout regularisation

Similar to the first LSTM layer, the add() method is used to add a new layer to the regressor model. In this case, the layer being added is another LSTM layer with 50 memory units.

The return_sequences argument is set to True in order to return the full sequence of output values from this layer, which will be fed into the next LSTM layer.

After the second LSTM layer is added, another Dropout layer is added to the model with a dropout rate of 0.2. This helps prevent overfitting during training by randomly dropping out some of the output values from the LSTM layer during each training iteration.

In [28]:
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))

### Adding a third LSTM layer and some Dropout regularisation
Similar to the first two LSTM layers, the add() method is used to add a new layer to the regressor model. In this case, the layer being added is another LSTM layer with 50 memory units.

The return_sequences argument is set to True in order to return the full sequence of output values from this layer, which will be fed into the next LSTM layer.

After the third LSTM layer is added, another Dropout layer is added to the model with a dropout rate of 0.2. This helps prevent overfitting during training by randomly dropping out some of the output values from the LSTM layer during each training iteration.

In [29]:
regressor.add(LSTM(units = 50, return_sequences = True))
regressor.add(Dropout(0.2))

### Adding a fourth LSTM layer and some Dropout regularisation
Similar to the previous LSTM layers, the add() method is used to add a new layer to the regressor model. In this case, the layer being added is another LSTM layer with 50 memory units.

Since this is the final LSTM layer, the return_sequences argument is not specified and defaults to False. This means that only the last output value from this layer will be returned, rather than the full sequence of output values.

After the final LSTM layer is added, another Dropout layer is added to the model with a dropout rate of 0.2. This helps prevent overfitting during training by randomly dropping out some of the output values from the LSTM layer during each training iteration.

Note that after the LSTM layers and dropout layers are added to the model, a fully connected Dense layer will be added to the model in the next step.

In [30]:
regressor.add(LSTM(units = 50))
regressor.add(Dropout(0.2))

### Adding the output layer
The add() method is used to add a new layer to the regressor model. In this case, the layer being added is a Dense layer with a single output unit.

The units argument specifies the number of output units in the Dense layer. Since we want to predict a single output value (the next day's stock price), we set units to 1.

Note that since we are using a regression model, rather than a classification model, we do not use an activation function in the output layer.

In [31]:
regressor.add(Dense(units = 1))

### Compiling the RNN

The compile() method is called on the regressor model, and we specify two arguments:

optimizer: This argument specifies the optimizer algorithm to use during training. In this case, we specify 'adam' as the optimizer, which is a popular choice for deep learning models.

loss: This argument specifies the loss function to use during training. In this case, we specify 'mean_squared_error' as the loss function, which is a commonly used loss function for regression problems.

By calling compile() on the regressor model with these arguments, we are setting up the model to be trained using the Adam optimizer algorithm and to minimize the mean squared error loss function during training.

In [32]:
regressor.compile(optimizer = 'adam', loss = 'mean_squared_error')

### Fitting the RNN to the Training set
The fit() method is called on the regressor model, and we specify four arguments:

x_train: This argument specifies the input training data, which consists of the scaled stock price values for the previous 60 days.

y_train: This argument specifies the output training data, which consists of the scaled stock price value for the current day.

epochs: This argument specifies the number of epochs (training iterations) to run during training. In this case, we specify 100 epochs.

batch_size: This argument specifies the number of training examples to use in each batch during training. In this case, we specify a batch size of 32.

By calling fit() on the regressor model with these arguments, we are training the model using the input and output training data that we have prepared. During training, the model will use the Adam optimizer algorithm to minimize the mean squared error loss function, and it will run for 100 epochs with a batch size of 32. After training is complete, the trained model will be able to predict the stock price for the next day based on the previous 60 days of stock price data.

In [None]:
regressor.fit(x_train, y_train, epochs = 100, batch_size = 32)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100


## Part 3 - Making the predictions and visualising the results

### Getting the real stock price of 2017

The read_csv() function from the Pandas library is used to read the CSV file and store it as a DataFrame in the dataset_test variable.

Then, we use the iloc[] method to extract the values of the second column (i.e., the stock price values) from the dataset_test DataFrame and store them in the real_stock_price variable as a NumPy array.

By doing this, we are obtaining the actual stock price values for the test period, which we can use later to evaluate the performance of our trained model.

In [None]:
dataset_test = pd.read_csv('/content/Google_Stock_Price_Test.csv')
real_stock_price = dataset_test.iloc[:, 1:2].values

### Getting the predicted stock price of 2017

First, we concatenate the training and test datasets along the vertical axis using the concat() function from the Pandas library. The resulting dataset_total contains all the historical data for the Google stock price.

Next, we extract the values of the last 80 days from dataset_total (i.e., the 60 days preceding the test period and the 20-day test period) and scale them using the same MinMaxScaler instance (sc) that was used to scale the training data.

Then, we create a list x_test of input sequences of 60 timesteps for the test period, similar to how we created x_train for the training period.

We then reshape x_test into a 3D array with shape (num_samples, 60, 1).

Next, we use the trained LSTM model regressor to predict the stock prices for the test period. We store the predicted values in the predicted_stock_price variable.

Finally, we use the inverse_transform() method of the sc instance to rescale the predicted stock prices back to their original values. This gives us the final predicted stock prices for the test period.

In [None]:
dataset_total= pd.concat((dataset_train['Open'], dataset_test['Open']), axis = 0)
inputs = dataset_total[len(dataset_total) - len(dataset_test) - 60:].values
inputs = inputs.reshape(-1, 1)
inputs = sc.transform(inputs)
x_test = []
for i in range(60, 80):
  x_test.append(inputs[i-60:i, 0])
x_test = np.array(x_test)
x_test = np.reshape(x_test,(x_test.shape[0], x_test.shape[1], 1))
predicted_stock_price = regressor.predict(x_test)
predicted_stock_price = sc.inverse_transform(predicted_stock_price)

### Visualising the results

First, we plot the real stock prices for the test period in red color using the plot() function. The real_stock_price variable contains the actual stock prices for the test period.

Next, we plot the predicted stock prices for the same period in blue color using the plot() function. The predicted_stock_price variable contains the predicted stock prices for the test period.

Then, we add a title to the plot using the title() function, with the text "Google Stock Price Prediction".

We label the x-axis as "Time" using the xlabel() function.

We label the y-axis as "Google Stock Price" using the ylabel() function.

Finally, we display a legend for the plot using the legend() function, which shows which line represents the real stock prices and which line represents the predicted stock prices.

The show() function is then called to display the plot.

In [None]:
plt.plot(real_stock_price, color = 'red', label = 'Real Google Stock Price')
plt.plot(predicted_stock_price, color = 'blue', label = 'Predicted Google Stock Price')
plt.title('Google Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel('Google Stock Price')
plt.legend()
plt.show()

#Future suggestions for this Google stock price prediction project:

Evaluate the model performance: Use various evaluation metrics such as Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) to evaluate the performance of the model.

Hyperparameter tuning: Experiment with different hyperparameters such as the number of LSTM layers, number of neurons in each layer, number of epochs, and batch size to improve the model performance.

Use additional data: Try using additional data such as news articles, economic indicators, and financial statements to improve the accuracy of the model.

Test on other stocks: Try using the same model to predict the stock prices of other companies and compare the results to see if the model is generalizable.

Use a different model architecture: Try using a different model architecture such as Convolutional Neural Networks (CNN) or a combination of LSTM and CNN to see if it improves the performance of the model.

Deploy the model: Deploy the model on a web platform to allow users to access the predictions and interact with the model.