<a href="https://colab.research.google.com/github/MatteoBettini/Stock-Market-Prediction-2020/blob/main/notebooks/Data%20exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Take-home Assessment

# Dataset exploration

In this section we will upload and explore the dataset "**Processed_NASDAQ**",  containing several daily features of NASDAQ Composite from 2010 to 2017. The dataset was acquired from [this repository](https://archive.ics.uci.edu/ml/datasets/CNNpred%3A+CNN-based+stock+market+prediction+using+a+diverse+set+of+variables#).

It covers features from various categories of technical indicators, future contracts, price of commodities, important indices of markets around the world, price of major companies in the U.S. market, and treasury bill rates. Sources and thorough description of features have been mentioned in the paper "[CNNpred: CNN-based stock market prediction using a diverse set of variables](https://arxiv.org/pdf/1810.08923.pdf)".

## Imports

In [1]:
# To plot figures
%matplotlib inline
import matplotlib.pyplot as plt

import pandas as pd
import numpy as np

# To make this notebook's output stable across runs
np.random.seed(42)

## Loading the dataset

In [2]:
nasdaq_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_NASDAQ.csv?token=ANHXQQK4VPBE6ABSBCHTF5K74UP3W'
dji_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_DJI.csv?token=ANHXQQOXYQZSSSTVFKX6RZS74XDXA'
nyse_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_NYSE.csv?token=ANHXQQNAISMPCLVLRTGNJBC74XD2C'
russel_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_RUSSELL.csv?token=ANHXQQPGLBLSM3B36OLWIPC74XD3U'
s_p_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_S%26P.csv?token=ANHXQQNRFS3NKP2XCF5Q5MS74XD5K'

In [14]:
nasdaq_df = pd.read_csv(nyse_url)
# Dataset is now stored in a Pandas Dataframe

## Exploring the dataset

Now that we have loaded the dataset we can start inspecting the data.

In [15]:
nasdaq_df.head(10)

Unnamed: 0,Date,Close,Volume,mom,mom1,mom2,mom3,ROC_5,ROC_10,ROC_15,ROC_20,EMA_10,EMA_20,EMA_50,EMA_200,DTB4WK,DTB3,DTB6,DGS5,DGS10,Oil,Gold,DAAA,DBAA,GBP,JPY,CAD,CNY,AAPL,AMZN,GE,JNJ,JPM,MSFT,WFC,XOM,FCHI,FTSE,GDAXI,DJI,...,RUT,TE1,TE2,TE3,TE5,TE6,DE1,DE2,DE4,DE5,DE6,CTB3M,CTB6M,CTB1Y,Name,AUD,Brent,CAC-F,copper-F,WIT-oil,DAX-F,DJI-F,EUR,FTSE-F,gold-F,HSI-F,KOSPI-F,NASDAQ-F,GAS-F,Nikkei-F,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,wheat-F,XAG,XAU
0,2009-12-31,7184.959961,,,,,,,,,,,,,,0.04,0.06,0.2,2.69,3.85,,,5.33,6.39,,,,,,,,,,,,,,,,,...,,3.81,3.79,3.65,0.02,0.16,1.06,2.54,6.19,6.33,6.35,,,,NYA,0.35,-0.13,0.15,0.09,0.1,0.48,-1.19,-0.12,0.27,0.34,1.68,-0.07,-0.96,-2.4,0.67,0.03,0.26,-1.08,-1.0,-0.11,-0.08,-0.06,-0.48,0.3,0.39
1,2010-01-04,7326.740234,0.921723,0.019733,,,,,,,,,,,,0.05,0.08,0.18,2.65,3.85,0.02683,0.0,5.35,6.39,-0.004222,-0.004467,-0.010644,-0.001991,0.015565,-0.004609,0.02115,0.004192,0.028318,0.01542,0.012227,0.014078,0.019724,,,0.014951,...,0.023521,3.8,3.77,3.67,0.03,0.13,1.04,2.54,6.21,6.31,6.34,-0.1,-0.04386,-0.01487,NYA,1.73,2.81,1.99,1.36,2.71,0.96,1.28,0.61,1.74,2.05,-0.52,0.54,1.51,5.6,0.31,1.52,3.26,1.61,1.62,-0.57,-0.59,-0.42,3.12,3.91,2.1
2,2010-01-05,7354.870117,-0.375903,0.003839,0.019733,,,,,,,,,,,0.03,0.07,0.17,2.56,3.77,0.002699,0.00156,5.24,6.3,-0.007628,-0.009838,-0.001441,1.5e-05,0.001729,0.0059,0.005178,-0.011596,0.01937,0.000323,0.027452,0.003904,-0.000264,0.004036,-0.002718,-0.001128,...,-0.002515,3.74,3.7,3.6,0.04,0.14,1.06,2.53,6.13,6.23,6.27,-0.055556,-0.073394,-0.033962,NYA,-0.08,0.59,-0.11,0.24,0.32,-0.14,-0.04,-0.31,0.38,0.04,2.03,-0.18,-0.08,-4.2,0.47,-0.07,1.96,-0.2,0.31,0.43,0.03,0.12,-0.9,1.42,-0.12
3,2010-01-06,7377.700195,0.996234,0.003104,0.003839,0.019733,,,,,,,,,,0.03,0.06,0.15,2.6,3.85,0.016883,0.006009,5.3,6.34,0.002067,0.008418,-0.007311,0.000191,-0.015906,-0.018116,-0.005151,0.008134,0.005494,-0.006137,0.001425,0.008643,0.001186,0.001358,0.00041,0.000157,...,-0.000846,3.82,3.79,3.7,0.03,0.12,1.04,2.49,6.19,6.28,6.31,-0.117647,0.0,0.015625,NYA,0.91,1.61,0.15,2.41,1.72,-0.01,0.01,0.31,0.16,1.59,0.79,0.78,-0.36,6.6,0.19,0.56,2.15,-0.02,0.07,-0.56,-0.24,-0.17,2.62,2.25,1.77
4,2010-01-07,7393.930176,0.059932,0.0022,0.003104,0.003839,0.019733,,,,,,,,,0.02,0.05,0.16,2.62,3.85,-0.006256,0.000221,5.31,6.33,-0.005609,0.011196,0.002035,-7.3e-05,-0.001849,-0.017013,0.05178,-0.007137,0.019809,-0.0104,0.036286,-0.003142,0.001775,-0.000597,-0.002481,0.003138,...,0.006301,3.83,3.8,3.69,0.03,0.14,1.02,2.48,6.17,6.28,6.31,0.066667,0.019802,0.007692,NYA,-0.41,-0.46,0.15,-1.9,-0.63,-0.12,0.28,-0.66,0.06,-0.25,-0.6,-1.27,-0.05,-3.38,-0.09,-0.72,0.94,0.5,0.4,0.58,0.58,0.54,-1.85,0.22,-0.58
5,2010-01-08,7425.350098,-0.167168,0.004249,0.0022,0.003104,0.003839,3.345741,,,,,,,,0.02,0.05,0.15,2.57,3.83,0.001695,-0.003097,5.32,6.32,0.005656,-0.007817,-0.004062,-4.4e-05,0.006648,0.027077,0.021538,0.003438,-0.002456,0.006897,-0.009269,-0.004012,0.005054,0.001357,0.003032,0.001068,...,0.004034,3.81,3.78,3.68,0.03,0.13,1.0,2.49,6.17,6.27,6.3,-0.0625,-0.067961,-0.019084,NYA,0.88,-0.17,0.53,-0.7,0.11,0.27,0.2,0.66,0.02,0.45,0.06,0.43,0.67,-0.98,1.03,0.61,0.68,0.64,0.35,-0.98,-0.58,-0.56,2.07,1.26,0.38
6,2010-01-11,7449.049805,-0.030483,0.003192,0.004249,0.0022,0.003104,1.669359,,,,,,,,0.01,0.04,0.13,2.58,3.85,-0.002417,0.023297,5.35,6.32,0.005543,-0.00613,0.003884,-0.000147,-0.008822,-0.024041,0.009639,0.000156,-0.003357,-0.01272,-0.002079,0.01122,-0.000507,0.000705,0.000479,0.004313,...,-0.000884,3.84,3.81,3.72,0.03,0.12,0.97,2.47,6.19,6.28,6.31,-0.133333,-0.010417,0.003891,NYA,0.76,-0.49,-0.07,1.21,-0.28,-0.12,0.36,0.7,-0.03,1.1,0.84,-0.13,-0.34,-5.13,-0.49,0.64,-0.13,-1.01,0.09,-0.66,-0.64,-0.61,1.08,0.65,1.44
7,2010-01-12,7370.450195,0.108178,-0.010552,0.003192,0.004249,0.0022,0.211833,,,,,,,,0.02,0.05,0.14,2.49,3.74,-0.021202,-0.001518,5.25,6.22,0.002407,-0.010989,0.005223,1.5e-05,-0.011375,-0.022715,0.000597,0.005294,-0.023355,-0.006607,-0.025,-0.004979,-0.010645,-0.007114,-0.016141,-0.003444,...,-0.013183,3.72,3.69,3.6,0.03,0.12,0.97,2.48,6.08,6.17,6.2,0.076923,-0.031579,-0.034884,NYA,-1.2,-2.06,-1.04,-2.68,-2.1,-1.42,-0.15,-0.13,-0.59,-1.89,-1.1,0.27,-0.96,2.51,1.23,-0.47,-2.36,-0.67,-0.74,0.22,-0.05,-0.06,-6.33,-1.78,-2.19
8,2010-01-13,7430.140137,-0.11573,0.008099,-0.010552,0.003192,0.004249,0.71079,,,,,,,,0.02,0.06,0.15,2.55,3.8,-0.013987,-0.020847,5.32,6.3,0.008322,0.003876,-0.008083,2.9e-05,0.014106,0.01382,0.003578,0.006351,0.017475,0.009312,0.017806,-0.004003,0.000203,-0.004583,0.003389,0.005035,...,0.012683,3.78,3.74,3.65,0.04,0.13,0.98,2.5,6.15,6.24,6.28,0.071429,0.054348,0.024096,NYA,0.43,-1.25,0.02,1.53,-1.41,0.3,0.38,0.12,-0.43,0.66,-2.06,-1.68,0.91,2.54,-1.56,0.26,1.62,0.82,0.66,-0.15,-0.17,-0.13,-0.51,1.97,0.98
9,2010-01-14,7448.52002,-0.061184,0.002474,0.008099,-0.010552,0.003192,0.738306,,,,7376.171094,,,,0.02,0.05,0.14,2.51,3.76,-0.003892,0.009758,5.17,6.22,0.002941,-0.001816,-0.007179,2.9e-05,-0.005792,-0.013632,-0.007724,0.002001,0.009943,0.020099,0.014346,0.000144,0.003727,0.004513,0.004316,0.002788,...,0.00446,3.74,3.71,3.62,0.03,0.12,1.05,2.46,6.08,6.17,6.2,-0.066667,-0.030928,-0.015686,NYA,0.82,-0.63,0.36,-0.35,-0.33,0.33,0.33,-0.06,0.36,0.55,-0.6,1.14,0.31,-2.53,1.59,0.27,0.57,0.76,0.33,0.12,-0.13,-0.16,-1.49,0.32,0.39


Taking a peak at the first five elements we can already see that there are a lot of missing values. They will be treated accordingly in the section regarding data transofrmations.

We can also see that the dates inculuded in the dataset are referring only to working days as the stock market is open only on those days.

We can get a confirmation of this by looking at the following rows where we see that 16,17 January 2010 are not present because it was a weekend and 18 January 2010 is not present because of the federal U.S. festivity of "Martin Luther King Jr. Day".

This is not a problem for our machine leraning pipeline as we will translate 'Date' feature into a categorical feature representing the day of the week

In [16]:
nasdaq_df.iloc[9:13]

Unnamed: 0,Date,Close,Volume,mom,mom1,mom2,mom3,ROC_5,ROC_10,ROC_15,ROC_20,EMA_10,EMA_20,EMA_50,EMA_200,DTB4WK,DTB3,DTB6,DGS5,DGS10,Oil,Gold,DAAA,DBAA,GBP,JPY,CAD,CNY,AAPL,AMZN,GE,JNJ,JPM,MSFT,WFC,XOM,FCHI,FTSE,GDAXI,DJI,...,RUT,TE1,TE2,TE3,TE5,TE6,DE1,DE2,DE4,DE5,DE6,CTB3M,CTB6M,CTB1Y,Name,AUD,Brent,CAC-F,copper-F,WIT-oil,DAX-F,DJI-F,EUR,FTSE-F,gold-F,HSI-F,KOSPI-F,NASDAQ-F,GAS-F,Nikkei-F,NZD,silver-F,RUSSELL-F,S&P-F,CHF,Dollar index-F,Dollar index,wheat-F,XAG,XAU
9,2010-01-14,7448.52002,-0.061184,0.002474,0.008099,-0.010552,0.003192,0.738306,,,,7376.171094,,,,0.02,0.05,0.14,2.51,3.76,-0.003892,0.009758,5.17,6.22,0.002941,-0.001816,-0.007179,2.9e-05,-0.005792,-0.013632,-0.007724,0.002001,0.009943,0.020099,0.014346,0.000144,0.003727,0.004513,0.004316,0.002788,...,0.00446,3.74,3.71,3.62,0.03,0.12,1.05,2.46,6.08,6.17,6.2,-0.066667,-0.030928,-0.015686,NYA,0.82,-0.63,0.36,-0.35,-0.33,0.33,0.33,-0.06,0.36,0.55,-0.6,1.14,0.31,-2.53,1.59,0.27,0.57,0.76,0.33,0.12,-0.13,-0.16,-1.49,0.32,0.39
10,2010-01-15,7356.790039,0.21545,-0.012315,0.002474,0.008099,-0.010552,-0.923324,2.391525,,,7372.647266,,,,0.03,0.06,0.15,2.44,3.7,-0.017517,-0.009005,5.12,6.18,-0.003921,-0.005665,0.005472,0.0,-0.016712,-0.001649,-0.015569,-0.008295,-0.0226,-0.00323,-0.03139,-0.00818,-0.015287,-0.007784,-0.018853,-0.009421,...,-0.013103,3.67,3.64,3.55,0.03,0.12,1.06,2.48,6.03,6.12,6.15,0.071429,-0.053191,-0.027888,NYA,-0.94,-0.91,-0.53,-0.62,-1.75,-1.89,-0.94,-0.81,-0.83,-1.09,-0.43,0.38,-1.38,1.84,0.73,-0.55,-1.24,-1.42,-1.14,0.64,0.77,0.77,-3.27,-1.45,-1.08
11,2010-01-19,7443.680176,-0.007124,0.011811,-0.012315,0.002474,0.008099,-0.072085,1.596071,,,7385.56234,,,,0.03,0.06,0.14,2.48,3.73,0.013084,0.004433,5.22,6.21,0.005333,0.004948,0.002624,-2.9e-05,0.044238,0.003697,0.006083,0.012237,-0.009158,0.007777,0.007123,0.002315,0.013982,0.010577,0.017105,0.010913,...,0.01754,3.7,3.67,3.59,0.03,0.11,0.99,2.48,6.07,6.15,6.18,-0.066667,0.044944,0.016393,NYA,-0.17,0.69,0.79,0.5,1.31,0.91,0.59,-0.59,0.35,0.85,0.52,0.16,1.48,-2.35,-0.74,-0.95,0.49,1.54,1.19,0.89,0.58,0.57,-2.08,0.7,0.45
12,2010-01-20,7329.830078,0.018145,-0.015295,0.011811,-0.012315,0.002474,-0.551121,-0.340455,,,7375.429202,,,,0.03,0.05,0.14,2.45,3.68,-0.019752,-0.011253,5.24,6.16,-0.003586,-7.7e-05,0.013085,7.3e-05,-0.015392,-0.014341,-0.002418,-0.00306,0.002773,-0.016399,-0.016266,-0.017901,-0.020131,-0.016742,-0.020907,-0.011401,...,-0.014696,3.65,3.63,3.54,0.02,0.11,0.92,2.48,6.02,6.11,6.13,0.0,-0.010753,-0.012097,NYA,-1.73,-1.69,-2.06,-2.65,-1.77,-2.08,-1.06,-1.4,-1.72,-2.4,-2.02,0.0,-1.2,-1.1,-0.46,-1.64,-4.89,-1.3,-1.03,1.05,1.09,1.08,-2.08,-4.69,-2.32


### Features

The datasets described in [the paper](https://arxiv.org/pdf/1810.08923.pdf) contains 1984 entries each representing a day of trading in the stock market. Each entry has 82 features which are grouped in the following way:

*   Primitive features
*   Technical indicators
*   Economic data
*   World stock markets
*   The exchange rate of U.S. dollar
*   Commodities
*   Big U.S. Companies
*   Futures contracts

The authora have made available five datasets, each representing a different stock market. The available markets are: S&P 500, NASDAQ Composite, Dow Jones Industrial Average, RUSSELL 2000, and NYSE Composite. In this work we will explore and analyse NASDAQ Composite, but the insighta we gain will be valid for all datasets.

The primitive featrues and the technical indicators are unique for each dataset, while all the other features are common among datasets.

A tabular description of the features is also reported in the following images.

<img src="https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/feature_description/feature_table_1.png?token=ANHXQQI7BINXDEPRGLZ42LS74XAH6" width="2000">
<img src="https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/feature_description/feature_table_2.png?token=ANHXQQOG2ZP4BPHXIWXNWWK74XAKY" width="2000">

By looking at the info we can see the types of the features and the number of non-null values.

In [17]:
nasdaq_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1984 entries, 0 to 1983
Data columns (total 84 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            1984 non-null   object 
 1   Close           1984 non-null   float64
 2   Volume          1983 non-null   float64
 3   mom             1983 non-null   float64
 4   mom1            1982 non-null   float64
 5   mom2            1981 non-null   float64
 6   mom3            1980 non-null   float64
 7   ROC_5           1979 non-null   float64
 8   ROC_10          1974 non-null   float64
 9   ROC_15          1969 non-null   float64
 10  ROC_20          1964 non-null   float64
 11  EMA_10          1975 non-null   float64
 12  EMA_20          1965 non-null   float64
 13  EMA_50          1935 non-null   float64
 14  EMA_200         1785 non-null   float64
 15  DTB4WK          1984 non-null   float64
 16  DTB3            1984 non-null   float64
 17  DTB6            1984 non-null   f

As already mentioned, all features are floats except for Date and Name which are strings.

In [None]:
def series_to_supervised(data, n_in, dropnan=False):
    """
    Frame a time series as a supervised learning dataset.
    Arguments:
        data: Sequence of observations as a list or NumPy array.
        n_in: Number of lag observations as input (X).
        dropnan: Boolean whether or not to drop rows with NaN values.
    Returns:
        Pandas DataFrame of series framed for supervised learning.
    """
    n_vars = 1 if type(data) is list else data.shape[1]
    df = pd.DataFrame(data)
    cols, names = list(), list()
    # input sequence (t-n, ... t-1)
    for i in range(n_in, 0, -1):
        cols.append(df.shift(i))
        names += [('var%d(t-%d)' % (j+1, i)) for j in range(n_vars)]
    # forecast sequence (t)
    cols.append(df.iloc[:][0] - df.shift(1).iloc[:][0])

    names += ['target']
  
    # put it all together
    agg = pd.concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

In [None]:
dataset_url = 'https://raw.githubusercontent.com/MatteoBettini/Stock-Market-Prediction-2020/main/stock_markets_datasets/Processed_NASDAQ.csv?token=ANHXQQK4VPBE6ABSBCHTF5K74UP3W'
nasdaq_df = pd.read_csv(dataset_url)
# Dataset is now stored in a Pandas Dataframe
nasdaq_df.info()
nasdaq_df = nasdaq_df.drop(columns=['Date','Name'])


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1984 entries, 0 to 1983
Data columns (total 84 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            1984 non-null   object 
 1   Close           1984 non-null   float64
 2   Volume          1983 non-null   float64
 3   mom             1983 non-null   float64
 4   mom1            1982 non-null   float64
 5   mom2            1981 non-null   float64
 6   mom3            1980 non-null   float64
 7   ROC_5           1979 non-null   float64
 8   ROC_10          1974 non-null   float64
 9   ROC_15          1969 non-null   float64
 10  ROC_20          1964 non-null   float64
 11  EMA_10          1975 non-null   float64
 12  EMA_20          1965 non-null   float64
 13  EMA_50          1935 non-null   float64
 14  EMA_200         1785 non-null   float64
 15  DTB4WK          1984 non-null   float64
 16  DTB3            1984 non-null   float64
 17  DTB6            1984 non-null   f

In [None]:
data = series_to_supervised(nasdaq_df.values, 40)

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1984 entries, 0 to 1983
Columns: 3281 entries, var1(t-40) to target
dtypes: float64(3281)
memory usage: 49.7 MB
