# JPX Tokyo Stock Exchange Prediction

## Competition

### Introduction
This competition is meant to define a model to find make accurate predictionas for the Japanese Stock Trade.

**stock_prices.csv** The core file of interest. Includes the daily closing price for each stock and the target column.

**options.csv** Data on the status of a variety of options based on the broader market. Many options include implicit predictions of the future price of the stock market and so may be of interest even though the options are not scored directly.

**secondary_stock_prices.csv** The core dataset contains on the 2,000 most commonly traded equities but many less liquid securities are also traded on the Tokyo market. This file contains data for those securities, which aren't scored but may be of interest for assessing the market as a whole.

**trades.csv** Aggregated summary of trading volumes from the previous business week.

**financials.csv** Results from quarterly earnings reports.

**stock_list.csv** - Mapping between the SecuritiesCode and company names, plus general information about which industry the company is in.

### Data Preparation Stage

In [2]:
pip install sklearn

Note: you may need to restart the kernel to use updated packages.




#### First, import necessary libraries (to start)

In [3]:
import pandas as pd
import numpy as np
import random
import os
import xgboost as xgb
from tensorflow import keras

In [4]:
main_dir = "jpx-tokyo-stock-exchange/"

if not os.path.exists("jpx-tokyo-stock-exchange-prediction/data_specifications/options_spec.csv"):
    print("poen")

In [5]:
from sklearn.model_selection import train_test_split
from os import listdir
from os.path import isfile, join

trainDir = "jpx-tokyo-stock-exchange-prediction/train_files/"
testDir = "jpx-tokyo-stock-exchange-prediction/supplemental_files/"
specDir = "jpx-tokyo-stock-exchange-prediction/data_specifications"

csvTrainFiles = [join(trainDir, f) for f in listdir("jpx-tokyo-stock-exchange-prediction/train_files/") if isfile(join("jpx-tokyo-stock-exchange-prediction/train_files/", f))]

csvTestFiles = [join(testDir, f) for f in listdir("jpx-tokyo-stock-exchange-prediction/train_files/") if isfile(join("jpx-tokyo-stock-exchange-prediction/supplemental_files/", f))]

csvSpecFiles = [join(specDir, f) for f in listdir("jpx-tokyo-stock-exchange-prediction/data_specifications") if isfile(join("jpx-tokyo-stock-exchange-prediction/data_specifications/", f))]

#### Now a bit of brain storming

We can make out from each of these dataframes, that an immediate observation is that the securities_code is common between most of these frames. Therefore, a simple means of getting the name of the stock is to reference the `stock_list.csv` dataframe.

The next logical step is to determine which feature of the datasets is most important when coming up with predictions. The challenge is that each dataframe has several featrues, all with potentially hundreds of entries.

Dimensionality reduction will be incredibly important, as it will be too computationally expensive to test every different feature and even more expensive to test every combination. Therefore, I will make use of the `sklearn.feature_selection` library.

The challenge now lies in determining which feature selection yields the best results. 


#### Map dataframes based on the stock code and then the name of that stock

#### Clean data

In [6]:
pd.read_csv(csvSpecFiles[0])

Unnamed: 0,Column,Sample value,Type,Addendum,Remarks
0,DateCode,20170104_144122718,string,,Unique ID for option price records
1,Date,2017-01-04 0:00:00,date,,Trade date and time
2,OptionsCode,144122718,string,,Local Securities Code (link to https://www.jpx...
3,WholeDayOpen,0,float,,Opening Price for Whole Trading Day
4,WholeDayHigh,0,float,,High Price for Whole Trading Day
5,WholeDayLow,0,float,,Low Price for Whole Trading Day
6,WholeDayClose,0,float,,Closing Price for Whole Trading Day
7,NightSessionOpen,0,float,,Opening Price for Night Session
8,NightSessionHigh,0,float,,High Price for Night Session
9,NightSessionLow,0,float,,Low Price for Night Session


### Test 1: XGBoost Regressor
Here I will test using the XGBoost regressor from the xgboost library

In [7]:
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from xgboost import XGBRegressor


model = XGBRegressor(random_state = 0)

X = pd.read_csv(csvTrainFiles[csvTrainFiles.index("jpx-tokyo-stock-exchange-prediction/train_files/options.csv")])

feature = X.Dividend
X_test_full = pd.read_csv(csvTestFiles[0])

X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, feature, train_size=0.8, test_size=0.2,
                                                                random_state=0)

model.fit(X_train_full, y_train)

  X = pd.read_csv(csvTrainFiles[csvTrainFiles.index("jpx-tokyo-stock-exchange-prediction/train_files/options.csv")])


ValueError: DataFrame.dtypes for data must be int, float, bool or category.  When
categorical type is supplied, DMatrix parameter `enable_categorical` must
be set to `True`. Invalid columns:DateCode, Date, NightSessionOpen, NightSessionHigh, NightSessionLow, NightSessionClose