# Data Science Use Case: ASX Index Predictor Model - Part 2
# Written by: Ricky Chang

## 3) Feature engineering
### Given the raw dataset, what scaling and transformations can be applied to individual features to improve model fit accuracy?

Which scaler should be used to scale features? It depends on the feature distribution. Chi-square goodness of fit test can indicate whether the chosen transformation is suitable.

In [155]:
import pandas as pd
import tensorflow as ts
import seaborn as sns
sns.set()
import sklearn as sk
import numpy as np
import scipy
import statsmodels.api as sm
import matplotlib.pyplot as plt
import math
# apply fix to statsmodels library
from scipy import stats
stats.chisquareprob = lambda chisq, df: stats.chi2.sf(chisq,df)

### Read in the joined raw dataset

In [148]:
folder = ''
df_asset = pd.read_csv(filepath_or_buffer = folder + 'df_asset.csv', index_col=0)

This code snippet performs these steps:
1. Creates a new feature 'CopperGoldRatio', which is a bullish economic indicator. Given the hypothesis that equity stock indices reflect the overall economy, then the CopperGoldRatio and index prices should be positively correlated as well.
1. Creates new features from asset prices suffixed with '(Return %)'. This is the first step in normalisation of raw prices.
1. Natural log is applied to the (Return %) features. This is a non-linear transformation which in effect scales the features to a normal distribution, if the raw prices were lognormally distributed.
1. Drop all NA records.

In [163]:
from sklearn.preprocessing import StandardScaler
#unscaled_inputs = df_asset.iloc[:,:-5]
assets = ['Gold','Silver','Iron','Copper','WTI','Brent','RBA Cash Rate']
unscaled_inputs = df_asset[assets]
copper_gold_ratio = unscaled_inputs['Copper'] / unscaled_inputs['Gold']
unscaled_inputs['CopperGoldRatio'] = copper_gold_ratio
assets.append('CopperGoldRatio')
for x in assets:
    unscaled_inputs[x+' (Return %)'] = (unscaled_inputs[x] - unscaled_inputs[x].shift(1)) / unscaled_inputs[x]
    list_returns = []
    for i in range(unscaled_inputs[x+' (Return %)'].shape[0]):
        list_returns.append(math.log1p(unscaled_inputs[x+' (Return %)'].values[i]))
    unscaled_inputs[x+' (Return %)'] = list_returns
unscaled_inputs = unscaled_inputs.dropna()
#asset_scaler = StandardScaler()
#asset_scaler.fit(unscaled_inputs)
#scaled_inputs = asset_scaler.transform(unscaled_inputs)
scaled_inputs = unscaled_inputs

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]


## 4) Model selection 

1st principle of model selection - a simpler model is better than a complex one, as long as model accuracy is not significantly compromised. Hence,

Model #1 shall be Logistic Regression.

Model #2 shall be XGBoost.

Model #3 shall be Naive Bayes.

After all models are fitted, the intention is to stack the results of these models together, to see if that results in greater predictive accuracy.

Look-ahead bias should be avoided at all costs; as this will discredit any predictions generated by the model.

### 4.1) Logistic Regression Model
The log (Return %) features have been selected for this model. The target is a binary classifer which returns:
* TRUE if XAO (Return %) is greater than 0 (buy)
* FALSE if XAO (Return %) is lesser than 0 (sell)

In [164]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
reg = LogisticRegression()
target = df_asset['Return (%)'][1:] > 0
reg.fit(scaled_inputs,target.values)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

# Feature Significance
Odds Ratio close to 1 indicates that the feature does not contribute significantly to the model, and could be removed from the model.

In [165]:
feature_name = unscaled_inputs.columns.values
summary_table = pd.DataFrame(columns=['Feature Name'], data = feature_name)
summary_table['Coefficient'] = np.transpose(reg.coef_)
summary_table.index = summary_table.index + 1
summary_table.loc[0] = ['Intercept', reg.intercept_[0]]
summary_table = summary_table.sort_index()
summary_table['Odds_Ratio'] = np.exp(summary_table.Coefficient)
summary_table = summary_table.sort_values('Odds_Ratio', ascending = False)
summary_table

Unnamed: 0,Feature Name,Coefficient,Odds_Ratio
12,Copper (Return %),0.973334,2.646754
16,CopperGoldRatio (Return %),0.85804,2.358533
10,Silver (Return %),0.341925,1.407655
0,Intercept,0.317471,1.37365
8,CopperGoldRatio,0.186498,1.205022
9,Gold (Return %),0.074643,1.0775
11,Iron (Return %),0.02936,1.029795
5,WTI,0.003897,1.003904
2,Silver,0.002751,1.002755
1,Gold,0.000784,1.000785


In [166]:
print('The model accuracy is: '+ str(100* reg.score(scaled_inputs,target.values)) + '%')

The model accuracy is: 64.61538461538461%


## Conclusion of Part 2
Initial fitting of Model #1 Logistic Regression is complete. 

Initial findings indicate the Copper (Return %), CopperGoldRatio (Return %) and Silver (Return %) are the most important features to predict buy/sell signals for All Ordinaries (XAO).

The accuracy requires improvements to >80% to be a credible predictive model. In addition, a k means train test split model validation needs to be performed.