### Loading up required libraries and configurations

In [226]:
import quandl
import pandas_datareader.data as web
import datetime
import pandas as pd
import sklearn
import numpy as np
from collections import defaultdict
from IPython.display import display
import scipy as sp
from operator import methodcaller
import time

# evaluate usage
pd.options.mode.chained_assignment = None  # default='warn'

In [227]:
""""
usage of this API key is monitored
please don't use this key for any other work, neither make it available on the web by any means
if you would like to access the same API for a different project,
please create an account in quandl.com (it is free) and generate your own API key
""" 
quandl.ApiConfig.api_key = "1513txcURR4fYyP5VDU3"

### Getting the data

#### Stock data

For the stock market general data, we will use Yahoo API, which contains reliable free access data for a number of stock markets, including Bovespa.

In [228]:
def get_stock_data(symbol='VALE5.SA', start_date = '1998-1-1', end_date = '2014-12-31'):
    df = web.DataReader(symbol, 'yahoo', start_date, end_date)
    return df

In [229]:
df = get_stock_data(start_date = '1998-1-1', end_date = '2014-12-31')

#### General Market Data

For the general market data, there is excelent data made available by Banco Central do Brasil (brazilian central bank). The data is available through a webservice, but there is a neat API quandl which makes it easier to access this same data. With a free profile we have limited access, but it is enough for the following tests.

There are over 1,300 different indicators available, from different time periods including daily, weekly, monthly, yearly, etc. At the moment we will stick with 10 relevant indicators which are available daily, and then move on to add more data as we see fit.

In [230]:
daily = {
    'Selic': 'BCB/432',
    'Exchange Rate USD Sell': 'BCB/1',
    'Exchange Rate USD Buy': 'BCB/10813',
    'BM&F Gold gramme': 'BCB/4',
    'Bovespa total volume': 'BCB/8',
    'International Reserves': 'BCB/13621',
    'Bovespa index': 'BCB/7',
    'Foreign exchange operations balance': 'BCB/13961',
    'Nasdaq index': 'BCB/7810',
    'Dow Jones index': 'BCB/7809'
}

## removed montly indicators for now - to be added later
# monthly = {
#     'IPCA-E': 'BCB/10764',
#     'IGP-M': 'BCB/7451',
#     'IPCA-15': 'BCB/74/78',
#     'Net public debt': 'BCB/4473'
# }

In [231]:
def get_market_data(input_df, start_date, end_date):
    df = input_df.copy()
    for var, code in daily.items():
        df = pd.concat([df, quandl.get(code, start_date=start_date , end_date=end_date)], join='inner', axis=1)
        df = df.rename(columns={'Value': var})
    return df

In [232]:
df = get_market_data(df, start_date = '1998-1-1', end_date = '2014-12-31')

#### Trend indicators

The trend indicators are borrowed from the field known as technical analysis, or graphism, that aims to find patterns by analyzing the trend of price and volume.

We will start with the most known: Moving Averages (indicator: MACD, moving average convergence divergence), Momentum, Daily Returns, 

Already included in the dataset
* Momentum: momentum is nothing but the current price, divided by the price X days earlier. The momentum is already included the dataset when we analyse the trend for Adj Close and all other variables
* Daily Return: it is the same as the momentum, but for one day before.

To include:
* Moving Average: moving average for X days. important to understand longer term trends
* Bollinger Bands: 95% confidence interval for the moving averages.
* CandleStick
* Elliot Waves
* Volume/Price

In [233]:
# moving average and bollinger bands
def get_tech_indicators(input_df):
    df = input_df.copy()
    for n in range(10,61,10):
        df['sma'+str(n)] = df['Adj Close'].rolling(window=n, center=False).mean()
        std =df['Adj Close'].rolling(window=n, center=False).std()
        df['bb_lower'+str(n)] = df['sma'+str(n)] - (2 * std)
        df['bb_upper'+str(n)] = df['sma'+str(n)] + (2 * std)
    return df

In [234]:
df = get_tech_indicators(df)

### Creating the labels

The general approach to the stock market problem is to use non-linear regressors to predict future prices. Although it is easy to predict the price for the day ahead, as you move days further the r2_score sinks and the prediction becomes useless.

The inovative approach we will follow here is meant to treat this problem as a classification problem. In order to treat this problem as a classifier, we will pre-define a trading strategy, backtest it to the full dataset, and define the label to be whether this trading strategy results in a successful trade or not.

##### Swing Trade

The first and most simple analysis is a swing trade strategy. We will buy the stock, and hold it for n days. If it reaches an upper boundary (+x%), we sell. If it reaches a lower boundary (-y%), we will also short our position.

So the challenge is within the next n days the stock needs to reach the upper boundary, before it reaches the lower boundary. That will be counted as a successful trade. If it reaches the lower boundary before, or n days passes without reaching the upper boundary, the trading is considered a failure.

The name swing trade means we are speculating on the week/bi-week pattern, the swing, not necessarily on the longer term trend.

The parameters n, x, and y, will be optimized through a genetic algorithm to achieve the optimal trading strategy for this setup.

In [235]:
def create_labels(input_df, forward=19, profit_margin=.042, stop_loss=.020):
    
    df = input_df.copy()
    for row in range(df.shape[0]-forward):

        # initialize max and min ticks
        max_uptick = 0
        min_downtick = 0 

        # move all days forward
        for i in range(1,forward+1):
            delta = (df.ix[row+i, 'Adj Close'] / df.ix[row, 'Adj Close'])-1
            if delta > max_uptick:
                max_uptick = delta
            if delta < min_downtick:
                min_downtick = delta

        # evaluate ticks against predefined strategy parameters
        if max_uptick >= profit_margin and min_downtick <= -stop_loss:
            df.ix[row,'Label'] = 1
        else:
            df.ix[row,'Label'] = 0        

    return df.dropna()


In [236]:
df = create_labels(df)

### Rounding up the features

Now we've got the main data, and the label, let's start creating our features. There is also some innovation going on here, we are running from the traditional approach of stock price prediction.

The first and foremost difference is that instead of analyzing the raw data, I want to analyze the trend for each variable. So in a single day I will look at how the variable increased or decreased in the last N days.

To center at 0, for each variable I will calculate (value / value in n-1). The number of days I will look back may also vary, and it will also depend on the trading strategy to follow. But for simplicity we will start with a round number such as 60 days for the swing trading strategy of around 10 days.

In [437]:
def create_features(input_df, base = 60):
    """ Receives a dataframe as input 
        Returns a new dataframe with ratios calculated
    """
    # avoid modifying in place
    df = input_df.copy() 
    # get all columns ubt the label
    cols = list(df.columns)
    if 'Label' in cols:
        cols.remove('Label')
    # create the new columns for the number of days
    for n in range(1,base+1):
        new_cols = list(map(lambda x: "{}-{}".format(x,n), cols))
        df[new_cols] = (df.loc[:, cols] / df.shift(n).loc[:, cols]) - 1
    
    # replace +inf with max and -inf with min, for each column
    for col in df.columns:
        if len(df[df[col] == np.inf]):
            df.loc[df[col] == np.inf, col] = df.loc[df[col] != np.inf, col].max()
        if len(df[df[col] == -np.inf]):
            df.loc[df[col] == -np.inf, col] = df.loc[df[col] != -np.inf, col].min()    
            
    # drop na 
    df.dropna(axis=0, inplace=True)
   
    # leave or remove original columns? for now I will leave them
    #return df.drop(cols, axis=1)
    return df

In [238]:
df = create_features(df)
df.shape

(1110, 2075)

### Understanding the label and features

Let's start by analizying the label distribution. This will tell us a lot about the dataset, which optimal classifier to choose, and whether we will need to use a stratified approach when splitting the dataset for testing or cross-validation. 

In [239]:
# break up X and y arrays, convert to numpy array
def split_features_labels(df):
    features = [x for x in df.columns if x != "Label"]
    X = df[features].values
    y = df['Label'].values
    return X, y

In [240]:
X,y = split_features_labels(df)
X.shape, y.shape

((1110, 2074), (1110,))

In [241]:
np.bincount(y.astype(int)) / len(y)

array([ 0.78648649,  0.21351351])

As expected, there will be only a few occurrences where such a trading strategy results in success. Only 6,25% of the observations have label 1(success), while 93,75% have label 0 (failure). A stratified approach will be required when splitting the dataset later.

Next step is to take a look at the features we have. I will start standardizing to z-score, then checking and removing outliers, and finally analyse which features are most relevant and attempt to extract principal components.

For now, I will keep working with the full data, since I'm looking for understanding data, I'm doing predictions at this point. I will come back later to this point to divide the dataset 

In [242]:
# scaling
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

In [243]:
# feature selection
from sklearn.feature_selection import SelectKBest
features = [x for x in df.columns if x != "Label"]
f_selector = SelectKBest()
f_selector.fit(X, y)
sorted(zip(features, f_selector.scores_, f_selector.pvalues_), key=lambda x: -x[1])

[('bb_lower20-1', 30.046022284100122, 5.2221671713432783e-08),
 ('bb_lower10-9', 28.488521033795937, 1.1427075168928967e-07),
 ('bb_lower10-10', 27.886973638337743, 1.5472352386519031e-07),
 ('bb_lower20-2', 27.745994818778321, 1.6612144619259018e-07),
 ('bb_lower20-3', 27.592808468381264, 1.794642299440322e-07),
 ('bb_lower10-11', 26.390812927348399, 3.2931523773031622e-07),
 ('bb_lower20-4', 25.770601851296266, 4.5071438886019719e-07),
 ('bb_lower10-8', 25.277657458099448, 5.7856795833824564e-07),
 ('bb_lower20-5', 23.579006572842172, 1.3708688897746095e-06),
 ('bb_lower10-12', 22.80523801219006, 2.0330553023174586e-06),
 ('bb_lower20-6', 21.89758032059218, 3.231011206009291e-06),
 ('bb_lower10-7', 21.560317517673433, 3.838976630175768e-06),
 ('bb_lower20-7', 20.082977005432273, 8.1849357340515636e-06),
 ('bb_lower10-13', 20.027976927870814, 8.4194296025996915e-06),
 ('bb_lower10', 19.837945350926155, 9.282887958648991e-06),
 ('Bovespa index', 19.050050158867361, 1.392316179256393e-0

Strangely enough, it shows an incredible high importance for International Reserves features, as well for other features related to the outside market, such as Nasdaq, and Exchange Rate.

Petrobras, the company being analyzed, is very influenced by the price and availability of USD in brazilian market. It is debts are most in dollar and the price of its main product, oil, is set globally. So these findings that my be accurate, just random noise, or due to some mistake in the data preprocessing

Will come back later to this point if this correlation is not confirmed with other analysis.

In [244]:
# pca conversion
# okay, I've got all these data points
from sklearn.decomposition import PCA
pca = PCA(n_components = 32)
pca.fit(X)
sum(pca.explained_variance_ratio_)

# why 32? I'm aiming at 80% explained variance, with the minimum number of components
# I just tried a few and got to the required number of components
# this could also be optimized later, aimed at performance

0.88731544329712686

In [245]:
X_transformed = pca.transform(X)

Okay, now we've transformed into fewer principal components, thus helping to avoid the curse of dimensionality, let's take a step back and see which are the main variables impacting each of the first components. Are they the same variables which stand out in the Anova F-value tests conducted before?

In [246]:
# checking the variables with most impact on the first component
i = np.identity(len(features)) # identity matrix
coef = pca.transform(i)
sorted(zip(features, coef[:, 1]), key=lambda x:-x[1])

[('bb_upper40-49', 0.036522875323813653),
 ('bb_upper40-48', 0.036488543993939283),
 ('bb_upper40-50', 0.036478537133766968),
 ('bb_upper40-47', 0.036376906292397544),
 ('bb_upper40-51', 0.036360379899087893),
 ('bb_upper50-46', 0.036203562940092103),
 ('bb_upper40-46', 0.036182748688690487),
 ('bb_upper50-47', 0.036177027767711742),
 ('bb_upper50-45', 0.036172888797073526),
 ('bb_upper40-52', 0.036171185171810831),
 ('bb_upper50-48', 0.036097978384063291),
 ('bb_upper50-44', 0.036092334828553362),
 ('bb_upper50-43', 0.035960313889181944),
 ('bb_upper50-49', 0.035934612816469105),
 ('bb_upper40-53', 0.03593121824356231),
 ('bb_upper40-45', 0.03590823032493274),
 ('bb_upper50-42', 0.035748244626107192),
 ('bb_upper50-50', 0.035702158754759676),
 ('bb_upper40-54', 0.035643522757171149),
 ('bb_upper40-44', 0.035555038139763843),
 ('bb_upper50-41', 0.035451676620954636),
 ('bb_upper50-51', 0.035413986014129629),
 ('bb_upper40-55', 0.035332963375255169),
 ('bb_upper40-43', 0.035109654692400

The results hardly show something relevant. The coefficientes between the features are very similar for the first principal component

In [247]:
# I will try feature selection again, with the components
from sklearn.feature_selection import f_classif
f_selector = SelectKBest(f_classif)
f_selector.fit(X_transformed,y)
sorted(zip(f_selector.scores_, f_selector.pvalues_), key=lambda x: -x[0])

[(21.927654334497475, 3.181741053535571e-06),
 (13.05318922114113, 0.00031634456352645784),
 (11.015075335376794, 0.00093314002547307769),
 (9.9186789135457136, 0.0016798683490303575),
 (9.4965307446220972, 0.002109417769852082),
 (9.4347588327470024, 0.0021810310404220402),
 (9.2421692302449259, 0.0024205653683596448),
 (6.6485707440971797, 0.010051686178781383),
 (6.1558981869631584, 0.013244970272769462),
 (6.0524350647583658, 0.01403892205239067),
 (6.0170539442411979, 0.014321546870616764),
 (5.8419871988926495, 0.015808841528129829),
 (2.5196262563728613, 0.11272254180592774),
 (2.1239266203197755, 0.14529768743477661),
 (1.9122548399835833, 0.16699161790610159),
 (1.8746052311892394, 0.17122603365658054),
 (1.7767324378829505, 0.18282509013215742),
 (1.2140390801819667, 0.27077294895529291),
 (1.1102000845152085, 0.29226817222386225),
 (0.76024883966466994, 0.38343955401203944),
 (0.59680722149876297, 0.43996356866310637),
 (0.37282809796579797, 0.54159195689609207),
 (0.3409643

The p values for the Anova F-value test, all above 0.5, tell a worrisome story. None has a verifiable correlation with the label. Nevertheless, let's push forward and see how good a prediction we can make with this data

### Predicting

Let's move on to prediction and see which results we can get with the work that is already done.

A previous work was done using several classifiers, and Gradient Boosting was the most promising. It is a model that can fit well complex models, and is one of the top perfomer algorithms used in the wide. Let's start with it and check the results

In [248]:
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.neighbors import KNeighborsClassifier as kNN
from sklearn.model_selection import cross_val_score
import warnings; warnings.filterwarnings("ignore")

In [249]:
# define cross validation strategy
cv = StratifiedShuffleSplit(n_splits=10, test_size=.1, random_state=42)

In [250]:
# initiate, train and evaluate classifier
clf = GBC(random_state=42)
scores = cross_val_score(clf, X, y, cv=cv, scoring='precision')
print("Precision: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Precision: 0.67 (+/- 0.12)


In [251]:
# same, with kNN
clf = kNN()
scores = cross_val_score(clf, X, y, cv=cv, scoring='precision')
print("Precision: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Precision: 0.67 (+/- 0.15)


40% is a reasonable precision, considering it is a highly skewed dataset, with only 6,25% of the observations with label 'Succeess'. But still a lot of variation - with a 95% confidence interval, the precision can be as low as .17. Let's check with the principal components, to see if we can get better results.

I will try with different number of components

In [252]:
# components = [20,50,100,150,200,300,400,len(features)]
components = [10,20,50,100, 200, 300]
for n in components:
    # pca
    pca = PCA(n_components = n, random_state=42)
    X_transformed = pca.fit_transform(X)
    
    # predict
    clf = GBC(random_state=42)
    scores = cross_val_score(clf, X_transformed, y, cv=cv, scoring='precision')
    print("Precision {} components: {:0.2f} (+/- {:0.2f})".format(
            n, scores.mean(), scores.std() * 2))

Precision 10 components: 0.65 (+/- 0.20)
Precision 20 components: 0.65 (+/- 0.20)
Precision 50 components: 0.68 (+/- 0.15)
Precision 100 components: 0.68 (+/- 0.23)
Precision 200 components: 0.57 (+/- 0.24)
Precision 300 components: 0.53 (+/- 0.25)


In [253]:
# same, with knn
components = [10, 20, 50, 100, 200, 300]
for n in components:
    # pca
    pca = PCA(n_components = n, random_state=42)
    X_transformed = pca.fit_transform(X)
    
    # predict
    clf = kNN()
    scores = cross_val_score(clf, X_transformed, y, cv=cv, scoring='precision')
    print("Precision {} components: {:0.2f} (+/- {:0.2f})".format(
            n, scores.mean(), scores.std() * 2))

Precision 10 components: 0.65 (+/- 0.15)
Precision 20 components: 0.66 (+/- 0.15)
Precision 50 components: 0.65 (+/- 0.18)
Precision 100 components: 0.67 (+/- 0.15)
Precision 200 components: 0.67 (+/- 0.14)
Precision 300 components: 0.67 (+/- 0.14)


The results with the principal components doesn't seem any better. I've also tried with 20 folds in the cross validation to reduce the margin of error, and between 100 to 200 components I get a precision of around .38 and a margin of error of around .5, which too high.

### Optimizing parameters

I see I got a 75% precision with a GBC, considering a time span of 19 days, a profit margin of 4.2% and a stop loss of 2.0%. If I can play this strategy consistently, it means profit.

Well, let's test this strategy. Let's begin by testing the classifier in a new test_dataset and see if we get the same results

In [536]:
## Steps to generate df:
# 1. get stock data
df_test = get_stock_data(symbol='PETR4.SA', start_date = '2014-7-1', end_date = '2016-12-31')
print(df_test.shape)
# 2. get market data
df_test = get_market_data(df_test, start_date = '2014-7-1', end_date = '2016-12-31')
print(df_test.shape)
# 3. get stock indicators
df_test = get_tech_indicators(df_test)
print(df_test.shape)
# 4. create features
df_test = create_features(df_test, base=60)
print(df_test.shape)
# 5. create labels
df_test = create_labels(df_test)
print(df_test.shape)
# 6. split features and labels
X_test,  y_test = split_features_labels(df_test)

(584, 6)
(543, 16)
(543, 34)
(423, 2074)
(404, 2075)


In [537]:
from sklearn.metrics import precision_score
# train classifier
# get X and y
X,y = split_features_labels(df)
# scale
scaler = StandardScaler()
X = scaler.fit_transform(X)
# fit classifier
clf = GBC(random_state=42)
clf.fit(X, y)

# predict new data points
# scale #note: standard scaler should not be sensitive to outliers, as a min max scaler
scaler.partial_fit(X_test)
X_test = scaler.transform(X_test)
# predict
y_pred = clf.predict(X_test)
precision = precision_score(y_test, y_pred)
print("Precision: {:0.2f}".format(precision))

Precision: 0.55


In [538]:
print(np.bincount(y_test.astype(int)) / len(y_test))
print(np.bincount(y_pred.astype(int)) / len(y_pred))

[ 0.54207921  0.45792079]
[ 0.62376238  0.37623762]


In [539]:
predictions = pd.DataFrame(y_pred, index=df_test.index)

We were able to replicate that precision in the test dataset, with a 0.62 precision. Let's use this classifier to simulate a trading strategy in this period, using only this stock.

We will follow the exact same strategy defined to create the label: buy the stock, hold it for up 19 days; short the position if the asset valuates 4.2% or devaluate 2.0%. 

In [553]:
class Operation():
    def __init__(self, price, qty, start_date, span, end_date=None):
        self.price = price
        self.qty = qty
        self.start_date = start_date
        self.end_date = end_date
        self.gain_loss = 0
        self.days_left = span
        
    def close(self, end_date, sell_price):
        self.end_date = end_date
        self.gain_loss = (sell_price / self.price) -1        
    
    def report(self):
        print("Start: {}, End: {}, Gain_loss: {:.2f}%, R$ {:.2f}, Days Left: {}".format(
                self.start_date, self.end_date, self.gain_loss*100, self.price*self.qty*self.gain_loss, self.days_left))

class Operator():
    def __init__(self, data, clf, predictions, strategy, capital=0, start_date='2015-01-01', end_date='2015-12-31'):
        self.data = data.copy()
        self.predictions = predictions
        self.clf = clf
        self.capital = capital
        self.stocks = 0.0
        self.period = pd.date_range(start=start_date, end=end_date, freq='D')
        self.operations = []
        self.strategy = strategy
    
    def run(self):
        for day in self.period:
            # needs to be a working day
            if day in self.data.index:
                # check if there any open operations that needs to be closed
                self.check_operations(day)
                # try to predict, scale features before
                # label = self.clf.predict(self.data.loc[day].drop('Label', axis=0))
                # commented above. using predictions done before
                label = predictions.loc[day].item()
                if label > 0:
                    if self.capital > 0:
                        self.buy(day)
                # wrap up in the last day
                if day == self.period[-1]:
                    for operation in self.operations:
                        self.sell(day, operation)
                            
    def check_operations(self, day):
        for operation in self.operations:
            span, profit, loss = self.strategy
            if not operation.end_date:
                # remove one more day
                operation.days_left -= 1
                # calc valuation
                valuation = (self.data.loc[day, 'Adj Close'] / operation.price)-1
                # sell if it reaches the target or the ends the number of days
                if valuation >= profit or valuation <= loss or operation.days_left<=0:
                    self.sell(day, operation)
    
    def buy(self, day):
        span, _, _ = self.strategy
        price = self.data.loc[day, 'Adj Close']
        qty = self.capital / price
        # update stocks and capital
        self.stocks += qty
        self.capital = 0
        # open operation
        self.operations.append(Operation(price = price, qty = qty, start_date = day, span=span))
        
    def sell(self, day, operation):
        price = self.data.loc[day, 'Adj Close']
        # update stocks and capital
        self.capital += self.stocks * price
        self.stocks = 0 
        # close operation
        operation.close(day, price)
        

In [554]:
op = Operator(df_test, clf, predictions, (19, .042, -.020), 
              capital=100000, start_date='2015-01-01', end_date='2015-12-31')

In [555]:
op.run()

In [556]:
op.capital

113567.8551427652

In [557]:
for operation in op.operations:
    operation.report()

Start: 2015-01-02 00:00:00, End: 2015-01-05 00:00:00, Gain_loss: -8.01%, R$ -8012.82, Days Left: 18
Start: 2015-01-05 00:00:00, End: 2015-01-06 00:00:00, Gain_loss: -3.25%, R$ -2991.45, Days Left: 18
Start: 2015-01-06 00:00:00, End: 2015-01-08 00:00:00, Gain_loss: 10.20%, R$ 9081.20, Days Left: 17
Start: 2015-01-08 00:00:00, End: 2015-01-12 00:00:00, Gain_loss: -2.94%, R$ -2884.62, Days Left: 17
Start: 2015-01-12 00:00:00, End: 2015-01-15 00:00:00, Gain_loss: 4.83%, R$ 4594.02, Days Left: 16
Start: 2015-01-30 00:00:00, End: 2015-02-02 00:00:00, Gain_loss: 5.87%, R$ 5855.43, Days Left: 18
Start: 2015-02-19 00:00:00, End: 2015-02-23 00:00:00, Gain_loss: -2.67%, R$ -2817.11, Days Left: 17
Start: 2015-04-13 00:00:00, End: 2015-04-15 00:00:00, Gain_loss: 8.64%, R$ 8882.98, Days Left: 17
Start: 2015-04-15 00:00:00, End: 2015-04-16 00:00:00, Gain_loss: -3.53%, R$ -3938.68, Days Left: 18
Start: 2015-04-16 00:00:00, End: 2015-04-27 00:00:00, Gain_loss: -2.18%, R$ -2346.45, Days Left: 13
Start: 

In [125]:
# ended year with 13% increase
# could be random
# not enough

### Optimizing parameters with genetic algorithm

Well, next step is to change my strategy. In the swing trade strategy, if I set number of days 10, profit margin(x) to 5% and stop loss(y) to 5%, I have to be right at least 51% of the times to actually win some money. That is not the case, as my current precision lower boundary is 17%.

But I might have a better chance identifying a different variation of this strategy. So as discussed before, let's try to optimize these 3 parameters: n, x and y. We will use a genetic algorithm which will use precision score from the cross validation function as its own scoring function, to determine which variation will perpetuate. 

In order to that, we will organize the code to create the dataset into classes and modules, along with the Genetic Algoritm, in the attached files stock.py and strategies.py, and import them to the notebook.

In [454]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

class Backtester():

    def __init__(self, df):
        self.df = df

    def create_labels(self, forward=10, profit_margin=.05, stop_loss=.05):

        for row in range(self.df.shape[0]-forward):

            # initialize max and min ticks
            max_uptick = 0
            min_downtick = 0 

            # move all days forward
            for i in range(1,forward+1):
                delta = (self.df.ix[row+i, 'Adj Close'] / self.df.ix[row, 'Adj Close'])-1
                if delta > max_uptick:
                    max_uptick = delta
                if delta < min_downtick:
                    min_downtick = delta

            # evaluate ticks against predefined strategy parameters
            if max_uptick >= profit_margin and min_downtick <= -stop_loss:
                self.df.ix[row,'Label'] = 1
            else:
                self.df.ix[row,'Label'] = 0        

        self.df.dropna(inplace=True)

    def prep_data(self):
        features = [x for x in self.df.columns if x != "Label"]
        X = self.df[features].values
        y = self.df['Label'].values
        return X,y

    def score(self, X, y):
        # apply PCA
        pca = PCA(n_components = 10, random_state=42)
        X_transformed = pca.fit_transform(X)
        
        #predict
        clf = GBC(random_state=42)
        cv = StratifiedShuffleSplit(n_splits=10, test_size=.1, random_state=42)
        scores = cross_val_score(clf, X_transformed, y, cv=cv, scoring='precision')

        # return score
        return (scores.mean())

    def evaluate(self, forward, profit_margin, stop_loss):
        self.create_labels(forward=forward, profit_margin=profit_margin, stop_loss=stop_loss)		
        score = self.score(*self.prep_data())
        print("span: {}, profit_margin: {:.3f}, stop_loss: {:.3f} --  score: {:.3f}".format(
            forward, profit_margin, stop_loss, score))
        return score

In [455]:
class Strategy():

    def __init__(self, span = 7, profit_margin = .06, stop_loss = .04):
        self.span = span
        self.profit_margin = profit_margin
        self.stop_loss = stop_loss
        self.mutations = [
            self.increase_span,
            self.decrease_span,
            self.increase_stop_loss,
            self.decrease_stop_loss,
            self.increase_profit_margin,
            self.decrease_profit_margin
        ]

    def mutate(self):
        np.random.choice(self.mutations)()

    def inform_params(self):
        return self.span, self.profit_margin, self.stop_loss

    def report(self):
        print("span: {}, profit_margin: {:.3f}, stop_loss: {:.3f}".format(
            self.span, self.profit_margin, self.stop_loss))

    # add a random component to mutation
    # allow 'wild' mutations

    def increase_span(self):
        self.span += 2

    def decrease_span(self):
        self.span -= 2

    def increase_profit_margin(self):
        self.profit_margin = self.profit_margin * 1.3

    def decrease_profit_margin(self):
        self.profit_margin = self.profit_margin * .7

    def increase_stop_loss(self):
        self.stop_loss = self.stop_loss * 1.3

    def decrease_stop_loss(self):
        self.stop_loss = self.stop_loss * .7


class GA():

    def __init__(self, df):
        """ Seed 2 initial strategies and an instance of backtester """

        self.backtester = Backtester(df.copy())

        self.strategies = pd.DataFrame(np.zeros((2,2)), columns = ['strat', 'score'])
        self.strategies['strat'] = self.strategies['strat'].apply(lambda x:Strategy())
        self.strategies['score'] = self.strategies['strat'].apply(self.evaluate)


    def fit(self, cycles):
        """ Run evolution for n cycles """
        i = 0
        while i < cycles:
            self.reproduce()
            # self.select()

            i += 1

    def best_strategy(self):
        """ Sort and return top perform in available strategies """

        self.strategies =  self.strategies.sort_values(by='score', ascending=False)
        self.strategies.iloc[0, 0].report()
        print("score: {:.4f}".format(self.strategies.iloc[0, 1]))

    def evaluate(self, strat):
        """ To implement: 
            Should evaluate only for those which value is zero, to avoid the cost of re-evaluating 
        """
        return self.backtester.evaluate(*strat.inform_params())

    def reproduce(self):
        """ Create new strategy based on its parents. """

        # sort and take top two performers in the list
        parents = self.strategies.sort_values(by='score', ascending=False).iloc[:2, 0]

        # create six offsprings
        for _ in range(6):
            stratN = self.crossover(*parents)
            stratN.mutate()
            # setting with enlargement using index based selection (not available for position based)
            self.strategies.ix[self.strategies.shape[0]] = (stratN, self.evaluate(stratN))

            # remove identical offspring, there is no use

    def crossover(self, stratA, stratB):
        """ Choose between parents attributes randomly. Return new strategy """

        span = np.random.choice([stratA.span, stratB.span])
        stop_loss = np.random.choice([stratA.stop_loss, stratB.stop_loss])
        profit_margin = np.random.choice([stratA.profit_margin, stratB.profit_margin])
        return Strategy(span=span, stop_loss=stop_loss, profit_margin=profit_margin)

    def select(self):
        """ Remove strategies which are bad performers 
            Define bad as 50 percent worst than best """

        # define cut off score as 50% of the max score
        cut_off = self.strategies['score'].max() * .75
        # remove strategies with scores below the cut-off
        self.strategies = self.strategies[self.strategies['score'] >= cut_off]



In [456]:
# ga = GA(df)
# ga.fit(20)
# ga.best_strategy()

span: 7, profit_margin: 0.060, stop_loss: 0.040 --  score: 0.000
span: 7, profit_margin: 0.060, stop_loss: 0.040 --  score: 0.000
span: 7, profit_margin: 0.060, stop_loss: 0.052 --  score: 0.000
span: 9, profit_margin: 0.060, stop_loss: 0.040 --  score: 0.100
span: 7, profit_margin: 0.060, stop_loss: 0.028 --  score: 0.000
span: 5, profit_margin: 0.060, stop_loss: 0.040 --  score: 0.000
span: 7, profit_margin: 0.060, stop_loss: 0.052 --  score: 0.000
span: 9, profit_margin: 0.060, stop_loss: 0.040 --  score: 0.100
span: 11, profit_margin: 0.060, stop_loss: 0.040 --  score: 0.000
span: 11, profit_margin: 0.060, stop_loss: 0.040 --  score: 0.000
span: 9, profit_margin: 0.060, stop_loss: 0.052 --  score: 0.000
span: 9, profit_margin: 0.042, stop_loss: 0.040 --  score: 0.153
span: 11, profit_margin: 0.060, stop_loss: 0.040 --  score: 0.000
span: 9, profit_margin: 0.060, stop_loss: 0.052 --  score: 0.000
span: 9, profit_margin: 0.029, stop_loss: 0.040 --  score: 0.025
span: 9, profit_margin

GA is not helping too much. I will leave this as an option to optimize in the end, but for now, I will go back to the basics.

I went in the PCA line. Let's try feature selecting and see what results we can get

### Feature Selection

How many features is the optimal number?
Which classifier should I use?
Which parameter can I get?

In [224]:
X.mean(), y.mean()

(-1.2942490633153009e-18, 0.29206349206349208)

In [225]:
X_test.mean(), y_test.mean()

(3.3895070900967049e-18, 0.44680851063829785)

### Removed Code