# Internal data with time courses

This notebook imports the data from the Data Preperation for 20 tickers with time courses of 10, 20, 30, 40, 50, 60, 70, 80 and 90 days. Test and training data are generated on the basis of this data. The result is applied to the following five sklearn methods:
- Support Vector Machine
- Linear Discrimant Analysis
- Gradient Boosting
- Random Forest
- KNN

# Content
 
 1. Import dependencies
 2. Load data
 3. Removing values of type Nan or inf
 4. Splitting data in training and testing
 5. Support Vector Machine
 6. Linear Discrimant Analysis
 7. Gradient Boosting
 8. Random Forest
 9. KNN
 10. Result

<hr>

# 1. Import dependencies

In [1]:
import numpy as np
import pandas as pd
import datetime
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# 2. Load data

In [2]:
training = pd.read_csv('prepared data/Training_set-sklearn_20 Ticker.csv')

In [3]:
training.head()

Unnamed: 0.1,Unnamed: 0,Entwicklungsrate Volume t+10,Entwicklungsrate Volume t+20,Entwicklungsrate Volume t+30,Entwicklungsrate Volume t+40,Entwicklungsrate Volume t+50,Entwicklungsrate Volume t+60,Entwicklungsrate Volume t+70,Entwicklungsrate Volume t+80,Entwicklungsrate Volume t+90,Entwicklungsrate Preis t+10,Entwicklungsrate Preis t+20,Entwicklungsrate Preis t+30,Entwicklungsrate Preis t+40,Entwicklungsrate Preis t+50,Entwicklungsrate Preis t+60,Entwicklungsrate Preis t+70,Entwicklungsrate Preis t+80,Entwicklungsrate Preis t+90,Y
0,0,1.051417,0.246943,-0.115904,0.163367,0.071779,-0.288287,-0.216916,-0.270543,-0.241494,-0.170821,-0.322701,-0.343678,-0.350806,-0.352104,-0.266236,-0.252081,-0.12711,-0.05369,-1
1,1,1.087555,0.200314,0.070344,0.891826,0.27377,0.22623,-0.171043,0.096679,-0.022362,-0.169965,-0.310988,-0.356183,-0.374539,-0.356183,-0.24818,-0.242617,-0.113233,-0.02761,-1
2,2,0.184439,-0.382323,-0.309641,0.226367,-0.172212,-0.282219,-0.453389,-0.217269,-0.370241,-0.125118,-0.281185,-0.332625,-0.34045,-0.296378,-0.201242,-0.180618,-0.05101,0.015919,0
3,3,0.174965,-0.449355,-0.533133,-0.410366,-0.512335,-0.492534,-0.660134,-0.563951,-0.578607,-0.135234,-0.256988,-0.30396,-0.300822,-0.253992,-0.150355,-0.12571,0.002911,0.057287,0
4,4,1.213502,0.032456,-0.384365,-0.032889,-0.403611,-0.439291,-0.326188,-0.339761,-0.497466,-0.246617,-0.28186,-0.299864,-0.29709,-0.257006,-0.135311,-0.09234,0.02041,0.068052,0


In [4]:
training.tail()

Unnamed: 0.1,Unnamed: 0,Entwicklungsrate Volume t+10,Entwicklungsrate Volume t+20,Entwicklungsrate Volume t+30,Entwicklungsrate Volume t+40,Entwicklungsrate Volume t+50,Entwicklungsrate Volume t+60,Entwicklungsrate Volume t+70,Entwicklungsrate Volume t+80,Entwicklungsrate Volume t+90,Entwicklungsrate Preis t+10,Entwicklungsrate Preis t+20,Entwicklungsrate Preis t+30,Entwicklungsrate Preis t+40,Entwicklungsrate Preis t+50,Entwicklungsrate Preis t+60,Entwicklungsrate Preis t+70,Entwicklungsrate Preis t+80,Entwicklungsrate Preis t+90,Y
106004,1463,0.408541,0.572405,4.638633,0.607884,-0.028121,0.977924,0.147832,0.322339,0.525493,0.028053,0.020509,0.058934,0.034771,0.076379,0.028996,-0.003536,0.049505,0.067657,0
106005,1464,-0.457371,0.042873,8.111211,-0.396274,-0.122031,0.033355,-0.478116,-0.656199,0.535389,0.031761,0.024239,-0.038806,0.047164,0.103881,0.011463,0.016119,0.059343,0.109731,0
106006,1465,1.363731,3.216244,6.288118,0.969049,0.978604,1.423764,0.296843,0.599709,2.028666,0.030314,0.0373,-0.090468,0.041563,0.108585,-0.008999,0.018591,0.038484,0.117466,0
106007,1466,4.37196,1.755974,3.06365,1.012476,1.495031,1.518926,0.479805,0.345316,2.654472,0.060413,0.064307,-0.091681,0.025015,0.093215,-0.016637,0.031386,0.027611,0.134041,0
106008,1467,1.039925,0.301394,0.662747,0.275842,0.130517,0.622967,-0.053426,0.446138,4.126597,0.058342,0.036997,-0.075537,0.039962,0.097,-0.025377,0.042927,0.059765,0.136725,0


In [5]:
training.shape

(106009, 20)

# 3. Removing values of type Nan or inf

In [6]:
training.dtypes

Unnamed: 0                        int64
Entwicklungsrate Volume t+10    float64
Entwicklungsrate Volume t+20    float64
Entwicklungsrate Volume t+30    float64
Entwicklungsrate Volume t+40    float64
Entwicklungsrate Volume t+50    float64
Entwicklungsrate Volume t+60    float64
Entwicklungsrate Volume t+70    float64
Entwicklungsrate Volume t+80    float64
Entwicklungsrate Volume t+90    float64
Entwicklungsrate Preis t+10     float64
Entwicklungsrate Preis t+20     float64
Entwicklungsrate Preis t+30     float64
Entwicklungsrate Preis t+40     float64
Entwicklungsrate Preis t+50     float64
Entwicklungsrate Preis t+60     float64
Entwicklungsrate Preis t+70     float64
Entwicklungsrate Preis t+80     float64
Entwicklungsrate Preis t+90     float64
Y                                 int64
dtype: object

In [7]:
#The calculations from the Data Preperation resulted in cells without "NaN" and "inf" values. Sklearn methods cannot handle 
#these values, which is why the rows with the corresponding values are removed.
training[training.isnull().any(axis=1)].head()

Unnamed: 0.1,Unnamed: 0,Entwicklungsrate Volume t+10,Entwicklungsrate Volume t+20,Entwicklungsrate Volume t+30,Entwicklungsrate Volume t+40,Entwicklungsrate Volume t+50,Entwicklungsrate Volume t+60,Entwicklungsrate Volume t+70,Entwicklungsrate Volume t+80,Entwicklungsrate Volume t+90,Entwicklungsrate Preis t+10,Entwicklungsrate Preis t+20,Entwicklungsrate Preis t+30,Entwicklungsrate Preis t+40,Entwicklungsrate Preis t+50,Entwicklungsrate Preis t+60,Entwicklungsrate Preis t+70,Entwicklungsrate Preis t+80,Entwicklungsrate Preis t+90,Y
69148,0,,,,,,,,,,,,,,,,,,,0
76752,49,inf,inf,inf,inf,,inf,inf,inf,inf,-0.160428,-0.093583,0.018717,0.066845,0.048128,0.066845,0.491979,0.794118,0.94385,0
76764,61,inf,inf,inf,inf,inf,inf,inf,inf,,0.14966,0.316327,0.360544,0.421769,0.29932,1.003401,1.503401,1.159864,1.312925,1
76858,155,,inf,inf,inf,inf,inf,inf,inf,inf,-0.121212,0.030303,0.089394,0.151515,0.257576,0.318182,0.419697,0.69697,0.481818,1
76887,184,inf,inf,inf,,inf,inf,inf,inf,inf,0.026667,0.16,0.16,0.24,0.453333,0.306667,0.278667,0.288,0.293333,1


In [8]:
training.dropna(axis = 0, inplace = True)
training.replace([np.inf, -np.inf], np.nan).dropna(axis=0, inplace=True)
training = training[~(training['Entwicklungsrate Volume t+10']==np.inf)]

In [9]:
training.head()

Unnamed: 0.1,Unnamed: 0,Entwicklungsrate Volume t+10,Entwicklungsrate Volume t+20,Entwicklungsrate Volume t+30,Entwicklungsrate Volume t+40,Entwicklungsrate Volume t+50,Entwicklungsrate Volume t+60,Entwicklungsrate Volume t+70,Entwicklungsrate Volume t+80,Entwicklungsrate Volume t+90,Entwicklungsrate Preis t+10,Entwicklungsrate Preis t+20,Entwicklungsrate Preis t+30,Entwicklungsrate Preis t+40,Entwicklungsrate Preis t+50,Entwicklungsrate Preis t+60,Entwicklungsrate Preis t+70,Entwicklungsrate Preis t+80,Entwicklungsrate Preis t+90,Y
0,0,1.051417,0.246943,-0.115904,0.163367,0.071779,-0.288287,-0.216916,-0.270543,-0.241494,-0.170821,-0.322701,-0.343678,-0.350806,-0.352104,-0.266236,-0.252081,-0.12711,-0.05369,-1
1,1,1.087555,0.200314,0.070344,0.891826,0.27377,0.22623,-0.171043,0.096679,-0.022362,-0.169965,-0.310988,-0.356183,-0.374539,-0.356183,-0.24818,-0.242617,-0.113233,-0.02761,-1
2,2,0.184439,-0.382323,-0.309641,0.226367,-0.172212,-0.282219,-0.453389,-0.217269,-0.370241,-0.125118,-0.281185,-0.332625,-0.34045,-0.296378,-0.201242,-0.180618,-0.05101,0.015919,0
3,3,0.174965,-0.449355,-0.533133,-0.410366,-0.512335,-0.492534,-0.660134,-0.563951,-0.578607,-0.135234,-0.256988,-0.30396,-0.300822,-0.253992,-0.150355,-0.12571,0.002911,0.057287,0
4,4,1.213502,0.032456,-0.384365,-0.032889,-0.403611,-0.439291,-0.326188,-0.339761,-0.497466,-0.246617,-0.28186,-0.299864,-0.29709,-0.257006,-0.135311,-0.09234,0.02041,0.068052,0


# 4. Splitting data in training and testing

In [10]:
x = training[['Entwicklungsrate Preis t+10', 
              'Entwicklungsrate Preis t+20',
              'Entwicklungsrate Preis t+30',
              'Entwicklungsrate Preis t+40',
              'Entwicklungsrate Preis t+50',
              'Entwicklungsrate Preis t+60',
              'Entwicklungsrate Preis t+70',
              'Entwicklungsrate Preis t+80',
              'Entwicklungsrate Preis t+90',
              'Entwicklungsrate Volume t+10',
              'Entwicklungsrate Volume t+20',
              'Entwicklungsrate Volume t+30',
              'Entwicklungsrate Volume t+40',
              'Entwicklungsrate Volume t+50',
              'Entwicklungsrate Volume t+60',
              'Entwicklungsrate Volume t+70',
              'Entwicklungsrate Volume t+80',
              'Entwicklungsrate Volume t+90']]
y = training['Y']

In [11]:
x.shape, y.shape

((105800, 18), (105800,))

In [12]:
x.max()

Entwicklungsrate Preis t+10        1.962963
Entwicklungsrate Preis t+20        1.923077
Entwicklungsrate Preis t+30        4.428571
Entwicklungsrate Preis t+40        4.071429
Entwicklungsrate Preis t+50        5.142857
Entwicklungsrate Preis t+60        4.928572
Entwicklungsrate Preis t+70        5.118033
Entwicklungsrate Preis t+80        5.462329
Entwicklungsrate Preis t+90        4.313725
Entwicklungsrate Volume t+10     446.000000
Entwicklungsrate Volume t+20     866.000000
Entwicklungsrate Volume t+30     652.000000
Entwicklungsrate Volume t+40     932.500000
Entwicklungsrate Volume t+50     855.000000
Entwicklungsrate Volume t+60    1719.000000
Entwicklungsrate Volume t+70    1089.000000
Entwicklungsrate Volume t+80     562.000000
Entwicklungsrate Volume t+90     637.000000
dtype: float64

In [13]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=42)



# 5. Support Vector Machine

In [14]:
svm = svm.SVC(kernel= 'linear', C = 100, max_iter=100000000)

In [15]:
svm.fit(x_train, y_train)



SVC(C=100, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=100000000, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [16]:
svm.score(x_test, y_test)

0.8540642722117202

# 6. Linear Discriminant Analysis

In [17]:
lda = LinearDiscriminantAnalysis(solver='lsqr', shrinkage='auto')

In [18]:
lda.fit(x_train, y_train)

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage='auto',
              solver='lsqr', store_covariance=False, tol=0.0001)

In [19]:
lda.score(x_test, y_test)

0.8511972274732199

# 7. Gradient Boosting

In [20]:
scaler = MinMaxScaler()

In [21]:
num_trees = 10
kfold = model_selection.KFold(n_splits=10, random_state=42)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=42)
model.fit(x_train, y_train)

GradientBoostingClassifier(criterion='friedman_mse', init=None,
              learning_rate=0.1, loss='deviance', max_depth=3,
              max_features=None, max_leaf_nodes=None,
              min_impurity_decrease=0.0, min_impurity_split=None,
              min_samples_leaf=1, min_samples_split=2,
              min_weight_fraction_leaf=0.0, n_estimators=10,
              presort='auto', random_state=42, subsample=1.0, verbose=0,
              warm_start=False)

In [22]:
steps = [('scale', scaler), ('GB', model)]

In [23]:
pipeline = Pipeline(steps)

In [24]:
gb = model_selection.cross_val_score(pipeline, x_train, y_train, cv=kfold).mean()
print(gb.mean())

0.9026870105320011


# 8. Random Forest

In [25]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)

In [26]:
rf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [27]:
rf.score(x_test, y_test)

0.9165406427221172

# 9. KNN

In [28]:
knn = KNeighborsClassifier()

In [29]:
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [30]:
knn.score(x_test, y_test)

0.7981726528040327

# 10. Result

The accuracy could be significantly improved by including time courses. The best result gives random forest. In compare to "internal data without time courses" an improvement of ca. 11 % is given for random forest. Already now there are very high running times of approx. 3 hours for this notebook.

Only by trying out the best possible parameters for the sklearn methods were determined in this iteration. With the help of Grid Search the best possible parameters can be determined automatically. Therefore, the best possible parameter input is determined in the 2nd iteration for a selection of tickers with KNN and Grid Search.