# Inclusion of external data per day

This notebook imports the data from the Data Preperation for the Ticker AAPL with inclusion of external data per day. Test and training data are generated on the basis of this data. The result is applied to the following five sklearn methods:
- Support Vector Machine
- Linear Discrimant Analysis
- Gradient Boosting
- Random Forest
- KNN

# Content
 1. Import dependencies
 2. Load data
 3. Splitting data in training and testing
 4. Support Vector Machine
 5. Linear Discrimant Analysis
 6. Gradient Boosting
 7. Random Forest
 8. KNN
 9. Result

<hr>

# 1. Import dependencies

In [1]:
import numpy as np
import pandas as pd
import datetime
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn import model_selection
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

# 2. Load data

In [2]:
merged = pd.read_csv('prepared data/Data_Preperation_one_ticker_with_inclusion_of_external_data_per_day.csv', sep=',', decimal=',')

In [3]:
merged.head()

Unnamed: 0,aaplopen,aaplclose,aaplvolume,ebit,revenues,net profit,AAPL
0,-0.9295518507117048,-0.2999405971549169,-0.1046078295209929,-1.0,-1.0,-1.0,1
1,-0.3006970933348756,-0.2951164608410198,-0.4854372787672025,-1.0,-1.0,-1.0,1
2,-0.2992688411311637,-0.2854244176957954,-0.5826358871000477,-1.0,-1.0,-1.0,1
3,-0.2848925607436917,-0.2849335626074728,-0.5881772798072009,-1.0,-1.0,-1.0,1
4,-0.9929065244873378,-0.9285380647178374,-0.664703175878881,-1.0,-1.0,-1.0,1


# 3. Splitting data in training and testing

In [4]:
x = merged[['aaplopen', 'aaplclose', 'aaplvolume', 'ebit', 'revenues', 'net profit']]
y = merged['AAPL']

In [5]:
x.shape, y.shape

((1232, 6), (1232,))

In [6]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, random_state=42)



# 4. Support Vector Machine

In [7]:
svm = svm.SVC(kernel= 'linear', C = 1, max_iter=100000000)

In [8]:
svm.fit(x_train, y_train)

SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=100000000, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [9]:
svm.score(x_test, y_test)

0.7108108108108108

# 5. Linear Discrimant Analysis

In [10]:
lda = LinearDiscriminantAnalysis()

In [11]:
lda.fit(x_train, y_train)

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
              solver='svd', store_covariance=False, tol=0.0001)

In [12]:
lda.score(x_test, y_test)

0.7027027027027027

# 6. Gradient Boosting

In [13]:
scaler = MinMaxScaler()

In [14]:
num_trees = 10
kfold = model_selection.KFold(n_splits=10, random_state=42)
model = GradientBoostingClassifier(n_estimators=num_trees, random_state=42)

In [15]:
steps = [('scale', scaler), ('GB', model)]

In [16]:
pipeline = Pipeline(steps)

In [17]:
gb = model_selection.cross_val_score(pipeline, x_train, y_train, cv=kfold).mean()
print(gb.mean())

0.7539160652232022


# 7. Random Forest

In [18]:
rf = RandomForestClassifier(n_estimators=100, random_state=42)

In [19]:
rf.fit(x_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=42, verbose=0, warm_start=False)

In [20]:
rf.score(x_test, y_test)

0.7891891891891892

# 8. KNN

In [21]:
knn = KNeighborsClassifier()

In [22]:
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [23]:
knn.score(x_test, y_test)

0.7810810810810811

# 9. Result

Some comparisons of different tickers with or without external data show only marginal differences in the sklearn methods. In addition, external data is incomplete or not available at all and the research is very time-consuming. The inclusion of external data is therefore worthwhile and not possible for all available tickers.