# Selecting features based on our academic education

We categorized 6 groups of kpi's based on our education that should have an predictive characteristic for the return. We would like to see if our model is able to increase its performance with limiting the amount of features.

### Multiples
* EV/EBITDA (Enterprise Value over EBITDA)
* P/E (PE ratio)

### Carhart Model
* Price-to-book (priceToBookRatio)
* Momentum (3Y Shareholders Equity Growth (per Share))

### Income Statement
* Revenue
* Revenue Growth
* R&D to Revenue

### Balance
* Debt to Equity (debtEquityRatio)

### Investments
* Long-term investments
* Goodwill and Intangible Assets

### Additional
* ROIC
* ROE
* inventoryTurnover

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
df = pd.read_csv('cleaned_data/Cleaned Data.csv', index_col=[0])

In [3]:
#Overview all features

for column in df.columns:
    print(column)

Revenue
Revenue Growth
Cost of Revenue
Gross Profit
R&D Expenses
SG&A Expense
Operating Expenses
Operating Income
Interest Expense
Earnings before Tax
Income Tax Expense
Net Income - Non-Controlling int
Net Income - Discontinued ops
Net Income
Preferred Dividends
Net Income Com
EPS
EPS Diluted
Weighted Average Shs Out
Weighted Average Shs Out (Dil)
Dividend per Share
Gross Margin
EBITDA Margin
EBIT Margin
Profit Margin
Free Cash Flow margin
EBITDA
EBIT
Consolidated Income
Earnings Before Tax Margin
Net Profit Margin
Cash and cash equivalents
Short-term investments
Cash and short-term investments
Receivables
Inventories
Total current assets
Property, Plant & Equipment Net
Goodwill and Intangible Assets
Long-term investments
Tax assets
Total non-current assets
Total assets
Payables
Short-term debt
Total current liabilities
Long-term debt
Total debt
Deferred revenue
Tax Liabilities
Deposit Liabilities
Total non-current liabilities
Total liabilities
Other comprehensive income
Retained earn

In [5]:
#Assigning values
X = df[['Enterprise Value over EBITDA','PE ratio','priceToBookRatio','3Y Shareholders Equity Growth (per Share)', 'Revenue', 
        'Revenue Growth', 'R&D to Revenue', 'debtEquityRatio', 'Long-term investments', 'Goodwill and Intangible Assets', 
        'ROIC', 'ROE', 'inventoryTurnover']]
y = df['Signal']

In [6]:
# Create Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

In [7]:
# Import Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(max_depth=4)
tree.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=4)

In [8]:
# Max depth
maxDepth = np.array([1, 2, 5, 10])

# Minimum number of samples required to split any internal node 
minSamplesNode = np.array([2, 5, 10])

# The minimum number of samples required to be at a leaf/terminal node
minSamplesLeaf = np.array([10, 20, 30])

# Import necessary functions
from sklearn.model_selection import StratifiedKFold, cross_val_score

# Create k-Fold CV object
kFold = StratifiedKFold(n_splits=10)

In [9]:
from sklearn.model_selection import GridSearchCV

# Define the hyperparameter values to be tested
param_grid = {'criterion': ['gini', 'entropy'],
              'max_depth': maxDepth,
              'min_samples_split': minSamplesNode,
              'min_samples_leaf': minSamplesLeaf}

# Run brute-force grid search
gs = GridSearchCV(estimator=DecisionTreeClassifier(random_state=0),
                  param_grid=param_grid,
                  scoring='accuracy',
                  cv=kFold, n_jobs=-1)
gs = gs.fit(X_train, y_train)
print(gs.best_score_)
print(gs.best_params_)

0.5549674993530316
{'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 10, 'min_samples_split': 2}


In [10]:
# Extract best parameter
clf = gs.best_estimator_

# Fit model given best parameter
clf.fit(X_train, y_train)

# Print out score on Test dataset
print('Test accuracy: {0: .4f}'.format(clf.score(X_test, y_test)))

Test accuracy:  0.5521


In [11]:
from sklearn import metrics

y_pred = tree.predict(X_test)
print('Confusion matrix: \n', 
      metrics.confusion_matrix(y_test, y_pred))

Confusion matrix: 
 [[ 435    2  967]
 [  55    0  169]
 [ 359    1 1486]]


## Random Forest 

In [13]:
from sklearn.ensemble import RandomForestClassifier

# Create classifier object and fit it to data
forest = RandomForestClassifier(criterion='gini', random_state=0, n_jobs=-1)
forest.fit(X_train, y_train)

RandomForestClassifier(n_jobs=-1, random_state=0)

In [14]:
# Print test score 
print('Test accuracy: {0: .4f}'.format(clf.score(X_test, y_test)))

Test accuracy:  0.5521


In [15]:
y_pred = forest.predict(X_test)

In [17]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

         Buy       0.49      0.39      0.43      1404
        Hold       0.00      0.00      0.00       224
        Sell       0.58      0.74      0.65      1846

    accuracy                           0.55      3474
   macro avg       0.36      0.38      0.36      3474
weighted avg       0.51      0.55      0.52      3474

