# Determination of the most frequent value of the best five results

In this file, the five best results are used to determine the recommendations for action that occur most frequently per ticker and date. An odd number was deliberately chosen, as the probability of the occurrence of equally frequent results (e.g. twice 0, twice 1 and once -1) is then lower. If this is the case, always use 0, as 0 is the most common with almost 80 %. 

In this notebook the following five result files were used:
- Random Forest: all tickers, all features, time courses of seven weeks --> Score on kaggle: 79,4 %
- Random Forest: all tickers, all features, no time courses --> Score on kaggle: 77,8 %
- Random Forest: all tickers, features open and volume, time course of one week --> Score on kaggle: 79,2 %
- SVM: 50 tickers, all features, time courses of 90 days, optimal values were determined with Grid Search --> Score on kaggle: 76,5 %
- SVM: 150 tickers, all features, time courses of 90 days, optimal values were determined with Grid Search --> Score on kaggle: 80,1 %

# Content
1. Import dependencies
2. Load and merge the five results
3. Determination of the most frequent result
4. Result

<hr>

# 1. Import dependencies

In [1]:
import pandas as pd

# 2. Load and merge the five results

In [2]:
df1 = pd.read_csv("Predictions/Random_Forest_all_features_7_weeks.csv", index_col=[0], parse_dates=[0])
df2 = pd.read_csv("Predictions/SVM_Grid_Search_50_tickers.csv", index_col=[0], parse_dates=[0])
df3 = pd.read_csv("Predictions/Random_Forest_all_features_no_timeseries.csv", index_col=[0], parse_dates=[0])
df4 = pd.read_csv("Predictions/SVM_Grid_Search_150_tickers.csv", index_col=[0], parse_dates=[0])
df5 = pd.read_csv("Predictions/Random_Forest_open_volume_1_week.csv", index_col=[0], parse_dates=[0])

finaldf = pd.concat([df1, df2, df3, df4, df5], axis=1, join='inner')
finaldf.head()

Unnamed: 0_level_0,Category,Category,Category,Category,Category
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-02:A,0,0,0,0,0
2018-01-03:A,0,0,0,0,0
2018-01-04:A,0,0,0,0,0
2018-01-05:A,0,0,0,0,0
2018-01-08:A,0,0,0,0,0


In [3]:
finaldf.tail()

Unnamed: 0_level_0,Category,Category,Category,Category,Category
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-06-25:ZBH,1,1,0,0,0
2018-06-26:ZBH,1,1,0,0,0
2018-06-27:ZBH,1,1,0,0,0
2018-06-28:ZBH,1,0,0,0,0
2018-06-29:ZBH,1,0,0,0,0


# 3. Determination of the most frequent result

In [4]:
# The number of 0, 1 and -1 is determined per row. 
num_0 = finaldf.isin([0]).sum(axis=1)
num_1 = finaldf.isin([1]).sum(axis=1)
num_n1 = finaldf.isin([-1]).sum(axis=1)

In [5]:
# Instead of individual categories, the number for 0, 1 and -1 is now displayed.
label_list = ["-1", "0", "1"]

num_occurences_df = pd.concat([num_n1, num_0, num_1], axis=1, join='inner')
num_occurences_df.columns = label_list
num_occurences_df.head()

Unnamed: 0_level_0,-1,0,1
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-01-02:A,0,5,0
2018-01-03:A,0,5,0
2018-01-04:A,0,5,0
2018-01-05:A,0,5,0
2018-01-08:A,0,5,0


In [6]:
# Most common value is now output for category. If values occur equally often, 0 is predicted.
res_df = num_occurences_df.copy(deep=True)
res_df['Category'] = pd.Series(index=res_df.index)

for index, row in num_occurences_df.iterrows():
    max_elems = row[row == row.max()]
    if (len(max_elems) == 1): 
        res_df.loc[index, ['Category']] = row.index[row == row.max()].tolist()[0]
    else:
        res_df.loc[index, ['Category']] = 0

res_df = res_df.drop(label_list, axis=1)
res_df.head()

Unnamed: 0_level_0,Category
Id,Unnamed: 1_level_1
2018-01-02:A,0
2018-01-03:A,0
2018-01-04:A,0
2018-01-05:A,0
2018-01-08:A,0


In [7]:
# saving of the result
res = res_df.to_csv('Prediction/Most_frequent_value_of_best_five_models_second_try.csv', index=True)

# 4. Result

With this example a score of 80,2 % is achieved on Kaggle. 

In another experiment, the following five result files were used:
- Random Forest: all tickers, all features, time courses of seven weeks --> Score on kaggle: 79,4 %
- Random Forest: all tickers, all features, no time courses --> Score on kaggle: 77,8 %
- Random Forest: all tickers, features open and volume, time course of one week --> Score on kaggle: 79,2 %
- KNN: all tickers, all features, time courses of nine weeks --> Score on kaggle: 74,0 %
- KNN: all tickers, all features, time courses of 30 days --> Score on kaggle: 71,5 %

Those files achieve a score of 81.1 % on Kaggle and this is the best we have achieved so far with the sklearn methods. Possibilities to further improve this value on Kaggle are described in the chapter "next steps".