# Predictions for all tickers

This notebook describes how the training and testing of all tickers was handled. A large memory is required to execute the code. A local execution of the code was therefore not possible. Amazon Web Services (AWS) was used to execute the code. Since AWS cannot execute the code on notebooks, it was developed in Visual Studio Code in ".py" format. The code is stored under the name "2.1.3.2 Prediction for all tickers.py". 

The code described in this notebook was executed using different sklearn methods, different features and different time courses. All used settings and their results are listed under the chapter "results". In the example below, the Support Vector Machine is used with the following setting: 
- all features (Open, Close, Low, High and Volume)
- Training and testing with 50 tickers 
- time courses of 10, 20, 30, 40, 50, 60, 70, 80 and 90 days
- Grid Search was used to determine the ideal values for the SVM. These are included: C = 100, gamma = 0.001 and kernel = rbf 


# Content
 1. Structure of the code
 2. Explanation of function prepareTrainingData()
 3. Explanation of function train(x, y)
 4. Explanation of function classify_2018(classifier)
 5. Result

<hr>

# 1. Structure of the code

The code contains a main function in which several functions are called. The function "prepareTrainingData()" prepares the data for training and testing. For this purpose, all ticker data is read in and merged with the recommendations for action (the training labels). In the function train(x, y) a sklearn method is applied to the test and training result. The function classify_2018(classifier) finally predicts the recommended actions for the trading days in 2018. At the end of the main function the results are saved in a csv-file. The code for the main function is shown below:

In [None]:
def main():
    start_time = time.time()
    x,y = prepareTrainingData()
    classifier = train(x, y)
    end_time = time.time()
    elapsed_time = end_time -  start_time

    result_str_lst = classify_2018(classifier)

    # Transfer list to DataFrame and save
    kaggle = pd.DataFrame(data=result_str_lst, columns=['Id', 'Category'])
    kaggle = kaggle.to_csv('Predictions/SVM_Grid_Search_50_tickers.csv', index=False)

if __name__ == "__main__":
    main()

The code is very flexible so that the time courses, used features and the number of tickers used for training can be adjusted easily. In the following, time histories of 10 times 10 days are used and all available features:

In [None]:
features_time_range = [i*10 for i in range(10)]
features_used = ['Open', 'Close', 'Low', 'High', 'Volume']

In the next chapters the functions 
- prepareTrainingData()
- train(x, y)
- classify_2018(classifier)

are explained in more detail.

# 2. Explanation of function prepareTrainingData()

"prepareTrainingData()" contains the function "getAllTickerTimeseries_GT_Df_serial(ticker_file_names, df_train_label)", which in turn calls functions. After a function has been described, the function it uses are described in more detail. These are marked in bold.

With **"getAllTickerTimeseries_GT_Df_serial"** (GT: Ground Truth) the prepared data is loaded. Lines with "nan" or "inf" are removed because sklearn methods cannot handle these values. The created dataframe "all_stocks_df" is divided into x and y. Y contains the training labels (recommended actions for the trading day). X contains the feature values except the date.

The function "prepareTrainingData()" looks as follows:

In [None]:
def prepareTrainingData():
    df_train_label = pd.read_csv(join(dir_path, 'labels_train.csv'), header=0, index_col=0)

    all_stocks_df = getAllTickerTimeseries_GT_Df_serial(ticker_file_names, df_train_label)
    all_stocks_df = all_stocks_df.dropna(axis = 0)
    all_stocks_df.replace([np.inf, -np.inf], np.nan).dropna(axis=0, inplace=True)

    cols_x = list(all_stocks_df)
    cols_x.remove("Date")
    cols_x.remove("Y")

    x = all_stocks_df[cols_x]
    y = all_stocks_df["Y"]

    return x, y

In the function "getAllTickerTimeseries_GT_Df_serial(ticker_file_names, df_train_label)" the dataframe "all_stocks_df" is built. 
A certain number of tickers, in this example 50, are read in one after the other. The formatted data is contained in function **"getTickerTimeseries_GT_Df(file_name, df_train_label)"**, which is assigned to "stock_complete_df" below. When the data for "stock_complete_df" is complete, it is added to "all_stocks_df". If the data is not complete, "not complete" is output. In order for a stock to be "complete", stock data must be available for all trading days in 2018 and for the last 90 trading days in 2017.

With the function "getAllTickerTimeseries_GT_Df_multiprocess(ticker_file_names, df_train_label)", loading has been parallelized to achieve faster loading times. However, the parallelization did not result in a significant improvement of the loading time, which is why serial loading is still used with the "getAllTickerTimeseries_GT_Df_serial(ticker_file_names, df_train_label)" function. A description of parallel loading is therefore not given here. 

The function "getAllTickerTimeseries_GT_Df_serial(ticker_file_names, df_train_label)" is shown below:

In [None]:
def getAllTickerTimeseries_GT_Df_serial(ticker_file_names, df_train_label):
    all_stocks_df = DataFrame()

    first_stock_b = True
    file_ctr = 0

    num_samples = 50
    ticker_file_names = [ticker_file_names[i] for i in np.random.randint(0, len(ticker_file_names), num_samples)]

    for file_name in ticker_file_names:
        ticker_name = file_name.replace(".csv", "")
        file_ctr += 1
        stock_complete_df = getTickerTimeseries_GT_Df(file_name, df_train_label)
        
        if stock_complete_df is not None:
            if first_stock_b:
                all_stocks_df = stock_complete_df
                first_stock_b = False
            else:
                all_stocks_df = all_stocks_df.append(stock_complete_df, ignore_index=True)
                all_stocks_df = all_stocks_df.sort_values('Date')
        else:
            print(ticker_name + " not complete")
    return all_stocks_df

In "getTickerTimeseries_GT_Df(file_name, df_train_label)" the selected tickers are loaded first. The completeness of the data is checked with **"hasTickerAllDates(df_ticker)"** (which, as described above, checks the availibility of all necessary stock data in 2017 and 2018). When the ticker data is complete, timeseries are mapped with function **"getTickerTimeseriesDf(df_ticker)"**. Lines with "nan" or "inf" are removed because sklearn methods cannot handle these values. The training label (recommended action) is then included in the dataframe.

In [None]:
def getTickerTimeseries_GT_Df(file_name, df_train_label):
    ticker_name = file_name.replace(".csv", "")

    print("Loading: " + file_name)

    # Load data frame
    file_str = join(ticker_path, file_name)
    df_ticker = pd.read_csv(file_str)
    df_ticker.columns = ['Date', 'Open', 'Close', 'Low', 'High', 'Volume']

    # if ticker is not valid, then write all zeros in the output file
    ticker_data_valid = hasTickerAllDates(df_ticker)
    if not ticker_data_valid:
        return None

    stock_merged_df = getTickerTimeseriesDf(df_ticker)
    stock_merged_df = stock_merged_df.dropna(axis = 0)
    stock_merged_df.replace([np.inf, -np.inf], np.nan).dropna(axis=0, inplace=True)

    # Add dependend variable
    labels_df = df_train_label.loc[:, df_train_label.columns.intersection([ticker_name])]
    labels_df = labels_df.rename(index=str, columns={ticker_name: 'Y'})
    
    stock_complete_df = pd.merge(stock_merged_df, labels_df[['Y']], on='Date')
    stock_complete_df = stock_complete_df.sort_values('Date')

    return stock_complete_df

The check whether the required dates are given is carried out on the basis of the list "all_boersen_days", in which all trading days are listed:  

In [None]:
def hasTickerAllDates(ticker_df):
    res = True
    for date_str in all_boersen_days:
        if not date_str in ticker_df['Date'].values:
            res = False
            continue
    return res

In the following function "getTickerTimeseriesDf(df_ticker)" ticker data is merged with the timeseries. Each feature is read in one after the other with the function **"getFeatureTimeseriesDf(feature, feature_df)"** and prepared accordingly. The preparation consists of normalizing values with the function **"normalizeFeatureDf(feature_df)"** and removing nan and inf values. 

In [None]:
def getTickerTimeseriesDf(df_ticker):
    stock_merged_df = DataFrame()
    first_feature_b = True

    for feature in features_used:
        feature_df = df_ticker.copy(deep=True)
        # drop everything except of the current feature and the date
        feature_df = feature_df[["Date", feature]]
        feature_df = getFeatureTimeseriesDf(feature, feature_df)
        feature_df = normalizeFeatureDf(feature_df)
        feature_df = feature_df.dropna(axis = 0)
        feature_df.replace([np.inf, -np.inf], np.nan).dropna(axis=0, inplace=True)

        # merge current dataframe into overall dataframe
        if first_feature_b:
            stock_merged_df = feature_df
            first_feature_b = False
        else:
            stock_merged_df = pd.merge(stock_merged_df, feature_df, on='Date')
    return stock_merged_df

**"getFeatureTimeseriesDf(feature, feature_df)"** receives a feature name (e.g. "Open", "Volume", ...) and a dataframe, which only contains a "Date" column and one feature column. It then builds a "timeseries" dataframe. That means, the columns are the timeseries of the feature (e.g. [t0 - t-10days], [t0 - t-20days], ...) and the rows are the samples (in this case the dates), that are later being used for training.

In [None]:
def getFeatureTimeseriesDf(feature, feature_df):
    timeseries_feature_df = series_to_supervised(feature, feature_df[feature].values, features_time_range)

    feature_ = feature + "_"

    for t in features_time_range:
        if not t == features_time_range[0]:
            f_0 = timeseries_feature_df[feature_ + str(features_time_range[0])]
            f_t = timeseries_feature_df[feature_ + str(t)]
            f_t = f_t - f_0
            feature_df[feature_ + str(t)] = f_t
    if len(features_time_range) > 1:
        feature_df = feature_df.drop([feature], axis=1)
    return feature_df

The "series_to_supervised(feature_name, data, time_range=[0], dropnan=True)" function makes use od the Pandas dataframe.shift() function to build the timeseries dataframe.

In [None]:
def series_to_supervised(feature_name, data, time_range=[0], dropnan=True):
    df = DataFrame(data)
    cols, names = list(), list()
    # forecast sequence (t, t+1, ... t+n)
    for i in time_range:
        cols.append(df.shift(-i))
        names += [feature_name + "_" + str(i)]
    # put it all together
    agg = concat(cols, axis=1)
    agg.columns = names
    # drop rows with NaN values
    if dropnan:
        agg.dropna(inplace=True)
    return agg

"MinMaxScaler" and "fit_transform" is used for normalization, where the normalized values should be between -1 and 1: 

In [None]:
def normalizeFeatureDf(feature_df):
    tmp_df = feature_df.copy(deep=True)
    tmp_df = tmp_df.drop(["Date"], axis=1)
    scaler = MinMaxScaler(feature_range=(-1, 1), copy=False)
    scaler.fit_transform(tmp_df.values)
    tmp_df["Date"] = feature_df["Date"].values
    feature_df = tmp_df

    # move the "Date" column to the front of the dataframe
    cols = list(feature_df)
    cols.insert(0, cols.pop(cols.index('Date')))
    feature_df = feature_df.loc[:, cols]
    return feature_df

# 3. Explanation of function train(x, y)

This function is used to train sklearn classifier. The generated data from function prepareTrainingData() is used for this purpose. In the following the training and testing for the "Support Vector Machine" is shown: 

In [None]:
def train(x, y):
    x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.7, test_size=0.3, random_state=42)

    # Train Classifier
    classifier = svm.SVC(kernel= 'rbf', gamma=0.001, C=100.0, max_iter=-1)
    classifier.fit(x_train, y_train)
    rf_score = classifier.score(x_test, y_test)

    return classifier

To optimaze the parameter of SVM Grid Search was used for 50 tickers. Grid Search was used to determine the best settings for kernel, gamma and C. The result was:
- C = 100
- gamma = 0.001 
- kernel = rbf

The code for this is shown below: 

In [None]:
def gridSearch(x, y):
    tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]

    scores = ['precision', 'recall']

    X_train, X_test, y_train, y_test = train_test_split(x, y, train_size=0.5, test_size=0.5, random_state=42)

    for score in scores:
        print("# Tuning hyper-parameters for %s" % score)
        print()

        clf = GridSearchCV(svm.SVC(), tuned_parameters, cv=5, n_jobs=-1, scoring='%s_macro' % score)
        clf.fit(X_train, y_train)

        print("Best parameters set found on development set:")
        print()
        print(clf.best_params_)
        print()
        print("Grid scores on development set:")
        print()
        means = clf.cv_results_['mean_test_score']
        stds = clf.cv_results_['std_test_score']
        for mean, std, params in zip(means, stds, clf.cv_results_['params']):
            print("%0.3f (+/-%0.03f) for %r"
                % (mean, std * 2, params))
        print()

        print("Detailed classification report:")
        print()
        print("The model is trained on the full development set.")
        print("The scores are computed on the full evaluation set.")
        print()
        y_true, y_pred = y_test, clf.predict(X_test)
        print(classification_report(y_true, y_pred))
        print()

# 4. Explanation of function classify_2018(classifier)

The function "classify_2018(classifier)" is used to predict the recommended actions for 2018. The function receives the classifier, that has been returned by train(x,y) (in this example the SVM).

For each ticker, the data is loaded first and then the tickers are checked for completeness with the function "hasTickerAllDates(df_ticker)", which has already been described in chapter 2. If the data for a ticker is not completely available, "0" is being predicted for each trading day in 2018. The dates to be predicted are derived from the function "getTickerTimeseriesDf" already described in Chapter 2. When all the data is complete, a recommendation for action is predicted for each trading day. The code for the function is shown below: 

In [None]:
def classify_2018(classifier):
    result_str_lst = list()

    for file_name in ticker_file_names:
        ticker_name = file_name.replace(".csv", "")

        # 1 Load data frame
        file_str = join(ticker_path, file_name)
        df_ticker = pd.read_csv(file_str)
        df_ticker.columns = ['Date', 'Open', 'Close', 'Low', 'High', 'Volume']

        ticker_data_valid = hasTickerAllDates(df_ticker)

        # if ticker is not valid, then write all zeros in the output file
        if not ticker_data_valid:
            for date_str in boersen_days_2018:
                result_str_lst.append([date_str + ":" + ticker_name, 0])
            print(ticker_name + " defaults to 0")
            continue

        ticker_ts_df = getTickerTimeseriesDf(df_ticker)
        ticker_ts_df = ticker_ts_df.dropna(axis = 0)
        ticker_ts_df.replace([np.inf, -np.inf], np.nan).dropna(axis=0, inplace=True)

        selected_dates_df = DataFrame()
        selected_dates = list()
        default_dates = list()

        for date in boersen_days_2018:
            x = ticker_ts_df.loc[ticker_ts_df['Date'] == date]
            if not x.empty:
                selected_dates_df = selected_dates_df.append(x)
                selected_dates.append(date)
            else:
                default_dates.append(date)

        selected_dates_df = selected_dates_df.drop(['Date'], axis=1)

        prediction = classifier.predict(selected_dates_df)

        for _ in range(len(prediction) + len(default_dates)):
            if len(selected_dates) > 0:
                min_pred = min(selected_dates)
            else:
                min_pred = '9999-99-99'
            if len(default_dates) > 0:
                min_default = min(default_dates)
            else:
                min_default = '9999-99-99'

            if min_pred < min_default:
                min_pred_idx = selected_dates.index(min_pred)
                result_str_lst.append(
                    [min_pred + ":" + ticker_name, prediction[min_pred_idx]])
                del selected_dates[min_pred_idx]
                prediction = np.delete(prediction, min_pred_idx)
            else:
                min_default_idx = default_dates.index(min_default)
                result_str_lst.append(
                    [min_default + ":" + ticker_name, 0])
                del default_dates[min_default_idx]
    return result_str_lst

# 5. Result

The following settings were used to make predictions on Kaggle:
- Random Forest: all tickers, all features, time courses of seven weeks --> Score on kaggle: 79,4 %
- Random Forest: all tickers, all features, no time courses --> Score on kaggle: 77,8 %
- Random Forest: all tickers, features open and volume, time course of one week --> Score on kaggle: 79,2 %
- SVM: 10 tickers, all features, time courses of 90 days, no Grid Search --> Score on kaggle: 71,5 %
- SVM: 10 tickers, all features, time courses of 90 days, optimal values were determined with Grid Search --> Score on kaggle: 64,6 %
- SVM: 50 tickers, all features, time courses of 90 days, optimal values were determined with Grid Search --> Score on kaggle: 76,5 %
- SVM: 150 tickers, all features, time courses of 90 days, optimal values were determined with Grid Search --> Score on kaggle: 80,1 %
- KNN: all tickers, all features, time courses of nine weeks --> Score on kaggle: 74,0 %
- KNN: all tickers, all features, time courses of 30 days --> Score on kaggle: 71,5 %

From this, the following findings can be derived:
- The inclusion of time courses leads to better results. Longer time courses lead to better results than shorter ones. In order to achieve the best possible results, the ideal time courses should be determined.
- Furthermore, it can be seen that the inclusion of more tickers leads to better results. The ideal number of tickers can be determined using a learning curve.
- The results also suggest that the features are interdependent. For better results and better performance, the ideal features should be determined.
- In the example, the use of Grid Search leads to a deterioration in the results. An exact analysis of the reasons is required. One reason for this could be that it is random, since the tickers for the training are selected randomly. A fixed seed can be used to verify the thesis. Another possible reason could be that too few tickers were used in the grid search and that not all parameters were included.