# ML in Finance Group Project
### Group 2: Barbara Capl, Mathias Lüthi, Pamela Matias, Stefanie Rentsch
## 1. Preperation of Datasets
In this part of the code the data is imported from the Wharton data source. It's then cleaned up and put into usable attribute matrices for further feature selection.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

### Market Data from Wharton
The financial data from the Wharton Database is loaded in. The redundant identification columns and columns which are more than 2/3 empty values are dropped.  

In [2]:
# Load in data from the Wharton Database
wharton = pd.read_csv('Data/WhartonData.csv', sep=',', header=0)

# Delete unusable columns
wharton = wharton.drop(['NAMEENDT','SHRCD','EXCHCD','SICCD','TICKER','COMNAM', 'NCUSIP', 'TSYMBOL',
              'PERMCO', 'ISSUNO', 'HEXCD', 'HSICCD'], 1)
wharton = wharton.drop(['DLAMT', 'DLPDT', 'DLSTCD', 'NEXTDT', 'HSICMG', 'HSICIG', 'DIVAMT',
              'SHRCLS', 'ACPERM', 'ACCOMP', 'NWPERM', 'DLRETX', 'DLPRC', 'DLRET', 'NMSIND',
              'MMCNT', 'NSDINX', 'DCLRDT', 'PAYDT', 'RCRDDT', 'DISTCD', 'FACPR', 'FACSHR',
              'TRTSCD', ], 1)

# Formatting data and permno
wharton.columns.values[0] = 'permno'
wharton['date'] = wharton.date.astype(str).str[:4] + '-' + wharton.date.astype(str).str[4:6]

# Calculate SPREAD manually
wharton['SPREAD'] = wharton['BID'] - wharton['ASK']

# print(wharton.isnull().sum())
# display(wharton.head())

FileNotFoundError: File b'Data/WhartonData.csv' does not exist

### Financial Ratios from Wharton
The financial ratios from the Wharton Database are loaded in. The adate and qdate are dropped because they are not relevant. Formatting "divyeld" from percentage to float.

In [None]:
# Load in the financial ratios from OLAT
ratios = pd.read_csv('Data/Ratios.csv', sep=',', header=0)

# Delete unusable columns
ratios = ratios.drop(['adate', 'qdate'], 1)

ratios.columns.values[1] = 'date'
ratios['date'] = ratios.date.str[6:] + '-' + ratios.date.str[3:5]

# Remove percentages in row "divyield" and divide with 100 (so its decimal percentage) with string split
ratios['divyield'] = ratios['divyield'].str.rstrip('%').astype('float')/100

# print(ratios.isnull().sum())
# display(ratios.head())

### Merging the Dataset

In [None]:
# Merging the two dataframes
data = pd.merge(wharton, ratios, left_on=['date', 'permno'], right_on=['date', 'permno'])
display(data.head())

### Adding additonal Features

In [None]:
# Adding reporting period as seasonality attribute

# Breakdown of categorical data


### Creating of Attribute Matrices and Response Vectors

In [None]:
# Creating a responce vector and an attribute matrix
forcast_periods = [1, 3, 6, 12]
for i in forcast_periods:
    data['return_' + str(i)] = np.where((data['permno'] == data['permno'].shift(i)), data['PRC'] / data['PRC'].shift(i), None)

# Creating a responce vector and an attribute matrix (for different forcast periods can be done later)
data_1 = data.dropna(subset=['return_1'])
response_1 = np.where(data_1.return_1 >= 1, 1, 0)
attributes_ratios_1 = data_1.iloc[:, 28:-4]
attributes_additional_1 = data_1.iloc[:, 2:-4]