# Fund Clustering - Data Component Explanations

## Table of contents

1. Introduction
2. Data Helper methods
3. Data Reader classes
    1. SQL Data Reader class
    2. CSV Data Reader class
    3. Retrieve data from a DataReader
3. Data Catcher class
    1. Process the data
    2. Access the data
4. Data Writer class
5. Data Maker class


Last Date Modified: Dec 20th, 2020

## Change current working directory to root of repository

In [1]:
import os

old_path = os.getcwd()

print(f"Current working directory:\n\t{old_path}")

new_path = old_path[:-len('Demo')-1]
os.chdir(new_path)

print(f"\nNew working directory:\n\n\t{new_path}")

Current working directory:
	/Users/glangetasq/Library/Mobile Documents/com~apple~CloudDocs/Columbia/Classes/Fall_20/DeepLearning/FundClusteringProject/Repo/Demo

New working directory:

	/Users/glangetasq/Library/Mobile Documents/com~apple~CloudDocs/Columbia/Classes/Fall_20/DeepLearning/FundClusteringProject/Repo


## Imports

In [47]:
import numpy as np
import pandas as pd

In [2]:
import Config
import DataHelper

# 1. Introduction

The Data component has been made with the goal of completely separating the data handling within the project. It is supposed to be able to read the raw data, process it, make fake data for testing and write the data, for different source type (currently SQL and CSV are supported). 

Also, the aim was to make a data component as flexbile as possible, to enable adding new dataset, processing and source type. 

# 2. Data Helper methods

The Data Helper methods, defined in ```DataHelper/DataHelper.py``` should be the only part of the Data component accessible from outside. This is done to make sure the internal behavior of the component is separated from the modelling, etc. 

The use of the different methods will be illustrated in the following sections.

# 3. Data Reader classes

The Data Reader classes' role is to read the data from a specific source and store it within itself. 

Because it is supposed to be a read-only class, it has the Singleton design pattern: at most one instance of a specific Data Reader can exist at any given time. This is also to avoid unecessary use of the memory. 

Currently, two data source are supported: either via a SQL database, or CSV files.

## 3.1. Read with the SQL Data Reader class

The SQL version reads the data directly from a database (that should have been properly set up beforehand). Some processing is already included to accelerate the process.

The template and request to each table is included in a natural order at ```Config/SQL/Structure/```. For instance ```Config/SQL/Structure/fund_clustering/returns.py``` include the template and sql request done to the table ```returns``` from the data base ```fund_clustering```.

In [3]:
reader = DataHelper.get_data_reader(source='sql')

In [4]:
# Only one instance allowed at a time
hex(id(reader)), hex(id(DataHelper.get_data_reader(source='sql')))

('0x10e3d60d0', '0x10e3d60d0')

### Read the returns

In [5]:
reader.load_table(db_name='fund_clustering', table_name='returns')

### Read the morning star table

In [6]:
reader.load_table(db_name='fund_clustering', table_name='morning_star')

### Read the fund number to ticker table

In [7]:
reader.load_table(db_name='fund_clustering', table_name='ticker')

## 3.2. Read with the CSV Data Reader class

It reads CSV files. The paths can be stored in the ```DATA_PATHS``` global variable (defined in ```Config/paths.py```), although it can be customized when calling the ```load_table``` method.

To make the implemetation consistent with the SQL Data Reader class, please use the same ```db_name``` and ```table_name``` when reading each files, as shown below.

In [8]:
# Only one instance allowed at a time
hex(id(reader)), hex(id(DataHelper.get_data_reader(source='csv')))

('0x10e3d60d0', '0x12304f9d0')

In [9]:
reader = DataHelper.get_data_reader(source='csv')

In [10]:
# Only one instance allowed at a time
hex(id(reader)), hex(id(DataHelper.get_data_reader(source='csv')))

('0x12304f9d0', '0x12304f9d0')

### Read the returns

In [11]:
returns_path = Config.DATA_PATHS['returns']
reader.load_table(db_name='fund_clustering', table_name='returns', path=returns_path)

### Read the morning star table

In [12]:
mrnstar_path = Config.DATA_PATHS['morning_star']
reader.load_table(db_name='fund_clustering', table_name='morning_star', path=mrnstar_path)

  if (await self.run_code(code, result,  async_=asy)):


### Read the fund number to ticker table

In [13]:
ticker_path = Config.DATA_PATHS['ticker']
reader.load_table(db_name='fund_clustering', table_name='ticker', path=ticker_path)

## 3.3 Retrieve the raw data from any reader

In [33]:
returns = reader.get_dataframe(db_name='fund_clustering', table_name='returns')
returns.iloc[:, 0:6].head()

Unnamed: 0,date,105,2704,2706,2708,2724
0,2010-01-04,0.024129,0.005268,0.010772,0.014401,0.018217
1,2010-01-05,0.003927,0.00262,0.002664,0.002662,0.001883
2,2010-01-06,0.003911,0.0,0.000886,0.00177,0.00188
3,2010-01-07,-0.001299,0.000871,0.000885,0.0,0.0
4,2010-01-08,0.006502,0.002611,0.004421,0.0053,0.005629


# 4. Data Catcher classes

The DataCatchers' job is to process the data from a reader to data that is usable by a model. Unfortunately, we couldn't come up with a better implementation than 1 catcher for each couple (model, source_type)... 

Reasons:
- The processing could be different from model to model (mostly because needs of different datasets)
- The source type have different (but equivalent) data: CSV has crsp_fundno has a row, SQL is fundNo

In [15]:
source_type = 'csv'
model_name = 'classic' # only classic exist for now

catcher = DataHelper.get_data_catcher(source=source_type, model=model_name)

## 4.1 Process the data

As it can take some time, the processing should be done before fitting any model. 

In [16]:
catcher.process()

Loading data...
... Finished loading data
Processing data...
... Finished processing data


## 4.2 Using the data

The order of which we retrieve the data should be implemented in the semi-private method ```_pack_data()```, it is an iterator function that yield what is need for each layer in a sequence. 

We can use ```unpack_data()``` to get the data of one layer at a time. For instance, for the classic model:

### First layer

In [26]:
features, asset_type = catcher.unpack_data(keys=['features', 'asset_type'])

In [27]:
features.head()

Unnamed: 0,cash,equity,bond,security
105,2.3,97.69,0.0,0.0
2704,-23.01,25.16,57.1,40.74
2706,-10.28,44.5,40.62,25.18
2708,-6.16,70.73,20.58,14.82
2724,-3.57,93.33,3.77,6.47


In [28]:
asset_type

['cash', 'equity', 'bond', 'security']

### Second layer

In [29]:
returns, = catcher.unpack_data(keys=['returns'])

In [32]:
returns.iloc[:, 0:6].head()

Unnamed: 0_level_0,105,2704,2706,2708,2724,2725
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2012-01-03,0.014134,0.003419,0.008496,0.015517,0.02112,0.029126
2012-01-04,-0.003484,0.0,-0.001685,-0.001698,-0.002698,-0.007719
2012-01-05,0.003497,0.000852,-0.000844,-0.001701,-0.001803,-0.013829
2012-01-06,-0.002323,0.0,0.0,-0.001704,-0.003613,-0.007011
2012-01-09,0.004657,0.001702,0.001689,0.001706,0.00272,0.001765


# 5. Data Writer class

For now it's only implemented for SQL output source. Could be implemented for CSV output file too. 

A DataWriter job is to output a dataframe to a SQL table. It uses the templates set in the ```Config/SQL/Structure/``` folder. 

In [34]:
writer = DataHelper.get_data_writer()

In [42]:
n = 5
fake_clusters = pd.DataFrame.from_dict({
    'fundNo': range(n),
    'main_cluster': n * [1],
    'sub_cluster': range(1, n+1),
})
fake_clusters

Unnamed: 0,fundNo,main_cluster,sub_cluster
0,0,1,1
1,1,1,2
2,2,1,3
3,3,1,4
4,4,1,5


In [43]:
# Careful before executing this cell uncommented: it could erase existing results

"""
writer.update_raw_data(
    db_name = 'fund_clustering', 
    table_name = 'clustering_output', 
    dataframe = fake_clusters,
    chunk_size = None # For huge dataframe can be useful to set a chunk_size as an int
)
"""

# 6. Data Maker class

This class is able to make fake data as to use in Unit Testing for instance. Should be revamped for new data needs of models.

It should have the same interface as a DataCatcher so as to be used to fit a model directly.

In [48]:
maker = DataHelper.get_data_maker()

## 6.1 Add one fund at a time

In [49]:
m_days = 20
fake_returns = pd.Series(
    np.random.randint(-10, 10, m_days) / 200,
    index = pd.date_range('2020-12-01', periods=m_days)
)
fake_returns.head()

2020-12-01    0.025
2020-12-02   -0.030
2020-12-03    0.030
2020-12-04    0.025
2020-12-05   -0.005
Freq: D, dtype: float64

In [50]:
fake_morning_star_row = [
    1, # fundNo
    '2020-12-31', # date
    0, # cash
    50, # equity
    25, # bond
    25, # security
    'class 1', # lipper class name
]

In [52]:
maker.add_fake_fund(fake_morning_star_row, fake_returns);

## 6.2 Add n funds simultaneously

In [54]:
fake_morning_star_row1, fake_returns1 = fake_morning_star_row, fake_returns
fake_morning_star_row2, fake_returns2 = fake_morning_star_row, fake_returns

In [56]:
fake_funds = [
    (fake_morning_star_row1, fake_returns1),
    (fake_morning_star_row2, fake_returns2),
]

In [58]:
maker.bulk_add_fake_fund(fake_funds)

The maker can be now used to fit a ```Classic``` model.