# Data Pre-processing

Converting raw data into ingredients fit for our strategy recipe is an annoying yet necessary work. Here's how I cleaned and processed data for this project.

Ultimately, we want these data:

1. Industry Index Price Series (Quarterly)
2. Industry Index Total Return Series (Quarterly)
3. Industry Index Earnings Series (Quarterly)

In [1]:
import pandas as pd
from datetime import datetime, timedelta
from xquant.util import *

In [2]:
df_div = pd.read_csv('D:/Repositories/cicc/Industry Momentum + CAPE/data/dividends.csv', parse_dates=['announced'], dtype={'ticker':str})
df_price = pd.read_csv('D:/Repositories/cicc/Industry Momentum + CAPE/data/price.csv', index_col=['date'], parse_dates=['date'])
df_mktcap = pd.read_csv('D:/Repositories/cicc/Industry Momentum + CAPE/data/market_cap.csv', index_col=['date'], parse_dates=['date'])
df_members = pd.read_csv('D:/Repositories/cicc/Industry Momentum + CAPE/data/WIND_index_members.csv', parse_dates=['included', 'excluded'])
df_map = pd.read_csv('D:/Repositories/cicc/Industry Momentum + CAPE/data/ticker_map.csv', index_col=['key'])
df_idx = pd.read_csv('D:/Repositories/cicc/Industry Momentum + CAPE/data/WIND_industry_index.csv', index_col=['Date'], parse_dates=['Date'])

In [3]:
# time range for back test
START = datetime(2010,1,1)
END = datetime(2020,12,31)

## Clean dividend data

In [5]:
df_div['ticker'] = df_div['ticker'].apply(add_suffix) # convert ticker symbol into standard format (e.g. 000001.SZ)

In [6]:
df_div.dropna(subset=['announced'], inplace=True)

## Clean Index Members Data

In [7]:
# map symbols to actual names of industry
df_members['industry'] = df_members['industry'].apply(lambda x: df_map.at[x,'value'])

In [8]:
# if stock is still a member of the index, set excluded time to a future data far away
df_members['excluded'].fillna(pd.Timestamp('20991231'), inplace=True)

In [9]:
df_members.dropna(subset=['included'], inplace=True)

## Calculate Financial Metrics for an Index

Very often we would need to look at certain metrics of an index, such as earnings and dividends. In a market capitalization weighted indices with $n$ members, its metric $m$ is calculated by:

$\sum^{n}_{i=1} w_{i} \cdot m_{i}$

where $w$ is the weight of member $i$ in the index (i.e. market cap of member $i$ divided by sum of market cap for all members).

In [10]:
for x in quarter_generator(START,END):
    print(x.year, x.quarter)
    if x.quarter == 1:
        should = (x.year-1, 4)
    else:
        should = (x.year, x.quarter-1)
    print(should)

    print(quarter_sum(ticker='000001.SZ',quarter=should,df=df_div,sum_col='div_per_share',date_col='announced'))
    print('\n')

2010 1
(2009, 4)
0.0


2010 2
(2010, 1)
0.0


2010 3
(2010, 2)
0.0


2010 4
(2010, 3)
0.0


2011 1
(2010, 4)
0.0


2011 2
(2011, 1)
0.0


2011 3
(2011, 2)
0.0


2011 4
(2011, 3)
0.0


2012 1
(2011, 4)
0.0


2012 2
(2012, 1)
0.0


2012 3
(2012, 2)
0.0


2012 4
(2012, 3)
0.0


2013 1
(2012, 4)
0.09


2013 2
(2013, 1)
0.0


2013 3
(2013, 2)
0.1315


2013 4
(2013, 3)
0.0


2014 1
(2013, 4)
0.0


2014 2
(2014, 1)
0.0


2014 3
(2014, 2)
0.152


2014 4
(2014, 3)
0.0


2015 1
(2014, 4)
0.0


2015 2
(2015, 1)
0.0


2015 3
(2015, 2)
0.1653


2015 4
(2015, 3)
0.0


2016 1
(2015, 4)
0.0


2016 2
(2016, 1)
0.0


2016 3
(2016, 2)
0.153


2016 4
(2016, 3)
0.0


2017 1
(2016, 4)
0.0


2017 2
(2017, 1)
0.0


2017 3
(2017, 2)
0.0


2017 4
(2017, 3)
0.158


2018 1
(2017, 4)
0.0


2018 2
(2018, 1)
0.0


2018 3
(2018, 2)
0.0


2018 4
(2018, 3)
0.136


2019 1
(2018, 4)
0.0


2019 2
(2019, 1)
0.0


2019 3
(2019, 2)
0.145


2019 4
(2019, 3)
0.0


2020 1
(2019, 4)
0.0


2020 2
(2020, 1)
0.0


2020 3
(2020, 2)
