# Exploring and Cleaning Returns Dataset

In this notebook, we will be exploring the Chinese equities returns dataset.

In [1]:
#!pip install pyreadr

Collecting pyreadr
[?25l  Downloading https://files.pythonhosted.org/packages/ad/fc/93a60f4eb7be983959ca6fb463130fd598c6cc0581b9dee31d1eb7630943/pyreadr-0.4.2-cp37-cp37m-macosx_10_9_x86_64.whl (250kB)
[K     |████████████████████████████████| 256kB 5.4MB/s eta 0:00:01
Installing collected packages: pyreadr
Successfully installed pyreadr-0.4.2


In [2]:
# Import the packages we will need
import numpy as np
import pandas as pd
import os

# Used to read .rds file
import pyreadr

In [5]:
# Read the Data
data = pyreadr.read_r('../../dailyChina.Rds')
data = data[None]

In [6]:
data.head()

Unnamed: 0,id,prc,Date,ret,lagME,industryCitic
0,1,79.614491,2007-02-01,-0.035024,29778860000.0,Banks
1,1,75.646705,2007-02-02,-0.049837,28735900000.0,Banks
2,1,71.980815,2007-02-05,-0.048461,27303780000.0,Banks
3,1,75.560448,2007-02-06,0.04973,25980620000.0,Banks
4,1,79.355722,2007-02-07,0.050228,27272640000.0,Banks


In a meeting, I was told that:
* id -- The ID of the company on the exchange
* prc -- The closing stock price of the company
* Date -- Self-explanatory
* ret -- Daily return for the stock
* lagME -- Market capitalization of the stock
* industryCitic -- Sector of the company

We will need to do some data wrangling to create an index return dataframe. Let's get started:

In [8]:
# Extract one company's data
sample = data.loc[data['id'] == '000001']
sample

Unnamed: 0,id,prc,Date,ret,lagME,industryCitic
0,000001,79.614491,2007-02-01,-0.035024,2.977886e+10,Banks
1,000001,75.646705,2007-02-02,-0.049837,2.873590e+10,Banks
2,000001,71.980815,2007-02-05,-0.048461,2.730378e+10,Banks
3,000001,75.560448,2007-02-06,0.049730,2.598062e+10,Banks
4,000001,79.355722,2007-02-07,0.050228,2.727264e+10,Banks
...,...,...,...,...,...,...
2553972,000001,13.450000,2020-04-21,0.035412,1.260415e+11,Banks
2553973,000001,13.230000,2020-04-22,-0.016357,1.305049e+11,Banks
2553974,000001,13.230000,2020-04-23,0.000000,1.283703e+11,Banks
2553975,000001,13.240000,2020-04-24,0.000756,1.283703e+11,Banks


In [9]:
# We want the Date and returns only for now
sample_returns = sample.loc[:,['Date', 'ret']].set_index('Date')
sample_returns

Unnamed: 0_level_0,ret
Date,Unnamed: 1_level_1
2007-02-01,-0.035024
2007-02-02,-0.049837
2007-02-05,-0.048461
2007-02-06,0.049730
2007-02-07,0.050228
...,...
2020-04-21,0.035412
2020-04-22,-0.016357
2020-04-23,0.000000
2020-04-24,0.000756


This seems simple enough, let's try to put this altogether now:

In [10]:
# Create a dictionary as it's easy to concatenate afterwards
returns_dict = {}

# Loop through each unque id
for ID in set(data['id'].values):
    
    # Same as before
    sample = data.loc[data['id'] == ID]
    sample_returns = sample.loc[:,['Date', 'ret']].set_index('Date')
    sample_returns.columns = [ID]
    returns_dict[ID] = sample_returns.copy()

In [13]:
# Sanity check
returns_dict['002110']

Unnamed: 0_level_0,002110
Date,Unnamed: 1_level_1
2007-08-01,-0.067320
2007-08-02,0.025928
2007-08-03,0.013661
2007-08-06,0.037736
2007-08-07,0.033766
...,...
2020-04-21,-0.016086
2020-04-22,-0.019074
2020-04-23,-0.005556
2020-04-24,-0.016760


Looks good! Let's concatenate all the returns now into a single dataframe:

In [14]:
returns = pd.concat([x for x in returns_dict.values()], axis=1)
returns

Unnamed: 0_level_0,600525,300274,601928,600717,603160,600651,600160,600378,601068,600469,...,600692,002841,600592,002064,600064,002047,600699,600597,000501,600051
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2007-02-01,-0.003488,,,-0.018416,,,0.019264,-0.008333,,0.091954,...,0.029279,,0.007645,,-0.006563,-0.001653,,0.003968,0.050407,0.001119
2007-02-02,-0.000269,,,-0.000938,,,-0.015464,0.021008,,-0.025000,...,-0.030635,,-0.005311,,0.000000,-0.014901,,-0.046113,0.054180,-0.032402
2007-02-05,0.002155,,,0.008451,,,0.036649,0.047325,,0.000000,...,0.011287,,0.004577,,-0.069364,0.000000,,-0.001381,0.099853,0.013857
2007-02-06,0.030099,,,0.022346,,,0.035354,-0.055010,,0.048583,...,0.037946,,0.010630,,0.006211,0.057143,,0.040111,0.029373,0.055809
2007-02-07,0.045656,,,0.029144,,,0.011382,0.035343,,0.014157,...,0.002151,,0.015778,,0.029982,-0.011129,,0.003989,0.058366,-0.007551
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2009-01-19,,,,0.007194,,-0.004049,0.010163,,,0.043546,...,0.010101,,0.013769,0.008824,0.047909,,,-0.014644,0.010490,0.004792
2009-01-20,,,,0.009184,,0.008130,-0.016097,,,0.034277,...,-0.011667,,-0.008489,-0.007289,0.013300,,,0.004246,0.015571,-0.007949
2009-01-21,,,,0.010111,,-0.006048,0.010225,,,-0.021614,...,-0.011804,,0.041096,0.036711,0.035275,,,-0.021142,-0.018739,-0.003205
2009-01-22,,,,0.008008,,0.010142,-0.008097,,,0.010309,...,0.025597,,0.009868,0.076487,-0.002377,,,0.002160,0.041667,0.040193


This looks very promising! Let's now do a sanity check to ensure all our steps were done correctly:

In [16]:
all(data.loc[data['id'] == '002110']['ret'].values == returns['002110'].dropna().values)

False

Great! Let's export this now to an h5 file:

In [17]:
returns.to_hdf('returns.h5', 'returns')