### Matching Data to Datastream Historical Stocks prices, MV, and Volumes

By Xiaoran (Jason) Jia, Oct 2022

- Step 1: Establish matchable firms -- Using SEDOL code and CUSIP code，create seperate lists firms that are matchable to Datastream-readable codes 
- Step 2: Query the lists (performed in Excel) - long time and tedious task.
- Step 3: match the queried lists to the sample - The merging process is extremly time-consuming (unless the machine is superior in performance and RAM)
- Step 4: match the data to benchmark indices returns
- Step 5: Perform calculations to get the three categories of measurements

In [2]:
# Import all modules required
import pandas as pd
import numpy as np
import re
import datetime as dt
from datetime import datetime, date
from pandas.tseries.offsets import *
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 20)

### Step 1: Establish Matchable Firms

#### 1.1 Import and modify sample file

New variables created are, for example:

- 'ann_p0' means earnings announcement date;
- 'ann_p1' means earnings announcement date plus one business day;
- 'ann_m1' means earnings announcement date minus one business day;
- 'ann_m120' means earnings announcement date minus 120 business days;

In [44]:
# Import
ann = pd.read_sas('anndate.sas7bdat', format='sas7bdat', encoding="utf-8", )

# Modify column types and names
ann['CONM'] = ann['CONM'].astype(str)
ann['SEDOL'] = ann['SEDOL'].astype(str)
ann['GVKEY'] = ann['GVKEY'].astype(str)
ann['CUSIP'] = ann['CUSIP'].astype(str)
ann.rename(columns={'ANNDATS_ACT':'ann_p0'}, inplace=True)

# Add +1, +2 and -1, -2 columns
ann['ann_p1'] = ann['ann_p0'] + BusinessDay()
ann['ann_p2'] = ann['ann_p0'] + 2*BusinessDay()
ann['ann_m1'] = ann['ann_p0'] - BusinessDay()
ann['ann_m2'] = ann['ann_p0'] - 2*BusinessDay()

# Add -120 to -21 columns
for i in range(21, 121):
    ann[f"ann_m{i}"] = ann['ann_p0'] - i*BusinessDay()
    
# set NA values (becuase None is the missing value for strings in Pandas)
ann.replace({'nan':None}, inplace=True)

# Print some basic information
print(f"There are {len(ann.GVKEY.unique())} unique firms")
print(" ") # empty line
print(f"{len(ann.loc[(ann.SEDOL.isnull()==False) | (ann.CUSIP.isnull()==False)].GVKEY.unique())} unique firms have either CUSIP or SEDOL number")
print(f"{len(ann.loc[(ann.SEDOL.isnull()) & (ann.CUSIP.isnull())].GVKEY.unique())} unique firms have neither CUSIP nor SEDOL number")
print(" ") # empty line
print(f"{len(ann[ann.CUSIP.isnull()==False].GVKEY.unique())} unique firms have the CUSIP number")
print(f"{len(ann[ann.SEDOL.isnull()==False].GVKEY.unique())} unique firms have the SEDOL number")
print(" ") # empty line
# print("The GVKEYs for firms with neither CUSIP nor SEDOL numbers are:")
# for i in ann.loc[(ann.SEDOL.isnull()) & (ann.CUSIP.isnull())].GVKEY.unique():
#     print(i)

There are 25201 unique firms
 
25193 unique firms have either CUSIP or SEDOL number
8 unique firms have neither CUSIP nor SEDOL number
 
7950 unique firms have the CUSIP number
17250 unique firms have the SEDOL number
 


#### 1.2 The matchable firm list
Note: the following 9 lists are created in Datastream (DS) according to the following process:
1. export the unique firms with either CUSIP or SEDOL number to excel; 
2. divide the firms with codes into sub-groups (with each group less than 5,000 firms becuase DS does not seem to allow larger lists)
3. query the lists separately using 'static requests' and then delete those with no matches
4. save all the firms that are matched this way
5. for those not matched, I use the following strategy: I statically query (i.e. not time series) all active and dead (delisted) firms via DS, and then extract the CUSIP (for U.S. and Canada listed firms) and SEDOL codes. Then, I match the previously unmatched firms to this large list.
6. finally, all the matchable firms (as the later codes show, there are 23,417 of them) are allocated into the 9 lists below (ready for DS query)

- L#XJ01
- L#XJ02
- L#XJ03
- L#XJ04
- L#XJ05
- L#XJ06
- L#XJ07
- L#XJ08
- L#XJ09

### Step 2: Query the lists in Datastream (via Excel add-on) - this process is preformed in Excel
- For each list for firms, I query the daily stock prices, daily market value, and daily trading volume, from 1/1/1998 to 05/31/2021 for each firm
- The process is very time-consuming, as the DS query runs VERY slow using the excel add-in.
- Essentially, the result is 27 files (3 types of query multiplied by 9 lists)

### Step 3: match the queried lists to the sample - Extremly time-consuming
- In this step, I match the original file to the 27 files.
- For each list, the price, market value, and trading volume files are matched together for all the days (i.e., day 2, 1, 0, -1, -2, -21, -22, ..., -120)
- The process is extremely time-consuming considering the number of matching algorithms performed.

#### 3.1 SEDOL List XJ01

In [9]:
sedol_xj01_P = pd.read_csv("DS query results\\DS_XJ01.csv")
sedol_xj01_P['SEDOL']=sedol_xj01_P['Code'].str[:7]
sedol_xj01_P.drop(['Name', 'Code', 'Sedol', 'CURRENCY'], axis=1, inplace=True)

sedol_xj01_P_long = pd.melt(sedol_xj01_P, id_vars=['SEDOL'], var_name = 'date', value_name='price')
sedol_xj01_P_long = sedol_xj01_P_long[sedol_xj01_P_long.SEDOL.isnull()==False]
sedol_xj01_P_long['date'] = pd.to_datetime(sedol_xj01_P_long['date'])
sedol_xj01_P_long.head()

Unnamed: 0,SEDOL,date,price
0,5165294,1998-01-01,6.6567
1,287580,1998-01-01,324.05
2,108120,1998-01-01,1001.0
4,798059,1998-01-01,400.0
5,3091357,1998-01-01,335.51


In [17]:
sedol_xj01_MV = pd.read_csv("DS query results\\DS_XJ01_MV.csv")
sedol_xj01_MV['SEDOL']=sedol_xj01_MV['Code'].str[:7]
sedol_xj01_MV.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj01_MV_long = pd.melt(sedol_xj01_MV, id_vars=['SEDOL'], var_name = 'date', value_name='MV')
sedol_xj01_MV_long = sedol_xj01_MV_long[sedol_xj01_MV_long.SEDOL.isnull()==False]
sedol_xj01_MV_long['date'] = pd.to_datetime(sedol_xj01_MV_long['date'])
sedol_xj01_MV_long.head()

Unnamed: 0,SEDOL,date,MV
0,5165294,1998-01-01,246.37
1,287580,1998-01-01,17207.23
2,108120,1998-01-01,4874.87
4,798059,1998-01-01,46096.0
5,3091357,1998-01-01,30638.35


In [18]:
sedol_xj01_VO = pd.read_csv("DS query results\\DS_XJ01_VO.csv")
sedol_xj01_VO['SEDOL']=sedol_xj01_VO['Code'].str[:7]
sedol_xj01_VO.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj01_VO_long = pd.melt(sedol_xj01_VO, id_vars=['SEDOL'], var_name = 'date', value_name='VO')
sedol_xj01_VO_long = sedol_xj01_VO_long[sedol_xj01_VO_long.SEDOL.isnull()==False]
sedol_xj01_VO_long['date'] = pd.to_datetime(sedol_xj01_VO_long['date'])
sedol_xj01_VO_long.head()

Unnamed: 0,SEDOL,date,VO
0,5165294,1998-01-01,
1,287580,1998-01-01,
2,108120,1998-01-01,
4,798059,1998-01-01,
5,3091357,1998-01-01,


In [19]:
# Merge
sedol_xj01_long = pd.merge(sedol_xj01_P_long, sedol_xj01_MV_long, on=['SEDOL', 'date'], how='left')
sedol_xj01_long = pd.merge(sedol_xj01_long, sedol_xj01_VO_long, on=['SEDOL', 'date'], how='left')
sedol_xj01_long.head()

Unnamed: 0,SEDOL,date,price,MV,VO
0,5165294,1998-01-01,6.6567,246.37,
1,287580,1998-01-01,324.05,17207.23,
2,108120,1998-01-01,1001.0,4874.87,
3,798059,1998-01-01,400.0,46096.0,
4,3091357,1998-01-01,335.51,30638.35,


In [22]:
# use for loop to merge to the sample
cols = ann.columns[6:]
ann1 = ann.copy()
for i in cols:
    ann1 = pd.merge(ann1, sedol_xj01_long, left_on=['SEDOL', i], right_on=['SEDOL', 'date'], how='left')
    ann1.rename(columns={'price':f"price_{i[4:]}", 'MV':f"mv_{i[4:]}", 'VO':f"vo_{i[4:]}"}, inplace=True)
    ann1.drop(['date'], axis=1, inplace=True)

In [5]:
# Print out the matching results information
# ann1 = pd.read_pickle("ann1.pkl")
ann1_matched = ann1[(ann1.price_p0.isnull()==False) | 
                    (ann1.price_p1.isnull()==False) | 
                    (ann1.price_m1.isnull()==False)]
ann1_unmatched = ann1[(ann1.price_p0.isnull()) & 
                      (ann1.price_p1.isnull()) & 
                      (ann1.price_m1.isnull()) ]

ann1_matched_gvkeys = list(ann1_matched.GVKEY.unique())
ann1_unmatched_error = ann1_unmatched[ann1_unmatched.GVKEY.isin(ann1_matched_gvkeys)]
ann1_todelete = list(ann1_unmatched_error.GVKEY.unique())
ann1_unmatched = ann1_unmatched[ann1_unmatched.GVKEY.isin(ann1_todelete)==False]

print(f"There are {ann1_unmatched_error.shape[0]} firms unmatched that also appear in the matched list due to lost of data because of listing gaps,\n"
     "this is the process that we lose some samples. \n")
print(f"The number of matched stocks is {len(ann1_matched.GVKEY.unique())}")
print(f"The number of unmatched stocks is {len(ann1_unmatched.GVKEY.unique())}")
ann1_matched.sample(3)

There are 86 firms unmatched that also appear in the matched list due to lost of data because of listing gaps,
this is the process that we lose some samples. 

The number of matched stocks is 4722
The number of unmatched stocks is 20479


Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,vo_m117,price_m118,mv_m118,vo_m118,price_m119,mv_m119,vo_m119,price_m120,mv_m120,vo_m120
79232,2009.0,205083,2009-06-30,FELIX RESOURCES LTD,6575687,,2009-08-31,2009-09-01,2009-09-02,2009-08-28,...,1129.4,7.7,1512.63,870.2,7.62,1496.91,514.8,7.14,1402.62,755.9
71031,2013.0,200714,2013-12-31,VANACHAI GROUP PCL,6548924,,2014-02-21,2014-02-24,2014-02-25,2014-02-20,...,270.7,2.3,3604.11,201.1,2.3,3604.11,444.0,2.15,3369.06,314.4
88591,2002.0,210922,2003-04-30,VEGA GROUP PLC,929150,,2003-07-03,2003-07-04,2003-07-07,2003-07-02,...,,63.5,11.75,162.7,65.0,12.02,,65.0,12.02,39.7


In [24]:
ann1.to_pickle("ann1.pkl")

#### SEDOL List XJ02

In [26]:
# Price
sedol_xj02_P = pd.read_csv("DS query results\\DS_XJ02.csv")
sedol_xj02_P['SEDOL']=sedol_xj02_P['Code'].str[:7]
sedol_xj02_P.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj02_P_long = pd.melt(sedol_xj02_P, id_vars=['SEDOL'], var_name = 'date', value_name='price')
sedol_xj02_P_long = sedol_xj02_P_long[sedol_xj02_P_long.SEDOL.isnull()==False]
sedol_xj02_P_long['date'] = pd.to_datetime(sedol_xj02_P_long['date'])

# MV
sedol_xj02_MV = pd.read_csv("DS query results\\DS_XJ02_MV.csv")
sedol_xj02_MV['SEDOL']=sedol_xj02_MV['Code'].str[:7]
sedol_xj02_MV.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj02_MV_long = pd.melt(sedol_xj02_MV, id_vars=['SEDOL'], var_name = 'date', value_name='MV')
sedol_xj02_MV_long = sedol_xj02_MV_long[sedol_xj02_MV_long.SEDOL.isnull()==False]
sedol_xj02_MV_long['date'] = pd.to_datetime(sedol_xj02_MV_long['date'])

# Volume
sedol_xj02_VO = pd.read_csv("DS query results\\DS_XJ02_VO.csv")
sedol_xj02_VO['SEDOL']=sedol_xj02_VO['Code'].str[:7]
sedol_xj02_VO.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj02_VO_long = pd.melt(sedol_xj02_VO, id_vars=['SEDOL'], var_name = 'date', value_name='VO')
sedol_xj02_VO_long = sedol_xj02_VO_long[sedol_xj02_VO_long.SEDOL.isnull()==False]
sedol_xj02_VO_long['date'] = pd.to_datetime(sedol_xj02_VO_long['date'])

In [27]:
# Merge
sedol_xj02_long = pd.merge(sedol_xj02_P_long, sedol_xj02_MV_long, on=['SEDOL', 'date'], how='left')
sedol_xj02_long = pd.merge(sedol_xj02_long, sedol_xj02_VO_long, on=['SEDOL', 'date'], how='left')
sedol_xj02_long.head()

Unnamed: 0,SEDOL,date,price,MV,VO
0,5468346,1998-01-01,4.58,274.82,
1,5109560,1998-01-01,25.05,120.26,
2,5182282,1998-01-01,74.1361,142.06,
3,5060322,1998-01-01,8.155,24.47,
4,5970614,1998-01-01,3.66,46.87,


In [28]:
# prepare the table to match
ann1_unmatched.drop(list(ann1_unmatched.filter(regex='mv')), axis=1, inplace=True)
ann1_unmatched.drop(list(ann1_unmatched.filter(regex='price')), axis=1, inplace=True)
ann1_unmatched.drop(list(ann1_unmatched.filter(regex='vo')), axis=1, inplace=True)
ann1_unmatched.drop(list(ann1_unmatched.filter(regex='index')), axis=1, inplace=True)
ann1_unmatched.drop(list(ann1_unmatched.filter(regex='currency')), axis=1, inplace=True)
ann1_unmatched.head()
len(ann1_unmatched.GVKEY.unique())

# use for loop to merge to the sample
cols = ann.columns[6:]
ann2 = ann1_unmatched.copy()
for i in cols:
    ann2 = pd.merge(ann2, sedol_xj02_long, left_on=['SEDOL', i], right_on=['SEDOL', 'date'], how='left')
    ann2.rename(columns={'price':f"price_{i[4:]}", 'MV':f"mv_{i[4:]}", 'VO':f"vo_{i[4:]}"}, inplace=True)
    ann2.drop(['date'], axis=1, inplace=True)

In [6]:
# Print out the matching results information
# ann2 = pd.read_pickle("ann2.pkl")
ann2_matched = ann2[(ann2.price_p0.isnull()==False) | 
                    (ann2.price_p1.isnull()==False) | 
                    (ann2.price_m1.isnull()==False)]
ann2_unmatched = ann2[(ann2.price_p0.isnull()) & 
                      (ann2.price_p1.isnull()) & 
                      (ann2.price_m1.isnull()) ]

ann2_matched_gvkeys = list(ann2_matched.GVKEY.unique())
ann2_unmatched_error = ann2_unmatched[ann2_unmatched.GVKEY.isin(ann2_matched_gvkeys)]
ann2_todelete = list(ann2_unmatched_error.GVKEY.unique())
ann2_unmatched = ann2_unmatched[ann2_unmatched.GVKEY.isin(ann2_todelete)==False]

print(f"There are {ann2_unmatched_error.shape[0]} firms unmatched that also appear in the matched list due to lost of data because of listing gaps,\n"
     "this is the process that we lose some samples. \n")
print(f"The number of matched stocks is {len(ann2_matched.GVKEY.unique())}")
print(f"The number of unmatched stocks is {len(ann2_unmatched.GVKEY.unique())}")
ann2_matched.sample(3)

There are 41 firms unmatched that also appear in the matched list due to lost of data because of listing gaps,
this is the process that we lose some samples. 

The number of matched stocks is 4249
The number of unmatched stocks is 16230


Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,vo_m117,price_m118,mv_m118,vo_m118,price_m119,mv_m119,vo_m119,price_m120,mv_m120,vo_m120
54542,2009.0,221646,2009-12-31,MIQUEL Y COSTAS & MIQUEL SA,4593067,,2010-02-23,2010-02-24,2010-02-25,2010-02-22,...,57.3,3.025,150.84,32.8,3.0093,150.06,98.2,3.0172,150.45,154.3
77957,2003.0,257971,2003-12-31,GESCARTAO SGPS SA,7623284,,2004-08-30,2004-08-31,2004-09-01,2004-08-27,...,58.1,9.46,189.06,29.1,9.2,183.86,3.1,9.05,180.86,18.8
54395,1999.0,221577,1999-12-31,INTERAMERICANA ENTRTENMIENTO,2224347,,2000-03-09,2000-03-10,2000-03-13,2000-03-08,...,2264.3,23.33,3934.52,68.3,23.09,3893.19,279.5,23.73,4000.65,621.1


In [30]:
ann2.to_pickle("ann2.pkl")

### SEDOL List XJ03

In [31]:
# Price
sedol_xj03_P = pd.read_csv("DS query results\\DS_XJ03.csv")
sedol_xj03_P['SEDOL']=sedol_xj03_P['Code'].str[:7]
sedol_xj03_P.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj03_P_long = pd.melt(sedol_xj03_P, id_vars=['SEDOL'], var_name = 'date', value_name='price')
sedol_xj03_P_long = sedol_xj03_P_long[sedol_xj03_P_long.SEDOL.isnull()==False]
sedol_xj03_P_long['date'] = pd.to_datetime(sedol_xj03_P_long['date'])

# MV
sedol_xj03_MV = pd.read_csv("DS query results\\DS_XJ03_MV.csv")
sedol_xj03_MV['SEDOL']=sedol_xj03_MV['Code'].str[:7]
sedol_xj03_MV.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj03_MV_long = pd.melt(sedol_xj03_MV, id_vars=['SEDOL'], var_name = 'date', value_name='MV')
sedol_xj03_MV_long = sedol_xj03_MV_long[sedol_xj03_MV_long.SEDOL.isnull()==False]
sedol_xj03_MV_long['date'] = pd.to_datetime(sedol_xj03_MV_long['date'])

# Volume
sedol_xj03_VO = pd.read_csv("DS query results\\DS_XJ03_VO.csv")
sedol_xj03_VO['SEDOL']=sedol_xj03_VO['Code'].str[:7]
sedol_xj03_VO.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj03_VO_long = pd.melt(sedol_xj03_VO, id_vars=['SEDOL'], var_name = 'date', value_name='VO')
sedol_xj03_VO_long = sedol_xj03_VO_long[sedol_xj03_VO_long.SEDOL.isnull()==False]
sedol_xj03_VO_long['date'] = pd.to_datetime(sedol_xj03_VO_long['date'])

In [32]:
# Merge
sedol_xj03_long = pd.merge(sedol_xj03_P_long, sedol_xj03_MV_long, on=['SEDOL', 'date'], how='left')
sedol_xj03_long = pd.merge(sedol_xj03_long, sedol_xj03_VO_long, on=['SEDOL', 'date'], how='left')
sedol_xj03_long.head()

Unnamed: 0,SEDOL,date,price,MV,VO
0,6433990,1998-01-01,,,
1,5943401,1998-01-01,,,
2,2780322,1998-01-01,,,
3,6527451,1998-01-01,,,
4,6541774,1998-01-01,,,


In [33]:
# prepare the table to match
ann2_unmatched.drop(list(ann2_unmatched.filter(regex='mv')), axis=1, inplace=True)
ann2_unmatched.drop(list(ann2_unmatched.filter(regex='price')), axis=1, inplace=True)
ann2_unmatched.drop(list(ann2_unmatched.filter(regex='vo')), axis=1, inplace=True)
ann2_unmatched.drop(list(ann2_unmatched.filter(regex='index')), axis=1, inplace=True)
ann2_unmatched.drop(list(ann2_unmatched.filter(regex='currency')), axis=1, inplace=True)
ann2_unmatched.head()
len(ann2_unmatched.GVKEY.unique())

# use for loop to merge to the sample
cols = ann.columns[6:]
ann3 = ann2_unmatched.copy()
for i in cols:
    ann3 = pd.merge(ann3, sedol_xj03_long, left_on=['SEDOL', i], right_on=['SEDOL', 'date'], how='left')
    ann3.rename(columns={'price':f"price_{i[4:]}", 'MV':f"mv_{i[4:]}", 'VO':f"vo_{i[4:]}"}, inplace=True)
    ann3.drop(['date'], axis=1, inplace=True)

In [8]:
# Print out the matching results information
# ann3 = pd.read_pickle("ann3.pkl")
ann3_matched = ann3[(ann3.price_p0.isnull()==False) | 
                    (ann3.price_p1.isnull()==False) | 
                    (ann3.price_m1.isnull()==False)]
ann3_unmatched = ann3[(ann3.price_p0.isnull()) & 
                      (ann3.price_p1.isnull()) & 
                      (ann3.price_m1.isnull()) ]

ann3_matched_gvkeys = list(ann3_matched.GVKEY.unique())
ann3_unmatched_error = ann3_unmatched[ann3_unmatched.GVKEY.isin(ann3_matched_gvkeys)]
ann3_todelete = list(ann3_unmatched_error.GVKEY.unique())
ann3_unmatched = ann3_unmatched[ann3_unmatched.GVKEY.isin(ann3_todelete)==False]

print(f"There are {ann3_unmatched_error.shape[0]} firms unmatched that also appear in the matched list due to lost of data because of listing gaps,\n"
     "this is the process that we lose some samples. \n")
print(f"The number of matched stocks is {len(ann3_matched.GVKEY.unique())}")
print(f"The number of unmatched stocks is {len(ann3_unmatched.GVKEY.unique())}")
ann3_matched.sample(3)

There are 2 firms unmatched that also appear in the matched list due to lost of data because of listing gaps,
this is the process that we lose some samples. 

The number of matched stocks is 119
The number of unmatched stocks is 16111


Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,vo_m117,price_m118,mv_m118,vo_m118,price_m119,mv_m119,vo_m119,price_m120,mv_m120,vo_m120
68812,2019.0,286724,2019-12-31,SYSTEMS TECHNOLOGY INC,6553014,,2020-03-19,2020-03-20,2020-03-23,2020-03-18,...,192.3,17100.0,270692.9,130.8,17500.0,277024.9,163.5,18100.0,286522.9,
89900,2020.0,326410,2020-12-31,1&1 AG,5734672,,2021-02-15,2021-02-16,2021-02-17,2021-02-12,...,0.8,23.84,4214.07,16.7,23.68,4185.78,0.3,23.92,4228.21,0.4
65826,2015.0,282694,2015-12-31,OKOMU OIL PALM CO PLC,6230715,,2016-03-31,2016-04-01,2016-04-04,2016-03-30,...,,35.63,33987.81,,35.63,33987.81,7483.4,35.72,34073.66,2785.0


In [35]:
ann3.to_pickle("ann3.pkl")

#### SEDOL List XJ04

In [19]:
# Price
sedol_xj04_P = pd.read_csv("DS query results\\DS_XJ04.csv")
sedol_xj04_ref = sedol_xj04_P[['Code2', 'SEDOL']]
sedol_xj04_P.drop(['Name', 'Code', 'Code2', 'CURRENCY'], axis=1, inplace=True)

sedol_xj04_P_long = pd.melt(sedol_xj04_P, id_vars=['SEDOL'], var_name = 'date', value_name='price')
sedol_xj04_P_long = sedol_xj04_P_long[sedol_xj04_P_long.SEDOL.isnull()==False]
sedol_xj04_P_long['date'] = pd.to_datetime(sedol_xj04_P_long['date'])

# # MV
sedol_xj04_MV = pd.read_csv("DS query results\\DS_XJ04_MV.csv")
sedol_xj04_MV = sedol_xj04_MV[sedol_xj04_MV['Code'].isnull()==False]
sedol_xj04_MV['text_location'] = sedol_xj04_MV['Code'].str.find("(")
sedol_xj04_MV['Code2'] = sedol_xj04_MV.apply(lambda x: x['Code'][:x['text_location']], axis=1)
sedol_xj04_MV = pd.merge(sedol_xj04_MV, sedol_xj04_ref, on=['Code2'], how='left')
sedol_xj04_MV.drop(['Name', 'Code', 'Code2', 'text_location', 'CURRENCY'], axis=1, inplace=True)

sedol_xj04_MV_long = pd.melt(sedol_xj04_MV, id_vars=['SEDOL'], var_name = 'date', value_name='MV')
sedol_xj04_MV_long = sedol_xj04_MV_long[sedol_xj04_MV_long.SEDOL.isnull()==False]
sedol_xj04_MV_long['date'] = pd.to_datetime(sedol_xj04_MV_long['date'])

# # Volume
sedol_xj04_VO = pd.read_csv("DS query results\\DS_XJ04_VO.csv")
sedol_xj04_VO = sedol_xj04_VO[sedol_xj04_VO['Code'].isnull()==False]
sedol_xj04_VO['text_location'] = sedol_xj04_VO['Code'].str.find("(").astype(int)
sedol_xj04_VO['Code2'] = sedol_xj04_VO.apply(lambda x: x['Code'][:x['text_location']], axis=1)
sedol_xj04_VO = pd.merge(sedol_xj04_VO, sedol_xj04_ref, on=['Code2'], how='left')
sedol_xj04_VO.drop(['Name', 'Code', 'Code2', 'text_location', 'CURRENCY'], axis=1, inplace=True)

sedol_xj04_VO_long = pd.melt(sedol_xj04_VO, id_vars=['SEDOL'], var_name = 'date', value_name='VO')
sedol_xj04_VO_long = sedol_xj04_VO_long[sedol_xj04_VO_long.SEDOL.isnull()==False]
sedol_xj04_VO_long['date'] = pd.to_datetime(sedol_xj04_VO_long['date'])

In [23]:
# Merge
sedol_xj04_long = pd.merge(sedol_xj04_P_long, sedol_xj04_MV_long, on=['SEDOL', 'date'], how='left')
sedol_xj04_long = pd.merge(sedol_xj04_long, sedol_xj04_VO_long, on=['SEDOL', 'date'], how='left')
sedol_xj04_long.head()

Unnamed: 0,SEDOL,date,price,MV,VO
0,BD0R0N4,1998-01-01,1.2,174.87,
1,BDGN274,1998-01-01,880.0,95681.44,
2,B28ZPV6,1998-01-01,6.51,3.81,
3,B10RZP7,1998-01-01,1149.88,17005.43,
4,B11HK39,1998-01-01,19.52,82351.69,


In [25]:
# prepare the table to match
ann3_unmatched.drop(list(ann3_unmatched.filter(regex='mv')), axis=1, inplace=True)
ann3_unmatched.drop(list(ann3_unmatched.filter(regex='price')), axis=1, inplace=True)
ann3_unmatched.drop(list(ann3_unmatched.filter(regex='vo')), axis=1, inplace=True)
ann3_unmatched.drop(list(ann3_unmatched.filter(regex='index')), axis=1, inplace=True)
ann3_unmatched.drop(list(ann3_unmatched.filter(regex='currency')), axis=1, inplace=True)
ann3_unmatched.head()
len(ann3_unmatched.GVKEY.unique())

# use for loop to merge to the sample
cols = ann.columns[6:]
ann4 = ann3_unmatched.copy()
for i in cols:
    ann4 = pd.merge(ann4, sedol_xj04_long, left_on=['SEDOL', i], right_on=['SEDOL', 'date'], how='left')
    ann4.rename(columns={'price':f"price_{i[4:]}", 'MV':f"mv_{i[4:]}", 'VO':f"vo_{i[4:]}"}, inplace=True)
    ann4.drop(['date'], axis=1, inplace=True)

In [26]:
# Print out the matching results information
# ann4 = pd.read_pickle("ann4.pkl")
ann4_matched = ann4[(ann4.price_p0.isnull()==False) | 
                    (ann4.price_p1.isnull()==False) | 
                    (ann4.price_m1.isnull()==False)]
ann4_unmatched = ann4[(ann4.price_p0.isnull()) & 
                      (ann4.price_p1.isnull()) & 
                      (ann4.price_m1.isnull()) ]

ann4_matched_gvkeys = list(ann4_matched.GVKEY.unique())
ann4_unmatched_error = ann4_unmatched[ann4_unmatched.GVKEY.isin(ann4_matched_gvkeys)]
ann4_todelete = list(ann4_unmatched_error.GVKEY.unique())
ann4_unmatched = ann4_unmatched[ann4_unmatched.GVKEY.isin(ann4_todelete)==False]

print(f"There are {ann4_unmatched_error.shape[0]} firms unmatched that also appear in the matched list due to lost of data because of listing gaps,\n"
     "this is the process that we lose some samples. \n")
print(f"The number of matched stocks is {len(ann4_matched.GVKEY.unique())}")
print(f"The number of unmatched stocks is {len(ann4_unmatched.GVKEY.unique())}")
ann4_matched.sample(3)

There are 30 firms unmatched that also appear in the matched list due to lost of data because of listing gaps,
this is the process that we lose some samples. 

The number of matched stocks is 977
The number of unmatched stocks is 15134


Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,vo_m117,price_m118,mv_m118,vo_m118,price_m119,mv_m119,vo_m119,price_m120,mv_m120,vo_m120
53401,2018.0,255940,2018-12-31,FULLSHARE HOLDINGS LTD,BSVXB88,,2019-03-31,2019-04-01,2019-04-02,2019-03-29,...,13652.0,3.13,61751.91,,3.13,61751.91,13447.5,3.2,63132.96,8602.2
55366,2009.0,270727,2009-12-31,JIANGXI HONGCHENG ENVIRONMEN,B01BHN3,,2010-03-22,2010-03-23,2010-03-24,2010-03-19,...,,3.47,1310.4,,3.47,1310.4,,3.47,1310.4,
39826,2011.0,142540,2011-12-31,CNOOC LTD,B00G0S5,,2012-03-28,2012-03-29,2012-03-30,2012-03-27,...,85242.8,13.28,593206.6,87816.5,13.92,621794.8,110210.0,13.72,612861.0,80225.3


In [27]:
ann4.to_pickle("ann4.pkl")

#### SEDOL List XJ05

In [28]:
# Price
sedol_xj05_P = pd.read_csv("DS query results\\DS_XJ05.csv")
sedol_xj05_ref = sedol_xj05_P[['Code', 'SEDOL']]
sedol_xj05_P.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj05_P_long = pd.melt(sedol_xj05_P, id_vars=['SEDOL'], var_name = 'date', value_name='price')
sedol_xj05_P_long = sedol_xj05_P_long[sedol_xj05_P_long.SEDOL.isnull()==False]
sedol_xj05_P_long['date'] = pd.to_datetime(sedol_xj05_P_long['date'])

# # MV
sedol_xj05_MV = pd.read_csv("DS query results\\DS_XJ05_MV.csv")
sedol_xj05_MV = sedol_xj05_MV[sedol_xj05_MV['Code'].isnull()==False]
sedol_xj05_MV['text_location'] = sedol_xj05_MV['Code'].str.find("(")
sedol_xj05_MV['Code'] = sedol_xj05_MV.apply(lambda x: x['Code'][:x['text_location']], axis=1)
sedol_xj05_MV = pd.merge(sedol_xj05_MV, sedol_xj05_ref, on=['Code'], how='left')
sedol_xj05_MV.drop(['Name', 'Code', 'text_location', 'CURRENCY'], axis=1, inplace=True)

sedol_xj05_MV_long = pd.melt(sedol_xj05_MV, id_vars=['SEDOL'], var_name = 'date', value_name='MV')
sedol_xj05_MV_long = sedol_xj05_MV_long[sedol_xj05_MV_long.SEDOL.isnull()==False]
sedol_xj05_MV_long['date'] = pd.to_datetime(sedol_xj05_MV_long['date'])

# # Volume
sedol_xj05_VO = pd.read_csv("DS query results\\DS_XJ05_VO.csv")
sedol_xj05_VO = sedol_xj05_VO[sedol_xj05_VO['Code'].isnull()==False]
sedol_xj05_VO['text_location'] = sedol_xj05_VO['Code'].str.find("(").astype(int)
sedol_xj05_VO['Code'] = sedol_xj05_VO.apply(lambda x: x['Code'][:x['text_location']], axis=1)
sedol_xj05_VO = pd.merge(sedol_xj05_VO, sedol_xj05_ref, on=['Code'], how='left')
sedol_xj05_VO.drop(['Name', 'Code', 'text_location', 'CURRENCY'], axis=1, inplace=True)

sedol_xj05_VO_long = pd.melt(sedol_xj05_VO, id_vars=['SEDOL'], var_name = 'date', value_name='VO')
sedol_xj05_VO_long = sedol_xj05_VO_long[sedol_xj05_VO_long.SEDOL.isnull()==False]
sedol_xj05_VO_long['date'] = pd.to_datetime(sedol_xj05_VO_long['date'])

In [29]:
# Merge
sedol_xj05_long = pd.merge(sedol_xj05_P_long, sedol_xj05_MV_long, on=['SEDOL', 'date'], how='left')
sedol_xj05_long = pd.merge(sedol_xj05_long, sedol_xj05_VO_long, on=['SEDOL', 'date'], how='left')
sedol_xj05_long.head()

Unnamed: 0,SEDOL,date,price,MV,VO
0,BG0SSL2,1998-01-01,,,
1,B9276C5,1998-01-01,,,
2,B0K2PB1,1998-01-01,,,
3,B0D01C5,1998-01-01,,,
4,B06N217,1998-01-01,,,


In [31]:
# prepare the table to match
ann4_unmatched.drop(list(ann4_unmatched.filter(regex='mv')), axis=1, inplace=True)
ann4_unmatched.drop(list(ann4_unmatched.filter(regex='price')), axis=1, inplace=True)
ann4_unmatched.drop(list(ann4_unmatched.filter(regex='vo')), axis=1, inplace=True)
ann4_unmatched.drop(list(ann4_unmatched.filter(regex='index')), axis=1, inplace=True)
ann4_unmatched.drop(list(ann4_unmatched.filter(regex='currency')), axis=1, inplace=True)
ann4_unmatched.head()
len(ann4_unmatched.GVKEY.unique())

# use for loop to merge to the sample
cols = ann.columns[6:]
ann5 = ann4_unmatched.copy()
for i in cols:
    ann5 = pd.merge(ann5, sedol_xj05_long, left_on=['SEDOL', i], right_on=['SEDOL', 'date'], how='left')
    ann5.rename(columns={'price':f"price_{i[4:]}", 'MV':f"mv_{i[4:]}", 'VO':f"vo_{i[4:]}"}, inplace=True)
    ann5.drop(['date'], axis=1, inplace=True)

In [32]:
# Print out the matching results information
# ann5 = pd.read_pickle("ann5.pkl")
ann5_matched = ann5[(ann5.price_p0.isnull()==False) | 
                    (ann5.price_p1.isnull()==False) | 
                    (ann5.price_m1.isnull()==False)]
ann5_unmatched = ann5[(ann5.price_p0.isnull()) & 
                      (ann5.price_p1.isnull()) & 
                      (ann5.price_m1.isnull()) ]

ann5_matched_gvkeys = list(ann5_matched.GVKEY.unique())
ann5_unmatched_error = ann5_unmatched[ann5_unmatched.GVKEY.isin(ann5_matched_gvkeys)]
ann5_todelete = list(ann5_unmatched_error.GVKEY.unique())
ann5_unmatched = ann5_unmatched[ann5_unmatched.GVKEY.isin(ann5_todelete)==False]

print(f"There are {ann5_unmatched_error.shape[0]} firms unmatched that also appear in the matched list due to lost of data because of listing gaps,\n"
     "this is the process that we lose some samples. \n")
print(f"The number of matched stocks is {len(ann5_matched.GVKEY.unique())}")
print(f"The number of unmatched stocks is {len(ann5_unmatched.GVKEY.unique())}")
ann5_matched.sample(3)

There are 39 firms unmatched that also appear in the matched list due to lost of data because of listing gaps,
this is the process that we lose some samples. 

The number of matched stocks is 4001
The number of unmatched stocks is 11133


Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,vo_m117,price_m118,mv_m118,vo_m118,price_m119,mv_m119,vo_m119,price_m120,mv_m120,vo_m120
64249,2015.0,292789,2015-12-31,FUJIAN SUNNER DEVELOPMENT CO,B4L9T62,,2016-02-27,2016-02-29,2016-03-01,2016-02-26,...,8794.1,16.33,18141.01,7420.4,15.03,16696.84,4955.7,16.0,17774.41,6817.5
77614,2019.0,319264,2019-12-31,SCAN INTER CO LTD,BVY9M36,,2020-02-17,2020-02-18,2020-02-19,2020-02-14,...,2950.1,2.56,3072.0,637.2,2.6,3120.0,507.0,2.62,3144.0,526.4
57276,2013.0,281813,2013-12-31,UNIVERSAL BIOSENSORS INC,B1L2R55,,2014-02-12,2014-02-13,2014-02-14,2014-02-11,...,187.2,0.675,117.8,36.0,0.68,118.67,31.2,0.69,120.42,


In [33]:
ann5.to_pickle("ann5.pkl")

#### SEDOL List XJ06

In [34]:
# Price
sedol_xj06_P = pd.read_csv("DS query results\\DS_XJ06.csv")
sedol_xj06_ref = sedol_xj06_P[['Code', 'SEDOL']]
sedol_xj06_P.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj06_P_long = pd.melt(sedol_xj06_P, id_vars=['SEDOL'], var_name = 'date', value_name='price')
sedol_xj06_P_long = sedol_xj06_P_long[sedol_xj06_P_long.SEDOL.isnull()==False]
sedol_xj06_P_long['date'] = pd.to_datetime(sedol_xj06_P_long['date'])

# # MV
sedol_xj06_MV = pd.read_csv("DS query results\\DS_XJ06_MV.csv")
sedol_xj06_MV = sedol_xj06_MV[sedol_xj06_MV['Code'].isnull()==False]
sedol_xj06_MV['text_location'] = sedol_xj06_MV['Code'].str.find("(")
sedol_xj06_MV['Code'] = sedol_xj06_MV.apply(lambda x: x['Code'][:x['text_location']], axis=1)
sedol_xj06_MV = pd.merge(sedol_xj06_MV, sedol_xj06_ref, on=['Code'], how='left')
sedol_xj06_MV.drop(['Name', 'Code', 'text_location', 'CURRENCY'], axis=1, inplace=True)

sedol_xj06_MV_long = pd.melt(sedol_xj06_MV, id_vars=['SEDOL'], var_name = 'date', value_name='MV')
sedol_xj06_MV_long = sedol_xj06_MV_long[sedol_xj06_MV_long.SEDOL.isnull()==False]
sedol_xj06_MV_long['date'] = pd.to_datetime(sedol_xj06_MV_long['date'])

# # Volume
sedol_xj06_VO = pd.read_csv("DS query results\\DS_XJ06_VO.csv")
sedol_xj06_VO = sedol_xj06_VO[sedol_xj06_VO['Code'].isnull()==False]
sedol_xj06_VO['text_location'] = sedol_xj06_VO['Code'].str.find("(").astype(int)
sedol_xj06_VO['Code'] = sedol_xj06_VO.apply(lambda x: x['Code'][:x['text_location']], axis=1)
sedol_xj06_VO = pd.merge(sedol_xj06_VO, sedol_xj06_ref, on=['Code'], how='left')
sedol_xj06_VO.drop(['Name', 'Code', 'text_location', 'CURRENCY'], axis=1, inplace=True)

sedol_xj06_VO_long = pd.melt(sedol_xj06_VO, id_vars=['SEDOL'], var_name = 'date', value_name='VO')
sedol_xj06_VO_long = sedol_xj06_VO_long[sedol_xj06_VO_long.SEDOL.isnull()==False]
sedol_xj06_VO_long['date'] = pd.to_datetime(sedol_xj06_VO_long['date'])

In [35]:
# Merge
sedol_xj06_long = pd.merge(sedol_xj06_P_long, sedol_xj06_MV_long, on=['SEDOL', 'date'], how='left')
sedol_xj06_long = pd.merge(sedol_xj06_long, sedol_xj06_VO_long, on=['SEDOL', 'date'], how='left')
sedol_xj06_long.head()

Unnamed: 0,SEDOL,date,price,MV,VO
0,3188044,1998-01-01,,,
1,5719981,1998-01-01,,,
2,6013972,1998-01-01,583.7539,603.84,
3,6080523,1998-01-01,,,
4,6107381,1998-01-01,10.4127,10.73,


In [36]:
# prepare the table to match
ann5_unmatched.drop(list(ann5_unmatched.filter(regex='mv')), axis=1, inplace=True)
ann5_unmatched.drop(list(ann5_unmatched.filter(regex='price')), axis=1, inplace=True)
ann5_unmatched.drop(list(ann5_unmatched.filter(regex='vo')), axis=1, inplace=True)
ann5_unmatched.drop(list(ann5_unmatched.filter(regex='index')), axis=1, inplace=True)
ann5_unmatched.drop(list(ann5_unmatched.filter(regex='currency')), axis=1, inplace=True)
ann5_unmatched.head()
len(ann5_unmatched.GVKEY.unique())

# use for loop to merge to the sample
cols = ann.columns[6:]
ann6 = ann5_unmatched.copy()
for i in cols:
    ann6 = pd.merge(ann6, sedol_xj06_long, left_on=['SEDOL', i], right_on=['SEDOL', 'date'], how='left')
    ann6.rename(columns={'price':f"price_{i[4:]}", 'MV':f"mv_{i[4:]}", 'VO':f"vo_{i[4:]}"}, inplace=True)
    ann6.drop(['date'], axis=1, inplace=True)

In [37]:
# Print out the matching results information
# ann6 = pd.read_pickle("ann6.pkl")
ann6_matched = ann6[(ann6.price_p0.isnull()==False) | 
                    (ann6.price_p1.isnull()==False) | 
                    (ann6.price_m1.isnull()==False)]
ann6_unmatched = ann6[(ann6.price_p0.isnull()) & 
                      (ann6.price_p1.isnull()) & 
                      (ann6.price_m1.isnull()) ]

ann6_matched_gvkeys = list(ann6_matched.GVKEY.unique())
ann6_unmatched_error = ann6_unmatched[ann6_unmatched.GVKEY.isin(ann6_matched_gvkeys)]
ann6_todelete = list(ann6_unmatched_error.GVKEY.unique())
ann6_unmatched = ann6_unmatched[ann6_unmatched.GVKEY.isin(ann6_todelete)==False]

print(f"There are {ann6_unmatched_error.shape[0]} firms unmatched that also appear in the matched list due to lost of data because of listing gaps,\n"
     "this is the process that we lose some samples. \n")
print(f"The number of matched stocks is {len(ann6_matched.GVKEY.unique())}")
print(f"The number of unmatched stocks is {len(ann6_unmatched.GVKEY.unique())}")
ann6_matched.sample(3)

There are 71 firms unmatched that also appear in the matched list due to lost of data because of listing gaps,
this is the process that we lose some samples. 

The number of matched stocks is 2880
The number of unmatched stocks is 8253


Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,vo_m117,price_m118,mv_m118,vo_m118,price_m119,mv_m119,vo_m119,price_m120,mv_m120,vo_m120
58137,2011.0,296331,2011-06-30,DELTA SBD LIMITED,B55L2T5,,2011-08-22,2011-08-23,2011-08-24,2011-08-19,...,31.3,0.86,37.49,14.0,0.86,37.28,19.4,0.88,37.93,
32320,2013.0,104862,2013-12-31,USG PEOPLE NV,B1FRPV8,,2014-02-28,2014-03-03,2014-03-04,2014-02-27,...,622.9,6.3191,515.09,551.6,6.2579,510.1,654.1,6.2845,512.28,495.3
48268,2013.0,211545,2013-06-30,AUSTRALIAN VINTAGE LTD,6130677,,2013-08-22,2013-08-23,2013-08-26,2013-08-21,...,196.4,0.4389,59.64,9.1,0.4389,59.64,9.5,0.4292,58.31,21.4


In [38]:
ann6.to_pickle('ann6.pkl')

#### SEDOL List XJ07

In [39]:
# Price
sedol_xj07_P = pd.read_csv("DS query results\\DS_XJ07.csv")
sedol_xj07_ref = sedol_xj07_P[['Code', 'SEDOL']]
sedol_xj07_P.drop(['Name', 'Code', 'CURRENCY'], axis=1, inplace=True)

sedol_xj07_P_long = pd.melt(sedol_xj07_P, id_vars=['SEDOL'], var_name = 'date', value_name='price')
sedol_xj07_P_long = sedol_xj07_P_long[sedol_xj07_P_long.SEDOL.isnull()==False]
sedol_xj07_P_long['date'] = pd.to_datetime(sedol_xj07_P_long['date'])

# # MV
sedol_xj07_MV = pd.read_csv("DS query results\\DS_XJ07_MV.csv")
sedol_xj07_MV = sedol_xj07_MV[sedol_xj07_MV['Code'].isnull()==False]
sedol_xj07_MV['text_location'] = sedol_xj07_MV['Code'].str.find("(")
sedol_xj07_MV['Code'] = sedol_xj07_MV.apply(lambda x: x['Code'][:x['text_location']], axis=1)
sedol_xj07_MV = pd.merge(sedol_xj07_MV, sedol_xj07_ref, on=['Code'], how='left')
sedol_xj07_MV.drop(['Name', 'Code', 'text_location', 'CURRENCY'], axis=1, inplace=True)

sedol_xj07_MV_long = pd.melt(sedol_xj07_MV, id_vars=['SEDOL'], var_name = 'date', value_name='MV')
sedol_xj07_MV_long = sedol_xj07_MV_long[sedol_xj07_MV_long.SEDOL.isnull()==False]
sedol_xj07_MV_long['date'] = pd.to_datetime(sedol_xj07_MV_long['date'])

# # Volume
sedol_xj07_VO = pd.read_csv("DS query results\\DS_XJ07_VO.csv")
sedol_xj07_VO = sedol_xj07_VO[sedol_xj07_VO['Code'].isnull()==False]
sedol_xj07_VO['text_location'] = sedol_xj07_VO['Code'].str.find("(").astype(int)
sedol_xj07_VO['Code'] = sedol_xj07_VO.apply(lambda x: x['Code'][:x['text_location']], axis=1)
sedol_xj07_VO = pd.merge(sedol_xj07_VO, sedol_xj07_ref, on=['Code'], how='left')
sedol_xj07_VO.drop(['Name', 'Code', 'text_location', 'CURRENCY'], axis=1, inplace=True)

sedol_xj07_VO_long = pd.melt(sedol_xj07_VO, id_vars=['SEDOL'], var_name = 'date', value_name='VO')
sedol_xj07_VO_long = sedol_xj07_VO_long[sedol_xj07_VO_long.SEDOL.isnull()==False]
sedol_xj07_VO_long['date'] = pd.to_datetime(sedol_xj07_VO_long['date'])

In [40]:
# Merge
sedol_xj07_long = pd.merge(sedol_xj07_P_long, sedol_xj07_MV_long, on=['SEDOL', 'date'], how='left')
sedol_xj07_long = pd.merge(sedol_xj07_long, sedol_xj07_VO_long, on=['SEDOL', 'date'], how='left')
sedol_xj07_long.head()

Unnamed: 0,SEDOL,date,price,MV,VO
0,B29MWZ9,1998-01-01,1894.39,11258.45,
1,BG0SSL2,1998-01-01,,,
2,B3KHXB3,1998-01-01,47.21,111.42,
3,B28TMS4,1998-01-01,144.37,187.94,
4,B032D70,1998-01-01,,,


In [41]:
# prepare the table to match
ann6_unmatched.drop(list(ann6_unmatched.filter(regex='mv')), axis=1, inplace=True)
ann6_unmatched.drop(list(ann6_unmatched.filter(regex='price')), axis=1, inplace=True)
ann6_unmatched.drop(list(ann6_unmatched.filter(regex='vo')), axis=1, inplace=True)
ann6_unmatched.drop(list(ann6_unmatched.filter(regex='index')), axis=1, inplace=True)
ann6_unmatched.drop(list(ann6_unmatched.filter(regex='currency')), axis=1, inplace=True)
ann6_unmatched.head()
len(ann6_unmatched.GVKEY.unique())

# use for loop to merge to the sample
cols = ann.columns[6:]
ann7 = ann6_unmatched.copy()
for i in cols:
    ann7 = pd.merge(ann7, sedol_xj07_long, left_on=['SEDOL', i], right_on=['SEDOL', 'date'], how='left')
    ann7.rename(columns={'price':f"price_{i[4:]}", 'MV':f"mv_{i[4:]}", 'VO':f"vo_{i[4:]}"}, inplace=True)
    ann7.drop(['date'], axis=1, inplace=True)

In [42]:
# Print out the matching results information
# ann7 = pd.read_pickle("ann7.pkl")
ann7_matched = ann7[(ann7.price_p0.isnull()==False) | 
                    (ann7.price_p1.isnull()==False) | 
                    (ann7.price_m1.isnull()==False)]
ann7_unmatched = ann7[(ann7.price_p0.isnull()) & 
                      (ann7.price_p1.isnull()) & 
                      (ann7.price_m1.isnull()) ]

ann7_matched_gvkeys = list(ann7_matched.GVKEY.unique())
ann7_unmatched_error = ann7_unmatched[ann7_unmatched.GVKEY.isin(ann7_matched_gvkeys)]
ann7_todelete = list(ann7_unmatched_error.GVKEY.unique())
ann7_unmatched = ann7_unmatched[ann7_unmatched.GVKEY.isin(ann7_todelete)==False]

print(f"There are {ann7_unmatched_error.shape[0]} firms unmatched that also appear in the matched list due to lost of data because of listing gaps,\n"
     "this is the process that we lose some samples. \n")
print(f"The number of matched stocks is {len(ann7_matched.GVKEY.unique())}")
print(f"The number of unmatched stocks is {len(ann7_unmatched.GVKEY.unique())}")
ann7_matched.sample(3)

There are 3 firms unmatched that also appear in the matched list due to lost of data because of listing gaps,
this is the process that we lose some samples. 

The number of matched stocks is 114
The number of unmatched stocks is 8139


Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,vo_m117,price_m118,mv_m118,vo_m118,price_m119,mv_m119,vo_m119,price_m120,mv_m120,vo_m120
47858,2009.0,284307,2009-06-30,THE MAC SERVICES GROUP LTD,B1VQDR1,,2009-08-17,2009-08-18,2009-08-19,2009-08-14,...,327.6,0.8,115.79,184.6,0.79,115.07,58.7,0.78,112.89,203.7
31631,2012.0,100277,2013-04-30,ANITE PLC,B3KHXB3,,2013-07-02,2013-07-03,2013-07-04,2013-07-01,...,347.1,137.3,411.07,298.6,136.9,409.87,297.3,135.0,404.18,781.9
47436,2006.0,272306,2006-12-31,ARMORGROUP INTERNATIONAL,B049FG3,,2007-03-20,2007-03-21,2007-03-22,2007-03-19,...,19.0,54.25,28.79,0.2,54.75,29.06,114.1,54.5,28.93,10.3


In [43]:
ann7.to_pickle("ann7.pkl")

#### CUSIP List XJ08

In [51]:
cusiplist_xj08 = pd.read_csv("DS_lookup_tables\\Cusip List 2.csv")
cusiplist_xj08['CUSIP'] = cusiplist_xj08['CUSIP'].str[1:]

# Price
cusip_xj08_P = pd.read_csv("DS query results\\DS_XJ08.csv")
cusip_xj08_P = pd.merge(cusip_xj08_P, cusiplist_xj08, left_on=['Code'], right_on=['Check'], how='left')
cusip_xj08_P.drop(['Name', 'Code', 'SEDOL', 'Check','CURRENCY'], axis=1, inplace=True)

cusip_xj08_P_long = pd.melt(cusip_xj08_P, id_vars=['CUSIP'], var_name = 'date', value_name='price')
cusip_xj08_P_long = cusip_xj08_P_long[cusip_xj08_P_long.CUSIP.isnull()==False]
cusip_xj08_P_long['date'] = pd.to_datetime(cusip_xj08_P_long['date'])

# MV
cusip_xj08_MV = pd.read_csv("DS query results\\DS_XJ08_MV.csv")
cusip_xj08_MV = cusip_xj08_MV[cusip_xj08_MV['Code'].isnull()==False]
cusip_xj08_MV['text_location'] = cusip_xj08_MV['Code'].str.find("(")
cusip_xj08_MV['Code'] = cusip_xj08_MV.apply(lambda x: x['Code'][:x['text_location']], axis=1)
cusip_xj08_MV = pd.merge(cusip_xj08_MV, cusiplist_xj08, left_on=['Code'], right_on=['Check'], how='left')
cusip_xj08_MV.drop(['Name', 'Code', 'Check', 'text_location', 'CURRENCY'], axis=1, inplace=True)

cusip_xj08_MV_long = pd.melt(cusip_xj08_MV, id_vars=['CUSIP'], var_name = 'date', value_name='MV')
cusip_xj08_MV_long = cusip_xj08_MV_long[cusip_xj08_MV_long.CUSIP.isnull()==False]
cusip_xj08_MV_long['date'] = pd.to_datetime(cusip_xj08_MV_long['date'])

# Volume
cusip_xj08_VO = pd.read_csv("DS query results\\DS_XJ08_VO.csv")
cusip_xj08_VO = cusip_xj08_VO[cusip_xj08_VO['Code'].isnull()==False]
cusip_xj08_VO['text_location'] = cusip_xj08_VO['Code'].str.find("(").astype(int)
cusip_xj08_VO['Code'] = cusip_xj08_VO.apply(lambda x: x['Code'][:x['text_location']], axis=1)
cusip_xj08_VO = pd.merge(cusip_xj08_VO, cusiplist_xj08, left_on=['Code'], right_on=['Check'], how='left')
cusip_xj08_VO.drop(['Name', 'Code', 'Check', 'text_location', 'CURRENCY'], axis=1, inplace=True)

cusip_xj08_VO_long = pd.melt(cusip_xj08_VO, id_vars=['CUSIP'], var_name = 'date', value_name='VO')
cusip_xj08_VO_long = cusip_xj08_VO_long[cusip_xj08_VO_long.CUSIP.isnull()==False]
cusip_xj08_VO_long['date'] = pd.to_datetime(cusip_xj08_VO_long['date'])

In [52]:
# Merge
cusip_xj08_long = pd.merge(cusip_xj08_P_long, cusip_xj08_MV_long, on=['CUSIP', 'date'], how='left')
cusip_xj08_long = pd.merge(cusip_xj08_long, cusip_xj08_VO_long, on=['CUSIP', 'date'], how='left')
cusip_xj08_long.head()

Unnamed: 0,CUSIP,date,price,MV,VO
0,20813101,1998-01-01,21.75,302.43,
1,909914103,1998-01-01,25.31,1119.04,
2,125141101,1998-01-01,,,
3,7768104,1998-01-01,3.5,125.67,
4,2444107,1998-01-01,9.22,1625.84,


In [53]:
# prepare the table to match
ann7_unmatched.drop(list(ann7_unmatched.filter(regex='mv')), axis=1, inplace=True)
ann7_unmatched.drop(list(ann7_unmatched.filter(regex='price')), axis=1, inplace=True)
ann7_unmatched.drop(list(ann7_unmatched.filter(regex='vo')), axis=1, inplace=True)
ann7_unmatched.drop(list(ann7_unmatched.filter(regex='index')), axis=1, inplace=True)
ann7_unmatched.drop(list(ann7_unmatched.filter(regex='currency')), axis=1, inplace=True)
ann7_unmatched.head()
len(ann7_unmatched.GVKEY.unique())

# use for loop to merge to the sample
cols = ann.columns[6:]
ann8 = ann7_unmatched.copy()
for i in cols:
    ann8 = pd.merge(ann8, cusip_xj08_long, left_on=['CUSIP', i], right_on=['CUSIP', 'date'], how='left')
    ann8.rename(columns={'price':f"price_{i[4:]}", 'MV':f"mv_{i[4:]}", 'VO':f"vo_{i[4:]}"}, inplace=True)
    ann8.drop(['date'], axis=1, inplace=True)

In [54]:
# Print out the matching results information
# ann8 = pd.read_pickle("ann8.pkl")
ann8_matched = ann8[(ann8.price_p0.isnull()==False) | 
                    (ann8.price_p1.isnull()==False) | 
                    (ann8.price_m1.isnull()==False)]
ann8_unmatched = ann8[(ann8.price_p0.isnull()) & 
                      (ann8.price_p1.isnull()) & 
                      (ann8.price_m1.isnull()) ]

ann8_matched_gvkeys = list(ann8_matched.GVKEY.unique())
ann8_unmatched_error = ann8_unmatched[ann8_unmatched.GVKEY.isin(ann8_matched_gvkeys)]
ann8_todelete = list(ann8_unmatched_error.GVKEY.unique())
ann8_unmatched = ann8_unmatched[ann8_unmatched.GVKEY.isin(ann8_todelete)==False]

print(f"There are {ann8_unmatched_error.shape[0]} firms unmatched that also appear in the matched list due to lost of data because of listing gaps,\n"
     "this is the process that we lose some samples. \n")
print(f"The number of matched stocks is {len(ann8_matched.GVKEY.unique())}")
print(f"The number of unmatched stocks is {len(ann8_unmatched.GVKEY.unique())}")
ann8_matched.sample(3)

There are 11 firms unmatched that also appear in the matched list due to lost of data because of listing gaps,
this is the process that we lose some samples. 

The number of matched stocks is 4300
The number of unmatched stocks is 3839


Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,vo_m117,price_m118,mv_m118,vo_m118,price_m119,mv_m119,vo_m119,price_m120,mv_m120,vo_m120
26339,2004.0,61706,2004-12-31,MOLECULAR DEVICES CORP,,60851C107,2005-02-10,2005-02-11,2005-02-14,2005-02-09,...,306.3,23.19,331.34,333.5,23.78,339.77,526.0,24.82,354.63,346.2
29824,1998.0,64939,1998-12-31,RWD TECHNOLOGIES INC,,74975B101,1999-01-28,1999-01-29,1999-02-01,1999-01-27,...,55.8,20.5,302.7,38.2,19.5,287.94,5.6,20.0,295.32,18.5
5467,2003.0,6116,2003-11-30,INTL SPEEDWAY CORP -CL A,,460335201,2004-01-22,2004-01-23,2004-01-26,2004-01-21,...,95.2,38.59,1080.06,66.7,38.52,1078.1,72.6,38.47,1076.84,53.1


In [65]:
ann8.to_pickle("ann8.pkl")

#### CUSIP List XJ09

In [61]:
cusiplist_xj09 = pd.read_csv("DS_lookup_tables\\Cusip List 1.csv")
cusiplist_xj09['CUSIP'] = cusiplist_xj09['CUSIP'].str[2:11]

# Price
cusip_xj09_P = pd.read_csv("DS query results\\DS_XJ09.csv")
cusip_xj09_P = pd.merge(cusip_xj09_P, cusiplist_xj09, left_on=['Code'], right_on=['Check'], how='left')
cusip_xj09_P.drop(['Name', 'Code', 'Check','CURRENCY'], axis=1, inplace=True)

cusip_xj09_P_long = pd.melt(cusip_xj09_P, id_vars=['CUSIP'], var_name = 'date', value_name='price')
cusip_xj09_P_long = cusip_xj09_P_long[cusip_xj09_P_long.CUSIP.isnull()==False]
cusip_xj09_P_long['date'] = pd.to_datetime(cusip_xj09_P_long['date'])

# MV
cusip_xj09_MV = pd.read_csv("DS query results\\DS_XJ09_MV.csv")
cusip_xj09_MV = cusip_xj09_MV[cusip_xj09_MV['Code'].isnull()==False]
cusip_xj09_MV['text_location'] = cusip_xj09_MV['Code'].str.find("(")
cusip_xj09_MV['Code'] = cusip_xj09_MV.apply(lambda x: x['Code'][:x['text_location']], axis=1)
cusip_xj09_MV = pd.merge(cusip_xj09_MV, cusiplist_xj09, left_on=['Code'], right_on=['Check'], how='left')
cusip_xj09_MV.drop(['Name', 'Code', 'Check', 'text_location', 'CURRENCY'], axis=1, inplace=True)

cusip_xj09_MV_long = pd.melt(cusip_xj09_MV, id_vars=['CUSIP'], var_name = 'date', value_name='MV')
cusip_xj09_MV_long = cusip_xj09_MV_long[cusip_xj09_MV_long.CUSIP.isnull()==False]
cusip_xj09_MV_long['date'] = pd.to_datetime(cusip_xj09_MV_long['date'])

# Volume
cusip_xj09_VO = pd.read_csv("DS query results\\DS_XJ09_VO.csv")
cusip_xj09_VO = cusip_xj09_VO[cusip_xj09_VO['Code'].isnull()==False]
cusip_xj09_VO['text_location'] = cusip_xj09_VO['Code'].str.find("(").astype(int)
cusip_xj09_VO['Code'] = cusip_xj09_VO.apply(lambda x: x['Code'][:x['text_location']], axis=1)
cusip_xj09_VO = pd.merge(cusip_xj09_VO, cusiplist_xj09, left_on=['Code'], right_on=['Check'], how='left')
cusip_xj09_VO.drop(['Name', 'Code', 'Check', 'text_location', 'CURRENCY'], axis=1, inplace=True)

cusip_xj09_VO_long = pd.melt(cusip_xj09_VO, id_vars=['CUSIP'], var_name = 'date', value_name='VO')
cusip_xj09_VO_long = cusip_xj09_VO_long[cusip_xj09_VO_long.CUSIP.isnull()==False]
cusip_xj09_VO_long['date'] = pd.to_datetime(cusip_xj09_VO_long['date'])

In [62]:
# Merge
cusip_xj09_long = pd.merge(cusip_xj09_P_long, cusip_xj09_MV_long, on=['CUSIP', 'date'], how='left')
cusip_xj09_long = pd.merge(cusip_xj09_long, cusip_xj09_VO_long, on=['CUSIP', 'date'], how='left')
cusip_xj09_long.head()

Unnamed: 0,CUSIP,date,price,MV,VO
0,000361105,1998-01-01,25.8346,710.44,
1,02376R102,1998-01-01,,,
2,723484101,1998-01-01,42.375,3589.92,
3,002824100,1998-01-01,14.6555,50099.81,
4,007903107,1998-01-01,8.875,2516.44,


In [63]:
# prepare the table to match
ann8_unmatched.drop(list(ann8_unmatched.filter(regex='mv')), axis=1, inplace=True)
ann8_unmatched.drop(list(ann8_unmatched.filter(regex='price')), axis=1, inplace=True)
ann8_unmatched.drop(list(ann8_unmatched.filter(regex='vo')), axis=1, inplace=True)
ann8_unmatched.drop(list(ann8_unmatched.filter(regex='index')), axis=1, inplace=True)
ann8_unmatched.drop(list(ann8_unmatched.filter(regex='currency')), axis=1, inplace=True)
ann8_unmatched.head()
len(ann8_unmatched.GVKEY.unique())

# use for loop to merge to the sample
cols = ann.columns[6:]
ann9 = ann8_unmatched.copy()
for i in cols:
    ann9 = pd.merge(ann9, cusip_xj09_long, left_on=['CUSIP', i], right_on=['CUSIP', 'date'], how='left')
    ann9.rename(columns={'price':f"price_{i[4:]}", 'MV':f"mv_{i[4:]}", 'VO':f"vo_{i[4:]}"}, inplace=True)
    ann9.drop(['date'], axis=1, inplace=True)

In [64]:
# Print out the matching results information
# ann9 = pd.read_pickle("ann9.pkl")
ann9_matched = ann9[(ann9.price_p0.isnull()==False) | 
                    (ann9.price_p1.isnull()==False) | 
                    (ann9.price_m1.isnull()==False)]
ann9_unmatched = ann9[(ann9.price_p0.isnull()) & 
                      (ann9.price_p1.isnull()) & 
                      (ann9.price_m1.isnull()) ]

ann9_matched_gvkeys = list(ann9_matched.GVKEY.unique())
ann9_unmatched_error = ann9_unmatched[ann9_unmatched.GVKEY.isin(ann9_matched_gvkeys)]
ann9_todelete = list(ann9_unmatched_error.GVKEY.unique())
ann9_unmatched = ann9_unmatched[ann9_unmatched.GVKEY.isin(ann9_todelete)==False]

print(f"There are {ann9_unmatched_error.shape[0]} firms unmatched that also appear in the matched list due to lost of data because of listing gaps,\n"
     "this is the process that we lose some samples. \n")
print(f"The number of matched stocks is {len(ann9_matched.GVKEY.unique())}")
print(f"The number of unmatched stocks is {len(ann9_unmatched.GVKEY.unique())}")
ann9_matched.sample(3)

There are 64 firms unmatched that also appear in the matched list due to lost of data because of listing gaps,
this is the process that we lose some samples. 

The number of matched stocks is 2055
The number of unmatched stocks is 1784


Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,vo_m117,price_m118,mv_m118,vo_m118,price_m119,mv_m119,vo_m119,price_m120,mv_m120,vo_m120
12248,2017.0,29837,2017-12-31,BIOCRYST PHARMACEUTICALS INC,,09058V103,2018-02-27,2018-02-28,2018-03-01,2018-02-26,...,3537.1,5.15,414.4,2874.1,5.36,431.3,9707.0,5.23,420.84,3873.3
17604,2011.0,123456,2011-12-31,MAXIM POWER CORP,,57773Y209,2012-03-22,2012-03-23,2012-03-26,2012-03-21,...,,2.08,112.49,,2.08,112.49,1.9,2.22,120.07,0.9
6708,2015.0,12459,2015-12-31,NVR INC,,62944T105,2016-01-26,2016-01-27,2016-01-28,2016-01-25,...,23.2,1526.99,6201.39,20.6,1511.92,6140.19,20.9,1512.77,6143.64,25.3


In [66]:
ann9.to_pickle('ann9.pkl')

#### Stack the matched firms together

In [72]:
ann1_matched['matched_by'] = 'SEDOL'
ann2_matched['matched_by'] = 'SEDOL'
ann3_matched['matched_by'] = 'SEDOL'
ann4_matched['matched_by'] = 'SEDOL'
ann5_matched['matched_by'] = 'SEDOL'
ann6_matched['matched_by'] = 'SEDOL'
ann7_matched['matched_by'] = 'SEDOL'
ann8_matched['matched_by'] = 'CUSIP'
ann9_matched['matched_by'] = 'CUSIP'

ann_1_9_combine = pd.concat([ann1_matched, ann2_matched, ann3_matched, ann4_matched, ann5_matched, ann6_matched, 
                             ann7_matched, ann8_matched, ann9_matched])

In [79]:
ann_1_9_combine.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 158709 entries, 189 to 24174
Columns: 427 entries, FYEAR to matched_by
dtypes: datetime64[ns](106), float64(1), object(320)
memory usage: 518.2+ MB


In [76]:
ann_1_9_combine.to_pickle("ann_1_9_combine.pkl")
ann9_unmatched.to_pickle("ann9_unmatched.pkl")

In [85]:
ann.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165516 entries, 0 to 165515
Columns: 111 entries, FYEAR to ann_m120
dtypes: datetime64[ns](106), float64(1), object(4)
memory usage: 140.2+ MB


### Step 4: Add and Match the benchmark index
- In this step, I match each firm (either CUSIP or SEDOL) to the corresponding benchmark index identified by Datastream
- Similarly the benchmark index (there are 45 of them) contains index price from 1/1/1998 to 05/31/2021.
- the index data is matched to earnings annoucement date 0, 1, 2, -1, -2, -21, -22, ...., -120

In [123]:
index_sedol = pd.read_csv("DS_lookup_tables\\Benchmark Index Match_SEDOL.csv")
index_cusip = pd.read_csv("DS_lookup_tables\\Benchmark Index Match_CUSIP.csv")
index_sedol['Match'] = index_sedol['SEDOL'].str.zfill(7)
index_sedol = index_sedol[index_sedol.SEDOL.isnull()==False]
index_cusip['Match'] = index_cusip['ISIN CODE'].str[2:11]
index_cusip.rename(columns={'SEDOL CODE':'SEDOL'}, inplace=True)
index = pd.concat([index_sedol, index_cusip])
index.rename(columns={'Matched_by':'matched_by'}, inplace=True)
index.drop(['Type', 'ISIN CODE', 'SEDOL'], axis=1, inplace=True)

In [124]:
index.head()

Unnamed: 0,matched_by,Index,Match
0,SEDOL,TOTMKNL,5165294
1,SEDOL,TOTMKUK,287580
2,SEDOL,TOTMKUK,108120
4,SEDOL,TOTMKUK,798059
5,SEDOL,TOTMKUK,3091357


In [125]:
ann_1_9_combine['Match'] = np.select([ann_1_9_combine['matched_by']=='SEDOL', ann_1_9_combine['matched_by']=='CUSIP'],
                             [ann_1_9_combine['SEDOL'], ann_1_9_combine['CUSIP']])

In [126]:
ann_combine = pd.merge(ann_1_9_combine, index, left_on=['Match', 'matched_by'], right_on=['Match', 'matched_by'], how='left')
ann_combine.head()

Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,vo_m118,price_m119,mv_m119,vo_m119,price_m120,mv_m120,vo_m120,matched_by,Match,Index
0,1998.0,1166,1998-12-31,ASM INTERNATIONAL NV,5165294,,1999-05-11,1999-05-12,1999-05-13,1999-05-10,...,360.5,4.1171,163.75,890.5,3.3861,134.67,109.8,SEDOL,5165294,TOTMKNL
1,1998.0,1166,1998-12-31,ASM INTERNATIONAL NV,5165294,,1999-05-11,1999-05-12,1999-05-13,1999-05-10,...,360.5,4.1171,163.75,890.5,3.3861,134.67,109.8,SEDOL,5165294,TOTMKNL
2,1999.0,1166,1999-12-31,ASM INTERNATIONAL NV,5165294,,2000-02-15,2000-02-16,2000-02-17,2000-02-14,...,44.1,5.8508,232.7,44.5,5.8084,231.01,58.4,SEDOL,5165294,TOTMKNL
3,1999.0,1166,1999-12-31,ASM INTERNATIONAL NV,5165294,,2000-02-15,2000-02-16,2000-02-17,2000-02-14,...,44.1,5.8508,232.7,44.5,5.8084,231.01,58.4,SEDOL,5165294,TOTMKNL
4,2000.0,1166,2000-12-31,ASM INTERNATIONAL NV,5165294,,2001-02-15,2001-02-16,2001-02-19,2001-02-14,...,325.2,24.6327,1416.45,592.5,24.2936,1396.94,255.6,SEDOL,5165294,TOTMKNL


In [130]:
# Query and add the index data
index_PI = pd.read_csv("DS query results\\DS_INDEX_PI.csv")
index_PI.drop(['Name', 'Code'], axis=1, inplace=True)
index_PI_long = pd.melt(index_PI, id_vars=['Index'], var_name = 'date', value_name='price_index')
index_PI_long = index_PI_long[index_PI_long.Index.isnull()==False]
index_PI_long['date'] = pd.to_datetime(index_PI_long['date'])

In [131]:
index_PI_long.head()

Unnamed: 0,Index,date,price_index
0,TOTMKNL,1998-01-01,971.03
1,TOTMKUK,1998-01-01,3286.25
2,TOTMKSG,1998-01-01,237.74
3,TOTMKJP,1998-01-01,349.63
4,TOTMKLX,1998-01-01,335.19


In [134]:
# use for loop to merge to the sample
cols = ann.columns[6:]
ann_combine_temp = ann_combine.copy()
for i in cols:
    ann_combine_temp = pd.merge(ann_combine_temp, index_PI_long, left_on=['Index', i], right_on=['Index', 'date'], how='left')
    ann_combine_temp.rename(columns={'price_index':f"price_index_{i[4:]}"}, inplace=True)
    ann_combine_temp.drop(['date'], axis=1, inplace=True)

In [138]:
ann_combine_final = ann_combine_temp.copy()
ann_combine_final.to_pickle("ann_combine_final.pkl")
ann_combine_final.head()

Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,price_index_m111,price_index_m112,price_index_m113,price_index_m114,price_index_m115,price_index_m116,price_index_m117,price_index_m118,price_index_m119,price_index_m120
0,1998.0,1166,1998-12-31,ASM INTERNATIONAL NV,5165294,,1999-05-11,1999-05-12,1999-05-13,1999-05-10,...,1040.87,1034.35,1024.58,1011.85,1032.97,1079.51,1108.44,1107.42,1089.39,1091.12
1,1998.0,1166,1998-12-31,ASM INTERNATIONAL NV,5165294,,1999-05-11,1999-05-12,1999-05-13,1999-05-10,...,1040.87,1034.35,1024.58,1011.85,1032.97,1079.51,1108.44,1107.42,1089.39,1091.12
2,1999.0,1166,1999-12-31,ASM INTERNATIONAL NV,5165294,,2000-02-15,2000-02-16,2000-02-17,2000-02-14,...,1282.8,1290.99,1282.75,1269.43,1275.41,1277.38,1260.01,1236.31,1251.91,1251.96
3,1999.0,1166,1999-12-31,ASM INTERNATIONAL NV,5165294,,2000-02-15,2000-02-16,2000-02-17,2000-02-14,...,1282.8,1290.99,1282.75,1269.43,1275.41,1277.38,1260.01,1236.31,1251.91,1251.96
4,2000.0,1166,2000-12-31,ASM INTERNATIONAL NV,5165294,,2001-02-15,2001-02-16,2001-02-19,2001-02-14,...,1526.16,1539.53,1548.61,1539.77,1554.28,1551.95,1555.41,1560.77,1548.4,1535.45


### Step 5: Perform final calculations

Main Variables created:

- **CAR1**: Cumulative abnormal returns (over DS benchmark index) from -1 to +1
- **CAR2**: Cumulative abnormal returns (over DS benchmark index) from -2 to +2
- **abvo_pm1**: the abnormal trading volume, which is the average trading volume from -1 to +1 over the average trading volume from -21 to -120
- **abvo_p1p0**: the abnormal trading volume, which is the average trading volume from 0 to +1 over the average trading volume from -21 to -120
- **abretvar_overbenchmark_pm1**: the abnormal return variance from -1 to +1 over the abnormal return variance from -21 to -120

In [33]:
# Make sure columns participating in calculations are of the right type
ann_combine_final = pd.read_pickle("ann_combine_final.pkl")
convert_type_cols = ann_combine_final.columns[111:426]

for i in convert_type_cols:
    ann_combine_final[i] = ann_combine_final[i].astype(float)

In [34]:
# calculate CAR (-1, +1)
ann_combine_final1 = ann_combine_final.copy()
ann_combine_final1['return_pm1'] = ann_combine_final1['price_p1'].div(ann_combine_final['price_m1']).replace(np.inf, np.nan)
ann_combine_final1['return_index_pm1'] = ann_combine_final1['price_index_p1'].div(ann_combine_final1['price_index_m1']).replace(np.inf, np.nan)
ann_combine_final1['CAR1'] = ann_combine_final1['return_pm1'] - ann_combine_final1['return_index_pm1']

# calculate CAR (-2, +2)
ann_combine_final1['return_pm2'] = ann_combine_final1['price_p2'].div(ann_combine_final['price_m2']).replace(np.inf, np.nan)
ann_combine_final1['return_index_pm2'] = ann_combine_final1['price_index_p2'].div(ann_combine_final1['price_index_m2']).replace(np.inf, np.nan)
ann_combine_final1['CAR2'] = ann_combine_final1['return_pm2'] - ann_combine_final1['return_index_pm2']

In [35]:
# Calculate abnormal trading volumes (-1, +1)
ann_combine_final2 = ann_combine_final1.copy()
vo_cols = ann_combine_final2.columns[list(range(128,426,3))]
ann_combine_final2['voavg_pm1'] = ann_combine_final2[['vo_p1', 'vo_p0', 'vo_m1']].mean(axis=1)
ann_combine_final2['voavg_m21m120'] = ann_combine_final2[vo_cols].mean(axis=1)
ann_combine_final2['abvo_pm1'] = ann_combine_final2['voavg_pm1'] / ann_combine_final2['voavg_m21m120']

# Calculate abnormal trading volumes (0, +1)
ann_combine_final2['voavg_p1p0'] = ann_combine_final2[['vo_p1', 'vo_p0']].mean(axis=1)
ann_combine_final2['abvo_p1p0'] = ann_combine_final2['voavg_p1p0'] / ann_combine_final2['voavg_m21m120']

In [36]:
# Glance at the new columns
ann_combine_final2[['vo_m1', 'vo_p0', 'vo_p1', 'voavg_pm1', 'voavg_m21m120', 'abvo_pm1', 'abvo_p1p0','CAR1', 'CAR2']].head(10)

Unnamed: 0,vo_m1,vo_p0,vo_p1,voavg_pm1,voavg_m21m120,abvo_pm1,abvo_p1p0,CAR1,CAR2
0,145.8,161.9,93.8,133.833333,138.28,0.967843,0.924573,6e-05,-0.0404
1,145.8,161.9,93.8,133.833333,138.28,0.967843,0.924573,6e-05,-0.0404
2,341.5,1399.9,981.8,907.733333,601.047475,1.510252,1.981291,0.154215,0.09012
3,341.5,1399.9,981.8,907.733333,601.047475,1.510252,1.981291,0.154215,0.09012
4,725.6,1013.6,666.7,801.966667,428.147423,1.873109,1.962291,0.123227,-0.0054
5,725.6,1013.6,666.7,801.966667,428.147423,1.873109,1.962291,0.123227,-0.0054
6,164.8,329.2,156.2,216.733333,420.239583,0.515738,0.577528,0.015559,-0.009681
7,164.8,329.2,156.2,216.733333,420.239583,0.515738,0.577528,0.015559,-0.009681
8,211.3,311.7,131.0,218.0,279.228866,0.780722,0.792719,-0.059362,-0.029165
9,211.3,311.7,131.0,218.0,279.228866,0.780722,0.792719,-0.059362,-0.029165


In [39]:
# Calculate daily abnormal returns and abnormal returns variance
ann_combine_final3 = ann_combine_final2.copy()

for i in range(21, 120):
    ann_combine_final3[f"return_m{i}"] = ann_combine_final3[f"price_m{i}"].div(ann_combine_final3[f"price_m{i+1}"]).replace(np.inf, np.nan)
    ann_combine_final3[f"return_index_m{i}"] = ann_combine_final3[f"price_index_m{i}"].div(ann_combine_final3[f"price_index_m{i+1}"]).replace(np.inf, np.nan)
    ann_combine_final3[f"abret_m{i}"] = ann_combine_final3[f"return_m{i}"] - ann_combine_final3[f"return_index_m{i}"]

ann_combine_final3['return_p1'] = ann_combine_final3['price_p1'].div(ann_combine_final3['price_p0']).replace(np.inf, np.nan)
ann_combine_final3['return_p0'] = ann_combine_final3['price_p0'].div(ann_combine_final3['price_m1']).replace(np.inf, np.nan)
ann_combine_final3['return_m1'] = ann_combine_final3['price_m1'].div(ann_combine_final3['price_m2']).replace(np.inf, np.nan)
ann_combine_final3['return_index_p1'] = ann_combine_final3['price_index_p1'].div(ann_combine_final3['price_index_p0']).replace(np.inf, np.nan)
ann_combine_final3['return_index_p0'] = ann_combine_final3['price_index_p0'].div(ann_combine_final3['price_index_m1']).replace(np.inf, np.nan)
ann_combine_final3['return_index_m1'] = ann_combine_final3['price_index_m1'].div(ann_combine_final3['price_index_m2']).replace(np.inf, np.nan)

ann_combine_final3['abret_p1'] = ann_combine_final3['return_p1'] - ann_combine_final3['return_index_p1']
ann_combine_final3['abret_p0'] = ann_combine_final3['return_p0'] - ann_combine_final3['return_index_p0']
ann_combine_final3['abret_m1'] = ann_combine_final3['return_m1'] - ann_combine_final3['return_index_m1']

# calcualte the variance horizontally, for the abnormal returns from m1 to p1
ann_combine_final3['abret_var_pm1'] = ann_combine_final3[['abret_p1', 'abret_p0', 'abret_m1']].var(axis="columns")

# calculate the variance horizontally, for the abnormal returns from m120 to m21
varlist_benchmark = []
for i in range (21, 120):
    varlist_benchmark.append(f"abret_m{i}")

ann_combine_final3['abret_var_m21m120'] = ann_combine_final3[varlist_benchmark].var(axis="columns")

# Finally, calcuate the abnormal return variance measurement
ann_combine_final3['abretvar_overbenchmark_pm1'] = ann_combine_final3['abret_var_pm1'] / ann_combine_final3['abret_var_m21m120']

# some final cleaning
ann_combine_final3['abretvar_overbenchmark_pm1'] = ann_combine_final3['abretvar_overbenchmark_pm1'].replace(np.inf, np.nan)
ann_combine_final3['abvo_pm1'] = ann_combine_final3['abvo_pm1'].replace(np.inf, np.nan)
ann_combine_final3['abvo_p1p0'] = ann_combine_final3['abvo_p1p0'].replace(np.inf, np.nan)

In [40]:
ann_combine_final3

Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,ann_p1,ann_p2,ann_m1,...,return_m1,return_index_p1,return_index_p0,return_index_m1,abret_p1,abret_p0,abret_m1,abret_var_pm1,abret_var_m21m120,abretvar_overbenchmark_pm1
0,1998.0,001166,1998-12-31,ASM INTERNATIONAL NV,5165294,,1999-05-11,1999-05-12,1999-05-13,1999-05-10,...,0.951920,0.992809,0.997008,0.992793,-0.012810,0.013094,-0.040873,0.000728,0.001594,0.457141
1,1998.0,001166,1998-12-31,ASM INTERNATIONAL NV,5165294,,1999-05-11,1999-05-12,1999-05-13,1999-05-10,...,0.951920,0.992809,0.997008,0.992793,-0.012810,0.013094,-0.040873,0.000728,0.001594,0.457141
2,1999.0,001166,1999-12-31,ASM INTERNATIONAL NV,5165294,,2000-02-15,2000-02-16,2000-02-17,2000-02-14,...,0.961401,1.011412,1.002234,0.985858,0.025867,0.123679,-0.024457,0.005674,0.002764,2.052831
3,1999.0,001166,1999-12-31,ASM INTERNATIONAL NV,5165294,,2000-02-15,2000-02-16,2000-02-17,2000-02-14,...,0.961401,1.011412,1.002234,0.985858,0.025867,0.123679,-0.024457,0.005674,0.002764,2.052831
4,2000.0,001166,2000-12-31,ASM INTERNATIONAL NV,5165294,,2001-02-15,2001-02-16,2001-02-19,2001-02-14,...,0.929621,0.993127,1.003545,0.992924,-0.051480,0.185727,-0.063303,0.019737,0.002797,7.057686
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
158716,2014.0,287462,2014-12-31,FUTUREFUEL CORP,,36116M106,2015-03-12,2015-03-13,2015-03-16,2015-03-11,...,1.016732,0.994381,1.012286,0.998577,0.068328,0.023208,0.018154,0.000763,0.000484,1.576061
158717,2015.0,287462,2015-12-31,FUTUREFUEL CORP,,36116M106,2016-03-10,2016-03-11,2016-03-14,2016-03-09,...,1.017718,1.016320,0.999357,1.005169,-0.102017,-0.016071,0.012549,0.003555,0.000655,5.425074
158718,2016.0,287462,2016-12-31,FUTUREFUEL CORP,,36116M106,2017-03-16,2017-03-17,2017-03-20,2017-03-15,...,1.012257,0.999020,0.998816,1.009059,0.068862,0.004965,0.003198,0.001400,0.000580,2.411928
158719,2018.0,315318,2018-12-31,ELEMENT SOLUTIONS INC,,28618M106,2019-02-28,2019-03-01,2019-03-04,2019-02-27,...,1.033159,1.006057,0.997420,1.000074,-0.002504,-0.046407,0.033084,0.001585,0.000278,5.711057


In [41]:
final_columns = ['FYEAR', 'GVKEY', 'DATADATE', 'CONM', 'SEDOL', 'CUSIP', 'ann_p0', 'CAR1','CAR2','abvo_pm1','abvo_p1p0',
                 'abretvar_overbenchmark_pm1','matched_by','Match','Index']
final_data = ann_combine_final3[final_columns]
final_data.head()

Unnamed: 0,FYEAR,GVKEY,DATADATE,CONM,SEDOL,CUSIP,ann_p0,CAR1,CAR2,abvo_pm1,abvo_p1p0,abretvar_overbenchmark_pm1,matched_by,Match,Index
0,1998.0,1166,1998-12-31,ASM INTERNATIONAL NV,5165294,,1999-05-11,6e-05,-0.0404,0.967843,0.924573,0.457141,SEDOL,5165294,TOTMKNL
1,1998.0,1166,1998-12-31,ASM INTERNATIONAL NV,5165294,,1999-05-11,6e-05,-0.0404,0.967843,0.924573,0.457141,SEDOL,5165294,TOTMKNL
2,1999.0,1166,1999-12-31,ASM INTERNATIONAL NV,5165294,,2000-02-15,0.154215,0.09012,1.510252,1.981291,2.052831,SEDOL,5165294,TOTMKNL
3,1999.0,1166,1999-12-31,ASM INTERNATIONAL NV,5165294,,2000-02-15,0.154215,0.09012,1.510252,1.981291,2.052831,SEDOL,5165294,TOTMKNL
4,2000.0,1166,2000-12-31,ASM INTERNATIONAL NV,5165294,,2001-02-15,0.123227,-0.0054,1.873109,1.962291,7.057686,SEDOL,5165294,TOTMKNL


In [42]:
final_data.to_pickle('final_data.pkl')
final_data.to_csv('final_data.csv')

In [43]:
final_data.to_stata('final_data.dta')