### Mid-term Exam Instructions

This assignment is designed to help you build several databases for your major research projects. These codes should give you a detailed indication on how to combine and clean common financial databases for analysis.

We will be looking at six databases: 
Accounting data from annual reports, 
Stock returns data from stock exchanges, 
Analyst reports, 
CEO characteristics from proxy statements, 
Board data from proxy statements, and 
Firm's patent grants from patent office

To make these databases manageable, I will be using data from 2007 to 2017, 10 years US data, for illustration. For your research project, it is more than enough.

These datas are all available via WRDS, feel free to download them if you need more of them. I will point out where to download. For our convenience, I have downloaded them already and converted them into a binary format called parquet. It helps pandas read in the data more quickly and enables us to select the columns before reading the entire dataset.


Tips before you begin:

1) Go back through the videos from week 1, week 2 and week 3 if you get stuck on anything. All of the tools required to complete this task have been already taught, you should be able to recognise which tools to use. If you cannot remember how to perform a certain task in pandas, go through the videos until you find the solution, or search it up on google.

2) Google is your best friend! Search up anything that you are still stuck with, and make sure that you are able to fully understand and explain what each function does that allows you to perform your required task.

3) I want you to demonstrate your knowledge and understanding of python. Put in comments in your code to explain what you have done in each step, present it clearly and imagine you are showing this code to someone with absolutely no knowledge of python. You want them to understand what it is you have done.

4) There will be cells that show the solutions for these tasks. Only use these if you are completely stuck. Avoid directly copying the code from the solution into your answer. You should try to run the code first, and continue to try and try again until you get the code to run properly. Then, I will recommend you check your code with the solutions after you are confident that your code is working.

5) Good luck!

Note: for any Mac users who are struggling to access Jupyter Notebook on their computer, here are the instructions to open Jupyter Notebook:

1. Download the midterm.ipynb zip on GitHub

2. Expand the zip on your desktop and place file on desktop

3. Go into terminal

4. Type pip install anaconda in terminal

5. After anaconda is installed, type jupyter notebook in terminal. This will take you to a web server where you can access the Jupyter Notebook.

6. Open the midterm.ipynb code from your desktop and the mfin6210-master file.

In [None]:
# import libraries
import pandas as pd
import numpy as np


In [None]:
# you may need to run below code to install some additional libraries, comment it out if you have installed them
!pip install fastparquet pyarrow

In [None]:
# I can read fundamental_annual data by using pd.read_parquet function
funda = pd.read_parquet('https://mfin6210.s3.amazonaws.com/fundamental_annual.pq',
                        columns=['gvkey','cusip','permno','fyear','datadate','at','xrd'])

Fundamental data has over 900 columns for an array of accounting variables, you can look at the
variable definition: https://wrds-web.wharton.upenn.edu/wrds/ds/compd/funda/index.cfm?navId=83

I manage to only read four columns: permno (firm id), fyear (fiscal year), at(total asset), xrd (R&D expenses)
by issuing columns= parameter.

Of course you can add more variables if you would like. Look at the documentation to see what's in there and grab what you want

In [None]:
# This dataset is a panel. For illustration purposes, I will use accounting data to serve as the basis
# And merge all other datasets into this one. 

In [None]:
# We want to produce a histogram for the normal distribution of the assets in the firms. 
# Your first task:
# 1. Write code to read in the log of the asset values
# 2. Produce a histogram that produces a normal distribution. 

# Hints:
# A normal distribution should be a curve that is very smooth and evenly spread out
# When taking the log of the values, make sure to add 1 to the values by creating a new column
# create a column called ln_at that takes the log of the asset and adds 1 to the value

# Type your code here:


In [None]:
# Running the next cell will reveal the solution, please work on it first. 
# Do not copy directly out of the solutions because I will be able to tell whether you have given this a serious attempt
# and only look the solution if you are absolutely stuck

In [None]:
%load 1.py

Next, I read stock return data. It is the monthly returns for each firm. For variable descriptions:

https://wrds-web.wharton.upenn.edu/wrds/ds/crsp/stock_a/msf.cfm?navId=128

In [None]:
ret = pd.read_parquet('https://mfin6210.s3.amazonaws.com/stock_return.pq')

In [None]:
# Your second task:
# 1. Write code to convert the monthly returns into annual returns as a measure of the returns
# 2. Calculate standard deviations of monthly returns for each year as a measure of the risk

# HINT: create a new dataframe called std_ret that calculates standard deviation of returns by grouping the 
# dataframe by permno and fyear, (google up the pandas tool for standard deviation)
# and create a new dataframe called are aret that
# calculates annual ret by using product of 1+ret. Do this by grouping the data by permno and date
# finally, combining (merging/joining) returns and risk to a dataframe called stock_return

# Type your code here:


In [None]:
#once again, only use this as a last resort scenario

In [None]:
%load 2.py

In [None]:
# We use left merge for accounting data. We keep everything on the left because 
# accounting data is our basis. If we do inner join, we will lose more and more observations as we 
# join more datasets. So, for the completeness of the data, we will left merge datasets into our base dataset
# and deal with the missing values later
df = funda.merge(stock_return,how='left') 
# the common columns to merge on is permno and fyear,
# so I just omitted the on= parameter here

Next, we read executive characteristics and compensation dataset

https://wrds-web.wharton.upenn.edu/wrds/ds/comp/execcomp/anncomp/index.cfm?navId=72

In [None]:
# I have only read a few columns for illustration, 
# they are: gvkey is a firm identifier, fyear (fiscal year), tdc1 is total compensation, 
# and becameceo is the date CEO took the role.
executive = pd.read_parquet('https://mfin6210.s3.amazonaws.com/exec_chars.pq',
                            columns=['gvkey','fyear','tdc1','ceoann','becameceo'])

In [None]:
# Your third task:
# Executive chars data lists the top 5 executives in the company, suppose we only need the data for the CEO
# 1. Write code to subset the rows where ceoann='CEO', save this subset to dataframe "ceos",
# 2. After keeping only CEOs, drop duplicates at gvkey and fyear level to form a firm-year panel
# Type your code here:


In [None]:
%load 3.py

In [None]:
# We now merge CEO characteristics back to df
df = df.merge(ceos,how='left') # this time, the merging keys are gvkey and fyear

Next, we merge director information from ISS (also called riskmetrics)

https://wrds-web.wharton.upenn.edu/wrds/ds/riskmetrics/rmdirectors/index.cfm?navId=245

In [None]:
# Each row represents an individual director for a company (cusip) for each year (need to derive from meetingdate)
directors = pd.read_parquet('https://mfin6210.s3.amazonaws.com/directors.pq',
                            columns=['cusip','meetingdate','director_detail_id','classification'])

In [None]:
# Your fourth task:
# 1. Write code to convert meetingdate to fiscal year, if the month < 7, then it is the calendar year - 1
# if the month >= 7, it is the calendar year
# HINT: first convert meetingdate to pandas' datetime format (google it if you need to find out how to do this)
# Type your code here:


In [None]:
%load 4.py

In [None]:
# This data comes from proxy statements, sometimes company will switch their reporting schedule 
# so we may have duplicated reporting In some years, but these cases are rare. 
# For our purpose, we need to make sure a certain director will only appear once in a year
# Therefore, we drop duplicated directors in each firm-year
# (this is cruel way of dealing with duplicates, but since the impact is small, we will just force drop the duplicates)
directors = directors.drop_duplicates(['cusip','fyear','director_detail_id'])

In [None]:
# We calculate an indicator to indicate the director is an independent director
directors['independence'] = directors['classification'].str.contains('I')*1

Here, we calculate two measures:
1. Board size
2. the fraction of independent directors

In [None]:
# Your fifth task:
# 1. Write code to count the number of unique directors in each firm-year as the measure of board size
# (once again, google this if you are unsure of how to count the number of unique directors)
# 2. Write code to calculate the fraction of independent directors
# Type your code here:


In [None]:
%load 5.py

In [None]:
independence = (independence / boardsize).rename('independence') # calculate the fraction

In [None]:
board = pd.concat([boardsize,independence],axis=1).reset_index()

In [None]:
df = df.merge(board,how='left')

Next, we will merge the analyst report dataset, which contains analyst's forecast EPS for a certain company. We will create a few measures there:

https://wrds-web.wharton.upenn.edu/wrds/ds/ibes/det/index.cfm?navId=223

1. Analyst coverage: Number of analysts existing to predict the company's EPS
2. Analyst forecast volatility: Std of analyst's forecasts, a measure of firm's information opacity
3. Analyst's forecast level: The median forecast EPS from all analysts for that firm-year

In [None]:
# Again, to make the data managable, 
# I will only keep firm id (cusip), forecasting date, actual value of forecast and analyst code
# The whole data file is very big, please only select columns that you need
analyst = pd.read_parquet('https://mfin6210.s3.amazonaws.com/analyst_eps.pq',
                          columns=['cusip','fpedats','value','analys'])

In [None]:
# Your sixth task:
# 1. Write code to calculate coverage, forecast volatility and forecast level, create three dataframes:
# coverage, analyst_volatility, analyst_median
# HINT: group by cusip and fpedats (google how to calculate any of these values that you don't know how to do)
# Type your code here:


In [None]:
%load 6.py

In [None]:
# The forecast is made for every reporting period

In [None]:
analyst = pd.concat([coverage,analyst_median,analyst_volatility],axis=1).reset_index()

In [None]:
# Your seventh task:
# For us to merge the analyst data back to df, we need to do two things:
# The cusip in analyst dataset is only first 8 digits
# Write code to:
# 1. Convert df's cusip to the first 8 digits from 9 digits
# 2. Rename analyst's fpedats to datadate, so we can match on column's name for pandas to merge
# Type your code here:


In [None]:
%load 7.py

In [None]:
df = df.merge(analyst,how='left')

Next, we have want to have stock ownership information. WRDS has already got a nice dataset for us.

https://wrds-web.wharton.upenn.edu/wrds/ds/tfn/types/s34summary/index.cfm?navId=340

In [None]:
# read the data
io = pd.read_parquet('https://mfin6210.s3.amazonaws.com/institional_ownership.pq',
                     columns=['rdate','cusip','NumInstBlockOwners','InstOwn_HHI','InstOwn_Perc'])

In [None]:
# Your eighth task:
# generate fyear from rdate variable (reporting date). To simplicity, just extract calender year as fyear
# calculate average Num of block holders, HHI, and percentage for each firm (cusip) and fyear
# Type your code here:


In [None]:
%load 8.py

In [None]:
df = df.merge(io,how='left')

Finally, we will merge the number of patents granted for each firm-year as a measure of their innovation

I have already cleaned the data for us.

In [None]:
pat_count = pd.read_parquet('https://mfin6210.s3.amazonaws.com/patents.pq')

In [None]:
df = df.merge(pat_count,how='left')

In [None]:
# Checkpoint
if len(list(df))==22:
    print('Congratulations! You have successfully completed the exercise!')
else:
    print('Sorry, you did not complete the exercise')