Run the cell below:

In [None]:
import pandas as pd
import numpy as np
import yfinance as yf
import pandas_datareader as pdr

# Problem 1 
### (10 x 5 = 50 points)

In this problem, you will analyze how the correlation between future corporate investment and current operating cash flows differs between dividend payers and non-payers.

1. Load the Compustat data (the "compa.zip" file from the "data" folder) into a dataframe called ``rawcomp``. Print out the column names of ``rawcomp``.
2. Create a new dataframe called ``comp`` that contains the following variables from ``rawcomp``: 'permno','datadate','ppent' (net PP\&E),'oancf' (operating cash flows),'at' (total assets),'dvc' (cash dividend). Then get rid of all rows that have any missing values in any of these variables. Then keep only the rows where both total assets and net PP\&E are strictly larger than 0. Display the first two rows of ``comp``. All future instructions refer to the ``comp`` dataframe.
3. Create a new column in ``comp`` called ``year`` that extracts just the year from ``datadate``. Then sort ``comp`` by ``permno`` and ``datadate``. Then drop duplicates with respect to ``permno`` and ``year``, keeping only the last observation when a duplicate is found. Print the first two rows of this new version of ``comp``.
4. For each firm, each year, calculate the percentage change in ``ppent``. Call this new variable ``invest``. Then create a new variable called ``future_invest`` which gives us the value of ``invest`` from the following year, for each firm (assume there are no "gaps" in the data). Finally create a new variable called ``cflow`` which equals operating cash flows divided by total assets. Print a table that gives us the mean, median and standard deviation (only these 3 stats) for the ``invest``, ``future_invest``, and ``cflow`` variables.
5. Winsorize ``invest``, ``future_invest``, and ``cflow`` at the 1st and 99th percentiles. Call these winsorize variables ``w_invest``, ``w_future_invest``, and ``w_cflow`` respectively. Print a table that gives us the mean, median and standard deviation (only these 3 stats) for the ``w_invest``, ``w_future_invest``, and ``w_cflow`` variables. (Note how they differ from the non-winsorize statistics.)
6. Create a new variable called ``div_payer`` which equals 1 for observations (rows) where the cash divided is strictly larger than 0, and 0 otherwise. Print out how many times ``div_payer`` equals 1 and how many times it equals 0.
7. Every year, calculate the correlation between ``w_future_invest`` and ``w_cflow`` using only the observations from that year. Then plot this time-series of correlations over time.
8. Now do the same thing separately for divided payers vs non-payers: i.e. every year, you should have two correlations between ``w_future_invest`` and ``w_cflow``: one which uses the data for dividend payers that year, and one using the data for non-payers that year. Store these correlations in a dataframe called ``ann_corrs_div`` and print its first 4 rows.
9. Reshape the ``ann_corrs_div`` dataframe to a wide format, where the correlations for dividend payers show up in a separate column from the correlations for non-payers. Call this wide dataframe ``ann_corrs_wide`` and print its first 2 rows.
10. Plot the time-series of correlations for dividend payers and the one for non-payers as two lines in the same plot.

In [None]:
# 1


In [None]:
# 2


In [None]:
# 3


In [None]:
# 4


In [None]:
# 5


In [None]:
# 6


In [None]:
# 7


In [None]:
# 8


In [None]:
# 9


In [None]:
# 10


# Problem 2
### (10 x 5 = 50 points)

In this problem, you will calculate value-weighted industry returns using all publicly traded firms from 1980 to 2020.

1. Load the monthly CRSP data (the ``crspm.zip`` file from the ``data`` folder) into a new dataframe called ``rawcrsp``. Print the names of the columns of ``rawcrsp``.
2. Create a new variable in ``rawcrsp`` called ``mktcap`` (market capitalization) which equals the absolute value of price (``prc``) times the number of shares (``shrout``). Display the first 2 rows of ``rawcrsp``.
3. Create a new dataframe called ``crsp`` which contains the following variables from ``rawcrsp``: 'permno','date','ret','mktcap', 'siccd'. Then drop all rows in ``crsp`` which contain any missing values in any of its columns. Display the first 2 rows of ``crsp``.
4. Create a new column in ``crsp`` called ``mdate`` which equals the ``date`` variable converted to a monthly "period" type. Print a table which contains only the minimum and maximum values of ``mdate``.
5. Create a new column in ``crsp`` called ``sector`` which equals the first digit of the SIC code (``siccd``). Print a table which shows how many observations we have for each value of ``sector``.
6. Create a new column in ``crsp`` called ``weights`` which, for every firm, every month, gives us the market capitalization of that firm divided by the sum of the market capitalizations of all firms that are in the same sector as that firm that month (think of the weights in a portfolio, where the portfolio is the sector that the firm operates in that month). Use ``.describe()`` to print out the summary statistics of the ``weights`` variable.
7. Create a new variable in ``crsp`` called ``vwret`` which equals the firm's return times its industry weight (given by ``weights``). Then, each month, for each sector, add up these ``vwret`` values of the firms in that sector, that month. This will give you the value-weighted return of each sector, each month. Store these returns in a new dataframe called ``ind_vwret``. Display the first 20 rows of ``ind_vwret``.
8. Create a new dataframe called ``ind_vwret_wide`` which is a reshaped version of ``ind_vwret``, where the returns of each sector show up in a separate column. Display the first two rows of ``ind_vwret_wide``.
9. Plot the returns for sectors 3 (manufacturing) and 6 (finance), in two subplots of the same plot. 
10. For sectors 3 and 6, compound their returns over time, and plot these two time-series of compounded returns in the same plot.

In [None]:
# 1


In [None]:
# 2


In [None]:
# 3


In [None]:
# 4


In [None]:
# 5


In [None]:
# 6


In [None]:
# 7


In [None]:
# 8


In [None]:
# 9


In [None]:
# 10
