Run the cell below:

In [1]:
import pandas as pd

If you have not done so already, download the "compa.zip", "crspm.zip" and "ff.zip" files from the "Lectures and Data / data" tab in D2L into a "data" folder on your computer (located outside the folder containing this file, as explained in ``lecture00``).

Load the CRSP monthly market data ("crspm.zip") into a new dataframe called ``crsp``.

In [2]:
crsp = pd.read_pickle('../data/crspm.zip')

Print out the number of rows and the number of columns of ``crsp``.

In [3]:
crsp.shape

(2553287, 10)

Print out the data types of all columns in ``crsp``.

In [4]:
crsp.dtypes

permno    float64
permco    float64
date       object
prc       float64
ret       float64
shrout    float64
shrcd     float64
exchcd    float64
siccd     float64
ticker     object
dtype: object

Replace the column names in ``crsp`` with a version of themselves that is all lower case and stripped of any leading and trailing white spaces. Then print these new column names out.

In [5]:
crsp.columns = crsp.columns.str.lower().str.strip()
crsp.columns

Index(['permno', 'permco', 'date', 'prc', 'ret', 'shrout', 'shrcd', 'exchcd',
       'siccd', 'ticker'],
      dtype='object')

Convert the following columns to integer (``int64``) type: 'permno','permco','shrcd','exchcd', 'siccd'. Then print all data types again to verify that this was done correctly.

In [6]:
to_convert = ['permno','permco','shrcd','exchcd', 'siccd']
crsp[to_convert] = crsp[to_convert].astype('int64')
crsp.dtypes

permno      int64
permco      int64
date       object
prc       float64
ret       float64
shrout    float64
shrcd       int64
exchcd      int64
siccd       int64
ticker     object
dtype: object

Convert ``siccd`` and ``ticker`` to ``string`` type. Then print all data types again to verify that this was done correctly.

In [7]:
crsp[['siccd','ticker']] = crsp[['siccd','ticker']].astype('string')
crsp.dtypes

permno      int64
permco      int64
date       object
prc       float64
ret       float64
shrout    float64
shrcd       int64
exchcd      int64
siccd      string
ticker     string
dtype: object

Replace ``crsp`` with a version of itself that does not have any duplicates with respect to ``permno`` and ``date`` (i.e. no two rows with the same ``permno`` and the same ``date``). Then print the number of rows and columns again to see if any rows were dropped.

In [8]:
crsp = crsp.drop_duplicates(['permno','date'], keep='last')
crsp.shape

(2553287, 10)

Replace ``crsp`` with a version of itself that does not have any missing values for ``ret``. Then print the number of rows and columns again to see if any rows were dropped.

In [9]:
crsp = crsp.loc[crsp['ret'].notnull(), :]
crsp.shape

(2530665, 10)

Create a new column called ``sic1d`` which contains only the first digit in ``siccd``. Then print the first 5 rows of ``crsp``.

In [10]:
crsp['sic1d'] = crsp['siccd'].str[0]
crsp.head()

Unnamed: 0,permno,permco,date,prc,ret,shrout,shrcd,exchcd,siccd,ticker,sic1d
1,10000,7952,1986-02-28,-3.25,-0.257143,3680.0,10,3,3990,OMFGA,3
2,10000,7952,1986-03-31,-4.4375,0.365385,3680.0,10,3,3990,OMFGA,3
3,10000,7952,1986-04-30,-4.0,-0.098592,3793.0,10,3,3990,OMFGA,3
4,10000,7952,1986-05-30,-3.109375,-0.222656,3793.0,10,3,3990,OMFGA,3
5,10000,7952,1986-06-30,-3.09375,-0.005025,3793.0,10,3,3990,OMFGA,3


Print out how many times each value of ``sic1d`` appears in ``crsp``.

In [11]:
crsp['sic1d'].value_counts()

3    643974
6    482560
2    327363
7    271156
5    250972
4    193575
1    151512
8    113159
9     64948
0     31446
Name: sic1d, dtype: Int64

Create a new column called ``mktcap`` (market capitalization) which equals ``shrout`` (shares outstanding) times the absolute value of ``prc`` (use the ``.abs()`` Pandas attribute to calculate the absolute value). Then print the first 5 rows of ``crsp``.

In [12]:
crsp['mktcap'] = crsp['prc'].abs() * crsp['shrout']
crsp.head()

Unnamed: 0,permno,permco,date,prc,ret,shrout,shrcd,exchcd,siccd,ticker,sic1d,mktcap
1,10000,7952,1986-02-28,-3.25,-0.257143,3680.0,10,3,3990,OMFGA,3,11960.0
2,10000,7952,1986-03-31,-4.4375,0.365385,3680.0,10,3,3990,OMFGA,3,16330.0
3,10000,7952,1986-04-30,-4.0,-0.098592,3793.0,10,3,3990,OMFGA,3,15172.0
4,10000,7952,1986-05-30,-3.109375,-0.222656,3793.0,10,3,3990,OMFGA,3,11793.859375
5,10000,7952,1986-06-30,-3.09375,-0.005025,3793.0,10,3,3990,OMFGA,3,11734.59375


Use ``.describe()`` to summarize the ``ret`` and ``mktcap`` variables.

In [13]:
crsp[['ret','mktcap']].describe()

Unnamed: 0,ret,mktcap
count,2530665.0,2530665.0
mean,0.01181741,2177878.0
std,0.1915711,15265640.0
min,-0.9936,0.0
25%,-0.06868871,25987.5
50%,0.0,114170.0
75%,0.07326314,623703.8
max,24.0,2255969000.0
