# 자계추 hw1: Create dataset

- `compustat_permno`와 `CRSP_M` 사용
    - SAS 코드 따라가며 python으로 포팅. 
- 최종 결과인 `Assignment1_data` 를 만들기 
    - 최종 결과는 permno/date 순으로 정렬하여 first 25 obs 를 보일 것. 
    - month of December for year 1970, 1980, 1990, 2000, 2010에 대하여 아래를 report:
        - number of distinct permnos
        - mean/std/min/max of the monthly delisting-adjusted excess returns 


## 가이드
- SAS log 확인하며 중간중간 단계에서 같은 결과가 나오는지 확인해라. 
    - shape check
- sample data는 정답지. 최종적으로 output이 일치하는지 확인. 
- SAS 를 파이썬으로 옮겨준 코드도 참고하기. 
    - summary statistics 등 뽑는거는 본인 코드 있으면 그거 쓰기. 

## 질문했던 것들

- long table vs wide table 
    - 왜 굳이 wide 안쓰고 long 써서 각종 문제가 생기게 하는지... permno를 1개만 만들어놓을 수 있다면 그냥 그걸 가지고 pivot table 하고나면 그 다음엔 ffill 등이 훨씬 용이해 짐. 
    - 이 wide를 하고 shift를 쓰는 것을 교수님도 말하심. missing date 찐빠가 날 일이 없음. 그냥 그 자리에 NaN이 차고 말지. 
    - 교수님이 말씀하시는 단점:
        - RDBMS 관점에서 비효율적임 
        - 테이블이 너무 많이 생김. 그 부분 비효율도 생각해라. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from pathlib import Path

In [2]:
CWD = Path('.').resolve()
DATA_DIR = CWD / 'data'

In [3]:
CRSP_M_df = pd.read_csv(DATA_DIR / 'CRSP_M.csv')
permno_df = pd.read_csv(DATA_DIR / 'compustat_permno.csv') 
sample_df = pd.read_csv(DATA_DIR / 'assignment1_sample_data.csv')

In [4]:
CRSP_M_df.columns

Index(['DATE', 'DLSTCD', 'PERMNO', 'SHRCD', 'EXCHCD', 'SICCD', 'DLRET',
       'PERMCO', 'PRC', 'VOL', 'RET', 'SHROUT', 'ALTPRC', 'rf'],
      dtype='object')

In [5]:
CRSP_M_df.shape

(2921193, 14)

In [6]:
CRSP_M_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2921193 entries, 0 to 2921192
Data columns (total 14 columns):
 #   Column  Dtype  
---  ------  -----  
 0   DATE    int64  
 1   DLSTCD  float64
 2   PERMNO  int64  
 3   SHRCD   int64  
 4   EXCHCD  int64  
 5   SICCD   float64
 6   DLRET   float64
 7   PERMCO  int64  
 8   PRC     float64
 9   VOL     float64
 10  RET     float64
 11  SHROUT  float64
 12  ALTPRC  float64
 13  rf      float64
dtypes: float64(9), int64(5)
memory usage: 312.0 MB


## SAS 3

Construct BE data

### filters

In [7]:
CRSP_M_df['EXCHCD'].unique() # 이미 필터는 처리 되어있다. 

array([ 1,  2,  3, 33, 32, 31], dtype=int64)

그래도 아래 따로 filter 구현. 

In [8]:
# filters

filter_common_stocks = [10, 11] # SHRCD
filter_exchange = [ # EXCHCD
    1, 31, # NYSE
    2, 32, # AMEX
    3, 33, # NASDAQ
]

plots

In [9]:
# TODO: Stock Exchange Composition을 groupby 사용하여 만들기. 별도 column에 NYSE, AMEX, NASDAQ, Other 표시
# TODO: Number of stocks 로 한 번, Market Cap으로 한 번 plot

In [10]:
# apply filters

CRSP_M_df = CRSP_M_df[ CRSP_M_df['SHRCD'].isin(filter_common_stocks) ]
CRSP_M_df = CRSP_M_df[ CRSP_M_df['EXCHCD'].isin(filter_exchange) ]

In [11]:
CRSP_M_df.shape

(2921193, 14)

### delisting returns

In [12]:
CRSP_M_df

Unnamed: 0,DATE,DLSTCD,PERMNO,SHRCD,EXCHCD,SICCD,DLRET,PERMCO,PRC,VOL,RET,SHROUT,ALTPRC,rf
0,19610131,,10006,10,1,3740.0,,22156,50.25,939.0,0.322368,1420.0,50.2500,0.0019
1,19610131,,10014,10,1,3710.0,,22157,4.00,395.0,0.000000,2504.0,4.0000,0.0019
2,19610131,,10030,10,1,3310.0,,22160,41.75,280.0,0.087948,1627.0,41.7500,0.0019
3,19610131,,10057,11,1,3540.0,,20020,54.00,152.0,0.142857,500.0,54.0000,0.0019
4,19610131,,10102,10,1,2810.0,,22164,79.50,480.0,0.032468,3965.0,79.5000,0.0019
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2921188,20121231,574.0,76999,11,3,7372.0,-0.765517,11056,,123365.0,,6855.0,0.3120,0.0001
2921189,20121231,580.0,93007,11,3,9999.0,-0.774834,53201,,121619.0,,57097.0,0.6307,0.0001
2921190,20121231,584.0,38790,11,2,1311.0,-0.762470,1933,,21350.0,,19048.0,0.3321,0.0001
2921191,20121231,584.0,89761,11,2,3714.0,2.520000,44123,,39636.0,,7107.0,0.3700,0.0001


In [None]:
def process_delisting_returns(row):
    DLRET = row['DLRET']
    DLSTCD = row['DLSTCD']

    loss30_codes = [500, 520] + list(range(551, 574)) + [574, 580, 584] # -30%, other values는 -100%
    # TODO: 하다 말고 잔다. 이어서 하기. 

## SAS 5

Construct ME and return data (delisting adjusted)

## SAS 6

Merge BE and ME with return data