## Class01
* Download data from WRDS and read into Python
* Use WRDS Python package

### Download data from WRDS
WRDS provides a user-friendly wbe interface. It is very intuitive and straightforward to download data from WRDS.

Taking a sample of CRSP data as an example, we will download CRSP monthly data from 2010 to 2017 and read the sample data into Python. 

In [1]:
import pandas as pd
import numpy as np

#### Read data

In [2]:
file_path = '/Users/ml/Google Drive/af/teaching/database/data/'
msf_raw = pd.read_csv(file_path+'msf_2010_2017.txt',sep='\t',low_memory=False)

In [3]:
msf_raw.iloc[:10,:10]

Unnamed: 0,PERMNO,date,SHRCD,EXCHCD,SICCD,NCUSIP,COMNAM,PERMCO,HSICCD,CUSIP
0,10001,20100129,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410
1,10001,20100226,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410
2,10001,20100331,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410
3,10001,20100430,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410
4,10001,20100528,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410
5,10001,20100630,11.0,2.0,4925,29269V10,ENERGY INC,7953,4925,36720410
6,10001,20100730,11.0,2.0,4925,36720410,GAS NATURAL INC,7953,4925,36720410
7,10001,20100831,11.0,2.0,4925,36720410,GAS NATURAL INC,7953,4925,36720410
8,10001,20100930,11.0,2.0,4925,36720410,GAS NATURAL INC,7953,4925,36720410
9,10001,20101029,11.0,2.0,4925,36720410,GAS NATURAL INC,7953,4925,36720410


### WRDS Python package
* You should have Python installed already.
* Use **pip** to install:

    *pip install wrds*

* Then you should be able to import this package.

[WRDS offers a guidance how to work Python with WRDS server](https://wrds-www.wharton.upenn.edu/pages/support/wrds-cloud/holding/python-programming-wrds/)

#### Import Python package: wrds

In [4]:
import wrds

#### Build connection with WRDS server
You need to type your WRDS username and password.

In [None]:
db = wrds.Connection()

#### Check available libraries

In [6]:
db.list_libraries()[20:40]

['compb',
 'compbd',
 'compdcur',
 'compg',
 'compgd',
 'compm',
 'compmcur',
 'compnad',
 'compsamp_snapshot',
 'compseg',
 'compsegd',
 'contrib',
 'crsp',
 'crsp_a_indexes',
 'crsp_a_stock',
 'crsp_q_indexhist',
 'crspa',
 'crspq',
 'csmar',
 'csmar_financial']

#### Check available databases
Each library contains different databases and we need to find out the location of required data within the library.

For example, let's read CRSP monthly stock file. You may notice that we have **crsp**, **crspa** and **crspq**. All of them include monthly stock file and daily stock file, and the difference is the update frequency:
* crsp: daily update
* crspa: annual update
* crspq: quarterly update

> ! Note: **crspa** (annual update) dose not mean that the database only contains yearly stock price. *-annual update-* means that WRDS will update daily file and monthly file every year.

You have to choose the one with access permission in your institution.

We have annual update CRSP, therefore, we will read monthly stock file from **crspa**.

In [7]:
db.list_tables('crspa')[140:150]

['mseshares',
 'msf',
 'msf62',
 'msfhdr',
 'msfhdr62',
 'msi',
 'msi62',
 'msia',
 'msib',
 'msic']

> **msf** is the monthly stock file. Then we know the path of CRSP monthly stock file: **crspa** > **msf**.

#### Check variables
Before you read data, you can check number of observations and what variables in this database.

The disadvantage compared to web platform is that you cannot check the definition directly.

In [8]:
db.describe_table('crspa','msf')

Approximately 4464480 rows in crspa.msf.


Unnamed: 0,name,nullable,type
0,cusip,True,VARCHAR(8)
1,permno,True,DOUBLE PRECISION
2,permco,True,DOUBLE PRECISION
3,issuno,True,DOUBLE PRECISION
4,hexcd,True,DOUBLE PRECISION
5,hsiccd,True,DOUBLE PRECISION
6,date,True,DATE
7,bidlo,True,DOUBLE PRECISION
8,askhi,True,DOUBLE PRECISION
9,prc,True,DOUBLE PRECISION


#### Read data
The combination of Python and WRDS server (powered by SQL) make data retrieve much more flexible and efficient (this requires basic knowledge of SQL).

Another advantage is that we do not have to separate data downloading and future data analysis. We can code the data importing and data analysis in the same place.

In [9]:
sql = "select permno,cusip,date,prc,ret \
    from crspa.msf \
    where date between '2016-01-01' and '2016-06-30' \
    order by permno,date"
crsp_sample_1 = db.raw_sql(sql)

In [10]:
crsp_sample_1.head()

Unnamed: 0,permno,cusip,date,prc,ret
0,10001.0,36720410,2016-01-29,8.32,0.116779
1,10001.0,36720410,2016-02-29,7.86,-0.055288
2,10001.0,36720410,2016-03-31,7.81,-0.006361
3,10001.0,36720410,2016-04-29,7.3,-0.055698
4,10001.0,36720410,2016-05-31,7.14,-0.021918


In [11]:
sql = "select permno,cusip,date,prc,ret, abs(prc)*vol as dvol \
    from crspa.msf \
    where date between '2016-01-01' and '2016-06-30' and abs(prc)*vol>1000000000 \
    order by permno,date"
crsp_sample_2 = db.raw_sql(sql)

In [12]:
crsp_sample_2.head()

Unnamed: 0,permno,cusip,date,prc,ret,dvol
0,14593.0,03783310,2016-01-29,97.339996,-0.075242,1236778000.0
1,84398.0,78462F10,2016-01-29,193.720795,-0.049783,7192758000.0
2,84398.0,78462F10,2016-02-29,193.559998,-0.00083,5653366000.0
3,84398.0,78462F10,2016-03-31,205.520004,0.067212,4774859000.0
4,84398.0,78462F10,2016-04-29,206.330795,0.003945,3942229000.0


> With Python connection to WRDS, we can also finish the job of merging databases without downloading data from web interface. Try it if you have interest.

#### Save data in local folder
You can output the data if you want to save it locally.

In [13]:
crsp_sample_2.to_csv(file_path+'python_retrieve.csv',index=False)