# Querying WRDS Data using Python
https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-python/querying-wrds-data-python/

In [1]:
# 建立连接
import wrds
db = wrds.Connection(wrds_username='fanjia')

Loading library list...
Done


In [9]:
# 使用help()方程或者inline documentation
help(db.describe_table)

Help on method describe_table in module wrds.sql:

describe_table(library, table) method of wrds.sql.Connection instance
    Takes the library and the table and describes all the columns
      in that table.
    Includes Column Name, Column Type, Nullable?.
    
    :param library: Postgres schema name.
    :param table: Postgres table name.
    
    :rtype: pandas.DataFrame
    
    Usage::
    >>> db.describe_table('wrdssec_all', 'dforms')
                name nullable     type
          0      cik     True  VARCHAR
          1    fdate     True     DATE
          2  secdate     True     DATE
          3     form     True  VARCHAR
          4   coname     True  VARCHAR
          5    fname     True  VARCHAR



In [12]:
# 先用少量data确认query正常工作，There are two ways of limiting the number of records (to say 10)

# 法【1】
#db.get_table('djones', 'djdaily', columns=['date', 'dji'], obs=10)

# 法【2】
db.raw_sql(
    ''' 
    SELECT date,dji 
    FROM djones.djdaily 
    LIMIT 10;
    ''', 
    date_cols=['date'])


Unnamed: 0,date,dji
0,1896-05-26,40.94
1,1896-05-27,40.58
2,1896-05-28,40.2
3,1896-05-29,40.63
4,1896-06-01,40.6
5,1896-06-02,40.04
6,1896-06-03,39.77
7,1896-06-04,39.94
8,1896-06-05,40.32
9,1896-06-08,39.81


In [21]:
######################## 看看Lib和Table内容 ########################
# You can analyze the structure of the data through its metadata using the wrds module, as outlined in the following steps:

# Alternatively, a comprehensive list of all WRDS libraries is available at the Dataset List(https://wrds-www.wharton.upenn.edu/pages/about/data-vendors/). This resource provides a listing of each library, their component datasets and variables, as well as a tabular database preview feature, and is helpful in establishing the structure of the data you're looking for in an easy manner from a Web browser.

# [1] List all available libraries at WRDS:
db.list_libraries()

# [2] Select a library to work with, and list all available datasets within that library using:
db.list_tables(library="crsp")

# [3] Select a dataset, and list all available variables (column headers) within that dataset using:
db.describe_table(library="crsp", table="msf")

# Where 'library' is a dataset such as crsp as returned from #1 above and 'table' is a component database within that library, such as msf, as returned from query #2 above. Remember that both the library and the dataset are case-sensitive, and must be all-lowercase.


Approximately 4696968 rows in crsp.msf.


Unnamed: 0,name,nullable,type
0,cusip,True,VARCHAR
1,permno,True,DOUBLE PRECISION
2,permco,True,DOUBLE PRECISION
3,issuno,True,DOUBLE PRECISION
4,hexcd,True,DOUBLE PRECISION
5,hsiccd,True,DOUBLE PRECISION
6,date,True,DATE
7,bidlo,True,DOUBLE PRECISION
8,askhi,True,DOUBLE PRECISION
9,prc,True,DOUBLE PRECISION


In [30]:
######################## Query Data ########################
# Now that you know how to query the metadata and understand the structure of the data, you are ready to query WRDS data directly. The wrds module provides several methods that are useful in gathering data:

# get_table() - fetches data by matching library and dataset, with the ability to filter using different parameters. This is the easiest method of accessing data. 
#data = db.get_table(library='djones', table='djdaily', columns=['date', 'dji'], obs=10)
#data

# raw_sql() - executes a SQL query against the specified library and dataset, allowing for highly-granular data queries.
data = db.raw_sql(
    '''
    select date,dji 
    from djones.djdaily 
    LIMIT 10;  
    ''', 
    date_cols=['date'])
data
# Notice the dot notation for the library and dataset. Unlike the other wrds methods, where library and table are specified separately, SQL queries instead use the two together to identify the data location. So, for example, a data query for the dataset msf within the library crsp would use the syntax crsp.msf, and the same goes for djones.djdaily.

# get_row_count() - returns the number of rows in a given dataset.
#data = db.get_row_count('djones', 'djdaily')
#data

Unnamed: 0,date,dji
0,1896-05-26,40.94
1,1896-05-27,40.58
2,1896-05-28,40.2
3,1896-05-29,40.63
4,1896-06-01,40.6
5,1896-06-02,40.04
6,1896-06-03,39.77
7,1896-06-04,39.94
8,1896-06-05,40.32
9,1896-06-08,39.81


In [40]:
######################## Joining Data ########################
# Data from separate datasets can be joined and analyzed together. The following example will join the Compustat Fundamentals data set (comp.funda) with Compustat's pricing dataset (comp.secm), and then query for total assets and liabilities mixed with monthly close price and shares outstanding.

df = db.raw_sql(
    '''
    SELECT a.gvkey, a.datadate, a.tic, a.conm, 
           a.at, a.lt, b.prccm, b.cshoq 
    FROM comp.funda a join comp.secm b 
    ON a.gvkey = b.gvkey and a.datadate = b.datadate 
    WHERE a.tic = 'IBM' and a.datafmt = 'STD' and a.consol = 'C' and a.indfmt = 'INDL'
    ''',
    date_cols=["datadate"])

# The code joins both datasets using a common gvkey identifier and date, querying IBM with a frequency of one year, resulting in a result of 55 observations (as of 2017). Running joined queries between large datasets can require large amounts of memory and execution time. It is recommended you limit the scope of your queries to reasonable sizes when performing joins.

df.head()

Unnamed: 0,gvkey,datadate,tic,conm,at,lt,prccm,cshoq
0,6066,1962-12-31,IBM,INTL BUSINESS MACHINES CORP,2112.301,731.7,389.999567,
1,6066,1963-12-31,IBM,INTL BUSINESS MACHINES CORP,2373.857,782.119,506.999353,
2,6066,1964-12-31,IBM,INTL BUSINESS MACHINES CORP,3309.152,1055.072,409.499496,
3,6066,1965-12-31,IBM,INTL BUSINESS MACHINES CORP,3744.917,1166.771,498.999146,
4,6066,1966-12-31,IBM,INTL BUSINESS MACHINES CORP,4660.777,1338.149,371.499662,


In [41]:
######################## Parameterize Data ########################
# The raw_sql() method now also supports parameterized SQL, allowing you to pass variables or lists from elsewhere in your Python code to your SQL statement. This is great for large lists of company codes or identifiers, or an array of specific trading days. Here is an example where a dictionary of tickers is passed through to a raw_sql() SQL statement:

parm = {'tickers': ('0015B', '0030B', '0032A', '0033A', '0038A')}
df = db.raw_sql(
    '''
    SELECT datadate,gvkey,cusip 
    FROM comp.funda 
    WHERE tic in %(tickers)s''', 
    params=parm)

# This allows for a great deal of flexibility in terms of your SQL queries. Common use cases might include building out a list of tickers, CUSIPS, etc programmatically or from an external file; re-using the same code list over multiple queries that adjust other parameters, such as date range; or matching based on specified trading days.

df



Unnamed: 0,datadate,gvkey,cusip
0,1982-10-31,002484,121579932
1,1983-10-31,002484,121579932
2,1984-10-31,002484,121579932
3,1985-10-31,002484,121579932
4,1986-10-31,002484,121579932
...,...,...,...
208,2009-12-31,179519,61847Z002
209,2010-12-31,179519,61847Z002
210,2010-12-31,179519,61847Z002
211,2011-12-31,179519,61847Z002


# Example Python Data Workflow
https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-python/python-example-data-workflow/