# Example Python Data Workflow 

<p class="lead">See how to combine a series of Python data queries into a useful workflow</p>
This short tutorial follows closely from https://wrds-www.wharton.upenn.edu/pages/support/programming-wrds/programming-python/

## 1. Example Data Research Workflow using Python 

<div class="rich-text">
    <p>The following series of stand-alone queries represents a basic example Python workflow using WRDS data. The commands in this workflow could be run interactively or submitted via a batch job using Python in the WRDS Cloud, or run locally from your computer using a Jupyter notebook. For this example, we'll use the CRSP Daily Stock File (<strong>crsp.dsf</strong>) data library</p>
    <p>First, as with every Python program that intends to connect to WRDS, we must import the <strong>wrds</strong> module and make our connection:</p>
</div>

In [1]:
import wrds
db = wrds.Connection()

Loading library list...
Done


<div class="rich-text">
    <p>We must also have set up our <strong>pgpass</strong> file as described earlier in the tutorial [Querying WRDS Data using Python](Querying WRDS Data using Python.ipynb).</p>
    <div class="alert alert-block alert-warning">
<b>Note:</b> Class accounts and IPAuth / Daypass accounts are not permitted to access WRDS in this manner and will receive an error if trying this connection. You must have your own, dedicated WRDS account in order to access WRDS from MATLAB.
</div>
    <p>Let's get started. The initial queries (metadata queries) have dedicated <strong>wrds</strong> module methods to give you the results you're looking for. Later queries (data queries) will use <code>raw_sql()</code> exclusively.</p>
    <p>1. To determine the <em>libraries</em> available at WRDS:</p></div>

In [5]:
db.list_libraries();

<p>2. Let's work with the CRSP library. From the above results, that's <strong>crsp</strong>. What datasets are available within this library?</p>

In [7]:
db.list_tables('crsp');

<p>3. The results show that many datasets are available within the <strong>crsp</strong> library. Let's work with the Daily Stock File (<strong>dsf</strong>) dataset, and take a look at the list of available data variables (column names) in that dataset:</p>

In [8]:
db.describe_table('crsp', 'dsf')

Approximately 94366100 rows in crsp.dsf.


Unnamed: 0,name,nullable,type
0,cusip,True,VARCHAR(8)
1,permno,True,DOUBLE PRECISION
2,permco,True,DOUBLE PRECISION
3,issuno,True,DOUBLE PRECISION
4,hexcd,True,DOUBLE PRECISION
5,hsiccd,True,DOUBLE PRECISION
6,date,True,DATE
7,bidlo,True,DOUBLE PRECISION
8,askhi,True,DOUBLE PRECISION
9,prc,True,DOUBLE PRECISION


<div class="rich-text">
    <p>4. We've examined the structure of our data by looking through its metadata, let's begin querying the data itself. First, let's take a peek at the first 5 rows of this database to see what we're working with. Using the LIMIT keyword to only return a small sample of the data, greatly speeding up the query a we only wish to get a quick summary.</p>
    <p>Both <code>get_table()</code> and <code>raw_sql()</code> can be used for this:</p>
</div>

In [11]:
db.get_table('crsp', 'dsf', obs=5)
db.raw_sql('select * from crsp.dsf LIMIT 5')

Unnamed: 0,cusip,permno,permco,issuno,hexcd,hsiccd,date,bidlo,askhi,prc,vol,ret,bid,ask,shrout,cfacpr,cfacshr,openprc,numtrd,retx
0,68391610,10000.0,7952.0,10396.0,3.0,3990.0,1986-01-07,2.375,2.75,-2.5625,1000.0,,,,3680.0,1.0,1.0,,,
1,68391610,10000.0,7952.0,10396.0,3.0,3990.0,1986-01-08,2.375,2.625,-2.5,12800.0,-0.02439,,,3680.0,1.0,1.0,,,-0.02439
2,68391610,10000.0,7952.0,10396.0,3.0,3990.0,1986-01-09,2.375,2.625,-2.5,1400.0,0.0,,,3680.0,1.0,1.0,,,0.0
3,68391610,10000.0,7952.0,10396.0,3.0,3990.0,1986-01-10,2.375,2.625,-2.5,8500.0,0.0,,,3680.0,1.0,1.0,,,0.0
4,68391610,10000.0,7952.0,10396.0,3.0,3990.0,1986-01-13,2.5,2.75,-2.625,5450.0,0.05,,,3680.0,1.0,1.0,,,0.05


<div class="rich-text">
    <p>5. From our results to #3 above, let's say we've decided to only work with the <strong>cusip</strong>, <strong>permno</strong>, <strong>date</strong>, <strong>bidlo</strong>, and <strong>askhi</strong> variables. We can request only those five specific columns by modifying the above like so, again using both <code>get_table()</code> and <code>raw_sql()</code>:</p>
</div>

In [14]:
db.get_table('crsp', 'dsf', columns=['cusip, permno, date, bidlo, askhi'], obs=5, )
db.raw_sql('select cusip, permno, date, bidlo, askhi  from crsp.dsf LIMIT 5')

Unnamed: 0,cusip,permno,date,bidlo,askhi
0,68391610,10000.0,1986-01-07,2.375,2.75
1,68391610,10000.0,1986-01-08,2.375,2.625
2,68391610,10000.0,1986-01-09,2.375,2.625
3,68391610,10000.0,1986-01-10,2.375,2.625
4,68391610,10000.0,1986-01-13,2.5,2.75


<div class="rich-text">
    <p>6. Run the query again, but filter the query for '<strong>permno'</strong>, limiting results to a single day:</p>
    <p>Since you are now filtering by specific values for variables, use <strong>raw_sql()</strong> exclusively. All future examples will be shown using <strong>raw_sql()</strong>.</p>
    <div class="alert alert-block alert-info">
<b>NOTE:</b> Use single quotation marks around the date value and double quotation marks elsewhere as shown below.
</div>
</div>

In [20]:
db.raw_sql("select cusip, permno, date, bidlo, askhi "
           "from crsp.dsf "
           "where permno in (14593, 90319, 12490, 17778) "
           "and date='2013-01-04'")

Unnamed: 0,cusip,permno,date,bidlo,askhi
0,45920010,12490.0,2013-01-04,192.779999,194.460007
1,03783310,14593.0,2013-01-04,525.828613,538.629883
2,08467010,17778.0,2013-01-04,140047.0,141003.796875
3,02079K30,90319.0,2013-01-04,727.680115,741.469971


<p><strong>6</strong>. Determine high '<strong>askhi'</strong> values by running a query to get a list of dates where '<strong>permno'</strong> values posted an Ask Price over $2,000 between the years 2010 and 2013, no longer limiting the number of returned rows (as this is a pretty specific query):</p>

In [27]:
db.raw_sql("select cusip,permno,date,bidlo,askhi "
           "from crsp.dsf "
           "where permno in (14593, 90319, 12490, 17778) "
           "and date between '2010-01-01' and '2013-12-31' "
           "and askhi > 2000")

Unnamed: 0,cusip,permno,date,bidlo,askhi
0,08467010,17778.0,2010-01-04,99201.00000,99910.000000
1,08467010,17778.0,2010-01-05,99550.00000,100001.000000
2,08467010,17778.0,2010-01-06,99500.00000,100000.000000
3,08467010,17778.0,2010-01-07,99594.00000,100000.000000
4,08467010,17778.0,2010-01-08,99700.00000,100300.000000
5,08467010,17778.0,2010-01-11,99320.00000,100750.007812
6,08467010,17778.0,2010-01-12,99350.00000,99948.992188
7,08467010,17778.0,2010-01-13,99150.00000,99948.992188
8,08467010,17778.0,2010-01-14,98920.00000,99480.000000
9,08467010,17778.0,2010-01-15,97205.00000,99362.500000


<p><strong>7</strong>. Only a single '<strong>permno'</strong> posted a high Ask price during this date range. Open the search to all <strong>permno</strong>s that have ever posted an Ask Price over $2,000 in any date range (use <strong>distinct</strong> to return only one entry per matching <strong>permno</strong>). Since the query is against all <strong>permno</strong>s and the entire date range of the dataset, this query may take a little longer:</p>

In [28]:
db.raw_sql('select distinct permno from crsp.dsf where askhi > 2000')

Unnamed: 0,permno
0,84788.0
1,21709.0
2,11498.0
3,18614.0
4,15798.0
5,17881.0
6,14533.0
7,10225.0
8,12343.0
9,17814.0


<p><strong>8</strong>. Retrieve all dates for which an Ask Price over $2000 was posted, along with the <strong>permno</strong>s that posted them. This will give a list of dates that match, with an additional entry for that date if additional <strong>permno</strong>s match as well:</p>

In [29]:
db.raw_sql("select distinct date,permno "
           "from crsp.dsf "
           "where askhi > 2000 " 
           "order by date")

Unnamed: 0,date,permno
0,1926-05-03,15675.0
1,1926-07-19,18614.0
2,1926-12-23,14752.0
3,1926-12-24,14752.0
4,1927-01-07,14752.0
5,1927-01-08,14752.0
6,1933-05-26,17814.0
7,1934-07-25,11498.0
8,1935-01-30,25566.0
9,1935-02-14,10225.0


<div class="rich-text">
    <p><strong>9</strong>. Query for the highest Ask ever posted (searching only through Asks over $2000), on what date it posted, and which <strong>permno</strong> posted it. Use <strong>limit 1</strong> to speed up the search since only the top value is desired:</p></div>

In [33]:
db.raw_sql('select permno,askhi,date from crsp.dsf where askhi > 2000 order by askhi desc LIMIT 1')

Unnamed: 0,permno,askhi,date
0,17778.0,335900.0,2018-10-09


<div class="rich-text">
    <p>This is one example of how you might approach an analytical task using Python. It begins by gathering metadata information using <strong>list_libraries()</strong> and <strong>list_tables()</strong> to learn more about the data structure available, and then using that information to run meaningful queries from the data itself using <strong>get_table()</strong> and <strong>raw_sql()</strong>..</p>
    <p>A common next step would be to write a batch program that uses the above one-off queries together. An example might be a program that uses a loop to iterate over each <strong>permno</strong> that has ever posted an Ask Price over $2000 and to calculate how long the date range was able to maintain that height. Or perhaps certain dates were more prolific than others - tallying the number of high asks per date might be informative.</p>
    </div>