<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Lecture-overview" data-toc-modified-id="Lecture-overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Lecture overview</a></span></li><li><span><a href="#Preliminaries" data-toc-modified-id="Preliminaries-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Preliminaries</a></span></li><li><span><a href="#Subperiod-analysis" data-toc-modified-id="Subperiod-analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Subperiod analysis</a></span></li><li><span><a href="#Conditioning-on-cross-sectional-information" data-toc-modified-id="Conditioning-on-cross-sectional-information-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Conditioning on cross-sectional information</a></span></li><li><span><a href="#Conditioning-on-both-time-and-the-cross-section" data-toc-modified-id="Conditioning-on-both-time-and-the-cross-section-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Conditioning on both time and the cross-section</a></span></li><li><span><a href="#Advanced-&quot;binning&quot;-example" data-toc-modified-id="Advanced-&quot;binning&quot;-example-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Advanced "binning" example</a></span><ul class="toc-item"><li><span><a href="#Multi-dimensional-bins" data-toc-modified-id="Multi-dimensional-bins-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Multi-dimensional bins</a></span></li></ul></li></ul></div>

# Lecture overview

Loosely speaking, "conditional descriptive statistics" are statistics calculated for subsamples (subsets) of your data. The information you use to create these subsamples is referred to as the "conditioning information". In this lecture, we showcase these types of descriptive statistics using the tools learned in the previous lecture, and a panel dataset: the "compa" file, which contains accounting information for multiple firms over multiple years. 

In the previous lecture we calculated conditional statistics in situations where the variable which dictated what observations are in what subsample already exists in the dataset. For example, we calculated average returns for each industry separately. In that example, the returns of each individual industry constitute a separate subsample of our data. The "conditioning information" which allowed us to specify which observation was in what sample was the "Industry" variable (which already existed in the dataset).

In this lecture, we will focus on examples where we have to create ourselves the variable which specifies which observation is in what sample. 

# Preliminaries

In [None]:
import pandas as pd
import numpy as np

Get raw data:

In [None]:
comp_raw = pd.read_pickle('../data/compa.zip')
comp_raw.dtypes

Clean it up a bit:

In [None]:
# Sort by firm identifier and date
comp_raw.sort_values(['permno','datadate'], inplace = True)
# Extract year from the date
comp_raw['year'] = pd.to_datetime(comp_raw['datadate']).dt.year
comp_raw.head(2)

Create a new dataframe with firm and year identifiers for firms with positive total assets:

In [None]:
comp = comp_raw.loc[comp_raw['at']>0, ['permno','year', 'sich']].copy()

And calculate some key variables:

In [None]:
comp['inv'] = comp_raw['capx'] / comp_raw['at']
comp['roa'] = comp_raw['ib'] / comp_raw['at']
comp['lev'] = (comp_raw['dlc'] + comp_raw['dltt']) / comp_raw['at']
comp['cash'] = comp_raw['che'] / comp_raw['at']

In [None]:
comp.describe()

**Challenge**

Winsorize the 'inv','roa','lev','cash' variables at the 1 and 99th percentiles and get full-sample summary statistics for them.

Compare the standard deviations of the winsorized variables above, to the standard deviation of the un-wisnorized variables: 

Save the names of the main variables we want to analyze into a list, so we don't have to type them up every time we use them:

In [None]:
main_vars = [ 'w_inv','w_roa','w_lev','w_cash']

# Subperiod analysis

It is often a good idea to test how the results of your analysis change depending on the time period included in the data. This type of testing is generally referred to as "subperiod analysis". We'll cover two such examples below.

In the following example, we calculate means of our key variables **each year** and plot these means to see how they have changed over time.

As another example, we now calculate our means separately for the period prior to the year 2000, and the period after. 

To do this, we need to create a new variable in our dataframe that takes one value prior to 2000 and a different value after 2000. *What* these values are, does not matter at all, they just have to be two different values in the pre-2000 and post-2000 eras. An easy way to do this is with the ``where`` function in the ``numpy`` package. This function works exactly like ``if`` in Excel:

Syntax:
```python
numpy.where(condition, x, y)
```
When the condition is true, this returns the value x, and when it is false, it returns the value y.

In [None]:
# The long way:
#comp['pre_post_2000'] = 'pre_2000'
#comp.loc[comp['year'] >= 2000, 'pre_post_2000'] = 'post_2000'

We can now use the "pre_post_2000" variable with ``.groupby()`` to calculate means separately in the two subperiods:

# Conditioning on cross-sectional information

Our panel dataset has information for many different firms, each year. We refer to the totality of the firms in our sample as the "cross-sectional" dimension of the data (as opposed to the "time" dimension).

In the examples below, we calculate means of our key variables for each sector in the economy. For each sector, we use all the data available for that sector (i.e. all years for all firms in that sector). This means we will have a single mean per sector.

In our example, separate "sectors" are identified by the first digit of the SIC (industry) code of the firm (the "sich" variable in the "comp" dataframe). So this is another example in which we have to create a new variable that specifies which observation is in which subsample (sector).

We do this by first turning "sich" into a string variable (with ``.astype('string')``) and then selecting the first character in that string (with ``.str[0]``):

Let's see how many observations we have for each sector:

If you look up the SIC codes:

https://siccode.com/sic-code-lookup-directory

you'll see that, roughly speaking, 3 stands for manufacturing firms (though 2 does as well), and 6 stands for financial firms. So the two largest sectors represented in our sample are manufacturing and finance.

Finally, we calculate sector-level means for each of the main variables, using "sic1d" with the ``groupby`` function. Note how much these statistics differ across sectors:

# Conditioning on both time and the cross-section

Finally, we showcase an example where we examine how summary statistics vary across groups of firms **and** over time. 

In the example below, we calculate means of our key variables for each sector in the economy, for each year separately. For each sector, we take an average over all the firms in that sector, separately for each year that the sector exists in our dataset. This means we will have a time-series of means for each sector.

We now plot some of these sector-specific means to see how they have changed over time. To create these plots, we use the **unstack** function to unstack the industries so their data show up side by side (instead of on top of each other).

Note that the column labels have two components: one component that tells us which variable is being summarized, and one component that tells us which sector is being summarized:

Let's look at the evolution of investment in particular:

Note that the column names are actually strings, not integers:

Note also that in the first few years, we have lots of missing data for the 'sich' variable, which is why we have so many "NaN" values in the table above. We use ``.dropna()`` to eliminate all the years in which we have "NaN" values:

Sector 1 is "Mining and Construction" and Sector 6 is "Financials" so it makes sense that they have drastically different levels of physical investment. To plot just those two sectors, we have to use ``.loc[]`` to extract them from the overall dataframe before we use ``.plot()``: 

To test yourself, see if you can tell why the line below produces the same result:

**Challange**:

Create a similar plot to the one above, but this time for profitability (roa). Also, this time, place each sector (1 and 6) in a separate subplot. 

# Advanced "binning" example 

In many cases, our analysis requires us to split our sample in bins (groups) based on how firms rank in terms of one specific variable. Then some analysis is performed separately for each bin. 

To showcase this type of subsample analysis, in the examples below, we analyze if the evolution of cash holdings over time looks different for firms with different levels of profitability.

To do this, we need to define what we mean by "different levels of profitability". One approach could be to use specific values of profitability: e.g. put all firms with ROA larger than 20\% in a "high profitability" bin, etc. However, these levels would be a bit arbitrary (why 20\% and not 25\%). 

Instead, a more common approach is to simply split firms into a number of equaly sized bins (same number of firms in each bin). For example, below, we split firms into 5 equaly-sized "bins" based on how their profitability ranks among the rest of profitability data (5 equaly-size groups are often called "quintiles", 4 = "quartiles", 3 = "terciles, 10 = "deciles")

First, let's look again at how average cash-holdings evolve over time, when we use the full cross-section:

Now we can use the ``.qcut()`` function to create the 5 profitability bins.

Syntax:
```python
pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')
```

And check that these are "equally-sized bins":

Now take a look at the trends in cash holdings, separately, for firm in different ROA bins:

It looks like the strong positive trend in cash holdings is only there for firms with the lowest profitability.

## Multi-dimensional bins

In the example below, we redo this analysis, but this time, to judge which firm goes into which ROA bin, we compare profitability levels only amongst firms in a given year (and we do this for all years).

To do this, we need to use the ``.transform()`` function we introduced in the last lecture. We supply ``pd.qcut`` as a parameter to ``.transform()``.  

Note that the ``lambda x`` tells Python that, what follows after it (i.e. pd.qcut) should be seen as a function of x. So the line of code above splits the "roa" data in years, then, it takes the roa data each year, calls it "x" and then supplies it as an input to the ``pd.qcut()`` function. That function uses that roa information to split firms into quintiles (q=5) based on how their roa ranks amongst all other firms that year. These quintiles are given names 1 through 5 (labels = range(1,6)), and stored in a new column called "roa_q" inside the "comp" dataframe.  

Let's take a look at these quintile, as well as the ones we created in the prior section, and the roa levels themselves: 

Now we recalculate cash holding trends separately for each ROA bin, using these new bins:

This looks very similar to what we found in the prior section: the result that "firms seem to be holding a lot more cash now" holds only for firms with the lowest profitability. What do you think could account for these findings? 