# Tech Class #8 - Summary Statistics

- This file provides an overview of how to use Python to generate Summary Statistics.
- Summary Statistics are also referred to as Descriptive Statistics.
- These statistics are the typical ones that you have learned in math classes: mean (average), median (50th percentile), standard deviation, etc.

**Import pandas package and loading two datasets**

In [1]:
import pandas as pd

**Import data that we need**

In [2]:
df2018 = pd.read_csv("Compustat_fy2018.csv", parse_dates=['datadate'])

  df2018 = pd.read_csv("Compustat_fy2018.csv", parse_dates=['datadate'])


**Create variables (measures) we need**

In [3]:
#ROE = net income / total equities
df2018['roe'] = df2018['ni'] / df2018['teq']

#ATurn = Revenue / total assets
df2018['aturn'] = df2018['revt'] / df2018['at']


---
## What we know so far - the describe command
```Python
df2018.describe()
```
- This command is an quick way to return releavant summary statistics including the number of observations (count), the average (mean), the standard deviation (std), the minimum (min), the maximum (max), and the values at different percentiles.
- This is extraordinarily helpful since it is all in one place, so it's a good first place to start.
- Running this command in the way above means it will run over your entire dataframe, and returns summary statistics about all **numeric** variables (columns).

### Specify individual columns and individual statistics
- Remember, we can select just one column like:
```Python
df2018['roe']
```
- You can select one descriptive statistic by specifying the statistic you want at the end. Here are some common ones: ```.mean()```, ```.median()```, ```.std()```, ```.quantile(<perc>)```, ```.count()```, ```.sum()```
- Combining these two would get you something like:
```Python
df2018['roe'].sum()
```

#### From Assignment 1, we now know we can filter our dataframe and run describe on just the smaller dataframe
```Python
df2018[df2018['sic']==3711]
```
Then you can run the describe command on the smaller subset of data.

In [None]:
#Remember: you can also save the results of your filter into a new dataframe to work with later

### But, what if you want to create an industry adjusted measure for ALL companies in df2018?
- Doing this comparison by hand is time consuming. 
- Writing a loop to go through every SIC code is also time consuming.

## The groupby command
- If we want to find the average value of ROE and ATurn for every industry in our dataframe
```Python
df2018[['sic','roe','aturn']].groupby('sic').mean()
```
- First, limit your dataframe to the group identifier (for this example SIC code) and to the variables you want a summary statistic for (ROE and Aturn in this case).
- Second, include the ```.groupby``` command with your group identifier.
- Third, specify the summary statistic you would like. Here are some common ones: ```.mean()```, ```.median()```, ```.std()```, ```.quantile(<perc>)```, ```.count()```, ```.sum()```
- This groupby command is "collapsing" your data, meaning it's taking your bigger dataframe and collapsing it along the dimension of SIC.

**As long as you save this output to a dataframe, you can then merge it back into your original data.**
```Python
df_merge1 = pd.merge(df2018, df_indavgs, how="left", on="sic", suffixes=('','_indavg'))
```
- Same merge command as before.
- We have introduced a new option since the variables ```roe``` and ```aturn``` exist in both dataframes. Including the ```suffixes=('','_indavg')``` option will add the text `_indavg` to the variables being merged in from the secondary dataframe.

**Now you can create your industry adjusted measures just like any other new variable**
- Industry adjusted means you take your individual company value for your variable and subtract the industry average of that variable.
```Python
df_merge1['adj_roe'] = df_merge1['roe'] - df_merge1['roe_indavg']
```