<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preliminaries" data-toc-modified-id="Preliminaries-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preliminaries</a></span></li><li><span><a href="#Lecture-overview" data-toc-modified-id="Lecture-overview-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Lecture overview</a></span></li><li><span><a href="#The-data" data-toc-modified-id="The-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>The data</a></span></li><li><span><a href="#Descriptive-statistics" data-toc-modified-id="Descriptive-statistics-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Descriptive statistics</a></span></li><li><span><a href="#Linear-regression" data-toc-modified-id="Linear-regression-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Linear regression</a></span><ul class="toc-item"><li><span><a href="#The-effect-of-outliers" data-toc-modified-id="The-effect-of-outliers-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>The effect of outliers</a></span></li><li><span><a href="#Economic-significance-vs-statistical-significance" data-toc-modified-id="Economic-significance-vs-statistical-significance-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Economic significance vs statistical significance</a></span></li><li><span><a href="#Multicollinearity" data-toc-modified-id="Multicollinearity-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Multicollinearity</a></span></li></ul></li></ul></div>

# Preliminaries

In [None]:
# Import packages
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Lecture overview

To make the concepts in these lecture notes (and the next) more concrete, we will apply them to the empirical analysis of whether firm profitability is predictable. To be more specific, we will ask: 

**Which of the following firm characteristics (if any) have statistically significant predictive power over firms' profitability: the firm's cash holdings, its book leverage or its capital investments?**

In this lecture, we will start by collecting the data we need for this analysis and producing some key descriptive statistics of the data. We will then perform a regression analysis where firm future profitability is the dependent variable, and firm cash holdings, book leverage,and investment are the explanatory variables. In this lecture, we will showcase some common practical issues that one needs to be aware of any time they run a regression:
1. The effect of outliers on our regression results
2. The difference between economic magnitude and statistical significance
3. Multicollinearity (highly correlated independent variables)

In the following lecture, we will continue this analysis by tackling two other very common issues with regression analysis:
1. The potential presence of "fixed-effects" in the data
2. The issue of correlated error terms in the regression

# The data

The first step in our analysis is to decide exactly what data we will use to try to answer this question.
1. We will use the Compustat dataset as raw data (the "compa.zip" file in the "data" folder). 

We then have to be explicit about exactly how each variable in the analysis will be calculated:
1. Dependent variable:
    - roa = net income before extraordinary items (``ib``) divided by total assets (``at``)

2. Independent variables:
    - cash holdings = cash and cash equivalents (``che``) divided by total assets
    - book leverage = long-term debt (``dltt``) divided by total assets
    - capital expenditures = change in net PP\&E (``ppent``) divided by total assets 

Note that all the variables are scaled by some measure of size (total assets). This is to ensure that our regression results are not dominated by large firms. It also helps avoid heteroskedasticity problems (the dollar-amount data for large firms is much more volatile than the data for small firms). 

In [None]:
# Load the cleaned compa data, keeping only what we need
comp = pd.read_pickle('../data/compa.zip')
comp = comp.loc[comp['at']>0, ['permno','datadate','ib','at','che','dltt','ppent','sich']].copy()
comp = comp.sort_values(['permno','datadate'])
comp.dtypes

In [None]:
# Create main variables

In [None]:
# Winsorize main variables at the 1 and 99 percentiles

In [None]:
# Add a constant (column of 1's) and save the data for next time
comp['const'] = 1
comp.to_pickle('../data/comp_clean.zip')

# Descriptive statistics

Summarize the main variables, both winsorized and unwinsorized:

Check correlations:

We can take a look at pairwise scatter plots (to visualize these correlations) using ``pd.plotting.scatter_matrix()``:

These plots help us realize that point-statistics (single numbers) like the correlation between profitability and cash holdings may mask how rich the data truly is and make us believe that patterns in the data (the -0.32 correlation seems quite strong) are more robust than they really are. Always look at your data (plot it). Just make sure you do it after you mitigate the effect of outliers or the images will look very distorted. 

# Linear regression

Let's use the non-winsorized data first for our baseline regression:

In [None]:
# Using non-winsorized data

## The effect of outliers

Now let's use the winsorized variables and look at the difference. Check the coefficient on the investment variable in particular.

In [None]:
# Using winsorized data

## Economic significance vs statistical significance

It is easy to use the results in the regression output above and decide (based on p-values or t-statistics) if the independent variables have a **statistically** significant relation with the dependent variable. But it is not clear if there relations are large or small in magnitude (does investment have a large impact on future profitability? larger than leverage?). That is what we mean by **economically** significant.

To help ease the interpretation of these economic magnitudes, we generally standardize all the variables in the regression by subtracting their mean and dividing by their standard deviation (see below). After doing this, the regression coefficient on any independent variable X, will tell us by how many standard deviations we expect the dependent variable Y to move, when the X variable changes by one standard deviation. 

So after the normalization, the X variables with larger coefficients have a larger economic impact on the Y variable.  

In [None]:
# Create list of names we want to give to the normalized varsions of these variables

In [None]:
# Create normalized variables

In [None]:
# Check that all means are 0 and all std deviations are 1

In [None]:
# Using winsorized, then normalized data

## Multicollinearity

One common way that multicollinearity arises when two or more of your independent variables (X) are very highly correlated (close to 1). The usual way to deal with this issue, is to calculate the correlation matrix between all the variables in your study, to identify which group of variables are highly correlated with each other. Then we simply drop all but one of them from the analysis.

Below, we artificially create this problem in our example application, by introducing in our regression a variable that equals the leverage variable times 100. This will have a correlation of 1 with the leverage variable. However, as we'll see below, "statsmodels" will NOT give us an error. So it's up to us to make sure that we don't have this problem in our data by always looking at the correlation matrix of our data.

In [None]:
# Add variable that is collinear with size

In [None]:
# Run regression with multicollinearity problem

Note how the coefficient on ``n_leverage`` has changed. Also, look at Notes [2] above.

In [None]:
# Check for correlations to see which which variable to drop

Multicollinearity can arise even if a "linear combination" (a weighted sum or difference) of our variables is highly correlated with some other variable in the regression. To see this in action, we will add to our explanatory variables, a variable called ``illiquid`` which measures the non-cash assets of the firm (divided by total assets). In this case, the sum of ``cash`` and ``illiquid`` will equal 1 at all times, which is equal to another explanatory variable in our regression: the constant term.

In [None]:
# Run regression with multicollinearity problem

Again, we did not get an error, but the results above can not be trusted. To see this, you can check Notes [2] above, but you can also print out the correlation matrix. 

Again, dropping one of the problem variables ("cash" or "illiquid") would solve our problem.