# Exploratory Data Analysis: Correlations. Authored by Noah Tamminga (ntamm@umich.edu).

The purpose of this notebook will be to walkthrough various correlation analyses with a focus on lagging price across permits to see any relationship between permits and pricing. Our expectation is that if permits increase supply, we will see a negative relationship between permits and price. However, we might also encounter a situation where there is no discernable relationship or that the relationship is the opposite of our expectation due to demand driven exogenous features that need to be controlled for before we can truely understand the relationship between permits and prices.

In [3]:
#Import necessary packages and connect to google drive
import numpy as np
import pandas as pd
import altair as alt

In [4]:
#Google drive pathway needs to be edited based on user's specific pathway to shared repository
permits = pd.read_parquet('data/permits_final.parquet.gzip')
price = pd.read_parquet('data/price_final.parquet.gzip')

permits.head()

Unnamed: 0_level_0,measure,county_name,region_code,division_code,bldgs,units,value
DATE,FIPS,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2000-01-01,1001,Autauga County,3,6,13,13,690525
2000-01-01,1081,Lee County,3,6,32,63,5558536
2000-01-01,1113,Russell County,3,6,3,4,343000
2000-01-01,1125,Tuscaloosa County,3,6,56,60,5353849
2000-01-01,2013,Aleutians East Borough,4,9,0,0,0


In [5]:
#Joining permits and price based on index
permits_price = permits.join(price, how='inner')
permits_price.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,county_name,region_code,division_code,bldgs,units,value,INVENTORY,HOMES_SOLD,SALE_PRICE,SALE_PRICE_ADJ,LIST_PRICE,LIST_PRICE_ADJ
DATE,FIPS,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2012-01-01,1001,Autauga County,3,6,9,9,2107491,286.0,20,133250,133250,157450.0,157450.0
2012-01-01,1007,Bibb County,3,6,1,1,163303,24.0,0,186650,186650,99900.0,99900.0
2012-01-01,1009,Blount County,3,6,1,1,350000,131.0,19,68876,68876,132000.0,132000.0
2012-01-01,1021,Chilton County,3,6,2,2,254433,162.0,9,128900,128900,111200.0,111200.0
2012-01-01,1037,Coosa County,3,6,0,0,0,11.0,1,115000,115000,142900.0,142900.0


In [6]:
permits_price.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 204752 entries, (Timestamp('2012-01-01 00:00:00'), '01001') to (Timestamp('2025-03-01 00:00:00'), '56045')
Data columns (total 12 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   county_name     204752 non-null  object 
 1   region_code     204752 non-null  int64  
 2   division_code   204752 non-null  int64  
 3   bldgs           204752 non-null  int64  
 4   units           204752 non-null  int64  
 5   value           204752 non-null  int64  
 6   INVENTORY       201032 non-null  float64
 7   HOMES_SOLD      204752 non-null  int64  
 8   SALE_PRICE      204752 non-null  int64  
 9   SALE_PRICE_ADJ  204752 non-null  int64  
 10  LIST_PRICE      203876 non-null  float64
 11  LIST_PRICE_ADJ  203876 non-null  float64
dtypes: float64(3), int64(8), object(1)
memory usage: 19.6+ MB


### Lagged Correlations

To start, we will look at the correlation of the normalized results of units and sale price. Since we are using data from counties with vastly different scales in terms of units and value, we want to make sure everything is evaluated on an equal basis.

Our first analysis looks at negative correlations. We want to see the frequencies of the most common lags for negative correlations. Then, we will look at the overall correlation by time period.

In [7]:
data = permits_price.copy().reset_index()

#Since we are going to compare across counties with differences (urban, rural, etc.), we will normalize our variables of interest
data['units_norm'] = data.groupby('county_name')['units'].transform(lambda x: (x - x.mean()) / x.std())
data['SALE_PRICE_NORM'] = data.groupby('county_name')['value'].transform(lambda x: (x - x.mean()) / x.std())

data.dropna(subset=['units_norm', 'SALE_PRICE_NORM'], inplace=True)

data.head()

Unnamed: 0,DATE,FIPS,county_name,region_code,division_code,bldgs,units,value,INVENTORY,HOMES_SOLD,SALE_PRICE,SALE_PRICE_ADJ,LIST_PRICE,LIST_PRICE_ADJ,units_norm,SALE_PRICE_NORM
0,2012-01-01,1001,Autauga County,3,6,9,9,2107491,286.0,20,133250,133250,157450.0,157450.0,-0.912033,-0.938987
1,2012-01-01,1007,Bibb County,3,6,1,1,163303,24.0,0,186650,186650,99900.0,99900.0,-0.743962,-0.699833
2,2012-01-01,1009,Blount County,3,6,1,1,350000,131.0,19,68876,68876,132000.0,132000.0,-0.137492,0.285731
3,2012-01-01,1021,Chilton County,3,6,2,2,254433,162.0,9,128900,128900,111200.0,111200.0,-0.225865,-0.253555
5,2012-01-01,1073,Jefferson County,3,6,86,86,19974726,2439.0,415,100900,100900,149900.0,149900.0,-0.060667,0.106854


In [8]:
#Now that we have our norm columns, we can apply the lagged correlation by county
#https://medium.com/pythoneers/cross-correlation-and-coherence-in-time-series-analysis-how-to-uncover-relationships-between-c83a08990b2d

#This function takes in the level we want to group and the variables we want to correlate.
#It's important to note we are only looking at impact on future periods in this lagged analysis.

def get_corr_values(group_level, var1, var2, period_lag=24):

  best_corr = None
  best_lag = None

  for lag in range(-period_lag, 0):
      x = group_level[var1]
      y = group_level[var2].shift(lag)
      valid = pd.concat([x, y], axis=1).dropna() #Corr has warnings if not enough records

      #Only apply corr when we have enough records
      if len(valid) > 1:
          corr = valid[var1].corr(valid[var2])
          if pd.notnull(corr) and (best_corr is None or corr < best_corr):
              best_corr = corr
              best_lag = int(lag)

  return pd.Series(
      {
          'best_corr': best_corr,
          'best_lag': best_lag
      }
  )



corr_summary = data.groupby('county_name').apply(get_corr_values,
                                                 var1='units_norm',
                                                 var2='SALE_PRICE_NORM')

corr_summary.sort_values('best_corr', ascending=True)

Unnamed: 0_level_0,best_corr,best_lag
county_name,Unnamed: 1_level_1,Unnamed: 2_level_1
Upson County,-1.000000,-24.0
La Porte County,-1.000000,-9.0
Litchfield County,-1.000000,-9.0
Wallowa County,-1.000000,-15.0
New Haven County,-1.000000,-9.0
...,...,...
St. Bernard Parish,0.656398,-17.0
Pinal County,0.680148,-19.0
San Jacinto County,0.719807,-23.0
La Salle Parish,,


Before we start, we will do some minor data cleaning to prep the results for our visuals.

In [9]:
#Flipping axis so time lag period can be read left to right
corr_summary['best_lag'] = corr_summary['best_lag'] * -1

#Filter out NaN values
corr_summary = corr_summary[corr_summary['best_corr'].notna()]

To begin, we will look at a couple of options to display lag density or count for all counties. This will display the lag with the highest correlation found in our function applied at a county group level.

In [10]:
#Using alt chart to display lines instead of density to improve the data to ink ratio and not hide the large values
alt.Chart(corr_summary).mark_bar(size=12).encode(
    alt.X("best_lag:Q", bin=alt.Bin(step=1), axis=alt.Axis(grid=False, format='.0f'), title='Months From Observation'),
    alt.Y('count()', axis=alt.Axis(grid=False), title='Count')
).properties(
    width=800,
    height=300,
    title='Count of Best Lags'
)

In [11]:
corr_summary_36 = data.groupby('county_name').apply(get_corr_values,
                                                 var1='units_norm',
                                                 var2='SALE_PRICE_NORM',
                                                 period_lag=36)

#Flipping axis so time lag period can be read left to right
corr_summary_36['best_lag'] = corr_summary_36['best_lag'] * -1

#Filter out NaN values
corr_summary_36 = corr_summary_36[corr_summary_36['best_corr'].notna()]



In [12]:
#Using alt chart to display lines instead of density to improve the data to ink ratio and not hide the large values
alt.Chart(corr_summary_36).mark_bar(size=12).encode(
    alt.X("best_lag:Q", bin=alt.Bin(step=1), axis=alt.Axis(grid=False, format='.0f'), title='Months From Observation'),
    alt.Y('count()', axis=alt.Axis(grid=False), title='Count')
).properties(
    width=800,
    height=300,
    title='Count of Best Lags'
)

Using corr_summary or corr_summary_36, both result in greater impact in future. This pattern seems indicative of not actual improvement, but rather underlying changes in the real estate markets.

For example, since both have best lags the further out we look, we should be suspect that what we are finding is an actual relationship between permits and price. While it is unclear what might be causing such a skew in the strength of our lagged correlation, it emphasizes that we would need to better control for exogenous variables and cannot solely rely on a correlation to give us any reasonable support behind our initial expectations.


### Total correlation by time period

Next we will analyze the overall correlation lag strength by plotting the strongest lag's correlation result over time. To do this, we simply input slight modification to our get_corr_values function to group at the DATE level.

In [13]:
total_corr = data.groupby('DATE', group_keys=False).apply(get_corr_values,
                                                          var1='units_norm',
                                                          var2='SALE_PRICE_NORM').reset_index()

total_corr.sort_values('best_corr', ascending=True)

Unnamed: 0,DATE,best_corr,best_lag
67,2017-08-01,-0.094894,-12.0
38,2015-03-01,-0.091757,-24.0
0,2012-01-01,-0.088793,-16.0
119,2021-12-01,-0.087780,-12.0
27,2014-04-01,-0.083480,-12.0
...,...,...,...
144,2024-01-01,0.006128,-20.0
120,2022-01-01,0.009439,-22.0
133,2023-02-01,0.011359,-17.0
125,2022-06-01,0.013352,-14.0


In [14]:
total_corr['DATE'] = pd.to_datetime(total_corr['DATE'])

bars = alt.Chart(total_corr).mark_bar(size=1.5).encode(
    x=alt.X('yearmonth(DATE):O',
            title='Date'),
    y=alt.Y('best_corr:Q',
            title='Best Low Correlation',
            scale=alt.Scale(domain=[total_corr['best_corr'].min() * 1.1, total_corr['best_corr'].max() * 1.1])),
    color=alt.condition(
        'datum.best_corr < 0',
        alt.value('#d95f02'),
        alt.value('#1f77b4')
    )
)

dots = alt.Chart(total_corr).mark_point(filled=True, size=40).encode(
    x=alt.X('yearmonth(DATE):O',
            title=''),
    y=alt.Y('best_corr:Q'),
    color=alt.condition(
        'datum.best_corr < 0',
        alt.value('#d95f02'),
        alt.value('#1f77b4')
    )
)

chart = (bars + dots).properties(
    width=800,
    height=300,
    title='Dot Plot of Best Low Correlation by Date'
)

chart

Above we see that amongst negative correlations alone, we do not see any consistently strong relationship between permits and price at the aggregate date level. Even the strongest lagged values only result in a -0.08 to -0.10 correlation, which is not indicative of a strong relationship at all.

Next we will show the impact of including positive correlations. Which based on the results of the poor negative correlation, would likely show evidence of demand pressure even in the event of increased development. The goal here will be to explicitly show how much further analysis needs to be done in a causal framework to produce any objective results on the relationship between permits and pricing.

In [15]:
#Modifying original function to look at the absolute value of the corr rather than just the negative corr
def get_corr_values(group_level, var1, var2, period_lag=24):

  best_corr = None
  best_lag = None

  for lag in range(-period_lag, 0):
    corr = group_level[var1].corr(group_level[var2].shift(lag))
    if pd.notnull(corr) and (best_corr is None or np.abs(corr) > best_corr):
      best_corr = corr
      best_lag = int(lag)

  return pd.Series(
      {
          'best_corr': best_corr,
          'best_lag': best_lag
      }
  )

In [16]:
total_corr = data.groupby('DATE', group_keys=False).apply(get_corr_values,
                                                 var1='units_norm',
                                                 var2='SALE_PRICE_NORM').reset_index()

total_corr.sort_values('best_corr', ascending=True)

Unnamed: 0,DATE,best_corr,best_lag
70,2017-11-01,-0.064149,-1.0
155,2024-12-01,0.029721,-16.0
8,2012-09-01,0.032839,-1.0
130,2022-11-01,0.033650,-18.0
149,2024-06-01,0.034271,-4.0
...,...,...,...
72,2018-01-01,0.144340,-6.0
83,2018-12-01,0.149166,-15.0
40,2015-05-01,0.170779,-5.0
37,2015-02-01,0.173187,-7.0


In [17]:
total_corr['DATE'] = pd.to_datetime(total_corr['DATE'])

bars = alt.Chart(total_corr).mark_bar(size=1.5).encode(
    x=alt.X('yearmonth(DATE):O',
            title='Date'),
    y=alt.Y('best_corr:Q',
            title='Best Correlation',
            scale=alt.Scale(domain=[total_corr['best_corr'].min() * 1.1, total_corr['best_corr'].max() * 1.1])),
    color=alt.condition(
        'datum.best_corr < 0',
        alt.value('#d95f02'),
        alt.value('#1f77b4')
    )
)

dots = alt.Chart(total_corr).mark_point(filled=True, size=40).encode(
    x=alt.X('yearmonth(DATE):O',
            title=''),
    y=alt.Y('best_corr:Q'),
    color=alt.condition(
        'datum.best_corr < 0',
        alt.value('#d95f02'),
        alt.value('#1f77b4')
    )
)

chart = (bars + dots).properties(
    width=800,
    height=300,
    title='Dot Plot of Best Correlations by Date'
)

chart

With the strongest relationships being almost exclusively positive (i.e., increase in units leading to increase in price), we see the expected result that many factors are driving up price that need to be controlled for via causal analysis before we can obtain an understanding of permits on price.

As seen above, we have nearly all positive values with two negative values. A likely explaination could be found in simply rising prices in high demand areas (positive corr) or lowering pricing in low demand areas (negative corr) rather capturing any true impact of permits & development on pricing.

This seemingly demand driven environment would indicate the opposite kind of relationship than what we initially expected. Instead of permits and development functioning as a mechanism to reduce inequality, it appears that it may be a response to increasing demand along with various other demand driven factors.