# Zillow8 - GTO

## Add dummies for regression
________

Zillow publishes series for 3 different tiers: 
- bottom: 5th and 35th percentiles
- middle: 35th and 65th percentile (“typical” home, the “flagship” ZHVI)
- top: 65th and 95th percentile

I will first test if middle tier capture prices affected by GTO regulation

#### On this code I will add dummy columns for regression

Step 1: For each **wave**, get from GTO_table:
- date
- FIPS treated

Step 2: On prices from zillow:
- add column for each wave, for treated or not treated
- add columns for each wave, centered around treatment date -2,-1,0,1, 2...

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from linearmodels.panel import PanelOLS
import statsmodels.api as sm
from linearmodels.panel import compare

# add save image function
%run functions/save_image_plotly.ipynb

pd.options.display.float_format = '{:.2f}'.format

Function save_plotly(figure, filename)



## Get Info from GTO table

In [2]:
# short table

GTO_table = pd.read_csv('.\output\GTO_table.csv', index_col=0)
GTO_table.head(4)

Unnamed: 0,wave_1,wave_2,wave_3,wave_4,wave_5,wave_6,wave_7
date issued,2016-01-13,2016-07-22,2017-02-21,2017-08-21,2018-02-21,2018-11-15,2019-05-14
date enforcement,2016-03-01,2016-08-28,2017-02-24,2017-08-23,2018-02-24,2018-11-17,2019-05-16
36061,3000000.0,3000000.0,3000000.0,3000000.0,3000000.0,300000,300000
12086,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,300000,300000


In [3]:
# long table - workhorse
GTO = pd.read_csv('.\output\GTO_table_long.csv',dtype={'FIPS': object,  'threshold':'Int64'})
GTO = GTO[GTO.threshold.notna()]
GTO

Unnamed: 0,FIPS,wave,threshold,effective,date issued,date enforcement
0,36061,wave_1,3000000,2016-03-01,2016-01-13,2016-03-01
1,12086,wave_1,1000000,2016-03-01,2016-01-13,2016-03-01
22,36061,wave_2,3000000,2016-09-01,2016-07-22,2016-08-28
23,12086,wave_2,1000000,2016-09-01,2016-07-22,2016-08-28
24,48029,wave_2,500000,2016-09-01,2016-07-22,2016-08-28
...,...,...,...,...,...,...
149,32003,wave_7,300000,2019-06-01,2019-05-14,2019-05-16
150,53033,wave_7,300000,2019-06-01,2019-05-14,2019-05-16
151,25025,wave_7,300000,2019-06-01,2019-05-14,2019-05-16
152,25017,wave_7,300000,2019-06-01,2019-05-14,2019-05-16


In [4]:
# wave dates
GTO_dates = pd.read_csv('.\output\GTO_dates.csv', index_col=0, dtype=object)
GTO_dates

Unnamed: 0,date issued,date enforcement,effective
wave_1,2016-01-13,2016-03-01,2016-03-01
wave_2,2016-07-22,2016-08-28,2016-09-01
wave_3,2017-02-21,2017-02-24,2017-03-01
wave_4,2017-08-21,2017-08-23,2017-09-01
wave_5,2018-02-21,2018-02-24,2018-03-01
wave_6,2018-11-15,2018-11-17,2018-01-01
wave_7,2019-05-14,2019-05-16,2019-06-01


## Work on Zillow clean data

### Step 1 - Load data

In [5]:
# Load clean price data (FIPS in strings, prices and sales 'Int64' (integer that accepts nan)
df_prices_all = (pd.read_csv('../output/df_long_allCountyPrices.csv', 
                             dtype={'FIPS': object, 'price':'Int64'}))

df_prices_top = (pd.read_csv('../output/df_long_TOPCountyPrices.csv',
                            dtype={'FIPS': object, 'price':'Int64'}))
                            
df_sales = (pd.read_csv('../output/df_long_SaleCountsCounty.csv', 
                       dtype={'FIPS': object}))

### Step 2 - Restrict time span 
    
Trade-off longer past x number of complete observatiosn. Decided to use 2 years before first treatment (2014). I will drop dates before "initial time" and then drop FIPS for which the time series is incomplete.

In [6]:
# use copy
df = df_prices_all.copy()

##  restrict time span and drop remaining FIPS where prices are NAs before
start_date = '2014-01-31'
df = df[df.Date>=start_date]
FIPS_incomplete = df.FIPS[df.price.isna()].unique()
df = df[~df.FIPS.isin(FIPS_incomplete)]
df

Unnamed: 0,RegionName,State,FIPS,Date,price,logprice,pct_change
477624,Nantucket County,MA,25019,2014-01-31,1202768,14.00,0.00
477625,San Francisco County,CA,06075,2014-01-31,949971,13.76,0.01
477626,San Mateo County,CA,06081,2014-01-31,898062,13.71,0.01
477627,Pitkin County,CO,08097,2014-01-31,939684,13.75,0.01
477628,Santa Clara County,CA,06085,2014-01-31,846662,13.65,0.01
...,...,...,...,...,...,...,...
690843,Allendale County,SC,45005,2020-03-31,35684,10.48,0.00
690844,Ontonagon County,MI,26131,2020-03-31,35014,10.46,-0.00
690845,Tillman County,OK,40141,2020-03-31,33816,10.43,0.00
690847,Phillips County,AR,05107,2020-03-31,33247,10.41,0.00


### Step 3 - Merge with "TOP prices" and "sales" tables

#### prepare names and formats to merge

In [7]:
# add suffix
(df.rename(columns={'price':'price_mid', 
                    'logprice':'logprice_mid', 
                    'pct_change':'price_pct_mid'},
                    inplace=True))

In [8]:
# change col names of top prices table before merge
(df_prices_top.rename(columns={'price':'price_TOP', 
                               'logprice':'logprice_TOP', 
                               'pct_change':'price_pct_TOP'},
                                inplace=True))

In [9]:
# table SALES, date fromat is YYYY-MM, rename it to Date2 and add column
# with same format on main table, to use as key to match

# add colum on main df with YEAR-MONTH only (without DAY) to match sales table date format
df['Date2'] = df.Date.str.slice(stop=7)
df_sales.rename(columns={'Date':'Date2'}, inplace=True)

#### Merge

In [10]:
# merge top prices
df = pd.merge(df,df_prices_top.iloc[:,2:], on=['FIPS', 'Date'], how='left' )

# merge sales
df = pd.merge(df,df_sales.iloc[:,2:], on=['FIPS', 'Date2'], how='left' )

In [11]:
df.drop('Date2', axis=1,inplace=True)
df

Unnamed: 0,RegionName,State,FIPS,Date,price_mid,logprice_mid,price_pct_mid,price_TOP,logprice_TOP,price_pct_TOP,counts,logcounts,pct_change
0,Nantucket County,MA,25019,2014-01-31,1202768,14.00,0.00,2481178,14.72,0.00,19.00,2.94,-0.27
1,San Francisco County,CA,06075,2014-01-31,949971,13.76,0.01,1519727,14.23,0.01,367.00,5.91,-0.39
2,San Mateo County,CA,06081,2014-01-31,898062,13.71,0.01,1526532,14.24,0.01,442.00,6.09,-0.19
3,Pitkin County,CO,08097,2014-01-31,939684,13.75,0.01,3762774,15.14,0.01,26.00,3.26,-0.47
4,Santa Clara County,CA,06085,2014-01-31,846662,13.65,0.01,1405253,14.16,0.01,886.00,6.79,-0.29
...,...,...,...,...,...,...,...,...,...,...,...,...,...
199570,Allendale County,SC,45005,2020-03-31,35684,10.48,0.00,71596,11.18,-0.00,,,0.00
199571,Ontonagon County,MI,26131,2020-03-31,35014,10.46,-0.00,66739,11.11,-0.01,2.00,0.69,1.00
199572,Tillman County,OK,40141,2020-03-31,33816,10.43,0.00,82852,11.32,0.00,2.00,0.69,0.00
199573,Phillips County,AR,05107,2020-03-31,33247,10.41,0.00,80264,11.29,0.00,1.00,0.00,-0.94


#### Missing data on sales
Notice that:
- absence of many counties series, from which 3 happens to be treated counties in NY
- many partially incomplete series on other counties too

Sales dataseries are based on closing date recorded on the county deed. For more recent data, Zillow "nowcasts" number of sales, using historical latency. IN sum, care when using recent sales data.

In [12]:
no_sales_data = len(set(df.FIPS) -  set(df_sales.FIPS))
n_series = len(df.FIPS.unique())
incomplete_series = len(df.FIPS[df.counts.isna()].unique())

pd.DataFrame.from_dict({'Total ':n_series,
                '1+ obs. miss.':incomplete_series,
                'all obs. miss.':no_sales_data}, orient ='index', columns= ['number of FIPS'])

Unnamed: 0,number of FIPS
Total,2661
1+ obs. miss.,1380
all obs. miss.,211


______
### <font color=red>Step 4 - Add dummies column</font>

<font color=red>I kept counties already treated on previous waves. Is it correct, or should I keep only on first wave? By keeping could capture effects of confirming beliefs that GTOs were not going to be lifted?
    
* 1st wave: high-end Manhatan, Miami
* 2nd wave: extended NYC, + scattered states
* 4th wave: Hawaii and WIRE TRANSFERS
* 6th wave: threshold to 300.000 </font>

* 0: not treated
* 1: treated (not treated before) after treatment

In [13]:
col_init = {}  # keep NUMBER OF MONTHS of at first observed perid, before treatment

# loop over all waves
for wave in GTO.wave.unique():
    
    # PART I: DUMMY COLUMNS
    #------------------------------------------------------------------------
    # boolean tables for FIPS treated and date after treatment
    bol_FIPS_wave = df.FIPS.isin(GTO.FIPS[GTO.wave==wave])
    bol_date_wave = df.Date >= GTO_dates.loc[wave,'effective']
    bol_treatment = bol_FIPS_wave & bol_date_wave 

    col_name = 'w_' + str(wave)[-1]
    # create column dummies
    df[col_name] = 0   # populate all with zero
    df.loc[:,col_name][bol_treatment] = 1  # replace cell with one if treated
    
    #######
    
    # PART II: BUILD DICTIONARY waves x number of periods until treatment
    #------------------------------------------------------------------------
    # take advantage of dummy column build before to find treatment position
    bol_FIPS_wave_treated1 = (df.FIPS == list(GTO.FIPS[GTO.wave==wave])[0]) # first element only

    # number of periods observed (I will keep it like this if we want "unbalanced panel")
    n_obs = bol_FIPS_wave_treated1.sum()

    # number of periods treated
    n_treat = df.loc[:,col_name][bol_FIPS_wave_treated1].sum()

    # start/end periods periods relative to treatment
    # start_count.append(-(n_obs-n_treat-1))
    col_init.update({wave:-(n_obs-n_treat-1)})




A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [14]:
# check treated cells got 1
df[bol_treatment] # yes, it works

Unnamed: 0,RegionName,State,FIPS,Date,price_mid,logprice_mid,price_pct_mid,price_TOP,logprice_TOP,price_pct_TOP,counts,logcounts,pct_change,w_1,w_2,w_3,w_4,w_5,w_6,w_7
172966,San Francisco County,CA,06075,2019-06-30,1399620,14.15,-0.00,2097136,14.56,-0.00,430.00,6.06,-0.34,0,1,1,1,1,1,1
172967,San Mateo County,CA,06081,2019-06-30,1355695,14.12,-0.00,2219397,14.61,-0.00,642.00,6.46,-0.08,0,1,1,1,1,1,1
172969,Santa Clara County,CA,06085,2019-06-30,1189639,13.99,-0.01,2032043,14.52,-0.01,1452.00,7.28,-0.15,0,1,1,1,1,1,1
172971,New York County,NY,36061,2019-06-30,1040443,13.86,-0.00,1950856,14.48,-0.00,1203.00,7.09,0.30,1,1,1,1,1,1,1
172985,Honolulu County,HI,15003,2019-06-30,706861,13.47,-0.00,1067209,13.88,-0.00,859.00,6.76,-0.02,0,0,0,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
197215,Broward County,FL,12011,2020-03-31,287440,12.57,0.00,470623,13.06,0.00,,,0.00,0,1,1,1,1,1,1
197358,Cook County,IL,17031,2020-03-31,246806,12.42,0.00,429231,12.97,-0.00,4368.00,8.38,-0.02,0,0,0,0,0,1,1
197468,Tarrant County,TX,48439,2020-03-31,229287,12.34,0.00,354004,12.78,0.00,,,0.00,0,0,0,0,0,1,1
197521,Dallas County,TX,48113,2020-03-31,221970,12.31,0.00,404081,12.91,0.00,,,0.00,0,0,0,0,0,1,1


### <font color=red>  Step 5 - dummies only for 1st time treated</font>

<font color=red> For each wave, only adding dummies for counties treated fro 1st time (consider other periods a "continuation"), except for waves bellow (include all), as treatment changed:
    
* 4th wave: wire transfers 
* 6th wave: lower threshold </font>

In [15]:
## create list of FIPS treated for the first time 
## change of treatment also considered as "first time"

FIPS = []
FIPS_first_time = []
wave_new_treatment = dict()

# list of FIPS treated per wave
for wave in GTO.wave.unique():
    FIPS.append(list(GTO.FIPS[GTO.wave==wave]))
    
for i in range(len(FIPS)):
    if (i in [0,3,5]): # keep WAVE1 wave4 and wave6 ALL FIPS      ####### I think it is missing wave 2.....
        wave_new_treatment.update({('wave_' + str(i+1)):list(FIPS[i])})      
    else: 
        wave_new_treatment.update({'wave_' + str(i+1):list(set(FIPS[i])-set(FIPS[i-1]))})   

In [16]:
# loop over all waves
for wave in wave_new_treatment:
    
    # PART III: DUMMY FOR NEW TREATMENT ONLY
    #------------------------------------------------------------------------
    
    # add dummy columns only when there is a new state
    if wave_new_treatment[wave] != []:
    
        # boolean tables for FIPS treated and date after treatment
        bol_FIPS_wave = df.FIPS.isin(wave_new_treatment[wave])
        bol_date_wave = df.Date >= GTO_dates.loc[wave,'effective']
        bol_treatment = bol_FIPS_wave & bol_date_wave 

        col_name = 'ww_' + str(wave)[-1]
        # create column dummies
        df[col_name] = 0   # populate all with zero
        df.loc[:,col_name][bol_treatment] = 1  # replace cell with one if treated



A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [17]:
df[df.FIPS=='36061'] # yes, it works

Unnamed: 0,RegionName,State,FIPS,Date,price_mid,logprice_mid,price_pct_mid,price_TOP,logprice_TOP,price_pct_TOP,...,w_2,w_3,w_4,w_5,w_6,w_7,ww_1,ww_2,ww_4,ww_6
6,New York County,NY,36061,2014-01-31,858948,13.66,0.01,1764128,14.38,0.01,...,0,0,0,0,0,0,0,0,0,0
2667,New York County,NY,36061,2014-02-28,867685,13.67,0.01,1779495,14.39,0.01,...,0,0,0,0,0,0,0,0,0,0
5328,New York County,NY,36061,2014-03-31,879214,13.69,0.01,1800970,14.40,0.01,...,0,0,0,0,0,0,0,0,0,0
7989,New York County,NY,36061,2014-04-30,889454,13.70,0.01,1822060,14.42,0.01,...,0,0,0,0,0,0,0,0,0,0
10650,New York County,NY,36061,2014-05-31,901455,13.71,0.01,1845234,14.43,0.01,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
186276,New York County,NY,36061,2019-11-30,1009455,13.82,-0.00,1890977,14.45,-0.00,...,1,1,1,1,1,1,1,0,1,1
188937,New York County,NY,36061,2019-12-31,1008778,13.82,-0.00,1888177,14.45,-0.00,...,1,1,1,1,1,1,1,0,1,1
191598,New York County,NY,36061,2020-01-31,1007546,13.82,-0.00,1884160,14.45,-0.00,...,1,1,1,1,1,1,1,0,1,1
194259,New York County,NY,36061,2020-02-29,1005255,13.82,-0.00,1878032,14.45,-0.00,...,1,1,1,1,1,1,1,0,1,1


______
### Step 6 - Add column with periods centered at treatment = 0

In [18]:
# set multiindex 
df = df.sort_values(by=['FIPS', 'Date']).set_index(['FIPS','Date'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,RegionName,State,price_mid,logprice_mid,price_pct_mid,price_TOP,logprice_TOP,price_pct_TOP,counts,logcounts,...,w_2,w_3,w_4,w_5,w_6,w_7,ww_1,ww_2,ww_4,ww_6
FIPS,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
01001,2014-01-31,Autauga County,AL,140007,11.85,0.00,217585,12.29,0.00,,,...,0,0,0,0,0,0,0,0,0,0
01001,2014-02-28,Autauga County,AL,140229,11.85,0.00,217863,12.29,0.00,,,...,0,0,0,0,0,0,0,0,0,0
01001,2014-03-31,Autauga County,AL,140479,11.85,0.00,218157,12.29,0.00,,,...,0,0,0,0,0,0,0,0,0,0
01001,2014-04-30,Autauga County,AL,140779,11.85,0.00,218551,12.29,0.00,,,...,0,0,0,0,0,0,0,0,0,0
01001,2014-05-31,Autauga County,AL,141139,11.86,0.00,219047,12.30,0.00,,,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
56045,2019-11-30,Weston County,WY,183719,12.12,0.00,273364,12.52,0.00,6.00,1.79,...,0,0,0,0,0,0,0,0,0,0
56045,2019-12-31,Weston County,WY,183927,12.12,0.00,273331,12.52,-0.00,4.00,1.39,...,0,0,0,0,0,0,0,0,0,0
56045,2020-01-31,Weston County,WY,184089,12.12,0.00,273177,12.52,-0.00,3.00,1.10,...,0,0,0,0,0,0,0,0,0,0
56045,2020-02-29,Weston County,WY,184259,12.12,0.00,273281,12.52,0.00,1.00,0.00,...,0,0,0,0,0,0,0,0,0,0


In [19]:
n_FIPS = len(df.index.get_level_values(0).unique())
n_FIPS # 2575 counties

2661

In [20]:
for wave in GTO.wave.unique():
    df['t_' + wave[-1]] = list(range(col_init[wave],n_obs+col_init[wave])) * n_FIPS


______
### Step 7 -  Check if columns are correctly assigned

In [21]:
# check treated cells got 1: example NY county (Manhattan), around treatment dates
## It works.... 

idx = pd.IndexSlice # to use multiindex slice from a list or pandasSeries
df.loc[idx['36061','2016-01-31':'2018-01-31'], :].iloc[:,-18:]

Unnamed: 0_level_0,Unnamed: 1_level_0,w_1,w_2,w_3,w_4,w_5,w_6,w_7,ww_1,ww_2,ww_4,ww_6,t_1,t_2,t_3,t_4,t_5,t_6,t_7
FIPS,Date,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
36061,2016-01-31,0,0,0,0,0,0,0,0,0,0,0,-1,-7,-13,-19,-25,-23,-40
36061,2016-02-29,0,0,0,0,0,0,0,0,0,0,0,0,-6,-12,-18,-24,-22,-39
36061,2016-03-31,1,0,0,0,0,0,0,1,0,0,0,1,-5,-11,-17,-23,-21,-38
36061,2016-04-30,1,0,0,0,0,0,0,1,0,0,0,2,-4,-10,-16,-22,-20,-37
36061,2016-05-31,1,0,0,0,0,0,0,1,0,0,0,3,-3,-9,-15,-21,-19,-36
36061,2016-06-30,1,0,0,0,0,0,0,1,0,0,0,4,-2,-8,-14,-20,-18,-35
36061,2016-07-31,1,0,0,0,0,0,0,1,0,0,0,5,-1,-7,-13,-19,-17,-34
36061,2016-08-31,1,0,0,0,0,0,0,1,0,0,0,6,0,-6,-12,-18,-16,-33
36061,2016-09-30,1,1,0,0,0,0,0,1,0,0,0,7,1,-5,-11,-17,-15,-32
36061,2016-10-31,1,1,0,0,0,0,0,1,0,0,0,8,2,-4,-10,-16,-14,-31


In [22]:
# df.index = df.index.set_levels([df.index.levels[0], pd.to_datetime(df.index.levels[1])])

### Step 8 - Save to local

In [23]:
# save local
df.to_csv('../output/tbl_reg.csv')