# Difference-in-Differences Model in R

By: Traci Lim

This notebook attempts to use a DiD model to estimate causal effects of the opening of Singapore's Downtown Line. We are seeking an (hopefully) unbiased estimate of the effect of some policy or treatment on a dependent variable--average price per sqm of resale flats. In this case, the treatment is the opening of downtown line. 

 We confirm that the parallel trends assumption holds using the plot in the '```4_DiD_plot.ipynb```' file. 

The treatment group consists of flats that are within a 1km geodesic radius to the nearest DTL station from the year 2010 onwards. The control group consists of flats that are outside the 1km geodesic radius to the nearest DTL station from the year 2010 onwards. 

This notebook was done after the submission of the time-sensitive case study.

---

In [6]:
within_1km_dt_station_df <- read.csv("data/within_1km_dt_station_df.csv", header = TRUE, stringsAsFactors = FALSE)
not_within_1km_dt_station_df <- read.csv("data/not_within_1km_dt_station_df.csv", header = TRUE, stringsAsFactors = FALSE)

In [4]:
head(not_within_1km_dt_station_df)

X,block,flat_model,flat_type,floor_area_sqm,lease_commence_date,month,resale_price,storey_range,street_name,...,multistorey_carpark,precinct_pavilion,year,distance_to_city,year_month,distance_to_nearest_downtown_station,price_per_sqm,num_months_from_jan_2010,avg_price_per_sqm,log_avg_price_per_sqm
0,208,New Generation,3 ROOM,73,1976,1,304000,10 TO 12,ANG MO KIO AVE 1,...,N,N,2010,7.546926,2010-1,5.649688,4164.384,1,3668.978,8.207668
1,220,Adjoined flat,5 ROOM,134,1977,1,455000,07 TO 09,ANG MO KIO AVE 1,...,N,N,2010,7.489863,2010-1,5.53563,3395.522,1,3668.978,8.207668
2,319,New Generation,4 ROOM,98,1977,1,410000,04 TO 06,ANG MO KIO AVE 1,...,N,N,2010,7.68831,2010-1,5.5171,4183.673,1,3668.978,8.207668
3,319,New Generation,3 ROOM,73,1977,1,307000,04 TO 06,ANG MO KIO AVE 1,...,N,N,2010,7.68831,2010-1,5.5171,4205.479,1,3668.978,8.207668
4,306,Standard,5 ROOM,123,1977,1,505000,01 TO 03,ANG MO KIO AVE 1,...,N,N,2010,7.687628,2010-1,5.667862,4105.691,1,3668.978,8.207668
5,330,New Generation,3 ROOM,68,1981,1,269000,10 TO 12,ANG MO KIO AVE 1,...,,,2010,7.581604,2010-1,5.065413,3955.882,1,3668.978,8.207668


In [7]:
within_1km_dt_station_df$treated <- 1
not_within_1km_dt_station_df$treated <- 0

In [8]:
within_1km_dt_station_df$time <- ifelse(within_1km_dt_station_df$num_months_from_jan_2010 >= 48, 1, 0)
not_within_1km_dt_station_df$time <- ifelse(not_within_1km_dt_station_df$num_months_from_jan_2010 >= 48, 1, 0)

In [10]:
within_1km_dt_station_df$did <- within_1km_dt_station_df$treated * within_1km_dt_station_df$time
not_within_1km_dt_station_df$did <- not_within_1km_dt_station_df$treated * not_within_1km_dt_station_df$time

In [11]:
head(not_within_1km_dt_station_df)

X,block,flat_model,flat_type,floor_area_sqm,lease_commence_date,month,resale_price,storey_range,street_name,...,distance_to_city,year_month,distance_to_nearest_downtown_station,price_per_sqm,num_months_from_jan_2010,avg_price_per_sqm,log_avg_price_per_sqm,treated,time,did
0,208,New Generation,3 ROOM,73,1976,1,304000,10 TO 12,ANG MO KIO AVE 1,...,7.546926,2010-1,5.649688,4164.384,1,3668.978,8.207668,0,0,0
1,220,Adjoined flat,5 ROOM,134,1977,1,455000,07 TO 09,ANG MO KIO AVE 1,...,7.489863,2010-1,5.53563,3395.522,1,3668.978,8.207668,0,0,0
2,319,New Generation,4 ROOM,98,1977,1,410000,04 TO 06,ANG MO KIO AVE 1,...,7.68831,2010-1,5.5171,4183.673,1,3668.978,8.207668,0,0,0
3,319,New Generation,3 ROOM,73,1977,1,307000,04 TO 06,ANG MO KIO AVE 1,...,7.68831,2010-1,5.5171,4205.479,1,3668.978,8.207668,0,0,0
4,306,Standard,5 ROOM,123,1977,1,505000,01 TO 03,ANG MO KIO AVE 1,...,7.687628,2010-1,5.667862,4105.691,1,3668.978,8.207668,0,0,0
5,330,New Generation,3 ROOM,68,1981,1,269000,10 TO 12,ANG MO KIO AVE 1,...,7.581604,2010-1,5.065413,3955.882,1,3668.978,8.207668,0,0,0


In [12]:
df <- rbind(within_1km_dt_station_df, not_within_1km_dt_station_df)

In [14]:
dim(df)

In [19]:
str(df)

'data.frame':	685644 obs. of  35 variables:
 $ X                                   : int  0 1 2 3 4 5 6 7 8 9 ...
 $ block                               : chr  "522" "520" "519" "519" ...
 $ flat_model                          : chr  "New Generation" "New Generation" "New Generation" "New Generation" ...
 $ flat_type                           : chr  "3 ROOM" "3 ROOM" "3 ROOM" "3 ROOM" ...
 $ floor_area_sqm                      : num  67 67 67 67 92 67 92 68 92 119 ...
 $ lease_commence_date                 : int  1979 1979 1979 1979 1979 1979 1979 1980 1980 1978 ...
 $ month                               : int  1 1 1 1 1 1 1 1 1 1 ...
 $ resale_price                        : num  236000 245000 249000 242000 339000 252000 355000 260000 370000 503000 ...
 $ storey_range                        : chr  "01 TO 03" "07 TO 09" "04 TO 06" "10 TO 12" ...
 $ street_name                         : chr  "BEDOK NTH AVE 1" "BEDOK NTH AVE 1" "BEDOK NTH AVE 1" "BEDOK NTH AVE 1" ...
 $ town              

### Estimating the DID estimator

In [25]:
didreg_1 = lm(log_avg_price_per_sqm ~ treated + time + did, data = df)
summary(didreg_1)


Call:
lm(formula = log_avg_price_per_sqm ~ treated + time + did, data = df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.187719 -0.022983 -0.007261  0.052923  0.149679 

Coefficients:
             Estimate Std. Error  t value Pr(>|t|)    
(Intercept) 8.3901195  0.0001393 60214.24   <2e-16 ***
treated     0.0105795  0.0003263    32.43   <2e-16 ***
time        0.0278074  0.0002050   135.63   <2e-16 ***
did         0.0388501  0.0004766    81.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.07642 on 685640 degrees of freedom
Multiple R-squared:  0.07719,	Adjusted R-squared:  0.07718 
F-statistic: 1.912e+04 on 3 and 685640 DF,  p-value: < 2.2e-16


The above regression results are computed with log-average-price-per-sqm as the dependent variable. The average price per sqm is calculated by taking the average of all price per sqm of all flats on a specific year-month. 
The treatment variable ```treated``` is defined as 1 if the flat falls within a 1km radius to the nearest DTL station, and 0 otherwise.
```time``` is a time dummy that has a value of 1 if a transaction occurs after stage 1 (start of DTL operations), and 0 otherwise. 
The interaction variable ```did``` captures the treatment effect of stage 1 on log-average-price-per-sqm.

The coefficient for ```did``` is the differences-in-differences estimator. The effect is significant at 3.9% with the treatment having a positive effect on the log-average-price-per-sqm of resale flats. The coefficient of ```time``` gets a positive value, which can be interpreted as: the average price per sqm of resale flats was experiencing a small 2.7% growth over time after the opening of DTL, regardless of its proximity to a DTL station.

We can introduce a new model that controls for housing specific attributes (such as unit area, floor, property type, lease type), and location-related amenities (such as distance to CBD, distance to nearest school, distance to nearest mrt station, etc.). It is inaccurate to use geodesic distance, as some flats can faced certain topological constraints, like a water body or an expressway blocking the fastest route to the station. 