# Do Duplexes Sell for Less per Square Foot than Single Family Homes?

## Hypotheses
Our null hypothesis is that duplexes do not sell for less per square foot than single family homes.
Our alternative hypothesis is that they do sell for less.

We will try to get results with a 95% confidence, so we will set our alpha to .05

#### Possible Errors:
If we make a type 1 error, we would claim that duplexes sell for less per square foot, when in reality they do not.

On the other hand, if we make a type 2 error, we will claim that they do not sell for less, when in fact they do.

In [150]:
# First we import the libraries we will be using.
import scipy.stats as stats
import statsmodels.stats.power as power
import pandas as pd
import numpy as np

## Load the data

In [151]:
salespath = r'C:\Users\caell\flatiron\projects\phase_2_project\phase_2_project_chicago-sf-seattle-ds-082420\data\EXTR_RPSale.csv'
parcelpath = r'C:\Users\caell\flatiron\projects\phase_2_project\phase_2_project_chicago-sf-seattle-ds-082420\data\EXTR_Parcel.csv'
residentialpath = r'C:\Users\caell\flatiron\projects\phase_2_project\phase_2_project_chicago-sf-seattle-ds-082420\data\EXTR_ResBldg.csv'
sales = pd.read_csv(salespath, encoding = 'ISO-8859-1')
parcels = pd.read_csv(parcelpath, encoding = 'ISO-8859-1')
residences = pd.read_csv(residentialpath, encoding = 'ISO-8859-1')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


## Prepare the data

In [152]:
sales['pin'] = sales.Major.astype(str) + sales.Minor.astype(str)
sales.set_index('pin', inplace = True)
sales = sales[sales['DocumentDate'].astype(str).str.endswith('2019')]
parcels['pin'] = parcels.Major.astype(str) + parcels.Minor.astype(str)
parcels.set_index('pin', inplace = True)
residences['pin'] = residences.Major.astype(str) + residences.Minor.astype(str)
residences.set_index('pin', inplace = True)


## Join the tables and extract the features we want to compare.

In [153]:
duplexs = parcels[parcels['PresentUse'] == 3]
duplexs = pd.DataFrame(duplexs['SqFtLot'])
duplexs = duplexs.join(sales, how = 'inner')
duplexs = duplexs[['SqFtLot','SalePrice']]
duplexs = duplexs.join(residences, how='inner')
duplexs = duplexs[['SalePrice','SqFtTotLiving']]
duplexs = duplexs[duplexs['SalePrice'] > 0]
duplexs['cost_per_sqft'] = duplexs.SalePrice / duplexs.SqFtTotLiving

duplexs.head()

Unnamed: 0_level_0,SalePrice,SqFtTotLiving,cost_per_sqft
pin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1025049027,760000,2260,336.283186
1025049153,875000,2000,437.5
1070050,1040000,1650,630.30303
1080030,675000,1650,409.090909
114200690,17255000,2080,8295.673077


In [154]:
singlefamily = parcels[parcels['PresentUse'] == 2]
singlefamily = pd.DataFrame(singlefamily['SqFtLot'])
singlefamily = singlefamily.join(sales, how = 'inner')
singlefamily = singlefamily[['SqFtLot','SalePrice']]
singlefamily = singlefamily.join(residences, how='inner')
singlefamily = singlefamily[['SalePrice','SqFtTotLiving']]
singlefamily = singlefamily[singlefamily['SalePrice'] > 0]
singlefamily['cost_per_sqft'] = singlefamily.SalePrice / singlefamily.SqFtTotLiving

singlefamily.head()

Unnamed: 0_level_0,SalePrice,SqFtTotLiving,cost_per_sqft
pin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100023,520000,1100,472.727273
10004,813000,1720,472.674419
10012025,245600,1060,231.698113
10012045,425000,1310,324.427481
10030120,615000,2390,257.322176


### Single family vs duplex sample size and sample means

In [155]:
print(f'{len(duplexs)} duplexs were sold in 2019, and {len(singlefamily)} single family homes were sold.')

print(f'The mean cost per sqft of our samples for Single Family Homes is {singlefamily.cost_per_sqft.mean()}')
print(f'The mean cost per sqft of our samples for Duplexs is {duplexs.cost_per_sqft.mean()}')

282 duplexs were sold in 2019, and 25653 single family homes were sold.
The mean cost per sqft of our samples for Single Family Homes is 380.84841644179767
The mean cost per sqft of our samples for Duplexs is 569.2470483466892


## Testing for statistical significance.

We will be using a two sample, one-tailed Welch's test to determine the statistical significance of the difference in means.  Our T-critical value tells us that we need a test statistic below -1.645 to confirm with 95% confidence that duplexes sell for less per square foot than single family homes.  We are looking for a pvalue of .05 or less to confirm our result.

In [156]:
tcrit = stats.t.ppf(q=.05,df = len(duplexs) + len(singlefamily)-1)
tstat = stats.ttest_ind(duplexs['cost_per_sqft'],singlefamily['cost_per_sqft'],  equal_var = False)
print(f'critical stat is {tcrit}, stat is {tstat[0]} with a pvalue of {tstat[1]/2}')

critical stat is -1.6449123847177736, stat is 1.692928658479199 with a pvalue of 0.045787490954498744


# We cannot reject the null hypothesis

Our critical stat, which tells us if duplexes sell for less per square foot than single family homes is ~ -1.64.  Our test statistic would need to be below that for us to confidently confirm this.  In fact, the test statistic is positive ~ 1.7.  We cannot reject our null hypothesis. 

### What are the chances we are wrong?
Let's check the power of our test, the chance that we would detect the lower average per square foot value of duplexes, if they were there.

In [157]:
effect = (duplexs.mean() - singlefamily.mean()) / np.sqrt(((len(duplexs) -1) * duplexs.var()
                                                         + len(singlefamily) -1 * singlefamily.var()
                                                          / len(duplexs) + len(singlefamily) -2))
power.tt_ind_solve_power(alpha = .95, 
                                     nobs1 = len(duplexs), 
                                     ratio = len(duplexs) / len(singlefamily),
                                    alternative = 'smaller',
                                    effect_size = effect)

SalePrice        0.948979
SqFtTotLiving    0.952634
cost_per_sqft    0.948907
dtype: float64

The power of our test is just under .95.  If, in fact, duplexes do sell for less per square foot than single family homes, we would get this same result about 5% of the time.  This gives us 95% confidence that we are not committing a type 2 error. 

#### It's likely that duplexes do not sell for less than single family homes.

## Next steps

We recommend testing whether duplexes, in fact may sell for more per square foot than single family homes, which our tests may indicate is the case.

In [158]:
!dir

 Volume in drive C is Windows
 Volume Serial Number is 685B-B2EA

 Directory of C:\Users\caell

09/28/2020  09:53 AM    <DIR>          .
09/28/2020  09:53 AM    <DIR>          ..
09/08/2020  07:06 PM    <DIR>          .atom
09/27/2020  08:18 PM            13,427 .bash_history
09/27/2020  08:18 PM               237 .bash_profile
09/05/2020  10:25 AM    <DIR>          .conda
09/24/2020  06:23 PM                43 .condarc
02/19/2020  05:19 PM    <DIR>          .config
08/24/2020  03:51 PM    <DIR>          .continuum
09/27/2020  08:17 PM               292 .gitconfig
09/27/2020  10:08 PM    <DIR>          .ipynb_checkpoints
02/18/2020  08:36 PM    <DIR>          .ipython
09/17/2020  09:42 AM    <DIR>          .jupyter
04/01/2020  05:53 PM    <DIR>          .keras
09/20/2020  05:49 PM    <DIR>          .matplotlib
09/18/2020  05:01 PM    <DIR>          .pylint.d
03/09/2020  01:09 PM    <DIR>          .spyder-py3
08/24/2020  05:10 PM             1,275 .viminfo
09/18/2020  04:56 PM    <DIR> 