## AI Hack 21

## Boston Housing Market Challenge

**Challenge Description**

First studied by Harrison and Rubinfeld (1978), the Boston Housing dataset has been extensively used in testing new machine learning models against existing benchmarks. It is a small dataset with only 506 observations, but is inherently interesting because a lot of things could be studied about this dataset. For this challenge, you are free to pose your own studies and encouraged to use alternative datasets.


You have two options for this challenge. You can either use the housing prices as a response variable and conduct a regression analysis, or focus on the nitrous oxide level and perform a classification task. **You only need to choose one of these two options for your study.** You need not to be comprehensive; the depth of analysis is more important than breaths of the questions posed. You can base your analysis on the available datasets as well as any supplementary datasets you may find.

You may explore one of the sample questions below, or come up with your own variation. Creativity in formulating your own questions is strongly encouraged, although this should not compromise the depth, precision and rigour of your analysis, which will be key performance indicators during assessment.

**Data**

The original Boston housing dataset can be accessed via the sklearn API

In [3]:
from sklearn import datasets
import pandas as pd
boston_load = datasets.load_boston() boston = pd.DataFrame(boston_load.data, columns=boston_load.feature_names)
boston['MEDV'] = boston_load.target

SyntaxError: invalid syntax (<ipython-input-3-74e05811a447>, line 3)

A corrected version with town names and spatial information is also available here, which is augmented with longitude and latitude of the observations and corrected for the censoring error. In particular, the censoring error refers to the fact that in the original dataset, the house price is capped at USD 50,000, with values higher than this number set to USD 50,000 (see the description on the page for the corrected dataset: [Ref 1], and also the _Note_ section of this page for the original data: [Ref 2]).


References:

[Ref 1] https://nowosad.github.io/spData/reference/boston.html

[Ref 2] https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html

**Sample Questions**:

Regression:

    [1]. Amongst the given attributes, can you identify any interesting quantities that are correlated with the house price, and uncover their statistical associations? Can you explain your findings?

    [2]. Does the house price show any spatial heterogeneity across the towns?
    You may find the Python library *geopandas* and the *GeoJSON* shapefiles helpful for visualization.

    [3]. A recent study in the Journal of the American Statistical Association (JASA) studied the causal effects of geographical boundaries for house prices in New York City. Can you conduct a similar study?


Classification:

    [1]. Is the nitrous oxide level correlated with any of the explanatory variables? Is this relationship causal?




**You are reminded that you only need to choose one of regression or classification for your analysis.**


**Tips and Suggestions**:

The following workflow may help:

[1]. Clean the dataset
    - Impute missing data (if any)
    - Do some feature engineering e.g. PCA/t-SNE
    
[2]. Visualise linear correlations.
    - Can you find any? Compute correlation coefficients
    
[3]. Visualise target distribution

[4]. Regress/classify on all features
    - Report in sample and out of sample loss
    - Report estimated model parameters
    - Is spatial adjacency a key information too? How do you add this into your model?
    
[5]. Do feature selection/model selection

[6]. Gather your conclusions. What are their implications on economy/policy making?

In [109]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from csv import reader
from scipy.optimize import curve_fit
import scipy as sp
import geopandas as gdp

In [60]:
df= pd.read_csv('boston.csv', delimiter = ',')
df2= pd.read_csv('boston_corrected.csv', delimiter = ',')
df.head

<bound method NDFrame.head of         crim    zn  indus  chas    nox     rm   age     dis  rad  tax  \
0    0.00632  18.0   2.31     0  0.538  6.575  65.2  4.0900    1  296   
1    0.02731   0.0   7.07     0  0.469  6.421  78.9  4.9671    2  242   
2    0.02729   0.0   7.07     0  0.469  7.185  61.1  4.9671    2  242   
3    0.03237   0.0   2.18     0  0.458  6.998  45.8  6.0622    3  222   
4    0.06905   0.0   2.18     0  0.458  7.147  54.2  6.0622    3  222   
..       ...   ...    ...   ...    ...    ...   ...     ...  ...  ...   
501  0.06263   0.0  11.93     0  0.573  6.593  69.1  2.4786    1  273   
502  0.04527   0.0  11.93     0  0.573  6.120  76.7  2.2875    1  273   
503  0.06076   0.0  11.93     0  0.573  6.976  91.0  2.1675    1  273   
504  0.10959   0.0  11.93     0  0.573  6.794  89.3  2.3889    1  273   
505  0.04741   0.0  11.93     0  0.573  6.030  80.8  2.5050    1  273   

     ptratio       b  lstat  medv  
0       15.3  396.90   4.98  24.0  
1       17.8  396.90 

In [62]:
print(list(df2)

           TOWN  TOWNNO  TRACT      LON      LAT  MEDV  CMEDV     CRIM    ZN  \
0        Nahant       0   2011 -70.9550  42.2550  24.0   24.0  0.00632  18.0   
1    Swampscott       1   2021 -70.9500  42.2875  21.6   21.6  0.02731   0.0   
2    Swampscott       1   2022 -70.9360  42.2830  34.7   34.7  0.02729   0.0   
3    Marblehead       2   2031 -70.9280  42.2930  33.4   33.4  0.03237   0.0   
4    Marblehead       2   2032 -70.9220  42.2980  36.2   36.2  0.06905   0.0   
..          ...     ...    ...      ...      ...   ...    ...      ...   ...   
501    Winthrop      91   1801 -70.9860  42.2312  22.4   22.4  0.06263   0.0   
502    Winthrop      91   1802 -70.9910  42.2275  20.6   20.6  0.04527   0.0   
503    Winthrop      91   1803 -70.9948  42.2260  23.9   23.9  0.06076   0.0   
504    Winthrop      91   1804 -70.9875  42.2240  22.0   22.0  0.10959   0.0   
505    Winthrop      91   1805 -70.9825  42.2210  11.9   19.0  0.04741   0.0   

     INDUS  CHAS    NOX     RM   AGE   

In [64]:
df2_MO= df2[['TOWN', 'TOWNNO', 'LON', 'LAT','CRIM']]

In [66]:
df2_MO.TOWN.value_counts()

Cambridge            30
Boston Savin Hill    23
Lynn                 22
Boston Roxbury       19
Newton               18
                     ..
Norfolk               1
Dover                 1
Hull                  1
Millis                1
Manchester            1
Name: TOWN, Length: 92, dtype: int64

In [65]:
df2_MO.TOWNNO.value_counts()

28    30
83    23
4     22
82    19
40    18
      ..
49     1
50     1
51     1
52     1
0      1
Name: TOWNNO, Length: 92, dtype: int64

In [113]:
from shapely.geometry import Point

FileNotFoundError: Could not find module 'C:\Users\user\anaconda3\Library\bin\geos_c.dll' (or one of its dependencies). Try using the full path with constructor syntax.

In [105]:
geometry= [Point(xy) for xy in zip(df2_MO.LON, df2_MO.LAT)]

NameError: name 'Point' is not defined