# Project - EDA with Pandas Using the Boston Housing Data

## Introduction

In this section you've learned a lot about importing, cleaning up, analysing (using descriptive statistics) and visualizing data. In this more free form project you'll get a chance to practice all of these skills with the Boston Housing data set,  which contains housing values in suburbs of Boston. The Boston Housing Data is commonly used by aspiring data scientists.  

## Objectives
You will be able to:
* Show mastery of the content covered in this section

# Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At minimum, this should include:

* Loading the data (which is stored in the file train.csv)
* Use built-in python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations using `.loc`, `.iloc` or related operations. Explain why you used the chosen subsets and do this for 3 possible 2-way splits. State how you think the 2 measures of centrality and/or dispersion might be different for each subset of the data. Examples of potential splits:
    - Create a 2 new dataframes based on your existing data, where one contains all the properties next to the Charles river, and the other one contains properties that aren't.
    - Create 2 new datagrames based on a certain split for crime rate.
* Next, use histograms and scatterplots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

In [7]:
#Load data

df = pd.read_csv('/Users/miyakusumoto/Coursework/boston_housing/train.csv')
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 15 columns):
ID         333 non-null int64
crim       333 non-null float64
zn         333 non-null float64
indus      333 non-null float64
chas       333 non-null int64
nox        333 non-null float64
rm         333 non-null float64
age        333 non-null float64
dis        333 non-null float64
rad        333 non-null int64
tax        333 non-null int64
ptratio    333 non-null float64
black      333 non-null float64
lstat      333 non-null float64
medv       333 non-null float64
dtypes: float64(11), int64(4)
memory usage: 39.1 KB


In [9]:
df.describe()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0
mean,250.951952,3.360341,10.689189,11.293483,0.06006,0.557144,6.265619,68.226426,3.709934,9.633634,409.279279,18.448048,359.466096,12.515435,22.768769
std,147.859438,7.352272,22.674762,6.998123,0.237956,0.114955,0.703952,28.133344,1.981123,8.742174,170.841988,2.151821,86.584567,7.067781,9.173468
min,1.0,0.00632,0.0,0.74,0.0,0.385,3.561,6.0,1.1296,1.0,188.0,12.6,3.5,1.73,5.0
25%,123.0,0.07896,0.0,5.13,0.0,0.453,5.884,45.4,2.1224,4.0,279.0,17.4,376.73,7.18,17.4
50%,244.0,0.26169,0.0,9.9,0.0,0.538,6.202,76.7,3.0923,5.0,330.0,19.0,392.05,10.97,21.6
75%,377.0,3.67822,12.5,18.1,0.0,0.631,6.595,93.8,5.1167,24.0,666.0,20.2,396.24,16.42,25.0
max,506.0,73.5341,100.0,27.74,1.0,0.871,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,37.97,50.0


The crime rate per capita has a mean of 3.36 and a median of 0.26. This leads me to wonder if there are a significant number of outliers in the crime rate information. The median may be more representative in this case. 

The rm series is interesting because the mean is 6.26, with a minimum of 3.56 and maximum of 8.73. This means that in all of the towns are observing, the average number of rooms is above 3. 

Student teacher ratio is interesting as well. The minimum is 12.6 meaning there are 12 students to 1 teacher. The maximum is 21 which means 21 students to 1 teacher. 

In [27]:
#Properties next to the Charles River
df1 = df.loc[df["chas"]==1]
df2 = df.loc[df["chas"]==0]

df1.describe()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0,20.0
mean,255.6,2.163972,8.5,12.33,1.0,0.593595,6.57775,75.815,3.06954,9.9,394.55,17.385,380.681,11.118,30.175
std,75.913663,2.885734,21.830688,6.505255,0.0,0.146237,0.814341,22.808638,1.343724,8.534142,171.795005,2.22906,21.661541,7.198281,12.362204
min,143.0,0.01501,0.0,1.21,1.0,0.401,5.403,24.8,1.1296,1.0,198.0,13.6,321.02,2.96,13.4
25%,211.25,0.200377,0.0,6.2,1.0,0.489,6.11125,58.325,2.04125,4.75,276.75,14.85,377.565,5.0075,21.7
50%,236.0,0.57207,0.0,12.24,1.0,0.5285,6.3225,86.0,3.08005,5.0,307.0,17.4,390.58,9.735,26.05
75%,302.25,3.453213,0.0,18.1,1.0,0.6935,6.91325,92.925,4.0952,12.0,468.75,19.0,395.24,14.775,37.9
max,373.0,8.98296,90.0,19.58,1.0,0.871,8.375,100.0,5.885,24.0,666.0,20.2,396.9,26.82,50.0


In [14]:
df2.describe()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,313.0,313.0,313.0,313.0,313.0,313.0,313.0,313.0,313.0,313.0,313.0,313.0,313.0,313.0,313.0
mean,250.654952,3.436787,10.829073,11.227252,0.0,0.554815,6.245674,67.741534,3.750853,9.616613,410.220447,18.515974,358.110511,12.604728,22.295527
std,151.365227,7.544289,22.754198,7.032974,0.0,0.112555,0.693026,28.400933,2.009606,8.768386,171.014186,2.132487,88.984202,7.061663,8.746397
min,1.0,0.00632,0.0,0.74,0.0,0.385,3.561,6.0,1.137,1.0,188.0,12.6,3.5,1.73,5.0
25%,118.0,0.07875,0.0,4.95,0.0,0.453,5.879,43.7,2.1329,4.0,281.0,17.4,376.7,7.22,17.2
50%,245.0,0.24522,0.0,9.69,0.0,0.538,6.185,76.5,3.0923,5.0,330.0,19.1,392.23,11.22,21.2
75%,387.0,3.67822,12.5,18.1,0.0,0.624,6.563,93.8,5.2146,24.0,666.0,20.2,396.33,16.44,24.7
max,506.0,73.5341,100.0,27.74,0.0,0.871,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,37.97,50.0


In [16]:
df_bl = df.loc[df["black"]>= 359]
df_bl.describe()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,274.0,274.0,274.0,274.0,274.0,274.0,274.0,274.0,274.0,274.0,274.0,274.0,274.0,274.0,274.0
mean,235.507299,2.16057,12.406934,10.059489,0.062044,0.535858,6.358814,64.427737,3.999166,8.251825,378.485401,18.373358,390.508686,11.301569,24.065328
std,144.136617,4.956756,23.73286,6.825214,0.241676,0.10553,0.67437,28.258299,1.987091,7.88136,159.387133,2.074735,8.306807,6.519188,9.04536
min,1.0,0.00632,0.0,0.74,0.0,0.389,4.138,6.0,1.137,1.0,188.0,13.0,359.29,1.73,5.0
25%,108.25,0.066232,0.0,4.49,0.0,0.448,5.9635,40.325,2.371575,4.0,270.0,17.4,388.45,6.5375,19.1
50%,230.5,0.1694,0.0,7.87,0.0,0.5125,6.247,69.65,3.58495,5.0,307.0,18.6,393.68,9.77,22.45
75%,343.75,0.954332,20.0,18.1,0.0,0.605,6.6455,91.15,5.401,8.0,432.0,20.2,396.9,14.635,26.975
max,506.0,38.3518,100.0,27.74,1.0,0.871,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,37.97,50.0


In [17]:
df_nbl = df.loc[df["black"]< 359]
df_nbl.describe()


Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0,59.0
mean,322.677966,8.932164,2.711864,17.024237,0.050847,0.656,5.832814,85.867797,2.366717,16.050847,552.288136,18.794915,215.302203,18.152712,16.747458
std,144.899185,12.465835,14.601666,4.524048,0.221572,0.105552,0.681489,19.677682,1.284722,9.705,149.034369,2.4688,129.978795,6.827225,7.194882
min,19.0,0.01965,0.0,1.52,0.0,0.385,3.561,9.8,1.1296,1.0,241.0,12.6,3.5,5.5,7.0
25%,167.0,1.97473,0.0,18.1,0.0,0.597,5.5105,84.2,1.75085,5.0,403.0,18.0,88.45,13.13,11.75
50%,409.0,4.75237,0.0,18.1,0.0,0.659,5.936,94.0,2.0635,24.0,666.0,20.2,261.95,17.15,15.6
75%,434.5,11.24205,0.0,18.84,0.0,0.713,6.24,97.3,2.45665,24.0,666.0,20.2,336.515,23.265,20.0
max,491.0,73.5341,80.0,27.74,1.0,0.871,7.107,100.0,9.0892,24.0,711.0,21.2,356.99,36.98,50.0


In [23]:
pd.tools.plotting.scatter_matrix(df)

  """Entry point for launching an IPython kernel.


<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1200d4eb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120383128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1203ac048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1203d4358>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1203fd668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1203fd6a0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120451c88>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120479f98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1204aa2e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1204d25f8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1204fbbe0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12052c2b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x120554940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x12057cfd0>,
      

In [24]:
colormap = ("skyblue", "salmon")
plt.figure()
pd.plotting.parallel_coordinates(df, "chas", color = colormap)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x123ec46a0>

In [34]:
#figure = plt.figure()
#ax1 = figure.add_subplot(121)
#ax2 = figure.add_sublot(122)


<IPython.core.display.Javascript object>

In [29]:
df2.plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x129458c88>

# Variable Descriptions

This data frame contains the following columns:

#### crim  
per capita crime rate by town.

#### zn  
proportion of residential land zoned for lots over 25,000 sq.ft.

#### indus  
proportion of non-retail business acres per town.

#### chas  
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

#### nox  
nitrogen oxides concentration (parts per 10 million).

#### rm  
average number of rooms per dwelling.

#### age  
proportion of owner-occupied units built prior to 1940.

#### dis  
weighted mean of distances to five Boston employment centres.

#### rad  
index of accessibility to radial highways.

#### tax  
full-value property-tax rate per $10,000.

#### ptratio  
pupil-teacher ratio by town.

#### black  
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

#### lstat  
lower status of the population (percent).

#### medv  
median value of owner-occupied homes in $10000s.
  
  
  
Source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

## Summary

Congratulations, you've completed your first "freeform" exploratory data analysis of a popular data set!