# Project - EDA with Pandas Using the Boston Housing Data

## Introduction

In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this more free-form project, you'll get a chance to practice all of these skills with the Boston Housing dataset, which contains housing values in the suburbs of Boston. The Boston housing data is commonly used by aspiring Data Scientists.

## Objectives

You will be able to:

* Perform a full exploratory data analysis process to gain insight about a dataset 

## Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At a minimum, this should include:

* Loading the data (which is stored in the file `'train.csv'`) 
* Use built-in Python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations like `.loc`, `.iloc`, or related operations.   Explain why you used the chosen subsets and do this for three possible 2-way splits. State how you think the two measures of centrality and/or dispersion might be different for each subset of the data. Examples of potential splits:
    - Create two new DataFrames based on your existing data, where one contains all the properties next to the Charles river, and the other one contains properties that aren't 
    - Create two new DataFrames based on a certain split for crime rate 
* Next, use histograms and scatter plots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

## Variable Descriptions

This DataFrame contains the following columns:

- `crim`: per capita crime rate by town  
- `zn`: proportion of residential land zoned for lots over 25,000 sq.ft  
- `indus`: proportion of non-retail business acres per town   
- `chas`: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)  
- `nox`: nitrogen oxide concentration (parts per 10 million)   
- `rm`: average number of rooms per dwelling   
- `age`: proportion of owner-occupied units built prior to 1940  
- `dis`: weighted mean of distances to five Boston employment centers   
- `rad`: index of accessibility to radial highways   
- `tax`: full-value property-tax rate per \$10,000   
- `ptratio`: pupil-teacher ratio by town    
- `b`: 1000(Bk - 0.63)^2 where Bk is the proportion of African American individuals by town   
- `lstat`: lower status of the population (percent)   
- `medv`: median value of owner-occupied homes in $10000s 
  
    
Source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.


In [38]:
import pandas as pd
df = pd.read_csv('housing.csv')
df.info()
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null int64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null int64
TAX        506 non-null int64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
MEDV       506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.5 KB
(506, 14)


In [31]:
df.head(10)

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
5,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
6,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9


In [21]:
print(df['RM'].mean())
print(df['NOX'].median())
print(df['PTRATIO'].mean())

6.284634387351779
0.5379999999999999
18.455533596837945


In [25]:
CRIM_NOCHA = df.loc[df['CHAS'] < 1, ['CRIM']] #Crime Rate of Houses not connected to the Charles River

In [23]:
CRIM_CHA = df.loc[df['CHAS'] >= 1, ['CRIM']] #Crime Rate of Houses connected to the Charles River

In [24]:
CRIM_CHA.mean()

CRIM    1.85167
dtype: float64

In [26]:
CRIM_NOCHA.mean()

CRIM    3.744447
dtype: float64

In [30]:
room_smallerzone = df.loc[df['ZN'] < 1, ['RM']] #Mean Rooms of houses in smaller zones
room_smallerzone.mean()

RM    6.147922
dtype: float64

In [29]:
room_biggerzone = df.loc[df['ZN'] >= 1, ['RM']] #Mean Rooms of houses in bigger zone
room_biggerzone.mean()

RM    6.664164
dtype: float64

In [None]:
#In areas where a larger proportion of residential land was zoned for lots over 25,000 sq.ft, the number of rooms per house tended to be larger.

In [37]:
df.loc[df['CHAS'] < 1]

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0,0.573,6.593,69.1,2.4786,1,273,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0,0.573,6.120,76.7,2.2875,1,273,21.0,396.90,9.08,20.6
503,0.06076,0.0,11.93,0,0.573,6.976,91.0,2.1675,1,273,21.0,396.90,5.64,23.9
504,0.10959,0.0,11.93,0,0.573,6.794,89.3,2.3889,1,273,21.0,393.45,6.48,22.0


In [39]:
%matplotlib notebook
import matplotlib.pyplot as plt

In [41]:
df.plot()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1a2bf20b550>

In [44]:
pd.plotting.scatter_matrix(df, figsize = (13,8))

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4CB56D8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4CD7A20>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4D09DA0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4D47390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4D76940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4DA8EF0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4DE74E0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4E17AC8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4E17B00>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4E84630>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4EB9BE0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001A2C4EF51D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000

In [54]:
plt.scatter(df['CRIM'],df['B'])
plt.show()

## Summary

Congratulations, you've completed your first "free form" exploratory data analysis of a popular dataset!