# Project - EDA with Pandas Using the Boston Housing Data

## Introduction

In this section you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this a more free form project you'll get a chance to practice all of these skills with the Boston Housing data set, which contains housing values in suburbs of Boston. The Boston Housing Data is commonly used by aspiring data scientists.

## Objectives

You will be able to:

* Load csv files using Pandas
* Find variables with high correlation
* Create box plots

## Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At a minimum, this should include:

* Loading the data (which is stored in the file `train.csv`)
* Use built-in python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations using `.loc`, `.iloc` or related operations. Explain why you used the chosen subsets and do this for 3 possible 2-way splits. State how you think the 2 measures of centrality and/or dispersion might be different for each subset of the data. Examples of potential splits:
    - Create a 2 new dataframes based on your existing data, where one contains all the properties next to the Charles river, and the other one contains properties that aren't.
    - Create 2 new dataframes based on a certain split for crime rate.
* Next, use histograms and scatterplots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

## Variable Descriptions

This data frame contains the following columns:

#### crim  
per capita crime rate by town.

#### zn  
proportion of residential land zoned for lots over 25,000 sq.ft.

#### indus  
proportion of non-retail business acres per town.

#### chas  
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

#### nox  
nitrogen oxides concentration (parts per 10 million).

#### rm  
average number of rooms per dwelling.

#### age  
proportion of owner-occupied units built prior to 1940.

#### dis  
weighted mean of distances to five Boston employment centers.

#### rad  
index of accessibility to radial highways.

#### tax  
full-value property-tax rate per $10,000.

#### ptratio  
pupil-teacher ratio by town.

#### black  
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

#### lstat  
lower status of the population (percent).

#### medv  
median value of owner-occupied homes in $10000s.
  
  
  
Source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

## Summary

Congratulations, you've completed your first "freeform" exploratory data analysis of a popular data set!

In [1]:
import pandas as pd
pd.read_csv('train.csv')

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.60,12.43,22.9
5,11,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0
6,12,0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.90,13.27,18.9
7,13,0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311,15.2,390.50,15.71,21.7
8,14,0.62976,0.0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21.0,396.90,8.26,20.4
9,15,0.63796,0.0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2


In [2]:
df = pd.read_csv('train.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 15 columns):
ID         333 non-null int64
crim       333 non-null float64
zn         333 non-null float64
indus      333 non-null float64
chas       333 non-null int64
nox        333 non-null float64
rm         333 non-null float64
age        333 non-null float64
dis        333 non-null float64
rad        333 non-null int64
tax        333 non-null int64
ptratio    333 non-null float64
black      333 non-null float64
lstat      333 non-null float64
medv       333 non-null float64
dtypes: float64(11), int64(4)
memory usage: 39.1 KB


In [4]:
df.describe()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
count,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0,333.0
mean,250.951952,3.360341,10.689189,11.293483,0.06006,0.557144,6.265619,68.226426,3.709934,9.633634,409.279279,18.448048,359.466096,12.515435,22.768769
std,147.859438,7.352272,22.674762,6.998123,0.237956,0.114955,0.703952,28.133344,1.981123,8.742174,170.841988,2.151821,86.584567,7.067781,9.173468
min,1.0,0.00632,0.0,0.74,0.0,0.385,3.561,6.0,1.1296,1.0,188.0,12.6,3.5,1.73,5.0
25%,123.0,0.07896,0.0,5.13,0.0,0.453,5.884,45.4,2.1224,4.0,279.0,17.4,376.73,7.18,17.4
50%,244.0,0.26169,0.0,9.9,0.0,0.538,6.202,76.7,3.0923,5.0,330.0,19.0,392.05,10.97,21.6
75%,377.0,3.67822,12.5,18.1,0.0,0.631,6.595,93.8,5.1167,24.0,666.0,20.2,396.24,16.42,25.0
max,506.0,73.5341,100.0,27.74,1.0,0.871,8.725,100.0,10.7103,24.0,711.0,21.2,396.9,37.97,50.0


In [11]:
df['crim'].mean()

3.360341471471471

In [12]:
df['indus'].mean()

11.293483483483483

In [13]:
df['nox'].mean()

0.5571441441441441

In [14]:
df['crim'].median()

0.26169000000000003

In [15]:
df['indus'].median()

9.9

In [16]:
df['nox'].median()

0.5379999999999999

In [17]:
df['crim'].std()

7.352271836781107

In [18]:
df['indus'].std()

6.998123104477312

In [19]:
df['nox'].std()

0.11495450830289293

In [25]:
df.loc[df['chas'] == 1 , :]
properties_next_charles = df.loc[df['chas'] == 1 , :]
properties_next_charles

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
97,143,3.32105,0.0,19.58,1,0.871,5.403,100.0,1.3216,5,403,14.7,396.9,26.82,13.4
104,155,1.41385,0.0,19.58,1,0.871,6.129,96.0,1.7494,5,403,14.7,321.02,15.12,17.0
108,161,1.27346,0.0,19.58,1,0.605,6.25,92.6,1.7984,5,403,14.7,338.92,5.5,27.0
110,164,1.51902,0.0,19.58,1,0.605,8.375,93.9,2.162,5,403,14.7,388.45,3.32,50.0
145,209,0.13587,0.0,10.59,1,0.489,6.064,59.1,4.2392,4,277,18.6,381.32,14.66,24.4
146,212,0.37578,0.0,10.59,1,0.489,5.404,88.6,3.665,4,277,18.6,395.24,23.98,19.3
149,217,0.0456,0.0,13.89,1,0.55,5.888,56.0,3.1121,5,276,16.4,392.8,13.51,23.3
150,222,0.40771,0.0,6.2,1,0.507,6.164,91.3,3.048,8,307,17.4,395.24,21.46,21.7
151,223,0.62356,0.0,6.2,1,0.507,6.879,77.7,3.2721,8,307,17.4,390.39,9.93,27.5
161,235,0.44791,0.0,6.2,1,0.507,6.726,66.5,3.6519,8,307,17.4,360.2,8.05,29.0


In [26]:
df.loc[df['chas'] == 0 , :]
properties_not_charles = df.loc[df['chas'] == 0 , :]
properties_not_charles

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.60,12.43,22.9
5,11,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0
6,12,0.11747,12.5,7.87,0,0.524,6.009,82.9,6.2267,5,311,15.2,396.90,13.27,18.9
7,13,0.09378,12.5,7.87,0,0.524,5.889,39.0,5.4509,5,311,15.2,390.50,15.71,21.7
8,14,0.62976,0.0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21.0,396.90,8.26,20.4
9,15,0.63796,0.0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2


In [27]:
properties_next_charles.mean()

ID         255.600000
crim         2.163972
zn           8.500000
indus       12.330000
chas         1.000000
nox          0.593595
rm           6.577750
age         75.815000
dis          3.069540
rad          9.900000
tax        394.550000
ptratio     17.385000
black      380.681000
lstat       11.118000
medv        30.175000
dtype: float64

In [28]:
properties_not_charles.mean()

ID         250.654952
crim         3.436787
zn          10.829073
indus       11.227252
chas         0.000000
nox          0.554815
rm           6.245674
age         67.741534
dis          3.750853
rad          9.616613
tax        410.220447
ptratio     18.515974
black      358.110511
lstat       12.604728
medv        22.295527
dtype: float64

In [29]:
properties_next_charles.mean() / properties_not_charles.mean()

ID         1.019729
crim       0.629650
zn         0.784924
indus      1.098221
chas            inf
nox        1.069897
rm         1.053169
age        1.119180
dis        0.818358
rad        1.029468
tax        0.961800
ptratio    0.938919
black      1.063027
lstat      0.882050
medv       1.353410
dtype: float64

In [34]:
df.loc[df['crim'] > df['crim'].median() , :]
high_crime_rate = df.loc[df['crim'] > df['crim'].median() , :]
high_crime_rate

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
8,14,0.62976,0.0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21.0,396.90,8.26,20.4
9,15,0.63796,0.0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2
10,16,0.62739,0.0,8.14,0,0.538,5.834,56.5,4.4986,4,307,21.0,395.62,8.47,19.9
11,17,1.05393,0.0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21.0,386.85,6.58,23.1
12,19,0.80271,0.0,8.14,0,0.538,5.456,36.6,3.7965,4,307,21.0,288.99,11.69,20.2
13,21,1.25179,0.0,8.14,0,0.538,5.570,98.1,3.7979,4,307,21.0,376.57,21.02,13.6
14,22,0.85204,0.0,8.14,0,0.538,5.965,89.2,4.0123,4,307,21.0,392.53,13.83,19.6
15,23,1.23247,0.0,8.14,0,0.538,6.142,91.7,3.9769,4,307,21.0,396.90,18.72,15.2
16,24,0.98843,0.0,8.14,0,0.538,5.813,100.0,4.0952,4,307,21.0,394.54,19.88,14.5
17,28,0.95577,0.0,8.14,0,0.538,6.047,88.8,4.4534,4,307,21.0,306.38,17.28,14.8


In [35]:
df.loc[df['crim'] <= df['crim'].median() , :]
low_crime_rate = df.loc[df['crim'] <= df['crim'].median() , :]
low_crime_rate

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.5380,6.575,65.2,4.0900,1,296,15.3,396.90,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.4690,6.421,78.9,4.9671,2,242,17.8,396.90,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.4580,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.4580,7.147,54.2,6.0622,3,222,18.7,396.90,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.5240,6.012,66.6,5.5605,5,311,15.2,395.60,12.43,22.9
5,11,0.22489,12.5,7.87,0,0.5240,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0
6,12,0.11747,12.5,7.87,0,0.5240,6.009,82.9,6.2267,5,311,15.2,396.90,13.27,18.9
7,13,0.09378,12.5,7.87,0,0.5240,5.889,39.0,5.4509,5,311,15.2,390.50,15.71,21.7
21,39,0.17505,0.0,5.96,0,0.4990,5.966,30.2,3.8473,5,279,19.2,393.43,10.13,24.7
22,40,0.02763,75.0,2.95,0,0.4280,6.595,21.8,5.4011,3,252,18.3,395.63,4.32,30.8


In [36]:
high_crime_rate.mean() / low_crime_rate.mean()

ID          1.605832
crim       69.231407
zn          0.061162
indus       2.111285
chas        2.347390
nox         1.344393
rm          0.958813
age         1.657752
dis         0.515212
rad         3.664490
tax         1.680636
ptratio     1.052604
black       0.848878
lstat       1.695279
medv        0.830421
dtype: float64

In [37]:
high_crime_rate.mean()

ID         309.512048
crim         6.644374
zn           1.228916
indus       15.343735
chas         0.084337
nox          0.639271
rm           6.133488
age         85.174699
dis          2.520529
rad         15.162651
tax        513.590361
ptratio     18.922289
black      330.003434
lstat       15.756145
medv        20.653614
dtype: float64

In [38]:
low_crime_rate.mean()

ID         192.742515
crim         0.095973
zn          20.092814
indus        7.267485
chas         0.035928
nox          0.475509
rm           6.396958
age         51.379641
dis          4.892216
rad          4.137725
tax        305.592814
ptratio     17.976647
black      388.752335
lstat        9.294132
medv        24.871257
dtype: float64

In [40]:
import matplotlib.pyplot as plt
%matplotlib notebook
plt.style.use('ggplot')

In [None]:
data = pd.DataFrame({'A':np.random.randn(365).cumsum(),
                    'B':np.random.randn(365).cumsum() + 25,
                    'C':np.random.randn(365).cumsum() - 25}, 
                     index = pd.date_range('1/1/2018', periods = 365))

In [43]:
properties_crime_not_chls = properties_not_charles['crim']
properties_crime_not_chls

0       0.00632
1       0.02731
2       0.03237
3       0.06905
4       0.08829
5       0.22489
6       0.11747
7       0.09378
8       0.62976
9       0.63796
10      0.62739
11      1.05393
12      0.80271
13      1.25179
14      0.85204
15      1.23247
16      0.98843
17      0.95577
18      1.13081
19      1.35472
20      1.61282
21      0.17505
22      0.02763
23      0.03359
24      0.14150
25      0.15936
26      0.12269
27      0.17142
28      0.18836
29      0.22927
         ...   
303     7.83932
304     3.16360
305     3.77498
306     4.42228
307    15.57570
308    13.07510
309     4.03841
310     3.56868
311     8.05579
312     4.87141
313    15.02340
314    10.23300
315    14.33370
316     5.82401
317     5.70818
318     2.81838
319     2.37857
320     5.69175
321     4.83567
322     0.15086
323     0.20746
324     0.10574
325     0.11132
326     0.17331
327     0.26838
328     0.17783
329     0.06263
330     0.04527
331     0.06076
332     0.04741
Name: crim, Length: 313,

In [44]:
properties_next_charles_crime = properties_next_charles['crim']
properties_next_charles_crime

97     3.32105
104    1.41385
108    1.27346
110    1.51902
145    0.13587
146    0.37578
149    0.04560
150    0.40771
151    0.62356
161    0.44791
163    0.52058
184    0.22188
185    0.05644
188    0.06129
189    0.01501
234    8.98296
235    3.84970
236    5.20177
244    6.53876
246    8.26725
Name: crim, dtype: float64

In [53]:
data = pd.DataFrame({'A':properties_next_charles_crime,
                    'B':properties_crime_not_chls})
data.head()

Unnamed: 0,A,B
0,,0.00632
1,,0.02731
2,,0.03237
3,,0.06905
4,,0.08829


In [63]:
data.plot.hist(alpha = 0.3, bins = 20)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1ce49822048>

In [69]:
data = pd.DataFrame({'A':high_crime_rate['medv'],
                    'B':low_crime_rate['medv']})
data.head()
data.plot.hist(alpha = 0.3, bins = 35)

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x1ce4a53a630>