# Project - EDA with Pandas Using the Boston Housing Data

## Introduction

In this section you've learned a lot about importing, cleaning up, analysing (using descriptive statistics) and visualizing data. In this more free form project you'll get a chance to practice all of these skills with the Boston Housing data set, which contains housing values in suburbs of Boston. The Boston Housing Data is commonly used by aspiring data scientists.

## Objectives

You will be able to:

* Load csv files using Pandas
* Find variables with high correlation
* Create box plots

# Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At minimum, this should include:

* Loading the data (which is stored in the file train.csv)
* Use built-in python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations using `.loc`, `.iloc` or related operations. Explain why you used the chosen subsets and do this for 3 possible 2-way splits. State how you think the 2 measures of centrality and/or dispersion might be different for each subset of the data. Examples of potential splits:
    - Create a 2 new dataframes based on your existing data, where one contains all the properties next to the Charles river, and the other one contains properties that aren't.
    - Create 2 new datagrames based on a certain split for crime rate.
* Next, use histograms and scatterplots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

In [4]:
# Loading the data (which is stored in the file train.csv)
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv("train.csv")
df.head()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9


In [8]:
# Use built-in python functions to explore measures of centrality and dispersion for at least 3 variables
# #crim #ptratio #rm 
display(df["rm"].describe())
display(df["tax"].describe())
display(df["medv"].describe())

count    333.000000
mean       6.265619
std        0.703952
min        3.561000
25%        5.884000
50%        6.202000
75%        6.595000
max        8.725000
Name: rm, dtype: float64

count    333.000000
mean     409.279279
std      170.841988
min      188.000000
25%      279.000000
50%      330.000000
75%      666.000000
max      711.000000
Name: tax, dtype: float64

count    333.000000
mean      22.768769
std        9.173468
min        5.000000
25%       17.400000
50%       21.600000
75%       25.000000
max       50.000000
Name: medv, dtype: float64

In [9]:
# Create meaningful subsets of the data using selection operations using .loc, .iloc or related operations.
#     Explain why you used the chosen subsets and do this for 3 possible 2-way splits. 
#     State how you think the 2 measures of centrality and/or dispersion might be different for each subset of the data.
#     Examples of potential splits:
#         Create a 2 new dataframes based on your existing data,
#         where one contains all the properties next to the Charles river,
#         and the other one contains properties that aren't.
#         Create 2 new datagrames based on a certain split for crime rate.

by_river = df.loc[df["chas"] == 1]
away_river = df.loc[df["chas"]!=1]

Seperating properties by if they are near the River. Predictions.
I chose to seperate this data by near and away from the river because most towns near a river originally formed to use it as a resource and then as the city grew the properties were located further from it. This means the age of the city decreases the further from the river you get. Most data points will decrease, except for residential land zoned because there is more residential land available the further you are from the city. The median value of homes should increase as well.

In [5]:
low_medv_crim = df.loc[df["crim"]<=50,"medv"]
high_medv_crim = df.loc[df["crim"]>50,"medv"]

Seperating median value of owned homes between above average (21) and below average, and then seeing the per capita crime rate in each of those areas show that homes with higher median value are in towns reporting a lower crime rate per capita. Lower median value homes are most likely in towns reporting similar crime rates, but the data has a larger spread showing homes in higher crime rate areas.

In [6]:
higher_age_tax = df.loc[df["age"] <= 50]
lower_age_tax = df.loc[df["age"]>50]

#comment

In [None]:
data.plot.scatter('A', 'C', 
                  c = 'B',
                  s = data['B'],
                 colormap = 'viridis');

In [None]:
# Define a new figure with matplotlib's .plot() function. Set the size of figure space
new_figure = plt.figure(figsize=(10,4))

# Add a subplot to the figure - a new axes
ax = new_figure.add_subplot(121)

# Add a second subplot to the figure - a new axes
ax2 = new_figure.add_subplot(122)

# Generate a line plot on first axes
ax.plot([1, 4, 6, 8], [10, 15, 27, 32], color='lightblue', linewidth=3, linestyle = '-.')

# Draw a scatter plot on 2nd axes
ax2.scatter([0.5, 2.2, 4.2, 6.5], [21, 19, 9, 26], color='red', marker='o')

# Set the limits of x and y for first axes
ax.set_xlim(0, 9), ax.set_ylim(5,35)

# Set the limits of x and y for 2nd axes
ax2.set_xlim(0, 9), ax2.set_ylim(5,35)

# Show the plot
plt.show()

# Variable Descriptions

This data frame contains the following columns:

#### crim  
per capita crime rate by town.

#### zn  
proportion of residential land zoned for lots over 25,000 sq.ft.

#### indus  
proportion of non-retail business acres per town.

#### chas  
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

#### nox  
nitrogen oxides concentration (parts per 10 million).

#### rm  
average number of rooms per dwelling.

#### age  
proportion of owner-occupied units built prior to 1940.

#### dis  
weighted mean of distances to five Boston employment centres.

#### rad  
index of accessibility to radial highways.

#### tax  
full-value property-tax rate per $10,000.

#### ptratio  
pupil-teacher ratio by town.

#### black  
1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

#### lstat  
lower status of the population (percent).

#### medv  
median value of owner-occupied homes in $10000s.
  
  
  
Source
Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

## Summary

Congratulations, you've completed your first "freeform" exploratory data analysis of a popular data set!