# Project - EDA with Pandas Using the Ames Housing Data

## Introduction

In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this more free-form project, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains housing values in the suburbs of Ames.

## Objectives

You will be able to:

* Perform a full exploratory data analysis process to gain insight about a dataset 

## Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At a minimum, this should include:

* Loading the data (which is stored in the file ``ames_train.csv``) 
* Use built-in Python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations like `.loc`, `.iloc`, or related operations.   Explain why you used the chosen subsets and do this for three possible 2-way splits. State how you think the two measures of centrality and/or dispersion might be different for each subset of the data.
* Next, use histograms and scatter plots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

## Variable Descriptions
Look in ``data_description.txt`` for a full description of all variables.

A preview of some of the columns:

**MSZoning**: Identifies the general zoning classification of the sale.
		
       A	 Agriculture
       C	 Commercial
       FV	Floating Village Residential
       I	 Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

**OverallCond**: Rates the overall condition of the house

       10	Very Excellent
       9	 Excellent
       8	 Very Good
       7	 Good
       6	 Above Average	
       5	 Average
       4	 Below Average	
       3	 Fair
       2	 Poor
       1	 Very Poor

**KitchenQual**: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

**YrSold**: Year Sold (YYYY)

**SalePrice**: Sale price of the house in dollars

In [13]:
# Let's get started importing the necessary libraries
import pandas as pd
%matplotlib notebook
import matplotlib.pyplot as plt

In [14]:
ls

 Volume in drive C is OS
 Volume Serial Number is 0219-1F16

 Directory of C:\Users\mdouk\flatiron\module01\section04\Project EDA with Pandas\dsc-project-eda-with-pandas-online-ds-sp-000

07/28/2020  12:10 PM    <DIR>          .
07/28/2020  12:10 PM    <DIR>          ..
07/23/2020  01:02 PM                66 .gitignore
07/23/2020  01:09 PM    <DIR>          .ipynb_checkpoints
07/23/2020  01:02 PM                96 .learn
07/23/2020  01:02 PM           452,865 ames_test.csv
07/23/2020  01:02 PM           462,137 ames_train.csv
07/23/2020  01:02 PM             1,846 CONTRIBUTING.md
07/23/2020  01:02 PM            13,893 data_description.txt
07/28/2020  12:10 PM           289,792 index.ipynb
07/23/2020  01:02 PM             1,354 LICENSE.md
07/23/2020  01:02 PM             3,039 README.md
07/23/2020  01:02 PM             3,779 submission_example.csv
07/23/2020  01:02 PM            13,875 test.csv
07/23/2020  01:02 PM            28,316 train.csv
              12 File(s)      1,271,058 byte

In [15]:
# Loading the data
df = pd.read_csv('ames_train.csv')
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [16]:
# Investigate the Data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id               1460 non-null int64
MSSubClass       1460 non-null int64
MSZoning         1460 non-null object
LotFrontage      1201 non-null float64
LotArea          1460 non-null int64
Street           1460 non-null object
Alley            91 non-null object
LotShape         1460 non-null object
LandContour      1460 non-null object
Utilities        1460 non-null object
LotConfig        1460 non-null object
LandSlope        1460 non-null object
Neighborhood     1460 non-null object
Condition1       1460 non-null object
Condition2       1460 non-null object
BldgType         1460 non-null object
HouseStyle       1460 non-null object
OverallQual      1460 non-null int64
OverallCond      1460 non-null int64
YearBuilt        1460 non-null int64
YearRemodAdd     1460 non-null int64
RoofStyle        1460 non-null object
RoofMatl         1460 non-null object
Exterior1st      1460 non-n

In [17]:
df.describe()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,730.5,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,421.610009,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,1.0,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,365.75,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,730.5,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,1095.25,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,1460.0,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [18]:
# Investigating Distributions using scatter_matrix
pd.plotting.scatter_matrix(df[['LotArea', 'PoolArea', 'SalePrice', 'YrSold', 'YearBuilt']], figsize=(10, 10))

<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001C10B02FE10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E2111D0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E224320>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E25A470>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E28C5C0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E2BF710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E2F39E8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E326FD0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E331048>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E393B38>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E3D1128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000001C10E4036D8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000

In [21]:
# Create a plot that shows the SalesPrice Distribution
fig, ax = plt.subplots(figsize=(10, 7))
ax.hist(df['SalePrice'], bins='auto')
ax.set_title('Distribution of Sales Prices')
ax.set_xlabel('Sales Price')
ax.set_ylabel('Number of Houses')
ax.axvline(df['SalePrice'].mean(), color='red')

<IPython.core.display.Javascript object>

<matplotlib.lines.Line2D at 0x1c10f6f9160>

In [23]:
# Create a plot that shows the LotArea Distribution
fig, ax = plt.subplots(figsize=(10, 7))

ax.hist(df['LotArea'], bins='auto');
ax.set_title('Distribution of Sizes of Lot')
ax.set_xlabel('Size of Lot')
ax.set_ylabel('Number of Houses')

<IPython.core.display.Javascript object>

Text(0, 0.5, 'Number of Houses')

In [30]:
df['OverallCond']

0       5
1       8
2       5
3       5
4       5
       ..
1455    5
1456    6
1457    9
1458    6
1459    6
Name: OverallCond, Length: 1460, dtype: int64

In [32]:
# Create a plot that shows the Distribution of the overall house condition
fig, ax = plt.subplots(figsize=(10, 7))
ax.hist(df['OverallCond'], bins='auto')
ax.set_title('Distribution of Overall Condition of Houses on a Scale 1-10')
ax.set_xlabel('Condition of House')
ax.set_ylabel('Number of Houses')

<IPython.core.display.Javascript object>

Text(0, 0.5, 'Number of Houses')

In [34]:
# Create a Box Plot for SalePrice
fig, ax = plt.subplots(figsize=(10, 7))
ax.boxplot(df['SalePrice'])
ax.set_ylabel('House Price ($)')

<IPython.core.display.Javascript object>

Text(0, 0.5, 'House Price ($)')

In [37]:
# Perform an Exploration of home values by age
df['age'] = df['YrSold'] - df['YearBuilt']
df['decades'] = df.age // 10
to_plot = df.groupby('decades').SalePrice.mean()
to_plot.plot(kind='barh', figsize=(10, 8))
plt.ylabel('House Age in Decades')
plt.xlabel('Average Sale Price of Homes')
plt.title('Average Home Values by Home Age')

Text(0.5, 1, 'Average Home Values by Home Age')

## Summary

Congratulations, you've completed your first "free form" exploratory data analysis of a popular dataset!