# Project - EDA with Pandas Using the Ames Housing Data

## Introduction

In this section, you've learned a lot about importing, cleaning up, analyzing (using descriptive statistics) and visualizing data. In this more free-form project, you'll get a chance to practice all of these skills with the Ames Housing dataset, which contains housing values in the suburbs of Ames.

## Objectives

You will be able to:

* Perform a full exploratory data analysis process to gain insight about a dataset 

## Goals

Use your data munging and visualization skills to conduct an exploratory analysis of the dataset below. At a minimum, this should include:

* Loading the data (which is stored in the file ``ames_train.csv``) 
* Use built-in Python functions to explore measures of centrality and dispersion for at least 3 variables
* Create *meaningful* subsets of the data using selection operations like `.loc`, `.iloc`, or related operations.   Explain why you used the chosen subsets and do this for three possible 2-way splits. State how you think the two measures of centrality and/or dispersion might be different for each subset of the data.
* Next, use histograms and scatter plots to see whether you observe differences for the subsets of the data. Make sure to use subplots so it is easy to compare the relationships.

## Variable Descriptions
Look in ``data_description.txt`` for a full description of all variables.

A preview of some of the columns:

**MSZoning**: Identifies the general zoning classification of the sale.
		
       A	 Agriculture
       C	 Commercial
       FV	Floating Village Residential
       I	 Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM	Residential Medium Density

**OverallCond**: Rates the overall condition of the house

       10	Very Excellent
       9	 Excellent
       8	 Very Good
       7	 Good
       6	 Above Average	
       5	 Average
       4	 Below Average	
       3	 Fair
       2	 Poor
       1	 Very Poor

**KitchenQual**: Kitchen quality

       Ex	Excellent
       Gd	Good
       TA	Typical/Average
       Fa	Fair
       Po	Poor

**YrSold**: Year Sold (YYYY)

**SalePrice**: Sale price of the house in dollars

In [156]:
# Let's get started importing the necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib notebook
import seaborn as sn

In [134]:
# Loading the data
df = pd.read_csv('ames_train.csv')

In [135]:
# Investigate the Data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

In [136]:
# Investigating Distributions using scatter_matrix
corrMatrix = df.corr()
print(corrMatrix['SalePrice'].sort_values().head())
print(corrMatrix['SalePrice'].sort_values().tail())
selectedVariables = corrMatrix['SalePrice'].sort_values()[-5:].index.values
print("Selected Variables: ", selectedVariables)
df_selected = df.loc[:, selectedVariables]
#df_selected.head()
pd.plotting.scatter_matrix(df_selected)

KitchenAbvGr    -0.135907
EnclosedPorch   -0.128578
MSSubClass      -0.084284
OverallCond     -0.077856
YrSold          -0.028923
Name: SalePrice, dtype: float64
GarageArea     0.623431
GarageCars     0.640409
GrLivArea      0.708624
OverallQual    0.790982
SalePrice      1.000000
Name: SalePrice, dtype: float64
Selected Variables:  ['GarageArea' 'GarageCars' 'GrLivArea' 'OverallQual' 'SalePrice']


<IPython.core.display.Javascript object>

array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fcedf5532b0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcedf57c7f0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcedf5abbe0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcee0d3b1d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcee0d96780>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcee0dc6d30>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcee0dfe2e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcee0e2c8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcee0e2c908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcee0e99438>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7fcee0ec99e8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcee0efcf98>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7fcee0f36588>,
        <matplotlib.axes._subplots.A

In [167]:
# Create a plot that shows the SalesPrice Distribution
plt.hist(df_selected['SalePrice'], bins='auto');
plt.xlabel("SalePrice");
plt.show();

<IPython.core.display.Javascript object>

In [173]:
# Create a plot that shows the LotArea Distribution
plt.hist(df['LotArea'], bins='auto');
plt.xlabel("LotArea");
plt.xlim(0,100000)
plt.show();

<IPython.core.display.Javascript object>

In [174]:
# Create a plot that shows the Distribution of the overall house condition
plt.hist(df['OverallCond'], bins='auto')
plt.xlabel("OverallCond");

<IPython.core.display.Javascript object>

In [175]:
# Create a Box Plot for SalePrice
df['SalePrice'].plot.box()

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7fceea301710>

In [179]:
# Perform an Exploration of home values by age
df.plot.scatter('YearBuilt', 'SalePrice', color='violet')

<IPython.core.display.Javascript object>

<matplotlib.axes._subplots.AxesSubplot at 0x7fceebd38c50>

## Summary

Congratulations, you've completed your first "free form" exploratory data analysis of a popular dataset!