### 1. Imports

In [1]:
# Import the pandas library - conventionally imported as "pd"
import pandas as pd
import requests

### 2. Load Data

[Kaggle](https://www.kaggle.com/) is an amazing platform for data-science created by Google.

On Kaggle there are hundreds of available datasets. For these tutorials I'll be using [The Ames Housing](https://www.kaggle.com/c/house-prices-advanced-regression-techniques/data) dataset which can be downloaded from [Kaggle](https://www.kaggle.com/) after creating an account. It can also be downloaded from [amstat.org](https://www.amstat.org/) using the link below.

This dataset was compiled by Dean De Cock for use in data science education. It's an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. 

* We use the dataframe constructor to read data from .xls files _(the "old" Excel format)_.

* We drop the Order index column and PID column, because Pandas autogenerates an integer-index, and we won't be needing PID.

* The `df.head()` function prints the first five lines of any dataframe and Jupyter renders in nicely in html.

* We also drop all spaces in the column names. This could cause some issues later an will generally just be helpful.

In [2]:
df = pd.read_excel('http://jse.amstat.org/v19n3/decock/AmesHousing.xls')

df = df.drop(columns=['PID', 'Order'])

# We don't want special characters in column headers
df.columns = [x.replace(' ', '') for x in df.columns]

df.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,20,RL,141.0,31770,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,5,2010,WD,Normal,215000
1,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,20,RL,93.0,11160,Pave,,Reg,Lvl,AllPub,Corner,...,0,,,,0,4,2010,WD,Normal,244000
4,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


### 3. Examine Data

On the amstat website there's a _DataDocumentation.txt_ file which contains information about each feature/column in the dataset. We'll probably use this a few times. 

I have printed the first 2000 characters of the descriptions file below, but we'll get back to this one.

In [3]:
req = requests.get('http://jse.amstat.org/v19n3/decock/DataDocumentation.txt').text

print(req[0:2000])

NAME: AmesHousing.txt
TYPE: Population
SIZE: 2930 observations, 82 variables
ARTICLE TITLE: Ames Iowa: Alternative to the Boston Housing Data Set

DESCRIPTIVE ABSTRACT: Data set contains information from the Ames Assessors Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010.

SOURCES: 
Ames, Iowa Assessors Office 

VARIABLE DESCRIPTIONS:
Tab characters are used to separate variables in the data file. The data has 82 columns which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers).

Order (Discrete): Observation number

PID (Nominal): Parcel identification number  - can be used with city web site for parcel review. 

MS SubClass (Nominal): Identifies the type of dwelling involved in the sale.	

       020	1-STORY 1946 & NEWER ALL STYLES
       030	1-STORY 1945 & OLDER
       040	1-STORY W/FINISHED ATTIC ALL AGES
       045	1-1/2 STORY - UNFINISHED ALL AGES
   