### Exploratory Data Analysis 



The HOUSES dataset contains a collection of recent real estate listings in San Luis Obispo county and around it. The dataset is as a CSV file. The dataset contains the following fields:

1. MLS: Multiple listing service number for the house (unique ID).
2. Location: city/town where the house is located. Most locations are in San Luis Obispo county and northern Santa Barbara county (Santa Maria-Orcutt, Lompoc, Guadelupe, Los Alamos), but there some out of area locations as well.
4. Price: the most recent listing price of the house (in dollars).
5. Bedrooms: number of bedrooms.
6. Bathrooms: number of bathrooms.
7. Size: size of the house in square feet.
8. Price/SQ.ft: price of the house per square foot.
9. Status: type of sale. Thee types are represented in the dataset: Short Sale, Foreclosure and Regular.

Lets import the required libraries that we will be using later. 

In [None]:
from numpy import * # everything 
import pandas as pd 

Let's load the dataset into a pandas dataframe and have a look at the headers.

In [None]:
df = pd.read_csv('data.csv', sep=',', error_bad_lines=False) # read fie as a dataframe

Lets take a look at the first 2 rows of the dataframe.

In [None]:
df.head(2)

Examine the provided columns, does the pandas infered datatype of each column make sense? Inlucde your code and/or comments below. 

In [None]:
#TODO -- Done
print(df.MLS.head(2))
print(df.Location.head(2))
print(df.Price.head(2))
print(df.Bedrooms.head(2))
print(df.Bathrooms.head(2))
print(df.Size.head(2))
print(df['Price/SQ.Ft'].head(2))
print(df.Status.head(2))
#Yes the infered datatype of each column does make sense since all numbers have dtype of int64 (and float64 for decimals) and those with strings have dtype of object.

Next, lets look at a specific column or feature in the dataframe. 
Based on the provided dataset, what are the distinct number of bedrooms and bathrooms?  Hint : Use the unique function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html

In [None]:
# TODO -- Done
print("Bedrooms", pd.unique(df.Bedrooms))
print("Bathrooms", pd.unique(df.Bathrooms))

What if we want to drop a column from the dataframe, like the 'Location' column. Hint: Use the drop function https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html

In [None]:
# TODO -- Done
df = df.drop(columns=['Location'])
df.head()

Let's rename the first column. 

Hint: A Google search for 'python pandas dataframe rename' points you at this documentation 
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

In [None]:
print ("Before rename", df.columns)
#TODO -- Done

df = df.rename(columns={"MLS": "notMLS"})
print ("After rename", df.columns)
df.head(2)

What is the max, min, mean/avg, and standard deviation of the column 'Bedrooms'?

In [None]:
# TODO -- Done
print("Max", df.Bedrooms.max())
print("Min", df.Bedrooms.min())
print("Mean", df.Bedrooms.mean())
print("Average", df.Bedrooms.sum()/df.Bedrooms.size)
print("Standard Deviation", df.Bedrooms.std())

Plot the distribution of 'Price/SQ.Ft' using matplotlib

In [None]:
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt

# plot histogram 
n, bins, patches = plt.hist(df['Price/SQ.Ft'], 10, facecolor='green')
plt.show()

One of the best ways to inspect data is visualize it. One way to do this is by using a scatter plot. A scatter plot of the data puts one feature along the x-axis and another along the y-axis, and draws a dot for each data point. 

Since its difficult to visualize more than 2 or 3 features, one possibility is to use a pair plot that looks at all possible pairs of features. The pair plot shows the interaction of each pair of features inorder to visualize any correlation between features. 

In [None]:
# import the scatter_matrix functionality
import random as rand
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt

print (df.shape)
x = df.iloc[:,[1,2,3,4,5]] # extract only a subset of columns from dataframe (using index)
y = x.dropna(thresh=5) # drop any rows that have 5 or more fields as NAN  
a = pd.scatter_matrix(x, alpha=0.05, figsize=(5,5), diagonal='hist')
plt.show()

In [None]:
#Lets plot the Price vs Size of the homes

fig=plt.figure()
plt.scatter(df.Price, df.Size)
axis = fig.gca() #get current axis
axis.set_title('Price vs Size')
axis.set_xlabel('Price')
axis.set_ylabel('Size')
fig.canvas.draw()

What does the visualizations and the statistics we observed tell you so far. Is there any other interesting stats or visualizations you think might be helpful. Include your comments and code below

In [None]:
# TODO

## Categorical Encoding
If we have categorical or continuous variables and we would like to encode them into discrete integer files (like 0, 1, 2, ...) we can use several tricks in pandas to do this.

In [None]:
# Approach 1 - Pandas makes it easy for us to directly replace the text values with their numeric equivalent by using replace .

newValues = {"Status": {"Foreclosure": 1, "Short Sale": 2, "Regular" : 3}}
df2 = df.replace(newValues, inplace=False )
df2.head()

In [None]:
# Approach 2 - Another approach to encoding categorical values is to use a technique called label encoding.
# Label encoding is simply converting each value in a column to a number.

# One trick you can use in pandas is to convert a column to a category, then use those category 
# values for your label encoding. 

df["Status"] = df["Status"].astype('category')
df.dtypes

# Then you can assign the encoded variable to a new column using the cat.codes accessor.
df["Status_cat"] = df["Status"].cat.codes
df.head()

In [None]:
"""Approach 3 - Label encoding has the advantage that it is straightforward but it has the 
   disadvantage that the numeric values can be “misinterpreted” by the algorithms. For example, 
   the value of 1 is obviously less than the value of 3 but does that really correspond to the data set in real life?
   For example, is "Foreclosure" =1 closer to "Short Sale" =2 compared to "Regular" =3?

   A common alternative approach is called one hot encoding. The basic strategy is to convert each category value 
   into a new column and assigns a 1 or 0 (True/False) value to the column. This has the benefit of not weighting 
   a value improperly but does have the downside of adding more columns to the data set.

   Pandas supports this feature using get_dummies. This function is named this way because it creates 
   dummy/indicator variables (aka 1 or 0)."""

pd.get_dummies(df, columns=["Status"], prefix=["new"]).head()

# basically, it creates a 3 new columns (one for each unique value in the column.) with the prefix "new_"

# Submission Instructions






Upload the .ipyn notbook to iLearn.

Have the TA check your lab to obtain credit.