#### chapter 4

# Housing Prices

#### One Hot Encoding
The book covers one way of performing one hot encoding. Here is another. First let's create a DataFrame. 

In [3]:
import pandas as pd
quad = pd.read_csv('https://raw.githubusercontent.com/zacharski/machine-learning/master/data/quad.csv')
quad = quad.set_index('Day')
quad

Unnamed: 0_level_0,Outlook,Temperature,Humidity,Wind,Fly Quad?
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,Sunny,Hot,High,Weak,No
2,Sunny,Hot,High,Strong,No
3,Overcast,Hot,High,Weak,Yes
4,Rain,Mild,High,Weak,Yes
5,Rain,Cool,Normal,Weak,Yes
6,Rain,Cool,Normal,Strong,No
7,Overcast,Cool,Normal,Strong,Yes
8,Sunny,Mild,High,Weak,No
9,Sunny,Cool,Normal,Weak,Yes
10,Rain,Mild,Normal,Weak,Yes


Here is how to one hot encode the Outlook column:


In [5]:
# first one hot encode the column
one_hot = pd.get_dummies(quad['Outlook'])
# drop the original Outlook column
quad = quad.drop('Outlook', axis=1)
# join the one hot columns to the quad dataframe
quad = quad.join(one_hot)
quad

Unnamed: 0_level_0,Temperature,Humidity,Wind,Fly Quad?,Overcast,Rain,Sunny
Day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Hot,High,Weak,No,0,0,1
2,Hot,High,Strong,No,0,0,1
3,Hot,High,Weak,Yes,1,0,0
4,Mild,High,Weak,Yes,0,1,0
5,Cool,Normal,Weak,Yes,0,1,0
6,Cool,Normal,Strong,No,0,1,0
7,Cool,Normal,Strong,Yes,1,0,0
8,Mild,High,Weak,No,0,0,1
9,Cool,Normal,Weak,Yes,0,0,1
10,Mild,Normal,Weak,Yes,0,1,0


#### Wonkiness
Sometimes the original datafile has the same type of data in multiple columns. For example...

Title | Genre 1 | Genre 2
 :--: | :---: | :---: 
 Mission: Impossible - Fallout | Action | Drama
 Mama Mia: Here We Go Again | Comedy | Musical
 Ant-Man and The Wasp | Action | Comedy
 BlacKkKlansman | Drama | Comedy
 
 
 When we one-hot encode this we get something like
 
 Title | Genre1 Action | Genre1 Comedy | Genre1 Drama | Genre2 Drama | Genre2 Musical | Genre2 Comedy
  :--: | :--: | :--: | :--: | :--: | :--: | :--: 
  Mission: Impossible - Fallout | 1 | 0 | 0 | 1 | 0 | 0
  Mama Mia: Here We Go Again  | 0 | 1 | 0 | 0 | 1 | 0
  Ant-Man and The Wasp | 1 | 0 | 0 | 0 | 0 | 1
  BlacKkKlansman | 0 | 0 | 1 | 0 | 0 | 1
  
  But this isn't what we probably want. Instead this would be a better representation:
  
  Title | Action | Comedy | Drama | Musical
  :---: | :---: | :---: |  :---: | :---: | 
  Mission: Impossible - Fallout | 1 | 0 | 1 | 0
  Mama Mia: Here We Go Again  | 0 | 1 | 0 | 1
  Ant-Man and The Wasp | 1 | 1 | 0 | 0
  BlacKkKlansman | 0 | 1 | 1 | 0
  
  Let's see how we might do this in code
  

In [8]:
df   = pd.DataFrame({'Title': ['Mission: Impossible - Fallout', 'Mama Mia: Here We Go Again', 
                               'Ant-Man and The Wasp', 'BlacKkKlansman' ],
                    'Genre1': ['Action', 'Comedy', 'Action', 'Drama'],
                    'Genre2': ['Drama', 'Musical', 'Comedy', 'Comedy']})
df

Unnamed: 0,Title,Genre1,Genre2
0,Mission: Impossible - Fallout,Action,Drama
1,Mama Mia: Here We Go Again,Comedy,Musical
2,Ant-Man and The Wasp,Action,Comedy
3,BlacKkKlansman,Drama,Comedy


In [9]:
one_hot_1 = pd.get_dummies(df['Genre1'])
one_hot_2 = pd.get_dummies(df['Genre2'])

In [28]:
# now get the intersection of the column names
s1 = set(one_hot_1.columns.values)
s2 = set(one_hot_2.columns.values)
intersect = s1 & s2
only_s1 = s1 - intersect
only_s2 = s2 - intersect
# now logically or the intersect
logical_or = one_hot_1[list(intersect)] | one_hot_2[list(intersect)]
# then combine everything
combined = pd.concat([one_hot_1[list(only_s1)], logical_or, one_hot_2[list(only_s2)]], axis=1)
combined

### Now drop the two original columns and add the one hot encoded columns
df= df.drop('Genre1', axis=1)
df= df.drop('Genre2', axis=1)
df = df.join(combined)
df

Unnamed: 0,Title,Action,Drama,Comedy,Musical
0,Mission: Impossible - Fallout,1,1,0,0
1,Mama Mia: Here We Go Again,0,0,1,1
2,Ant-Man and The Wasp,1,0,1,0
3,BlacKkKlansman,0,1,1,0


## 

In [26]:
df.join(combined)

Unnamed: 0,Title,Genre1,Genre2,Action,Drama,Comedy,Musical
0,Mission: Impossible - Fallout,Action,Drama,1,1,0,0
1,Mama Mia: Here We Go Again,Comedy,Musical,0,0,1,1
2,Ant-Man and The Wasp,Action,Comedy,1,0,1,0
3,BlacKkKlansman,Drama,Comedy,0,1,1,0


# The task: Predict Housing Prices
Your task is to create a regession classifier that predicts house prices. The data and a description of the columns is available at `https://github.com/zacharski/machine-learning/tree/master/data/housePrices`. You can load the data into a Pandas DataFrame with:

In [30]:
df = pd.read_csv('https://raw.githubusercontent.com/zacharski/machine-learning/master/data/housePrices/train.csv')

Minimally, your classifier should be trained on the following columns:

In [32]:
numericColumns = ['LotFrontage', 'LotArea', 'OverallQual', 'OverallCond', '1stFlrSF', '2ndFlrSF', 'GrLivArea',
                 'FullBath', 'HalfBath', 'Bedroom', 'Kitchen']
categoryColumns = ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 
                   'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'BldgType', 
                   'HouseStyle', 'RoofStyle', 'RoofMatl', '' ]

# Using multicolumns is optional
multicolumns = [['Condition1', 'Condition2'], ['Exterior1st', 'Exterior2nd']]

You are free to use more columns than these. Also, you may need to process some of the columns. 
 Here are the requirements:
 
 ### 1. You are to compare Lasso Regression and Ridge Regression
 ### 2. You should use 10 fold cross validation and score using negative mean squared error.
 A description of this is in the book, but here is an example:


In [None]:
scores = cross_val_score(lasso_reg, X, y, scoring="neg_mean_squared_error", cv=10 )

### 3. Drop any data rows that contain NaN in a column.
Once you do this you should have around 1200 rows.

# Bonus
You will get at least a 10% bonus if yhe mean of the 10 negative mean squared error numbers is greater than -1.040.000 