<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Acquire-the-data" data-toc-modified-id="Acquire-the-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Acquire the data</a></span></li><li><span><a href="#Explore-the-data" data-toc-modified-id="Explore-the-data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Explore the data</a></span><ul class="toc-item"><li><span><a href="#Investigate-target-variable" data-toc-modified-id="Investigate-target-variable-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Investigate target variable</a></span></li><li><span><a href="#Investigate-Numeric-Features" data-toc-modified-id="Investigate-Numeric-Features-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Investigate Numeric Features</a></span></li><li><span><a href="#Investigate-non-numeric-Features" data-toc-modified-id="Investigate-non-numeric-Features-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Investigate non-numeric Features</a></span></li></ul></li><li><span><a href="#Transforming-and-engineering-features" data-toc-modified-id="Transforming-and-engineering-features-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Transforming and engineering features</a></span><ul class="toc-item"><li><span><a href="#One-hot-encoding-of-categorical-variables" data-toc-modified-id="One-hot-encoding-of-categorical-variables-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>One-hot encoding of categorical variables</a></span></li><li><span><a href="#Deal-with-null-values" data-toc-modified-id="Deal-with-null-values-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Deal with null values</a></span></li></ul></li><li><span><a href="#Build-Linear-Model" data-toc-modified-id="Build-Linear-Model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Build Linear Model</a></span><ul class="toc-item"><li><span><a href="#Evaluate-model" data-toc-modified-id="Evaluate-model-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Evaluate model</a></span></li></ul></li><li><span><a href="#Make-a-submission" data-toc-modified-id="Make-a-submission-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Make a submission</a></span></li></ul></div>

# House Prices: Advanced Regression Techniques 

We’ll follow these steps to a successful Kaggle Competition submission:

- Acquire the data
- Explore the data
- Engineer and transform the features and the target variable
- Build a model
- Make and submit predictions

## Acquire the data

In [None]:
# Import libraries

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use(style='ggplot')
%matplotlib inline

We will first look at the `train.csv` data. After we’ve trained a model, we’ll make predictions using the `test.csv` data.

In [None]:
# Read in csv
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [None]:
# Check size of data
print ("Train data shape:", train.shape)
print ("Test data shape:", test.shape)

Test data does not include target column.

In [None]:
# Preview data
train.head()

## Explore the data

### Investigate target variable

In [None]:
train['SalePrice'].describe()

The average sale price of a house in our dataset is close to USD 180 000 with most of the values falling within the USD 130,000 to USD 215,000 range.

Next, we’ll check for skewness, which is a measure of the shape of the distribution of values.

In [None]:
print ("Skew is:", train['SalePrice'].skew())
plt.hist(train.SalePrice, color='blue')
plt.show()

Notice that the distribution has a longer tail on the right. The distribution is positively skewed.

Now we use `np.log()` to transform `SalePrice` and calculate the skewness a second time, as well as re-plot the data. A value closer to 0 means that we have improved the skewness of the data.

In [None]:
target = np.log(train['SalePrice'])
print ("Skew is:", target.skew())
plt.hist(target, color='blue')
plt.show()

We can see visually that the data better resembles a normal distribution.

### Investigate Numeric Features

In [None]:
# get numeric features
numeric_features = train.select_dtypes(include=[np.number])

In [None]:
numeric_features.corr()

### Investigate non-numeric Features

In [None]:
categoricals = train.select_dtypes(exclude=[np.number])
categoricals.describe()

The count column indicates the count of non-null observations, while unique counts the number of unique values. top is the most commonly occurring value, with the frequency of the top value shown by freq.

For many of these features, we might want to use one-hot encoding to make use of the information for modeling.

## Transforming and engineering features

### One-hot encoding of categorical variables

When transforming features, it’s important to remember that any transformations that you’ve applied to the training data before fitting the model must be applied to the test data.

Our model expects that the shape of the features from the train set match those from the test set. This means that any feature engineering that occurred while working on the train data should be applied again on the test set.

Consider the `Street` data, which indicates whether there is `Gravel` or `Paved` road access to the property.

In [None]:
train['Street'].value_counts()

Our model needs numerical data, so we will use one-hot encoding to transform the data into a Boolean column.

In [None]:
train['enc_street'] = pd.get_dummies(train['Street'], drop_first=True)
test['enc_street'] = pd.get_dummies(train['Street'], drop_first=True)

Let’s try engineering another feature. 

We’ll look at `SaleCondition` by constructing and plotting a pivot table

In [None]:
condition_pivot = train.pivot_table(index='SaleCondition', values='SalePrice', aggfunc=np.median)
condition_pivot.plot(kind='bar', color='blue')
plt.xlabel('Sale Condition')
plt.ylabel('Median Sale Price')
plt.xticks(rotation=0)
plt.show()

Notice that `Partial` has a significantly higher Median Sale Price than the others. We will encode this as a new feature. We select all of the houses where `SaleCondition` is equal to `Patrial` and assign the value `1`, otherwise assign `0`.

In [None]:
def encode(x):
    return 1 if x == 'Partial' else 0

train['enc_condition'] = train.SaleCondition.apply(encode)
test['enc_condition'] = test.SaleCondition.apply(encode)

### Deal with null values

In [None]:
# Get number of Null Values in each column
train.isna().sum().sort_values(ascending = False)[:22]

The documentation can help us understand the missing values. In the case of PoolQC, the column refers to Pool Quality. Pool quality is NaN when PoolArea is 0, or there is no pool.
We can find a similar relationship between many of the Garage-related columns

We’ll fill the missing values with an average value. This is a method of interpolation. The DataFrame.interpolate() method makes this simple.

This is a quick and simple method of dealing with missing values, and might not lead to the best performance of the model on new data. Handling missing values is an important part of the modeling process, where creativity and insight can make a big difference. 

In [None]:
data = train.select_dtypes(include=[np.number]).interpolate().dropna()

In [None]:
# check
data.isna().any().any()

## Build Linear Model

In [None]:
data.head()

In [None]:
y = np.log(train['SalePrice'])
X = data.drop(['Id', 'SalePrice'], axis = 1)

In [None]:
# Train-test-split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)

In [None]:
# Create Linear Regression model
from sklearn import linear_model
linreg = linear_model.LinearRegression()

Next, we need to fit the model using X_train and y_train. The `lr.fit()` method will fit the linear regression on the features and target variable that we pass.

In [None]:
# Fit the model
model = linreg.fit(X_train, y_train)

### Evaluate model

Kaggle will evaluate our submission using root-mean-squared-error (RMSE). We’ll also look at The r-squared value. The r-squared value is a measure of how close the data are to the fitted regression line. It takes a value between 0 and 1, 1 meaning that all of the variance in the target is explained by the data. In general, a higher r-squared value means a better fit.

The `model.score()` method returns the r-squared value by default.

In [None]:
print ("R^2 is: \n", model.score(X_test, y_test))

This means that our features explain approximately 86% of the variance in our target variable.

Next, we’ll consider RMSE. To do so, use the model we have built to make predictions on the test data set.

In [None]:
y_hat_test = model.predict(X_test)

The `model.predict()` method will return a list of predictions given a set of predictors. Use model.predict() after fitting the model.

In [None]:
y_hat_test[:5]

In [None]:
y_test[:5]

In [None]:
test_residuals = y_hat_test - y_test

The `mean_squared_error` function takes two arrays and calculates the rmse.

In [None]:
# Compute RMSE for test data
from sklearn.metrics import mean_squared_error

test_mse = mean_squared_error(y_test, y_hat_test)

test_mse

In [None]:
# Plot
actual_values = y_test
plt.scatter(y_hat_test, actual_values, alpha=.7,
            color='b') #alpha helps to show overlapping data
plt.xlabel('Predicted Price')
plt.ylabel('Actual Price')
plt.title('Linear Regression Model')
plt.show()

## Make a submission

We’ll need to create a `csv` that contains the predicted `SalePrice` for each observation in the `test.csv` dataset.

The first column must the contain the ID from the test data.

In [None]:
submission = pd.DataFrame()
submission['Id'] = test['Id']

Now, select the features from the test data for the model as we did above.


In [None]:
# Select numeric features and drop ID, interpolate for missing values
feats = test.select_dtypes(
        include=[np.number]).drop(['Id'], axis=1).interpolate()

In [None]:
# Generate predictions
predictions = model.predict(feats)

Now we’ll transform the predictions to the correct form. Remember that to reverse `log()` we do `exp()`.
So we will apply `np.exp()` to our predictions becasuse we have taken the logarithm previously.

In [None]:
# Taxe exponential to reverse log
final_predictions = np.exp(predictions)

In [None]:
# Define SalePrice in our submission DataFrame
submission['SalePrice'] = final_predictions

In [None]:
# Final check
submission.head()

One we’re confident that we’ve got the data arranged in the proper format, we can export to a `.csv` file as Kaggle expects. We pass `index=False` because Pandas otherwise would create a new index for us.

In [None]:
submission.to_csv('submission1.csv', index=False)

We’ve created a file called `submission1.csv` in our working directory that conforms to the correct format. Let's submit!