<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Assignment-#1:-Regression" data-toc-modified-id="Assignment-#1:-Regression-0.1">Assignment #1: Regression</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-0.2">Learning Outcomes</a></span></li></ul></li><li><span><a href="#Data-Science-Workflow:" data-toc-modified-id="Data-Science-Workflow:-1">Data Science Workflow:</a></span><ul class="toc-item"><li><span><a href="#1)-Ask:-" data-toc-modified-id="1)-Ask:--1.1">1) Ask: </a></span></li><li><span><a href="#2)-Acquire" data-toc-modified-id="2)-Acquire-1.2">2) Acquire</a></span></li><li><span><a href="#3)-Process" data-toc-modified-id="3)-Process-1.3">3) Process</a></span></li><li><span><a href="#4)-Model" data-toc-modified-id="4)-Model-1.4">4) Model</a></span></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-1.5">Model Evaluation</a></span></li><li><span><a href="#5)-Deliver" data-toc-modified-id="5)-Deliver-1.6">5) Deliver</a></span></li><li><span><a href="#Helpful-Hints" data-toc-modified-id="Helpful-Hints-1.7">Helpful Hints</a></span></li></ul></li></ul></div>

Assignment #1: Regression
----

<center><img src="https://i.imgur.com/s888wFw.png" width="75%"/></center>

Learning Outcomes
------

__By the end of this assignment, you should be able to__:

- Practice applying the Data Science worklflow to a real-world problem.
- Fit a regression model with scikit-learn.

Data Science Workflow:
======

1. Ask
2. Acquire
3. Process
4. Model
5. Deliver 

1) Ask: 
----

Can a linear regression model learn to predict the price of airbnb rentals in San Francisco?

2) Acquire
----

The data was sourced from [airbnb](https://www.airbnb.com/) and converted to a .csv for you.

In [1]:
reset -fs

In [2]:
import numpy as np
import pandas as pd

In [3]:
path = "../../data/regression/"

# Load train data - train data has potential features and price as target
data_train_raw = pd.read_csv(path+"train.csv", low_memory=False)
assert "price" in data_train_raw.columns

target_raw = data_train_raw.pop('price')
# Convert `price` as a string with dollar signs and commas to a float.
y_train = np.array(target_raw.str.replace('$', '').str.replace(',', '').astype(float))

# Keep track of the number of instances
n_rows_input = data_train_raw.shape[0]

3) Process
-----

In [None]:
def process_data(data):
    """Feature engineering & feature selection
    The input parameter and return value have the same name. All processing should happen inplace.
    """

    # YOUR CODE HERE
    raise NotImplementedError()
    
    return data

In [None]:
"""
2 points
Test code for the 'process_data' function. 
This cell should NOT give any errors when it is run.
"""

data_train = process_data(data_train_raw)

# Double check the type is still a pd.DataFrame
assert type(data_train) == pd.core.frame.DataFrame
# Double check no rows where drop. Dropping which will break test data performance.
assert data_train.shape[0] == n_rows_input 

In [None]:
# Convert from pandas.DataFrame (tabular) to numpy.array (matrix)
X_train = np.array(data_train)

4) Model
-----

Only use scikit-learn's [Generalized Linear Models (GLM)](https://scikit-learn.org/stable/modules/linear_model.html). No other models are allowed.

No automatic hyperparameter search is allowed in final submitted code, thus no model with `CV` suffix. 

In [None]:
from sklearn import linear_model

In [None]:
# Fit simple LR
lm = linear_model.LinearRegression() # TODO: Replace with your choice of algorithm and hyperparameters 
lm.fit(X_train, y_train) # Train model

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""
1 point
Test for valid model type.
This cell should NOT give any errors when it is run.
"""

assert str(type(lm)).split(".")[1] == 'linear_model'
assert str(type(lm)).split(".")[-1].count('CV') == 0

Model Evaluation
------

__Root Mean Squared Logarithmic Error__

$$ RMSLE = \sqrt{\frac{1}{n}\sum_{i=1}(log(p_i + 1) - log(a_i + 1))^2}$$

p = predicted

a = actual

In [None]:
def rmlse(target_true, target_predicted):
    "Root Mean Squared Logarithmic Error"
    
    assert len(target_true) == len(target_predicted), "True and predicted targets need to be the same length"

    log_diff = np.log(target_predicted + 1) - np.log(target_true + 1)
    return np.sqrt(np.mean(np.power(log_diff, 2)))

In [None]:
rmlse_value = rmlse(y_train, lm.predict(X_train))
print(f"{rmlse_value:.4f} rmlse on training set")

__Median Absolute Error__

$$\text{MedAE} = \text{median}(\mid p_1 - a_1 \mid, \ldots, \mid p_n - a_n \mid)$$

In [None]:
from sklearn import metrics

medae_value = metrics.median_absolute_error(y_train, lm.predict(X_train))
print(f"{medae_value:.4f} medae on training set")

5) Deliver
-----

How well can your model predict the test data?

In [None]:
# Load test data - test data only has potential features, no price as target
data_test_raw = pd.read_csv(path+"test.csv", low_memory=False)
assert not "price" in data_test_raw.columns

n_rows_input = data_test_raw.shape[0]

In [None]:
"""
2 points
Test code for the 'process_data' function. 
This cell should NOT give any errors when it is run.
"""

data_test = process_data(data_test_raw)

# Double check the type is still a pd.DataFrame
assert type(data_test) == pd.core.frame.DataFrame

# Double check not dropping any rows which will break test data
assert data_test.shape[0] == n_rows_input 

In [None]:
# Performance on median benchmark (mock test dataset)
# Sanity-check that your model can generate predicts in the correct form for actual test dataset.
# Your performance on the median data is not a good indicator of preformance on actual test dataset.

X_test = np.array(data_test)

test_solutions = pd.read_csv(path+'median_benchmark.csv') 
y_median = test_solutions['median_price']

medae_value = metrics.median_absolute_error(y_median, lm.predict(X_test))
print(f"{medae_value:.4f} medae on median test set")

In [None]:
"""
1 point for medae less than 100 on test dataset.
Tests are hidden since you don't have access to test targets.
"""



In [None]:
"""
1 point for medae less than 90 on test dataset.
Tests are hidden since you don't have access to test targets.
"""



In [None]:
"""
1 point for medae less than 80 on test dataset.
Tests are hidden since you don't have access to test targets.
"""



In [None]:
"""
1 point for medae less than 70 on test dataset.
Tests are hidden since you don't have access to test targets.
"""



In [None]:
"""
1 point for medae less than 60 on test dataset.
Tests are hidden since you don't have access to test targets.
"""



Helpful Hints
------

- The primary deliverables are `process_data` function and a single linear model, aka `lm`. There should be no other code in the notebook, including visualizations or hyperparameter search code.
- Suggested steps (deliverable / nondeliverable)
    1. Exploratory data analysis (EDA) (nondeliverable)
    2. Process data to create features (deliverable)
    3. Search for model, hyperparamter, parameters (nondeliverable)
    4. Select single algorithm with single set of hyperparamter (deliverable)
- There are __no__ tricks to doing well on the test dataset. This assignment requires fundamental machine learning skills:
    - Domain and data knowledge
    - Feature engineering and feature selection
    - Model selection and hyperparameter tuning
- Domain knowledge is always important - Check out [airbnb San Francisco listings](https://www.airbnb.com/a/San-Francisco--California--United-States) to understand the data.
- Use your EDA skills. Create a separate notebook for EDA.
- Clean data is better than dirty data.
- Think deeply about the data. What potential issues will arise with modeling this data with a linear regression algorithm?
- High quality regression results often require hand-tuning of features and model hyperparameters.
- Academic Integrity reminder - Use only course notes and package (scikit-learn, pandas, …) documentation for this assignment. You should __not__ be googling for help. You should be working independently.

<br>
<br> 
<br>

----