#### Copyright 2019 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# WIP Multiple Features


So far we have only done linear regressions with single features. It is possible to use multiple features to predict target values.

Let's see the use of multiple features in action.

## Overview

### Learning Objectives

  * Regression with multiple features

### Prerequisites

* Intermediate Python
* Intermediate Pandas
* Linear Regression with scikit-learn

### Estimated Duration

60 minutes

### Grading Criteria

Each exercise is worth 3 points. The rubric for calculating those points is:

| Points | Description |
|--------|-------------|
| 0      | No attempt at exercise |
| 1      | Attempted exercise, but code does not run |
| 2      | Attempted exercise, code runs, but produces incorrect answer |
| 3      | Exercise completed successfully |

There is no graded exercises in this Colab so there is 0 points available.

## Data

First we'll load the [Boston dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html) and convert it into a `DataFrame`.

The Boston dataset is a tiny dataset that comes packaged with Scikit Learn. It is one of many [toy datasets](https://scikit-learn.org/stable/datasets/index.html) that can be useful for experimenting with machine learning.

In the example below, we load the dataset and convert it into a `DataFrame`. The feature data and target data are stored separately in the loaded data. In this example we combine them into a single `DataFrame`. This isn't strictly required, but can make things like shuffling data and creating correlation matrices easier, so we go ahead and do it here.

In [0]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_boston

boston = load_boston()

df = pd.DataFrame(data=np.c_[boston.data, boston.target], columns=np.append(boston.feature_names, 'TARGET'))

df

## LinearRegression

Next we train a `LinearRegression` model on more than one feature. It is as simple as passing `fit` a list of features with more than one element in the list.

In [0]:
from sklearn.linear_model import LinearRegression

features = ['CRIM', 'ZN']

lin_reg = LinearRegression()
lin_reg.fit(df[features], df['TARGET'])

We can then check the [R-squared score](https://en.wikipedia.org/wiki/Coefficient_of_determination) for the model across the entire data set.

In [0]:
lin_reg.score(df[features], df['TARGET'])

That is all there is to using multiple features in a linear regression.

A few notes about this Colab:

*  We didn't split out test data when training a model. Typically this is a bad practice, but the goal of this Colab was to show the use of multiple features and not how to properly build a model.
* Adding features does not scale linearly. Behind the scenes Scikit Learn is using Singular Value Decomposition (SVD) to project the variance of the features into a single dimension for the model. This operation takes nearly $O(n^2)$ time, so two features takes about four times as long to process.

# Exercises

## Exercise 1

Scikit Learn has a [diabetes dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_diabetes.html#sklearn.datasets.load_diabetes) that works well for regression. Load this dataset into a `DataFrame` and take a few minutes to experiment with a different groupings of feature columns. Note the fit for each grouping. When you are done, save the model with the best fitting features.

### Student Solution

In [0]:
from sklearn.datasets import load_diabetes

diabetes = load_diabetes()

import numpy as np

all_data = np.c_[diabetes.data, diabetes.target]

all_names = diabetes.feature_names + ['target']

import pandas as pd

diabetes_df = pd.DataFrame(data=all_data, columns=all_names)

from sklearn.linear_model import LinearRegression

my_features = ["age", "sex", "bmi", "s3", "s5"]
lr = LinearRegression()
lr.fit(diabetes_df[my_features], diabetes_df["target"])
lr.score(diabetes_df[my_features], diabetes_df["target"])

In [0]:
diabetes_df.corr()

### Answer Key

**Solution**

In [0]:
# TODO

**Validation**

In [0]:
# TODO