In [1]:
%matplotlib inline

## Regression

## Importing data for supervised learning

In this notebook, you will work with Gapminder data that we have consolidated into one CSV file available in the workspace as 'gapminder.csv'. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country's GDP, fertility rate, and population. 

Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: 'fertility', which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.

Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's .reshape() method. Don't worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.

** Task assignment **

* Import numpy and pandas as their standard aliases (np and pd).
* Read the file './data/gapminder_tidy.csv' into a DataFrame df using the `read_csv()` function.
* Create array `X` for the 'fertility' feature and array y for the 'life' target variable.
* Reshape the arrays by using the `.reshape()` method and passing in -1 and 1.

In [None]:
# Import numpy and pandas
____
____

# Read the CSV file into a DataFrame: df
df = ____ 
df = df.fillna(0)

# Create arrays for features and target variable
y = ____
X = ____

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = ____
X = ____

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))


## Exploring the Gapminder data

As always, it is important to explore your data before building models. You can construct a heatmap showing the correlation between the different features of the Gapminder dataset as follows which has been pre-loaded into a DataFrame as df and is available for exploration in the Python Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with life, and which ones are negatively correlated? Does this match your intuition?

Then, in the IPython Shell, explore the DataFrame using pandas methods such as `.info()`, `.describe()`, `.head()`.

In case you are curious, the heatmap was generated using Seaborn's heatmap function and the following line of code, where `df.corr()` computes the pairwise correlation between columns:

In [None]:
from IPython.display import Image
%matplotlib inline

import seaborn as sns
sns.heatmap(df.corr(), square=True, cmap='RdYlGn')

** Task assignment **

Once you have a feel for the data, consider the statements below and select the one that is not true.

* The DataFrame has 139 samples (or rows) and 9 columns.
* Life and fertility are negatively correlated.
* The mean of life is 64.0786.
* Fertility is of type float64.
* GDP and life are positively correlated

## Fit & predict for regression

Now, you will fit a linear regression and predict life expectancy using just one feature. In this exercise, you will use the 'fertility' feature of the Gapminder dataset. Since the goal is to predict life expectancy, the target variable here is 'life'. The array for the target variable has been pre-loaded as y and the array for 'fertility' has been pre-loaded as X.

A scatter plot with 'fertility' on the x-axis and 'life' on the y-axis has been generated. As you can see, there is a strongly negative correlation, so a linear regression should be able to capture this trend. Your job is to fit a linear regression and then predict the life expectancy, overlaying these predicted values on the plot to generate a regression line. You will also compute and print the $R^2$ score using scikit-learn's `.score()` method.

** Task assignment **

* Import `LinearRegression` from `sklearn.linear_model`.
* Create a `LinearRegression` regressor called `reg`.
* Set up the prediction space to range from the minimum to the maximum of `X`. This has been done for you.
* Fit the regressor to the data (`X` and `y`) and compute its predictions using the `.predict()` method and the prediction_space array.
* Compute and print the $R^2$ score using the `.score()` method.
* Overlay the plot with your linear regression line. 

In [None]:
import matplotlib.pyplot as plt
# Import LinearRegression
____

# Create the regressor: reg
reg = ____

# Create the prediction space
prediction_space = np.linspace(min(X), max(X)).reshape(-1,1)

# Fit the model to the data
____

# Compute predictions over the prediction space: y_pred
y_pred = ____

# Print R^2 
print(reg.score(____, ____))

# Plot regression line
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()

## Train/test split for regression

As you learned earlier, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over all features. In addition to computing the $R^2$ score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array `X` and target variable array `y` have been pre-loaded for you from the DataFrame `df`.

** Task assignment **

* Import LinearRegression from `sklearn.linear_model`, `mean_squared_error` from `sklearn.metrics`, and `train_test_split` from `sklearn.model_selection`.
* Using `X` and `y`, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of 42.
* Create a linear regression regressor called reg_all, fit it to the training set, and evaluate it on the test set.
* Compute and print the $R^2$ score using the `.score()` method on the test set.
* Compute and print the RMSE. To do this, first compute the Mean Squared Error using the `mean_squared_error()` function with the arguments `y_test` and `y_pred`, and then take its square root using `np.sqrt()`.

In [None]:
# Import necessary modules
____
____
____


# Create training and test sets
X_train, X_test, y_train, y_test = ____(____, ____, test_size = ____, random_state=____)

# Create the regressor: reg_all
reg_all = ____

# Fit the regressor to the training data
____

# Predict on the test data: y_pred
y_pred = ____

# Compute and print R^2 and RMSE
r2 = ____
print("R^2: {}".format(r2))
rmse = np.sqrt(____)
print("Root Mean Squared Error: {}".format(rmse))

## 5-fold cross-validation

Cross-validation is a vital step in evaluating a model. It maximizes the amount of data that is used to train the model, as during the course of training, the model is not only trained, but also tested on all of the available data.

In this exercise, you will practice 5-fold cross validation on the Gapminder data. By default, scikit-learn's `cross_val_score()` function uses $R^2$ as the metric of choice for regression. Since you are performing 5-fold cross-validation, the function will return 5 scores. Your job is to compute these 5 scores and then take their average.

The DataFrame has been loaded as `df` and split into the feature/target variable arrays `X` and `y`. The modules `pandas` and `numpy` have been imported as `pd` and `np`, respectively.

** Task assignment **

* Import LinearRegression from `sklearn.linear_model` and `cross_val_score` from `sklearn.model_selection`.
* Create a linear regression regressor called `reg`.
* Use the `cross_val_score()` function to perform 5-fold cross-validation on `X` and `y`.
* Compute and print the average cross-validation score. You can use NumPy's `mean()` function to compute the average.

In [None]:
# Import the necessary modules
____
____

# Create a linear regression object: reg
reg = ____

# Compute 5-fold cross-validation scores: cv_scores
cv_scores = ____

# Print the 5-fold cross-validation scores
print(____)

print("Average 5-Fold CV Score: {}".format(____(____)))

## K-Fold CV comparison

Cross validation is essential but do not forget that the more folds you use, the more computationally expensive cross-validation becomes. In this exercise, you will explore this for yourself. Your job is to perform 3-fold cross-validation and then 10-fold cross-validation on the Gapminder dataset.

You can use `%timeit` to see how long each 3-fold CV takes compared to 10-fold CV by executing the following `cv=3` and `cv=10`:

`%timeit cross_val_score(reg, X, y, cv = ____)`

** Task Assignment **

* Import `LinearRegression` from `sklearn.linear_model` and `cross_val_score` from `sklearn.model_selection`.
* Create a linear regression regressor called `reg`.
* Perform 3-fold CV and then 10-fold CV. Compare the resulting mean scores.

In [None]:
# Import necessary modules
____
____

# Create a linear regression object: reg
reg = ____

# Perform 3-fold CV
cvscores_3 = ____
print(np.mean(cvscores_3))

# Perform 10-fold CV
cvscores_10 = ____
print(np.mean(cvscores_10))

# Now time the 3-fold and 10-fold CV and compare the results

## Melbourne housing dataset - Kaggle

We will now switch to a dataset available on Kaggle: the Melbourne housing dataset. Just like above, start by loading the dataset using the pandas read_csv function. The dataset is available in "data/melb_data.csv". We will try to predict the sales price based on a set of input variables. 

** Task Assignment **
* Print the list of columns in the dataset to find the name of the prediction target.
* Save this to a new variable "y". 

In [None]:
# Create prediction target y

** Task assignment **
* Now create a dataframe X holding the predictive features. Use the following set of columns:
** LotArea
** YearBuilt
** 1stFlrSF
** 2ndFlrSF
** FullBath
** BedroomAbvGr
** TotRmsAbvGrd

Do this by first creating a list of features, and then using that list to create the dataframe that you'll use to fit the model. 

In [None]:
# Create the list of features below
feature_names = ___

# Select data corresponding to features in feature_names
X = ____

Now first take a look at your data before we start building a model for it. 

** Task assignment **

* Import LinearRegression from `sklearn.linear_model`, `mean_squared_error` from `sklearn.metrics`, and `train_test_split` from `sklearn.model_selection`.
* Using `X` and `y`, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of 42.
* Create a linear regression regressor called reg_all, fit it to the training set, and evaluate it on the test set.
* Compute and print the $R^2$ score using the `.score()` method on the test set.
* Compute and print the RMSE. To do this, first compute the Mean Squared Error using the `mean_squared_error()` function with the arguments `y_test` and `y_pred`, and then take its square root using `np.sqrt()`.

Now let's compare that with a Decision Tree regressor model. 

** Task assignment **
* Import DecisionTreeRegressor from sklearn.tree.
* Create a new DecisionTreeRegressor model.
* Fit the model to your X_train, y_train data.
* Evaluate the performance (R^2 and RMSE) on the test set.