# Train/test split for regression

#### EXERCISE:
As you learned in Chapter 1, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over <strong>all</strong> features. In addition to computing the $R^2$ score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array <code>X</code> and target variable array <code>y</code> have been pre-loaded for you from the DataFrame <code>df</code>.

#### INSTRUCTIONS:
* Import <code>LinearRegression</code> from <code>sklearn.linear_model</code>, <code>mean_squared_error</code> from <code>sklearn.metrics</code>, and <code>train_test_split</code> from <code>sklearn.model_selection</code>.
* Using <code>X</code> and <code>y</code>, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of <code>42</code>.
* Create a linear regression regressor called <code>reg_all</code>, fit it to the training set, and evaluate it on the test set.
* Compute and print the $R^2$ score using the <code>.score()</code> method on the test set.
* Compute and print the RMSE. To do this, first compute the Mean Squared Error using the <code>mean_squared_error()</code> function with the arguments <code>y_test</code> and <code>y_pred</code>, and then take its square root using <code>np.sqrt()</code>.

#### SCRIPT.PY:

In [3]:
import numpy as np
import pandas as pd
df = pd.read_csv("gapminder.csv")
y = df["life"].values.reshape(-1, 1)
X = df["fertility"].values.reshape(-1, 1)

# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, random_state=42)

# Create the regressor: reg_all
reg_all = LinearRegression()

# Fit the regressor to the training data
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))


R^2: 0.7298987360907494
Root Mean Squared Error: 4.194027914110243
