# Intro To Data Science With Linear Regression

Linear regression is the fundamental building block of data science and analytics. If you ever venture into data science, this will most likely be the first model you're taught.

First, import the necessary libraries to run the notebook. Press `Shift + Enter` to run the cell below.

In [51]:
from sklearn import datasets
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import pandas as pd

Load the boston dataset. This is a dataset that's installed inside of Scikit-Learn. The goal is to predict the housing price, using other columns (features) in the dataset.

In [53]:
boston = load_boston()

Next, separate the data into the features and target using the following code:

`y = boston.target`

`boston = pd.DataFrame(boston.data)`

In [54]:
y = boston.target

boston = pd.DataFrame(boston.data)


Print the boston dataset using the following code. The `head` method prints out the first 5 lines in a DataFrame.

`boston.head()`

In [55]:
boston.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


The columns don't have any labels! This happens with some datasets. Assuming you have a data dictionary, you can label the columns. For the time being, add this line into the cell below, and call the `head` method on the DataFrame again.

`boston.columns = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat']`

In [56]:
boston.columns = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat']

boston.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


Now that the data is labeled, we have a better sense of what each column means.

To reiterate, we'll be predicting the housing prices using all of these columns (features). 

## Building the Linear Regression Model 

Now that the data is in the right format, we can begin to build the linear regression model.

First, we're going to split the data. In data science, your data is split into two datasets.

The first dataset is the *training* set. Building a model is referred to as "training", hence the moniker of a train data set.

The second dataset is the *test* set. This is used to make predictions/ inferences, and evaluate if our model is performing well.

In [39]:
X_train, X_test, y_train, y_test = train_test_split(boston, y, test_size=0.20, random_state=42)

In [40]:
model = LinearRegression()

In [41]:
model.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [43]:
predictions = model.predict(X_test)

In [44]:
r2_score(y_test, predictions)

0.66848257539716704