<h2> Import Libraries</h2>

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

## Load the Data
The boston house-price dataset is one of datasets scikit-learn comes with that do not require the downloading of any file from some external website. The code below loads the boston dataset.

In [2]:
data = load_boston()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


<h2> Remove Missing or Impute Values</h2>
If you want to build models with your data, null values are (almost) never allowed. It is important to always see how many samples have missing values and for which columns.

In [3]:
# Look at the shape of the dataframe
df.shape

(506, 14)

In [4]:
# There are no missing values in the dataset
df.isnull().sum()

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
target     0
dtype: int64

<h2> Arrange Data into Features Matrix and Target Vector </h2>
What we are predicing is the continuous column "target" which is the median value of owner-occupied homes in $1000’s. 

In [5]:
X = df.loc[:, ['RM', 'LSTAT', 'PTRATIO']]

In [6]:
y = df.loc[:, 'target']

## Splitting Data into Training and Test Sets


In [7]:
# Original random state is 2
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)

## Train Test Split Visualization

A relatively new feature of pandas is conditional formatting. https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html

In [8]:
X_train = pd.DataFrame(X_train, columns=['RM', 'LSTAT', 'PTRATIO'])

X_test = pd.DataFrame(X_test, columns=['RM', 'LSTAT', 'PTRATIO'])

In [9]:
X_train['split'] = 'train'
X_test['split'] = 'test'

In [10]:
X_train

Unnamed: 0,RM,LSTAT,PTRATIO,split
312,6.023,11.72,18.4,train
328,5.868,9.97,16.9,train
251,6.438,3.59,19.1,train
205,5.891,10.87,18.6,train
231,7.412,5.25,17.4,train
...,...,...,...,...
22,6.142,18.72,21.0,train
72,6.065,5.52,19.2,train
493,5.707,12.01,19.2,train
15,5.834,8.47,21.0,train


In [11]:
X_train['target'] = y_train
X_test['target'] = y_test

In [12]:
fullDF = pd.concat([X_train, X_test], axis = 0, ignore_index=False)

In [13]:
fullDF.head(10)

Unnamed: 0,RM,LSTAT,PTRATIO,split,target
312,6.023,11.72,18.4,train,19.4
328,5.868,9.97,16.9,train,19.3
251,6.438,3.59,19.1,train,24.8
205,5.891,10.87,18.6,train,22.6
231,7.412,5.25,17.4,train,31.7
114,6.254,10.45,17.8,train,18.5
437,6.152,26.45,20.2,train,8.7
68,5.594,13.09,18.9,train,17.4
214,5.412,29.55,18.6,train,23.7
377,6.794,21.24,20.2,train,13.3


In [14]:
len(fullDF.index)

506

In [15]:
len(np.unique(fullDF.index))

506

In [16]:
fullDFsplit = fullDF.copy()
fullDF = fullDF.drop(columns = ['split'])

In [17]:

def highlight_color(s, fullDFsplit):
    '''
    highlight the the entire dataframe cyan.
    '''

    colorDF = s.copy()


    colorDF.loc[fullDFsplit['split'] == 'train', ['RM', 'LSTAT', 'PTRATIO']] = 'background-color: #40E0D0'


    colorDF.loc[fullDFsplit['split'] == 'test', ['RM', 'LSTAT', 'PTRATIO']] = 'background-color: #00FFFF'

    # #9370DB
    # FF D7 00
    colorDF.loc[fullDFsplit['split'] == 'train', ['target']] = 'background-color: #FFD700'

    # EE82EE
    # BD B7 6B
    colorDF.loc[fullDFsplit['split'] == 'test', ['target']] = 'background-color: #FFFF00'
    return(colorDF)


temp = fullDF.sort_index().loc[0:9,:].style.apply(lambda x: highlight_color(x,pd.DataFrame(fullDFsplit['split'])), axis = None)
temp.set_properties(**{'border-color': 'black',
                       'border': '1px solid black'})

Unnamed: 0,RM,LSTAT,PTRATIO,target
0,6.575,4.98,15.3,24.0
1,6.421,9.14,17.8,21.6
2,7.185,4.03,17.8,34.7
3,6.998,2.94,18.7,33.4
4,7.147,5.33,18.7,36.2
5,6.43,5.21,18.7,28.7
6,6.012,12.43,15.2,22.9
7,6.172,19.15,15.2,27.1
8,5.631,29.93,15.2,16.5
9,6.004,17.1,15.2,18.9


<h3>Train test split key</h3>

In [18]:
# Train test split key
temp = pd.DataFrame(data = [['X_train','X_test','y_train','y_test']]).T
temp

Unnamed: 0,0
0,X_train
1,X_test
2,y_train
3,y_test


In [19]:
def highlight_mini(s):
    '''
    highlight the the entire dataframe cyan.
    '''

    colorDF = s.copy()

    # colorDF.loc[0, [0]] = 'background-color: #40E0D0'
    
    # train features
    colorDF.loc[0, [0]] = 'background-color: #40E0D0'

    # test features
    colorDF.loc[1, [0]] = 'background-color: #00FFFF'

    # train target
    colorDF.loc[2, [0]] = 'background-color: #FFD700'

    # test target
    colorDF.loc[3, [0]] = 'background-color: #FFFF00'

    return(colorDF)


temp2 = temp.sort_index().style.apply(lambda x: highlight_mini(x), axis = None)
temp2.set_properties(**{'border-color': 'black',
                       'border': '1px solid black',
                       })

Unnamed: 0,0
0,X_train
1,X_test
2,y_train
3,y_test


After that I was lazy and used powerpoint to make that graph. 