# Preliminary Steps

## Analysis of WestRoxbury.csv

Description of variables in West Roxbury (Boston) Home Value Dataset

- *TOTAL VALUE*: Total assessed value for property, in thousands of USD
- *TAX*: Tax bill amount based on total assessed value multiplied by the tax rate, in USD
- *LOT SQ FT*: Total lot size of parcel (ft 2 )
- *YR BUILT*: Year the property was built
- *GROSS AREA*: Gross floor area
- *LIVING AREA*: Total living area for residential properties (ft 2 )
- *FLOORS*: Number of floors
- *ROOMS*: Total number of rooms
- *BEDROOMS*: Total number of bedrooms
- *FULL BATH*: Total number of full baths
- *HALF BATH*: Total number of half baths
- *KITCHEN*: Total number of kitchens
- *FIREPLACE*: Total number of fireplaces
- *REMODEL*: When the house was remodeled (recent/old/none)

Packages imports and data load:

In [1]:
# Import required packages
import math
from typing import List, Tuple
import numpy as np
import pandas as pd

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def print_values(l: List[Tuple]):
    """Receives a list of 2-dimensional on the form ('desc','vals') and print them."""
    for desc, vals in l:
        print(desc + ":")
        print(vals)
        print("\n")

def regression_summary(y_true: pd.DataFrame, y_pred: pd.DataFrame):
    """Print regression performance metrics.
    Function adapted from https://github.com/gedeck/dmba/blob/master/src/dmba/metric.py

    Input:
        y_true: actual values
        y_pred: predicted values
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)
    y_res = y_true - y_pred

    metrics = [
        ('Mean Error (ME)', sum(y_res) / len(y_res)),
        ('Root Mean Squared Error (RMSE)', math.sqrt(mean_squared_error(y_true, y_pred))),
        ('Mean Absolute Error (MAE)', sum(abs(y_res)) / len(y_res)),
    ]
    if all(yt != 0 for yt in y_true):
        metrics.extend([
            ('Mean Percentage Error (MPE)', 100 * sum(y_res / y_true) / len(y_res)),
            ('Mean Absolute Percentage Error (MAPE)', 100 * sum(abs(y_res / y_true) / len(y_res))),
        ])
    fmt1 = '{{:>{}}} : {{:.4f}}'.format(max(len(m[0]) for m in metrics))
    print('\nRegression statistics\n')
    for metric, value in metrics:
        print(fmt1.format(metric, value))

In [2]:
# Load data
housing_df = pd.read_csv("../datasets/WestRoxbury.csv")
info = [("DataFrame Dimension", housing_df.shape),
        ("First five rows", housing_df.head())]
print_values(info)

DataFrame Dimension:
(5802, 14)


First five rows:
   TOTAL VALUE    TAX  LOT SQFT   YR BUILT  GROSS AREA   LIVING AREA  FLOORS   \
0         344.2  4330       9965      1880         2436         1352      2.0   
1         412.6  5190       6590      1945         3108         1976      2.0   
2         330.1  4152       7500      1890         2294         1371      2.0   
3         498.6  6272      13773      1957         5032         2608      1.0   
4         331.5  4170       5000      1910         2370         1438      2.0   

   ROOMS  BEDROOMS   FULL BATH  HALF BATH  KITCHEN  FIREPLACE REMODEL  
0      6          3          1          1        1          0    None  
1     10          4          2          1        1          0  Recent  
2      8          4          1          1        1          0    None  
3      9          5          1          1        1          1    None  
4      7          3          2          0        1          0    None  




Renaming the colums:

In [3]:
housing_df.columns = [s.strip().replace(" ", "_").lower() for s in housing_df.columns]
print(housing_df.columns)

Index(['total_value', 'tax', 'lot_sqft', 'yr_built', 'gross_area',
       'living_area', 'floors', 'rooms', 'bedrooms', 'full_bath', 'half_bath',
       'kitchen', 'fireplace', 'remodel'],
      dtype='object')


Showing some rows with `loc` and `iloc`:

In [4]:
# loc[a:b] gives rows a to b, inclusive
info = list()
info.append(("loc", housing_df.loc[0:3]))

# iloc[a:b] gives rows a to b-1
info.append(("iloc", housing_df.iloc[0:4]))

print_values(info)

loc:
   total_value   tax  lot_sqft  yr_built  gross_area  living_area  floors  \
0        344.2  4330      9965      1880        2436         1352     2.0   
1        412.6  5190      6590      1945        3108         1976     2.0   
2        330.1  4152      7500      1890        2294         1371     2.0   
3        498.6  6272     13773      1957        5032         2608     1.0   

   rooms  bedrooms  full_bath  half_bath  kitchen  fireplace remodel  
0      6         3          1          1        1          0    None  
1     10         4          2          1        1          0  Recent  
2      8         4          1          1        1          0    None  
3      9         5          1          1        1          1    None  


iloc:
   total_value   tax  lot_sqft  yr_built  gross_area  living_area  floors  \
0        344.2  4330      9965      1880        2436         1352     2.0   
1        412.6  5190      6590      1945        3108         1976     2.0   
2        330.1 

Different ways to show the first 3 values for `total_value`:

In [5]:
print(housing_df["total_value"][0:3])
print(housing_df.iloc[0:3]["total_value"])
print(housing_df.iloc[0:3].total_value) # only when the column name does not have spaces

0    344.2
1    412.6
2    330.1
Name: total_value, dtype: float64
0    344.2
1    412.6
2    330.1
Name: total_value, dtype: float64
0    344.2
1    412.6
2    330.1
Name: total_value, dtype: float64


Show the second row of the first ten columns:

In [6]:
print(housing_df.iloc[2][0:10])
print(housing_df.iloc[2, 0:10])
print(housing_df.iloc[2:3, 0:10]) # use a slice to return a data frame

total_value    330.1
tax             4152
lot_sqft        7500
yr_built        1890
gross_area      2294
living_area     1371
floors           2.0
rooms              8
bedrooms           4
full_bath          1
Name: 2, dtype: object
total_value    330.1
tax             4152
lot_sqft        7500
yr_built        1890
gross_area      2294
living_area     1371
floors           2.0
rooms              8
bedrooms           4
full_bath          1
Name: 2, dtype: object
   total_value   tax  lot_sqft  yr_built  gross_area  living_area  floors  \
2        330.1  4152      7500      1890        2294         1371     2.0   

   rooms  bedrooms  full_bath  
2      8         4          1  


Concatenating columns:

In [7]:
# Use pd.concat to combine non-consecutive columns into a new data frame.
# The axis argument specifies the dimension along which the
# concatenation happens, 0=rows, 1=columns.
pd.concat([housing_df.iloc[4:6, 0:2], housing_df.iloc[4:6, 4:6]],
          axis=1)

Unnamed: 0,total_value,tax,gross_area,living_area
4,331.5,4170,2370,1438
5,337.4,4244,2124,1060


Specifying a full column:

In [8]:
(housing_df.iloc[:,0:1],
 housing_df.total_value)

(      total_value
 0           344.2
 1           412.6
 2           330.1
 3           498.6
 4           331.5
 ...           ...
 5797        404.8
 5798        407.9
 5799        406.5
 5800        308.7
 5801        447.6
 
 [5802 rows x 1 columns],
 0       344.2
 1       412.6
 2       330.1
 3       498.6
 4       331.5
         ...  
 5797    404.8
 5798    407.9
 5799    406.5
 5800    308.7
 5801    447.6
 Name: total_value, Length: 5802, dtype: float64)

Descriptive Statistics:

In [9]:
print("Number of rows: ", len(housing_df.total_value)) # show lenght of the total_value column
print("Mean of 'total_value': ", housing_df.total_value.mean()) # show mwan of column
housing_df.describe() # show summary statistics for each column

Number of rows:  5802
Mean of 'total_value':  392.6857149258877


Unnamed: 0,total_value,tax,lot_sqft,yr_built,gross_area,living_area,floors,rooms,bedrooms,full_bath,half_bath,kitchen,fireplace
count,5802.0,5802.0,5802.0,5802.0,5802.0,5802.0,5802.0,5802.0,5802.0,5802.0,5802.0,5802.0,5802.0
mean,392.685715,4939.485867,6278.083764,1936.744916,2924.842123,1657.065322,1.68373,6.994829,3.230093,1.296794,0.613926,1.01534,0.739917
std,99.177414,1247.649118,2669.707974,35.98991,883.984726,540.456726,0.444884,1.437657,0.846607,0.52204,0.533839,0.12291,0.565108
min,105.0,1320.0,997.0,0.0,821.0,504.0,1.0,3.0,1.0,1.0,0.0,1.0,0.0
25%,325.125,4089.5,4772.0,1920.0,2347.0,1308.0,1.0,6.0,3.0,1.0,0.0,1.0,0.0
50%,375.9,4728.0,5683.0,1935.0,2700.0,1548.5,2.0,7.0,3.0,1.0,1.0,1.0,1.0
75%,438.775,5519.5,7022.25,1955.0,3239.0,1873.75,2.0,8.0,4.0,2.0,1.0,1.0,1.0
max,1217.8,15319.0,46411.0,2011.0,8154.0,5289.0,3.0,14.0,9.0,5.0,3.0,2.0,4.0


## Sampling in `pandas`

In [10]:
# random sample of 5 observations
print(housing_df.sample(5))

# oversample houses with over 10 rooms
weights = [0.9 if rooms > 10 else 0.01 for rooms in housing_df.rooms]
print(housing_df.sample(5, weights=weights))

      total_value   tax  lot_sqft  yr_built  gross_area  living_area  floors  \
5655        389.6  4901      4851      1938        3020         1555     2.5   
5716        376.2  4732      6110      1940        2810         1768     2.0   
2319        304.3  3828      4615      1928        2224          925     1.0   
105         344.6  4335      8151      1983        4532         2056     1.0   
458         198.4  2495      3228      1920        1755         1006     1.5   

      rooms  bedrooms  full_bath  half_bath  kitchen  fireplace remodel  
5655      8         4          1          1        1          1     Old  
5716      8         4          1          1        1          1    None  
2319      5         2          1          0        1          1    None  
105       5         2          1          2        1          1    None  
458       6         3          1          0        1          0    None  
      total_value   tax  lot_sqft  yr_built  gross_area  living_area  floor

## Preprocessing and Cleaning the Data

In [11]:
info = list()

# Column names
info.append(("Columns", housing_df.columns))

# Remodel needs to be converted to a categorical variable
housing_df.remodel = housing_df.remodel.astype("category")

# Show the number of categories
info.append(("Categories", housing_df.remodel.cat.categories)) 
# Check type of converted variable
info.append(("Column type", housing_df.remodel.dtype))

print_values(info)

Columns:
Index(['total_value', 'tax', 'lot_sqft', 'yr_built', 'gross_area',
       'living_area', 'floors', 'rooms', 'bedrooms', 'full_bath', 'half_bath',
       'kitchen', 'fireplace', 'remodel'],
      dtype='object')


Categories:
Index(['None', 'Old', 'Recent'], dtype='object')


Column type:
category




To create binary dummies (indicators):

In [12]:
# Use drop_first=True to drop the first dummy variable
housing_df = pd.get_dummies(housing_df, prefix_sep="_", drop_first=True)
print(housing_df.columns)

Index(['total_value', 'tax', 'lot_sqft', 'yr_built', 'gross_area',
       'living_area', 'floors', 'rooms', 'bedrooms', 'full_bath', 'half_bath',
       'kitchen', 'fireplace', 'remodel_Old', 'remodel_Recent'],
      dtype='object')


In [13]:
print(housing_df.loc[:, ["remodel_Old", "remodel_Recent"]].head())

   remodel_Old  remodel_Recent
0            0               0
1            0               1
2            0               0
3            0               0
4            0               0


## Imputing Missing Data

In [14]:
# To illustrate missing data procedures, we first convert a few entries for
# bedrooms to NA’s. Then we impute these missing values using the median of the
# remaining values.

print("Number of rows with valid 'bedrooms' values before setting to NAN: ",
      housing_df.bedrooms.count())

# remove rows with missing values
missing_rows = housing_df.sample(10).index
housing_df.loc[missing_rows, "bedrooms"] = np.nan
reduced_df = housing_df.dropna()
print("Number of rows after removing rows with missing values: ",
      len(reduced_df.bedrooms))

# replace the missing values using the median of the remaining values
median_bedrooms = housing_df.bedrooms.median()
housing_df.bedrooms = housing_df.bedrooms.fillna(value=median_bedrooms)
print("Number of rows with valid 'bedrooms' values after filling NA values: ",
      housing_df.bedrooms.count())

Number of rows with valid 'bedrooms' values before setting to NAN:  5802
Number of rows after removing rows with missing values:  5792
Number of rows with valid 'bedrooms' values after filling NA values:  5802


## Normalizing and rescaling data

In [15]:
df = housing_df.copy()

# Normalizing a data frame
# pandas:
norm_df = (housing_df - housing_df.mean()) / housing_df.std()

# scikit-learn:
# the result of the transformation is a numpy array, we convert it into a dataframe
scaler = StandardScaler()
norm_df = pd.DataFrame(scaler.fit_transform(housing_df),
                       index=housing_df.index,
                       columns=housing_df.columns)

# Rescaling a data frame
# pandas:
norm_df = (housing_df - housing_df.min()) / (housing_df.max() - housing_df.min())

# scikit-learn:
scaler = MinMaxScaler()
norm_df = pd.DataFrame(scaler.fit_transform(housing_df),
                       index=housing_df.index,
                       columns=housing_df.columns)

# Predictive Power and Overfitting

## Partitioning data into training, validation and test sets

In [16]:
# random_state is set to a defined value to get the same partitions when re-running
# the code
# training: 60 %
# validation (test): 40%

train_data, valid_data = train_test_split(housing_df, test_size=0.40, random_state=1)
print("Training:   ", train_data.shape)
print("Validation: ", valid_data.shape)

# training: 50%
# validation: 30%
# test: 20%
train_data, temp = train_test_split(housing_df, test_size=0.50, random_state=1)
valid_data, test_data = train_test_split(temp, test_size=0.40, random_state=1)
print("Training:   ", train_data.shape)
print("Validation: ", valid_data.shape)
print("Test:       ", test_data.shape)

Training:    (3481, 15)
Validation:  (2321, 15)
Training:    (2901, 15)
Validation:  (1740, 15)
Test:        (1161, 15)


# Building a Predictive Model

As we already loaded the data and did some data manipulation with the `West Roxbury` dataset. Let's create the a list of predictors and the outcome

Obs:

TAX might be a very good predictor of home value in a numerical sense, but would it be useful if we wanted to apply our model to homes whose assessed value might not be known? For this reason, we will exclude TAX from the analysis.

It is also useful to check for outliers that might be errors. Let's look to some of them:

 - Floors

In [17]:
housing_df.floors.value_counts()

2.0    3415
1.0    1505
1.5     773
2.5     105
3.0       4
Name: floors, dtype: int64

 - Rooms

In [18]:
housing_df.rooms.value_counts()

7     1769
6     1669
8      936
5      578
9      450
10     200
4       71
11      66
12      45
13      10
14       5
3        3
Name: rooms, dtype: int64

 - Bedrooms

In [19]:
housing_df.bedrooms.value_counts()

3.0    3243
4.0    1348
2.0     817
5.0     256
6.0      90
1.0      30
7.0      14
8.0       3
9.0       1
Name: bedrooms, dtype: int64

 - Full Bath

In [20]:
housing_df.full_bath.value_counts()

1    4249
2    1399
3     140
4      13
5       1
Name: full_bath, dtype: int64

 - Half Bath

In [21]:
housing_df.half_bath.value_counts()

1    3287
0    2378
2     136
3       1
Name: half_bath, dtype: int64

 - Kitchen

In [22]:
housing_df.kitchen.value_counts()

1    5713
2      89
Name: kitchen, dtype: int64

 - Fireplace

In [23]:
housing_df.fireplace.value_counts()

1    3658
0    1842
2     275
3      23
4       4
Name: fireplace, dtype: int64

Now creating the predictos and outcome:

In [24]:
# create list of predictors and outcome
exclude_cols = ("total_value", "tax")
predictors = [s for s in housing_df.columns if s not in exclude_cols]
outcome = "total_value"

Let's partition the data in training and validation sets:

In [25]:
# Partition data
X = housing_df[predictors]
y = housing_df[outcome]

train_X, valid_X, train_y, valid_y = train_test_split(X, y, test_size=0.40, random_state=1)

And train the model. In this case, it is multiple linear regression.And train the model. We want to predict the value of a house in `West Roxbury` on the basis of all the other predictors (except TAX).

In [26]:
model = LinearRegression()
model.fit(train_X, train_y)

LinearRegression()

Model prediction on training set:

In [27]:
train_pred = model.predict(train_X)
train_results = pd.DataFrame({
    "total_value": train_y,
    "predicted": train_pred,
    "residual": train_y - train_pred,
})
train_results.head()

Unnamed: 0,total_value,predicted,residual
2024,392.0,387.692778,4.307222
5140,476.3,430.840645,45.459355
5259,367.4,384.030436,-16.630436
421,350.3,368.998307,-18.698307
1401,348.1,315.004776,33.095224


And validation set:

In [28]:
valid_pred = model.predict(valid_X)
valid_results = pd.DataFrame({
    "total_value": valid_y,
    "predicted": valid_pred,
    "residual": valid_y - valid_pred,
})
valid_results.head()

Unnamed: 0,total_value,predicted,residual
1822,462.0,406.878702,55.121298
1998,370.4,362.834535,7.565465
5126,407.4,390.278439,17.121561
808,316.1,382.470841,-66.370841
4034,393.2,434.273851,-41.073851


Let's see the algorithm performance:

In [29]:
# training set
regression_summary(y_true=train_y, y_pred=train_pred)

# validation set
regression_summary(y_true=valid_y, y_pred=valid_pred)


Regression statistics

                      Mean Error (ME) : 0.0000
       Root Mean Squared Error (RMSE) : 43.0348
            Mean Absolute Error (MAE) : 32.6066
          Mean Percentage Error (MPE) : -1.1118
Mean Absolute Percentage Error (MAPE) : 8.4885

Regression statistics

                      Mean Error (ME) : -0.1483
       Root Mean Squared Error (RMSE) : 42.7208
            Mean Absolute Error (MAE) : 31.9569
          Mean Percentage Error (MPE) : -1.0895
Mean Absolute Percentage Error (MAPE) : 8.3256


The first is mean error (ME), simply the average of the residuals (errors). In both cases, it is quite small relative to the units of TOTAL VALUE, indicating that, on balance, predictions average about right our predictions are “unbiased”. Of course, this simply means that the positive and negative errors balance out. It tells us nothing about how large these errors are.

The root-mean-squared error (RMSE) is more informative of the error magnitude: it takes the square root of the average squared error, so it gives an idea of the typical error (whether positive or negative) in the same scale as that used for the original outcome variable. The RMSE for the validation data (\\$42.7K), which the model is seeing for the first time in making these predictions, is in the same range as for the training data (\\$43.0K), which were used in training the model. Normally, we expect the validation set error to be higher than for the training set.

Simulating a model deployed for three new records:

In [30]:
new_data = pd.DataFrame({
    "lot_sqft": [4200, 6444, 5035],
    "yr_built": [1960, 1940, 1925],
    "gross_area": [2670, 2886, 3264],
    "living_area": [1710, 1474, 1523],
    "floors": [2.0, 1.5, 1.9],
    "rooms": [10, 6, 6],
    "bedrooms": [4, 3, 2],
    "full_bath": [1, 1, 1],
    "half_bath": [1, 1, 0],
    "kitchen": [1, 1, 1],
    "fireplace": [1, 1, 0],
    "remodel_Old": [0, 0, 0],
    "remodel_Recent": [0, 0, 1],
})

print(new_data)
print("Predictions: ", model.predict(new_data))

   lot_sqft  yr_built  gross_area  living_area  floors  rooms  bedrooms  \
0      4200      1960        2670         1710     2.0     10         4   
1      6444      1940        2886         1474     1.5      6         3   
2      5035      1925        3264         1523     1.9      6         2   

   full_bath  half_bath  kitchen  fireplace  remodel_Old  remodel_Recent  
0          1          1        1          1            0               0  
1          1          1        1          1            0               0  
2          1          0        1          0            0               1  
Predictions:  [384.45922888 378.09089303 385.8814195 ]


The model is used in new data to predict TOTAL VALUE for homes where this value is unknown. Predicting the output value for new records is called scoring. For predictive tasks, scoring produces predicted numerical values. For classification tasks, scoring produces classes and/or propensities.