# Module 1.1: Machine Learning Workflow Fundamentals

## üéØ Learning Objectives

By the end of this module, you will be able to:
1. **Load and explore** datasets using R's data manipulation functions
2. **Apply sampling techniques** for handling large datasets and class imbalance
3. **Preprocess data** including handling missing values and creating dummy variables
4. **Partition data** properly to avoid overfitting
5. **Build and evaluate** predictive models using best practices

---

## üìä Why These Skills Matter in Business

Machine learning is transforming how businesses make decisions. But the **quality of your model depends entirely on how you prepare your data**. Consider these real-world scenarios:

| Business Problem | ML Skill Needed | This Module |
|------------------|-----------------|-------------|
| "Which customers will churn?" | Handling imbalanced classes | Part 2: Sampling |
| "How much is this property worth?" | Building regression models | Part 5: Modeling |
| "Is this transaction fraudulent?" | Proper train/test splits | Part 4: Partitioning |
| "What drives customer satisfaction?" | Feature engineering | Part 3: Preprocessing |

### The Data Science Workflow

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ   COLLECT   ‚îÇ ‚Üí  ‚îÇ   EXPLORE   ‚îÇ ‚Üí  ‚îÇ  PREPROCESS ‚îÇ ‚Üí  ‚îÇ   MODEL     ‚îÇ ‚Üí  ‚îÇ  EVALUATE   ‚îÇ
‚îÇ    Data     ‚îÇ    ‚îÇ    Data     ‚îÇ    ‚îÇ    Data     ‚îÇ    ‚îÇ   Build     ‚îÇ    ‚îÇ   & Deploy  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
      ‚Üë                                     ‚îÇ                                      ‚îÇ
      ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                    Iterate and Improve
```

**Key Insight**: Data scientists spend **60-80% of their time** on data preparation (Parts 1-4). The modeling (Part 5) is often the easy part!

## Setup: Installing Required Packages

We'll use the `mlba` package (Machine Learning for Business Analytics) which contains datasets and helper functions.

In [8]:
# Install mlba package if not already installed
if (!require(mlba)) {
  library(devtools)
  install_github("gedeck/mlba/mlba", force=TRUE)
}

# Disable scientific notation for easier reading
options(scipen=999)

Loading required package: mlba

Loading required package: caret

Loading required package: ggplot2

Loading required package: lattice

Loading required package: forecast

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 



---

## Part 1: Preliminary Steps

### Loading and Looking at the Data in R

The first step in any analysis is loading your data and understanding its structure. We'll use the **West Roxbury Housing** dataset, which contains property assessment data.

In [9]:
# Load data from mlba package
housing.df <- mlba::WestRoxbury

# Find the dimension of data frame (rows x columns)
dim(housing.df)

# Show the first six rows
head(housing.df)

Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
1,344.2,4330,9965,1880,2436,1352,2,6,3,1,1,1,0,
2,412.6,5190,6590,1945,3108,1976,2,10,4,2,1,1,0,Recent
3,330.1,4152,7500,1890,2294,1371,2,8,4,1,1,1,0,
4,498.6,6272,13773,1957,5032,2608,1,9,5,1,1,1,1,
5,331.5,4170,5000,1910,2370,1438,2,7,3,2,0,1,0,
6,337.4,4244,5142,1950,2124,1060,1,6,3,1,0,1,1,Old


### üìã Understanding the Data Structure

- **dim()**: Returns number of rows and columns
- **head()**: Shows first 6 rows (use `head(df, n)` for n rows)
- **View()**: Opens interactive data viewer (works in RStudio)

### Subsetting Data: Multiple Approaches

R provides flexible ways to access subsets of data using `[row, column]` notation.

In [10]:
# Practice showing different subsets of the data

# Show the first 10 rows of the first column only
housing.df[1:10, 1]

# Show the first 10 rows of ALL columns
housing.df[1:10, ]

# Show the fifth row of the first 10 columns
housing.df[5, 1:10]

# Show the fifth row of specific columns (1, 2, 4, 8, 9, 10)
housing.df[5, c(1:2, 4, 8:10)]

Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
1,344.2,4330,9965,1880,2436,1352,2,6,3,1,1,1,0,
2,412.6,5190,6590,1945,3108,1976,2,10,4,2,1,1,0,Recent
3,330.1,4152,7500,1890,2294,1371,2,8,4,1,1,1,0,
4,498.6,6272,13773,1957,5032,2608,1,9,5,1,1,1,1,
5,331.5,4170,5000,1910,2370,1438,2,7,3,2,0,1,0,
6,337.4,4244,5142,1950,2124,1060,1,6,3,1,0,1,1,Old
7,359.4,4521,5000,1954,3220,1916,2,7,3,1,1,1,0,
8,320.4,4030,10000,1950,2208,1200,1,6,3,1,0,1,0,
9,333.5,4195,6835,1958,2582,1092,1,5,3,1,0,1,1,Recent
10,409.4,5150,5093,1900,4818,2992,2,8,4,2,0,1,0,


Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>
5,331.5,4170,5000,1910,2370,1438,2,7,3,2


Unnamed: 0_level_0,TOTAL.VALUE,TAX,YR.BUILT,ROOMS,BEDROOMS,FULL.BATH
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>
5,331.5,4170,1910,7,3,2


In [11]:
# Accessing columns by name using $ notation

# Show the whole first column (TOTAL.VALUE)
housing.df$TOTAL.VALUE[1:10]  # First 10 values

# Find the length of the column
length(housing.df$TOTAL.VALUE)

# Find the mean of the column
mean(housing.df$TOTAL.VALUE)

# Find summary statistics for ALL columns
summary(housing.df)

  TOTAL.VALUE          TAX           LOT.SQFT        YR.BUILT      GROSS.AREA  
 Min.   : 105.0   Min.   : 1320   Min.   :  997   Min.   :   0   Min.   : 821  
 1st Qu.: 325.1   1st Qu.: 4090   1st Qu.: 4772   1st Qu.:1920   1st Qu.:2347  
 Median : 375.9   Median : 4728   Median : 5683   Median :1935   Median :2700  
 Mean   : 392.7   Mean   : 4939   Mean   : 6278   Mean   :1937   Mean   :2925  
 3rd Qu.: 438.8   3rd Qu.: 5520   3rd Qu.: 7022   3rd Qu.:1955   3rd Qu.:3239  
 Max.   :1217.8   Max.   :15319   Max.   :46411   Max.   :2011   Max.   :8154  
  LIVING.AREA       FLOORS          ROOMS           BEDROOMS      FULL.BATH    
 Min.   : 504   Min.   :1.000   Min.   : 3.000   Min.   :1.00   Min.   :1.000  
 1st Qu.:1308   1st Qu.:1.000   1st Qu.: 6.000   1st Qu.:3.00   1st Qu.:1.000  
 Median :1548   Median :2.000   Median : 7.000   Median :3.00   Median :1.000  
 Mean   :1657   Mean   :1.684   Mean   : 6.995   Mean   :3.23   Mean   :1.297  
 3rd Qu.:1874   3rd Qu.:2.000   3rd Qu.:

### üìã Key Subsetting Syntax

| Syntax | Description |
|--------|-------------|
| `df[1:10, ]` | First 10 rows, all columns |
| `df[, 1:5]` | All rows, first 5 columns |
| `df[5, 3]` | Single cell (row 5, column 3) |
| `df$column` | Access column by name |
| `df[, c(1,3,5)]` | Specific columns (1, 3, 5) |

---

## Part 2: Sampling from a Database

### Why Sampling Matters
- **Large datasets**: Can't always process millions of rows
- **Class imbalance**: Some categories may be underrepresented
- **Exploratory analysis**: Quick insights from representative samples

### Random Sampling

In [12]:
housing.df <- mlba::WestRoxbury

# Random sample of 5 observations
set.seed(42)  # For reproducibility
s <- sample(row.names(housing.df), 5)
housing.df[s,]

Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
2609,410.9,5169,4700,1968,2822,1586,2,12,3,2,1,1,1,
4069,418.3,5262,7000,1912,4006,2430,2,8,4,1,1,1,0,
2369,354.7,4462,5385,1925,2878,1248,1,6,2,1,0,1,1,
5273,357.6,4498,5000,1930,2549,1632,2,7,3,1,0,1,1,
1098,317.7,3996,4976,1948,1872,1248,2,6,3,1,0,1,0,


### Weighted/Stratified Sampling

Sometimes we want to **oversample** rare but important cases. For example, houses with more than 10 rooms might be rare but valuable for analysis.

In [13]:
# Oversample houses with over 10 rooms
# prob= assigns higher probability (0.9) to large houses
set.seed(42)
s <- sample(row.names(housing.df), 5, prob=ifelse(housing.df$ROOMS>10, 0.9, 0.01))
housing.df[s,]

Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<chr>
4354,490.1,6165,6735,1915,4385,2439,2.0,7,3,1,1,1,1,
4733,404.7,5091,6250,1930,3299,1802,2.0,7,3,1,0,1,0,Recent
1838,409.5,5151,4864,1950,3998,2176,1.5,11,4,2,1,2,1,
2932,534.2,6720,11700,1880,4521,2642,2.0,10,4,1,1,1,1,
3548,349.3,4394,5792,1890,3000,2032,2.0,12,3,2,0,1,0,


### Rebalancing Classes with Upsampling

When one class is much smaller than others (imbalanced data), models may ignore the minority class. **Upsampling** duplicates minority class observations to balance the dataset.

In [14]:
# Check current class distribution
housing.df$REMODEL <- factor(housing.df$REMODEL)
cat("Original distribution:\n")
table(housing.df$REMODEL)

# Upsample to balance classes
upsampled.df <- caret::upSample(housing.df, housing.df$REMODEL, list=TRUE)$x
cat("\nAfter upsampling:\n")
table(upsampled.df$REMODEL)

Original distribution:



  None    Old Recent 
  4346    581    875 


After upsampling:



  None    Old Recent 
  4346   4346   4346 

### üìã Interpreting the Rebalancing

**Before upsampling**: Classes have different sizes (imbalanced)
**After upsampling**: All classes have equal representation

**Business impact**: 
- Prevents model from ignoring rare but important cases
- Critical for fraud detection, churn prediction, rare disease diagnosis

---

## Part 3: Preprocessing and Cleaning the Data

### Types of Variables

Understanding variable types is crucial:
- **Numeric**: Continuous values (price, age, temperature)
- **Factor/Categorical**: Discrete categories (color, region, yes/no)
- **Character**: Text strings

In [15]:
library(tidyverse)

# Get overview of data structure
housing.df <- mlba::WestRoxbury
str(housing.df)

‚îÄ‚îÄ [1mAttaching core tidyverse packages[22m ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ tidyverse 2.0.0 ‚îÄ‚îÄ
[32m‚úî[39m [34mdplyr    [39m 1.1.4     [32m‚úî[39m [34mreadr    [39m 2.1.6
[32m‚úî[39m [34mforcats  [39m 1.0.1     [32m‚úî[39m [34mstringr  [39m 1.6.0
[32m‚úî[39m [34mlubridate[39m 1.9.4     [32m‚úî[39m [34mtibble   [39m 3.3.0
[32m‚úî[39m [34mpurrr    [39m 1.2.0     [32m‚úî[39m [34mtidyr    [39m 1.3.1
‚îÄ‚îÄ [1mConflicts[22m ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ tidyverse_conflicts() ‚îÄ‚îÄ
[31m‚úñ[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m‚úñ[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[31m‚úñ[39m [34mpurrr[39m::[32mlift()[39m   masks [34mcaret[39m::lift()
[36m‚Ñπ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to

'data.frame':	5802 obs. of  14 variables:
 $ TOTAL.VALUE: num  344 413 330 499 332 ...
 $ TAX        : int  4330 5190 4152 6272 4170 4244 4521 4030 4195 5150 ...
 $ LOT.SQFT   : int  9965 6590 7500 13773 5000 5142 5000 10000 6835 5093 ...
 $ YR.BUILT   : int  1880 1945 1890 1957 1910 1950 1954 1950 1958 1900 ...
 $ GROSS.AREA : int  2436 3108 2294 5032 2370 2124 3220 2208 2582 4818 ...
 $ LIVING.AREA: int  1352 1976 1371 2608 1438 1060 1916 1200 1092 2992 ...
 $ FLOORS     : num  2 2 2 1 2 1 2 1 1 2 ...
 $ ROOMS      : int  6 10 8 9 7 6 7 6 5 8 ...
 $ BEDROOMS   : int  3 4 4 5 3 3 3 3 3 4 ...
 $ FULL.BATH  : int  1 2 1 1 2 1 1 1 1 2 ...
 $ HALF.BATH  : int  1 1 1 1 0 0 1 0 0 0 ...
 $ KITCHEN    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ FIREPLACE  : int  0 0 0 1 0 1 0 0 1 0 ...
 $ REMODEL    : chr  "None" "Recent" "None" "None" ...


In [16]:
# Make REMODEL a factor variable (categorical)
housing.df$REMODEL <- factor(housing.df$REMODEL)

# Check the result
str(housing.df$REMODEL)

# Show factor's categories (levels)
levels(housing.df$REMODEL)

 Factor w/ 3 levels "None","Old","Recent": 1 3 1 1 1 2 1 1 3 1 ...


### üìã Interpreting `str()` Output

The `str()` function shows the **structure** of your data:
- `num`: Numeric variable (continuous)
- `int`: Integer variable (whole numbers)
- `chr`: Character variable (text strings)
- `Factor`: Categorical variable with defined levels

**Why factors matter**: R treats factors specially in statistical models, automatically creating appropriate contrasts and dummy variables.

### Converting Variables to Factors

When R imports data, it may not recognize categorical variables. We must explicitly convert them:

### Using Tidyverse Pipes for Clean Code

The `%>%` (pipe) operator makes code more readable by chaining operations.

In [17]:
# Load and preprocess data in one statement
# The %>% operator passes the result as the first argument to the next function
housing.df <- mlba::WestRoxbury %>%
  mutate(REMODEL=factor(REMODEL))

str(housing.df$REMODEL)

 Factor w/ 3 levels "None","Old","Recent": 1 3 1 1 1 2 1 1 3 1 ...


### Handling Categorical Variables: Dummy Encoding

Many machine learning algorithms require **numeric inputs**. We convert categorical variables to **dummy variables** (0/1 indicators).

Example: `REMODEL` with levels [None, Old, Recent] becomes:
- `REMODEL_Old`: 1 if Old, 0 otherwise
- `REMODEL_Recent`: 1 if Recent, 0 otherwise
- (None is the reference category, indicated by both = 0)

In [19]:
library(fastDummies)
library(tidyverse)

housing.df <- dummy_cols(mlba::WestRoxbury,
                 remove_selected_columns=TRUE,  # remove the original column
                 remove_first_dummy=TRUE)       # removes the first dummy (reference)

# Show first 2 rows to see the dummy columns
housing.df %>% head(2)

Unnamed: 0_level_0,TOTAL.VALUE,TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL_Old,REMODEL_Recent
Unnamed: 0_level_1,<dbl>,<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,344.2,4330,9965,1880,2436,1352,2,6,3,1,1,1,0,0,0
2,412.6,5190,6590,1945,3108,1976,2,10,4,2,1,1,0,0,1


### Handling Missing Values

Missing values (`NA`) can break analyses. Common strategies:
1. **Remove rows** with missing values (loses data)
2. **Impute** with mean, median, or mode (preserves data)
3. **Flag** missing values as a separate category

In [20]:
housing.df <- mlba::WestRoxbury

# Simulate missing data: convert some BEDROOMS entries to NA
set.seed(1)
rows.to.missing <- sample(row.names(housing.df), 10)
housing.df[rows.to.missing,]$BEDROOMS <- NA

# Check the result - now we have 10 NA's
cat("Summary with missing values:\n")
summary(housing.df$BEDROOMS)

Summary with missing values:


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
   1.00    3.00    3.00    3.23    4.00    9.00      10 

In [21]:
# Impute missing values using the median
# use na.rm=TRUE to ignore NA when computing the median
housing.df <- housing.df %>%
  replace_na(list(BEDROOMS=median(housing.df$BEDROOMS, na.rm=TRUE)))

# Verify - no more NA's
cat("Summary after imputation:\n")
summary(housing.df$BEDROOMS)

Summary after imputation:


   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   3.000   3.229   4.000   9.000 

### üìã Why Median for Imputation?

- **Mean**: Sensitive to outliers (a $10M house skews the average)
- **Median**: Robust to outliers (middle value is stable)
- **Mode**: Best for categorical variables

**Business rule**: Document your imputation strategy! Stakeholders need to know how missing data was handled.

---

## Part 4: Predictive Power and Overfitting

### The Overfitting Problem

**Overfitting** occurs when a model learns the training data TOO well, including noise and random patterns that don't generalize.

| Training Accuracy | Test Accuracy | Diagnosis |
|-------------------|---------------|------------|
| 95% | 93% | Good generalization ‚úì |
| 99% | 70% | Overfitting ‚úó |
| 65% | 63% | Underfitting (too simple) |

### Solution: Data Partitions

Split your data into:
- **Training set**: Build the model
- **Validation set**: Tune parameters
- **Holdout/Test set**: Final evaluation (NEVER used during training)

### Holdout Partition (60% Train / 40% Test)

In [22]:
housing.df <- mlba::WestRoxbury %>%
  mutate(REMODEL=factor(REMODEL))

# Set seed for reproducibility
set.seed(1)

# Randomly sample 60% of row IDs for training
train.rows <- sample(rownames(housing.df), nrow(housing.df)*0.6)

# Collect training rows
train.df <- housing.df[train.rows, ]

# Remaining 40% for holdout
holdout.rows <- setdiff(rownames(housing.df), train.rows)
holdout.df <- housing.df[holdout.rows, ]

cat("Training set:", nrow(train.df), "rows\n")
cat("Holdout set:", nrow(holdout.df), "rows\n")

Training set: 3481 rows
Holdout set: 2321 rows


### Three-Way Partition (50% Train / 30% Validation / 20% Test)

In [23]:
set.seed(1)

# 50% for training
train.rows <- sample(rownames(housing.df), nrow(housing.df)*0.5)

# 30% for validation (from remaining rows)
valid.rows <- sample(setdiff(rownames(housing.df), train.rows),
              nrow(housing.df)*0.3)

# Remaining 20% for holdout
holdout.rows <- setdiff(rownames(housing.df), union(train.rows, valid.rows))

# Create data frames
train.df <- housing.df[train.rows, ]
valid.df <- housing.df[valid.rows, ]
holdout.df <- housing.df[holdout.rows, ]

cat("Training set:", nrow(train.df), "rows\n")
cat("Validation set:", nrow(valid.df), "rows\n")
cat("Holdout set:", nrow(holdout.df), "rows\n")

Training set: 2901 rows
Validation set: 1740 rows
Holdout set: 1161 rows


### Using caret for Stratified Partitioning

The `caret` package ensures proportional representation of the target variable in each partition.

In [None]:
set.seed(1)

# createDataPartition ensures stratified sampling based on TOTAL.VALUE
idx <- caret::createDataPartition(housing.df$TOTAL.VALUE, p=0.6, list=FALSE)
train.df <- housing.df[idx, ]
holdout.df <- housing.df[-idx, ]

cat("Training set:", nrow(train.df), "rows\n")
cat("Holdout set:", nrow(holdout.df), "rows\n")

---

## Part 5: Building a Predictive Model

### Complete Modeling Workflow

Let's put it all together:
1. Load and preprocess data
2. Create train/test split
3. Build model on training data
4. Evaluate on holdout data
5. Make predictions on new data

In [24]:
library(tidyverse)
library(mlba)
library(fastDummies)

# Step 1: Load and preprocess data
housing.df <- mlba::WestRoxbury %>%
  drop_na() %>%                              # Remove rows with missing values
  select(-TAX) %>%                           # Remove TAX column
  mutate(REMODEL=factor(REMODEL)) %>%        # Convert to factor
  dummy_cols(select_columns=c('REMODEL'),    # Create dummy variables
             remove_selected_columns=TRUE, 
             remove_first_dummy=TRUE)

# Step 2: Create train/test split
set.seed(1)
idx <- caret::createDataPartition(housing.df$TOTAL.VALUE, p=0.6, list=FALSE)
train.df <- housing.df[idx, ]
holdout.df <- housing.df[-idx, ]

cat("Data prepared. Training:", nrow(train.df), "| Holdout:", nrow(holdout.df))

Data prepared. Training: 3483 | Holdout: 2319

### Step 1-2: Data Preparation and Partitioning

In [25]:
# Step 3: Build linear regression model
reg <- lm(TOTAL.VALUE ~ ., data=train.df)

# Training set results
train.res <- data.frame(
  actual=train.df$TOTAL.VALUE, 
  predicted=reg$fitted.values,
  residuals=reg$residuals
)

cat("Training set predictions (first 6 rows):\n")
head(train.res)

Training set predictions (first 6 rows):


Unnamed: 0_level_0,actual,predicted,residuals
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
1,344.2,384.4206,-40.220638
4,498.6,546.4628,-47.862759
5,331.5,347.917,-16.417031
12,344.5,380.4297,-35.929727
13,315.5,313.1879,2.312083
15,326.2,345.3751,-19.175064


### üìã Understanding the Preprocessing Pipeline

The code above demonstrates a **complete preprocessing chain**:

| Step | Function | What It Does |
|------|----------|--------------|
| `drop_na()` | Remove missing | Ensures clean data |
| `select(-TAX)` | Remove column | TAX is redundant with TOTAL.VALUE |
| `mutate(REMODEL=factor())` | Convert type | Prepares for dummy encoding |
| `dummy_cols()` | Create dummies | ML-ready numeric features |

### Step 3: Building the Model

We'll use **linear regression** ‚Äî the foundation of predictive modeling:

In [26]:
# Step 4: Evaluate on holdout data
pred <- predict(reg, newdata=holdout.df)

holdout.res <- data.frame(
  actual=holdout.df$TOTAL.VALUE, 
  predicted=pred,
  residuals=holdout.df$TOTAL.VALUE - pred
)

cat("Holdout set predictions (first 6 rows):\n")
head(holdout.res)

Holdout set predictions (first 6 rows):


Unnamed: 0_level_0,actual,predicted,residuals
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
2,412.6,460.2777,-47.677744
3,330.1,359.392,-29.291958
6,337.4,290.0277,47.372303
7,359.4,402.5332,-43.133242
8,320.4,314.0683,6.331652
9,333.5,339.8206,-6.320582


### üìã Understanding Training Results

The output shows:
- **actual**: True property value from the data
- **predicted**: Model's estimate (fitted values)
- **residuals**: actual - predicted (the error)

**Key insight**: Residuals should be small and randomly distributed around zero. Large residuals indicate properties the model struggles to value.

### Step 4: Evaluating on Holdout Data

Now the critical test ‚Äî how does our model perform on **data it has never seen**?

### Comparing Training vs Holdout Performance

Key metrics for regression:
- **ME (Mean Error)**: Average error (should be near 0)
- **RMSE (Root Mean Squared Error)**: Typical prediction error in $ units
- **MAE (Mean Absolute Error)**: Average absolute deviation

In [27]:
library(caret)

# Compute metrics on training set
cat("=== Training Set Metrics ===\n")
data.frame(
    ME = round(mean(train.res$residuals), 5),
    RMSE = RMSE(pred=train.res$predicted, obs=train.res$actual),
    MAE = MAE(pred=train.res$predicted, obs=train.res$actual)
)

=== Training Set Metrics ===


ME,RMSE,MAE
<dbl>,<dbl>,<dbl>
0,42.14665,31.98717


In [28]:
# Compute metrics on holdout set
cat("=== Holdout Set Metrics ===\n")
data.frame(
    ME = round(mean(holdout.res$residuals), 5),
    RMSE = RMSE(pred=holdout.res$predicted, obs=holdout.res$actual),
    MAE = MAE(pred=holdout.res$predicted, obs=holdout.res$actual)
)

=== Holdout Set Metrics ===


ME,RMSE,MAE
<dbl>,<dbl>,<dbl>
-1.04237,43.90381,33.05476


### üìã Interpreting Model Performance

**Compare RMSE between training and holdout:**
- Similar values ‚Üí Model generalizes well ‚úì
- Holdout much higher ‚Üí Overfitting ‚úó
- Both high ‚Üí Model is too simple (underfitting)

**RMSE interpretation**: "On average, our predictions are off by $[RMSE] from actual values."

---

## Part 6: Making Predictions on New Data

Once your model is validated, you can use it to predict values for new observations.

In [29]:
# Create sample new data (from original dataset for demonstration)
housing.df <- mlba::WestRoxbury

new.data <- housing.df[100:102, -1] %>%  # Remove TOTAL.VALUE (we're predicting it)
  mutate(REMODEL=factor(REMODEL, levels=c("None", "Old", "Recent"))) %>%
  dummy_cols(select_columns=c('REMODEL'),
           remove_selected_columns=TRUE, 
           remove_first_dummy=TRUE)

cat("New data to predict:\n")
new.data

New data to predict:


TAX,LOT.SQFT,YR.BUILT,GROSS.AREA,LIVING.AREA,FLOORS,ROOMS,BEDROOMS,FULL.BATH,HALF.BATH,KITCHEN,FIREPLACE,REMODEL_Old,REMODEL_Recent
<int>,<int>,<int>,<int>,<int>,<dbl>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
3818,4200,1960,2670,1710,2.0,10,4,1,1,1,1,0,0
3791,6444,1940,2886,1474,1.5,6,3,1,1,1,1,0,0
4275,5035,1925,3264,1523,1.0,6,2,1,0,1,0,0,1


In [30]:
# Make predictions
pred <- predict(reg, newdata = new.data)

cat("\nPredicted property values:\n")
pred


Predicted property values:


### üìã Interpreting Predictions on New Data

**What the output means:**
- Each predicted value is the model's estimate of `TOTAL.VALUE` for that property
- These predictions are based on the property characteristics (LIVING.AREA, BEDROOMS, etc.)
- You can now use these for pricing recommendations, risk assessment, or business decisions

### üè¢ Production Considerations

When deploying models to production, consider:

| Factor | Question to Ask |
|--------|-----------------|
| **Monitoring** | How will you track prediction accuracy over time? |
| **Retraining** | When will you update the model with new data? |
| **Fallback** | What happens if the model fails or produces outliers? |
| **Explainability** | Can you explain why the model made a specific prediction? |
| **Bias** | Does the model treat all groups fairly? |

---

## Summary: Key Takeaways

### Data Exploration
| Task | Function |
|------|----------|
| Dimensions | `dim(df)` |
| Structure | `str(df)` |
| Summary stats | `summary(df)` |
| First rows | `head(df)` |

### Data Preprocessing
| Task | Function |
|------|----------|
| Create factor | `factor(x)` |
| Dummy variables | `dummy_cols()` |
| Handle NA | `replace_na()`, `drop_na()` |
| Rebalance classes | `caret::upSample()` |

### Model Building
| Task | Function |
|------|----------|
| Data partition | `caret::createDataPartition()` |
| Linear regression | `lm(y ~ ., data)` |
| Predict | `predict(model, newdata)` |
| Evaluate | `RMSE()`, `MAE()` |

### Best Practices
1. **Always set.seed()** for reproducibility
2. **Never evaluate on training data** - use holdout set
3. **Document preprocessing steps** for stakeholders
4. **Check for overfitting** by comparing train vs test performance

---

**Next Steps**: Apply these techniques to your own datasets!