# Prompt

Using a dataset of your own, explore the data utilizing multiple cross-validation techniques. Choose the most appropriate cross-validation technique for your data.

In your initial post, describe your data, state which cross-validation technique you used, and explain your rationale for deciding on which cross-validation technique was the most appropriate for your specific dataset.

# Imports

In [1]:
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.utils import shuffle
from sklearn.feature_selection import SequentialFeatureSelector
import plotly.express as px

np.random.seed(1234)

In [2]:
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)
mpl.rcParams.update({"axes.grid": True})

# Data Load

In [6]:
df = pd.read_csv("./data/housing.csv")
# display(df.head())
# display(df.info())
df.dropna(inplace=True)
df = df.query(
    "median_house_value < 500e3"
)  # are these null values? or just a hard cap on the data?
display(df.head())
display(df.info())

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


<class 'pandas.core.frame.DataFrame'>
Index: 19448 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           19448 non-null  float64
 1   latitude            19448 non-null  float64
 2   housing_median_age  19448 non-null  float64
 3   total_rooms         19448 non-null  float64
 4   total_bedrooms      19448 non-null  float64
 5   population          19448 non-null  float64
 6   households          19448 non-null  float64
 7   median_income       19448 non-null  float64
 8   median_house_value  19448 non-null  float64
 9   ocean_proximity     19448 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


None

# Problem Statement and Model Setup

## Problem Statement

The data set contains the target feature - median_house_value - and we wish to develop a model from the other features to predict this feature.

## Model Setup

We know from a prior exercise that the ocean proximity column was of limited value, and that a 2nd-order polynomial provided a good balance of performance on both training and development sets. Therefore, we will drop ocean proximity and establish a pipeline of:
- Polynomial features
- Standardization
- Ridge Regression

and declare the hyperparameter of interest to be the alpha parameter for ridge.

We also saw that polynomial performance (more so on higher-order models) suffered unless the data was scaled prior to doing the polynomial feature transformation, so we will scale the data up front once.

Finally, we will randomly shuffle the data one time and do a 3-way split into training, development, and test sets.

### Defining Target and Regression Features

In [8]:
target_feature = "median_house_value"
numeric_features = df.columns[df.dtypes != "object"].to_list()
numeric_features.remove(target_feature)

display([target_feature, numeric_features])

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income']

### Standardize

### Shuffle and Split

### Build Pipeline

# Cross Validation with Holdout

# Cross Validation with K-Fold

# Cross Validation with Leave One Out

# Comparisons

# Conclusions