# **First Model**

_John Andrew Dixon_

---

**Setup**

In [13]:
# Import all necessary libraries

# For easily handling data
import pandas as pd

# For creating a train-test split
from sklearn.model_selection import train_test_split

# For scaling and one-hot encoding
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# For selecting and transforming columns
from sklearn.compose import make_column_selector, make_column_transformer

# For creating pipelines
from sklearn.pipeline import make_pipeline

In [14]:
# Remote URL to the dataset
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vT3R_qvIyzGAYylk0aIdTFxAtcxLdjBtfEGfSyAI1PfOnr0YpN_QjbbH1j5OScoYNcoyOjY_c9tQQ0H/pub?output=csv"
# Load data and verify 
df = pd.read_csv(url)
df.sample(5)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
157,18,male,25.175,0,yes,northeast,15518.18025
1175,22,female,27.1,0,no,southwest,2154.361
894,62,male,32.11,0,no,northeast,13555.0049
862,55,female,33.535,2,no,northwest,12269.68865
1082,38,male,19.95,1,no,northwest,5855.9025


**Numeric**: `age`, `bmi`, `children`

**Ordinal**: `None`

**Nominal**: `sex`, `smoker`, `region`

---

## **Tasks**

> **Question**: _How well can the additional charges be predicted based on the age, sex, BMI, number of children, smoking habit, and region of the patient?_ 

### **Create a preprocessing object, such as a column transformer or pipeline, that will:**
* Ordinal encode any ordinal features
* One-hot encode any nominal features
* Scale any numeric features

In [15]:
# Create the feature matrix
X = df.drop(columns="charges")
# Create the target vector
y  = df["charges"]

In [16]:
# Create the train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [17]:
# Scaler for numeric features
scaler = StandardScaler()
# One-hot encoder for nomical features
one_hot_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)

In [None]:
# Column selector for numeric features
numeric_selector = make_column_selector(dtype_include="number")
# Column selector for nominal features
nominal_selector = make_column_selector(dtype_include="object")

In [None]:
# Tuple for numeric features
numeric_tuple = (scaler, numeric_selector)
# Tuple for nominal features
nominal_tuple = (one_hot_encoder, nominal_selector)

In [None]:
# ColumnTransformer as a preprocessor object
preprocessor = make_column_transformer(numeric_tuple, nominal_tuple, remainder="passthrough", verbose_feature_names_out=False)

### **Instantiate a linear regression model**

### **Create a model pipeline with your preprocessor first and linear regression model last.**

### **Fit the modeling pipeline on the training data.**

### **Evaluate the model performance on both the training set and the test set using the R-squared score.**