<a href="https://colab.research.google.com/github/pkro/tensorflow_cert_training/blob/main/colab_notebooks/01b_a_larger_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## A larger example

We'll get a dataset from [kaggle](https://www.kaggle.com/)

[Description](https://www.kaggle.com/datasets/mirichoi0218/insurance)

Dataset: https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv

### Preparing the dataset

In [None]:
import tensorflow as tf
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
insurance = pd.read_csv("https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv")
insurance

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


The dependend variable - the one we're trying to predict - is "charges".

The others are the independent variables (= predictors / features / covariates).


In [None]:
# Convert text columns into numbers / one-hot encode categorical text variables

# create a new dataframe with one-hot encoded sex
dummies = pd.get_dummies(insurance['sex'], prefix='sex')
dummies

Unnamed: 0,sex_female,sex_male
0,1,0
1,0,1
2,0,1
3,0,1
4,0,1
...,...,...
1333,0,1
1334,1,0
1335,1,0
1336,1,0


In [None]:
# Concat the data
insurance = pd.concat([insurance, dummies], axis='columns')
insurance

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,sex_female,sex_male
0,19,female,27.900,0,yes,southwest,16884.92400,1,0
1,18,male,33.770,1,no,southeast,1725.55230,0,1
2,28,male,33.000,3,no,southeast,4449.46200,0,1
3,33,male,22.705,0,no,northwest,21984.47061,0,1
4,32,male,28.880,0,no,northwest,3866.85520,0,1
...,...,...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830,0,1
1334,18,female,31.920,0,no,northeast,2205.98080,1,0
1335,18,female,36.850,0,no,southeast,1629.83350,1,0
1336,21,female,25.800,0,no,southwest,2007.94500,1,0


In [None]:
# delete the originial text column
insurance = insurance.drop(['sex'], axis='columns') # create new dataframe (or do it inplace as in the next example)

In [None]:
# this can be done in one go too, here for "smoker" and "region"
# note the double [[]]
# when supplying multiple, pandas automatically adds prefixes based on 
# the original column's name
dummies = pd.get_dummies(insurance[['smoker', 'region']]) # add .copy() if modification of the new dataframe is intended
dummies

Unnamed: 0,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,0,1,0,0,0,1
1,1,0,0,0,1,0
2,1,0,0,0,1,0
3,1,0,0,1,0,0
4,1,0,0,1,0,0
...,...,...,...,...,...,...
1333,1,0,0,1,0,0
1334,1,0,1,0,0,0
1335,1,0,0,0,1,0
1336,1,0,0,0,0,1


In [None]:
# concat
insurance = pd.concat([insurance, dummies], axis="columns")
# and delete old columns
insurance.drop(['smoker', 'region'], axis="columns", inplace=True) # can be done inplace

insurance

Unnamed: 0,age,bmi,children,charges,sex_female,sex_male,smoker_no,smoker_yes,region_northeast,region_northwest,region_southeast,region_southwest
0,19,27.900,0,16884.92400,1,0,0,1,0,0,0,1
1,18,33.770,1,1725.55230,0,1,1,0,0,0,1,0
2,28,33.000,3,4449.46200,0,1,1,0,0,0,1,0
3,33,22.705,0,21984.47061,0,1,1,0,0,1,0,0
4,32,28.880,0,3866.85520,0,1,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
1333,50,30.970,3,10600.54830,0,1,1,0,0,1,0,0
1334,18,31.920,0,2205.98080,1,0,1,0,1,0,0,0
1335,18,36.850,0,1629.83350,1,0,1,0,0,0,1,0
1336,21,25.800,0,2007.94500,1,0,1,0,0,0,0,1


### Build a regression model

In [None]:
# Create X and y values (features and labels)

# Split into training and test set

# Create and train model