# Applied Artificial Intelligence - Lab 2

Luca van Straaten - 18073611

## Preparation

First I uninstalled anaconda by:
- removing the lines from my .config/fish/config.fish
- removing the path from fish
- `brew uninstall anaconda`

Than I installed miniconda by:

```fish
brew install miniconda
```

That was easy.

uh, it did not install the newest version of conda, so I had to update it by:

```fish
conda update -n base -c defaults conda
```

**Maby make a pull request to the brew formula? - [if I ever feel like it]™**

And I recreated the environment with python 3.10:

```fish
conda create -n aai_lab python=3.10
conda activate aai_lab
conda install tensorflow notebook pandas matplotlib numpy
conda install scikit-learn
conda install -c conda-forge nb_conda_kernels
```

**System information**: 2018 Intel Core i7 13-inch MacBookPro15,2, 16GB RAM, 512GB SSD, macOS Ventura 13.0 (22A380), kernel 22.1.0

This file, along with the rest of the labs, are tracked in a git repository on github. [lab 2](https://github.com/lucanatorvs/Applied_Artificial_Intelligence_Lab/blob/main/2/lab2.ipynb)

## Exercise 1 - A quick overview of machine learning project.

The end goal of this lab is to train a linear regression model to make an insurance cost prediction. Write the necessary code and answer the questions below.

### 1

Use the CSV file from Blackboard Med_insurance.csv [2] from last week. It contains data of medical information and insurance cost. It contains 1338 rows of data with columns: age, gender, BMI, children, smoker, region, insurance charges. Read this csv file using pandas library into a variable called insurance_data

In [None]:
import pandas as pd

insurance_data = pd.read_csv("../Med_insurance.csv")

### 2

Convert a categorical variable of your choice into dummy/indicator variables using the
pandas function pandas.get_dummies() and combine the result with the numerical columns
of the insurance_data


In [None]:
insurance = pd.get_dummies(insurance_data, columns=["smoker"], drop_first=True)

# now only keep the numeric columns
insurance = insurance.select_dtypes(include=["number"])

# print the first 5 rows of the data
insurance.head()

### 3

Create a test set which is 20 % of the whole data set using a pure random sampling approach.

**Question**: Why do you need a test set when training a model?

**Answer**: To test the model on data it has not seen before.


In [None]:
from sklearn.model_selection import train_test_split
import random as rnd

train_set, test_set = train_test_split(
    insurance, test_size=0.2, random_state=rnd.seed(42)
)

# Print the number of rows in the train and test set
print("Number of rows in the train set: ", len(train_set))
print("Number of rows in the test set: ", len(test_set))

## Exercise 2 - Preparing the data for Machine Learning algorithms

In the upcoming steps you will prepare the data that will be used to train a machine learning model

### 1

If you found missing values in the data add the missing entries for the respective column(s)
using the imputer transform (including only numerical
attributes) in Scikit SimpleImputer class. Use “median” as strategy. Make sure to train the imputer only on the training set.

#### A

Interpalat the training set


In [None]:
# find the rows for missing data in the train set and print them and save the index
indexes = train_set[train_set.isnull().any(axis=1)].index
print(train_set.loc[indexes])

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

# fit the imputer to the train set
imputer.fit(train_set)

# transform the train set
X = imputer.transform(train_set)

# apply the imputer to the train set
train_set_fixed = pd.DataFrame(X, columns=train_set.columns, index=train_set.index)

print(train_set_fixed.loc[indexes])

# print the first 5 rows of the data
train_set_fixed.head()

#### B

We also need to interpalate the test data:

This is done completely seperate from the training data, so the test data is completely unseen by the model.


In [None]:
# find the rows for missing data in the test_set and print them and save the index
indexes = test_set[test_set.isnull().any(axis=1)].index
print(test_set.loc[indexes])

# fit the imputer to the test_set
imputer.fit(test_set)

# transform the test_set
X = imputer.transform(test_set)

# apply the imputer to the test_set
test_set_fixed = pd.DataFrame(X, columns=test_set.columns, index=test_set.index)

print(test_set_fixed.loc[indexes])

# print the first 5 rows of the data
test_set_fixed.head()

### 2

Perform feature scaling on all numerical attributes
using Scikit transform StandardScaler. Again, fit the scaler on the training data only.

***Question***: Explain what problem or problems the feature scaling resolves.

**Answer**: It makes the data more comparable, so that the model can learn better.

In [None]:
# Perform feature scaling on all numerical attributes
# using Scikit transform StandardScaler. Again, fit the scaler on the training data only.
# we will use the same scaler for the test set so we make a pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline(
    [
        ("imputer", SimpleImputer(strategy="median")),
        ("std_scaler", StandardScaler()),
    ]
)

# apply the pipeline to the train set
train_set_prepared = num_pipeline.fit_transform(train_set_fixed)

# apply the pipeline to the test set
test_set_prepared = num_pipeline.fit_transform(test_set_fixed)

# print the first 5 rows of the data
print(train_set_prepared[:5])
print(test_set_prepared[:5])


Save the prepocessed data into a variable `insurance_data_prepared`.


In [None]:
insurance_data_prepared = train_set_prepared

# Exercise 3 - Training a model

In the following steps you will select and train a machine learning model.

### 4

Now your data is ready to be used in training a machine learning model. Use it to train a Linear Regression model that can predict insurance charges. See the Training and Evaluation on the Training Set section at page 72 in the book [1].

**Question**: What is the difference between supervised and unsupervised learning? Is a Linear Regression supervised or unsupervised?

**Answer**: Supervised learning is when you have a dataset with the correct answers, and unsupervised learning is when you don't. Linear Regression is supervised.

**Question**: What is the difference between regression and classification?

**Answer**: Regression is when you want to predict a continuous value, and classification is when you want to predict a discrete value.


In [None]:
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()

# fit the model to the train set
lin_reg.fit(insurance_data_prepared, train_set["charges"])

# print the intercept and coefficients
print("Intercept: ", lin_reg.intercept_)
print("Coefficients: ", lin_reg.coef_)
print("Number of coefficients: ", len(lin_reg.coef_))


### 5

Test your trained model with the test data and find out the root mean squared error and
mean absolute error.


In [None]:
# Test your trained model with the test data and find out the root mean squared error and
# mean absolute error.

from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np

insurance_data_predictions = lin_reg.predict(test_set_prepared)

lin_mse = mean_squared_error(test_set["charges"], insurance_data_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_mae = mean_absolute_error(test_set["charges"], insurance_data_predictions)

print("Root Mean Squared Error: ", lin_rmse)
print("Mean Absolute Error: ", lin_mae)