#IS 470 Lab 7: SVM and Neural Network

---

##Part 1. SVM and Neural Network for Numeric Prediction
<br>
In order for a health insurance company to make money, it needs to collect more
in yearly premiums than it spends on medical care to its beneficiaries. As a result, insurers invest a great deal of time and money in developing models that accurately forecast medical expenses for the insured population.<br>
<br>
Medical expenses are difficult to estimate because the most costly conditions are rare and seemingly random. Still, some conditions are more prevalent for certain segments of the population. For instance, lung cancer is more likely among smokers than non-smokers, and heart disease may be more likely among the obese.<br>
<br>
The goal of this analysis is to use patient data to estimate the average medical
care expenses for such population segments. These estimates can be used to create actuarial tables that set the price of yearly premiums higher or lower, 
depending on the expected treatment costs.<br>
<br>
The insurance data set has 1338 observations of 7 variables.
<br>
We will use this file to predict the medical expenses.
<br>
<br>
VARIABLE DESCRIPTIONS:<br>
age:	      age in years<br>
sex:	      gender<br>
bmi:	      body mass index<br>
children:	how many children do they have?<br>
smoker:	  do they smoke?<br>
region:	  geographic region<br>
expenses:	yearly medical expenses<br>

Target variable: **expenses**

### Upload and clean data

In [None]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import libraries
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.neural_network import MLPRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

In [None]:
# Read data
insurance = pd.read_csv("/content/drive/MyDrive/IS470_data/insurance.csv")
insurance

In [None]:
# Show the head rows of a data frame
insurance.head()

In [None]:
# Examine variable type
insurance.dtypes

In [None]:
# Change categorical variables to "category"
insurance['sex'] = insurance['sex'].astype('category')
insurance['smoker'] = insurance['smoker'].astype('category')
insurance['region'] = insurance['region'].astype('category')

In [None]:
# Examine variable type
insurance.dtypes

In [None]:
# Data exploration: some examples
# Histogram of insurance expenses
snsplot = sns.histplot(x='expenses', data = insurance)
snsplot.set_title("Histogram of expenses in the insurance data set")

In [None]:
# exploring relationships among all numeric variables: correlation matrix
insurance.corr()

### Partition the data set

In [None]:
# Create dummy variables
insurance = pd.get_dummies(insurance, columns=['sex','smoker','region'], drop_first=True)
insurance

In [None]:
# Partition the data
target = insurance['expenses']
predictors = insurance.drop(['expenses'],axis=1)
predictors_train_insurance, predictors_test_insurance, target_train_insurance, target_test_insurance = train_test_split(predictors, target, test_size=0.3, random_state=0)
print(predictors_train_insurance.shape, predictors_test_insurance.shape, target_train_insurance.shape, target_test_insurance.shape)

In [None]:
# Examine the distribution of target variable for training data set
snsplot = sns.histplot(data = target_train_insurance)
snsplot.set_title("Histogram of expenses in the training data set")

In [None]:
# Examine the distribution of target variable for testing data set
snsplot = sns.histplot(data = target_test_insurance)
snsplot.set_title("Histogram of expenses in the testing data set")

### SVM model

In [None]:
# Build a SVM model with default setting (C = 1.0)


In [None]:
# Make predictions on testing data


In [None]:
# Examine the evaluation results on testing data: MAE and RMSE


In [None]:
# Build a SVM model with C = 10.0


In [None]:
# Make predictions on testing data


In [None]:
# Examine the evaluation results on testing data: MAE and RMSE


In [None]:
# Build a SVM model with C = 100.0


In [None]:
# Make predictions on testing data


In [None]:
# Examine the evaluation results on testing data: MAE and RMSE


Q1. Which C value provides the best performance?<br>


Q2. How dose the cost parameter C impact SVM model performance?<br>

Q3. Assume that you will lose each dollar your model’s prediction misses due to an over-estimation or under-estimation. Which evaluation metric you should use?<br>


Q4. Assume that the penalty for an erroneous prediction increases with the difference between the actual and predicted values. Which evaluation metric you should use?<br>


### MLP model

In [None]:
#Build MLP model contains two hidden layers: 16 hidden nodes for the first layer, and 8 hidden nodes for the second layer. Set random_state=1.


In [None]:
# Make predictions on testing data


In [None]:
# Examine the evaluation results on testing data: MAE and RMSE


In [None]:
# Build MLP model contains three hidden layers: 8 hidden nodes for the first layer, 4 hidden nodes for the second layer, and 4 hidden nodes for the third layer. Set random_state=1.


In [None]:
# Make predictions on testing data


In [None]:
# Examine the evaluation results on testing data: MAE and RMSE


##Part 2. SVM and Neural Network for classification
<br>
This data set contains information of cars purchased at the Auction.
<br>
We will use this file to predict the quality of buying decisions and visualize decision processes.
<br>
<br>
VARIABLE DESCRIPTIONS:<br>
Auction: Auction provider at which the  vehicle was purchased<br>
Color: Vehicle Color<br>
IsBadBuy: Identifies if the kicked vehicle was an avoidable purchase<br>
MMRCurrentAuctionAveragePrice: Acquisition price for this vehicle in average condition as of current day<br>
Size: The size category of the vehicle (Compact, SUV, etc.)<br>
TopThreeAmericanName:Identifies if the manufacturer is one of the top three American manufacturers<br>
VehBCost: Acquisition cost paid for the vehicle at time of purchase<br>
VehicleAge: The Years elapsed since the manufacturer's year<br>
VehOdo: The vehicles odometer reading<br>
WarrantyCost: Warranty price (term=36month  and millage=36K)<br>
WheelType: The vehicle wheel type description (Alloy, Covers)<br>


Target variable: **IsBadBuy**

### Upload and clean data

In [None]:
# Import libraries
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from matplotlib import pyplot as plt
from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report

In [None]:
# Read data
carAuction = pd.read_csv("/content/drive/MyDrive/IS470_data/carAuction.csv")
carAuction

In [None]:
# Show the head rows of a data frame
carAuction.head()

In [None]:
# Examine variable type
carAuction.dtypes

In [None]:
# Change categorical variables to "category"
carAuction['Auction'] = carAuction['Auction'].astype('category')
carAuction['Color'] = carAuction['Color'].astype('category')
carAuction['IsBadBuy'] = carAuction['IsBadBuy'].astype('category')
carAuction['Size'] = carAuction['Size'].astype('category')
carAuction['TopThreeAmericanName'] = carAuction['TopThreeAmericanName'].astype('category')
carAuction['WheelType'] = carAuction['WheelType'].astype('category')

In [None]:
# Examine variable type
carAuction.dtypes

###2. Partition the data set

In [None]:
# Create dummy variables
carAuction = pd.get_dummies(carAuction, columns=['Auction','Color','Size','TopThreeAmericanName','WheelType'], drop_first=True)
carAuction

In [None]:
# Partition the data
target = carAuction['IsBadBuy']
predictors = carAuction.drop(['IsBadBuy'],axis=1)
predictors_train_car, predictors_test_car, target_train_car, target_test_car = train_test_split(predictors, target, test_size=0.3, random_state=0)
print(predictors_train_car.shape, predictors_test_car.shape, target_train_car.shape, target_test_car.shape)

In [None]:
# Examine the porportion of target variable for training data set
print(target_train_car.value_counts(normalize=True))

In [None]:
# Examine the porportion of target variable for testing data set
print(target_test_car.value_counts(normalize=True))

### SVM model

In [None]:
# Build a SVM model with default setting (C = 1.0)


In [None]:
# Make predictions on testing data


In [None]:
# Examine the evaluation results on testing data: confusion_matrix


In [None]:
# Examine the evaluation results on testing data: accuracy, precision, recall, and f1-score


### MLP model

In [None]:
#Build MLP model contains two hidden layers: 16 hidden nodes for the first layer, and 10 hidden nodes for the second layer. Set random_state=1.


In [None]:
# Make predictions on testing data


In [None]:
# Examine the evaluation results on testing data: confusion_matrix


In [None]:
# Examine the evaluation results on testing data: accuracy, precision, recall, and f1-score


Q5. Which model has better performance on carAuction data, SVM or neural network? why?<br>


In [None]:
!jupyter nbconvert --to html "/content/drive/MyDrive/IS470_lab/IS470_lab7.ipynb"