Welcome to the Neural Network Lab! You will apply your newly acquired knowledge on neural networks to tackle a regression problem. You will work with the auto insurance company data from week 2, where your goal is to predict the **total claim amount** of a customer based on their features (income, gender, region, etc).

In [19]:
### Steps to follow:

1. **Load the Data**  
   Load the auto insurance dataset. Ensure the dataset is clean and ready for analysis.

In [17]:
import pandas as pd

file_path = r"C:\Users\Gunay\Documents\GitHub\NN\marketing_customer_analysis_clean.csv"

df = pd.read_csv(file_path)
df

Unnamed: 0,unnamed:_0,customer,state,customer_lifetime_value,response,coverage,education,effective_to_date,employmentstatus,gender,...,number_of_policies,policy_type,policy,renew_offer_type,sales_channel,total_claim_amount,vehicle_class,vehicle_size,vehicle_type,month
0,0,DK49336,Arizona,4809.216960,No,Basic,College,2011-02-18,Employed,M,...,9,Corporate Auto,Corporate L3,Offer3,Agent,292.800000,Four-Door Car,Medsize,A,2
1,1,KX64629,California,2228.525238,No,Basic,College,2011-01-18,Unemployed,F,...,1,Personal Auto,Personal L3,Offer4,Call Center,744.924331,Four-Door Car,Medsize,A,1
2,2,LZ68649,Washington,14947.917300,No,Basic,Bachelor,2011-02-10,Employed,M,...,2,Personal Auto,Personal L3,Offer3,Call Center,480.000000,SUV,Medsize,A,2
3,3,XL78013,Oregon,22332.439460,Yes,Extended,College,2011-01-11,Employed,M,...,2,Corporate Auto,Corporate L3,Offer2,Branch,484.013411,Four-Door Car,Medsize,A,1
4,4,QA50777,Oregon,9025.067525,No,Premium,Bachelor,2011-01-17,Medical Leave,F,...,7,Personal Auto,Personal L2,Offer1,Branch,707.925645,Four-Door Car,Medsize,A,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10905,10905,FE99816,Nevada,15563.369440,No,Premium,Bachelor,2011-01-19,Unemployed,F,...,7,Personal Auto,Personal L1,Offer3,Web,1214.400000,Luxury Car,Medsize,A,1
10906,10906,KX53892,Oregon,5259.444853,No,Basic,College,2011-01-06,Employed,F,...,6,Personal Auto,Personal L3,Offer2,Branch,273.018929,Four-Door Car,Medsize,A,1
10907,10907,TL39050,Arizona,23893.304100,No,Extended,Bachelor,2011-02-06,Employed,F,...,2,Corporate Auto,Corporate L3,Offer1,Web,381.306996,Luxury SUV,Medsize,A,2
10908,10908,WA60547,California,11971.977650,No,Premium,College,2011-02-13,Employed,F,...,6,Personal Auto,Personal L1,Offer1,Branch,618.288849,SUV,Medsize,A,2


In [45]:
df.columns

Index(['unnamed:_0', 'customer', 'state', 'customer_lifetime_value',
       'response', 'coverage', 'education', 'effective_to_date',
       'employmentstatus', 'gender', 'income', 'location_code',
       'marital_status', 'monthly_premium_auto', 'months_since_last_claim',
       'months_since_policy_inception', 'number_of_open_complaints',
       'number_of_policies', 'policy_type', 'policy', 'renew_offer_type',
       'sales_channel', 'total_claim_amount', 'vehicle_class', 'vehicle_size',
       'vehicle_type', 'month'],
      dtype='object')

In [61]:
df["policy"].unique()

array(['Corporate L3', 'Personal L3', 'Personal L2', 'Corporate L2',
       'Personal L1', 'Special L1', 'Corporate L1', 'Special L3',
       'Special L2'], dtype=object)

In [63]:
df["gender"].unique()

array(['M', 'F'], dtype=object)

In [71]:
df["state"].unique()

array(['Arizona', 'California', 'Washington', 'Oregon', 'Nevada'],
      dtype=object)

In [75]:
df["vehicle_size"].unique()

array(['Medsize', 'Small', 'Large'], dtype=object)

In [77]:
df["education"].unique()

array(['College', 'Bachelor', 'High School or Below', 'Doctor', 'Master'],
      dtype=object)

In [73]:
df["policy_type"].unique()

array(['Corporate Auto', 'Personal Auto', 'Special Auto'], dtype=object)

In [133]:
df["coverage"].unique()

array(['Basic', 'Extended', 'Premium'], dtype=object)

In [83]:
df["marital_status"].unique()

array(['Married', 'Single', 'Divorced'], dtype=object)

In [25]:
# Check the shape of the dataset
print(f"Dataset Shape: {df.shape}")

Dataset Shape: (10910, 27)


In [27]:
# Check for missing values
print(df.isnull().sum())

unnamed:_0                       0
customer                         0
state                            0
customer_lifetime_value          0
response                         0
coverage                         0
education                        0
effective_to_date                0
employmentstatus                 0
gender                           0
income                           0
location_code                    0
marital_status                   0
monthly_premium_auto             0
months_since_last_claim          0
months_since_policy_inception    0
number_of_open_complaints        0
number_of_policies               0
policy_type                      0
policy                           0
renew_offer_type                 0
sales_channel                    0
total_claim_amount               0
vehicle_class                    0
vehicle_size                     0
vehicle_type                     0
month                            0
dtype: int64


In [31]:
# View column data types
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10910 entries, 0 to 10909
Data columns (total 27 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   unnamed:_0                     10910 non-null  int64  
 1   customer                       10910 non-null  object 
 2   state                          10910 non-null  object 
 3   customer_lifetime_value        10910 non-null  float64
 4   response                       10910 non-null  object 
 5   coverage                       10910 non-null  object 
 6   education                      10910 non-null  object 
 7   effective_to_date              10910 non-null  object 
 8   employmentstatus               10910 non-null  object 
 9   gender                         10910 non-null  object 
 10  income                         10910 non-null  int64  
 11  location_code                  10910 non-null  object 
 12  marital_status                 10910 non-null 

2. **Preprocessing**  
   Complete the following preprocessing steps:
   - **Split the data** into training and testing sets.
   - **Encode categorical variables** such as gender and region (use one-hot encoding or label encoding).
   - **Scale numerical variables** like income and claim amounts to ensure smooth model training (using techniques like standardization or normalization).

In [643]:
from sklearn.model_selection import train_test_split

# Features (X) and target (y) split
X = df.drop(columns=['total_claim_amount']) 
y = df['total_claim_amount']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

Training set shape: (8728, 26)
Testing set shape: (2182, 26)


In [645]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Select categorical columns
categorical_cols = ["state", "effective_to_date", "education", "gender", "vehicle_size", "coverage", "sales_channel", "response", "location_code", "vehicle_class", "employmentstatus", "marital_status"] 

# Apply one-hot encoding
X_train = pd.get_dummies(X_train, columns=categorical_cols, drop_first=True)
X_test = pd.get_dummies(X_test, columns=categorical_cols, drop_first=True)

# Ensure both datasets have the same columns after encoding
X_train, X_test = X_train.align(X_test, join='left', axis=1, fill_value=0)

In [647]:
# Check number of columns in X_train and X_test
print(f"Number of columns in X_train: {X_train.shape[1]}")
print(f"Number of columns in X_test: {X_test.shape[1]}")

Number of columns in X_train: 102
Number of columns in X_test: 102


In [649]:
from sklearn.preprocessing import StandardScaler

# Select numerical columns
numerical_cols = ["income", "customer_lifetime_value", "month", "number_of_policies", "monthly_premium_auto"]  

# Apply standardization
scaler = StandardScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

In [653]:
# Check number of columns in X_train and X_test
print(f"Number of columns in X_train: {X_train.shape[1]}")
print(f"Number of columns in X_test: {X_test.shape[1]}")

Number of columns in X_train: 102
Number of columns in X_test: 102


3. **Create the Neural Network Class**  
   Implement a neural network model using `PyTorch`. The model should:
   - Take the customer features as input.
   - Output the predicted total claim amount.

In [655]:
import torch
import pandas as pd
import numpy as np
from torch import optim
import torch.nn as nn
import torch.optim as optim

In [657]:
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state = 42)

In [659]:
# Ensure all columns are numeric
X_train = X_train.apply(pd.to_numeric, errors='coerce')
X_test = X_test.apply(pd.to_numeric, errors='coerce')

# Now check data types again
print(X_train.dtypes)
print(X_test.dtypes)

MedInc        float64
HouseAge      float64
AveRooms      float64
AveBedrms     float64
Population    float64
AveOccup      float64
Latitude      float64
Longitude     float64
dtype: object
MedInc        float64
HouseAge      float64
AveRooms      float64
AveBedrms     float64
Population    float64
AveOccup      float64
Latitude      float64
Longitude     float64
dtype: object


In [661]:
from sklearn.preprocessing import StandardScaler
import torch
import numpy as np

# Convert data to PyTorch tensors
X_train_scaled = X_train.values  
X_test_scaled = X_test.values    

# Convert scaled data to PyTorch tensors
X_train_tensor = torch.tensor(X_train_scaled, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_scaled, dtype=torch.float32)

# Convert target variables to numpy arrays and then to PyTorch tensors
y_train = np.array(y_train)  
y_test = np.array(y_test)  

# Convert y_train and y_test to PyTorch tensors and reshape to match model's expected input
y_train_tensor = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)

In [663]:
class InsuranceClaimModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(InsuranceClaimModel, self).__init__()

        self.network = nn.Sequential(nn.Linear(input_size,hidden_size),
                                    nn.Linear(hidden_size,output_size))
    def forward(self, x): #prediction 
        return self.network(x)

In [665]:
import torch.nn as nn
class SimpleNN(nn.Module): 
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleNN, self).__init__()
        self.network = nn.Sequential(nn.Linear(input_size,hidden_size),
                                    nn.Linear(hidden_size,output_size))
    def forward(self, x): #prediction 
        return self.network(x)

In [667]:
input_size = 8
hidden_size = 12  
output_size = 1

model = SimpleNN(input_size, hidden_size, output_size)

4. **Train the Network**  
   Train the neural network on the training dataset. Use appropriate loss functions (e.g., Mean Squared Error) and optimizers (e.g., Adam, SGD). Tune hyperparameters like learning rate and epochs.

In [670]:
criterion = nn.MSELoss()  
optimizer = optim.Adam(model.parameters(), lr=0.001) 
# Training the model
epochs = 1000  
for epoch in range(epochs):
    model.train()  # Set the model to training mode

    # Forward pass
    outputs = model(X_train_tensor)
    loss = criterion(outputs, y_train_tensor)

    # Backward pass and optimization
    optimizer.zero_grad()
    loss.backward()  
    optimizer.step() 

    if (epoch+1) % 50 == 0:
        print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}')

Epoch [50/1000], Loss: 2712.1086
Epoch [100/1000], Loss: 21.8489
Epoch [150/1000], Loss: 4.2302
Epoch [200/1000], Loss: 3.9131
Epoch [250/1000], Loss: 3.6844
Epoch [300/1000], Loss: 3.4578
Epoch [350/1000], Loss: 3.2396
Epoch [400/1000], Loss: 3.0340
Epoch [450/1000], Loss: 2.8439
Epoch [500/1000], Loss: 2.6711
Epoch [550/1000], Loss: 2.5161
Epoch [600/1000], Loss: 2.3789
Epoch [650/1000], Loss: 2.2590
Epoch [700/1000], Loss: 2.1552
Epoch [750/1000], Loss: 2.0662
Epoch [800/1000], Loss: 1.9907
Epoch [850/1000], Loss: 1.9270
Epoch [900/1000], Loss: 1.8737
Epoch [950/1000], Loss: 1.8293
Epoch [1000/1000], Loss: 1.7925


5. **Evaluate the Model**  
   Evaluate the neural network performance on the testing dataset. Use evaluation metrics like **Mean Squared Error (MSE)** or **R-squared** to assess model accuracy.

6. **Linear Regression Model**  
   - Train a simple **Linear Regression** model on the same dataset.
   - Evaluate the linear regression model using the same metrics as the neural network.

In [672]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

In [674]:
# Create a Linear Regression model
model = LinearRegression()

# Train the model using the training data
model.fit(X_train, y_train)

In [676]:
# Predict on training data
y_train_pred = model.predict(X_train)

# Predict on testing data
y_test_pred = model.predict(X_test)

In [678]:
# Calculate Mean Squared Error
mse = mean_squared_error(y_test, y_test_pred)
print(f'Mean Squared Error: {mse}')

# Calculate R-squared
r2 = r2_score(y_test, y_test_pred)
print(f'R-squared: {r2}')

Mean Squared Error: 0.5411287478470683
R-squared: 0.5910509795491354


7. **Compare the Performances**  
   - Compare the performance of the neural network and linear regression models.
   - Which model would you choose for this task? Justify your answer based on the performance metrics and potential trade-offs.

In [None]:
#Reducing the loss further should be considered. 