<small>### Problem Statement
Estimate individual medical insurance expenses based on factors such as age, BMI, smoking status, and other health-related variables.

### Data Description
**Dataset Filename:** `data_insurance.csv`

#### Attributes Information:
1. **age:** Age of the primary policyholder.
2. **sex:** Gender of the insurance policyholder (female or male).
3. **bmi:** Body Mass Index, a measure that uses the ratio of weight to height (kg/m²) to categorize body weight, ideally ranging from 18.5 to 24.9.
4. **children:** Number of dependents covered by the health insurance.
5. **smoker:** Indicates if the person smokes.
6. **region:** The geographical area in the U.S. where the policyholder resides (northeast, southeast, southwest, northwest).

**Target:**
- **charges:** Medical expenses billed to the insurance for the individual.</small>

In [17]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import copy
import math
import random

#### <small> Reading the data
Here we read the data from the file and store it in a pandas dataframe.</small>

In [19]:
filepath = 'data_insurance.csv'

df = pd.read_csv(filepath)

###  <small>Explore the data
We explore the first few rows of the dataset and describe it</small>

In [23]:
print("First few rows of the dataset:")
print(df.head())

print("\nSummary statistics:")
print(df.describe())

print("\nData types of each column:")
print(df.dtypes)

First few rows of the dataset:
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520

Summary statistics:
               age          bmi     children       charges
count  1338.000000  1338.000000  1338.000000   1338.000000
mean     39.207025    30.663397     1.094918  13270.422265
std      14.049960     6.098187     1.205493  12110.011237
min      18.000000    15.960000     0.000000   1121.873900
25%      27.000000    26.296250     0.000000   4740.287150
50%      39.000000    30.400000     1.000000   9382.033000
75%      51.000000    34.693750     2.000000  16639.912515
max      64.000000    53.130000     5.000000  63770.428010

Data types of each column:
age  

###  <small>Handling missing data
1. Option 1: Fill missing values with a specific value (e.g., mean)
2. Option 2: Drop rows/columns with missing values</small>

In [24]:
missing_values = df.isnull().sum()
print("\nMissing values in each column:")
print(missing_values)

if missing_values.sum() > 0:
    df.fillna(df.mean(), inplace=True)
    print("\nMissing values were found and have been filled with the mean of each column.")

    #df.dropna(inplace=True)
    #print("\nMissing values were found and have been dropped.")
    
    print("\nMissing values after handling:")
    print(df.isnull().sum())
else:
    print("\nNo missing values found in the dataset.")


Missing values in each column:
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

No missing values found in the dataset.


### <small> Converting Categorical Variables
We convert non-numeric features like 'sex', 'smoker' and 'region' into numerical values using encoding techniques.

Since 'sex' and 'smoker' have binary values, we use **label_encoding** on them. \
'region' has multiple values, hence we use **one-hot encoding**.</small>

In [25]:
df['sex'] = df['sex'].apply(lambda x: 1 if x == 'male' else 0)
df['smoker'] = df['smoker'].apply(lambda x: 1 if x == 'yes' else 0)

df = pd.get_dummies(df, columns=['region'], drop_first=True)

### <small> Feature Scaling
We perform feature scaling to bring all features into the same range. 
This is essential for gradient descent to converge faster. Here, we'll use **standardization**.
We subtract the mean and divide by the standard deviation.</small>

In [26]:
target_column = 'charges'
df_standardized = df.copy()

for column in df_standardized.columns:
    if column != target_column:
        mean = df_standardized[column].mean()
        std = df_standardized[column].std()
        df_standardized[column] = (df_standardized[column] - mean) / std

df_standardized.head()

Unnamed: 0,age,sex,bmi,children,smoker,charges,region_northwest,region_southeast,region_southwest
0,-1.438227,-1.010141,-0.453151,-0.908274,1.96985,16884.924,-0.566206,-0.611095,1.764821
1,-1.509401,0.989221,0.509431,-0.078738,-0.507273,1725.5523,-0.566206,1.635183,-0.566206
2,-0.797655,0.989221,0.383164,1.580335,-0.507273,4449.462,-0.566206,1.635183,-0.566206
3,-0.441782,0.989221,-1.305043,-0.908274,-0.507273,21984.47061,1.764821,-0.611095,-0.566206
4,-0.512957,0.989221,-0.292447,-0.908274,-0.507273,3866.8552,1.764821,-0.611095,-0.566206


###  <small>Train-Test Split
We split the data into training and testing sets for evaluating the model's performance.</small>

In [29]:
from sklearn.model_selection import train_test_split

X = df_standardized.drop(columns=[target_column]).values
y = df_standardized[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

ModuleNotFoundError: No module named 'sklearn'

<small>### Gradient Descent Implementation
We implement gradient descent for linear regression.</small>

In [221]:
def compute_cost(X, y, theta):
    m = len(y)
    predictions = X.dot(theta)
    cost = (1/(2*m)) * np.sum(np.square(predictions - y))
    return cost

def gradient_descent(X, y, theta, learning_rate, iterations):
    m = len(y)
    cost_history = []

    for i in range(iterations):
        predictions = X.dot(theta)
        errors = predictions - y
        gradients = (1/m) * X.T.dot(errors)
        theta -= learning_rate * gradients

        cost = compute_cost(X, y, theta)
        cost_history.append(cost)

        if i % 100 == 0:
            print(f"Iteration {i}: Cost {cost}")

    return theta, cost_history

def add_intercept(X):
    intercept = np.ones((X.shape[0], 1))
    return np.hstack((intercept, X))

# Prepare the data with an intercept
X_train_intercept = add_intercept(X_train)

# Initialize parameters
theta = np.zeros(X_train_intercept.shape[1])
iterations = 1000
learning_rate = 0.01

# Perform gradient descent
theta, cost_history = gradient_descent(X_train_intercept, y_train, theta, learning_rate, iterations)

theta

<small>### Evaluating the Model
We evaluate the model by computing the cost on the test set.</small>

In [223]:
# Prepare test data with intercept
X_test_intercept = add_intercept(X_test)

# Compute cost on the test set
test_cost = compute_cost(X_test_intercept, y_test, theta)
print(f"Test set cost: {test_cost}")

Test set cost: 0.2640510469543897


<small>### Visualizing the Cost History
We plot the cost history to observe the convergence of gradient descent.</small>

In [224]:
plt.plot(range(iterations), cost_history, 'b-', linewidth=2)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.title('Cost Function Convergence')
plt.show()