<a href="https://colab.research.google.com/github/joshuahurd515/ai-and-data-science-work/blob/main/medicalInsurancePredictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Baseline Model**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.feature_selection import SelectKBest, f_regression

# Load the dataset
df = pd.read_csv('insurance.csv')

# Encode categorical variables
le = LabelEncoder()
df['sex'] = le.fit_transform(df['sex'])
df['smoker'] = le.fit_transform(df['smoker'])
df['region'] = le.fit_transform(df['region'])

df.head(10)

# Split the dataset into training and testing sets
X = df.drop('charges', axis=1)
y = df['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Select the top k features using F-test
selector = SelectKBest(f_regression, k=4)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

# Train a linear regression model
lr = LinearRegression()
lr.fit(X_train_selected, y_train)

# Predict on the test set
y_pred = lr.predict(X_test_selected)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.2f}")
print(f"R^2: {r2:.2f}")

RMSE: 5829.38
R^2: 0.78


**Final Model**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder, StandardScaler

# Load the dataset
df = pd.read_csv('insurance.csv')

# Encode categorical variables
le = LabelEncoder()
df['sex'] = le.fit_transform(df['sex'])
df['smoker'] = le.fit_transform(df['smoker'])
df['region'] = le.fit_transform(df['region'])

# Split the dataset into training and testing sets
X = df.drop('charges', axis=1)
y = df['charges']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a random forest regressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train_scaled, y_train)

# Predict on the test set
y_pred = rf.predict(X_test_scaled)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.2f}")
print(f"R^2: {r2:.2f}")

RMSE: 4571.50
R^2: 0.87


**Report**

**Task Description:**

The task is to build a predictive model to estimate the medical insurance charges for an individual on several features. The dataset contains information about age, sex, BMI, number of children, if they smoke, and region of residence for individuals. The overall goal is to develop a model that can predict the insurance charges for new individuals based on all of these features

**Dataset description:**

The dataset used for this task is the medical insurance cose dataset. It contains 1388 instances with 7 features that I listed earlier. The dataset has categorical and numerical values.

**Challenges:**

The main challenge is the complexity of the relationships between the features and the target variable that is trying to be acheived. Therfore, it is important to use appropriate feature selection techniques to identify the most relevant features and make the model as simple as possible. With this in mind, we also want to prevent overfitting, but this can be avoided and reduced by regularization and cross validation.

**Data Split:**

The dataset is split into three sets, training validation, and testing. The training set is used to train the model, the validation set is used to tune the models hyperparameters, and the testing set is to evaluate the model on unseen data

**Parameter Tuning: **

This is an important step in building this model because it allows us to select the best hyperparameters so that the model is able to perform the best. In our baseline model, we get an accuracy of 78 percent, and a total insurance cost of 5829.38, but this will change with the final model being fine tuned

**Performance metrics:**

Mean Squared Error (MSE): measures the average squared difference between the predicted and actual values. Lower values indicate better performance.

Root Mean Squared Error (RMSE): measures the square root of the average squared difference between the predicted and actual values. Lower values indicate better performance.

R^2 Score: measures the proportion of variance in the target variable that is explained by the model. Higher values indicate better performance.

**Model Comparison:**

Overall, the random forrest regressor performs much better than the linear regression model in both RMSE and R^2. The random forrest regressor has a RMSE and R^2 of 4571.34 and 86 percent, wheras the linear regression model has a RMSE and R^2 of 5829.38 and 78 percent. The random forrest regressor is able to perform better because it can capture nonlinear relationships between the feature and the target variable, wheras the linear regressor assumes that there is a linear relationship. The random forrest generator is also able to handle categorical variables much better, therfore, the random forrest regressor is a better choice and a better model for predicting insurance charges for this problem.