## Key Takeaways from the model :
#### -Age and BMI are strong predictors of insurance charges, with older individuals and those with higher BMIs expected to have higher charges.
#### -Number of children has a smaller positive effect.
#### -Gender (being male) doesn't seem to affect the charges significantly in this model.
#### -Smoking has the most substantial impact, with smokers incurring significantly higher charges.

# -----------------------------------------------------------------

##### In this project we will create a ridge regression model to gain an understanding of the relationships between the features of our datasets and insurance charges

##### We will start off importing the dataset into a pandas dataframe :

In [2]:
import pandas as pd

# Specify the file path
file_path = '/Users/harshpatel/Desktop/codingproject/regression/expenses.csv'

# Load the CSV into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(df.head())

   age     sex     bmi  children smoker      charges
0   19  female  27.900         0    yes  16884.92400
1   18    male  33.770         1     no   1725.55230
2   28    male  33.000         3     no   4449.46200
3   33    male  22.705         0     no  21984.47061
4   32    male  28.880         0     no   3866.85520


##### Now I will to see if there are any missing values in our dataset

In [3]:
#Check to see if there are any missing values in our dataset. No missing values exist. 
df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
charges     0
dtype: int64

##### Since I do not need to clean missing values, I will proceed. The next steps will be to import the necessary libraries we will use and get dummies for our categorical variables (sex and smoker). 

In [4]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler


In [5]:
# There are two categorical variables in our dataset: sex and smoker which are binary columns. We will convert this to a dummy variable. 
df = pd.get_dummies(df, columns=['sex', 'smoker'], drop_first=True)


##### Now that the libraries are imported and dummy variables are processed, the next thing to check for is multicollinearity by checking VIF factors : 

In [6]:
# Define features (X) and target (y) variables
X = df.drop('charges', axis=1)  # Features (all columns except 'charges')
y = df['charges']  # Target variable (insurance charges)


#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#Standardize the feature variables
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

#Check VIF factors for multi-collinearity amongst features. A high VIF factor indicates a strong correlation amongst variables which reduces the reliability of the model. 
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

#Use X scaled for preprocessed data
vif = pd.DataFrame()
vif['Feature'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train_scaled, i) for i in range(X_train_scaled.shape[1])]

# Display the VIF values
print(vif)


      Feature       VIF
0         age  1.021030
1         bmi  1.014626
2    children  1.004466
3    sex_male  1.005621
4  smoker_yes  1.007995


#### Now the model is ready to be trained, fitted and tested : 

In [7]:
#We will be using a Ridge model. This helps with overfitting and, although we do not have high multicollinearity, helps to maintain it. It also helps explain variability rather than overfitting our model during training. 
ridge = Ridge(alpha=.05) #alpha set to .05. Tested other alphas. 
ridge.fit(X_train_scaled, y_train)

#Creating predictions using the model. 
y_pred = ridge.predict(X_test_scaled)



#Storing predicted values into dataframe
results_df = X_test.copy()  # Copy the test features
results_df['Actual'] = y_test  # Add the actual target values
results_df['Predicted'] = y_pred  # Add the predicted target values

#Add a column for to see percent difference from actual
results_df['Percent Difference'] = ((results_df['Predicted'] - results_df['Actual']) / results_df['Actual']) * 100



# Print the first few rows to see a sample of the output
print(results_df.head())

      age     bmi  children  sex_male  smoker_yes       Actual     Predicted  \
764    45  25.175         2         0           0   9095.06825   8555.003063   
887    36  30.020         0         0           0   5272.17580   6973.844831   
890    64  26.885         0         0           1  29330.98315  36797.399178   
1293   46  25.745         3         1           0   9301.89355   9418.107438   
259    19  31.920         0         1           1  33750.29180  26871.066378   

      Percent Difference  
764            -5.938000  
887            32.276409  
890            25.455731  
1293            1.249357  
259           -20.382714  


#### Let's see the model performance. I want to be careful of overfitting the model as it will not be as reliable. An R^2 of 0.75-0.90 is good to have for our purposes

In [8]:
#Checking model performance. Our model explains 78% of variance is explainable by age, bmi, children, sex and smoker predictors. 
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')


from sklearn.metrics import r2_score
r2 = r2_score(y_test, y_pred)
print(f'R²: {r2}')

Mean Squared Error: 33979563.00870515
R²: 0.7811282405840095


#### Now lets take a look at the coefficients in our model. This will help to understand relationships between the predictors and our target variable which is insurance charges : 

In [11]:
coefficients = ridge.coef_

coeff_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': coefficients
})

# Display the coefficients
print(coeff_df)

#See key takeaways in the first cell for interpretation of coefficients

      Feature  Coefficient
0         age  3616.103798
1         bmi  1978.413305
2    children   519.283904
3    sex_male    -3.942320
4  smoker_yes  9559.144012
