# Data Science - K-Means

### In this advanced data science topic, we will be exploring how the K-means algorithm works. It will utilize concepts taught in module 3 such as pandas and dataframes. For this project we will be using a dataset from Kaggle to group insurance charges based off patient demographics.

### Link to dataset: https://www.kaggle.com/code/janiobachmann/patient-charges-clustering-and-regression

## Import Stuff

In [None]:
# Install and import all the neccessary stuff

%pip install scikit-learn
%pip install pandas
%pip install seaborn


In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


## Load data

### First thing we will do is load the data into pandas and look at its structure. Let's take a look at the first few rows using ```df.head()``` to get a look at what we have.

### We also want to run ```df.info``` to get some metadata to save us some time from counting rows and columns by hand.

### Attribute Information
1. Age - The age of the patient. (Numeric)
2. Sex - The gender of the patient. (Categorical)
3. BMI - The BMI of the patient (Numeric)
4. Children - The number of children the patient has (Numeric)
5. Smoker - Does the patient smoke (Categorical)
6. Region - Which region of the US the patient lives (Categorical)
7. Charges - How much the patients were charged (Numeric)

In [None]:
# Import the dataset
df = pd.read_csv('insurance.csv', dtype={'sex':'string', 'smoker':'string','region':'string'})

# Display the first few rows of the dataset
df.head()

In [None]:
df.info()

## Let's explore the data a little more by visualizing it 

In [None]:
# Create a subset of the dataframe. 
# We do this by passing a list of these column names to the dataframe df. 
subset = df[['age', 'bmi', 'children', 'charges']]

# Now we're going to create a "pair plot" of this subset. Pair plots are a great way to visualize relationships 
# between different pairings of these variables. In a pair plot, the diagonal elements show the histogram of the 
# data for that particular variable, and the off-diagonal elements show scatter plots of one variable versus another
sns.pairplot(subset)

# lets take a look!
plt.show()

In [None]:
corr = subset.corr() # This line computes the correlation matrix of the DataFrame.
                 #  It calculates the Pearson correlation coefficient for each pair of numerical columns. 
                 # Post cleaning, all of our columns have some kind of numerical representation.

# Generate a mask for the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool)) #  Here, create a mask for the upper triangle of your correlation matrix. 
                                               # This is done because the matrix is symmetric, i.e., the lower triangle is a mirror 
                                               # image of the upper triangle. Thus, showing both would be redundant.
                                               # You don't technially need to do this, but its a nice trick...

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask
# Look at the sns documenttion for details on all of the arguments. 
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0, square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot=True)

plt.title('Correlation Matrix Heatmap')
plt.show()

### We can get more information out of the data, like trends, by splitting the data by smoker status. You can plot them on top of each other using pair plot.

In [None]:
# Lets pair plot the data separately. One for smokers and one for non-smokers
smoker_subset = df[['age', 'bmi', 'children', 'charges', 'smoker']]
sns.pairplot(smoker_subset, hue='smoker')
plt.show()

In [None]:
# Create a function that matches BMI to the medical weight definition
def get_weight_condition(bmi):
    if bmi < 18.5:
        return 'Underweight'
    elif bmi >= 18.5 and bmi < 25:
        return 'Normal Weight'
    elif bmi >= 25 and bmi < 30:
        return 'Overweight'
    else:
        return 'Obese'

In [None]:
# Create a new column in the DataFrame of the weight conditions
df['weight_condition'] = df['bmi'].apply(get_weight_condition)
df.head()

In [None]:
# Create a pairplot with each separate weight condition
bmi_subset = df[['age', 'bmi', 'children', 'charges', 'weight_condition']]
sns.pairplot(bmi_subset, hue='weight_condition')
plt.show()

### Now that we see the data, let's try to make a simple K-means model that we can visualize. We will plot BMI against charges to run it on. 

### K-means is an unsupervised algorithm which means the data does not have any labels already given to it. Instead, the algorithm looks for trends in the data. Specifically, it tries to group the data into $k$ groups. This can be really helpful when you want to use to find collections of observations that share similar characteristics. 

### The algorithm starts by creating $k$ points or "centers" which represents the $k$ groups. The goal is to find the best place for these points, such that every data point in the group share similar characteristics. Data points are assigned by whichever center it is closest to. This algorithm is an iterative process which means the $k$ centers are moved to see if the grouping is improved. The new centers are determined by the actual center of all of the data points in a group. It may be easier explain with pictures shown below.

![kmeans.png](attachment:kmeans.png)
Image from https://stanford.edu/~cpiech/cs221/handouts/kmeans.html

### In this image, (a) represents the sample dataset. In (b) two centers are randomly created. Then, the data points are assigned to groups based off of which center it is closest to (c). The new centers are assigned by computing the means of the assigned groups (d). The data points are assigned to groups based off the closest center again and the process is repeated until an optimal center is found.

In [None]:
# You may be able to ignore this
import os
os.environ["OMP_NUM_THREADS"] = '6'
from sklearn.cluster import KMeans

# Create a K-Means model
bmi_subset = df[['bmi','charges']]
kmeans = KMeans(n_clusters=3, n_init=10)  
kmeans.fit(bmi_subset)

In [None]:
# Look at where the final centers are and which data points got assigned  
fig = plt.figure(figsize=(12,8))

plt.scatter(bmi_subset.values[:,0], bmi_subset.values[:,1], c=kmeans.labels_, cmap='Set1', s=25)
plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black', marker="x", s=250)
plt.title("Kmeans Clustering \n Finding Unknown Groups in the Population", fontsize=16)
plt.show()

In [None]:
# This function will created a K-means model with k clusters given a dataset
# It's the same process we just did above
def kmeans_cluster(k, bmi_subset):
    kmeans = KMeans(n_clusters=k, n_init=10)  
    kmeans.fit(bmi_subset)
    
    fig = plt.figure(figsize=(12,8))

    plt.scatter(bmi_subset.values[:,0], bmi_subset.values[:,1], c=kmeans.labels_, cmap='Set1', s=25)
    plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black', marker="x", s=250)
    plt.title("Kmeans Clustering \n Finding Unknown Groups in the Population", fontsize=16)
    plt.show()

## Play around with the $k$ to see if you can find any real trends.

In [None]:

kmeans_cluster(5, bmi_subset)