## Week 10 Lab (K Means)
### COSC 3337 Dr. Rizk


About The Data
We'll be using the Customer Dataset from kaggle for this lab, but feel free to follow along with your own dataset. The dataset contains the following attributes:

* CustomerID
* Genre
* Age
* AnnualIncome(k$)
* Spending_Score

Our goal is to group/cluster these customers.

## About K Means
K Means Clustering is an unsupervised learning algorithm that tries to cluster data based on their similarity. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data. In k means clustering, we have the specify the number of clusters we want the data to be grouped into. The algorithm randomly assigns each observation to a cluster, and finds the centroid of each cluster. Then, the algorithm iterates through two steps: Reassign data points to the cluster whose centroid is closest. Calculate new centroid of each cluster. These two steps are repeated till the within cluster variation cannot be reduced any further. The within cluster variation is calculated as the sum of the euclidean distance between the data points and their respective cluster centroids. Refer back to the lecture video or slides for more detail on K Means.

### Implementation
Because K Means is used more for finding patterns in our data, we'll skip the data exploration portion, but you're welcome to explore this data or your own if working with a different dataset.



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
from matplotlib import rcParams
rcParams['figure.figsize'] = 15, 5
sns.set_style('darkgrid')

Let's first load the data into a pandas DataFrame. We'll use the CustomerID column as our index_col for this DataFrame.

In [3]:
customer_df = pd.read_csv('customers.csv', index_col='CustomerID')
customer_df.head()

Unnamed: 0_level_0,Genre,Age,Annual_Income_(k$),Spending_Score
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Male,19,15,39
2,Male,21,15,81
3,Female,20,16,6
4,Female,23,16,77
5,Female,31,17,40


In [5]:
customer_df['Genre'] = customer_df['Genre'].astype(str)

# Step 3: Set 'CustomerID' as the index
customer_df.set_index('Genre', inplace=True)

# Make a copy of the DataFrame
customer_df_copy = customer_df.copy()

# Display the first few rows of the DataFrame
print(customer_df_copy.head())

        Age  Annual_Income_(k$)  Spending_Score
Genre                                          
Male     19                  15              39
Male     21                  15              81
Female   20                  16               6
Female   23                  16              77
Female   31                  17              40


calling **.info()** we see that there are no missing values in this dataset since there are 200 entries in total and 200 non-null entries in each column.

In [6]:
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, Male to Male
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Age                 200 non-null    int64
 1   Annual_Income_(k$)  200 non-null    int64
 2   Spending_Score      200 non-null    int64
dtypes: int64(3)
memory usage: 6.2+ KB


In [7]:
customer_df_copy.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, Male to Male
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype
---  ------              --------------  -----
 0   Age                 200 non-null    int64
 1   Annual_Income_(k$)  200 non-null    int64
 2   Spending_Score      200 non-null    int64
dtypes: int64(3)
memory usage: 6.2+ KB


In [8]:
customer_df.describe()

Unnamed: 0,Age,Annual_Income_(k$),Spending_Score
count,200.0,200.0,200.0
mean,38.85,60.56,50.2
std,13.969007,26.264721,25.823522
min,18.0,15.0,1.0
25%,28.75,41.5,34.75
50%,36.0,61.5,50.0
75%,49.0,78.0,73.0
max,70.0,137.0,99.0


In [10]:
from sklearn.preprocessing import MinMaxScaler
# scaling it, so values will be adjsuted st they fall under [0,1]
# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# select only numeric columns for scaling
numeric_columns = customer_df_copy.select_dtypes(include=['number']).columns

# apply the scaler to the numeric columns
customer_df_copy[numeric_columns] = scaler.fit_transform(customer_df_copy[numeric_columns])

# display the summary statistics of the scaled data
print(customer_df_copy.describe())

              Age  Annual_Income_(k$)  Spending_Score
count  200.000000          200.000000      200.000000
mean     0.400962            0.373443        0.502041
std      0.268635            0.215285        0.263505
min      0.000000            0.000000        0.000000
25%      0.206731            0.217213        0.344388
50%      0.346154            0.381148        0.500000
75%      0.596154            0.516393        0.734694
max      1.000000            1.000000        1.000000


To ensure that we don't have any duplicates, we can call **.drop_duplicates(inplace=True)** on our DataFrame.

In [11]:
customer_df.drop_duplicates(inplace=True)

Just so that we can visualize our clusters in the end of this lab, we'll go ahead and only work with 2 variables (spending score and income). However, you're free to use more than 2 variables if you're working with your own dataset.

In [18]:
# Check the columns of the DataFrame
print(customer_df.head())

# Adjust the indices based on the actual positions
# Ensure the indices are correct and within the bounds of the DataFrame
# Example: if 'Spending_Score' is at position 2 and 'Income' is at position 3
# Adjust based on the actual column positions
if customer_df.shape[1] > 3:
    X = customer_df.iloc[:, [2, 3]].values

    # Display the first few rows of the selected data
    print(X[:5])
else:
    print("The DataFrame does not have enough columns.")


        Age  Annual_Income_(k$)  Spending_Score
Genre                                          
Male     19                  15              39
Male     21                  15              81
Female   20                  16               6
Female   23                  16              77
Female   31                  17              40
The DataFrame does not have enough columns.
