Up to this point, we've been working with machine learning to predict values. These values can be whether a particular student will be admitted to a scholar program, whether a patient has heart disease, etc. In these scenarios, we're working with supervised machine learning. In supervised machine learning, the dataset contains a target variable that we're trying to predict. As the name suggests, we can supervise our model's performance since it's possible to objectively verify if its outputs are correct.

In this course, we'll be working with unsupervised machine learning. When working with unsupervised algorithms, we have an unlabelled dataset, which means we do not have a target variable that we'll try to predict. In fact, the goal is not to predict anything, but to find patterns in the data. As there's no target variable, we can't supervise the algorithm by objectively telling whether or not the outputs are correct. Therefore, it's up to the data scientist to analyze the outputs and understand the pattern the algorithm found in the data.

![](https://s3.amazonaws.com/dq-content/741/1.1-m741.svg)

Common unsupervised machine learning types:

- Clustering: the process of segmenting the dataset into groups based on the patterns found in the data. Used to segment customers and products, for example.
- Association: the goal is to find patterns between the variables, not the entries. It's frequently used for market basket analysis, for instance.
- Anomaly detection: this kind of algorithm tries to identify when a particular data point is completely off the rest of the dataset pattern. Frequently used for fraud detection.

## K-Means 
n this course, we'll focus on the use of unsupervised machine learning for clustering with the K-means algorithm. The K-means algorithm is an iterative algorithm designed to find a split for a dataset given a number of clusters set by the user. In other words, the K-means algorithm helps us split our population into a given number of groups. The number of clusters is called K.

As an iterative algorithm, K-means is based on repeating the same process over and over again for a determined number of times or until it reaches a determined stopping condition. For K-means, the algorithm randomly chooses K points to be the centers of the clusters. These points are called the clusters' centroids. K is set by the user. Then, an iterative process begins where each iteration is made of the following steps:

1. Calculate the euclidean distance between each data point to each centroid.
2. Assign each data point as belonging to the cluster of the closest centroid.
3. Calculate new centroids using the mean of the datapoints in each cluster.

The algorithm will then run until a maximum number of iterations is reached or until the centroids no longer change.

The following animation shows this process visually.

![](https://s3.amazonaws.com/dq-content/741/2.1-m741.gif)

Note that, in each iteration, new centroids (the big dots) are calculated and then new clusters are reassigned.

## Read in Data
During this course, we'll use a dataset containing information about customers of a mall. The goal is to use the data to segment the customers into groups.

The dataset contains the following columns:

CustomerID: a unique identifier for each customer.
Gender: the gender of the customer.
Age: the customer's age in years.
Annual Income: the customer's annual income in thousand of dollars.
Spending Score: a score based on customer shopping patterns. Goes from 1 to 100.

In [None]:
import pandas as pd

customers = pd.read_csv("mall_customers.csv")

# print(customers.head())
print(customers.describe())

## Initialize Centroids

Now that we've become familiar with the dataset, we'll start to build our own clustering algorithm. This will be a simple version of K-means intended to help us understand the concepts and mechanics behind the actual algorithm.

We'll follow the steps listed before in order to segment the customers and also visualize the segmentation. To make the visualization easier, the clustering will be performed using only two clusters and two variables, Age and Spending Score, which makes it possible to plot in the two-dimensional chart.

The first step is to randomly initialize the centroids. Then we'll need to save the coordinates of each centroid in order to later compare to each data point.

Let's do this!

### Instructions
1. Create a DataFrame keeping only the columns used for the clustering alongside the customer identifier and assign back to customers.

2. Use the DataFrame.sample() method to randomly select two data points and assign the resulting DataFrame to centroids.

3. Write a function, fetch_coordinates, that fetches centroids for a two-by-two dataframe.

    - The function should return the four values in the given DataFrame in order (from left to right and from top to bottom)
Call the function just written on centroids and assign the result to: age_centroid_1, score_centroid_1, age_centroid_2, and score_centroid_2.

4. Create a scatter plot of the Age and Spending Score.

5. In the same axes, create a scatter plot from the centroids DataFrame.

    - Use a different color and size to highlight the centroids

In [None]:
cols_to_keep = ['Age', 'Spending Score']

customers = customers[cols_to_keep]

centroids = customers.sample(2)

def fetch_coordinates(centroids):
        
        c_1 = centroids.iloc[0,0]
        c_2 = centroids.iloc[0,1]
        c_3 = centroids.iloc[1,0]
        c_4 = centroids.iloc[1,1]
        
        return c_1, c_2, c_3, c_4
    
age_centroid_1, score_centroid_1, age_centroid_2, score_centroid_2 = fetch_coordinates(centroids)

# print(centroids)
  
import matplotlib.pyplot as plt

plt.scatter(x = customers["Age"], y = customers["Spending Score"])
plt.scatter(x = centroids["Age"], y = centroids["Spending Score"], color = "red")
plt.show()

## Distances Between the Points

Now that we've initialized the first couple of centroids, we need to calculate the (Euclidean) distance between each customer to each of the centroids.

Here is the formula for the distance between two points,in a two-dimensional space--the dimensional space we're working with:

![](https://s3.amazonaws.com/dq-content/741/5.1-m741.svg)


The formula calculates the squared distance between the corresponding coordinates of each point, adds them together, and takes the square root.

Pictorially, the result is the length of the line that connects the two points.

This formula can be easily translated into Python. For instance, to calculate the distance between the two centroids in new columns of the centroids DataFrame, all we need to do is this:

```python
centroids['dist_centroid_1'] = np.sqrt((centroids['Age'] - age_centroid_1)**2 + (centroids['Spending Score'] - score_centroid_1)**2)
centroids['dist_centroid_2'] = np.sqrt((centroids['Age'] - age_centroid_2)**2 + (centroids['Spending Score'] - score_centroid_2)**2)
```

This would result in the following:

| CustomerID	|Age	|Spending Score	|dist_centroid_1	|dist_centroid_2|
|-|-|-|-|-|
|198	|32	|74	|0	|42.8|
|110	|66	|48	|42.8	|0|
Note that the distance between each centroid to itself is zero.

This formula will be used to calculate the distance from every blue dot to both of the black dots in the scatter plot below:

![](https://s3.amazonaws.com/dq-content/741/scatter.png)

### Instructions
The libraries NumPy and pandas have already been imported.

1. Write a function called calculate_distance that

    - Receives as arguments, in order:
        - A row of customers
        - One of the coordinates of the centroid
        - The other coordinate of the centroid
    - Returns the euclidean distance from the row to the given centroid
2. Create a new column in the customers called dist_centroid_1 by calculating the euclidean distance from every point to centroid (age_centroid_1, score_centroid_1).

3. Create an analogous column relative to the second centroid.

4. Inspect customers.

In [None]:
customers = pd.read_csv('mall_customers.csv')

cols_to_keep = ['Age', 'Spending Score']

customers = customers[cols_to_keep].copy()

centroids = customers.sample(2)

def fetch_coordinates(df):
    age_centroid_1 = df.iloc[0, 0]
    score_centroid_1 = df.iloc[0, 1]
    age_centroid_2 = df.iloc[1, 0]
    score_centroid_2 = df.iloc[1, 1]
    return age_centroid_1, score_centroid_1, age_centroid_2, score_centroid_2

age_centroid_1, score_centroid_1, age_centroid_2, score_centroid_2 = fetch_coordinates(centroids)



def calculate_distance(row, centroid_x, centroid_y):
    x_squared = (row[0] - row[1])**2
    y_squared = (centroid_x - centroid_y)**2
    distance = np.sqrt(x_squared + y_squared)
    return distance


customers['dist_centroid_1'] = customers.apply(calculate_distance, args=(age_centroid_1, score_centroid_1), axis=1)
customers['dist_centroid_2'] = customers.apply(calculate_distance, args=(age_centroid_2, score_centroid_2), axis=1)
    
print(customers.head())

## Assigning Clusters

At this point, we have the distance from each customer to both of the clusters' centroids. Therefore, all we need to do is to assign each customer to the cluster with the closer centroid.

Once we have that, we can visualize how the first split went by creating a scatter plot. However, in this plot, we want to use different colors for different clusters so we can actually see the difference.

A couple of screens ago, we used the `plt.scatter()` function to plotting. Although `matplotlib` is great, some particular tricks can be a bit hard to implement. Creating a scatter plot with different colors is one of them.

The seaborn library is another great visualization tool and it provides an easier to way implement what we need. The seaborn.scatterplot() function is very similar to the plt.scatter(), with the following differences:

1. `x` and `y` parameters don't receive the columns, just their name, which means that instead of `df['columns_name']`, it receives `'column_name'`.
2. There's a `data` parameter that receives the DataFrame in which the columns are contained.
3. There's the optional `hue` parameter. This parameter represents the column by which we divide the scatter plot in different colors.

This function will be very useful for visualizing the clusters we created.

### Instructions
In this exercise, we'll create a function to tie together some of the work we've done so far.

1. Write a function called calculate_distance_assign_clusters that:

    - Receives both the customers and centroids DataFrames as inputs.
    - Uses the fetch_coordinates function to generate the coordinates.
    - Uses the calculate_distance to calculate the distance from the centroids.
    - Creates a cluster column in the customer DataFrame.
        - This column should contain 1 if that customer is closer to centroid 1, or 2 if it's closer to centroid 2.
    - Returns the customers DataFrame

2. Call the function we just created on customers and assign the result back to it.

3. Create a scatter plot for the Age and Spending Score columns from customers using the seaborn.scatterplot() function. Pass the 'cluster' column to the hue parameter.
    - Use the palette parameter of the function to set a color palette to be used. The tab10 palette is a good choice.

4. Create the same plot, but now for the centroids DataFrame.

We don't have to use hue and palette.
Make the centroids bigger using the s (size) parameter.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


def calculate_distance_assign_clusters(customers, centroids):
    
    age_centroid_1, score_centroid_1, age_centroid_2, score_centroid_2 = fetch_coordinates(centroids)
    
    customers['dist_centroid_1'] = customers.apply(calculate_distance, args=(age_centroid_1, score_centroid_1), axis=1)
    
    customers['dist_centroid_2'] = customers.apply(calculate_distance, args=(age_centroid_2, score_centroid_2), axis=1)
    
    
    cluster = lambda row: 1 if row['dist_centroid_1'] < row['dist_centroid_2'] else 2 
    customers['cluster'] = customers.apply(cluster, axis=1)
            
    return customers


customers = calculate_distance_assign_clusters(customers, centroids)

sns.scatterplot(data = customers, x = customers["Age"], y = customers["Spending Score"], hue = customers["cluster"], palette = "tab10")
sns.scatterplot(data = centroids, x = centroids["Age"], y = centroids["Spending Score"], color = "red", s = 100)
                
plt.show()

## Creating New Clusters 

We have our first cluster split. However, the K-means algorithm consists of multiple iterations until the centroids converge to the mean of their clusters.

We'll perform the next step in this iteration by creating new centroids and assigning clusters for a second time.

As the name of the algorithm suggests, the new centroids are calculated by the mean of each of the K clusters, in our case, the mean of the two clusters.

Once we have the Age and Spending Score coordinates from the new centroids, the process will repeat:

1. Calculate the distance of each customer to the new centroids.
2. Assign new clusters to the customer based on the new distances.
3. Visualize the new clusters.

### Instructions
1. Use the DataFrame.groupby() method to get the mean Age and Spending Score by cluster and assign the result to new_centroids.

2. Reset the index on new_centroids.

3. Drop the cluster columns from new_centroids.

4. Using calculate_distance_assign_clusters, reassign clusters to each row in customers using new_centroids.

5. Create the same scatter plots as in the last screen to visualize the new centroids and new clusters.

Did the clusters change significantly? What about the centroids? Are they more centralized in their own cluster? Did the algorithm get better?

In [None]:
new_centroids = customers.groupby("cluster")

new_centroids = customers.groupby('cluster')['Age', 'Spending Score'].mean().reset_index()
new_centroids.drop('cluster', axis=1, inplace=True)

calculate_distance_assign_clusters(customers, new_centroids)

sns.scatterplot(x='Age', y='Spending Score', hue='cluster', palette='tab10', data=customers, s=50)
sns.scatterplot(x='Age', y='Spending Score', color='black', data=new_centroids, s=100)
plt.show()

## Wrapping in a Function

We have so far built an algorithm that performs two iterations and splits the dataset into two clusters.

Before we move on, let's first wrap everything we've done so far inside a single, consolidated function. Then we'll be able to build from this and develop a more complex algorithm.

### Instructions
This function should receive the customers DataFrame as an argument, then do the entire process we have developed in this lesson.

1. Write a function called create_clusters and in its body:

    - Initialize two random centroids.
    - Use the calculate_distance_assign_clusters function to get the centroids' coordinates, calculate distances, and assign clusters.
    - Create new centroids using groupby().
    - Drop the cluster columns from the DataFrame containing the new centroids.
    - Recalculate the distances to each centroid and reassign clusters using calculate_distance_assign_clusters with the new centroids.
    - Return the clusters column.

2. Call the create_clusters function passing the customers DataFrame as argument. Assign the result to clusters.

In [None]:
cols_to_keep = ['Age', 'Spending Score']

customers = customers[cols_to_keep].copy()

def create_clusters(df):
    
    centroids = df.sample(2)
    
    calculate_distance_assign_clusters(df, centroids)
    
    new_centroids = df.groupby('cluster')['Age', 'Spending Score'].mean().reset_index()
    new_centroids.drop('cluster', axis=1, inplace=True)

    calculate_distance_assign_clusters(df, new_centroids)
    
    return df["cluster"]

clusters = create_clusters(customers)