# Segmenting Customers

You have been asked to review how the customer ratings data looks when modeled with 3 and 4 clusters.

Using the information contained in this notebook, apply the K-means algrothim to the service_ratings data using both 3 and 4 clusters to segment the customer information.

In [40]:
# Import the modules
import pandas as pd
from pathlib import Path
import hvplot.pandas

# Import the K-means algorithm
from sklearn.cluster import KMeans

In [41]:
# Read in the CSV file as a Pandas DataFrame
service_ratings_df = pd.read_csv(
    Path("service_ratings.csv")
)

# Review the DataFrame
service_ratings_df.head()

Unnamed: 0,mobile_app_rating,personal_banker_rating
0,3.5,2.4
1,3.65,3.14
2,2.9,2.75
3,2.93,3.36
4,2.89,2.62


In [42]:
# Visualize a scatter plot of the data
service_ratings_df.hvplot.scatter(
    x="mobile_app_rating", 
    y="personal_banker_rating"
)

## Run the k-means model with 3 clusters

In [43]:
# Create and initialize the K-means model instance for 3 clusters
# Set the random_state variable to 1
model = KMeans(n_clusters=3, random_state=1)


# Print the model
model

In [44]:
# Fit the data to the instance of the model
model.fit(service_ratings_df)

  super()._check_params_vs_input(X, default_n_init=10)


In [45]:
# Make predictions about the data clusters using the trained model
predictions = model.predict(service_ratings_df)

# Print the predictions
predictions

array([0, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 0, 2, 2, 0, 2, 1, 2, 1, 1, 2, 2,
       1, 2, 1, 0, 1, 2, 1, 0, 2, 1, 2, 0, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2,
       2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 0, 1, 2, 1, 2, 1, 2, 1, 1,
       0, 0, 1, 2, 0, 1, 1, 2, 1, 2, 0, 1, 2, 2, 2, 0, 1, 2, 2, 2, 1, 2,
       0, 1, 2, 1, 2, 2, 0, 2, 0, 2, 1, 2, 0, 1, 2, 1, 2, 2, 1, 1, 2, 2,
       1, 2, 2, 2, 0, 2, 1, 1, 1, 2, 0, 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 2,
       2, 2, 1, 0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 1, 0, 2, 1, 1, 1, 0, 2,
       2, 2, 0, 2, 2, 2, 1, 2, 0, 1, 2, 1, 1, 1, 1, 0, 2, 1, 0, 2, 2, 1,
       2, 2, 2, 1, 2, 0, 2], dtype=int32)

In [46]:
# Create a copy of the DataFrame and name it as service_ratings_predictions_df
service_ratings_predictions_df = service_ratings_df.copy()

# Add a column to the DataFrame that contains the customer_segment information
service_ratings_predictions_df['customer_segment_3'] = predictions

# Review the DataFrame
service_ratings_predictions_df

Unnamed: 0,mobile_app_rating,personal_banker_rating,customer_segment_3
0,3.50,2.40,0
1,3.65,3.14,2
2,2.90,2.75,1
3,2.93,3.36,1
4,2.89,2.62,1
...,...,...,...
178,3.44,3.00,2
179,2.40,2.80,1
180,3.25,2.88,2
181,3.50,2.40,0


In [47]:
# Plot the data points based on the customer rating
service_ratings_predictions_df.hvplot.scatter(
    x="mobile_app_rating", 
    y="personal_banker_rating", 
    by="customer_segment_3"
)

## Run the k-means model with 4 clusters

In [48]:
# Create and initialize the K-means model instance for 4 clusters
model = KMeans(n_clusters=4, random_state=1)


# Print the model
model

In [49]:
# Fit the data to the instance of the model
model.fit(service_ratings_df)

  super()._check_params_vs_input(X, default_n_init=10)


In [50]:
# Make predictions about the data clusters using the trained model
predictions = model.predict(service_ratings_df)

# Print the predictions
predictions

array([1, 3, 2, 2, 2, 2, 3, 3, 2, 0, 2, 2, 3, 2, 1, 3, 2, 3, 2, 2, 3, 3,
       2, 3, 2, 1, 0, 3, 0, 1, 3, 2, 3, 1, 3, 2, 2, 2, 3, 0, 2, 3, 3, 2,
       3, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 2, 1, 2, 2, 2, 3, 2, 3, 2, 2,
       1, 1, 2, 2, 1, 0, 2, 3, 0, 3, 1, 2, 3, 3, 3, 1, 0, 3, 3, 2, 0, 3,
       1, 0, 3, 2, 2, 3, 1, 3, 1, 3, 2, 2, 1, 2, 3, 2, 3, 3, 0, 0, 3, 3,
       2, 2, 3, 3, 1, 3, 0, 0, 2, 3, 1, 2, 2, 2, 2, 2, 3, 3, 2, 2, 2, 3,
       3, 3, 2, 1, 2, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 2, 3, 2, 0, 2, 1, 3,
       3, 3, 1, 3, 3, 3, 2, 3, 1, 2, 3, 2, 2, 2, 0, 1, 3, 2, 1, 2, 3, 0,
       3, 3, 3, 2, 2, 1, 3], dtype=int32)

In [51]:
# Add a column to the service_ratings_predictions_df DataFrame that contains the customer_segment information
service_ratings_predictions_df['customer_segment_4'] = predictions

# Review the DataFrame
service_ratings_predictions_df

Unnamed: 0,mobile_app_rating,personal_banker_rating,customer_segment_3,customer_segment_4
0,3.50,2.40,0,1
1,3.65,3.14,2,3
2,2.90,2.75,1,2
3,2.93,3.36,1,2
4,2.89,2.62,1,2
...,...,...,...,...
178,3.44,3.00,2,3
179,2.40,2.80,1,2
180,3.25,2.88,2,2
181,3.50,2.40,0,1


In [52]:
# Plot the data points based on the customer rating
service_ratings_predictions_df.hvplot.scatter(
    x="mobile_app_rating", 
    y="personal_banker_rating", 
    by="customer_segment_4"
)

## Answer the following question

**Question:** Can any additional information be gleaned from the customer segmentation data when clusters of 3 and 4 are applied?

**Answers:** As you add more clusters, the centroids are recalculated which may cause some data points to change its grouping when compared to another sample with less or more clusters. 