### Exploring Customer Segmentation


<center>
    <img src = images/segments.jpeg>
</center>


In this activity, you are tasked with profiling customer groups for a large telecommunications company.  The data provided contains information on customers purchasing and useage behavior with the telecom products.  Your goal is to use PCA and clustering to segment these customers into meaningful groups, and report back your findings.  

Because these results need to be interpretable, it is important to keep the number of clusters reasonable.  Think about how you might represent some of the non-numeric features so that they can be included in your segmentation models.  You are to report back your approach and findings to the class.  Be specific about what features were used and how you interpret the resulting clusters.

In [18]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA

In [2]:
df = pd.read_csv('try-it_6.2_starter/data/telco_churn_data.csv')

In [3]:
df.head()

Unnamed: 0,Customer ID,Referred a Friend,Number of Referrals,Tenure in Months,Offer,Phone Service,Avg Monthly Long Distance Charges,Multiple Lines,Internet Service,Internet Type,...,Latitude,Longitude,Population,Churn Value,CLTV,Churn Category,Churn Reason,Total Customer Svc Requests,Product/Service Issues Reported,Customer Satisfaction
0,8779-QRDMV,No,0,1,,No,0.0,No,Yes,Fiber Optic,...,34.02381,-118.156582,68701,1,5433,Competitor,Competitor offered more data,5,0,
1,7495-OOKFY,Yes,1,8,Offer E,Yes,48.85,Yes,Yes,Cable,...,34.044271,-118.185237,55668,1,5302,Competitor,Competitor made better offer,5,0,
2,1658-BYGOY,No,0,18,Offer D,Yes,11.33,Yes,Yes,Fiber Optic,...,34.108833,-118.229715,47534,1,3179,Competitor,Competitor made better offer,1,0,
3,4598-XLKNJ,Yes,1,25,Offer C,Yes,19.76,No,Yes,Fiber Optic,...,33.936291,-118.332639,27778,1,5337,Dissatisfaction,Limited range of services,1,1,2.0
4,4846-WHAFZ,Yes,1,37,Offer C,Yes,6.33,Yes,Yes,Cable,...,33.972119,-118.020188,26265,1,2793,Price,Extra data charges,1,0,2.0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 46 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Customer ID                        7043 non-null   object 
 1   Referred a Friend                  7043 non-null   object 
 2   Number of Referrals                7043 non-null   int64  
 3   Tenure in Months                   7043 non-null   int64  
 4   Offer                              3166 non-null   object 
 5   Phone Service                      7043 non-null   object 
 6   Avg Monthly Long Distance Charges  7043 non-null   float64
 7   Multiple Lines                     7043 non-null   object 
 8   Internet Service                   7043 non-null   object 
 9   Internet Type                      5517 non-null   object 
 10  Avg Monthly GB Download            7043 non-null   int64  
 11  Online Security                    7043 non-null   objec

In [5]:
df.describe()

Unnamed: 0,Number of Referrals,Tenure in Months,Avg Monthly Long Distance Charges,Avg Monthly GB Download,Monthly Charge,Total Regular Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Age,Number of Dependents,Zip Code,Latitude,Longitude,Population,Churn Value,CLTV,Total Customer Svc Requests,Product/Service Issues Reported,Customer Satisfaction
count,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,7043.0,1834.0
mean,1.951867,32.386767,22.958954,21.11089,65.5388,2280.381264,1.962182,278.499225,749.099262,46.509726,0.468692,93486.070567,36.197455,-119.756684,22139.603294,0.26537,4400.295755,1.338776,0.308107,3.005453
std,3.001199,24.542061,15.448113,20.948471,30.606805,2266.220462,7.902614,685.039625,846.660055,16.750352,0.962802,1856.767505,2.468929,2.154425,21152.392837,0.441561,1183.057152,1.430471,0.717514,1.256938
min,0.0,1.0,0.0,0.0,18.25,18.8,0.0,0.0,0.0,19.0,0.0,90001.0,32.555828,-124.301372,11.0,0.0,2003.0,0.0,0.0,1.0
25%,0.0,9.0,9.21,3.0,35.89,400.15,0.0,0.0,70.545,32.0,0.0,92101.0,33.990646,-121.78809,2344.0,0.0,3469.0,0.0,0.0,2.0
50%,0.0,29.0,22.89,17.0,71.968,1394.55,0.0,0.0,401.44,46.0,0.0,93518.0,36.205465,-119.595293,17554.0,0.0,4527.0,1.0,0.0,3.0
75%,3.0,55.0,36.395,28.0,90.65,3786.6,0.0,182.62,1191.1,60.0,0.0,95329.0,38.161321,-117.969795,36125.0,1.0,5380.5,2.0,0.0,4.0
max,11.0,72.0,49.99,94.0,123.084,8684.8,49.79,6477.0,3564.72,80.0,9.0,96150.0,41.962127,-114.192901,105285.0,1.0,6500.0,9.0,6.0,5.0


### Preparing the Data

In [9]:
object_cols = df.select_dtypes('object').columns.tolist()
object_cols

['Customer ID',
 'Referred a Friend',
 'Offer',
 'Phone Service',
 'Multiple Lines',
 'Internet Service',
 'Internet Type',
 'Online Security',
 'Online Backup',
 'Device Protection Plan',
 'Premium Tech Support',
 'Streaming TV',
 'Streaming Movies',
 'Streaming Music',
 'Unlimited Data',
 'Contract',
 'Paperless Billing',
 'Payment Method',
 'Gender',
 'Under 30',
 'Senior Citizen',
 'Married',
 'Dependents',
 'City',
 'Churn Category',
 'Churn Reason']

### Droping the object columns

In [10]:
df_numeric = df.drop(object_cols, axis = 1)
df_numeric.head()

Unnamed: 0,Number of Referrals,Tenure in Months,Avg Monthly Long Distance Charges,Avg Monthly GB Download,Monthly Charge,Total Regular Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Age,Number of Dependents,Zip Code,Latitude,Longitude,Population,Churn Value,CLTV,Total Customer Svc Requests,Product/Service Issues Reported,Customer Satisfaction
0,0,1,0.0,9,41.236,39.65,0.0,0.0,0.0,78,0,90022,34.02381,-118.156582,68701,1,5433,5,0,
1,1,8,48.85,19,83.876,633.3,0.0,120.0,390.8,74,1,90063,34.044271,-118.185237,55668,1,5302,5,0,
2,0,18,11.33,57,99.268,1752.55,45.61,0.0,203.94,71,3,90065,34.108833,-118.229715,47534,1,3179,1,0,
3,1,25,19.76,13,102.44,2514.5,13.43,327.0,494.0,78,1,90303,33.936291,-118.332639,27778,1,5337,1,1,2.0
4,1,37,6.33,15,79.56,2868.15,0.0,430.0,234.21,80,1,90602,33.972119,-118.020188,26265,1,2793,1,0,2.0


### Droping the non-informative columns

In [11]:
non_info = ['Zip Code','Latitude','Longitude','Customer Satisfaction','Churn Value']

In [12]:
drop_cols = object_cols + non_info
drop_cols

['Customer ID',
 'Referred a Friend',
 'Offer',
 'Phone Service',
 'Multiple Lines',
 'Internet Service',
 'Internet Type',
 'Online Security',
 'Online Backup',
 'Device Protection Plan',
 'Premium Tech Support',
 'Streaming TV',
 'Streaming Movies',
 'Streaming Music',
 'Unlimited Data',
 'Contract',
 'Paperless Billing',
 'Payment Method',
 'Gender',
 'Under 30',
 'Senior Citizen',
 'Married',
 'Dependents',
 'City',
 'Churn Category',
 'Churn Reason',
 'Zip Code',
 'Latitude',
 'Longitude',
 'Customer Satisfaction',
 'Churn Value']

In [13]:
df_clean = df.drop(drop_cols, axis = 1)
df_clean.head()

Unnamed: 0,Number of Referrals,Tenure in Months,Avg Monthly Long Distance Charges,Avg Monthly GB Download,Monthly Charge,Total Regular Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Age,Number of Dependents,Population,CLTV,Total Customer Svc Requests,Product/Service Issues Reported
0,0,1,0.0,9,41.236,39.65,0.0,0.0,0.0,78,0,68701,5433,5,0
1,1,8,48.85,19,83.876,633.3,0.0,120.0,390.8,74,1,55668,5302,5,0
2,0,18,11.33,57,99.268,1752.55,45.61,0.0,203.94,71,3,47534,3179,1,0
3,1,25,19.76,13,102.44,2514.5,13.43,327.0,494.0,78,1,27778,5337,1,1
4,1,37,6.33,15,79.56,2868.15,0.0,430.0,234.21,80,1,26265,2793,1,0


In [14]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 15 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Number of Referrals                7043 non-null   int64  
 1   Tenure in Months                   7043 non-null   int64  
 2   Avg Monthly Long Distance Charges  7043 non-null   float64
 3   Avg Monthly GB Download            7043 non-null   int64  
 4   Monthly Charge                     7043 non-null   float64
 5   Total Regular Charges              7043 non-null   float64
 6   Total Refunds                      7043 non-null   float64
 7   Total Extra Data Charges           7043 non-null   float64
 8   Total Long Distance Charges        7043 non-null   float64
 9   Age                                7043 non-null   int64  
 10  Number of Dependents               7043 non-null   int64  
 11  Population                         7043 non-null   int64

### Scaling the Data

In [15]:
df_scaled = (df_clean - df_clean.mean())/ df_clean.std()
df_scaled.head()

Unnamed: 0,Number of Referrals,Tenure in Months,Avg Monthly Long Distance Charges,Avg Monthly GB Download,Monthly Charge,Total Regular Charges,Total Refunds,Total Extra Data Charges,Total Long Distance Charges,Age,Number of Dependents,Population,CLTV,Total Customer Svc Requests,Product/Service Issues Reported
0,-0.650362,-1.278897,-1.486198,-0.578128,-0.794033,-0.988753,-0.248295,-0.406545,-0.88477,1.879977,-0.4868,2.201235,0.872912,2.559453,-0.42941
1,-0.317162,-0.993672,1.676001,-0.100766,0.599122,-0.726797,-0.248295,-0.231372,-0.423191,1.641176,0.551835,1.585088,0.762181,2.559453,-0.42941
2,-0.650362,-0.586209,-0.752775,1.713209,1.102016,-0.232913,5.523212,-0.406545,-0.643894,1.462075,2.629105,1.200545,-1.032322,-0.236828,-0.42941
3,-0.317162,-0.300984,-0.207077,-0.387183,1.205653,0.103308,1.451142,0.0708,-0.301301,1.879977,0.551835,0.266561,0.791766,-0.236828,0.964292
4,-0.317162,0.187973,-1.076439,-0.291711,0.458107,0.259361,-0.248295,0.221156,-0.608142,1.999377,0.551835,0.195032,-1.358595,-0.236828,-0.42941


## PCA

In [19]:
pca = PCA(n_components = 3, random_state = 42)
pca

In [20]:
components = pca.fit_transform(df_scaled)
components

array([[-2.36079276,  0.63262569,  1.15284615],
       [-0.76191211,  0.96019258,  1.67950134],
       [-0.26736245,  0.74589693, -1.23909589],
       ...,
       [ 5.77525106,  2.12148196, -2.35291507],
       [-1.94810441, -1.04201939, -2.12452624],
       [ 2.35573268,  0.86893308,  1.92865413]], shape=(7043, 3))

In [21]:
pca.explained_variance_ratio_

array([0.22236947, 0.11723835, 0.1041562 ])

In [22]:
pca.explained_variance_ratio_.sum()

np.float64(0.44376402555520816)

## KMeans

In [23]:
kmeans = KMeans(n_clusters = 3, random_state = 42).fit(components)
kmeans