# Lab 1: Visualization and Data Preprocessing
## Telco Customer Churn
#### Evan Adams, Lijju Mathew, Sid Swarupananda

##### Business Understanding
Customer churn has many definitions: customer attrition, customer turnover, customer defection, etc. They all refer to tge loss of customers which can be by choice (voluntary) or relocation (involuntary). Our team's goal is to predict which customers are at risk of churning. Our analysis helps the Telco company design activities and create strategies to retain customers, their main asset, and minimize churn. The Telco Customer Churn dataset is from [Kaggle](https://www.kaggle.com/blastchar/telco-customer-churnIn). The dataset has 7,043 entries (customers) and 21 attributes that capture information such as demographics and customer relationship history. All customer entries contain several features and an attribute stating if the customer has churned or not. To better understand the data we will first load it into the pandas package and explore it with the help of some basic commands.

##### Data Understanding
The data contains 7,043 customer entries consisting of 21 attributes. They are detailed below:
- **customerID**: Customer ID number
- **gender**: Customer's gender (male/female)
- **SeniorCitizen**: Senior citizen status (1/0)
- **Partner**: Does the customer live with a partner (Yes/No)
- **Dependents**: Does the customer have dependents (Yes/No)
- **tenure**: Number of months the customer has been with the telco (int)
- **PhoneService**: Does the customer have phone service (Yes/No)
- **MultipleLines**: Does the customer have multiple phone lines (Yes/No/No phone service)
- **InternetService**: Customer's internet type (DSL/Fiber optic/No)
- **OnlineSecurity**: Telco provided online security (Yes/No/No internet service)
- **OnlineBackup**: Does the customer have online backups (Yes/No/No internet service)
- **DeviceProtection**: Does the customer have device protection (Yes/No/No internet service)
- **TechSupport**: Does the telco provide tech support to the customer (Yes/No/No internet service)
- **StreamingTV**: Does the customer have streaming TV services (Yes/No/No internet service)
- **StreamingMovies**: Does the customer have streaming movie services (Yes/No/No internet service)
- **Contract**: Customer's contract term in months (Month-to-month/One year/Two year)
- **PaperlessBilling**: Does the customer receive physical bills (Yes/No)
- **PaymentMethod**: How does the customer pay their bill (Electronic check/Mailed check/Bank transfer (auto)/Credit card (auto))
- **MonthlyCharges**: The customer's monthly bill (int)
- **TotalCharges**: Total amount charged to the customer over their tenure with the telco (int)
- **Churn**: Has the customer left the telco (Yes/No)

In [1]:
# import packages
import pandas as pd

In [2]:
# import dataset
df = pd.read_csv('https://raw.githubusercontent.com/lijjumathew/MSDS-Machine-Learning-1-Project/master/dataset/Telco-Customer-Churn.csv')

Calling `df.info()` on our dataset gives us the name of each attribute, how many entries there are, how many of them are not null, and what their data type is. We can see that our data is divided into three types: object, int, and float. The object attributes are categorical, the int attributes are continuous integers, and the float attribute is a continuous number value that can contain decimals.

The `SeniorCitizen` attribute is of type `int` but it is used as a binary classification of whether the customer is a senior citizen or not. We will be changing this to (yes/no) to be consistent with the rest of the `object` type attributes. This leaves `tenure` as the only `int` object in the dataset.

The `TotalCharges` attribute is of type `object` but since it should be a `float` value we are changing it using a pandas function.

In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   object 
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [14]:
# convert TotalCharges to float and replace the empty values with the mean value of TotalCharges
# convert SeniorCitizen to (Yes/No)
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors = 'coerce')
df["TotalCharges"].fillna(df["TotalCharges"].mean(), inplace=True)
df["SeniorCitizen"] = df.replace({0:"No", 1:"Yes"})

##### Data Quality
There are no null values in the dataset, although there are eleven missing values in the `TotalCharges` attribute and we will fill those in with the mean of the `TotalCharges` column.

##### Simple Statistics
The dataset has three continuous variables to analyze. Using the `df.describe()` function we can make the following observations:
- Half of the customers remain in the company for more than 29 months
- The average monthly bill is about \$65
- The average revenue generated per customer is about $2284

In [15]:
df.describe()

Unnamed: 0,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7043.0
mean,32.371149,64.761692,2283.300441
std,24.559481,30.090047,2265.000258
min,0.0,18.25,18.8
25%,9.0,35.5,402.225
50%,29.0,70.35,1400.55
75%,55.0,89.85,3786.6
max,72.0,118.75,8684.8


##### Visualize Attributes
Visualization of data is the simplest form of analysis that allows us to examine how each variable relates to the churn rate. Our major takeaways are as follows:

- The churn percent is almost equal in the case of males and females
- The percent of churn is higher in the case of senior citizens
- The churn rate is higher for customers who have phone service
- Customers with Partners and Dependents have a lower churn rate compared to those who do not
- Customers with an electronic payment method have a higher churn rate compared to other payment methods
- Customers with no internet service have a low churn rate
- Churn rate is much higher in customers with Fiber Optic internet services
- Customers who do not have services like OnlineSecurity, OnlineBackup, and TechSupport have left the platform in the past month # Evan note: how do we know this?