# Telco Customer Churn Data Analysis and Visualization

## Data Dictionary:

1. **`CustomerID`**: A unique ID that identifies each customer


2. **`Gender`**: The customer’s gender
    * Male
    * Female


3. **`Senior Citizen`**: Indicates if the customer is 65 or older
    * Yes (1)
    * No (0)


4. **`Partner`**: Indicates if the customer is married
    * Yes
    * No
    

5. **`Dependents`**: Indicates if the customer lives with any dependents (Dependents could be children, parents, grandparents, etc.) 
    * Yes
    * No


6. **`Tenure`**: Indicates the total amount of months that the customer has been with the company.


7. **`Phone Service`**: Indicates if the customer subscribes to home phone service with the company
    * Yes
    * No
   

8. **`Multiple Lines`**: Indicates if the customer subscribes to multiple telephone lines with the company
    * Yes
    * No
    * No phone service


9. **`Internet Service`**: Indicates if the customer subscribes to Internet service with the company
    * No
    * DSL
    * Fiber Optic


10. **`Online Security`**: Indicates if the customer subscribes to an additional online security service provided by the company
    * Yes
    * No
    * No internet service
    

11. **`Online Backup`**: Indicates if the customer subscribes to an additional online backup service provided by the company
    * Yes
    * No
    * No internet service


12. **`Device Protection Plan`**: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company
    * Yes
    * No
    * No internet service


13. **`Premium Tech Support`**: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times
    * Yes
    * No
    * No internet service


14. **`Streaming TV`**: Indicates if the customer uses their Internet service to stream television programing from a third party provider. The company does not charge an additional fee for this service.
    * Yes
    * No
    * No internet service


15. **`Streaming Movies`**: Indicates if the customer uses their Internet service to stream movies from a third party provider: Yes, No. The company does not charge an additional fee for this service.
    * Yes
    * No
    * No internet service


16. **`Contract`**: Indicates the customer’s current contract type
    * Month-to-Month
    * One year
    * Two year


17. **`Paperless Billing`**: Indicates if the customer has chosen paperless billing: Yes, No
    * Yes
    * No
    
    
18. **`Payment Method`**: Indicates how the customer pays their bill: Bank Withdrawal, Credit Card, Mailed Check
    * Electronic check
    * Mailed check
    * Bank transfer (automatic)
    * Credit card (automatic)
    
    
19. **`Monthly Charge`**: Indicates the customer’s current total monthly charge for all their services from the company.


20. **`Total Charges`**: Indicates the customer’s total charges, calculated to the end of the quarter specified above.


21. **`Churn`**: Yes = the customer left the company this quarter. No = the customer remained with the company. Directly related to Churn Value.
    * Yes
    * No
    

Source:
https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

import plotly 
import plotly.express as px
import plotly.graph_objs as go
import plotly.offline as py
from plotly.offline import iplot
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

import missingno as msno

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("/kaggle/input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df = df.drop('customerID',axis=1)
df.head()

## Data Quick Review

In [None]:
# Check missing value

df.isna().sum()

**There is currently no null value in the data set. However, we need to examine this output in a little more detail so that it is not misleading.**

In [None]:
df.info()

> We can see the data consisting of 20 columns with 7043 instances. 
> According to the first observations, there are no empty values in this data set. 
> But we should note that the variable `TotalCharges` is of object type. We need to convert this variable to numeric type.

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

In [None]:
# Missing values check

def missing_values(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns
    
missing_values_table = missing_values(df)
missing_values_table

In [None]:
msno.matrix(df)

In [None]:
# Fill the empty values in the TotalCharges variable by multiplying the tenure and MonthlyCharges values

df['TotalCharges'].fillna(df['tenure'] * df['MonthlyCharges'], inplace=True)

In [None]:
# Check cardinality control

df.nunique()

**There is no cardinality problem in categorical variables.**

In [None]:
# Let's review categorical and numerical values one last time

def filter_categorical_numeric_columns(dataframe):
    categorical_columns = dataframe.select_dtypes(include=['object', 'category']).columns
    numeric_columns = dataframe.select_dtypes(include=['number']).columns
    return categorical_columns, numeric_columns

# Filter categorical and numeric variables
categorical_cols, numeric_cols = filter_categorical_numeric_columns(df)

print("Categorics:")
print(categorical_cols)
print("\nNumerics:")
print(numeric_cols)

In [None]:
# Let's see the unique values of the categorical features.

for feature in df[categorical_cols]:
        print(f'{feature}: {df[feature].unique()}')

**In `MultipleLines`, `OnlineSecurity`, `OnlineBackup`, `DeviceProtection`, `TechSupport`, `StreamingTV` and `StreamingMovies` variables 'No' and 'No internet service' are used repeatedly although they mean the same thing. These need to be merged during the model development phase.**

In [None]:
df['MultipleLines'] = df['MultipleLines'].replace('No phone service','No')

columns_to_replace = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for column in columns_to_replace:
    df[column] = df[column].replace('No internet service', 'No')

## Visualization

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns 
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')

churn_counts= df['Churn'].value_counts()
fig2 = px.pie(names= churn_counts.keys(), values= churn_counts.values, title='Churn Distribution')
fig2.show()

* 26.5% of customers churned. 
* It can be seen that this data set is unbalanced. 

In [None]:
'''This function gives the distribution of variables, 
their relationship with the target variable and the probability of churn on a variable basis.'''

def analyze_category(df, category_column, target_column='Churn'):
    # Value Counts Pie Chart
    category_counts = df[category_column].value_counts()
    fig = px.pie(names=category_counts.index, values=category_counts.values, title=f'{category_column} Distribution')
    fig.show()

    # Churn Probabilities
    churn_probabilities = df.groupby(category_column)[target_column].value_counts(normalize=True) * 100
    for category_value in df[category_column].unique():
        churn_rate = churn_probabilities[category_value]['Yes'] if 'Yes' in churn_probabilities[category_value].index else 0
        print(f"A {category_value} customer has a probability of {churn_rate:.2f}% churn")

    # Histogram
    fig = px.histogram(df, x=category_column, color=target_column, width=400, height=400)
    fig.show()

    # Grouping
    grouped_data = df.groupby([category_column, target_column]).size().reset_index(name='count')

    # Bar Chart
    plt.figure(figsize=(10, 6))
    sns.barplot(data=grouped_data, x=category_column, y='count', hue=target_column)
    plt.title(f'Number of people with or without churn by {category_column} type')
    plt.xlabel(category_column)
    plt.ylabel('Count')
    plt.show()

### Gender

In [None]:
df["gender"].value_counts()

In [None]:
analyze_category(df, 'gender')

### SeniorCitizen

In [None]:
df["SeniorCitizen"].value_counts()

In [None]:
analyze_category(df, 'SeniorCitizen')

### Partner

In [None]:
df["Partner"].value_counts()

In [None]:
analyze_category(df, 'Partner')

### Dependents

In [None]:
df["Dependents"].value_counts()

In [None]:
analyze_category(df, 'Dependents')

### PhoneService

In [None]:
df["PhoneService"].value_counts()

In [None]:
analyze_category(df, 'PhoneService')

### MultipleLines

In [None]:
df["MultipleLines"].value_counts()

In [None]:
analyze_category(df, 'MultipleLines')

### InternetService

In [None]:
df["InternetService"].value_counts()

In [None]:
analyze_category(df, 'InternetService')

### OnlineSecurity

In [None]:
df["OnlineSecurity"].value_counts()

In [None]:
analyze_category(df, 'OnlineSecurity')

### OnlineBackup

In [None]:
df["OnlineBackup"].value_counts()

In [None]:
analyze_category(df, 'OnlineBackup')

### DeviceProtection

In [None]:
df["DeviceProtection"].value_counts()

In [None]:
analyze_category(df, 'DeviceProtection')

### TechSupport

In [None]:
df["TechSupport"].value_counts()

In [None]:
analyze_category(df, 'TechSupport')

### StreamingTV

In [None]:
df["StreamingTV"].value_counts()

In [None]:
analyze_category(df, 'StreamingTV')

### StreamingMovies

In [None]:
df["StreamingMovies"].value_counts()

In [None]:
analyze_category(df, 'StreamingMovies')

### Contract

In [None]:
df["Contract"].value_counts()

In [None]:
analyze_category(df, 'Contract')

### PaperlessBilling

In [None]:
analyze_category(df, 'PaperlessBilling')

### PaymentMethod

In [None]:
analyze_category(df, 'PaymentMethod')

### Tenure & MonthlyCharges & TotalCharges

In [None]:
df[['tenure', 'MonthlyCharges', 'TotalCharges']].iplot(kind='histogram',subplots=True,bins=50)

## Important Takeaways

**`Gender`** : 
* The data set consists of 49.5% Female and 50.5% Man. 
* When we examine the data, there is no noticeable difference between genders in terms of churn probability.

**`SeniorCitizen`** :
* The vast majority of customers (83.8%) are under the age of 65. 
* In contrast, customers aged 65 and over are 1.76 times more likely to churn than others.

**`Partner`** :
* Whether customers are married or not is almost evenly distributed in the dataset. 
* The churn probability of married customers is 19.66%, while this rate is 32.96% for unmarried customers.

**`Dependents`** :
* People without dependents are 2 times more likely to churn than those with dependents. 
* However, this is an issue that needs to be explored further given that 70% of people are not dependent on someone.

**`PhoneService`** : 
* 90% of customers do not subscribe to the company's telephone service. 
* When looking at the probability of churn, subscribers and non-subscribers are almost equally likely.

**`MultipleLines`** : 
* Churn rate difference between customer has a multiple lines phone service with the company and customer does not have a multiple line phone service with the company is very small.

**`InternetServices`** : 
* Fiber Optic users constitute 44% of the data set. 
* Fiber optic users have a 42% churn probability, while DSL users have a 19% probability and others have a 7.4% probability.
* Fiber optic customers are 2.2 times more likely to churn than DSL customers and 5.6 times more likely than customers without an internet subscription.

**`OnlineSecurity`** : 
* A customer with an online security service with the company almost 2.14 times less likely to leave the company than a customer without any online security service with the company. 

**`OnlineBackup`** : 
* There is no noticeable difference in the churn rates of customers with or without an additional online backup system provided by the company.

**`DeviceProtection`** : 
* Customers who do not subscribe to this plan are 1.3 times more likely to churn than others.

**`TechSupport`** : 
* The churn probability of the customer without a technical support plan is 31.19%, while the others are 15.17%. 
* In other words, customers without a technical support plan exhibited 2 times more churn.

**`StreamingTV - StreamingMovies`** : 
* StreamingTV and StreamingMovies data do not provide significant information about the likelihood of churn. 
* Churn rates are close to each other.

**`Contract`** : 
* The churn rate of customers with month-to-month contracts is 3.8 times higher than customers with one year contracts and 15 times higher than customers with two year contracts. 
* We can say that customers with two year contracts have a serious loyalty.

**`PaperlessBilling`** : 
* Customers who choose paperless billing have 2 times higher churn rate than others.

**`PaymentMethod`** : 
* The number of customers for all payment methods is almost the same. 
* However, customers with electronic check payment method churn 2.3 - 3 times more than other customers.