# TELCO CHURN ANALYSIS

## Hypothesis

Null Hypothesis: There is no relationship between the monthly charges and the churn of customers.

Alternate Hypothesis: There is a relationship between the monthly charges and the churn of customers.

## Analytical Questions

1. What is the overall churn rate of the telecommunication network?
2. What is the average monthly spending of churn customers compared to non-churn customers?
3. What percentage of the top 100 most charged customers churned?
4. What is the rate of churning of customers without online security?
5. What is the rate of churning of customers without online backup?
6. What is the rate of churning of customers without device protection?
7. What is the rate of churning of customers without Tech support?
8. How does the absence of online security, online backup, device protection and Tech support add up to lead to churning?
9. Does the length of a customer's contract affect their likelihood of churn?
10. How does the customer's tenure affect their likelihood of churn?
11. How many male customers who have partners and dependents with high monthly charges churned?

In [61]:
# Importing the needed packages.

import pyodbc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

warnings.filterwarnings('ignore')

There are three datasets for this analysis which are located in three different places. The first dataset can be found on a SQL server and is to be accessed remotely. The second dataset can be found as an excel file on One Drive. The access link was provided and used to access and download the dataset. While the third dataset can be found as a csv file on a Github repository whose link was provided as well and used to clone the dataset into the local machine.

The first and last datasets have been identified as the train datasets, while the second dataset has been identifed as the test dataset. The train datasets will be assessed and merged together, and used to build models independently. While the test dataset will be used to test the models independently.

## Accessing the first dataset from SQL database

In [62]:
# Creating the server instance variable such as the server you are connecting to, database , username and password.
server = 'dap-projects-database.database.windows.net'
database = 'dapDB'
username = 'dataAnalyst_LP2'
password = 'A3g@3kR$2y'

# The connection string is an f string that includes all the variable above to extablish a connection to the server.
connection_string = f"DRIVER={{SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}"


In [64]:
# Using the connect method of the pyodbc library to pass in the connection string.

# N/B: This will connect to the server and might take a few seconds to be complete.
# Check your internet connection if it takes more time than necessary.

connection = pyodbc.connect(connection_string)

Error: ('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver][SQL Server]Reason: Login failed due to client TLS version being less than minimal TLS version allowed by the server. (47072) (SQLDriverConnect); [HY000] [Microsoft][ODBC SQL Server Driver][SQL Server]Reason: Login failed due to client TLS version being less than minimal TLS version allowed by the server. (47072)')

In [None]:
# The SQL query used to get the dataset is shown below. Note that you will not have permissions to insert delete
# or update this database table. dbo.LP2_Telco_churn_first_3000 is the name of the table. The dbo in front of the name 
# is a naming convention in Microsoft SQL Server.

query = "Select * from dbo.LP2_Telco_churn_first_3000"
df1 = pd.read_sql(query, connection)

df1.head()

## Accessing the last dataset

In [None]:
# Loading the last dataset.

df3 = pd.read_csv('LP2_Telco-churn-last-2000.csv')

In [None]:
# Displaying the first five rows of the last dataset.

df3.head()

In [None]:
# Inspecting the columns of the first dataset.

df1.columns

In [None]:
# Inspecting the columns of the last dataset.

df3.columns

In [None]:
# Since both datasets have the same column names and index number, they can be concatenated to have the train dataset.

train_data = pd.concat([df1, df3])

# Saving the train dataset to a new csv file.
train_data.to_csv('Train-Data.csv')

train_data.head()

## Accessing the second dataset (test dataset)

In [None]:
# Loading the test dataset.

test_data = pd.read_excel('Telco-churn-second-2000.xlsx')
test_data.head()

## EDA

In [None]:
# Checking the number of rows and columns on the train dataset.

train_data.shape

In [None]:
# Checking the number of rows and columns on the test dataset.

test_data.shape

The train dataset has 5043 columns and 21 columns while the test dataset has 2000 rows and 20 columns. Let's identify the column in the train dataset that is absent in the test dataset.

In [None]:
# Inspecting the columns of the train dataset.

train_data.columns

In [None]:
# Inspecting the columns of the test dataset.

test_data.columns

As can be seen, the test dataset has the same columns as the train dataset with the exception of the churn column which can be found on the train dataset alone. This is understandable as the churn column on the train dataset provides information on whether customers churn or  not, which is used to build the best ML model. This column is not needed on the test dataset, rather the model built is tested on the test dataset to check the ability of the model to predict whether a customet will churn or not.

In [None]:
# Checking the statistical data on the train dataset.

train_data.describe()

In [None]:
# Checking the statistical data on the test dataset.

test_data.describe()

In [None]:
# Checking the datatypes and the presence of missing values on the train dataset.

train_data.info()

In [None]:
# Checking the datatypes and the presence of missing values on the test dataset.

test_data.info()

Note that the datatype of the 'SeniorCitizen' column is an object on the train data but an integer on the test data.

In [None]:
# Confirming the number of cells with missing values on each column of the train dataset.

train_data.isna().sum()

The 'MultipleLines' column has 269 missing values. The 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV' and 'StreamingMovies' columns all have 651 missing values. This needs to be evaluated further to find out if these missing values are exactly on the same rows. The 'TotalCharges' column has 5 missing values while the 'Churn' column has 1 missing value.

In [None]:
# Confirming that there are no missing values on the test dataset.

test_data.isna().sum()

There are no missing values on the test dataset.

In [None]:
# Checking for the presence of duplicates on the train dataset.

train_data.duplicated().sum()

There are no duplicate rows on the train dataset.

In [None]:
# Checking for the presence of duplicates on the test dataset.

test_data.duplicated().sum()

There are no duplicate rows on the test dataset.

## Data Transformation

In [None]:
# Investigating the columns on the train dataset.

train_data.columns
for column in train_data.columns:
    print('column: {} - unique value: {}'.format(column, train_data[column].unique()))

# Markdown

In [None]:
# Replacing True with 'Yes' and replacing False, 'None', 'No internet service' and 'No phone service' with 'No'
# in the train dataset.

train_data = train_data.replace({
    True: 'Yes',
    False: 'No',
    None: 'No',
    'No internet service': 'No',
    'No phone service': 'No'
}, inplace = False)

In [None]:
# Confirming that the changes have been effected.

train_data.columns
for column in train_data.columns:
    print('column: {} - unique value: {}'.format(column, train_data[column].unique()))

In [None]:
# Checking for missing values.

train_data.isna().sum()

# Markdown

In [None]:
# Investigating the columns on the test dataset.

test_data.columns
for column in test_data.columns:
    print('column: {} - unique value: {}'.format(column, test_data[column].unique()))

# Markdown

In [None]:
# Replacing 1 with 'Yes' and replacing 0, 'No internet service' and 'No phone service' with 'No' in the test dataset.

test_data = test_data.replace({
    1: 'Yes',
    0: 'No',
    'No internet service': 'No',
    'No phone service': 'No'
}, inplace = False)

In [None]:
# Confirming that the changes have been effected.

test_data.columns
for column in test_data.columns:
    print('column: {} - unique value: {}'.format(column, test_data[column].unique()))

In [None]:
# Checking for missing values.

test_data.isna().sum()

# Markdown

Total Charges