# IART Project - Supervised Learning

### [Sample Telco Customer Churn Dataset](https://www.kaggle.com/datasets/easonlai/sample-telco-customer-churn-dataset)

 Developed by:  
 Carlos Gomes – up201906622​  
 Domingos Santos – up201906680​  
 Filipe Pinto – up201907747  

 

## Table of contents

1. [Introduction](#Introduction)

2. [Specification](#Specification)

3. [Required libraries](#Required-libraries)

4. [The problem domain](#The-problem-domain)

5. [Step 1: Answering the question](#Step-1:-Answering-the-question)

6. [Step 2: Checking the data](#Step-2:-Checking-the-data)

7. [Step 3: Tidying the data](#Step-3:-Tidying-the-data)

    - [Bonus: Testing our data](#Bonus:-Testing-our-data)

8. [Step 4: Exploratory analysis](#Step-4:-Exploratory-analysis)

9. [Step 5: Classification](#Step-5:-Classification)

    - [Cross-validation](#Cross-validation)

    - [Parameter tuning](#Parameter-tuning)

10. [Step 6: Reproducibility](#Step-6:-Reproducibility)

11. [Conclusions](#Conclusions)

12. [Further reading](#Further-reading)

13. [Acknowledgements](#Acknowledgements)

## Introduction
[[ go back to the top ]](#Table-of-contents)    

The main goal of this supervised learning problem is to learn how to classify examples in terms of the concept under
analysis using different learning algorithms. Then they should be compared using appropriate evaluation
metrics and according to the respective time spent to train/test the models. For this particular example we will study/implement the previous for [this](https://www.kaggle.com/datasets/easonlai/sample-telco-customer-churn-dataset) dataset

## Specification
[[ go back to the top ]](#Table-of-contents)  

It is important for a company to retain customers in order to maintain or even increase profit, so it might be very useful to predict their behaviour.​

So, given a dataset with information about telco customers we want to predict if a customer will churn or not​

In other words we want to, the main goal of this project is to predict if a customer will stop buying products/services in telco.​

## Required libraries

[[ go back to the top ]](#Table-of-contents)

If you don't have Python on your computer, you can use the [Anaconda Python distribution](http://continuum.io/downloads) to install most of the Python packages you need. Anaconda provides a simple double-click installer for your convenience.

This notebook uses several Python packages that come standard with the Anaconda Python distribution. The primary libraries that we'll be using are:

* **NumPy**: Provides a fast numerical array structure and helper functions.
* **pandas**: Provides a DataFrame structure to store data in memory and work with it easily and efficiently.
* **scikit-learn**: The essential Machine Learning package in Python.
* **matplotlib**: Basic plotting library in Python; most other Python plotting libraries are built on top of it.
* **Seaborn**: Advanced statistical plotting library.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
#TODO
#from sklearn.[x] import [y]

In [4]:
# This line tells the notebook to show plots inside of the notebook
%matplotlib inline

## Data Pre-processing

[[ go back to the top ]](#Table-of-contents)  

There are to datasets available let's call the main one (with 7011 entries) and the other test (21 entries).
Before any attempt of method implementation we need to do some data analysis and pre-processing.
We first need to load the data. Then we may print some information about the dataset to get used to the way the information is organized and to know what our next step should be, such as statistics and possible missing and repeated values. The following topics indicate the description of each column in the dataset.

Column Description:
- customerID: A unique ID that identifies each customer.
- gender: The customer’s gender: Male (1), Female (0).
- SeniorCitizen: Indicates if the customer is 65 or older: No (0), Yes (1).
- Partner: Service contract is resold by the partner: No (0), Yes (1).
- Dependents: Indicates if the customer lives with any dependents: No (0), Yes (1).
- Tenure: Indicates the total amount of months that the customer has been with the company.
- PhoneService: Indicates if the customer subscribes to home phone service with the company: No (0), Yes (1).
- MultipleLines: Indicates if the customer subscribes to multiple telephone lines with the company: No (0), Yes (1).
- InternetService: Indicates if the customer subscribes to Internet service with the company: No (0), DSL (1), Fiber optic (2).
- OnlineSecurity: Indicates if the customer subscribes to an additional online security service provided by the company: No (0), Yes (1), NA (2).
- OnlineBackup: Indicates if the customer subscribes to an additional online backup service provided by the company: No (0), Yes (1), NA (2).
- DeviceProtection: Indicates if the customer subscribes to an additional device protection plan for their Internet equipment provided by the company: No (0), Yes (1), NA (2).
- TechSupport: Indicates if the customer subscribes to an additional technical support plan from the company with reduced wait times: No (0), Yes (1), NA (2).
- StreamingTV: Indicates if the customer uses their Internet service to stream television programing from a third party provider: No (0), Yes (1), NA (2). The company does not charge an additional fee for this service.
- StreamingMovies: Indicates if the customer uses their Internet service to stream movies from a third party provider: No (0), Yes (1), NA (2). The company does not charge an additional fee for this service.
- Contract: Indicates the customer’s current contract type: Month-to-Month (0), One Year (1), Two Year (2).
- PaperlessBilling: Indicates if the customer has chosen paperless billing: No (0), Yes (1).
- PaymentMethod: Indicates how the customer pays their bill: Bank transfer - automatic (0), Credit card - automatic (1), Electronic cheque (2), Mailed cheque (3).
- MonthlyCharges: Indicates the customer’s current total monthly charge for all their services from the company.
- TotalCharges: Indicates the customer’s total charges.
- Churn: Indicates if the customer churn or not: No (0), Yes (1).



In [21]:
# Loading Data
data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_R2_Test.csv')
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,Tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1122-JWTJW,1,0,1,1,1,1,0,2,0,...,0,0,0,0,0,1,3,70.65,70.65,1
1,9710-NJERN,0,0,0,0,39,1,0,0,2,...,2,2,2,2,2,0,3,20.15,826.0,0
2,9837-FWLCH,1,0,1,1,12,1,0,0,2,...,2,2,2,2,0,1,2,19.2,239.0,0
3,1699-HPSBG,1,0,0,0,12,1,0,1,0,...,0,1,1,0,1,1,2,59.8,727.8,1
4,7203-OYKCT,1,0,0,0,72,1,1,2,0,...,1,0,1,1,1,1,2,104.95,7544.3,0


In [36]:
# Data information (number of non-null values and data type)

data.info()
print("\n-----------------------------------------------")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        21 non-null     object 
 1   gender            21 non-null     int64  
 2   SeniorCitizen     21 non-null     int64  
 3   Partner           21 non-null     int64  
 4   Dependents        21 non-null     int64  
 5   Tenure            21 non-null     int64  
 6   PhoneService      21 non-null     int64  
 7   MultipleLines     21 non-null     int64  
 8   InternetService   21 non-null     int64  
 9   OnlineSecurity    21 non-null     int64  
 10  OnlineBackup      21 non-null     int64  
 11  DeviceProtection  21 non-null     int64  
 12  TechSupport       21 non-null     int64  
 13  StreamingTV       21 non-null     int64  
 14  StreamingMovies   21 non-null     int64  
 15  Contract          21 non-null     int64  
 16  PaperlessBilling  21 non-null     int64  
 17 

In [38]:
# Number of null values (for each column and total)

data.isna().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
Tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [39]:
# Repeated values
data['customerID'].unique().size

21

As we can see there are no missing values in our dataset and there is no duplicated customerId values because the unique function returned the exactly number of rows.

In [40]:
data.describe()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,Tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
count,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0,21.0
mean,0.52381,0.190476,0.333333,0.142857,30.809524,0.904762,0.380952,1.238095,0.571429,0.714286,0.714286,0.666667,0.857143,0.714286,0.52381,0.761905,1.571429,64.652381,2176.571429,0.238095
std,0.511766,0.402374,0.483046,0.358569,25.910652,0.300793,0.497613,0.768424,0.810643,0.783764,0.783764,0.795822,0.727029,0.783764,0.749603,0.436436,1.121224,27.9369,2351.107515,0.436436
min,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,19.2,39.25,0.0
25%,0.0,0.0,0.0,0.0,12.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,44.4,403.35,0.0
50%,1.0,0.0,0.0,0.0,19.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,2.0,69.5,1419.4,0.0
75%,1.0,0.0,1.0,0.0,55.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,84.8,3316.1,0.0
max,1.0,1.0,1.0,1.0,72.0,1.0,1.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,2.0,1.0,3.0,104.95,7544.3,1.0


In [41]:
# Why missing values here but not in the previous command????
data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn_R2_Test.csv', na_values=[2])

In [42]:
data.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,Tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1122-JWTJW,1,0,1,1,1.0,1,0,,0.0,...,0.0,0.0,0.0,0.0,0.0,1,3.0,70.65,70.65,1
1,9710-NJERN,0,0,0,0,39.0,1,0,0.0,,...,,,,,,0,3.0,20.15,826.0,0
2,9837-FWLCH,1,0,1,1,12.0,1,0,0.0,,...,,,,,0.0,1,,19.2,239.0,0
3,1699-HPSBG,1,0,0,0,12.0,1,0,1.0,0.0,...,0.0,1.0,1.0,0.0,1.0,1,,59.8,727.8,1
4,7203-OYKCT,1,0,0,0,72.0,1,1,,0.0,...,1.0,0.0,1.0,1.0,1.0,1,,104.95,7544.3,0


In [7]:
sb.pairplot(data.drop_na(), hue='class')
;

AttributeError: 'DataFrame' object has no attribute 'drop_na'