<p style="color: red; font-size: 16pt; font-weight: bold; text-align:center;">Change the name of this notebook before you edit!</p>

# Telecom Data

Source: https://www.kaggle.com/code/manishpuraswani/telecom-data-lr/input

In [1]:
! ls -lh /data/IFI8410/telecom/

total 1.2M
-rw-r--r--. 1 pmolnar pmolnar    480K Oct 23  2023 churn_data.csv
-rw-r--r--. 1 pmolnar pmolnar    185K Oct 23  2023 customer_data.csv
-rw-r--r--. 1 pmolnar pmolnar    456K Oct 23  2023 internet_data.csv
-rw-r--r--. 1 pmolnar united2024  162 Jan 30  2024 README.md


# Setup

In [2]:
%reload_ext autoreload
%autoreload 2

import sys
import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load data

In [3]:
customer_df = pd.read_csv('/data/IFI8410/telecom/customer_data.csv')
print(f"Number of customer records: {customer_df.shape[0]:,}")

Number of customer records: 7,042


In [4]:
internet_df = pd.read_csv('/data/IFI8410/telecom/internet_data.csv')
print(f"Number of internet records: {internet_df.shape[0]:,}")

Number of internet records: 7,042


In [5]:
churn_df = pd.read_csv('/data/IFI8410/telecom/churn_data.csv')
print(f"Number of internet records: {churn_df.shape[0]:,}")

Number of internet records: 7,042


## What does the data look like?

In [6]:
customer_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No


In [7]:
internet_df.head()

Unnamed: 0,customerID,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,No phone service,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,No,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,No,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,No phone service,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,No,Fiber optic,No,No,No,No,No,No


In [8]:
churn_df.head()

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,1,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,34,Yes,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,2,Yes,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,45,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,2,Yes,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


## More Details

In [9]:
customer_df.dtypes

customerID       object
gender           object
SeniorCitizen     int64
Partner          object
Dependents       object
dtype: object

In [10]:
internet_df.dtypes

customerID          object
MultipleLines       object
InternetService     object
OnlineSecurity      object
OnlineBackup        object
DeviceProtection    object
TechSupport         object
StreamingTV         object
StreamingMovies     object
dtype: object

In [11]:
churn_df.dtypes

customerID           object
tenure                int64
PhoneService         object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

Data might need some cleaning:

`TotalCharges` is a string (object), should be a numerical value

In [12]:
jdf = pd.merge(internet_df, churn_df, on='customerID', how='inner') \
    .reset_index()
print(jdf.shape)

(7042, 18)


Let's compare features from two tables:

In [19]:
jdf.groupby(['InternetService', 'PaymentMethod'])['customerID'].apply('count')

InternetService  PaymentMethod            
DSL              Bank transfer (automatic)     566
                 Credit card (automatic)       594
                 Electronic check              648
                 Mailed check                  613
Fiber optic      Bank transfer (automatic)     645
                 Credit card (automatic)       597
                 Electronic check             1595
                 Mailed check                  258
No               Bank transfer (automatic)     332
                 Credit card (automatic)       331
                 Electronic check              122
                 Mailed check                  741
Name: customerID, dtype: int64

In [14]:
pd.pivot_table(jdf, 
               index='PaymentMethod', 
               columns='InternetService', 
               values='customerID',
               aggfunc='count')

InternetService,DSL,Fiber optic,No
PaymentMethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bank transfer (automatic),566,645,332
Credit card (automatic),594,597,331
Electronic check,648,1595,122
Mailed check,613,258,741


## Deep Dive:
- Selecting rows and columns
- Joining tables vs extending/concatenating
    - inner, outer
    - merge() vs join()
- Adding new columns, copy sub-table
- Loading and saving data (why `index=None` ?)
