<p style="color: red; font-size: 16pt; font-weight: bold; text-align:center;">Change the name of this notebook before you edit!</p>

# Telecom Data

Source: https://www.kaggle.com/code/manishpuraswani/telecom-data-lr/input

In [1]:
! ls -lh /data/IFI8410/telecom/

total 1.2M
-rw-rw-r--. 1 pmolnar ifi8410_instructor 480K Oct 23  2023 churn_data.csv
-rw-rw-r--. 1 pmolnar ifi8410_instructor 185K Oct 23  2023 customer_data.csv
-rw-rw-r--. 1 pmolnar ifi8410_instructor 456K Oct 23  2023 internet_data.csv
-rw-rw-r--. 1 pmolnar ifi8410_instructor  162 Jan 30  2024 README.md


# Setup

In [2]:
%reload_ext autoreload
%autoreload 2

import sys
import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load data

In [3]:
customer_df = pd.read_csv('/data/IFI8410/telecom/customer_data.csv')
print(f"Number of customer records: {customer_df.shape[0]:,}")

Number of customer records: 7,042


In [4]:
internet_df = pd.read_csv('/data/IFI8410/telecom/internet_data.csv')
print(f"Number of internet records: {internet_df.shape[0]:,}")

Number of internet records: 7,042


In [5]:
churn_df = pd.read_csv('/data/IFI8410/telecom/churn_data.csv')
print(f"Number of internet records: {churn_df.shape[0]:,}")

Number of internet records: 7,042


## What does the data look like?

In [6]:
customer_df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents
0,7590-VHVEG,Female,0,Yes,No
1,5575-GNVDE,Male,0,No,No
2,3668-QPYBK,Male,0,No,No
3,7795-CFOCW,Male,0,No,No
4,9237-HQITU,Female,0,No,No


In [7]:
internet_df.head()

Unnamed: 0,customerID,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies
0,7590-VHVEG,No phone service,DSL,No,Yes,No,No,No,No
1,5575-GNVDE,No,DSL,Yes,No,Yes,No,No,No
2,3668-QPYBK,No,DSL,Yes,Yes,No,No,No,No
3,7795-CFOCW,No phone service,DSL,Yes,No,Yes,Yes,No,No
4,9237-HQITU,No,Fiber optic,No,No,No,No,No,No


In [8]:
churn_df.head()

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,1,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,34,Yes,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,2,Yes,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,45,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,2,Yes,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [9]:
customer_df['gender'].unique()

array(['Female', 'Male'], dtype=object)

## More Details

In [10]:
customer_df.dtypes

customerID       object
gender           object
SeniorCitizen     int64
Partner          object
Dependents       object
dtype: object

In [11]:
internet_df.dtypes

customerID          object
MultipleLines       object
InternetService     object
OnlineSecurity      object
OnlineBackup        object
DeviceProtection    object
TechSupport         object
StreamingTV         object
StreamingMovies     object
dtype: object

In [12]:
churn_df.dtypes

customerID           object
tenure                int64
PhoneService         object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

Data might need some cleaning:

`TotalCharges` is a string (object), should be a numerical value

In [13]:
jdf = pd.merge(internet_df, churn_df, on='customerID', how='inner') \
    .reset_index()
print(jdf.shape)

(7042, 18)


Let's compare features from two tables:

In [14]:
jdf.groupby(['InternetService', 'PaymentMethod'])['customerID'].apply('count')

InternetService  PaymentMethod            
DSL              Bank transfer (automatic)     566
                 Credit card (automatic)       594
                 Electronic check              648
                 Mailed check                  613
Fiber optic      Bank transfer (automatic)     645
                 Credit card (automatic)       597
                 Electronic check             1595
                 Mailed check                  258
No               Bank transfer (automatic)     332
                 Credit card (automatic)       331
                 Electronic check              122
                 Mailed check                  741
Name: customerID, dtype: int64

In [15]:
pd.pivot_table(jdf, 
               index='PaymentMethod', 
               columns='InternetService', 
               values='customerID',
               aggfunc='count')

InternetService,DSL,Fiber optic,No
PaymentMethod,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bank transfer (automatic),566,645,332
Credit card (automatic),594,597,331
Electronic check,648,1595,122
Mailed check,613,258,741


## Deep Dive:
- Selecting rows and columns
- Joining tables vs extending/concatenating
    - inner, outer
    - merge() vs join()
- Adding new columns, copy sub-table
- Loading and saving data (why `index=None` ?)


## Arithmatic with pandas DataFrames

In [16]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list("abcde"))

In [17]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [18]:
df2.add(df1, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [19]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)), columns=list("bde"),
index=["Utah", "Ohio", "Texas", "Oregon"])

In [20]:
frame

Unnamed: 0,b,d,e
Utah,-0.055889,-0.28446,0.117213
Ohio,-0.621866,2.615302,-1.933263
Texas,0.488745,0.470201,-0.134121
Oregon,0.267694,-0.265068,-0.299252


## Descriptive Statistics with pandas DataFrames

https://sparkbyexamples.com/pandas/calculate-summary-statistics-in-pandas/

https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html

### Applying the .describe() method on DataFrame with numerical data

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [21]:
frame.describe()

Unnamed: 0,b,d,e
count,4.0,4.0,4.0
mean,0.019671,0.633994,-0.562356
std,0.48264,1.366782,0.92984
min,-0.621866,-0.28446,-1.933263
25%,-0.197383,-0.269916,-0.707754
50%,0.105902,0.102566,-0.216686
75%,0.322957,1.006476,-0.071287
max,0.488745,2.615302,0.117213


### Calculating other statistical measures

In [22]:
frame['mean'] = frame.apply('mean', axis='columns')
frame

Unnamed: 0,b,d,e,mean
Utah,-0.055889,-0.28446,0.117213,-0.074379
Ohio,-0.621866,2.615302,-1.933263,0.020058
Texas,0.488745,0.470201,-0.134121,0.274942
Oregon,0.267694,-0.265068,-0.299252,-0.098875


In [23]:
frame['b'].mean()

0.01967094498311696

In [24]:
frame['b'].corr(frame['d'])

-0.7587208919372784

In [25]:
frame['b'].corr(frame['e'])

0.8127760137088692

In [26]:
frame['b'].cov(frame['d'])

-0.5005006724140456

In [27]:
def f1(x):
    return x.max() - x.min()

In [28]:
frame.apply(f1)

b       1.110610
d       2.899762
e       2.050475
mean    0.373817
dtype: float64

In [29]:
frame2 = frame.copy()

In [30]:
frame2['max_min'] = frame2.apply(f1, axis="columns")

In [31]:
frame2

Unnamed: 0,b,d,e,mean,max_min
Utah,-0.055889,-0.28446,0.117213,-0.074379,0.401673
Ohio,-0.621866,2.615302,-1.933263,0.020058,4.548565
Texas,0.488745,0.470201,-0.134121,0.274942,0.622865
Oregon,0.267694,-0.265068,-0.299252,-0.098875,0.566946


In [32]:
frame['max_min'] = frame.apply(f1, axis="columns")

In [33]:
frame

Unnamed: 0,b,d,e,mean,max_min
Utah,-0.055889,-0.28446,0.117213,-0.074379,0.401673
Ohio,-0.621866,2.615302,-1.933263,0.020058,4.548565
Texas,0.488745,0.470201,-0.134121,0.274942,0.622865
Oregon,0.267694,-0.265068,-0.299252,-0.098875,0.566946


### Applying .describe() method on DataFrame with categorical features

https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

In [55]:
catframe = pd.DataFrame({"a": ["Yes", "Yes", "No", "No", "Yes", "Yes"], 
                         "b": ["Good", "Better", "Bad", "Bad", "Worse", "Good"], 
                         "c": ["Claude", "Maria", "Maria", "George", "Luisa", "Kurt"], 
                         })
catframe.describe()

Unnamed: 0,a,b,c
count,6,6,6
unique,2,4,5
top,Yes,Good,Maria
freq,4,2,2


In [40]:
catframe.mode()

Unnamed: 0,a,b,c
0,Yes,Bad,Maria
1,,Good,


In [41]:
catframe.value_counts()

a    b       c     
No   Bad     George    1
             Maria     1
Yes  Better  Maria     1
     Good    Claude    1
             Kurt      1
     Worse   Luisa     1
Name: count, dtype: int64

### Creating a custom .describe() method with the .agg() (aggregate) method

https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html

In [42]:
from functools import partial

In [43]:
tsdf = pd.DataFrame(
    np.random.randn(1000, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=1000),
)

In [44]:
tsdf.describe(percentiles=[0.05, 0.25, 0.75, 0.95])

Unnamed: 0,A,B,C
count,1000.0,1000.0,1000.0
mean,-0.030197,-0.021609,0.004424
std,1.048531,0.988639,1.005783
min,-3.645911,-2.43032,-3.843037
5%,-1.742221,-1.667552,-1.611298
25%,-0.744874,-0.704218,-0.699555
50%,-0.016654,-0.037644,0.030053
75%,0.649187,0.644884,0.719921
95%,1.662401,1.65965,1.612643
max,3.957197,3.374284,3.414863


In [46]:
q_25 = partial(pd.Series.quantile, q=0.25)
q_25.__name__ = "25%"

q_75 = partial(pd.Series.quantile, q=0.75)
q_75.__name__ = "75%"

def na_func(series):
    return series.isna().sum()

def na_percent(series):
    return na_func(series) / series.count()

def cardinality(series):
    return series.nunique()     

In [47]:
# Numerical data:
tsdf.agg(["count", na_percent, cardinality, "min", q_25, "mean", "median", q_75, "max", "std"])

Unnamed: 0,A,B,C
count,1000.0,1000.0,1000.0
na_percent,0.0,0.0,0.0
cardinality,1000.0,1000.0,1000.0
min,-3.645911,-2.43032,-3.843037
25%,-0.744874,-0.704218,-0.699555
mean,-0.030197,-0.021609,0.004424
median,-0.016654,-0.037644,0.030053
75%,0.649187,0.644884,0.719921
max,3.957197,3.374284,3.414863
std,1.048531,0.988639,1.005783


In [48]:
def mode_1st(series):
    return series.value_counts().sort_values(ascending=False).index[0]
 
def mode_1st_freq(series):
    mode = mode_1st(series)
    return series[series == mode].count()
    
def mode_1st_percent(series):
    return mode_1st_freq(series) / series.count()
    
def mode_2nd(series):
    return series.value_counts().sort_values(ascending=False).index[1]
 
def mode_2nd_freq(series):
    mode = mode_2nd(series)
    return series[series == mode].count()   
 
def mode_2nd_percent(series):
    return mode_2nd_freq(series) / series.count()    

In [49]:
# Categorical data:
catframe.agg(["count", na_percent, cardinality, 
              mode_1st, mode_1st_freq, mode_1st_percent, 
              mode_2nd, mode_2nd_freq, mode_2nd_percent])

Unnamed: 0,a,b,c
count,6,6,6
na_percent,0.0,0.0,0.0
cardinality,2,4,5
mode_1st,Yes,Good,Maria
mode_1st_freq,4,2,2
mode_1st_percent,0.666667,0.333333,0.333333
mode_2nd,No,Bad,Claude
mode_2nd_freq,2,2,1
mode_2nd_percent,0.333333,0.333333,0.166667


### Other available stats packages

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.describe.html#scipy.stats.describe

https://www.statsmodels.org/stable/generated/statsmodels.stats.descriptivestats.Description.html#statsmodels.stats.descriptivestats.Description

In [50]:
from scipy import stats

stats.describe(tsdf)

DescribeResult(nobs=1000, minmax=(array([-3.64591131, -2.43032006, -3.84303678]), array([3.9571969 , 3.37428437, 3.41486307])), mean=array([-0.0301975 , -0.02160931,  0.00442364]), variance=array([1.09941722, 0.97740767, 1.01159897]), skewness=array([ 0.02583128,  0.16174955, -0.08232011]), kurtosis=array([ 0.0593336 , -0.00525772,  0.00547354]))

In [57]:
from statsmodels.stats.descriptivestats import Description

Description(tsdf).numeric_statistics

('nobs',
 'missing',
 'mean',
 'std_err',
 'ci',
 'std',
 'iqr',
 'iqr_normal',
 'mad',
 'mad_normal',
 'coef_var',
 'range',
 'max',
 'min',
 'skew',
 'kurtosis',
 'jarque_bera',
 'mode',
 'median',
 'percentiles')

In [58]:
Description(
    tsdf, 
    stats=[
        'nobs', 'missing', 'distinct', 'min', 'mean', 'median', 'max', 'std',  
        'skew', 'kurtosis', 'iqr', 'percentiles', 'mode', 
    ], 
    numeric=True, 
    categorical=False, 
    alpha=0.05, 
    use_t=False, 
    percentiles=(1, 5, 10, 25, 50, 75, 90, 95, 99),
).summary()

0,1,2,3
nobs,1000.0,1000.0,1000.0
missing,0.0,0.0,0.0
min,-3.645911309275139,-2.4303200634015405,-3.8430367757874224
mean,-0.0301974957552539,-0.0216093139570072,0.0044236370453146
median,-0.0166535472219713,-0.0376443394084529,0.0300533160975775
max,3.957196904643556,3.3742843741031177,3.414863074079527
std,1.0485309798749074,0.9886393028315296,1.0057827624141855
skew,0.0258312786078056,0.161749549022368,-0.082320114612837
kurtosis,3.0593336032129432,2.994742282297342,3.005473537542445
iqr,1.3940610514694636,1.3491019953107453,1.4194759270879969


In [67]:
catnumframe = pd.DataFrame({"a": ["Yes", "Yes", "No", "No", "Yes", "Yes"], 
                            "b": ["Good", "Better", "Bad", "Bad", "Worse", "Good"], 
                            "c": ["Claude", "Maria", "Maria", "George", "Luisa", "Kurt"], 
                            "d": [0.1, 0.343, 0.56, -0.74, 0.89, -0.12]
                            })

Description(
    catnumframe, 
    numeric=True, 
    categorical=True,
).categorical_statistics

('nobs', 'missing', 'distinct', 'top', 'freq')

In [70]:
Description(
    catnumframe, 
    stats=[
        'nobs', 'missing', 'distinct', 'mode', 'top', 'freq',
    ], 
    numeric=True, 
    categorical=True,
    ntop=5
).summary()

0,1
nobs,6.0
missing,0.0
mode,-0.74
mode_freq,0.1666666666666666
