<p style="color: red; font-size: 16pt; font-weight: bold; text-align:center;">Change the name of this notebook before you edit!</p>

# Telecom Data

Source: https://www.kaggle.com/code/manishpuraswani/telecom-data-lr/input

In [1]:
! ls -lh /data/IFI8410/telecom/

ls: /data/IFI8410/telecom/: No such file or directory


# Setup

In [2]:
%reload_ext autoreload
%autoreload 2

import sys
import os
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load data

In [3]:
customer_df = pd.read_csv('/data/IFI8410/telecom/customer_data.csv')
print(f"Number of customer records: {customer_df.shape[0]:,}")

FileNotFoundError: [Errno 2] No such file or directory: '/data/IFI8410/telecom/customer_data.csv'

In [None]:
internet_df = pd.read_csv('/data/IFI8410/telecom/internet_data.csv')
print(f"Number of internet records: {internet_df.shape[0]:,}")

In [None]:
churn_df = pd.read_csv('/data/IFI8410/telecom/churn_data.csv')
print(f"Number of internet records: {churn_df.shape[0]:,}")

## What does the data look like?

In [None]:
customer_df.head()

In [None]:
internet_df.head()

In [None]:
churn_df.head()

In [None]:
customer_df['gender'].unique()

## More Details

In [None]:
customer_df.dtypes

In [None]:
internet_df.dtypes

In [None]:
churn_df.dtypes

Data might need some cleaning:

`TotalCharges` is a string (object), should be a numerical value

In [None]:
jdf = pd.merge(internet_df, churn_df, on='customerID', how='inner') \
    .reset_index()
print(jdf.shape)

Let's compare features from two tables:

In [None]:
jdf.groupby(['InternetService', 'PaymentMethod'])['customerID'].apply('count')

In [None]:
pd.pivot_table(jdf, 
               index='PaymentMethod', 
               columns='InternetService', 
               values='customerID',
               aggfunc='count')

## Deep Dive:
- Selecting rows and columns
- Joining tables vs extending/concatenating
    - inner, outer
    - merge() vs join()
- Adding new columns, copy sub-table
- Loading and saving data (why `index=None` ?)


## Arithmatic with pandas DataFrames

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list("abcd"))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list("abcde"))

In [None]:
df1.add(df2, fill_value=0)

In [None]:
df2.add(df1, fill_value=0)

In [None]:
frame = pd.DataFrame(np.random.standard_normal((4, 3)), columns=list("bde"),
index=["Utah", "Ohio", "Texas", "Oregon"])

In [None]:
frame

## Descriptive Statistics with pandas DataFrames

https://sparkbyexamples.com/pandas/calculate-summary-statistics-in-pandas/

https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/06_calculate_statistics.html

### Applying the .describe() method on DataFrame with numerical data

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html

In [None]:
frame.describe()

### Calculating other statistical measures

In [None]:
frame['mean'] = frame.apply('mean', axis='columns')
frame

In [None]:
frame['b'].mean()

In [None]:
frame['b'].corr(frame['d'])

In [None]:
frame['b'].corr(frame['e'])

In [None]:
frame['b'].cov(frame['d'])

In [None]:
def f1(x):
    return x.max() - x.min()

In [None]:
frame.apply(f1)

In [None]:
frame2 = frame.copy()

In [None]:
frame2['max_min'] = frame2.apply(f1, axis="columns")

In [None]:
frame2

In [None]:
frame['max_min'] = frame.apply(f1, axis="columns")

In [None]:
frame

### Applying .describe() method on DataFrame with categorical features

https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html

In [None]:
categorical = pd.DataFrame({"a": ["Yes", "Yes", "No", "No", "Yes", "Yes"], 
                            "b": ["Good", "Better", "Bad", "Bad", "Worse", "Good"], 
                            "c": ["Claude", "Maria", "Maria", "George", "Luisa", "Kurt"], 
                            })
categorical.describe()

In [None]:
categorical.mode()
categorical.value_counts()

### Creating a custom .describe() method with the .agg() (aggregate) method

https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html

In [None]:
from functools import partial

In [None]:
tsdf = pd.DataFrame(
    np.random.randn(1000, 3),
    columns=["A", "B", "C"],
    index=pd.date_range("1/1/2000", periods=1000),
)

In [None]:
tsdf.describe(percentiles=[0.05, 0.25, 0.75, 0.95])

In [None]:
q_25 = partial(pd.Series.quantile, q=0.25)
_25.__name__ = "25%"

q_75 = partial(pd.Series.quantile, q=0.75)
q_75.__name__ = "75%"

def na_func(series):
    return series.isna().sum()

def na_percent(series):
    return na_func(series) / series.count()

def cardinality(series):
    return series.nunique()     

In [None]:
# Numerical data:
tsdf.agg(["count", na_percent, cardinality, "min", q_25, "mean", "median", q_75, "max", "std"])

In [None]:
def mode_1st(series):
    return series.value_counts().sort_values(ascending=False).index[0]
 
def mode_1st_freq(series):
    mode = mode_1st(series)
    return series[series == mode].count()
    
def mode_1st_percent(series):
    return mode_1st_freq(series) / series.count()
    
def mode_2nd(series):
    return series.value_counts().sort_values(ascending=False).index[1]
 
def mode_2nd_freq(series):
    mode = mode_2nd(series)
    return series[series == mode].count()   
 
def mode_2nd_percent(series):
    return mode_2nd_freq(series) / series.count()    

In [None]:
# Categorical data:
categorical.agg(["count", na_percent, cardinality, 
                 mode_1st, mode_1st_freq, mode_1st_percent, 
                 mode_2nd, mode_2nd_freq, mode_2nd_percent])

### Other available stats packages

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.describe.html#scipy.stats.describe

In [None]:
from scipy import stats
stats.describe(tsdf)

In [None]:
stats.describe(categorical)