# Table of Contents

1. [Imports and definitions](#imports-and-definitions)
2. [Exploration](#exploration)
   - [core_data](#core_data)
     - [General overview](#general-overview)
     - [Customers portfolio analysis](#customers-portfolio-analysis)
     - [Upselling distribution](#upselling-distribution)
     - [Data checks](#data-checks)
   - [usage_info](#usage_info)
     - [General overview](#general-overview-1)
     - [Usage analysis](#usage-analysis)
     - [Data checks](#data-checks-1)
   - [customer_interactions](#customer_interactions)
     - [General overview](#general-overview-2)
     - [Interactions analysis](#interactions-analysis)
     - [Data checks](#data-checks-2)
3. [Conclusion](#conclusion)

---

# Imports and definitions

In [1]:
import pickle
from pathlib import Path

import polars as pl
import plotly.express as px

_ = pl.Config.set_tbl_cols(None)
_ = pl.Config.set_fmt_str_lengths(500)
_ = pl.Config.set_fmt_float("full")

In [2]:
base_dir = Path('/workspaces/data-scientist-at-magenta')
code_dir = base_dir / 'notebooks'
data_dir = code_dir / "data"
raw_dir = data_dir / "raw"
features_dir = data_dir / 'features'
train_dir = data_dir / 'train'

In [3]:
def load_artifact(targ_file:str):
    targ_path = raw_dir / targ_file
    
    if not targ_path.exists():
        raise FileNotFoundError(f'Artifact {targ_file} not found in {raw_dir}')

    with open(targ_path,'rb') as fp:
        test_artifact = pickle.load(fp)

    return pl.from_pandas(test_artifact)

`core_data` <br><br>

| Feature Name           | Description                                                  |
|------------------------|--------------------------------------------------------------|
| rating_account_id      | Unique identifier for the contract account                    |
| customer_id            | Unique identifier for the customer                           |
| age                    | Age of the customer **in years**                                       |
| contract_lifetime_days | Total duration of the customer contract in days              |
| remaining_binding_days | Number of days left in the contract binding period - usual binding period is 2 years - **if it's positive it means that the customer is still in the binding period**       |
| has_special_offer      | Indicates if the customer has a special offer      |
| is_magenta1_customer   | Indicates if the customer is part of the Magenta1 program - fedelty program    |
| available_gb           | Amount of mobile data included in the current tariff - usually it's an integer (there aren't plans with 15.4GB, for example)       |
| gross_mrc              | Gross monthly recurring charge (in euros)                    |
| smartphone_brand       | Brand of the customer’s smartphone                           |
| has_done_upselling     | Whether the customer has already done an upsell in the last 3 years      |


`usage_info`

| Feature Name           | Description                                                  |
|------------------------|--------------------------------------------------------------|
| rating_account_id      | Unique identifier for the contract account                    |
| billed_period_month_d  | Billing period (monthly)                                     |
| has_used_roaming       | Indicates if roaming was used during the period            |
| used_gb                | Amount of mobile data used in the billing period (in GB)     |


`customer_interactions`

| Feature Name   | Description                                                              |
|----------------|--------------------------------------------------------------------------|
| customer_id    | Unique identifier for the customer                                       |
| type_subtype   | Category and subtype of the interaction (e.g., tariff change, billing)   |
| n              | Number of interactions of this type in the last 6 months                                |
| days_since_last| Number of days since the last interaction of this type                   |


# Exploration

## `core_data`


| Feature Name           | Description                                                  |
|------------------------|--------------------------------------------------------------|
| rating_account_id      | Unique identifier for the contract account                    |
| customer_id            | Unique identifier for the customer                           |
| age                    | Age of the customer **in years**                                       |
| contract_lifetime_days | Total duration of the customer contract in days              |
| remaining_binding_days | Number of days left in the contract binding period - usual binding period is 2 years - **if it's positive it means that the customer is still in the binding period**       |
| has_special_offer      | Indicates if the customer has a special offer      |
| is_magenta1_customer   | Indicates if the customer is part of the Magenta1 program - fedelty program    |
| available_gb           | Amount of mobile data included in the current tariff         |
| gross_mrc              | Gross monthly recurring charge (in euros)                    |
| smartphone_brand       | Brand of the customer’s smartphone                           |
| has_done_upselling     | Whether the customer has already done an upsell in the last 3 years      |

### General overview

In [11]:
core_data = load_artifact('core_data')

In [12]:
core_data.head()

rating_account_id,customer_id,age,contract_lifetime_days,remaining_binding_days,has_special_offer,is_magenta1_customer,available_gb,gross_mrc,smartphone_brand,has_done_upselling
i64,str,i64,i64,i64,i64,i64,i64,f64,str,i64
289094,"""4.161115""",36,878,325,0,0,20.0,70.0,"""iPhone""",0
677626,"""2.429976""",34,998,614,0,0,0.0,5.0,"""Samsung""",0
769928,"""3.875044""",36,37,-26,0,1,50.0,16.94,"""Samsung""",0
873260,"""4.649933""",50,503,-149,0,1,20.0,30.2,"""iPhone""",1
109774,"""3.851059""",47,331,-328,1,1,,46.12,"""Samsung""",0


In [13]:
core_data.schema

Schema([('rating_account_id', Int64),
        ('customer_id', String),
        ('age', Int64),
        ('contract_lifetime_days', Int64),
        ('remaining_binding_days', Int64),
        ('has_special_offer', Int64),
        ('is_magenta1_customer', Int64),
        ('available_gb', Int64),
        ('gross_mrc', Float64),
        ('smartphone_brand', String),
        ('has_done_upselling', Int64)])

In [26]:
core_data = core_data.with_columns(
    pl.col('has_done_upselling').cast(pl.Boolean),
    pl.col('has_special_offer').cast(pl.Boolean),
    pl.col('is_magenta1_customer').cast(pl.Boolean)
)

core_data = core_data.with_columns(
    (pl.col('contract_lifetime_days') + pl.col('remaining_binding_days')).alias('contract_binding_days'),
    (pl.col('contract_lifetime_days') / (pl.col('contract_lifetime_days') + pl.col('remaining_binding_days'))).alias('completion_rate'),
    (pl.col('gross_mrc') / pl.col('available_gb')).alias('cost_per_gb')
)

core_data

rating_account_id,customer_id,age,contract_lifetime_days,remaining_binding_days,has_special_offer,is_magenta1_customer,available_gb,gross_mrc,smartphone_brand,has_done_upselling,contract_binding_days,completion_rate,cost_per_gb
i64,str,i64,i64,i64,bool,bool,i64,f64,str,bool,i64,f64,f64
289094,"""4.161115""",36,878,325,false,false,20,70,"""iPhone""",false,1203,0.7298420615128844,3.5
677626,"""2.429976""",34,998,614,false,false,0,5,"""Samsung""",false,1612,0.6191066997518611,inf
769928,"""3.875044""",36,37,-26,false,true,50,16.94,"""Samsung""",false,11,3.3636363636363638,0.33880000000000005
873260,"""4.649933""",50,503,-149,false,true,20,30.2,"""iPhone""",true,354,1.42090395480226,1.51
109774,"""3.851059""",47,331,-328,true,true,,46.12,"""Samsung""",false,3,110.33333333333333,
…,…,…,…,…,…,…,…,…,…,…,…,…,…
502283,"""5.605022""",88,1573,-576,false,false,10,34.18,"""Samsung""",false,997,1.5777331995987964,3.418
618421,"""2.862063""",85,1138,412,true,false,40,50.1,"""iPhone""",false,1550,0.7341935483870968,1.2525
104422,"""2.414264""",79,1709,-494,false,false,10,12.96,"""Samsung""",false,1215,1.4065843621399177,1.296
642380,"""3.619106""",84,1592,403,false,false,10,56.73,"""Samsung""",false,1995,0.7979949874686717,5.673


In [27]:
n_customers = core_data.select(pl.col('customer_id')).unique().shape[0]
n_contracts = core_data.select(pl.col('rating_account_id')).unique().shape[0]

n_unique_smartphone_brands = core_data.select(pl.col('smartphone_brand')).n_unique()

n_special_offer = core_data.filter(pl.col('has_special_offer')).shape[0]

n_magenta1 = core_data.filter(pl.col('is_magenta1_customer')).shape[0]

n_special_offer_and_magenta1 = core_data.filter(
    pl.col('has_special_offer') & pl.col('is_magenta1_customer')
).shape[0]

min_lifetime = core_data.select(pl.col('contract_lifetime_days')).min().item()
max_lifetime = core_data.select(pl.col('contract_lifetime_days')).max().item()
avg_lifetime = core_data.select(pl.col('contract_lifetime_days')).mean().item()

count_still_bindings = core_data.filter(pl.col('remaining_binding_days') > 0).shape[0]
min_still_bindings = core_data.filter(pl.col('remaining_binding_days') > 0).select(pl.col('remaining_binding_days')).min().item()
max_still_bindings = core_data.filter(pl.col('remaining_binding_days') > 0).select(pl.col('remaining_binding_days')).max().item()
avg_still_bindings = core_data.filter(pl.col('remaining_binding_days') > 0).select(pl.col('remaining_binding_days')).mean().item()

count_not_bound = core_data.filter(pl.col('remaining_binding_days') < 0).shape[0]
min_not_bound = core_data.filter(pl.col('remaining_binding_days') < 0).select(pl.col('remaining_binding_days')).min().item()
max_not_bound = core_data.filter(pl.col('remaining_binding_days') < 0).select(pl.col('remaining_binding_days')).max().item()
avg_not_bound = core_data.filter(pl.col('remaining_binding_days') < 0).select(pl.col('remaining_binding_days')).mean().item()

min_completion_rate = core_data.filter(pl.col('completion_rate') < 1.0).select(pl.col('completion_rate')).min().item()
max_completion_rate = core_data.filter(pl.col('completion_rate') < 1.0).select(pl.col('completion_rate')).max().item()
avg_completion_rate = core_data.filter(pl.col('completion_rate') < 1.0).select(pl.col('completion_rate')).mean().item()

min_spending = core_data.select(pl.col('gross_mrc')).min().item()
max_spending = core_data.select(pl.col('gross_mrc')).max().item()
avg_spending = core_data.select(pl.col('gross_mrc')).mean().item()

min_data = core_data.select(pl.col('available_gb')).min().item()
max_data = core_data.select(pl.col('available_gb')).max().item()
avg_data = core_data.select(pl.col('available_gb')).mean().item()

min_cost_per_gb = core_data.select(pl.col('cost_per_gb')).min().item()
max_cost_per_gb = core_data.filter(~pl.col('cost_per_gb').is_infinite()).select(pl.col('cost_per_gb')).max().item()
avg_cost_per_gb = core_data.filter(~pl.col('cost_per_gb').is_infinite()).select(pl.col('cost_per_gb')).mean().item()



print(f'The core_data has {core_data.shape[0]} rows and {core_data.shape[1]} columns.\n')

print(f'The core_data has {n_customers} unique customers.')
print(f'The core_data has {n_contracts} unique contracts.\n')

print(f'The core_data has {n_unique_smartphone_brands} unique smartphone brands.\n')

print(f'The core_data has {n_special_offer}({n_special_offer / n_contracts * 100:.2f}%) contracts with special offers.\n')

print(f'The core_data has {n_magenta1}({n_magenta1 / n_contracts * 100:.2f}%) Magenta1 contracts.\n')

print(f'The core_data has {n_special_offer_and_magenta1}({n_special_offer_and_magenta1 / n_contracts * 100:.2f}%) contracts with both special offers and Magenta1.\n')

print(f'The core_data has a contract lifetime between {min_lifetime} and {max_lifetime} days with an average of {avg_lifetime:.2f} days.\n')

print(f'The core_data has a remaining binding days between {min_still_bindings} and {max_still_bindings} days with an average of {avg_still_bindings:.3f} days for still bindings.')
print(f'The core_data has {count_still_bindings}({count_still_bindings / n_contracts * 100:.2f}%) contracts with still bindings.\n')

print(f'The core_data has exceeded contract days between {abs(max_not_bound)} and {abs(min_not_bound)} days with an average of {abs(avg_not_bound):.3f} days for not bound contracts.')
print(f'The core_data has {count_not_bound}({count_not_bound / n_contracts * 100:.2f}%) contracts with exceeded bindings.\n')

print(f'The core_data has a completion rate between {min_completion_rate * 100:.2f}% and {max_completion_rate * 100:.2f}% with an average of {avg_completion_rate * 100:.2f}%.\n')

print(f'The core_data has a spending between {min_spending:.2f} euros and {max_spending:.2f} euros with an average of {avg_spending:.2f} euros.\n')

print(f'The core_data has a data volume between {min_data:.2f} GB and {max_data:.2f} GB with an average of {avg_data:.2f} GB.')
print(f'The core_data has a cost per GB between {min_cost_per_gb:.2f} euros and {max_cost_per_gb:.2f} euros with an average of {avg_cost_per_gb:.2f} euros.\n')

The core_data has 100000 rows and 14 columns.

The core_data has 58495 unique customers.
The core_data has 100000 unique contracts.

The core_data has 5 unique smartphone brands.

The core_data has 29953(29.95%) contracts with special offers.

The core_data has 29929(29.93%) Magenta1 contracts.

The core_data has 8974(8.97%) contracts with both special offers and Magenta1.

The core_data has a contract lifetime between 7 and 1825 days with an average of 778.61 days.

The core_data has a remaining binding days between 1 and 730 days with an average of 275.906 days for still bindings.
The core_data has 49938(49.94%) contracts with still bindings.

The core_data has exceeded contract days between 1 and 730 days with an average of 275.910 days for not bound contracts.
The core_data has 49863(49.86%) contracts with exceeded bindings.

The core_data has a completion rate between 50.03% and 99.94% with an average of 73.19%.

The core_data has a spending between 5.00 euros and 70.00 euros with

Finding segment of the customer for target models

In [54]:
fig = px.histogram(core_data, x="age", nbins=40, title="Distribution of Age")
fig.update_layout(xaxis_title="Age", yaxis_title="Count")
fig.show()

In [55]:
# Define bin edges and labels
bins = [18, 25, 35, 45, 55, 65, 75, 85, 95, 100]
labels = [
    "18-24", "25-34", "35-44", "45-54", "55-64",
    "65-74", "75-84", "85-94", "95-100"
]

binned_age_upselling = core_data.with_columns(
        pl.when(pl.col("age") < 18).then(pl.lit('18-'))
        .otherwise(pl.lit('+100'))
        .alias("bin")
)

for i in range(len(bins)-1):
    binned_age_upselling = binned_age_upselling.with_columns(
        pl.when(pl.col("age").is_between(bins[i], bins[i+1], closed="left")).then(pl.lit(labels[i]))
        .otherwise(pl.col('bin'))
        .alias("bin")
)
    
# Group by bin and aggregate
binned_age_upselling = binned_age_upselling.group_by("bin").agg(
    pl.len().alias("count"),
    (pl.col("has_done_upselling") == True).sum().alias("positive"),
    (pl.col("has_done_upselling") == False).sum().alias("negative"),
).sort('bin')

fig = px.bar(binned_age_upselling, x="bin", y=['positive', 'negative'], title='Distribution upselling customers by age')
fig.show()

In [59]:
fig = px.histogram(core_data, x="contract_lifetime_days", nbins=300, title="Distribution of Contract Lifetime Days")
fig.update_layout(xaxis_title="Days", yaxis_title="Count")
fig.show()

In [86]:
bins = list(range(10, 1861, 100))
labels = [f"{str(bins[i]).zfill(4)}-{str(bins[i+1]-1).zfill(4)}" for i in range(len(bins)-1)]

binned_lifetime_upselling = core_data.with_columns(
        pl.when(pl.col("contract_lifetime_days") < 10).then(pl.lit('10-'))
        .otherwise(pl.lit('1860+'))
        .alias("bin")
)

for i in range(len(bins)-1):
    binned_lifetime_upselling = binned_lifetime_upselling.with_columns(
        pl.when(pl.col("contract_lifetime_days").is_between(bins[i], bins[i+1], closed="left")).then(pl.lit(labels[i]))
        .otherwise(pl.col('bin'))
        .alias("bin")
)
    
# Group by bin and aggregate
binned_lifetime_upselling = binned_lifetime_upselling.group_by("bin").agg(
    pl.len().alias("count"),
    (pl.col("has_done_upselling") == True).sum().alias("positive"),
    (pl.col("has_done_upselling") == False).sum().alias("negative"),
).sort('bin')

fig = px.bar(binned_lifetime_upselling, x="bin", y=['positive', 'negative'], title='Distribution upselling customer by contract lifetime days')
fig.show()

In [84]:
binned_lifetime_upselling

bin,count,positive,negative
str,u32,u32,u32
"""10-""",209,8,201
"""10-59""",3491,220,3271
"""1010-1059""",3438,260,3178
"""1060-1109""",2962,222,2740
"""110-159""",3420,168,3252
…,…,…,…
"""760-809""",3508,270,3238
"""810-859""",3518,268,3250
"""860-909""",3406,261,3145
"""910-959""",3366,263,3103


In [80]:
fig = px.bar(core_data.group_by('available_gb').len(), x="available_gb", y='len', title="Distribution of Contract Lifetime Days")
fig.update_layout(xaxis_title="GB", yaxis_title="Count")
fig.show()

In [81]:

binned_gb_upselling = core_data.group_by('available_gb').agg(
    pl.len().alias("count"),
    (pl.col("has_done_upselling") == True).sum().alias("positive"),
    (pl.col("has_done_upselling") == False).sum().alias("negative"),
).sort('available_gb')

fig = px.bar(binned_gb_upselling, x="available_gb", y=['positive', 'negative'], title='Distribution upselling customer by available data')
fig.show()

### Customers porfolio analysis

In [28]:
customer_portfolio_value = (
    core_data
    .group_by('customer_id')
    .agg([
        pl.col('rating_account_id').count().alias('portfolio_size'),
        
        pl.col('gross_mrc').sum().alias('total_value'),
        pl.col('gross_mrc').mean().alias('avg_per_contract'),
        
        pl.col('has_done_upselling').max().alias('any_upselling'),
        pl.col('has_done_upselling').sum().alias('count_upselling'),
        (pl.col('has_done_upselling').sum() / pl.col('rating_account_id').count() * 100).round(2).alias('rate_upselling'),
        
        pl.col('age').min().alias('min_age'),
        pl.col('age').max().alias('max_age'),
        pl.col('age').mean().alias('avg_age'),
        pl.col('age').std().alias('age_std'),

        pl.col('contract_lifetime_days').mean().alias('avg_contract_lifetime'),
        pl.col('contract_lifetime_days').max().alias('max_contract_lifetime'),

        (pl.col('remaining_binding_days') < 0).sum().alias('count_completed_contracts'),

        pl.col('completion_rate').filter(pl.col('completion_rate') < 1.0).mean().alias('avg_completion_rate'),
        pl.col('completion_rate').filter(pl.col('completion_rate') < 1.0).min().alias('min_completion_rate'),
        pl.col('completion_rate').filter(pl.col('completion_rate') < 1.0).max().alias('max_completion_rate'),
        
        pl.col('available_gb').sum().alias('total_data_value'),
        pl.col('available_gb').mean().alias('avg_data_value'),

        pl.col('cost_per_gb').filter(~pl.col('cost_per_gb').is_infinite()).mean().alias('avg_cost_per_gb'),

        pl.col('is_magenta1_customer').any().alias('any_magenta1_customer'),

        pl.col('has_special_offer').any().alias('any_special_offer'),

        pl.col('smartphone_brand').n_unique().alias('count_smartphone_brands'),
    ])
)


portfolio_summary = (
    customer_portfolio_value
    .group_by('portfolio_size')
    .agg([
        pl.col('customer_id').n_unique().alias('total_customers'),

        (pl.col('customer_id').n_unique() / n_customers * 100).round(2).alias('percent_customers'),

        # Percentage of customers who have done at least one upselling (within this portfolio size)
        (pl.col('any_upselling').mean() * 100).round(2).alias('customers_with_upselling_percent'),
                
        # Average number of upselling contracts per customer
        pl.col('count_upselling').filter(pl.col('count_upselling') > 0).mean().round(2).alias('avg_upselling_contracts_per_customer'),

        pl.col('total_value').mean().round(2).alias('avg_total_spend'),
        pl.col('avg_per_contract').mean().round(2).alias('avg_per_contract'),
        
        pl.col('avg_age').mean().round(2).alias('avg_age_per_contract'),
        
        pl.col('avg_contract_lifetime').mean().round(2).alias('avg_contract_lifetime_days'),
        pl.col('max_contract_lifetime').max().alias('max_contract_lifetime_days'),

        pl.col('count_completed_contracts').mean().round(2).alias('avg_completed_contracts'),

        (pl.col('avg_completion_rate') * 100).mean().round(2).alias('avg_completion_rate'),

        pl.col('total_data_value').mean().round(2).alias('avg_total_data'),
        pl.col('avg_data_value').mean().round(2).alias('avg_data_per_contract'),

        pl.col('avg_cost_per_gb').mean().round(2).alias('avg_cost_per_gb'),

        (pl.col('any_magenta1_customer').mean() * 100).round(2).alias('magenta1_percent'),

        (pl.col('any_special_offer').mean() * 100).round(2).alias('specialoffer_percent'),

        pl.col('count_smartphone_brands').mean().round(2).alias('avg_count_smartphone_brands'),
        pl.col('count_smartphone_brands').max().alias('max_count_smartphone_brands'),
    ])
    .sort('portfolio_size', descending=False)
)

portfolio_summary = portfolio_summary.rename({
    'portfolio_size': 'Portfolio Size',

    'total_customers': 'Total Customers',
    'percent_customers': 'Customers Pool Percentage (%)',

    'avg_total_spend': 'Avg Total Spend (€)',
    'avg_per_contract': 'Avg per Contract (€)',

    'avg_age_per_contract': 'Avg Age per Contract (years)',

    'avg_contract_lifetime_days': 'Avg Contract Lifetime (days)',
    'max_contract_lifetime_days': 'Max Contract Lifetime (days)',

    'avg_completed_contracts': 'Avg Completed Contracts',
    'avg_completion_rate': 'Avg Completion Rate (%)',

    'avg_total_data': 'Avg Total Data (GB)',
    'avg_data_per_contract': 'Avg Data per Contract (GB)',

    'avg_cost_per_gb': 'Avg Cost per GB (€)',

    'magenta1_percent': 'Magenta1 Percentage (%)',

    'specialoffer_percent': 'Special Offer Percentage (%)',

    'avg_count_smartphone_brands': 'Avg Count Smartphone Brands',
    'max_count_smartphone_brands': 'Max Count Smartphone Brands',

    'customers_with_upselling_percent': 'Customers with Upselling (%)',
    'avg_upselling_contracts_per_customer': 'Avg Upselling Contracts per Customer',  # Average of customers that have at least one upselling contract
})

portfolio_summary

Portfolio Size,Total Customers,Customers Pool Percentage (%),Customers with Upselling (%),Avg Upselling Contracts per Customer,Avg Total Spend (€),Avg per Contract (€),Avg Age per Contract (years),Avg Contract Lifetime (days),Max Contract Lifetime (days),Avg Completed Contracts,Avg Completion Rate (%),Avg Total Data (GB),Avg Data per Contract (GB),Avg Cost per GB (€),Magenta1 Percentage (%),Special Offer Percentage (%),Avg Count Smartphone Brands,Max Count Smartphone Brands
u32,u32,f64,f64,f64,f64,f64,f64,f64,i64,f64,f64,f64,f64,f64,f64,f64,f64,u32
1,30574,52.27,6.88,1.0,37.63,37.63,43.91,779.29,1825,0.5,73.08,21.43,25.01,1.73,29.62,30.18,1.0,1
2,17945,30.68,13.9,1.04,74.85,37.43,43.88,776.39,1825,1.0,73.24,43.03,25.11,1.71,51.04,51.41,1.68,2
3,7200,12.31,19.75,1.07,112.12,37.37,43.77,779.98,1825,1.5,73.33,64.01,24.74,1.72,66.06,64.72,2.14,3
4,2100,3.59,24.24,1.13,152.05,38.01,43.59,776.77,1825,1.97,73.06,86.03,25.23,1.73,76.95,75.38,2.46,4
5,552,0.94,34.06,1.18,188.5,37.7,44.07,787.92,1825,2.5,73.04,109.96,25.27,1.69,82.97,81.34,2.67,5
6,96,0.16,23.96,1.17,224.34,37.39,44.2,805.08,1823,2.96,73.29,129.69,25.01,1.68,85.42,90.62,2.89,4
7,24,0.04,33.33,1.12,277.81,39.69,44.21,840.54,1822,3.58,72.43,152.5,24.49,2.02,95.83,95.83,2.96,4
8,4,0.01,75.0,1.67,272.47,34.06,42.38,568.25,1663,5.5,70.21,162.5,26.92,1.36,100.0,100.0,3.0,3


### Upselling distribution

In [29]:
n_upselling = core_data.filter(pl.col('has_done_upselling')).shape[0]

n_upselling_per_customer = core_data.group_by('customer_id').agg(pl.col('has_done_upselling').sum())
n_upselling_customers = n_upselling_per_customer.filter(pl.col('has_done_upselling') > 0).shape[0]
n_more1_upselling_customers = n_upselling_per_customer.filter(pl.col('has_done_upselling') > 1).shape[0]
avg_upselling_per_customer = n_upselling_per_customer.select(pl.col('has_done_upselling')).mean().item()
avg_upselling_per_upselling_customer = n_upselling_per_customer.filter(pl.col('has_done_upselling') > 0).select(pl.col('has_done_upselling')).mean().item()
max_upselling_per_customer = n_upselling_per_customer.select(pl.col('has_done_upselling')).max().item()

upselling_contracts = core_data.filter(pl.col('has_done_upselling'))
not_upselling_contracts = core_data.filter(~pl.col('has_done_upselling'))

min_age_upselling = upselling_contracts.select(pl.col('age')).min().item()
max_age_upselling = upselling_contracts.select(pl.col('age')).max().item()
avg_age_upselling = upselling_contracts.select(pl.col('age')).mean().item()

min_age_not_upselling = not_upselling_contracts.select(pl.col('age')).min().item()
max_age_not_upselling = not_upselling_contracts.select(pl.col('age')).max().item()
avg_age_not_upselling = not_upselling_contracts.select(pl.col('age')).mean().item()

count_not_bound_upselling = upselling_contracts.filter(pl.col('remaining_binding_days') < 0).shape[0]
count_bound_upselling = upselling_contracts.filter(pl.col('remaining_binding_days') > 0).shape[0]
avg_binding_upselling = upselling_contracts.filter(pl.col('remaining_binding_days') > 0).select(pl.col('remaining_binding_days')).mean().item()
avg_exceeded_binding_upselling = upselling_contracts.filter(pl.col('remaining_binding_days') < 0).select(pl.col('remaining_binding_days')).mean().item()

avg_completion_rate_upselling = upselling_contracts.filter(pl.col('completion_rate') < 1.0).select(pl.col('completion_rate')).mean().item()

avg_data_upselling = upselling_contracts.select(pl.col('available_gb')).mean().item()
avg_cost_per_gb_upselling = upselling_contracts.filter(~pl.col('cost_per_gb').is_infinite()).select(pl.col('cost_per_gb')).mean().item()
avg_data_not_upselling = not_upselling_contracts.select(pl.col('available_gb')).mean().item()
avg_cost_per_gb_not_upselling = not_upselling_contracts.filter(~pl.col('cost_per_gb').is_infinite()).select(pl.col('cost_per_gb')).mean().item()

avg_spending_upselling = upselling_contracts.select(pl.col('gross_mrc')).mean().item()
avg_spending_not_upselling = not_upselling_contracts.select(pl.col('gross_mrc')).mean().item()

count_special_offer_upselling = upselling_contracts.filter(pl.col('has_special_offer')).shape[0]
count_special_offer_not_upselling = not_upselling_contracts.filter(pl.col('has_special_offer')).shape[0]
count_not_special_offer_upselling = upselling_contracts.filter(~pl.col('has_special_offer')).shape[0]
count_not_special_offer_not_upselling = not_upselling_contracts.filter(~pl.col('has_special_offer')).shape[0]

count_magenta1_upselling = upselling_contracts.filter(pl.col('is_magenta1_customer')).shape[0]
count_not_magenta1_upselling = upselling_contracts.filter(~pl.col('is_magenta1_customer')).shape[0]
count_magenta1_not_upselling = not_upselling_contracts.filter(pl.col('is_magenta1_customer')).shape[0]
count_not_magenta1_not_upselling = not_upselling_contracts.filter(~pl.col('is_magenta1_customer')).shape[0]


print(f'The core_data has {n_upselling}({n_upselling / n_contracts * 100:.2f}%) upsold contracts.\n')

print(f'The core_data has {n_upselling_customers}({n_upselling_customers / n_customers * 100:.2f}%) customers with at least one upselling contract.')
print(f'The core_data has {n_more1_upselling_customers}({n_more1_upselling_customers / n_customers * 100:.2f}%) customers with more than one upselling contract.')
print(f'The core_data has an average of {avg_upselling_per_upselling_customer:.2f} upsold contracts between customers with upsold contracts.')
print(f'The core_data has an average of {avg_upselling_per_customer:.2f} upsold contracts per customer with a maximum of {max_upselling_per_customer} upsold contracts per customer.\n')

print(f'The core_data has an average age of {avg_age_upselling:.2f} years for customers with upsold contracts.')
print(f'The core_data has an age between {min_age_upselling} and {max_age_upselling} years for customers with upsold contracts.\n')

print(f'The core_data has an average age of {avg_age_not_upselling:.2f} years for customers without upsold contracts.')
print(f'The core_data has an age between {min_age_not_upselling} and {max_age_not_upselling} years for customers without upsold contracts.\n')

print(f'The core_data has {count_bound_upselling}({count_bound_upselling / n_upselling * 100:.2f}%) upsold contracts with still bindings.')
print(f'The core_data has an average of {avg_binding_upselling:.2f} days of still bindings for upsold contracts.\n')

print(f'The core_data has {count_not_bound_upselling}({count_not_bound_upselling / n_upselling * 100:.2f}%) upsold contracts with exceeded bindings.')
print(f'The core_data has an average of {abs(avg_exceeded_binding_upselling):.2f} days of exceeded bindings for upsold contracts.\n')

print(f'The core_data has an average completion rate of {avg_completion_rate_upselling * 100:.2f}% for customers with upsold contracts.\n')

print(f'The core_data has an average data volume of {avg_data_upselling:.2f} GB for upsold contracts.')
print(f'The core_data has an average cost per GB of {avg_cost_per_gb_upselling:.2f} euros for upsold contracts.\n')

print(f'The core_data has an average data volume of {avg_data_not_upselling:.2f} GB for customers without upsold contracts.')
print(f'The core_data has an average cost per GB of {avg_cost_per_gb_not_upselling:.2f} euros for customers without upsold contracts.\n')

print(f'The core_data has an average spending of {avg_spending_upselling:.2f} euros for upsold contracts.')
print(f'The core_data has an average spending of {avg_spending_not_upselling:.2f} euros for customers without upsold contracts.\n')

print(f'The core_data has {count_special_offer_upselling}({count_special_offer_upselling / n_upselling * 100:.2f}%) upsold contracts with special offers.')
print(f'The core_data has {count_not_special_offer_upselling}({count_not_special_offer_upselling / n_upselling * 100:.2f}%) upsold contracts without special offers.')
print(f'The core_data has {count_not_special_offer_not_upselling}({count_not_special_offer_not_upselling / (n_contracts - n_upselling) * 100:.2f}%) non-upsold contracts without special offers.')
print(f'The core_data has {count_special_offer_not_upselling}({count_special_offer_not_upselling / (n_contracts - n_upselling) * 100:.2f}%) non-upsold contracts with special offers.\n')

print(f'The core_data has {count_magenta1_upselling}({count_magenta1_upselling / n_upselling * 100:.2f}%) upsold contracts with Magenta1.')
print(f'The core_data has {count_not_magenta1_upselling}({count_not_magenta1_upselling / n_upselling * 100:.2f}%) upsold contracts without Magenta1.')
print(f'The core_data has {count_not_magenta1_not_upselling}({count_not_magenta1_not_upselling / (n_contracts - n_upselling) * 100:.2f}%) non-upsold contracts without Magenta1.')
print(f'The core_data has {count_magenta1_not_upselling}({count_magenta1_not_upselling / (n_contracts - n_upselling) * 100:.2f}%) non-upsold contracts with Magenta1.\n')

The core_data has 7049(7.05%) upsold contracts.

The core_data has 6750(11.54%) customers with at least one upselling contract.
The core_data has 287(0.49%) customers with more than one upselling contract.
The core_data has an average of 1.04 upsold contracts between customers with upsold contracts.
The core_data has an average of 0.12 upsold contracts per customer with a maximum of 3 upsold contracts per customer.

The core_data has an average age of 41.78 years for customers with upsold contracts.
The core_data has an age between 18 and 100 years for customers with upsold contracts.

The core_data has an average age of 44.00 years for customers without upsold contracts.
The core_data has an age between 18 and 100 years for customers without upsold contracts.

The core_data has 3239(45.95%) upsold contracts with still bindings.
The core_data has an average of 277.86 days of still bindings for upsold contracts.

The core_data has 3802(53.94%) upsold contracts with exceeded bindings.
Th

### Data checks

In [30]:
null_counts_core_data = core_data.select([
    pl.col(col).is_null().sum().alias(f'{col}_nulls') for col in core_data.columns
])

nan_counts_core_data = core_data.select([
    pl.col(col).is_nan().sum().alias(f'{col}_nans')
    for col, dtype in zip(core_data.columns, core_data.dtypes)
    if dtype in [pl.Float32, pl.Float64]
])

inf_counts_core_data = core_data.select([
    ((pl.col(col) == float('inf')) | (pl.col(col) == float('-inf'))).sum().alias(f'{col}_infs')
    for col, dtype in zip(core_data.columns, core_data.dtypes)
    if dtype in [pl.Float32, pl.Float64]
])

In [31]:
print('Null counts per column:')
null_counts_core_data

Null counts per column:


rating_account_id_nulls,customer_id_nulls,age_nulls,contract_lifetime_days_nulls,remaining_binding_days_nulls,has_special_offer_nulls,is_magenta1_customer_nulls,available_gb_nulls,gross_mrc_nulls,smartphone_brand_nulls,has_done_upselling_nulls,contract_binding_days_nulls,completion_rate_nulls,cost_per_gb_nulls
u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
0,0,0,0,0,0,0,14148,0,0,0,0,0,14148


In [32]:
print('NaN counts per float column:')
nan_counts_core_data

NaN counts per float column:


gross_mrc_nans,completion_rate_nans,cost_per_gb_nans
u32,u32,u32
0,0,0


In [33]:
print('Infinite counts per float column:')
inf_counts_core_data

Infinite counts per float column:


gross_mrc_infs,completion_rate_infs,cost_per_gb_infs
u32,u32,u32
0,0,14288


---

## `usage_info`

| Feature Name           | Description                                                  |
|------------------------|--------------------------------------------------------------|
| rating_account_id      | Unique identifier for the contract account                    |
| billed_period_month_d  | Billing period (monthly)                                     |
| has_used_roaming       | Indicates if roaming was used during the period            |
| used_gb                | Amount of mobile data used in the billing period (in GB)     |


### General overview

In [34]:
usage_info = load_artifact('usage_info')

In [35]:
usage_info.head()

rating_account_id,billed_period_month_d,has_used_roaming,used_gb
i64,str,i64,f64
289094,"""2024-04-01""",0,0.8
289094,"""2024-05-01""",0,0.2
289094,"""2024-06-01""",1,0.1
289094,"""2024-07-01""",0,0.0
677626,"""2024-04-01""",0,0.8


In [36]:
usage_info.schema

Schema([('rating_account_id', Int64),
        ('billed_period_month_d', String),
        ('has_used_roaming', Int64),
        ('used_gb', Float64)])

In [37]:
usage_info = usage_info.with_columns(
    pl.col('has_used_roaming').cast(pl.Boolean),
    pl.col('billed_period_month_d').cast(pl.Date)
)

In [38]:
from datetime import datetime

n_contracts_usage_info = usage_info.select(pl.col('rating_account_id')).n_unique()

min_billing_period = usage_info.select(pl.col('billed_period_month_d')).min().item()
max_billing_period = usage_info.select(pl.col('billed_period_month_d')).max().item()

n_billing_periods = usage_info.select(pl.col('billed_period_month_d')).n_unique()

n_billiging_periods_per_contract = usage_info.group_by('rating_account_id').agg(pl.col('billed_period_month_d').n_unique().alias('n_billing_periods'))
complete_blilling_periods = n_billiging_periods_per_contract.filter(pl.col('n_billing_periods') == n_billing_periods).shape[0]

print(f'The usage_info has {n_contracts_usage_info} unique contracts.\n')

print(f'The usage_info has billing periods between {min_billing_period} and {max_billing_period}.\n')

print(f'The usage_info has {complete_blilling_periods} ({complete_blilling_periods / n_contracts * 100:.2f}%) of contracts with full usage information\n')

The usage_info has 100000 unique contracts.

The usage_info has billing periods between 2024-04-01 and 2024-07-01.

The usage_info has 100000 (100.00%) of contracts with full usage information



### Usage analysis

In [39]:
usage_per_month = (
    usage_info
    .group_by('billed_period_month_d')
    .agg([
        pl.col('used_gb').sum().round(2).alias('total_used_gb'),
        pl.col('used_gb').mean().round(2).alias('avg_used_gb'),
        pl.col('used_gb').std().round(2).alias('std_used_gb'),
        pl.col('used_gb').quantile(0.5).round(2).alias('median_used_gb'),
        pl.col('used_gb').max().round(2).alias('max_used_gb'),
        pl.col('used_gb').min().round(2).alias('min_used_gb'),

        pl.col('has_used_roaming').sum().cast(pl.Int64).alias('count_roaming_used'),
    ])
).sort('billed_period_month_d', descending=False)

usage_per_month = usage_per_month.with_columns(
    (pl.col('total_used_gb') - pl.col('total_used_gb').shift(1)).round(2).alias('total_used_gb_delta1'),
    (pl.col('total_used_gb') - pl.col('total_used_gb').shift(2)).round(2).alias('total_used_gb_delta2'),
    (pl.col('total_used_gb') - pl.col('total_used_gb').shift(3)).round(2).alias('total_used_gb_delta3'),

    (pl.col('count_roaming_used') - pl.col('count_roaming_used').shift(1)).alias('count_roaming_used_delta1'),
    (pl.col('count_roaming_used') - pl.col('count_roaming_used').shift(2)).alias('count_roaming_used_delta2'),
    (pl.col('count_roaming_used') - pl.col('count_roaming_used').shift(3)).alias('count_roaming_used_delta3'),
)

usage_per_month = usage_per_month.rename({
    'billed_period_month_d': 'Billed Period (Month)',
    
    'total_used_gb': 'Total Used GB',
    'avg_used_gb': 'Avg Used GB',
    'max_used_gb': 'Max Used GB',
    'min_used_gb': 'Min Used GB',
    'std_used_gb': 'Std Used GB',
    'median_used_gb': 'Median Used GB',
    'total_used_gb_delta1': 'Total Used GB Delta previous 1 month',
    'total_used_gb_delta2': 'Total Used GB Delta previous 2 months',
    'total_used_gb_delta3': 'Total Used GB Delta previous 3 months',
    
    'count_roaming_used': 'Count Roaming Used',
    'count_roaming_used_delta1': 'Count Roaming Used Delta previous 1 month',
    'count_roaming_used_delta2': 'Count Roaming Used Delta previous 2 months',
    'count_roaming_used_delta3': 'Count Roaming Used Delta previous 3 months',
})

usage_per_month.select(
    pl.col('Billed Period (Month)'),
    pl.col('Total Used GB'),
    pl.col('Avg Used GB'),
    pl.col('Std Used GB'),
    pl.col('Min Used GB'),
    pl.col('Median Used GB'),
    pl.col('Max Used GB'),
    pl.col('Total Used GB Delta previous 1 month'),
    pl.col('Total Used GB Delta previous 2 months'),
    pl.col('Total Used GB Delta previous 3 months'),
    pl.col('Count Roaming Used'),
    pl.col('Count Roaming Used Delta previous 1 month'),
    pl.col('Count Roaming Used Delta previous 2 months'),
    pl.col('Count Roaming Used Delta previous 3 months')
)

Billed Period (Month),Total Used GB,Avg Used GB,Std Used GB,Min Used GB,Median Used GB,Max Used GB,Total Used GB Delta previous 1 month,Total Used GB Delta previous 2 months,Total Used GB Delta previous 3 months,Count Roaming Used,Count Roaming Used Delta previous 1 month,Count Roaming Used Delta previous 2 months,Count Roaming Used Delta previous 3 months
date,f64,f64,f64,f64,f64,f64,f64,f64,f64,i64,i64,i64,i64
2024-04-01,1402806.0,14.03,18.69,0,5,70,,,,30001,,,
2024-05-01,1396414.1,13.96,18.61,0,5,70,-6391.9,,,30083,82.0,,
2024-06-01,1404015.3,14.04,18.7,0,5,70,7601.2,1209.3,,30012,-71.0,11.0,
2024-07-01,1399001.3,13.99,18.66,0,5,70,-5014.0,2587.2,-3804.7,29765,-247.0,-318.0,-236.0


### Data checks

In [40]:
null_counts_usage_info = usage_info.select([
    pl.col(col).is_null().sum().alias(f'{col}_nulls') for col in usage_info.columns
])

nan_counts_usage_info = usage_info.select([
    pl.col(col).is_nan().sum().alias(f'{col}_nans')
    for col, dtype in zip(usage_info.columns, usage_info.dtypes)
    if dtype in [pl.Float32, pl.Float64, pl.Int64, pl.Int32]
])

inf_counts_usage_info = usage_info.select([
    ((pl.col(col) == float('inf')) | (pl.col(col) == float('-inf'))).sum().alias(f'{col}_infs')
    for col, dtype in zip(usage_info.columns, usage_info.dtypes)
    if dtype in [pl.Float32, pl.Float64, pl.Int64, pl.Int32]
])

In [41]:
print('Null counts per column:')
null_counts_usage_info

Null counts per column:


rating_account_id_nulls,billed_period_month_d_nulls,has_used_roaming_nulls,used_gb_nulls
u32,u32,u32,u32
0,0,0,0


In [42]:
print('NaN counts per float column:')
nan_counts_usage_info

NaN counts per float column:


rating_account_id_nans,used_gb_nans
u32,u32
0,0


In [43]:
print('Infinite counts per float column:')
inf_counts_usage_info

Infinite counts per float column:


rating_account_id_infs,used_gb_infs
u32,u32
0,0


---

## `customer_interactions`

| Feature Name   | Description                                                              |
|----------------|--------------------------------------------------------------------------|
| customer_id    | Unique identifier for the customer                                       |
| type_subtype   | Category and subtype of the interaction (e.g., tariff change, billing)   |
| n              | Number of interactions of this type in the last 6 months                                |
| days_since_last| Number of days since the last interaction of this type                   |


### General overview

In [44]:
customer_interactions = load_artifact('customer_interactions')

In [45]:
customer_interactions.head()

customer_id,type_subtype,n,days_since_last
str,str,i64,i64
"""5.563689""","""rechnungsanfragen""",1,116
"""2.928257""","""produkte&services-tarifdetails""",1,146
"""3.468993""","""produkte&services-tarifdetails""",1,50
"""4.481640""","""prolongation""",1,46
"""1.750352""","""rechnungsanfragen""",1,177


In [46]:
customer_interactions.schema

Schema([('customer_id', String),
        ('type_subtype', String),
        ('n', Int64),
        ('days_since_last', Int64)])

In [47]:
n_customers_interactions = customer_interactions.select(pl.col('customer_id')).n_unique()


n_types_interactions_per_customers = customer_interactions.group_by('customer_id').agg(pl.col('type_subtype').n_unique().alias('n_types_interactions'))
avg_n_types_interactions_per_customer = n_types_interactions_per_customers.select(pl.col('n_types_interactions')).mean().item()
max_n_types_interactions_per_customer = n_types_interactions_per_customers.select(pl.col('n_types_interactions')).max().item()

multiple_same_interactions_per_customer = customer_interactions.group_by('customer_id', 'type_subtype').len()
n_customers_multiple_same_interactions = multiple_same_interactions_per_customer.filter(pl.col('len') > 1).shape[0]

types_interactions = customer_interactions.select(pl.col('type_subtype')).unique()

print(f'The customer_interactions has {n_customers_interactions} ({n_customers_interactions / n_customers * 100:.2f}% of whole customers pool) unique customers.\n')

print(f'The customer_interactions has an average of {avg_n_types_interactions_per_customer:.2f} unique interaction types per customer with a maximum of {max_n_types_interactions_per_customer} unique interaction types.\n')

print(f'The customer_interactions has {n_customers_multiple_same_interactions} ({n_customers_multiple_same_interactions / n_customers_interactions * 100:.2f}%) customers with multiple same interaction types.\n')

print(f'The customer_interactions has {types_interactions.shape[0]} unique interaction types: {types_interactions.select(pl.col('type_subtype')).to_series().to_list()}.\n')

The customer_interactions has 42095 (71.96% of whole customers pool) unique customers.

The customer_interactions has an average of 1.50 unique interaction types per customer with a maximum of 3 unique interaction types.

The customer_interactions has 0 (0.00%) customers with multiple same interaction types.

The customer_interactions has 4 unique interaction types: ['produkte&services-tarifwechsel', 'produkte&services-tarifdetails', 'rechnungsanfragen', 'prolongation'].



### Interactions analysis

In [48]:
interactions_distribution = (
    customer_interactions
    .group_by('type_subtype')
    .agg([
        pl.col('customer_id').n_unique().alias('n_customers'),

        pl.col('n').sum().alias('total_interactions_in_last_6_months'),
        pl.col('n').min().round(2).alias('min_interactions_per_customer_in_last_6_months'),
        pl.col('n').mean().round(2).alias('avg_interactions_per_customer_in_last_6_months'),
        pl.col('n').median().round(2).alias('median_interactions_per_customer_in_last_6_months'),
        pl.col('n').max().round(2).alias('max_interactions_per_customer_in_last_6_months'),
        

        pl.col('days_since_last').mean().round(2).alias('avg_days_since_last_interaction'),
        pl.col('days_since_last').max().alias('max_days_since_last_interaction'),
        pl.col('days_since_last').min().alias('min_days_since_last_interaction'),
    ])
)

interactions_distribution = interactions_distribution.rename({
    'type_subtype': 'Interaction Type',
    
    'n_customers': 'Number of Customers', 

    'total_interactions_in_last_6_months': 'Total Interactions in Last 6 Months',
    'avg_interactions_per_customer_in_last_6_months': 'Avg Interactions per Customer in Last 6 Months',
    'median_interactions_per_customer_in_last_6_months': 'Median Interactions per Customer in Last 6 Months',
    'max_interactions_per_customer_in_last_6_months': 'Max Interactions per Customer in Last 6 Months',
    'min_interactions_per_customer_in_last_6_months': 'Min Interactions per Customer in Last 6 Months',

    'avg_days_since_last_interaction': 'Avg Days Since Last Interaction',
    'max_days_since_last_interaction': 'Max Days Since Last Interaction',
    'min_days_since_last_interaction': 'Min Days Since Last Interaction',
})

interactions_distribution

Interaction Type,Number of Customers,Total Interactions in Last 6 Months,Min Interactions per Customer in Last 6 Months,Avg Interactions per Customer in Last 6 Months,Median Interactions per Customer in Last 6 Months,Max Interactions per Customer in Last 6 Months,Avg Days Since Last Interaction,Max Days Since Last Interaction,Min Days Since Last Interaction
str,u32,i64,i64,f64,f64,i64,f64,i64,i64
"""prolongation""",15837,30507,1,1.93,1,10,89.82,180,0
"""produkte&services-tarifdetails""",15759,30693,1,1.95,2,10,90.18,180,0
"""produkte&services-tarifwechsel""",15811,30661,1,1.94,2,10,91.02,180,0
"""rechnungsanfragen""",15811,30493,1,1.93,1,10,90.06,180,0


### Data checks

In [49]:
null_counts_customer_interactions = customer_interactions.select([
    pl.col(col).is_null().sum().alias(f'{col}_nulls') for col in customer_interactions.columns
])

nan_counts_customer_interactions = customer_interactions.select([
    pl.col(col).is_nan().sum().alias(f'{col}_nans')
    for col, dtype in zip(customer_interactions.columns, customer_interactions.dtypes)
    if dtype in [pl.Float32, pl.Float64, pl.Int64, pl.Int32]
])

inf_counts_customer_interactions = customer_interactions.select([
    ((pl.col(col) == float('inf')) | (pl.col(col) == float('-inf'))).sum().alias(f'{col}_infs')
    for col, dtype in zip(customer_interactions.columns, customer_interactions.dtypes)
    if dtype in [pl.Float32, pl.Float64, pl.Int64, pl.Int32]
])

In [50]:
print('Null counts per column:')
null_counts_customer_interactions

Null counts per column:


customer_id_nulls,type_subtype_nulls,n_nulls,days_since_last_nulls
u32,u32,u32,u32
0,0,0,0


In [51]:
print('NaN counts per float column:')
nan_counts_customer_interactions

NaN counts per float column:


n_nans,days_since_last_nans
u32,u32
0,0


In [52]:
print("Infinite counts per float column:")
inf_counts_customer_interactions

Infinite counts per float column:


n_infs,days_since_last_infs
u32,u32
0,0


---

# Conclusion

## Data Interpretation

This dataset represents a snapshot extracted at a specific point in time, capturing the complete active contract portfolio for each customer.<br>
The key insight is that a single `customer_id` can manage multiple simultaneous active contracts (identified by unique `rating_account_id`), as evidenced by the presence of usage data across all contracts for the complete four-month observation period (April-July 2024).
It is possible to assume that the age attribute appears to correspond to the individual user of each specific contract rather than the account holder, which aligns with the smartphone brand assignment per contract.

This structure reflects two primary real-world scenarios:
- **Corporate Accounts**: Businesses managing multiple employee lines, where each contract represents a different employee.

- **Family Plans**: Household accounts where the primary account holder manages contracts for multiple family members.

## Summary

This analysis shows that the data consists of **100000 contracts** from **58495 unique customers**. The most important insight is that only **7.05% of the contracts are upsold** and 11.54% of customers manage at least one upsold contract. This shows how the data is extremely **imbalanced**.

The `core_data` dataset captures a multi-contract customer structure where 4.74% of the customers manages 4+ contracts, with an average of **1.71 contracts per customer**.

This structure represents both corporate accounts managing employee lines and family plans with multiple household members.

**Only 71.96% of the customer base had an interaction with the customer service**, with balanced distribution across four interaction types (prolongation, tariff changes, billing inquiries, and tariff details). However, the 28% coverage gap must be addressed.

## Data Quality

The primary data quality issue involves 14148 contracts (14.15%) with null `available_gb` values. This requires a strategic approach for handling missing values through imputation, filling, or removal based on business context.

## Target

The data is modeling scenarios where:

- Each contract has independent upgrade potential based on its specific circumstances
- Contract lifecycle timing matters (binding periods, contract age)
- Service-level features are contract-specific (data allowances, pricing tiers)

**Contract-level targeting** is actually the right approach because:

- The business logic treats each contract as having independent upgrade potential
- Contract-specific features (binding periods, data allowances, pricing) are the key drivers

## Data Integration

Given the commercial and modeling advantages of contract-level targeting, the optimal data integration strategy involves:

- **Usage Data Aggregation**: Transform the monthly usage records into contract-level features by creating aggregated statistics across the 4-month observation period, including average usage, usage trends, rolling averages, and period-over-period deltas.

- **Customer Interaction Mapping**: Transform customer interaction data by pivoting interaction categories (e.g., tariff changes, billing inquiries) into separate feature columns, creating dedicated metrics for each interaction type. These customer-level interaction aggregations are then propagated to all contracts belonging to the same customer. While this approach may introduce some noise since interactions may be contract-specific rather than customer-wide, with appropriate null handling for customers without recorded interactions.