<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Setup" data-toc-modified-id="Setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Setup</a></span><ul class="toc-item"><li><span><a href="#Imports" data-toc-modified-id="Imports-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Imports</a></span></li><li><span><a href="#File-Config" data-toc-modified-id="File-Config-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>File Config</a></span></li><li><span><a href="#Simulation-Config" data-toc-modified-id="Simulation-Config-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Simulation Config</a></span></li></ul></li><li><span><a href="#Generate-Dataset" data-toc-modified-id="Generate-Dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Generate Dataset</a></span></li><li><span><a href="#Save-Sample-or-Read-In" data-toc-modified-id="Save-Sample-or-Read-In-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Save Sample or Read In</a></span></li><li><span><a href="#Run-Model" data-toc-modified-id="Run-Model-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Run Model</a></span></li><li><span><a href="#Results" data-toc-modified-id="Results-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Results</a></span><ul class="toc-item"><li><span><a href="#Customers-Join" data-toc-modified-id="Customers-Join-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Customers Join</a></span></li><li><span><a href="#Site-Join" data-toc-modified-id="Site-Join-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Site Join</a></span></li><li><span><a href="#GMM" data-toc-modified-id="GMM-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>GMM</a></span></li></ul></li><li><span><a href="#Performance" data-toc-modified-id="Performance-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Performance</a></span></li><li><span><a href="#Viz" data-toc-modified-id="Viz-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Viz</a></span></li></ul></div>

# Generate nNPS data #


The purpose of this notebook is to create a dummy dataset for modelling network NPS. It simulates interactions between customers and the mobile network based on empirical distributions. This can be graphically represented by a bipartite graph having a set of nodes for sites $S_N$ and a set of nodes for customers $C_K$. To generate data from the simulation we follow these steps: 

1. for each site sample an average daily site performance KPI $x_N$,
2. for each site sample a number of daily surveyed customer interactions $D_N$,
3. for each number of connections in $D_N$ sample a random customer id from $C_K$,
4. for each customer $C_K$ determine p(*hadBadInteraction*),
5. for each customer $C_K$ add noise to determine p(*isDetractor*).

## Setup

### Imports

In [None]:
%matplotlib notebook

import csv
import os
import json
import pandas as pd
import numpy as np
import scipy.stats as stats
import plotly.express as px
import plotly.graph_objects as go
import matplotlib.pyplot as plt

from sklearn.metrics import classification_report, auc, confusion_matrix
from math import floor

from pathlib import Path

In [None]:
if not Path('./data').exists():
    os.mkdir('./data')

### File Config

In [None]:
data_dir = Path('./data/')

dataset_filename = 'dataset.csv'
dataset_path = data_dir/dataset_filename

responses_filename = 'responses.csv'
responses_path = data_dir/responses_filename

kpis_filename = 'kpis.csv'
kpis_path = data_dir/kpis_filename

customer_results_filename = 'customer-results.csv'
customer_results_path = data_dir/customer_results_filename

sites_results_filename = 'sites-results.csv'
sites_results_path = data_dir/sites_results_filename

gmm_results_filename = 'gmm-results.csv'
gmm_results_path = data_dir/gmm_results_filename

### Simulation Config

In [None]:
num_customers = 100
avg_subs_per_site = 3 # used for the Poisson distribution per  day
num_days = 20
num_sites = 10

weight_class_0 = 0.9
weight_class_1 = 0.1

p_error = 0.2 # we use the BSC to add noise

resample = True
viz = True

## Generate Dataset

In [None]:
N = num_days*num_sites # number of daily sites (i.e., days*sites)

true_mean_0 = 0
true_mean_1 = 10

true_precision_0 = 0.05
true_precision_1 = 0.05

def sample(component):
    if component == 0:
        return np.random.normal(true_mean_0, np.sqrt(1 / true_precision_0), 1)[0]
    if component == 1:
        return np.random.normal(true_mean_1, np.sqrt(1 / true_precision_1), 1)[0]

# comp 0 = good, comp 1 = bad
mask = np.random.choice([0, 1], N, p=[weight_class_0, weight_class_1])
kpi_data = [sample(i) for i in mask]
df_kpi = pd.DataFrame(kpi_data, columns=['x'])
df_kpi['comp'] = mask
df_kpi.shape

For each site, sample a daily degree.

In [None]:
# Note: this is the daily site degree distribution.
D = np.random.poisson(avg_subs_per_site, N)
D.shape

In [None]:
# plt.hist(D, 100, alpha=0.4, color='gray')
# plt.grid()
# plt.xlabel('Degree distribution')
# plt.title('Daily site degree distribution')
# plt.show()

We have N number of daily sites for which each daily site needs to be uniformly allocated to customers.

In [None]:
customer_ids = list(range(num_customers))
daily_site_ids = list(range(N))
cust_site_dict = {}

for i in daily_site_ids:
    num_customer_connections = D[i]
    np.random.shuffle(customer_ids)
    ids = customer_ids[:num_customer_connections]
    for j in ids:
        if j in cust_site_dict:
            temp_list = cust_site_dict[j]
            temp_list.append(i)
            cust_site_dict.update({j:temp_list})
        else:
            cust_site_dict.update({j:[i]})        

In [None]:
# test uniqueness
# for k, v in cust_site_dict.items():
#     if len(v) != len(set(v)):
#         print(k)

In [None]:
df_temp = pd.DataFrame.from_dict(cust_site_dict, orient='index')

In [None]:
df_temp.sort_index(inplace=True)

In [None]:
kpis = []
comp = []
hadBadInteraction = []

for index, row in df_temp.iterrows():
    x = row.values
    ids = x[~np.isnan(x)].astype(int)
    kpis.append(df_kpi.loc[ids, 'x'].values.round(2).tolist())
    comp.append(df_kpi.loc[ids, 'comp'].values)
    # deterministic OR logic
    hadBadInteraction.append(np.any(df_kpi.loc[ids, 'comp']))

In [None]:
# y = list(map(lambda x: 1 if x else 0, hadBadInteraction))
# plt.hist(y, 10)
# plt.grid()
# plt.xlabel('$p(hadBadInteraction)$')
# plt.show()

In [None]:
df_temp['kpis'] = kpis
df_temp['component'] = comp
df_temp['hadBadInteraction'] = hadBadInteraction

In [None]:
df_customers = df_temp.loc[:, :'kpis'].drop(['kpis'], axis=1)

In [None]:
df_customers = df_customers.apply(lambda x: list(x[~np.isnan(x)].astype(int)), axis=1)

In [None]:
df_customers = pd.DataFrame(df_customers, columns=['temp'])

In [None]:
df_customers['site_ids'] = [','.join(map(str, l)) for l in df_customers['temp']]

In [None]:
df_temp.shape, df_customers.shape

In [None]:
df_customers['hadBadInteraction'] = df_temp['hadBadInteraction'].astype(int)
df_customers['isDetractor'] = df_customers['hadBadInteraction']

# hadBadSiteInt, isDetractor, p(isDetractor|hadBadSiteInt)
# 0              0            1 - p(error)
# 0              1            p(error)
# 1              0            p(error)
# 1              1            1 - p(error)

mask = df_customers[df_customers['hadBadInteraction'] == 0].sample(frac=p_error).index.values
df_customers.loc[mask, 'isDetractor'] = 1
mask = df_customers[df_customers['hadBadInteraction'] == 1].sample(frac=p_error).index.values
df_customers.loc[mask, 'isDetractor'] = 0

## Save Sample or Read In

Create dataset.csv file, which has number of customers as rows and jagged daily site ids as columns.

In [None]:
if resample:
    config = {'num_customers' :num_customers,
          'avg_subs_per_site' :avg_subs_per_site,
          'num_days' :num_days,
          'num_sites' :num_sites,
          'weight_class_0' :weight_class_0,
          'weight_class_1' :weight_class_1}
    with open(data_dir/'config.json', 'w') as fp:
        json.dump(config, fp)
    df_customers.to_pickle(data_dir/'df_customers.pck')
    df_kpi.to_pickle(data_dir/'df_kpi.pck')
    df_temp.to_pickle(data_dir/'df_temp.pck')

In [None]:
with open(data_dir/'config.json', 'r') as fp:
    config = json.load(fp)
num_customers = config['num_customers']
avg_subs_per_site = config['avg_subs_per_site']
num_days = config['num_days']
num_sites = config['num_sites']

weight_class_0 = config['weight_class_0']
weight_class_1 = config['weight_class_1']

In [None]:
# Don't quite trust the dropna here, for some reason subs get dropped
# Causing a missalignment in joining the results
df_customers = pd.read_pickle(data_dir/'df_customers.pck').\
dropna().\
reset_index().\
drop('index', axis=1)

df_customers.\
site_ids.\
str.\
replace(',', ';').\
to_csv(dataset_path, 
       sep=',', 
       quoting=csv.QUOTE_NONE,
       header=False, 
       index=False)


Create kpis.csv file, which has number of days multiplied with number of unique sites equal to the number of rows.

In [None]:
df_kpi = pd.read_pickle(data_dir/'df_kpi.pck')
df_kpi['x'].to_csv(kpis_path, 
#                    decimal = ',',
                   sep=',', 
                   header=False, 
                   index=False)

Create responses.csv file, which has number of customers as rows.

In [None]:
df_customers['isDetractor'].\
to_csv(responses_path, 
       sep=',', 
       header=False, 
       index=False)

## Run Model

In [None]:
dotnet_cmd = f'dotnet run --project ../model-v1/ '
args = f'../notebooks/{data_dir}/ {dataset_filename} {responses_filename} {kpis_filename} {num_days} {num_sites}'
cmd = dotnet_cmd + args
print(cmd)

In [None]:
!{cmd}

## Results

### Customers Join

In [None]:
df_customers_results = pd.read_csv(customer_results_path, 
                                   header=None, 
                                   sep=';',
                                   names=['hadBadSiteInter'])

df_customers_results.hadBadSiteInter = df_customers_results.\
hadBadSiteInter.astype(str).\
str.replace(',','.').\
apply(lambda x: float(x))

In [None]:
assert df_customers_results.shape[0] == df_customers.shape[0], 'need to have the same amount of customers'

In [None]:
df_customers_results = pd.concat([df_customers, 
                                  df_customers_results], 
                                 axis=1).\
reset_index().\
rename(columns={'index':'cust_id'})

In [None]:
assert df_customers_results.isna().sum().sum() == 0, 'There shouldn \'t be missing values here'

In [None]:
# df_customers_results.sample(10)

### Site Join

In [None]:
df_sites_results = pd.read_csv(sites_results_path, 
                               sep=';',
                               header=None, 
                               names=['hadBadPerf'])
df_sites_results.hadBadPerf = df_sites_results.hadBadPerf.astype(str).str.replace(',','.').apply(lambda x: float(x))

In [None]:
assert df_kpi.shape[0] == df_sites_results.shape[0], 'need to have the same amount of sites'

In [None]:
df_kpi.shape, df_sites_results.shape

In [None]:
df_sites_results = pd.concat([df_kpi, df_sites_results], axis=1).\
reset_index().rename(columns={'index':'site_id'})

In [None]:
assert df_sites_results.isna().sum().sum() == 0, 'There shouldn \'t be missing values here'

### GMM

In [None]:
df_gmm_results = pd.read_csv(gmm_results_path, header=None, names=['mean', 'precision'], sep=';')
df_gmm_results['mean'] = df_gmm_results['mean'].astype(str).str.replace(',','.').astype(float)
df_gmm_results['precision'] = df_gmm_results['precision'].astype(str).str.replace(',','.').astype(float)

In [None]:
df_gmm_results

In [None]:
# fig, ax = plt.subplots(2, 1, sharex=True)
# ax[0].hist(df_sites_results['x'], 100, alpha=0.4, color='gray', label='$x_{kpi}$')
# ax[0].grid()
# ax[0].legend()

# m0 = df_gmm_results.loc[0, 'mean']
# p0 = df_gmm_results.loc[0, 'precision']

# sigma = np.sqrt(1 / p0)
# xx = np.linspace(m0 - 6 * sigma, m0 + 6 * sigma, 500)
# ax[1].plot(xx, stats.norm.pdf(xx, m0, sigma), label='Good', color='green')

# m1 = df_gmm_results.loc[1, 'mean']
# p1 = df_gmm_results.loc[1, 'precision']

# sigma = np.sqrt(1 / p1)
# xx = np.linspace(m1 - 6 * sigma, m1 + 6 * sigma, 500)
# ax[1].plot(xx, stats.norm.pdf(xx, m1, sigma), label='Bad', color='red')
# ax[1].grid()

# # TODO: need to add the weight distribution posterior (default = 0.5 for now)
# w = 0.5
# w0 = 1 - w
# w1 = w
# xx = np.linspace(df_sites_results['x'].min(), df_sites_results['x'].max(), 500)
# ax[1].plot(xx, (w0 * stats.norm.pdf(xx, m0, np.sqrt(1 / p0))) + \
#                (w1 * stats.norm.pdf(xx, m1, np.sqrt(1 / p1))), linestyle='--', color='black', alpha=0.3, label=r'$\sum_{k} \pi_k \mathcal{N}$($\mu_k$, $\gamma_{k}^{-1}$)')

# ax[1].legend()

# plt.show()

## Performance

In [None]:
cust_threshold = 0.5
cust_y_true = list(df_customers_results.isDetractor)
cust_y_pred = [int(v > cust_threshold) for v in list(df_customers_results.hadBadSiteInter)]
target_names = ['promotors', 'detractors']
print(confusion_matrix(cust_y_true, cust_y_pred))
print(classification_report(cust_y_true, cust_y_pred, target_names=target_names))

In [None]:
site_threshold = 0.5
site_y_true = list(df_sites_results.comp)
site_y_pred = [int(v > site_threshold) for v in list(df_sites_results.hadBadPerf)]
target_names = ['good', 'bad']
print(confusion_matrix(site_y_true, site_y_pred))
print(classification_report(site_y_true, site_y_pred, target_names=target_names))

In [None]:
# fpr, tpr, _ = roc_curve(site_y_true, site_y_pred)
# roc_auc = auc(fpr, tpr)
# plt.plot(fpr, tpr)
# plt.plot([0, 1], [0, 1], color='black', linestyle='--')
# plt.xlim([-0.01, 1.0])
# plt.ylim([0.0, 1.05])
# plt.title("AUC: {0:.3f}".format(roc_auc))
# plt.xlabel('False Positive Rate')
# plt.ylabel('True Positive Rate')
# plt.grid()
# plt.show()

## Viz

In [None]:
cust_y = -1*(num_days)
site_y = 1
num_customers = df_customers_results.shape[0]

if num_sites > num_customers:
    x_offset = (num_sites-num_customers)/2
else:
    x_offset = -1*((num_customers-num_sites)/2)
    
nnps_colors = list(px.colors.diverging.RdYlGn_r)

In [None]:
if viz:
    fig = go.Figure()

    # Plot Sites
    all_site_ids = list(df_sites_results.site_id.values)
    all_bad_perf = list(df_sites_results.hadBadPerf.values)
    for day in range(0, num_days):
        adj_site_ids = all_site_ids[num_sites*day:num_sites*(day+1)]
        site_ids = [s%num_sites for s in adj_site_ids]
        i = 0
        for bad_perf, site_id in zip(all_bad_perf, site_ids):
            site_df = df_sites_results[df_sites_results.site_id == site_id + num_sites*day]
            site_x = site_df.x.values[0]
            site_comp = site_df.comp.values[0]
            fig.add_trace(
                go.Scatter(x=[site_id], 
                           y=[site_y*(day+1)], 
                           mode='markers',
                           showlegend=i==0,
                           name=f'Sites on day {day}',
                           legendgroup=f'day_{day}',
                           marker=dict(color=nnps_colors[int(bad_perf*10)]),
                           hovertemplate = 'Site: %{x} <br>Day: %{y} <br>had bad perf:' + 
                           f'{bad_perf}'+
                           f'<br>x = {site_x}<br>comp = {site_comp}'))
            i = i+1
    for i, cust in df_customers_results.iterrows():
        # Plot Customers
        x = cust.cust_id
        fig.add_trace(go.Scatter(x=[x+x_offset], 
                                 y=[cust_y], 
                                 mode='markers', 
                                 marker=dict(color=nnps_colors[int(cust.hadBadSiteInter*10)]),
                                 legendgroup=f'cust_{x}',
                                 hovertemplate = f'Customer: {x}'+
                                 f'<br>had bad site interactions: %{str(cust.hadBadSiteInter)}'+
                                 f'<br>isDetractor: {cust.isDetractor}',
                                 text = cust.hadBadSiteInter,
                                 name=f'Customer {x}'))
        # Plot Connections
        connected_sites = cust.temp
        for site in connected_sites:
            site_id = site%num_sites
            site_day = floor(site/num_sites)
            fig.add_trace(go.Scatter(x=[x+x_offset, site_id],
                                     showlegend=False,
                                     legendgroup=f'cust_{x}',
                                     opacity=0.1,
                                     line={'color':'#aaa', 'width':1},
                                     mode='lines',
                                     y=[cust_y, (site_day+1)*site_y]))
    fig.update_xaxes(showgrid=False, zeroline=False, range=[min(x_offset, 0)-1, max(num_sites, num_customers/1.8)+1], showticklabels=False)
    fig.update_yaxes(showgrid=False, zeroline=False, range=[min(cust_y, site_y)-1, num_days+1], showticklabels=False)
    fig.update_layout(template='plotly_dark', height=600)
    fig.show()