# Churn Predictions for Telco company Using Google Vertex AI

This notebook represents a starter kit for implementing churn analytic pipeline with autoML provided by Vertex AI. In other words, it shows best practices of how to implement auto ml capabilities in modeling churn predictions, and how to utilise those models for gaining better insights when it comes to reducing number of churned users. Considering all above, we included following steps in the notebook:

1. Initialization and authentification
2. Usuful functions
2. Load data
3. Short-term churn prediction - model used for predicting churns for short terms time horizon, such as prediction for next month. In general, it shows **WHO** is likely going to churn. This type of models, used the most recent behavioral data to predict next user event.
4. Long-term churn predictor - model is used to revield **WHEN** users have more chance to churn, and it helps business retaining them on a time. This model use sequence of events from recent users past in order to predict their future behaviour, usually inside predefined future time horizont.
5. Uplift modeling for selecting treatment - model that shows **HOW** we may prevent users from churning with a best treatments suited for them. Uplift models help buisness in matching users with the right treatment and it could be used in process of optimisation of costs.
6. Identifiaction of triggers - it shows **WHAT** is triggering our users to churn. This type of analyses is capable of showing business where is the bottlenecks of the system.

Models from this notebook should be implement together by following next steps:
- step 1: Implement account and interaction data to train long & short term models that will revield users that have higher probability of churning from services.
- step 2: Implement uplift model as shown in this notebook, to estimate what is best strategy for retaining individual user, and use output of the model to optimise company retain budget.
- step 3: Utilise user generated data to extract churn triggers in a system. 

The notebook is supposed to be run by Google Collaboratory.

## Initialization and authentification

In this section, we perfom basic initialization and set some variables.

In [None]:
#
# Reinstall google-cloud-aiplatform
#

!pip3 uninstall -y google-cloud-aiplatform
!pip3 install google-cloud-aiplatform
 
import IPython
 
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

In [39]:
!pip install waterfallcharts
!pip install pyyaml
!python -m spacy download en_core_web_md 

Collecting waterfallcharts
  Downloading waterfallcharts-3.8.tar.gz (3.9 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: waterfallcharts
  Building wheel for waterfallcharts (setup.py) ... [?25ldone
[?25h  Created wheel for waterfallcharts: filename=waterfallcharts-3.8-py3-none-any.whl size=3413 sha256=b9d213ed1624ee6bc595e37215934aa29e703af1d5e1540f4d380d153737dcda
  Stored in directory: /Users/marko/Library/Caches/pip/wheels/41/09/98/4a4c399b27ecf43c049f7dde966823fcc688edde795d1b0d22
Successfully built waterfallcharts
Installing collected packages: waterfallcharts
Successfully installed waterfallcharts-3.8
/opt/anaconda3/envs/churn-prediction-vertex-ai/bin/python: No module named spacy


In [None]:
#
# Authentication
#

import sys

if "google.colab" in sys.modules:
    from google.colab import auth

    auth.authenticate_user()

In [None]:
#
# Pull starting kit folder from github repo
#

TOKEN = '[CHANGE TO YOUR GITHUB TOKEN]'
!mkdir rnd-gcp-starter-kits && \
  cd rnd-gcp-starter-kits && \
  git init && \
  git remote add -f origin  https://{TOKEN}@github.com/griddynamics/rnd-gcp-starter-kits/ && \
  git config core.sparseCheckout true && \
  echo churn-prevention-vertex-ai >> .git/info/sparse-checkout && \
  git pull origin master

In [None]:
#
# Load configuration file, if needed than 
#

import yaml

config = yaml.safe_load(open(f'rnd-gcp-starter-kits/churn-prevention-vertex-ai/config.yml'))

In [36]:
#
# Set some global variables with your project details and other information:
#

from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y_%m_%d_%H%M%S")

REGION = config['gcp']['region']
PROJECT_ID = config['gcp']['project_id']
BUCKET_NAME = config['gcp']['bucket_name']

print(TIMESTAMP)

2022_11_01_102625


In [38]:
#
# Create project, bucket and upload files 
#

# # check if project and bucket exist if no create it
# find_project = !gcloud projects list --filter $PROJECT_ID
# if 'Listed 0 items.' in find_project:
#    !gcloud projects create {PROJECT_ID}

# # set project
# !gcloud config set project {PROJECT_ID}

# # check if bucket exist
# from google.cloud import storage
#
# client = storage.Client()
# try:
#     bucket = client.get_bucket(BUCKET_NAME)
# except:
#     !gsutil mb gs://{BUCKET_NAME}

# # transfer files to bucket
# for key, file in config['data']['input_data'].items():
#     if 'file_name' in key:
#         !gsutil cp data/{file} gs://{BUCKET_NAME}/
#     elif 'file_path' in key:
#         !gsutil cp {file} gs://{BUCKET_NAME}/
#     print("Copy {file} to {BUCKET_NAME}")

In [None]:
#
# Initialize AI platform:
#

from google.cloud import aiplatform
import os

aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)
os.environ["GCLOUD_PROJECT"] = PROJECT_ID 

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## Useful functions

In [None]:
#
# Function used to cosmetically format plots
#

def format_plots():

    import seaborn as sns
    import matplotlib.pyplot as plt

    sns.set(
        font='serif',
        rc={
          'axes.axisbelow': False,
          'axes.edgecolor': 'lightgrey',
          'axes.facecolor': 'None',
          'axes.grid': False,
          'axes.labelcolor': 'dimgrey',
          'axes.spines.right': False,
          'axes.spines.top': False,
          'figure.facecolor': 'white',
          'lines.solid_capstyle': 'round',
          'patch.edgecolor': 'w',
          'patch.force_edgecolor': True,
          'text.color': 'black',
          'xtick.bottom': False,
          'xtick.color': 'dimgrey',
          'xtick.direction': 'out',
          'xtick.top': False,
          'ytick.color': 'dimgrey',
          'ytick.direction': 'out',
          'ytick.left': False,
          'ytick.right': False}
    )
    sns.set_context(
        "notebook", 
        rc={
          'font.size':14,
          'axes.titlesize':14,
          'axes.labelsize':14}
    )

    plt.rcParams["figure.figsize"] = [10, 8]
    plt.rcParams["figure.autolayout"] = True
    plt.rcParams['axes.prop_cycle'] = plt.cycler(color=[
        '#2CBDFE', '#47DBCD', '#F3A0F2', '#9D2EC5', '#661D98', '#F5B14C'
    ])
    sns.despine(left=True, bottom=True)

format_plots()

In [None]:
#
# Read files from bucket
#

def bucket_to_bytes(file_name, bucket_name=BUCKET_NAME):

    from google.cloud import storage
    from io import BytesIO

    if f"gs://{bucket_name}/" in file_name:
        file_name = file_name.replace(f"gs://{bucket_name}/", "")

    # Get the blob
    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    blob = bucket.blob(file_name)

    # Convert to DataFrame
    contents = blob.download_as_bytes()
    data = BytesIO(contents)
    return data

In [None]:
#
# Read csv file as a Pandas DataFrame from the bucket
#

def read_csv_from_bucket(gcs_path, bucket_name=BUCKET_NAME):

    import pandas as pd
    csv_data = bucket_to_bytes(gcs_path, bucket_name)
    return pd.read_csv(csv_data)

In [None]:
#
# Write csv file to the bucket
#

def write_csv_to_bucket(csv_object, gcs_path, bucket_name=BUCKET_NAME):

    from google.cloud import storage

    storage_client = storage.Client()
    bucket = storage_client.bucket(bucket_name)
    bucket.blob(gcs_path).upload_from_string(csv_object, 'text/csv')

In [None]:
#
# Evaluation of Vertex AI classification model
#

def model_evaluation(model, cm_cols=['retain', 'churn'], highlight_feature=None):

    import pandas as pd
    import matplotlib.pyplot as plt

    for model_evaluation in model.list_model_evaluations():

        info_dict = model_evaluation.to_dict()
        info_metrics = info_dict['metrics']
        info_features = info_dict['modelExplanation']
        confusion_matrix = info_metrics['confusionMatrix']
        au_prc = info_metrics['auPrc']
        au_roc = info_metrics['auRoc']
        log_loss = info_metrics['logLoss']

    info_dict['cm_df'] = pd.DataFrame(confusion_matrix['rows'], 
        index=cm_cols, columns=cm_cols)
    TP = confusion_matrix['rows'][1][1]
    FP = confusion_matrix['rows'][0][1]
    FN = confusion_matrix['rows'][1][0]
    precision = round(TP / (FP + TP), 2)
    recall = round(TP / (FN + TP), 2)

    print("Model metrics:\n")
    print(f" - Area under Precision-Recall Curve: {au_prc}")
    print(f" - Area under Receiver Operating Characteristic Curve: {au_roc}")
    print(f" - Log-loss: {log_loss}")
    print(f" - Precision: {precision}")
    print(f" - Recall: {recall}")

    print("\nConfusion matrix:\n")
    print(info_dict['cm_df'])

    feature_importance = info_features['meanAttributions'][0]['featureAttributions']
    sorted_feature_importance = sorted(
        feature_importance.items(), key=lambda x: x[1], reverse=False)
    names = [feature for feature, _ in sorted_feature_importance]
    values = [value for _, value in sorted_feature_importance]

    plt.barh(range(len(feature_importance)), values, tick_label=names)
    if highlight_feature in names:
        plt.gca().get_yticklabels()[int(names.index(highlight_feature))].set_color("red")
    plt.title("Feature importance")
    plt.show()
    
    return info_dict

In [None]:
#
# Optimise threshold classification by using costs
#

def optimisation_by_cost_matrix(tresholds, f_p_cost, f_n_cost, p_c):

    import pandas as pd
    import matplotlib.pyplot as plt

    costs = []
    for i in range(len(tresholds)):
      try:
        false_negative = int(tresholds[i]['falseNegativeCount'])
      except KeyError:
        false_negative = 0
      
      try:
        false_positive = int(tresholds[i]['falsePositiveCount'])
      except KeyError:
        false_positive = 0

      costs.append(f_p_cost*false_positive + f_n_cost*false_negative*p_c)

    best_threshold = tresholds[int(costs.index(min(costs)))]['confidenceThreshold']
    print(f"Minimum costs of {min(costs)}$ are obtained for {best_threshold} \
    threshold.")

    fig, ax = plt.subplots(figsize=(7,4))
    ax.plot(np.linspace(0,1,len(tresholds)),costs)
    ax.scatter(best_threshold,min(costs), c='r', linewidths=25)
    ax.set(xlabel='threshold', ylabel='cost [$]',
          title='Cost VS threshold')
    ax.grid()

    plt.show()
    return best_threshold

In [None]:
#
# Get predictions in batch mode
#

def get_df_from_batch_predict(
    input_file,
    output_folder,
    model_name,
    bucket_name,
    model_object=None,
    model_path=None,
):
    import pandas as pd
    
    if model_path is None:

        if model_object is None:
            raise AttributeError("If model_path is None, than model_object\
             need to be instance of model")
            
        print(f"Batch predictions with {model_name} model.")
        
        batch_predict_job = model_object.batch_predict(
          gcs_source=input_file,
          instances_format="csv",
          gcs_destination_prefix=output_folder,
          predictions_format="csv",
          job_display_name=f"job-batch_predict-{model_name}-{TIMESTAMP}",
          sync=True
        )
        batch_predict_job.wait()
        list_files = batch_predict_job.iter_outputs()
        print(batch_predict_job.output_info.gcs_output_directory)

    else:
        
        from google.cloud import storage
        
        storage_client = storage.Client()
        bucket = storage_client.get_bucket(bucket_name)
        list_files = list(bucket.list_blobs(prefix=model_path))

    df_list = []
    for row in list_files:
        df_list.append(read_csv_from_bucket(
            row.name, bucket_name=bucket_name))

    return pd.concat(df_list)

In [None]:
#
# Plot time-series of prediction churn periods 
#

def plot_uuid_ts_with_periods(
    uuid, 
    df_predict_interaction, 
    df_lt_result, 
    thresholds,
    map_int_to_date):
    
    import matplotlib.pyplot as plt
    import pandas as pd
    from dateutil.relativedelta import relativedelta
  
    print(f"User id: {uuid}")

    fig_size = (10, 15)
    fig, axs = plt.subplots(ncols=1, nrows=4, figsize=fig_size, 
                            gridspec_kw={'height_ratios': [1, 1, 1, 2]})
    for i, col in enumerate(['internet_min', 'phone_min', 'number_customer_service_calls']):

        df_tmp = df_predict_interaction.loc[(df_predict_interaction['uuid']==uuid)].iloc[-7:]

        df_tmp.index = pd.to_datetime(df_tmp['timestamp'])
        df_tmp_pred = df_lt_result.loc[(df_lt_result['uuid']==uuid)]
        df_tmp_pred.index = pd.to_datetime(df_tmp_pred['timestamp'])

        df_tmp_pred = df_lt_result.loc[(df_lt_result['timestamp']=='2021-10-01')&(df_lt_result['uuid']==uuid)
          ].sort_values(by='time_horizon')[['time_horizon', 'churn', 'churn_1_scores', col]]
        df_tmp_pred = df_tmp_pred.replace({"time_horizon": map_int_to_date})
        df_tmp_pred.index = pd.to_datetime(df_tmp_pred['time_horizon'])

        df_tmp.loc[df_tmp['uuid']==uuid, [col]].plot(ax=axs[i])
        x_true_neg = df_tmp_pred.loc[(df_tmp_pred['churn']==0)].index
        x_true_pos = df_tmp_pred.loc[(df_tmp_pred['churn']==1)].index


        # x_pred_neg = df_tmp_pred.loc[(df_tmp_pred['churn_1_scores']<=.5)].index
        # x_pred_pos = df_tmp_pred.loc[(df_tmp_pred['churn_1_scores']>=.5)].index

        # max_int = max(df_tmp[col].max(), 1)

        if len(x_true_pos)>0:
            x_true_pos = sorted([min(list(x_true_pos)) - relativedelta(months=1)] + list(x_true_pos))
            x_true_pos = [list(x_true_pos)[0] + relativedelta(months=1)] + list(x_true_pos)
            y = df_tmp.loc[df_tmp.index.isin(x_true_pos), col].values
            if len(y)< len(x_true_pos):
                x_true_pos = x_true_pos[1:]
            axs[i].fill_between(x_true_pos, y, color='red', alpha=.2, label='true churn period')

        if len(x_true_neg)>0:
            x_true_neg = sorted([min(list(x_true_neg)) - relativedelta(months=1)] + list(x_true_neg))
            
            y = df_tmp.loc[df_tmp.index.isin(x_true_neg), col].values
            if len(y)< len(x_true_neg):
                x_true_neg = x_true_neg[1:]
            axs[i].fill_between(x_true_neg, y, color='blue', alpha=.2, label='true retention period')
        axs[i].legend(loc='upper left')
        axs[i].set_title(f"True values for \nperiods of {col} ts feature")
        axs[i].set_xlabel('Months')
        axs[i].set_ylabel('min' if col != 'no_service_calls' else 'count')
    
    if len(df_tmp_pred.index) >= len(map_int_to_date):
        axs[3].bar(
            df_tmp_pred.index, 
            df_tmp_pred['churn_1_scores'].values,
            width = 10,
            label = 'churn probability',
            )
        axs[3].step(
            df_tmp_pred.index.values, 
            list(thresholds.values()), 
            'k',
            where="mid",
            linestyle='--',
            linewidth=2,
            label = 'threshold'
            )

    axs[3].set_title(f"Risk of churning")
    axs[3].legend(loc='center left')
    plt.show()

## Load data

In many cases, user data is very sensitive and because of it's complience it cannot be used as it is for the purpose of this notbook. Moreover, if we took into consideration that main purpose of this starting kit is to show best practices for implementing churn prediction methodology with Vertex AI, it is not crucial to have real data for demonstration purposes and therefore we decided to proceed with syntheticly generated data. There are three types of data generated for this exercise, and that is:

 - Account data - this is data that is usually stored in data warehouses and it consist from user social-demografic, duration of contract, type of payment and etc.
 - Interaction data - this is data that is usually stored in transactional database and it consisted from user generated events, such as payment transactions, usage of system during time, logs and etc. 
 - User generated data - this data is usually stored in data lake and it consisted of unstructural data, such as email, call transcription, chats and etc.

### Account data

In [None]:
#
# Read account data
#

CSV_TRAIN_ACCOUNT = os.path.join("gs://", BUCKET_NAME, config['data']['input_data']['file_name_short_term_train'])
CSV_PREDICT_ACCOUNT = os.path.join("gs://", BUCKET_NAME, config['data']['input_data']['file_name_short_term_predict'])
DATASET_ACCOUNT = aiplatform.TabularDataset.create(
    display_name="account_data_train", gcs_source=[CSV_TRAIN_ACCOUNT])

In [None]:
df_train_account = read_csv_from_bucket(CSV_TRAIN_ACCOUNT)
df_predict_account = read_csv_from_bucket(CSV_PREDICT_ACCOUNT)
df_train_account.head()

*   `uuid` - unique user id
*   `gender` - Male/Female
*   `tenure` - number of consecutive months that user is subscribed to a service
*   `phone_services` - 0 or 1 that shows if user use phone services
*   `internet_services` - 0 or 1 that shows if user use internet services
*   `contract_duration` - short/long type of contract
*   `payment_method` - email/mail/automatic type of payment
*   `number_customer_service_calls` - total number of calls to the customer services
*   `phone_min` - total duration of phone calls in last month
*   `internet_min` - total duration of internet usage in last month
*   `phone_monthly_charges` - phone call bill for last month
*   `internet_monthly_charges` - internet bill for last month
*   `avg_monthly_bill` - average bill during user lifetime
*   `trigger_price` - 0 or 1 that shows if price trigger user to churn
*   `trigger_quality` - 0 or 1 that shows if quality of service trigger user to churn
*   `trigger_external` - 0 or 1 that shows if external factor trigger user to churn
*   `churn` - 0 or 1 that shows if user churn or not
*   `treatment` - none/discount/upgrade package/free device treatment implemented in order to retain user
*   `churn_after_treatment` - 0 or 1 that shows if user retain after treatment

Usually, churn datasets are imbalanced, that means that there are more retaine users than churned ones. Consequently, this is the case in current dataset as well, with 26% of churned users. Moreover, this will be one of the point that we need to address during the training phase.

In [None]:
ax = df_train_account.replace(
        {'churn': {0: "Retain users", 1: "Churn users"}}
    ).groupby('churn')['churn'].count().plot(
        kind="pie", autopct='%1.1f%%', shadow=True, explode=[0.05, 0.05], 
        legend=True, title='Ratio between churn and retain users', 
        ylabel='', labeldistance=None
    )
    
ax.legend(bbox_to_anchor=(1, 1), loc='upper left', prop={'size': 15})

plt.show()

### Interaction data

In [None]:
CSV_TRAIN_INTERACTION = os.path.join("gs://", BUCKET_NAME, config['data']['input_data']['file_name_long_term_v1_train'])
CSV_PREDICT_INTERACTION = os.path.join("gs://", BUCKET_NAME, config['data']['input_data']['file_name_long_term_v1_predict'])
DATASET_INTERACTION = aiplatform.TimeSeriesDataset.create(
  display_name="timeseries_data_train", gcs_source=[CSV_TRAIN_INTERACTION])

In [None]:
df_train_interaction = read_csv_from_bucket(CSV_TRAIN_INTERACTION)
df_predict_interaction = read_csv_from_bucket(CSV_PREDICT_INTERACTION)
df_train_interaction.head()

*   `uuid` - unique user id
*   `timestamp` - time when record is generated
*   `phone_min` - total duration of phone calls during time window of one month
*   `internet_min` - total duration of internet usage during time window of one month
*   `number_customer_service_calls` - total number of calls to the customer services during time window of one month


### Combine data-set

In this part we are loading data that obtained by merging accoutn and interaction data, combined in a way so that could be used for training Tabular models. That is, we introduced several new features like internet_min_lag1, internet_min_lag2, internet_min_lag3 and etc. where we add data from previous period, while we add a feature time_horizont for determine for which period in a future we want to make a prediction.

In [None]:
CSV_TRAIN_ACC_INTER = os.path.join("gs://", BUCKET_NAME, config['data']['input_data']['file_name_combined_data_train'])
CSV_PREDICT_ACC_INTER = os.path.join("gs://", BUCKET_NAME, config['data']['input_data']['file_name_combined_data_predict'])
DATASET_ACC_INTER = aiplatform.TabularDataset.create(
    display_name="account_data_train", gcs_source=[CSV_TRAIN_ACC_INTER])

In [None]:
df_train_acc_inter = read_csv_from_bucket(CSV_TRAIN_ACC_INTER)
df_predict_acc_inter = read_csv_from_bucket(CSV_PREDICT_ACC_INTER)
df_train_acc_inter.head()

In [None]:
CSV_TRAIN_ACC_INTER_V2 = os.path.join("gs://", BUCKET_NAME, config['data']['input_data']['file_name_long_term_v2_train'])
CSV_PREDICT_ACC_INTER_V2 = os.path.join("gs://", BUCKET_NAME, config['data']['input_data']['file_name_long_term_v2_predict'])
DATASET_ACC_INTER_V2 = aiplatform.TimeSeriesDataset.create(
  display_name="timeseries_data_train_forecast", gcs_source=[CSV_TRAIN_ACC_INTER_V2])

In [None]:
df_train_acc_inter_v2 = read_csv_from_bucket(CSV_TRAIN_ACC_INTER_V2)
df_predict_acc_inter_v2 = read_csv_from_bucket(CSV_PREDICT_ACC_INTER_V2)
df_train_acc_inter_v2.head()

## Short-term churn prediction

### Train model

As it was mantion previously, dataset used for training this model is imbalanced. That means, that more users retain in the system than the ones that churnes. Therefor, we are going to use 'maximize-au-prc' objective function. By looking at Vertex AI documentation, we may see that this function is used in a case when we want to optimize for less common class (in this case churned users), and it stands for area under the precision-recall curve.

In [None]:
exclude = ['uuid']
target_column = 'churn'
multi_target_columns = ['trigger_price', 'trigger_quality', 'trigger_external']
treatments = ['churn_after_treatment', 'treatment']
cols = [col for col in df_train_account.columns 
        if col not in exclude+[target_column]+multi_target_columns+treatments]
categorical_cols = df_train_account.loc[:, cols].select_dtypes(exclude="number").columns
numerical_cols = df_train_account.loc[:, cols].select_dtypes(include="number").columns

COLUMN_SPEC = {}
for col_i in numerical_cols:
  COLUMN_SPEC[col_i] = "numeric"
for col_i in categorical_cols:
  COLUMN_SPEC[col_i] = "categorical"

In [None]:
model_st_id = config['artifacts']['model_id']['short_term_churn_model']
model_st_display_name = "aggregate-churn-prediction-model"

if model_st_id is None:
  job_tabular = aiplatform.AutoMLTabularTrainingJob(
      display_name=f'job_train_model-{model_st_display_name}-{TIMESTAMP}',
      optimization_prediction_type='classification',
      optimization_objective='maximize-au-prc',
      column_specs=COLUMN_SPEC
  )

  model_st = job_tabular.run(
      dataset = DATASET_ACCOUNT,
      target_column = target_column,
      training_fraction_split = 0.7,
      validation_fraction_split = 0.15,
      test_fraction_split = 0.15,
      budget_milli_node_hours=10, #1000
      model_display_name=model_st_display_name,
  )
else:
  model_st = aiplatform.Model(model_st_id)

### Analyse model

In [None]:
info_eval = model_evaluation(model_st)

Feature importance reveals which features contributes the most in classifing churners. Therefore, we are going to utilise it to take a closer inspection of distributions of the most promenent ones (average monthly bills, charges for internet and internet minutes). In addition, comparing to other features, this graph shows that gender, contract duration, phone and internet services plays little role in defining churners.

In [None]:
fig, axs = plt.subplots(ncols=1, nrows=3, figsize=(10, 10))
axs[0].set_title("Compare features of churns and retain users")
axs[0].hist(
    [df_train_account.loc[df_train_account['churn']==1, 'internet_monthly_charges'].tolist(),
     df_train_account.loc[df_train_account['churn']==0, 'internet_monthly_charges'].tolist()], 
     label=['churn','retain'],
     density=True)
axs[0].set_xlabel("internet monthly charges [$]")
axs[0].legend(loc='upper right')
axs[1].hist(
    [df_train_account.loc[df_train_account['churn']==1, 'internet_min'].tolist(),
     df_train_account.loc[df_train_account['churn']==0, 'internet_min'].tolist()], 
     label=['churn','retain'],
     density=True)
axs[1].set_xlabel("internet usage [min]")
axs[2].hist(
    [df_train_account.loc[df_train_account['churn']==1, 'tenure'].tolist(),
     df_train_account.loc[df_train_account['churn']==0, 'tenure'].tolist()], 
     label=['churn','retain'],
     density=True)
axs[2].set_xlabel("user tenure [months]")
plt.show()

This type of analyses is out of the scope of this notebook, but it could be implemented further on in defining churn metrics. This type of metrics is valuable in buisness as a measure of showing how users that retain in a system use their product versus the ones that churned. Moreover, it could be used to push churners through customer support or some treatment which will convert them back to the services.

### Optimisation by cost matrix

In general, in order to have better precision we need to sacrifice recall, and vice versa. Moreover, different businesses has different sensitivity when it comes to selecting recall vs precision. So, defining balance between those two metrics represents an important decision, because it will revield what threshold we should use and how it will effect our costs. One of the best practices in this case is to use cost matrix. 

In order to illustrate this, we add artificially costs as follow:
- Recall
 - costs of replacing false pnegative users with new ones cost on average 40 dollars, 
 - probability of keeping false negative if we would clasiffy them as churners is 0.3
 - Therefore, expected costs of not predicting churned users per case is 12 dollars. 
- Precision
 - Wrongly treated retain users as churned ones cost company 10 dollars per case, by promoting the system through one time down sell promotions.

In [None]:
import numpy as np
import pandas as pd


# illustrative values
down_sell_cost = 10
acquisition_cost = 40
prob_of_converting = 0.3


expected_return_TN = df_train_account.loc[df_train_account.churn==0].avg_monthly_bill.mean()
expected_return_TP = prob_of_converting * (
    df_train_account.loc[df_train_account.churn==1].avg_monthly_bill.mean() - 
    down_sell_cost)
expected_cost_FP = -1* down_sell_cost
expected_cost_FN = -1* (acquisition_cost - down_sell_cost) * prob_of_converting
cost_matrix = pd.DataFrame(
    np.array([
                        [expected_return_TN, expected_cost_FP], 
                        [expected_cost_FN, expected_return_TP]
]).astype(int), index=['retain', 'churn'], columns=['retain', 'churn'])
cost_matrix.style.apply(
    lambda x: ['background-color: red' if v<=0 else "" for v in x]
)

In [None]:
best_threshold = optimisation_by_cost_matrix(
    info_eval['metrics']['confidenceMetrics'], 
    down_sell_cost, 
    acquisition_cost, 
    prob_of_converting)

## Long-term churn prediction

## Version 1: AutoMLTabular model

This section of the notebook address the problem of identifying churners before it is too late. Therefore, we will show the model that is capable of providing estimations several months before users unsubscribe from services. For this purpose, we implemented synthetically generated time series data, that is used for training time aware model with autoML Tabular (Vertex AI) service.

Furthermore, we show that short-term models are better balanced and that they are more sensitive to churn behaviour. However, major downside of short-term models is that after it make predictions, buisness doesn't have so much time to act and in many cases users already made definite decision of churning from product. However, this is where long term predictions shine and it gave us ability to act on a time by predicting probability of churning for mid future period (6 months in this case). 

In general, we would like to keep both models, short-term models because they are more accurate and it reveals actual user intentions, while long-term model is better with identification of possible future churners.

### Train model

In [None]:
exclude = ['uuid', 'timestamp']
target_column = 'churn'
multi_target_columns = ['trigger_price', 'trigger_quality', 'trigger_external']
treatments = ['churn_after_treatment', 'treatment']
cols = [col for col in df_train_acc_inter.columns 
        if col not in exclude+[target_column]+multi_target_columns+treatments]
categorical_cols = df_train_acc_inter.loc[:, cols].select_dtypes(exclude="number").columns
numerical_cols = df_train_acc_inter.loc[:, cols].select_dtypes(include="number").columns

COLUMN_SPEC = {}
for col_i in numerical_cols:
  COLUMN_SPEC[col_i] = "numeric"
for col_i in categorical_cols:
  COLUMN_SPEC[col_i] = "categorical"

In [None]:
model_lt_id = config['artifacts']['model_id']['long_term_churn_model_v1']
model_lt_display_name = "event-churn-prediction-model"

if model_lt_id is None:
  job_tabular = aiplatform.AutoMLTabularTrainingJob(
      display_name=f'job_train_model-{model_lt_display_name}-{TIMESTAMP}',
      optimization_prediction_type='classification',
      optimization_objective='maximize-au-prc',
      column_specs=COLUMN_SPEC
  )

  model_lt = job_tabular.run(
      dataset = DATASET_ACC_INTER,
      target_column = target_column,
      training_fraction_split = 0.7,
      validation_fraction_split = 0.15,
      test_fraction_split = 0.15,
      budget_milli_node_hours=10,
      model_display_name=model_lt_display_name,
  )
else:
  model_lt = aiplatform.Model(model_lt_id)

### Analyse model

In [None]:
info_eval = model_evaluation(model_lt, highlight_feature='time_horizon')

In [None]:
DATA_LT_FOLDER = config['artifacts']['model_path']['long_term_churn_model_v1_folder']
model_lt_results = config['data']['output_data']['file_path_long_term_churn_model_v1']
gsc_output_folder = os.path.join("gs://", BUCKET_NAME, DATA_LT_FOLDER, f"predictions-st-{TIMESTAMP}")

df_lt_result = get_df_from_batch_predict(
    CSV_PREDICT_ACC_INTER, 
    gsc_output_folder, 
    model_lt_display_name,
    BUCKET_NAME,
    model_lt,
    model_lt_results)

df_lt_result['predicted_churn'] = 0
df_lt_result.loc[
    df_lt_result['churn_0_scores']<=.5, 'predicted_churn'] = 1

In [None]:
threshodls_by_time_horizont = {}
tresholds = np.linspace(0,1, 101)
for time_horizont in range(1,7):
    df_tmp = df_lt_result.loc[df_lt_result['time_horizon']==time_horizont, ['churn_1_scores', 'churn']]
    costs = []
    for th in tresholds:
        FP = df_tmp.loc[(df_tmp['churn']==0)& (df_tmp['churn_1_scores']>=th)].shape[0]
        FN = df_tmp.loc[(df_tmp['churn']==1)& (df_tmp['churn_1_scores']<th)].shape[0]
        costs.append(
            down_sell_cost*FP + 
            acquisition_cost*FN*prob_of_converting
        )
    threshodls_by_time_horizont[time_horizont] = tresholds[int(costs.index(min(costs)))]

In [None]:
map_dates = {
    1: '2021-10-01',
    2: '2021-11-01',
    3: '2021-12-01',
    4: '2022-01-01',
    5: '2022-02-01',
    6: '2022-03-01'
}

In [None]:
plot_uuid_ts_with_periods(
        '8a43ef1e-0756-11ed-a65f-0242ac1c0002', 
        df_predict_interaction, 
        df_lt_result, 
        threshodls_by_time_horizont,
        map_dates
    )

In [None]:
plot_uuid_ts_with_periods(
        '8a4721ca-0756-11ed-a65f-0242ac1c0002', 
        df_predict_interaction, 
        df_lt_result, 
        threshodls_by_time_horizont,
        map_dates
    )

## Version 2: AutoMLForecasting model

In implementing sequence model, we faced with two major challenges. 

1.   One of them were how to transform classification to regression problem. Basically, churn prediction labels in data represent revealed probability that user will churn, therefore we have only two labels 0 that user retain and 1 that user churn. By interpreting a label as probability of churnes, the problem is reframed as a regression one. Therefore, we may use inverse sigmoid function to transform labels in a more easilly "digested" values for time serias model. 

> $$ x_{regression} =  \dfrac{1}{1+e^{-x}} $$

2.   Second, dataset is even more unbalanced, that is churnes represent rare events in time serias. We approach this problem, by smoothing values near churn event with 0.5 probability values that represent uninformative information, and underline justification is that we are uncertanty whether user decided to churn during that moment in time. Moreover, we trim long historical non churn sequence, in order to reduce non churn events.



### Train model

In [None]:
target_column = 'churn_regr'
time_column = 'timestamp'
time_series_identifier_column = 'uuid'
available_at_forecast_columns = [
  'timestamp', 'phone_duration',	'internet_duration',	
  'no_service_calls']
time_series_attribute_columns = [
  'gender',	'phone_services',	'internet_services',	
  'contract_duration',	'payment_method',	'avg_monthly_bill']
cols = [col for col in df_train_acc_inter_v2.columns if col not in 
        [time_column, time_series_identifier_column]]
categorical_cols = df_train_acc_inter_v2.loc[:, cols].select_dtypes(exclude="number").columns
numerical_cols = df_train_acc_inter_v2.loc[:, cols].select_dtypes(include="number").columns

COLUMN_SPEC = {
    time_column: 'timestamp',
    target_column: 'churn_regr'
}
for col_i in numerical_cols:
  COLUMN_SPEC[col_i] = "numeric"
for col_i in categorical_cols:
  COLUMN_SPEC[col_i] = "categorical"

forecast_horizon = 3
context_window = 6
model_ts_display_name = "timeseries-churn-prediction-model-regr"

In [None]:
model_ts_id = config['artifacts']['model_id']['long_term_churn_model_v2']
if model_ts_id is None:
  job_long_term = aiplatform.AutoMLForecastingTrainingJob(
      display_name=f'job_train_model-{model_ts_display_name}-{TIMESTAMP}',
      optimization_objective='minimize-rmse', #'minimize-rmse',#'minimize-quantile-loss', #'minimize-mae',    
      column_specs = COLUMN_SPEC,
  )

  # This will take around an hour to run
  model_ts = job_long_term.run(
      dataset=DATASET_ACC_INTER_V2,
      target_column=target_column,
      time_column=time_column,
      time_series_identifier_column=time_series_identifier_column,
      available_at_forecast_columns=available_at_forecast_columns,
      unavailable_at_forecast_columns=[target_column],
      time_series_attribute_columns=time_series_attribute_columns,
      forecast_horizon=forecast_horizon,
      context_window=context_window,
      data_granularity_unit="month",
      data_granularity_count=1,
      weight_column=None,
      budget_milli_node_hours=1000,
      model_display_name=model_ts_display_name, 
      predefined_split_column_name=None
  )
else:
  model_ts = aiplatform.Model(model_ts_id)

In [None]:
print("Model evaluation performance: ")
for model_ts_evaluation in model_ts.list_model_evaluations():
  info_dict = model_ts_evaluation.to_dict()
  for metric, value in info_dict['metrics'].items():
    print(f"  - Model {metric}: {value}")

In [None]:
DATA_ACC_INTER_FOLDER_V2 = config['artifacts']['model_path']['long_term_churn_model_v2_folder']
model_lt_results = config['data']['output_data']['file_path_long_term_churn_model_v2']
gsc_output_folder = os.path.join("gs://", BUCKET_NAME, DATA_ACC_INTER_FOLDER_V2, f"predictions-{TIMESTAMP}")

df_ts_result = get_df_from_batch_predict(
    CSV_PREDICT_ACC_INTER_V2, 
    gsc_output_folder, 
    model_ts_display_name,
    BUCKET_NAME,
    model_ts,
    model_lt_results)

# merge true label
df_ts_result = df_ts_result.merge(
    df_predict_acc_inter_v2[['timestamp', 'uuid', 'true_label']], 
    on=['timestamp', 'uuid'])

# convert label and predict to prob values from 0 to 1
df_ts_result['predicted_churn'] = 1/(1+np.exp(-df_ts_result['predicted_churn_regr']))
df_ts_result['true_label'] = 1/(1+np.exp(-df_ts_result['true_label']))

df_ts_result.sort_values(by=['uuid', 'timestamp'], inplace=True)

## Compare Sequence (V2) with Tabular (V1) approach

In [None]:
cm_sequence_model = pd.crosstab(
    df_ts_result['true_label'].round(0), 
    df_ts_result['predicted_churn'].round(0), 
    rownames=['Actual'], 
    colnames=['Predicted'], 
    margins=False
    ).rename({0: 'retain', 1:'churn'}
    ).rename({0: 'retain', 1:'churn'}, axis=1)

precision_forecaster = round(
    cm_sequence_model.loc['churn', 'churn'] / 
    cm_sequence_model.loc[:, 'churn'].sum(), 2)
recall_forecaster = round(
    cm_sequence_model.loc['churn', 'churn'] / 
    cm_sequence_model.loc['churn', :].sum(), 2)

cm_sequence_model

In [None]:
cm_tabular_model = pd.crosstab(
    df_lt_result.loc[df_lt_result['time_horizon']==6, 'churn'].reset_index(drop=True), 
    df_lt_result.loc[df_lt_result['time_horizon']==6,'predicted_churn'].reset_index(drop=True),
    rownames=['Actual'], 
    colnames=['Predicted'], 
    margins=False
    ).rename({0: 'retain', 1:'churn'}
    ).rename({0: 'retain', 1:'churn'}, axis=1)

precision_tabular = round(
    cm_tabular_model.loc['churn', 'churn'] / 
    cm_tabular_model.loc[:, 'churn'].sum(), 2)
recall_tabular = round(
    cm_tabular_model.loc['churn', 'churn'] / 
    cm_tabular_model.loc['churn', :].sum(), 2)

cm_tabular_model

In [None]:
print("** Comparison of two approaches\n")
print(f"Precision:\n    \
Binary Classification {precision_tabular}\
\n    Sequential {precision_forecaster}\n\n")
print(f"Recall:\n    \
Binary Classification {recall_tabular}\
\n    Sequential {recall_forecaster}")

Comparison of two approaches show that sequential model trained with AutoML Forecasting yield even better results than model trained with AutoML Tabular. That is, sequence model shows values of 0.67 Precision and 1.00 Recall while binary classification approach shows values of 0.51 Precision and 0.69 Recall. One of the reasons could be that tabular model was trained on more imbalanced dataset because it predicts values for six months in advance, while sequence model predicts values for just 6th month in a future. Nevertheless, this exercise shows that it is recommended to try both approaches in order to evaluate which one will bring better results.

## Uplift modeling

Uplifting model is used to predict how much gain in probability we can get if we offer users some treatment in comapre to scenario without treatment. For this case, we are going to use once more time AutoMLTabular (Vertex AI) service. Like in previous cases, data is imbalanced with smaller number of users that retain after teatment is subscribed. The final goal of this section is to find treatemnts that generate highest gain in probability of retaining each individual users.

In [None]:
COLUMN_SPEC.update({'treatment':'categorical'})
target_column = 'churn_after_treatment'

### Train model

In [None]:
model_uplift_id = config['artifacts']['model_id']['uplift_model']
model_uplift_name = "uplift-churn-prediction-model"

if model_uplift_id is None:
  job_tabular = aiplatform.AutoMLTabularTrainingJob(
      display_name=f'job_train_model-{model_uplift_name}-{TIMESTAMP}',
      optimization_prediction_type='classification',
      optimization_objective='maximize-au-prc',
      column_specs=COLUMN_SPEC
  )

  model_uplift = job_tabular.run(
      dataset = DATASET_ACCOUNT,
      target_column = target_column,
      training_fraction_split = 0.7,
      validation_fraction_split = 0.15,
      test_fraction_split = 0.15,
      budget_milli_node_hours=1000,
      model_display_name=model_uplift_name,
  )
else:
  model_uplift = aiplatform.Model(model_uplift_id)

### Analyse model

In [None]:
_ = model_evaluation(model_uplift, highlight_feature='treatment')

Feature importance graph shows that treatment feature are one of the most important ones. Therefore, model is relatively sensitivity towards this feature, which allows us to use it for defining uplift gains of different scenarious.

### Made predictions

In order to find best strategies, we are going to calculate gain in probability of users retaining by subtractig probability of treatment scenario with case when treatment is not offered. Hence, first we populate treatment feature with None values to calculate predictions for non treatment scenario, and after that we are going to repeat the same for each strategy but this time populating treatment feature with the name of relevant treat. 

In [None]:
# for name in ('none', 'discount', 'free_device', 'upg_packet'):
#     df = read_csv_from_bucket(CSV_PREDICT_ACCOUNT)
#     df['treatment'] = name
#     write_csv_to_bucket(df.to_csv(index=False), f"account_predict_uplift_{name}.csv")

In [None]:
model_uplift_paths_outputs = config['data']['output_data']['files_paths_uplift_model']

dfs_uplift_result_list = []

for name in model_uplift_paths_outputs.keys():

    model_uplift_path = model_uplift_paths_outputs[name]

    csv_file = os.path.join(
        "gs://", BUCKET_NAME, config['data']['input_data']['files_names_uplift_model'][name]")

    DATA_UPLIFT_FOLDER = f"mn-model-uplift-{name}-output"
    gsc_output_folder = os.path.join("gs://", BUCKET_NAME, DATA_UPLIFT_FOLDER, f"predictions-uplift-{TIMESTAMP}")

    df_uplift_result = get_df_from_batch_predict(
        csv_file, 
        gsc_output_folder, 
        model_uplift_name,
        BUCKET_NAME,
        model_uplift,
        model_uplift_path)

    dfs_uplift_result_list.append(df_uplift_result)

df_uplift = pd.concat(dfs_uplift_result_list)

In [None]:
# for each user extract information about probability of retain after each treatment
df_treatment = df_uplift.groupby(
    ['uuid', 'treatment'], as_index=False)['churn_after_treatment_0_scores'].first().rename(
    columns={'churn_after_treatment_0_scores': 'treatment_prob_of_retention'})

# extract no treatment cases
df_no_treatment = df_uplift.loc[df_uplift['treatment'].isna(), ['uuid',	'churn_after_treatment_0_scores']].rename(
    columns={'churn_after_treatment_0_scores': 'no_treatment_prob_of_retention'})

# compare treaetment with no treatment cases
df_compare_treatment = df_treatment.merge(df_no_treatment, on=['uuid'])
df_compare_treatment['difference'] = df_compare_treatment['treatment_prob_of_retention'] - \
    df_compare_treatment['no_treatment_prob_of_retention']

df_uplift_uid = df_uplift.loc[:, 
      ['uuid', 'treatment', 'churn_after_treatment_0_scores']
    ].fillna('None').sort_values(
      ['treatment', 'uuid']
    ).pivot_table(
      index = ['uuid'], 
      columns = 'treatment', 
      values = 'churn_after_treatment_0_scores'
    )

# compute uplift between adjacent values of treatments in a table
for col_next, col_prev in zip(df_uplift_uid.columns[:0:-1], df_uplift_uid.columns[-2::-1]): #[:-1:][::-1]):
    df_uplift_uid[col_next] = df_uplift_uid[col_next] - df_uplift_uid[col_prev]

Selecting best treatments for each user.
- First, filter out users that probably doesn't need any treatment, which will most certainly retain
- Second, filter out users that are lost cause, that probably could not be retain or it would be too expensive.

In [None]:
df_highest_gain_treatment = df_compare_treatment.sort_values(
        by=['uuid', 'difference'], ascending=False
    ).loc[
        lambda df: df['difference']>=0
    ].drop_duplicates(['uuid'])

df_highest_gain_treatment.loc[
    lambda df: (df['no_treatment_prob_of_retention'] <0.95) & 
    (df['treatment_prob_of_retention']>0.2), ['uuid','treatment']].sample(10)

In [None]:
import waterfall_chart

uuid = df_highest_gain_treatment.iloc[0, 0] 
waterfall_chart.plot(df_uplift_uid.loc[uuid].index, 
                      df_uplift_uid.loc[uuid].values)
plt.show()

### Budget optimization

At the end, if there is constrained budget for each of treatment, we may implement MIP (mixed integer programming) to optimise spendings regarding the expected gain from uplift outcomes. This is out of the scope of this notebook, but once we got all predictins we may consider to optimise following setup


$$
max \sum\sum X_{j,i}*P_{j,i}
$$

subject to:

> $$B_{discount} > \sum X_{discount,i}*costTreatments_{discount}$$
> $$B_{upgrade} > \sum X_{upgrade,i}*costTreatments_{upgrade}$$
> $$B_{freedevice} > \sum X_{freedevice,i}*costTreatments_{freedevice}$$
> $$X_{discount,i} + X_{free device,i} + X_{upgrade,i} + X_{none,i} = 1$$
> $$ 𝑃_{𝑗,𝑖} \in [ \, 0,1 ] \,$$
> $$X_{j,i} \in \{ 0,1 \}$$
> $$j \in \{ freedevice, upgrade, discount, none \}$$
> $$i \in [ \, 1 ... numberOfUsers ] \,$$





where


- $X_{j,i}$ - represent decision binary varaibale
- $P_{j,i}$ - probability of non churning for user i if we implement treatment j
- $costTreatments_j$ - cost of implementing treatment j
- $B_j$ - budget for treatment j

## Advance: Identify of triggers

Identification of churn triggers in a system could lead to numerous benefits, likewise: determining bottlenecks in user funnel, detecting counterintuitive parts of the product, steep learning curve, unresponsive customer support, bad quality of product etc. In fact, churn analytics could identify trigger parts of the system that need to be improved. Here we will summarize the approach of revealing users' triggers in a system through implementation of sentimental models.

### Sentiment model

Creation of an user-generated  synthetic dataset with enough data to train the model is a challenging problem itself and therefore we decided to use publicly available “Crowdflower Claritin-Twitter” dataset for this purpose. This type of approach uses similar logic as transfer learning, but in this case we are not taking weights of models but dataset of similar context.

In [None]:
CSV_SENTIMENT_FILE =  config['data']['input_data']['file_path_sentiment_train']
SENTIMENT_MAX = 4

In [None]:
model_sentiment_id = config['artifacts']['model_id']['sentiment_model']
model_sentiment_name = "sentiment-claritin-demo-model"

if model_uplift_id is None:

    sentiment_dataset = aiplatform.TextDataset.create(
        display_name="Crowdflower Claritin-Twitter" + "_" + TIMESTAMP,
        gcs_source=[CSV_SENTIMENT_FILE],
        import_schema_uri=aiplatform.schema.dataset.ioformat.text.sentiment,
    )

    job = aiplatform.AutoMLTextTrainingJob(
      display_name=model_sentiment_name+"-" + TIMESTAMP,
      prediction_type="sentiment",
      sentiment_max=SENTIMENT_MAX,
    )

    model_sentiment = job.run(
      dataset=sentiment_dataset,
      model_display_name=model_sentiment_name+"-" + TIMESTAMP,
      training_fraction_split=0.8,
      validation_fraction_split=0.1,
      test_fraction_split=0.1,
    )
else:
    model_sentiment = aiplatform.Model(model_sentiment_id)

Once we trained model we may use it to filter out the good from bad sentiment. For this purposes, we are going to use synthetically generated users email written customer support of imaginary telco company.

In [None]:
input_file = os.path.join("gs://", BUCKET_NAME, config['data']['input_data']['file_name_sentiment_predict'])
output_file = config['data']['output_data']['file_path_sentiment']

In [None]:
if not output_file:
    DATA_SENTIMENT_FOLDER = config['artifacts']['model_path']['sentiment_model_folder']
    gsc_output_folder = os.path.join("gs://", BUCKET_NAME, DATA_SENTIMENT_FOLDER, f"predictions-sentiment-{TIMESTAMP}")


    batch_predict_job = model_sentiment.batch_predict(
              gcs_source=input_file,
              instances_format="jsonl",
              gcs_destination_prefix=gsc_output_folder,
              predictions_format="jsonl",
              job_display_name=f"job-batch_predict-{model_sentiment_name}-{TIMESTAMP}",
              sync=True
            )
    batch_predict_job.wait()

    for row in batch_predict_job.iter_outputs():
        output_file = row.name


In [None]:
json_data = bucket_to_bytes(output_file, bucket_name=BUCKET_NAME)
df_sentiment_output = pd.read_json(path_or_buf=json_data, lines=True)
df_sentiment_output['prediction'] = df_sentiment_output['prediction'].apply(lambda x: x['sentiment'])

for i in range(df_sentiment_output.shape[0]):
    txt_data = bucket_to_bytes(file_name=df_sentiment_output.iloc[i][0]['content'])
    df_sentiment_output.iloc[i, 0] = txt_data.read().decode('UTF-8')

Results of selected emails with bad sentiment are following:

In [None]:
df_sentimental_select = df_sentiment_output.loc[lambda df: df['prediction']>2, ['instance']]
for i in range(df_sentimental_select.shape[0]):
    print(f"Email {i}: ", df_sentimental_select.iloc[i, 0])

Next we are going to label our dataset of bad sentiment by comparing vector representation of words from emails with vector representation of triggers. In general, for this purposes we should use more sofisticate metrics and to filter out stop words, but just for sace of simplicity we are going to keep it simple and only count for words that have higher similarity than 0.65 to trigger words.

In [None]:
import spacy

nlp = spacy.load('en_core_web_md')
triggers = ['price', 'quality']
trigger_tokens = nlp('price quality')
symilarity_threshold = 0.65
results = {}

for trigger in trigger_tokens:
    score = 0
    words = 0
    df_sentimental_select[trigger.text] = 0
    for i in range(df_sentimental_select.shape[0]):
        
        tokens = nlp(df_sentimental_select.iloc[i,0])

        for token in tokens:
            similarity_score = token.similarity(trigger)
            if  similarity_score > symilarity_threshold:
                score += similarity_score
                words += 1
        df_sentimental_select.loc[lambda df: df.index[i],trigger.text]  = round(score/words,2) if words>0 else 0.

df_sentimental_select[triggers] = df_sentimental_select[triggers].div(df_sentimental_select[triggers].sum(axis=1), axis=0).fillna(0).round(2)
df_sentimental_select['external'] = 1 - df_sentimental_select[triggers].sum(axis=1)

As result, outcome of labeling are following:

In [None]:
df_sentimental_select