* ## [1) The problem](#TheProblem)

    * #### Goal
    
* ## [2) The Data](#TheData)
    * ### [(a) Clear overview of your data](#DataOverview)
    
        ##### Beverage Machine data

        ##### Beverage Mapping data

        ##### Beverage Classification data
    
        ##### Placement Tickets data

        ##### Telemetry data

    * ### [(b) Plan to manage and process the data](#ManageData)
    
        ##### Beverage Machine data features and the Beverage Classification data features

        ##### Placement Tickets data features
        
        ##### Telemetry data features

        ##### Missing data
        
        ##### Preparation of the data in order to execute some EDA

* ## [3) Preparation of the data](#prep)         
    * ### [(a) Details of preparation](#det)
    
        #### Beverage Machine data preparation

        #### Placement Tickets data preparation

        #### Telemetry data preparation

        #### Data summary
        
    * ### [(b) Save the data](#save)


## 1) The problem <a class="anchor" id="TheProblem"></a>

The main business is a full service for beverage machine including :

    beverage machines placed at a customer’s place (rented or loaned), 
    
    the beverage ingredients (coffee beans, soluble coffee, juice, etc.) delivered to the customers 
    
    and the management of any issue and repair.

A little bit like the printers in companies where a printing machine is placed and the ink and the issues are also managed by the same company.

We have high churn rate of the beverage machine rented/loaned in our business and the goal is to reduce the churn rate by predicting which customer is more likely to churn and try to retain these customers.

The goal is to use Machine Learning in order to predict which machine is at risk of churn by calculating a churn likelihood.

The 'churned' machines are the machines that are definitively removed from their installation point thus resulting in a lower number of machines deployed dispensing beverage cups.

A churned machine generates a one-time cost for removal and replacement and a variable cost for depreciation and storage whilst a new location is found.

The Installation Point is referring to a customer's point where the machine is installed. A customer can have one or several Installation Point. A machine can be replaced by a new Machine on the same Installation Point. The Idea is to look when we lose an Installation Point, meaning that a machine distributing beverage cups has churned.

A Machine can be replaced on an Installation Point and it means we have kept the customer, so that is why we focus on the Installation Point rather than the Machine's Serial ID.

Two proposals could be used:

    Proposal 1 : We keep all the Installation Point data available and we do not aggregate the monthly data of the machines
    
    Proposal 2 : We aggregate the data of the same Installation Point over several month.
    
Example Proposal 1:

    InstPoint     Month of snapshot     ID       Churn      Age in Month      
    Inst.   1     Jan                   1        No         20
    Inst.   2     Jan                   2        No         48
    Inst.   3     Jan                   3        No         69
    Inst.   4     Jan                   4        Yes        45
    
    Inst.   1     Feb                   5        No         21
    Inst.   2     Feb                   6        No         49
    Inst.   3     Feb                   7        Yes        70
    Inst.   5     Feb                   8        No         25
    
    Inst.   1     Mar                   9        No         22
    Inst.   2     Mar                   10       No         50
    Inst.   5     Mar                   11       No         26
    Inst.   6     Mar                   12       No         30
    Inst.   7     Mar                   13       No         42
    Inst.   8     Mar                   14       No         7
    
    
Example Proposal 2:

    Inst.   #     Latest month snap     ID       Churn      Age in Month       data available since (month)     
    Inst.   1     Mar                   1        No         22                 3
    Inst.   2     Mar                   2        No         50                 3
    Inst.   3     Feb                   3        Yes        70                 1
    Inst.   4     Jan                   4        Yes        45                 2
    Inst.   5     Mar                   5        No         26                 2
    Inst.   6     Mar                   6        No         30                 1
    Inst.   7     Mar                   7        No         42                 1
    Inst.   8     Mar                   8        No         7                  1

I am currently missing all the Installation Point before January who have churned, therefore, the data is only having the current Park of Installation Point, only having the survivors, so I need to be careful of the Survivorship bias.

In order to make a Time Series problem it would be better to have more data.

I can have data from one Sales Organisation available in January and for another Sales Organisation in March.

The idea behind the first proposal was to predict a monthly churn rate, the monthly churn rate is the number of churn over the total. However with 10 month of data it is not the best solution.

With the second solution I will predict based on features if the machine has churned or not. The advantage of the second solution is that we can work without the time dimension and focus only on the features to make a prediction if the machine has churned or not.

### a) Goal <a class="anchor" id="Goal"></a>

By giving the customer's installation point with the highest churn likelihood to the managers, they can take action in order to retain more customer's installation point.

This will help to retain more customer's installation points and increase the company's deployed beverage machine park.

Also, less churn implies higher efficiency per machine (less time in the deposit) and lower cost for installation removal.

### TO DO LIST
Add Sales Org ID to vendon data and to Incident tickets and with a mapping create a key Serial-SalesOrg to link to the main data
Add acquisition cost and book value from ERP?

## 2) The data <a class="anchor" id="TheData"></a>

### (a) Clear overview of your data  <a class="anchor" id="DataOverview"></a>

pip install matplotlib

In [1]:
import pandas as pd
import os
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

import datetime as dt
from datetime import datetime

import pickle
#Install brokenaxes
#!pip install brokenaxes

from config import (
    BeverageMachine22_df,
    BeverageMachine23_df,
    BeverageMachine24_df,
    BeverageMachine25_df,
    BevMap_df,
    BeverageClassification_df,
    Placement_df,
    np_churn_consumption2,
    np_churn_consumption,
    Visitsdf,
    PhoneCallsdf,
    IncidentTicketdf,
    PakistanSales,
    MalaysiaSales,
    RussiaSalesData,
    SouthAfricaSales,
    SingaporeSales,
    MktActions,
    IndiaSales
)

# Specify the file path
file_path_output = r'C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Notebook output'

# Date when the data was extracted
ChurnDate2=datetime(2023,7,31)

# The range from when I want to have the details about Telemetry data
TelemetryDateRangeStart = '2020-01-01'

In [2]:
#V2
from snowflake.snowpark import Session, DataFrame

from snowflake.snowpark import Session, DataFrame
from snowflake.snowpark.functions import Column 
from snowflake.snowpark.exceptions import SnowparkSQLException, SnowparkDataframeException
import logging

connection_parameters = {
"account": "nestleprd.west-europe.azure",
"user": "mirko.salomon@nestle.com",
"role": "EDW_READER",  # optional
"warehouse": "EDW_QVW",  # optional
"database": "EDW",  # optional
"schema": "PRS",  # optional
"authenticator": "externalbrowser"
 }  


sf_session = Session.builder.configs(connection_parameters).create()




In [3]:
#Placements tickets

#V2
try:
    dp_sql = """SELECT SERIAL_ID, SERVICE_CATEGORY_DESCRIPTION, INCIDENT_CATEGORY_DESCRIPTION
    FROM EDW.PRS.C4C_NETPLACEMENTS_V"""

    df_data_products_config = sf_session.sql(dp_sql)

except SnowparkSQLException as e:
    logging.error('Exception in function---[ get_data_products() ] - ' + str(e))


df_data_products_config.show()

pandas_df_NetPlacement = df_data_products_config.toPandas()

------------------------------------------------------------------------------------
|"SERIAL_ID"  |"SERVICE_CATEGORY_DESCRIPTION"  |"INCIDENT_CATEGORY_DESCRIPTION"    |
------------------------------------------------------------------------------------
|16O0039201   |Removal                         |Removal / Data Fix                 |
|20O0020536   |Removal                         |Low throughput                     |
|18E0020607   |Removal                         |Removal / Data Fix                 |
|20O0021313   |Removal                         |Trial / Demo /Food Show            |
|10051281     |Removal                         |Low throughput                     |
|10043415     |Removal                         |Low throughput                     |
|23O0013360   |Installation                    |New Customer / Installation Point  |
|SGBMB07904   |Installation                    |New Customer / Installation Point  |
|SGBMB12165   |Replacement                     |Maintenance & Rep

In [4]:
#Commercial Visits

try:
    dp_sql = """SELECT

                V1.VISIT_TYPE_DESCRIPTION,
                              
                V1.END_DATE_IN_LOCAL_TIME_ZONE,
                
                V4.DESCRIPTION,
                
                V1.SALESORG,
                
                V3.DESCRIPTION AS ACTIVITY_LIFE_CYCLE_STATUS_DESCRIPTION,
                
                V1.VISIT_ID,
                
                V1.ACCOUNT_ID
                
                FROM
                
                EDW.PRS.C4C_VISIT_ACTIVITY_REPORT_V V1
                
                INNER JOIN EDW.PRS.C4C_SS_VISIT_HEADER_V V2 ON V1.VISIT_ID = V2.VISIT_ID
                
                INNER JOIN EDW.PRS.C4C_SS_ORG_UNIT_MASTER_DATA_V O ON O.ORG_UNIT = V2.ORGANIZATIONAL_UNIT
                
                INNER JOIN EDW.PRS.C4C_SS_ORG_UNIT_MASTER_DATA_V SO ON SO.ORG_UNIT_ID = V2.SALESORG
                
                INNER JOIN EDW.PRS.C4C_TEXT_ACTIVITY_LIFE_CYCLE_STATUS_V V3 ON V3.CODE = V2.ACTIVITY_LIFE_CYCLE_STATUS
                
                LEFT JOIN EDW.PRS.C4C_TEXT_VISIT_RESULT_V V4 ON V1.RESULT = V4.CODE
                
                WHERE ACTIVITY_LIFE_CYCLE_STATUS_DESCRIPTION = 'Completed'"""

    df_data_products_config = sf_session.sql(dp_sql)

except SnowparkSQLException as e:
    logging.error('Exception in function---[ get_data_products() ] - ' + str(e))


df_data_products_config.show()

pandas_df_Sales_Visits = df_data_products_config.toPandas()


-------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"VISIT_TYPE_DESCRIPTION"     |"END_DATE_IN_LOCAL_TIME_ZONE"  |"DESCRIPTION"  |"SALESORG"  |"ACTIVITY_LIFE_CYCLE_STATUS_DESCRIPTION"  |"VISIT_ID"  |"ACCOUNT_ID"  |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
|Discovery/Customer Planning  |2024-10-15                     |Objective Met  |AU11        |Completed                                 |2990017     |1310064       |
|Service Visit                |2024-07-08                     |NULL           |SE16        |Completed                                 |2910068     |2374064       |
|Service Visit                |2024-01-17                     |NULL           |MY34        |Completed                                 |2768575     |3791510       |
|Service Visit  

In [5]:
#Commercial Phone calls
try:
    dp_sql = """select Distinct
                
                T1.ACTIVITY_NAME as Activity_Name,
                
                T1.ACCOUNT_ID as Account_Name,
                
                T1.ACTIVITY_OWNER_NAME as Activity_Owner,
                
                T1.ACTIVITY_LIFE_CYCLE_STATUS as Activity_Life_Cycle_Status,
                
                T1.PHONE_CALL_ID as Phone_Call_ID,
                
                T1.OBJECTIVE_PHONE_CALL as Objective_Phone_Call,
                
                T1.SALES_ORGANIZATION_ID as Sales_Organization,
                
                date(T1.END_DATE_IN_LOCAL_TIME_ZONE) as End_Date_in_Local_Time_Zone,
                
                date(T1.START_DATE_IN_LOCAL_TIME_ZONE) as Start_Date_in_Local_Time_Zone,
                
                to_varchar(T1.END_DATE_IN_LOCAL_TIME_ZONE,'yyyy - mm') as PeriodEnd,
                
                T1.ACTIVITY_OWNER_ID as ee
                
                from "EDW"."PRS"."C4C_PHONE_CALLS_ACTIVITY_V" T1"""

    df_data_products_config = sf_session.sql(dp_sql)

except SnowparkSQLException as e:
    logging.error('Exception in function---[ get_data_products() ] - ' + str(e))


df_data_products_config.show()

pandas_df_Phone_Calls = df_data_products_config.toPandas()

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"ACTIVITY_NAME"                       |"ACCOUNT_NAME"  |"ACTIVITY_OWNER"  |"ACTIVITY_LIFE_CYCLE_STATUS"  |"PHONE_CALL_ID"  |"OBJECTIVE_PHONE_CALL"                              |"SALES_ORGANIZATION"  |"END_DATE_IN_LOCAL_TIME_ZONE"  |"START_DATE_IN_LOCAL_TIME_ZONE"  |"PERIODEND"  |"EE"   |
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2024-08-29- Perumbavoor Tyres Call 1  |9477011         |Jadala Aishwarya  |Completed                     |1371651          |NULL 

In [6]:
#transform the 'COMPLETION_SLA_MET' column from boolean (True/False) to integer (1/0) in your SQL query

try:
    dp_sql = """SELECT COMPLETION_DATE, INCIDENT_CATEGORY_DESCRIPTION, SERIAL_ID, COMPLETION_SLA_MET
    FROM EDW.PRS.C4C_REPAIR_TICKETS_KPI_V"""
    #dp_sql = """SELECT COMPLETION_DATE, INCIDENT_CATEGORY_DESCRIPTION, SERIAL_ID,
     #  CASE WHEN COMPLETION_SLA_MET = 'True' THEN 1 ELSE 0 END AS COMPLETION_SLA_MET
    #FROM EDW.PRS.C4C_REPAIR_TICKETS_KPI_V"""

    df_data_products_config = sf_session.sql(dp_sql)

except SnowparkSQLException as e:
    logging.error('Exception in function---[ get_data_products() ] - ' + str(e))


df_data_products_config.show()

pandas_df_Repair = df_data_products_config.toPandas()

--------------------------------------------------------------------------------------------
|"COMPLETION_DATE"  |"INCIDENT_CATEGORY_DESCRIPTION"  |"SERIAL_ID"  |"COMPLETION_SLA_MET"  |
--------------------------------------------------------------------------------------------
|2022-11-10         |2.c Hydraulic Leaking            |20O0047368   |False                 |
|2022-11-03         |1.a Ingredient Calibration       |ID18415      |True                  |
|2022-03-17         |6 Electronics (PCBs)             |ID28731      |False                 |
|2020-12-28         |1.d Ingredient Other             |MYBMB19882   |True                  |
|2020-07-24         |17 Miscellaneous                 |174444191    |True                  |
|2021-12-17         |3.a Door Display/Touchscreen     |ID11156      |False                 |
|2022-02-04         |1.a Ingredient Calibration       |17O0040516   |False                 |
|2021-12-14         |2.g Hydraulic Other              |15O0036793   |F

In [7]:
#V2
# Date when the data was extracted
import calendar

#Algorithm that gives the last day of the past month as Churn Date
CurrentDate = datetime.today()

shift_year = 0
shift_month = 1

if (CurrentDate.month == 1):
    shift_year = 1
    shift_month = -11

new_date = calendar.monthrange(CurrentDate.year - shift_year,CurrentDate.month - shift_month)
ChurnDate2 = datetime(CurrentDate.year - shift_year,CurrentDate.month - shift_month,new_date[1])

print(ChurnDate2)

2025-01-31 00:00:00


In [8]:
####Whe is the last time we had telemetry data
# Should be the same as Churn Date2

#TelemetryDate = ChurnDate2
#TelemetryDate = datetime(2020,8,31)

#PakistanLastUpdate = datetime(2021,5,31)
#PakistanDateRangeStart = datetime(2020,7,31)

#VendonDateRangeStart = TelemetryDate
#VendonLastUpdate = datetime(2021,9,30)

The data has been anonymized

#### Below is a list of my datasets:

#### 1.	Beverage Machine data
    - The Beverage machine data is maintained by the Service manager of each Sales Organisation (usually a Sales Organisation corresponds to a country) and I can create a report to extract the data in excel from a database maintained by an external provider.
    - The database only keeps the latest state of the machine, therefore, I take a monthly snapshot of the data to capture the changes. 
    - This data provides details about the Beverage Machines park situation.
    - More and more Sales Organisations are going to be managed by this system, so the number of machines managed is increasing.

#### 2.	Beverage Mapping data
    - Beverage Mapping data is maintained in an Excel file by a colleague, I ask him to upload this mapping whenever I find new machines in the consolidated Beverage Machine data.
    - The goal of the file is to link the Beverage machine data to the Beverage Classification data.

#### 3.	Beverage Classification data
    - Beverage Classification data is maintained in a SharePoint file by a colleague.
    - This file is to get more technical details and features of the Beverage Machines.
    
#### 4.	Placement Tickets data
    - The Placement Tickets data is maintained by the Service manager of each Sales org and I can create a report to extract the data in excel from a database maintained by an external provider.
    - This data provides details of the placements and some incidents tickets of the Beverage Machines.
    - Sometimes the tickets are not done by the Service manager and some market does not fill this data inside the database, so only a minority of machines have this data.

#### 5.	Telemetry data
    - A new project has been launched not very long ago and some machine are equipped with telemetry data.
    - This data is stored by the telemetry provider and I asked an external colleague managing the relationship with the telemetry provider to share with me the data he could get from his requests.
    - Very few machines are equipped with telemetry data.
    - The number of machines connected with Telemetry is going to increase in the future.
    - This is not the definitive data, I have asked my colleague, but he could not provide me the final data this month, a data lake is being built in order to access the data more easily in the future


#### 6.	Visits data

#### 7. Phone Calls data

#### 8. Repair tickets data

### Beverage Machine data
Below you can find an extract of the Beverage Machine data which contains the details of the Beverage Machines

No need to use 2021 data so I turned it to Markdown

BeverageMachine_df = pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload.csv")
BeverageMachine_df.head()

#### Additionnal beverage data

In [9]:
###From 2022 onwards
#memory issues with 2022 to 2025
#BeverageMachine22_df = pd.read_csv(BeverageMachine22_df)
#BeverageMachine22_df.head()

In [10]:
#BeverageMachine22_df.info()

Not needed anymore
### 2020 data
Bev_add2 = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload23.csv")
Bev_add2.head()

In [11]:
##2023 data

#BeverageMachine23_df =  pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload23.csv")
BeverageMachine23_df =  pd.read_csv(BeverageMachine23_df)
BeverageMachine23_df.head()

  BeverageMachine23_df =  pd.read_csv(BeverageMachine23_df)


Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date
0,0,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42430,16E0009895,...,1101 Exclusive,110101 Distribution Center,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
1,1,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42522,16E0014757,...,0618 Distributors OOH,061802 Non Exclusive,06 Out of Home,0203 Petrol Station,020399 Not classified,1046515,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
2,2,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42614,16E0021271,...,0618 Distributors OOH,061802 Non Exclusive,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
3,3,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42705,16E0021245,...,1101 Exclusive,110101 Distribution Center,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
4,4,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42736,16E0021249,...,1101 Exclusive,110101 Distribution Center,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31


In [12]:
##2024 data
BeverageMachine24_df =  pd.read_csv(BeverageMachine24_df)

BeverageMachine25_df = pd.read_csv(BeverageMachine25_df)

BeverageMachine25_df.head()

  BeverageMachine24_df =  pd.read_csv(BeverageMachine24_df)
  BeverageMachine25_df = pd.read_csv(BeverageMachine25_df)


Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date
0,0,SHL NESTLE PROD SERV,44736.0,Accessories BM,100056187,OTHERS-R/L/N,IDLE,TO BE ASSIGNED,,22O0017701,...,1102 Non-Exclusive,110203 Head Office Loc/Reg,Not assigned,Not assigned,Not assigned,,Trade Asset w/ Fixed Asset,CN26,90076029,2025-01-31
1,1,SHL NESTLE PROD SERV,44736.0,Accessories BM,100056187,OTHERS-R/L/N,IDLE,TO BE ASSIGNED,,22O0017864,...,1102 Non-Exclusive,110203 Head Office Loc/Reg,Not assigned,Not assigned,Not assigned,,Trade Asset w/ Fixed Asset,CN26,90076029,2025-01-31
2,2,SHL NESTLE PROD SERV,44736.0,Accessories BM,100056187,OTHERS-R/L/N,IDLE,TO BE ASSIGNED,,22O0017729,...,1102 Non-Exclusive,110203 Head Office Loc/Reg,Not assigned,Not assigned,Not assigned,,Trade Asset w/ Fixed Asset,CN26,90076029,2025-01-31
3,3,SHL NESTLE PROD SERV,44736.0,Accessories BM,100056187,OTHERS-R/L/N,IDLE,TO BE ASSIGNED,,22O0017852,...,1102 Non-Exclusive,110203 Head Office Loc/Reg,Not assigned,Not assigned,Not assigned,,Trade Asset w/ Fixed Asset,CN26,90076029,2025-01-31
4,4,SHL NESTLE PROD SERV,44736.0,Accessories BM,100056187,OTHERS-R/L/N,IDLE,TO BE ASSIGNED,,22O0017759,...,1102 Non-Exclusive,110203 Head Office Loc/Reg,Not assigned,Not assigned,Not assigned,,Trade Asset w/ Fixed Asset,CN26,90076029,2025-01-31


In [13]:
#BeverageMachine_df = pd.concat([BeverageMachine24_df, BeverageMachine22_df], ignore_index=True) 
BeverageMachine_df = pd.concat([BeverageMachine24_df, BeverageMachine23_df], ignore_index=True) 
#BeverageMachine_df = pd.concat([BeverageMachine_df, BeverageMachine23_df], ignore_index=True) 
BeverageMachine_df = pd.concat([BeverageMachine_df, BeverageMachine25_df], ignore_index=True) 

BeverageMachine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7625638 entries, 0 to 7625637
Data columns (total 37 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

In [14]:
def convert_column_to_int(df, column_name):
    df[column_name] = pd.to_numeric(df[column_name], errors='coerce')
    df[column_name] = df[column_name].fillna(df[column_name])
    df[column_name] = df[column_name].astype(pd.Int64Dtype(), errors='ignore')
    return df

BeverageMachine_df = convert_column_to_int(BeverageMachine_df, 'EC ID')

In [15]:
#TODELETE once we move to Snowflake Monthly Snapshot
#Snowflake values are different and in Upper case, it creates a problem when I filter out machines that are not "Deployed"
BeverageMachine_df['Machine Status Groupings'] = BeverageMachine_df['Machine Status Groupings'].replace({'DEPLOYED': 'Deployed', 'IDLE': 'Idle', 'OTHER': 'Other'})

#Snowflake has a different Upper letters value and for one month of data the algorithm can know if the data comes from Snowflake and adapt the algorithm to know these machines have not churned, to be deleted once all the data comes from Snowflake
BeverageMachine_df['User Status'] = BeverageMachine_df['User Status'].replace({'IN PREPARATION': 'In Preparation', 'TO BE ASSIGNED': 'To be Assigned', 'TO BE DESTROYED': 'To be Destroyed', 'IN REPAIR': 'In Repair', 'INSTALLED': 'Installed', 'UNDER INSTALLATION': 'Under Installation', 'MISSING': 'Missing', 'STATUS TO BE CORRECTED IN ERP': 'Status to be corrected in ERP', 'TO BE REMOVED': 'To be Removed'})
BeverageMachine_df['TA Usage Indicator'] = BeverageMachine_df['TA Usage Indicator'].replace({'Monthly Rental': '5 Monthly Rental', '': 'Not assigned'})

In [16]:
#Keep the deployed
#BeverageMachine_df= BeverageMachine_df.loc[BeverageMachine_df['Machine Status Groupings']=="DEPLOYED"]

In [17]:
count = BeverageMachine_df[(BeverageMachine_df['Sales Organisation'] == 'Nestlé Russia') & (BeverageMachine_df['Machine Status Groupings'] == 'Deployed')].shape[0]
print("Number of rows with 'Sales Organisation' as 'Nestlé Russia' and 'Machine Status Groupings' as 'Deployed':", count)

Number of rows with 'Sales Organisation' as 'Nestlé Russia' and 'Machine Status Groupings' as 'Deployed': 414177


In [18]:
BeverageMachine_df = BeverageMachine_df.loc[BeverageMachine_df['Machine Status Groupings']=="Deployed"]

Manufacturer Serial number can be the same for two different machine in different countries let's create a key Key_ManufacturerID_SalesOrg

Key_ManufacturerID_SalesOrg will be used for merging local sales data from market with the main data

import pandas as pd

# Create a new column 'Key_ManufacturerID_SalesOrg' with initial values from 'Manufacturer Number' and 'Sales Organisation'
BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Manufacturer Number'].astype(str) + BeverageMachine_df['Sales Organisation']



# Conditionally update 'Key_ManufacturerID_SalesOrg' column if it is a specific Sales Organisation
specific_market = 'Nestlé Russia'  # Replace with the name of your specific market
# for Russia use Account ID instead of Manufacturer Number
BeverageMachine_df.loc[BeverageMachine_df['Sales Organisation'] == specific_market, 'Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Account ID'].astype(str) + BeverageMachine_df['Sales Organisation']


In [19]:
# Create a new column 'Key_ManufacturerID_SalesOrg' with initial values from 'Manufacturer Number' and 'Sales Organisation'
BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Manufacturer Number'].astype(str) + BeverageMachine_df['Sales Organisation']

#Account ID should be of type "String"
BeverageMachine_df['Account ID'] = BeverageMachine_df['Account ID'].astype(str)
BeverageMachine_df['Serial ID'] = BeverageMachine_df['Serial ID'].astype(str)
BeverageMachine_df['EC ID'] = BeverageMachine_df['EC ID'].astype(str)

# Conditionally update 'Key_ManufacturerID_SalesOrg' column if it is a specific market
specific_market = 'Nestle South Africa'

specific_market2 = ['Nestlé India', 'Pakistan'] 


BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df.apply(
    lambda row: str(row['Serial ID']) + row['Sales Organisation'] if row['Sales Organisation'] in specific_market2 else (
        row['Account ID'] + row['Sales Organisation'] if row['Sales Organisation'] == specific_market else row['Key_ManufacturerID_SalesOrg']
    ), axis=1
)



# Update 'Key_ManufacturerID_SalesOrg' column based on specific markets
BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df.apply(
    lambda row: row['Account ID'] + row['Sales Organisation'] if row['Sales Organisation'] == specific_market else (
        str(row['Serial ID']) + row['Sales Organisation'] if row['Sales Organisation'] == specific_market2 else row['Key_ManufacturerID_SalesOrg']
    ), axis=1
)

BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Manufacturer Number'].astype(str) +  BeverageMachine_df['Sales Organisation'] 

Serial Id should be a string had issue with mix type for same serial ID

In [20]:
BeverageMachine_df['Serial ID'] = BeverageMachine_df['Serial ID'].astype('str')

BeverageMachine_df['Parent Installation Point ID'] = BeverageMachine_df['Parent Installation Point ID'].astype('str')

In [21]:
BeverageMachine_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5006354 entries, 0 to 7625637
Data columns (total 38 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

In [22]:
BeverageMachine_df = BeverageMachine_df.query("`Product [Machine Model]` != 'Vendon Telemetry Device – vBox BM'")

In [23]:
BeverageMachine_df['Sales Organisation'].unique()

array(['Nestlé UAE', 'NP Bosnia & Herzegovina', 'Néstlé Bahrain',
       'NESTLE PROD SERV - CN19', 'SHL NESTLE PROD SERV',
       'Nestlé Denmark', 'Nestlé Finland', 'Nestle Hong Kong',
       'Indonesia', 'JP Japan Sales',
       'Kuwait General Operational Manager', 'NP North Macedonia',
       'Malaysia', 'Nestle New Zealand', 'Nestlé PH', 'Nestlé Qatar',
       'Nestlé Russia', 'Nestlé Slovak Republic', 'Nestle Turkiye Gida',
       'Nestle South Africa', 'Singapore', 'Nestle Australia Ltd',
       'NP-Bulgaria', 'NESTLE PROD SERV - CN17',
       'NESTLE PROD SERV - CN20', 'Nestlé Czech', 'Nestle UK',
       'Nestlé India', 'Néstlé Jordania', 'Nestle Kenya Ltd',
       'Néstlé Lebanon', 'Nestle Prd Mauritius Ltd', 'NP-Netherlands',
       'Nestlé Norway', 'Oman - Business Manager UAE & Oman', 'Pakistan',
       'NP Serbia, Kosovo, Montenegro', 'Néstlé Saudi Arabia',
       'Nestle Sweden', 'Thailand', 'Nestlé Taiwan', 'NP-Belgilux',
       'NP-France', 'Nestlé Italy IT35 OOH', 'Ne

Removed some markets from analysis

In [None]:
BeverageMachine_df2 = BeverageMachine_df.copy()

# List of China sales organizations to be excluded first as they use another system
china_excluded_sales_orgs = [
    'NESTLE PROD SERV - CN17',
    'NESTLE PROD SERV - CN19',
    'NESTLE PROD SERV - CN20'
]

# List of sales organizations to be excluded
excluded_sales_orgs = [
    'NP-Bulgaria',
    'Nestlé Italy IT35 OOH',
    'NP-Netherlands',
    'Nestle Turkiye Gida',
    'Baltics Sales Organization',
    'Nestle Poland',
    'Nestle Romania S.R.L'
]

# Filter out the rows where 'Sales Organisation' is in the additional excluded list first
BeverageMachine_df2 = BeverageMachine_df2[~BeverageMachine_df2['Sales Organisation'].isin(china_excluded_sales_orgs)]

# Filter out the rows where 'Sales Organisation' is in the excluded list
BeverageMachine_df2 = BeverageMachine_df2[~BeverageMachine_df2['Sales Organisation'].isin(excluded_sales_orgs)]

# Display the filtered DataFrame
print(BeverageMachine_df2)

BeverageMachine_df2.head()

Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date,Key_ManufacturerID_SalesOrg
0,0,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544849,...,061406 PMO:Petrol Stations,06 Out of Home,0614 Convenience OOH,061406 PMO:Petrol Stations,1498958,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20174544849Nestlé UAE
1,1,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544851,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-8930,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20174544851Nestlé UAE
2,2,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544855,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-9059,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20174544855Nestlé UAE
3,3,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43282.0,182026936,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-9146,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20182026936Nestlé UAE
4,4,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43313.0,182026920,...,060503 Remote Site Company,06 Out of Home,0605 Business/Industry,060503 Remote Site Company,IP-9006,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20182026920Nestlé UAE


In [25]:
BeverageMachine_df2['Sales Organisation'].unique()

array(['Nestlé UAE', 'NP Bosnia & Herzegovina', 'Néstlé Bahrain',
       'SHL NESTLE PROD SERV', 'Nestlé Denmark', 'Nestlé Finland',
       'Nestle Hong Kong', 'Indonesia', 'JP Japan Sales',
       'Kuwait General Operational Manager', 'NP North Macedonia',
       'Malaysia', 'Nestle New Zealand', 'Nestlé PH', 'Nestlé Qatar',
       'Nestlé Russia', 'Nestlé Slovak Republic', 'Nestle Turkiye Gida',
       'Nestle South Africa', 'Singapore', 'Nestle Australia Ltd',
       'NP-Bulgaria', 'Nestlé Czech', 'Nestle UK', 'Nestlé India',
       'Néstlé Jordania', 'Nestle Kenya Ltd', 'Néstlé Lebanon',
       'Nestle Prd Mauritius Ltd', 'NP-Netherlands', 'Nestlé Norway',
       'Oman - Business Manager UAE & Oman', 'Pakistan',
       'NP Serbia, Kosovo, Montenegro', 'Néstlé Saudi Arabia',
       'Nestle Sweden', 'Thailand', 'Nestlé Taiwan', 'NP-Belgilux',
       'NP-France', 'Nestlé Italy IT35 OOH', 'Nestle Ireland',
       'Nestle Romania S.R.L', 'Nestle Poland',
       'Baltics Sales Organizati

In [26]:
BeverageMachine_df = BeverageMachine_df2

In [27]:
BeverageMachine_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4288720 entries, 0 to 7625637
Data columns (total 38 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

### Beverage Mapping data

In [28]:
# (A) Load the Beverage Mapping data
#BevMap_df = pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\SBU-11 NESTLE PRO. Translation.csv")
BevMap_df = pd.read_csv(BevMap_df)

BevMap_df['ID Model Code']=BevMap_df['ID Model Code'].astype(str)
BevMap_df.head()

Unnamed: 0,Brand Name,Description,ID Model Code,Source,Model,Revised,Modified,Modified By
0,Accolade,Accolade 12oz,ACC-FPD-12z,BMB,N&W Astro Accolade,Yes,6/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
1,Accolade,Accolade 9oz,ACC-FPD- 9z,BMB,N&W Astro Accolade,Yes,6/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
2,ALEGRIA,Chest Freezer NP PK BM,100069870,C4C,Accessories,Yes,6/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
3,ALEGRIA,Chiller SAX 250 NP BM PK,100069872,C4C,Others,Yes,6/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
4,ALEGRIA,Chiller SAX 400 NP BM PK,100069869,C4C,Others,Yes,6/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"


In [29]:
BevMap_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2801 entries, 0 to 2800
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Brand Name     2801 non-null   object
 1   Description    2801 non-null   object
 2   ID Model Code  2801 non-null   object
 3   Source         2801 non-null   object
 4   Model          2801 non-null   object
 5   Revised        2801 non-null   object
 6   Modified       2801 non-null   object
 7   Modified By    2801 non-null   object
dtypes: object(8)
memory usage: 175.2+ KB


### Beverage Classification data

In [30]:
# (A) Load the Beverage Classification data
BeverageClassification_df = pd.read_csv(BeverageClassification_df)

BeverageClassification_df.head()

Unnamed: 0,Model,Model Vendor,Model Category,Global Projects,System Brands,Solution Brands,Model Group,Generation,Product,Ingredient Format,...,PSL,TAA & TAR Ownership,TAA & TAR,SC & Planning,Production,IM,Sustainability LCA Ownership,Sustainability LCA,Vendon Compatible,Technical Capacity
0,4Swiss Roma A10 PRO,Others,Mainstream B2C,#-N/A,Branded others,Branded Others,Other,Legacy,Pure R&G,Pure R&G,...,Validated,Market,Not Done,Market,Discontinued,Market,Market,Not Done,,20
1,Accessories,Generic,Other,#-N/A,Branded others,Non-Branded,Other,Legacy,#-Unknown,Other,...,Validated,Market,Not Done,Market,Discontinued,Market,Market,Not Done,,0
2,Alegria V-Café 140,Crem,Hot Liquid,Alegria,Nescafé Alegria,Nescafé Alegria,NA Legacy,Gen. 1,Hot Liquid,Liquid,...,Mandatory,Region,Not Done,Region,Active,Region,Region,Not Done,,0
3,Alegria V-Café 2120,Crem,Hot Liquid,Alegria,Nescafé Alegria,Nescafé Alegria,NA Legacy,Gen. 1,Hot Liquid,Liquid,...,Mandatory,Region,Not Done,Region,Active,Region,Region,Not Done,,0
4,Alegria V-Café 4500,NP Beverages,Hot Liquid,Alegria,Nescafé Alegria,Nescafé Alegria,NA Legacy,Gen. 1,Hot Liquid,Liquid,...,Validated,Market,Not Done,Market,Discontinued,Market,Market,Not Done,,0


In [31]:
BeverageClassification_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599 entries, 0 to 598
Data columns (total 30 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Model                         598 non-null    object
 1   Model Vendor                  599 non-null    object
 2   Model Category                599 non-null    object
 3   Global Projects               599 non-null    object
 4   System Brands                 599 non-null    object
 5   Solution Brands               599 non-null    object
 6   Model Group                   599 non-null    object
 7   Generation                    599 non-null    object
 8   Product                       599 non-null    object
 9   Ingredient Format             599 non-null    object
 10  Model Category 2              599 non-null    object
 11  Machine Type                  599 non-null    object
 12  Beverage Temperature          599 non-null    object
 13  Positionning        

### Placement Tickets data

pip install openpyxl

In [32]:
# Load the Placement Tickets data
#Placement_df = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Net Placements.xlsx")
#Placement_df.tail()

In [33]:
#Placement_df.info()

In [34]:
#selected_columns = ['Serial ID', 'Service Category', 'INCIDENT_CATEGORY_DESCRIPTION']
Placement_df = pandas_df_NetPlacement
# Perform operations on the selected DataFrame as needed
Placement_df=Placement_df.rename(columns={"SERIAL_ID": "Serial ID", "SERVICE_CATEGORY_DESCRIPTION": "Service Category"})
#Placement_df=Placement_df[selected_columns]
Placement_df.head()

Unnamed: 0,Serial ID,Service Category,INCIDENT_CATEGORY_DESCRIPTION
0,14E0003095,Installation,Unknown/Other
1,Y105033436,Removal,Unknown/Other
2,12E0012560,Removal,Site closure
3,16E0024039,Removal,Site closure
4,14E0020262,Removal,Seasonal Removal


# Read the Excel file into a pandas DataFrame and filter columns
file_path = Placement_df
selected_columns = ['Serial ID', 'Service Category', 'INCIDENT_CATEGORY_DESCRIPTION']
Placement_df = pd.read_excel(file_path, usecols=selected_columns)
# Perform operations on the selected DataFrame as needed
print(Placement_df)

### Telemetry data

Get data from URL for Telemetry Data Lake

In [35]:
url = "https://queryenginelandingprod.blob.core.windows.net/shared/np/churn/np_churn_historical_consumption_by_product_group.csv?sp=r&st=2023-04-10T06:40:59Z&se=2050-04-10T14:40:59Z&spr=https&sv=2021-12-02&sr=b&sig=d%2Fn5C%2FWWksWDI%2FiiZEqwz5mOaw2jAqkW9DHUOSz6R7Q%3D"
np_churn_consumption2 =pd.read_csv(url)
np_churn_consumption2

Unnamed: 0,date,serial,sap_serial,quantity,salesorg,machine_id,product_group
0,2022-02-28,20150404801,,13,ESAR,126694,CAPPUCCINO
1,2022-02-28,20202102671,70010010331,201,MENA,4158,CAPPUCCINO
2,2022-02-28,20202319122,MYBMB35137,493,Malaysia,137198,ICED LATTE
3,2022-02-28,3400000017927/E0110021908140,90074212,29,Turkey,5730,FLAVOURED LATTE
4,2022-02-28,20205135631,70010011034,58,MENA,5529,AMERICANO
...,...,...,...,...,...,...,...
5001629,2025-01-31,2012 50 44 310,1551,1,Dominican Republic,6181648,RISTRETTO
5001630,2025-01-31,10297426,10297426,1,Ecuador,6849445,ICED COFFEE
5001631,2025-01-31,20202017312,3801,1,Dominican Republic,1134981,AMERICANO
5001632,2025-01-31,20192431620,3576,2,Dominican Republic,11616877,CHOCOLATE


In [36]:
#url= "https://queryenginelandingstag.blob.core.windows.net/shared/np/churn/np_churn_historical_consumption.csv?sp=r&st=2022-09-02T07:17:17Z&se=2050-09-02T15:17:17Z&spr=https&sv=2021-06-08&sr=b&sig=hiIpKctZ%2BlxXwR9E%2BVReK1TnsQqZrcayCYu%2BZaCynlw%3D"
url = "https://queryenginelandingprod.blob.core.windows.net/shared/np/churn/np_churn_historical_consumption.csv?sp=r&st=2022-11-29T12:22:43Z&se=2050-11-29T20:22:43Z&spr=https&sv=2021-06-08&sr=b&sig=JZE599UA3foRsJ6ZbOHW6M0nWexxLc3JCB49gJ%2B2faU%3D"
np_churn_consumption =pd.read_csv(url)
np_churn_consumption

Unnamed: 0,date,serial,sap_serial,quantity,salesorg,machine_id
0,2022-02-28,20202319122,MYBMB35137,2416,Malaysia,137198
1,2022-02-28,3400000017927/E0110021908140,90074212,1664,Turkey,5730
2,2022-02-28,20202016662,70010010418,1243,MENA,5754
3,2022-02-28,20200911666,70010062795,2511,Pakistan,94762
4,2022-02-28,20202016668,70010010424,784,MENA,2469
...,...,...,...,...,...,...
686838,2025-01-31,3400000110999,70010235595,1,Malaysia,11335959
686839,2025-01-31,20202420115,,1,MENA,11492753
686840,2025-01-31,Unknown,,1,ADC,10371013
686841,2025-01-31,121000,6631046,1,MENA,3163


In [37]:
np_churn_consumption = np_churn_consumption.rename(columns={"date": "Month", "salesorg": "SalesOrg"}).reset_index()
np_churn_consumption.head()

Unnamed: 0,index,Month,serial,sap_serial,quantity,SalesOrg,machine_id
0,0,2022-02-28,20202319122,MYBMB35137,2416,Malaysia,137198
1,1,2022-02-28,3400000017927/E0110021908140,90074212,1664,Turkey,5730
2,2,2022-02-28,20202016662,70010010418,1243,MENA,5754
3,3,2022-02-28,20200911666,70010062795,2511,Pakistan,94762
4,4,2022-02-28,20202016668,70010010424,784,MENA,2469


In [38]:
np_churn_consumption=np_churn_consumption.drop(columns=['sap_serial','index', 'machine_id'])
np_churn_consumption.head()

Unnamed: 0,Month,serial,quantity,SalesOrg
0,2022-02-28,20202319122,2416,Malaysia
1,2022-02-28,3400000017927/E0110021908140,1664,Turkey
2,2022-02-28,20202016662,1243,MENA
3,2022-02-28,20200911666,2511,Pakistan
4,2022-02-28,20202016668,784,MENA


In [39]:
Telemetry_df = np_churn_consumption

# Load the Telemetry data
Telemetry_df = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Telemetry2021.xlsx")
Telemetry_df.tail()

In [40]:
Telemetry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 686843 entries, 0 to 686842
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   Month     686843 non-null  object
 1   serial    672150 non-null  object
 2   quantity  686843 non-null  int64 
 3   SalesOrg  686843 non-null  object
dtypes: int64(1), object(3)
memory usage: 21.0+ MB


# Load the Telemetry data
Telemetry_add = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Telemetry2022.xlsx")
Telemetry_add.tail()

Telemetry_df = Telemetry_df.append(Telemetry_add)
Telemetry_df.info()

Telemetry_df.tail()

I will only keep the telemetry data in the date range, so it is only the telemtry data starting after "TelemetryDateRangeStart"

#Telemetry_df1 = Telemetry_df.loc[Telemetry_df['Month']>=TelemetryDateRangeStart]

I will aggregate the number of Cup Sales for each Machine by the feature called 'serial' which corresponds to the feature 'Manufacturer Number' in the Beverage Machine data

In [41]:
Telemetry_aggSales = Telemetry_df['quantity'].groupby(Telemetry_df['serial'], axis=0).sum()
Telemetry_aggSales_df = Telemetry_aggSales.to_frame().reset_index()
Telemetry_aggSales_df

Unnamed: 0,serial,quantity
0,\t 20221813844,9566
1,#N/D,2409
2,'20202016602,52397
3,'Y20231619331,27277
4,**,545
...,...,...
43929,Х580BGS230203370094,2469
43930,Х580BGS230203370097,9444
43931,Х580BGS230203370098,14562
43932,Х580BGS230203370105,35791


In [42]:
Telemetry_df1 = Telemetry_df.groupby(['Month', 'serial']).sum()
# df.groupby(['col5', 'col2']).size()
#['quantity']
Telemetry_df1 = Telemetry_df1.reset_index()

Telemetry_df1

  Telemetry_df1 = Telemetry_df.groupby(['Month', 'serial']).sum()


Unnamed: 0,Month,serial,quantity
0,2022-02-28,'20202016602,1390
1,2022-02-28,000103535,1070
2,2022-02-28,000103550,1298
3,2022-02-28,00086126,883
4,2022-02-28,00094680,1043
...,...,...,...
658365,2025-01-31,Х580BGS230203370094,257
658366,2025-01-31,Х580BGS230203370097,710
658367,2025-01-31,Х580BGS230203370098,1071
658368,2025-01-31,Х580BGS230203370105,2582


from dateutil.relativedelta import relativedelta

one_month = TelemetryDate + relativedelta(months=-1)
three_months = TelemetryDate + relativedelta(months=-3)
six_months = TelemetryDate + relativedelta(months=-6)

Telemetry_df_one_month = Telemetry_df1.loc[Telemetry_df1['Month']>one_month]
Telemetry_df_three_months = Telemetry_df1.loc[Telemetry_df1['Month']>three_months]
Telemetry_df_six_months = Telemetry_df1.loc[Telemetry_df1['Month']>six_months]

TODO
why "Telemetry_aggSales_one_month_avg = Telemetry_df_one_month['quantity'].groupby(Telemetry_df_one_month['serial'], axis=0).count()"

not this?
Telemetry_aggSales_one_month_avg = Telemetry_df_one_month['quantity'].groupby(Telemetry_df_one_month['serial'], axis=0).sum()

In [43]:
Telemetry_df_one_month = Telemetry_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(1).sum()})

Telemetry_df_three_months = Telemetry_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

Telemetry_df_six_months = Telemetry_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

Telemetry_df_one_month = Telemetry_df_one_month.rename(columns={"quantity": "one_month_avg"}).reset_index()

Telemetry_df_three_months = Telemetry_df_three_months.rename(columns={"quantity": "three_months_avg"}).reset_index()

Telemetry_df_six_months = Telemetry_df_six_months.rename(columns={"quantity": "six_months_avg"}).reset_index()

Telemetry_aggSales_one_month_avg = Telemetry_df_one_month['quantity'].groupby(Telemetry_df_one_month['serial'], axis=0).count()
Telemetry_aggSales_one_month_avg = Telemetry_aggSales_one_month_avg.to_frame().reset_index()
Telemetry_aggSales_one_month_avg = Telemetry_aggSales_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

Telemetry_aggSales_three_months_avg = Telemetry_df_three_months['quantity'].groupby(Telemetry_df_three_months['serial'], axis=0).count()
Telemetry_aggSales_three_months_avg = Telemetry_aggSales_three_months_avg.to_frame().reset_index()
Telemetry_aggSales_three_months_avg = Telemetry_aggSales_three_months_avg.rename(columns={"quantity": "three_months_avg"})

Telemetry_aggSales_six_months_avg = Telemetry_df_six_months['quantity'].groupby(Telemetry_df_six_months['serial'], axis=0).count()
Telemetry_aggSales_six_months_avg = Telemetry_aggSales_six_months_avg.to_frame().reset_index()
Telemetry_aggSales_six_months_avg = Telemetry_aggSales_six_months_avg.rename(columns={"quantity": "six_months_avg"})

Telemetry_aggSales_three_months_avg

Telemetry_aggSales_three_months_avg['three_months_avg'] = Telemetry_aggSales_three_months_avg['three_months_avg'].apply(lambda x: x/3)

Telemetry_aggSales_six_months_avg['six_months_avg'] = Telemetry_aggSales_six_months_avg['six_months_avg'].apply(lambda x: x/6)

Telemetry_aggSales_three_months_avg 

I used 'left' instead of 'inner' because I want all the machines that had data

In [44]:
Telemetry_aggSales_df

Unnamed: 0,serial,quantity
0,\t 20221813844,9566
1,#N/D,2409
2,'20202016602,52397
3,'Y20231619331,27277
4,**,545
...,...,...
43929,Х580BGS230203370094,2469
43930,Х580BGS230203370097,9444
43931,Х580BGS230203370098,14562
43932,Х580BGS230203370105,35791


In [45]:
Telemetry_df_one_month

Unnamed: 0,serial,one_month_avg
0,\t 20221813844,1243
1,#N/D,6
2,'20202016602,1401
3,'Y20231619331,1844
4,**,350
...,...,...
43929,Х580BGS230203370094,257
43930,Х580BGS230203370097,710
43931,Х580BGS230203370098,1071
43932,Х580BGS230203370105,2582


In [46]:
Telemetry_aggSales_df1 = pd.merge(Telemetry_aggSales_df, Telemetry_df_one_month, how='left', left_on = ['serial'], right_on = ['serial'])
Telemetry_aggSales_df1.head()

Unnamed: 0,serial,quantity,one_month_avg
0,\t 20221813844,9566,1243
1,#N/D,2409,6
2,'20202016602,52397,1401
3,'Y20231619331,27277,1844
4,**,545,350


In [47]:
Telemetry_aggSales_df2 = pd.merge(Telemetry_aggSales_df1, Telemetry_df_three_months, how='left', left_on = ['serial'], right_on = ['serial'])
Telemetry_aggSales_df3 = pd.merge(Telemetry_aggSales_df2, Telemetry_df_six_months, how='left', left_on = ['serial'], right_on = ['serial'])
Telemetry_aggSales_df3 = Telemetry_aggSales_df3.fillna(0)
Telemetry_aggSales_df3

Unnamed: 0,serial,quantity,one_month_avg,three_months_avg,six_months_avg
0,\t 20221813844,9566,1243,2214.333333,1594.333333
1,#N/D,2409,6,516.333333,401.500000
2,'20202016602,52397,1401,2447.000000,2536.166667
3,'Y20231619331,27277,1844,1924.666667,2104.666667
4,**,545,350,181.666667,90.833333
...,...,...,...,...,...
43929,Х580BGS230203370094,2469,257,329.666667,321.833333
43930,Х580BGS230203370097,9444,710,820.000000,774.166667
43931,Х580BGS230203370098,14562,1071,1286.666667,1285.333333
43932,Х580BGS230203370105,35791,2582,3041.666667,3123.333333


In [48]:
Telemetry_aggSales_df3['serial'] = Telemetry_aggSales_df3['serial'].astype(str)

In [49]:
Telemetry_aggSales_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43934 entries, 0 to 43933
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   serial            43934 non-null  object 
 1   quantity          43934 non-null  int64  
 2   one_month_avg     43934 non-null  int64  
 3   three_months_avg  43934 non-null  float64
 4   six_months_avg    43934 non-null  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 2.0+ MB


### 6. Visits data

# Load the Visits data
Visitsdf = pd.read_excel(Visitsdf)
Visitsdf.head()

In [50]:
#Renaming the columns to allign with the old data
Visitsdf = pandas_df_Sales_Visits
Visitsdf=Visitsdf.rename(columns={'END_DATE_IN_LOCAL_TIME_ZONE': 'End Date in Local Time Zone', "DESCRIPTION": "Result", 'ACTIVITY_LIFE_CYCLE_STATUS_DESCRIPTION': 'Activity Life Cycle Status', 'VISIT_ID': 'Visit', 'ACCOUNT_ID': 'Account ID.Account ID Level 01.Key'})
Visitsdf.head()

Unnamed: 0,VISIT_TYPE_DESCRIPTION,End Date in Local Time Zone,Result,SALESORG,Activity Life Cycle Status,Visit,Account ID.Account ID Level 01.Key
0,Discovery/Customer Planning,2024-10-15,Objective Met,AU11,Completed,2990017,1310064
1,Service Visit,2024-07-08,,SE16,Completed,2910068,2374064
2,Service Visit,2024-01-17,,MY34,Completed,2768575,3791510
3,Service Visit,2024-02-06,,MY34,Completed,2785408,3791510
4,Grow/Business Development,2024-08-27,,PL10,Completed,2952812,8995134


### 7. Phone Calls data

# Load the Visits data
PhoneCallsdf = pd.read_excel(PhoneCallsdf, dtype={'Account Name': str})
PhoneCallsdf.head()

In [51]:
#Renaming the columns to allign with the old data
PhoneCallsdf = pandas_df_Phone_Calls
PhoneCallsdf["ACCOUNT_NAME"]= PhoneCallsdf["ACCOUNT_NAME"].astype(str)
PhoneCallsdf= PhoneCallsdf.rename(columns={"ACCOUNT_NAME": "Account Name"})
PhoneCallsdf.head()

Unnamed: 0,ACTIVITY_NAME,Account Name,ACTIVITY_OWNER,ACTIVITY_LIFE_CYCLE_STATUS,PHONE_CALL_ID,OBJECTIVE_PHONE_CALL,SALES_ORGANIZATION,END_DATE_IN_LOCAL_TIME_ZONE,START_DATE_IN_LOCAL_TIME_ZONE,PERIODEND,EE
0,2024-05-21- Malá Itálie pizzerie s.r.o,9223247,Milan Svoboda,Completed,1327588,,CZ11,2024-05-20,2024-05-20,2024 - 05,4826
1,2025-02-06- Hotel Call 3,9870085,Faruq Khan,Completed,1448411,,IN14,2025-02-07,2025-02-07,2025 - 02,12496
2,2025-02-13- Education academy Call 1,9885185,Gaurav Mishra,Completed,1451304,,IN14,2025-02-14,2025-02-14,2025 - 02,11273
3,2025-02-13- Das Call 1,9884739,Sandeep Gupta,Completed,1451143,,IN14,2025-02-13,2025-02-13,2025 - 02,16062
4,2025-02-06- New satyanarayan mithai bhandar Ca...,9871856,Akash Chawariya,Completed,1448810,,IN14,2025-02-08,2025-02-08,2025 - 02,10608


In [52]:

PhoneCallsdf = PhoneCallsdf.rename(columns={'ACTIVITY_LIFE_CYCLE_STATUS': 'Activity Life Cycle Status','END_DATE_IN_LOCAL_TIME_ZONE': 'End Date in Local Time Zone'})


### 8. Incident Tickets data

IncidentTicketdf = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Incident tickets.xlsx")
IncidentTicketdf.head()

In [53]:
IncidentTicketdf = pandas_df_Repair
IncidentTicketdf.head()

Unnamed: 0,COMPLETION_DATE,INCIDENT_CATEGORY_DESCRIPTION,SERIAL_ID,COMPLETION_SLA_MET
0,2022-05-24,1.d Ingredient Other,ZA8345,False
1,2022-10-31,5.c Dispensing Area Other,ID15217,False
2,2022-09-20,2.e Hydraulic Cooling/Freezing,16O0039291,False
3,2022-11-01,14 Accessory problem(external pump..),ID20273,True
4,2022-12-23,14 Accessory problem(external pump..),ID12789,True


In [54]:
IncidentTicketdf= IncidentTicketdf.rename(columns={"COMPLETION_DATE": "Completion Date_2", "INCIDENT_CATEGORY_DESCRIPTION": "Incident Category", "SERIAL_ID": "Serial ID"})

### 9. Market specific data

UK stopped providing their service data

### 10. LOCAL DATA

Sales & Telemetron & Vendon 2021 Sales

PakistanSales both Serial no and manuf no are the same
RussiaSalesData uses Manuf no

Vendon data uses manuf no



In [55]:
PakistanSales = pd.read_excel(PakistanSales)
MalaysiaSales = pd.read_excel(MalaysiaSales)

# Drop the 'Serial' column
MalaysiaSales.drop('Serial', axis=1, inplace=True)
# Rename the 'Serial Manufacturer' column to 'Serial'
MalaysiaSales.rename(columns={'Serial Manufacturer': 'Serial'}, inplace=True)

In [56]:
PakistanSales.head()

Unnamed: 0,Serial,quantity,Month
0,20O0014321,2512.8206,2021-01-01
1,7010054243,8488.0412,2021-01-01
2,7010055066,91133.6902,2021-01-01
3,7010045635,91133.6902,2021-01-01
4,7010058209,91133.6902,2021-01-01


In [57]:
PakistanSales['Serial'] = PakistanSales['Serial'].astype(str)
MalaysiaSales['Serial'] = MalaysiaSales['Serial'].astype(str)

In [58]:
RussiaSalesData = pd.read_excel(RussiaSalesData)

IndiaSalesData = pd.read_excel(IndiaSales)

In [59]:
# Keep only the desired columns
desired_columns = ["Serial ID", "Total NNS", "Sales Month"]
IndiaSalesData = IndiaSalesData[desired_columns]
IndiaSalesData.tail()

Unnamed: 0,Serial ID,Total NNS,Sales Month
240803,24O0046002,0.0,2024-11-30
240804,24O0046003,0.0,2024-11-30
240805,24O0046004,0.0,2024-11-30
240806,24O0046005,0.0,2024-11-30
240807,24O0046006,0.0,2024-11-30


In [60]:
RussiaSalesData.tail()

Unnamed: 0,Date,Machine Manufacturer Serial Number,ПРОДАЖИ (NPS)
447985,2024-12-31,20123731398,-4466.44
447986,2024-12-31,20160607018,-6764.73
447987,2024-12-31,20182838906,-9948.56
447988,2024-12-31,20170908353,-320240.48
447989,2024-12-31,20162020036,-751511.34


In [61]:
RussiaSalesData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 447990 entries, 0 to 447989
Data columns (total 3 columns):
 #   Column                              Non-Null Count   Dtype         
---  ------                              --------------   -----         
 0   Date                                447990 non-null  datetime64[ns]
 1   Machine Manufacturer Serial Number  447984 non-null  object        
 2   ПРОДАЖИ (NPS)                       447935 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 10.3+ MB


In [62]:
SouthAfricaSales = pd.read_excel(SouthAfricaSales)

In [63]:
SouthAfricaSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5880 entries, 0 to 5879
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   AccountID  5880 non-null   int64         
 1   quantity   5880 non-null   float64       
 2   Month      5880 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 137.9 KB


In [64]:
SingaporeSales = pd.read_excel(SingaporeSales)

In [65]:
SingaporeSales['Month'] = pd.to_datetime(SingaporeSales['Month'])
SingaporeSales['Serial ID'] = SingaporeSales['Serial ID'].astype(str)
IndiaSalesData['Serial ID'] = IndiaSalesData['Serial ID'].astype(str)

IndiaSalesData['Month'] = pd.to_datetime(IndiaSalesData['Sales Month'])
                                         
# Drop the "Sales Month" column
IndiaSalesData.drop('Sales Month', axis=1, inplace=True)

  SingaporeSales['Month'] = pd.to_datetime(SingaporeSales['Month'])


In [66]:
SingaporeSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142531 entries, 0 to 142530
Data columns (total 6 columns):
 #   Column               Non-Null Count   Dtype         
---  ------               --------------   -----         
 0   Serial ID            142531 non-null  object        
 1   Month                142531 non-null  datetime64[ns]
 2   Sales                136879 non-null  float64       
 3   Ship to              142531 non-null  object        
 4   Account ID           142530 non-null  float64       
 5   Manufacturer Number  142531 non-null  object        
dtypes: datetime64[ns](1), float64(2), object(3)
memory usage: 6.5+ MB


pip install pandasql

In [67]:
# create data frame
df1 = SingaporeSales
 
print("Original DataFrame")
 
# print original data frame
display(df1)
 
# create a dictionary
# key = old name
# value = new name
dict = {'Serial ID': 'Serial_ID',
        'Month' : 'Month',
        'Sales' : 'Sales',
        'Ship to': 'Ship_to',
       'Account ID' : 'Account_ID',
       'Manufacturer Number' : 'Manufacturer_Number'}
 
print("\nAfter rename")
# call rename () method
df1.rename(columns=dict,
          inplace=True)
 
# print Data frame after rename columns
display(df1)

Original DataFrame


Unnamed: 0,Serial ID,Month,Sales,Ship to,Account ID,Manufacturer Number
0,SGBMB03059,2023-01-31,0.00,30885489704,3981172.0,20092213179
1,SGBMB04056,2023-01-31,,27835PA00101,3981872.0,20092414016
2,SGBMB03049,2023-01-31,0.00,30885489184,3981521.0,20092414029
3,SGBMB03772,2023-01-31,2412.35,27835KEP008A01,3982306.0,20094625024
4,SGBMB03804,2023-01-31,0.00,30885489344,3981360.0,20094625056
...,...,...,...,...,...,...
142526,24O0047574,2025-01-31,166980.20,58447875844787,9730498.0,24471901
142527,24O0047576,2025-01-31,1212.22,6767280N-2410015,9751119.0,24471903
142528,24O0047577,2025-01-31,166980.20,58447875844787,9772986.0,24471904
142529,24O0047578,2025-01-31,166980.20,58447875844787,9800079.0,24471905



After rename


Unnamed: 0,Serial_ID,Month,Sales,Ship_to,Account_ID,Manufacturer_Number
0,SGBMB03059,2023-01-31,0.00,30885489704,3981172.0,20092213179
1,SGBMB04056,2023-01-31,,27835PA00101,3981872.0,20092414016
2,SGBMB03049,2023-01-31,0.00,30885489184,3981521.0,20092414029
3,SGBMB03772,2023-01-31,2412.35,27835KEP008A01,3982306.0,20094625024
4,SGBMB03804,2023-01-31,0.00,30885489344,3981360.0,20094625056
...,...,...,...,...,...,...
142526,24O0047574,2025-01-31,166980.20,58447875844787,9730498.0,24471901
142527,24O0047576,2025-01-31,1212.22,6767280N-2410015,9751119.0,24471903
142528,24O0047577,2025-01-31,166980.20,58447875844787,9772986.0,24471904
142529,24O0047578,2025-01-31,166980.20,58447875844787,9800079.0,24471905


In [68]:
import pandas as pd
import sqlite3

# create a sample DataFrame
df = df1

# create an in-memory SQLite database
conn = sqlite3.connect(':memory:')

# write the DataFrame to the database
df.to_sql('my_table', con=conn)

# define the SQL query
query = '''
SELECT Month, Sales AS quantity, Ship_to, Manufacturer_Number,
       COUNT(Manufacturer_Number) OVER (PARTITION BY Month, Ship_to) AS Manufacturer_Count,
       (Sales / COUNT(Manufacturer_Number) OVER (PARTITION BY Month, Ship_to)) AS Sales_perMachine
FROM my_table
'''

# run the query using pandas
result = pd.read_sql_query(query, conn)

# print the result
print(result)

                      Month  quantity         Ship_to Manufacturer_Number  \
0       2023-01-31 00:00:00      0.00      2215122122         20103018840   
1       2023-01-31 00:00:00      0.00      2215122122         20103018859   
2       2023-01-31 00:00:00      0.00      2215122122         20102917864   
3       2023-01-31 00:00:00      0.00      2215122122         20102917875   
4       2023-01-31 00:00:00      0.00      2215122122         20103923848   
...                     ...       ...             ...                 ...   
142526  2025-01-31 00:00:00      0.00  69338116933811         20141213801   
142527  2025-01-31 00:00:00      0.00  69338116933811         20141213823   
142528  2025-01-31 00:00:00   1065.40  69768386976838         20224845590   
142529  2025-01-31 00:00:00    146.88   7534867231846         20223933588   
142530  2025-01-31 00:00:00    146.88   7534867231846         20242120272   

        Manufacturer_Count  Sales_perMachine  
0                      204  

In [69]:
SingaporeSales = result
SingaporeSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 142531 entries, 0 to 142530
Data columns (total 6 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   Month                142531 non-null  object 
 1   quantity             136879 non-null  float64
 2   Ship_to              142531 non-null  object 
 3   Manufacturer_Number  142531 non-null  object 
 4   Manufacturer_Count   142531 non-null  int64  
 5   Sales_perMachine     136879 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 6.5+ MB


In [70]:
SingaporeSales.rename(columns={'Manufacturer_Number': 'Manufacturer Number'}, inplace=True)
SingaporeSales.rename(columns={'quantity': 'quantityold'}, inplace=True)
SingaporeSales.rename(columns={'Sales_perMachine': 'quantity'}, inplace=True)




In [71]:
IndiaSalesData.rename(columns={'Total NNS': 'quantity'}, inplace=True)

In [72]:
RussiaSalesData.rename(columns={'Machine Manufacturer Serial Number': 'Serial', 'ПРОДАЖИ (NPS)': 'quantity'}, inplace=True)
RussiaSalesData

Unnamed: 0,Date,Serial,quantity
0,2023-01-31,20172526481,10374.8300
1,2023-01-31,20182128683,0.0000
2,2023-01-31,15297DU18090720333,0.0000
3,2023-01-31,20180912309,0.0000
4,2023-01-31,20170908199,5404.2677
...,...,...,...
447985,2024-12-31,20123731398,-4466.4400
447986,2024-12-31,20160607018,-6764.7300
447987,2024-12-31,20182838906,-9948.5600
447988,2024-12-31,20170908353,-320240.4800


In [73]:
RussiaSalesData['quantity'] = RussiaSalesData['quantity'].astype(float)
RussiaSalesData['Serial'] = RussiaSalesData['Serial'].astype(str)

In [74]:
RussiaSalesData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 447990 entries, 0 to 447989
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   Date      447990 non-null  datetime64[ns]
 1   Serial    447990 non-null  object        
 2   quantity  447935 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 10.3+ MB


TelemetronData = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Telemetron Data.xlsx")

TelemetronData.rename(columns={'Machine serial': 'serial', 'Total': 'quantity'}, inplace=True)

In [75]:
PakistanSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287041 entries, 0 to 287040
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   Serial    287041 non-null  object        
 1   quantity  287041 non-null  float64       
 2   Month     287041 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 6.6+ MB


VendonData = pd.read_excel(r"C:\Users\msalomo\Churn Project\Data\Telemetry2021.xlsx")

VendonData.head()

In [76]:
Pakistan_aggSales = PakistanSales['quantity'].groupby(PakistanSales['Serial'], axis=0).sum()
Pakistan_aggSales = Pakistan_aggSales.reset_index()

# Perform the aggregation on 'quantity' grouped by 'Serial'
Malaysia_aggSales = MalaysiaSales['quantity'].groupby(MalaysiaSales['Serial']).sum().reset_index()

India_aggSales = IndiaSalesData['quantity'].groupby(IndiaSalesData['Serial ID']).sum().reset_index()

# Print
print(India_aggSales)

      Serial ID       quantity
0      10010737   93785.068531
1      10010738   62524.483809
2      10010770       0.000000
3      10010772       0.000000
4      10010776   44331.245300
...         ...            ...
24230  CWIP1220  123114.126170
24231  CWIP1221       0.000000
24232  CWIP1222       0.000000
24233  CWIP1223    3050.932203
24234  CWIP1224  210454.575476

[24235 rows x 2 columns]


Telemetron_agg = TelemetronData['quantity'].groupby(TelemetronData['serial'], axis=0).sum()
Telemetron_agg = Telemetron_agg.reset_index()
Telemetron_agg

In [77]:
RussiaSalesData_agg = RussiaSalesData['quantity'].groupby(RussiaSalesData['Serial'], axis=0).sum()
RussiaSalesData_agg = RussiaSalesData_agg.reset_index()
RussiaSalesData_agg

Unnamed: 0,Serial,quantity
0,#,3.666028e+07
1,#Н/Д,9.457055e+03
2,00001428-0011,0.000000e+00
3,00001429-0001,2.587372e+04
4,00001429-0004,0.000000e+00
...,...,...
42794,ХК0114,0.000000e+00
42795,ХК0115,0.000000e+00
42796,ХК0116,0.000000e+00
42797,ХК0120,0.000000e+00


In [78]:
SouthAfrica_aggSales = SouthAfricaSales['quantity'].groupby(SouthAfricaSales['AccountID'], axis=0).sum()
SouthAfrica_aggSales = SouthAfrica_aggSales.reset_index()
SouthAfrica_aggSales

Unnamed: 0,AccountID,quantity
0,365014,1.28
1,365018,-2039.00
2,366680,1068.01
3,366935,13691.11
4,367418,184597.82
...,...,...
650,7317156,7284.40
651,7340045,12686.06
652,7341159,3852.80
653,7347170,25374.10


In [79]:
Singapore_aggSales = SingaporeSales['quantity'].groupby(SingaporeSales['Manufacturer Number'], axis=0).sum()
Singapore_aggSales = Singapore_aggSales.reset_index()
Singapore_aggSales

Unnamed: 0,Manufacturer Number,quantity
0,141982300000383201,5158.605000
1,141982300001083201,5389.800000
2,141982300003083201,18889.012298
3,141982300003283201,3900.230000
4,141992300000483201,5186.680000
...,...,...
6669,ZEBA0072,30484.776439
6670,ZEBA0073,17973.045567
6671,ZEBA0074,21128.741822
6672,ZEBA0075,22971.646956


#Vendon_agg = VendonData['quantity'].groupby(VendonData['serial'], axis=0).sum()

Vendon_agg = (VendonData.sort_values('Month')
    .groupby(["serial"])
                      .agg({'SalesOrg' : lambda s: s.values[-1],
                            'quantity' : 'sum'}))

Vendon_agg = Vendon_agg.reset_index()
Vendon_agg

PakistanSales = PakistanSales.loc[PakistanSales['Month']>=PakistanDateRangeStart]

VendonData = VendonData.loc[VendonData['Month']>=VendonDateRangeStart]

I will aggregate the number of Cup Sales for each Machine by the feature called 'serial' which corresponds to the feature 'Manufacturer Number' in the Beverage Machine data

In [80]:

PakistanSales_df1 = PakistanSales.groupby(['Month', 'Serial']).sum()
PakistanSales_df1 = PakistanSales_df1.reset_index()


PakistanSales_one_month = PakistanSales_df1.sort_values('Month').groupby('Serial').agg({'quantity' : lambda x: x.tail(1).sum()})

PakistanSales_three_months = PakistanSales_df1.sort_values('Month').groupby('Serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

PakistanSales_six_months = PakistanSales_df1.sort_values('Month').groupby('Serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

# Group by 'Month' and 'Serial' and calculate the sum
MalaysiaSales_df1 = MalaysiaSales.groupby(['Month', 'Serial']).sum().reset_index()
# Calculate the sum of 'quantity' for the latest month for each 'Serial'
MalaysiaSales_one_month = MalaysiaSales_df1.sort_values('Month').groupby('Serial').agg({'quantity': lambda x: x.tail(1).sum()})
# Calculate the average of 'quantity' for the last three months for each 'Serial'
MalaysiaSales_three_months = MalaysiaSales_df1.sort_values('Month').groupby('Serial').agg({'quantity': lambda x: x.tail(3).sum() / 3})
# Calculate the average of 'quantity' for the last six months for each 'Serial'
MalaysiaSales_six_months = MalaysiaSales_df1.sort_values('Month').groupby('Serial').agg({'quantity': lambda x: x.tail(6).sum() / 6})

# Group by 'Month' and 'Serial ID' and calculate the sum
IndiaSales_df1 = IndiaSalesData.groupby(['Month', 'Serial ID']).sum().reset_index()

# Calculate the sum of 'quantity' for the latest month for each 'Serial ID'
IndiaSales_one_month = IndiaSales_df1.sort_values('Month').groupby('Serial ID').agg({'quantity': lambda x: x.tail(1).sum()})

# Calculate the average of 'quantity' for the last three months for each 'Serial ID'
IndiaSales_three_months = IndiaSales_df1.sort_values('Month').groupby('Serial ID').agg({'quantity': lambda x: x.tail(3).sum() / 3})

# Calculate the average of 'quantity' for the last six months for each 'Serial ID'
IndiaSales_six_months = IndiaSales_df1.sort_values('Month').groupby('Serial ID').agg({'quantity': lambda x: x.tail(6).sum() / 6})

# Print the resulting DataFrames
print(IndiaSales_one_month)

              quantity
Serial ID             
10010737      0.000000
10010738      0.000000
10010770      0.000000
10010772      0.000000
10010776   5066.458812
...                ...
CWIP1220   9809.949214
CWIP1221      0.000000
CWIP1222      0.000000
CWIP1223    338.983051
CWIP1224   1694.915254

[24235 rows x 1 columns]


In [81]:
PakistanSales_one_month = PakistanSales_one_month.reset_index()
PakistanSales_three_months = PakistanSales_three_months.reset_index()
PakistanSales_six_months = PakistanSales_six_months.reset_index()

# Reset the index for 'MalaysiaSales_one_month'
MalaysiaSales_one_month = MalaysiaSales_one_month.reset_index()
# Reset the index for 'MalaysiaSales_three_months'
MalaysiaSales_three_months = MalaysiaSales_three_months.reset_index()
# Reset the index for 'MalaysiaSales_six_months'
MalaysiaSales_six_months = MalaysiaSales_six_months.reset_index()

# Reset the index for 'IndiaSales_one_month'
IndiaSales_one_month = IndiaSales_one_month.reset_index()

# Reset the index for 'IndiaSales_three_months'
IndiaSales_three_months = IndiaSales_three_months.reset_index()

# Reset the index for 'IndiaSales_six_months'
IndiaSales_six_months = IndiaSales_six_months.reset_index()

# Print the 'IndiaSales_three_months' DataFrame
print(IndiaSales_three_months)

      Serial ID     quantity
0      10010737     0.000000
1      10010738     0.000000
2      10010770     0.000000
3      10010772     0.000000
4      10010776  3887.710884
...         ...          ...
24230  CWIP1220  7898.797140
24231  CWIP1221     0.000000
24232  CWIP1222     0.000000
24233  CWIP1223   338.983051
24234  CWIP1224  4181.708785

[24235 rows x 2 columns]


from dateutil.relativedelta import relativedelta

one_month_pak = PakistanLastUpdate + relativedelta(months=-1)
three_months_pak = PakistanLastUpdate + relativedelta(months=-3)
six_months_pak = PakistanLastUpdate + relativedelta(months=-6)

PakistanSales_one_month = PakistanSales.loc[PakistanSales['Month']>one_month_pak]
PakistanSales_three_months = PakistanSales.loc[PakistanSales['Month']>three_months_pak]
PakistanSales_six_months = PakistanSales.loc[PakistanSales['Month']>six_months_pak]

VendonData_one_month = VendonData.loc[VendonData['Month']>one_month_pak]
VendonData_three_months = VendonData.loc[VendonData['Month']>three_months_pak]
VendonData_six_months = VendonData.loc[VendonData['Month']>six_months_pak]

PakistanSales_one_month.loc[PakistanSales_one_month['serial'] != '70010058920']

PakistanSales_one_month_avg = PakistanSales_one_month['quantity'].groupby(PakistanSales_one_month['serial'], axis=0).sum()

PakistanSales_one_month_avg = PakistanSales_one_month_avg.to_frame().reset_index()
PakistanSales_one_month_avg = PakistanSales_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

In [82]:
#PakistanSales_one_month_avg = PakistanSales_one_month['quantity'].groupby(PakistanSales_one_month['serial'], axis=0).sum()
#PakistanSales_one_month_avg = PakistanSales_one_month_avg.to_frame().reset_index()
PakistanSales_one_month_avg = PakistanSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})

#PakistanSales_three_months_avg = PakistanSales_three_months['quantity'].groupby(PakistanSales_three_months['serial'], axis=0).sum()
#PakistanSales_three_months_avg = PakistanSales_three_months_avg.to_frame().reset_index()
PakistanSales_three_months_avg = PakistanSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})

#PakistanSales_six_months_avg = PakistanSales_six_months['quantity'].groupby(PakistanSales_six_months['serial'], axis=0).sum()
#PakistanSales_six_months_avg = PakistanSales_six_months_avg.to_frame().reset_index()
PakistanSales_six_months_avg = PakistanSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})

PakistanSales_three_months_avg

# Rename the column in 'MalaysiaSales_one_month'
MalaysiaSales_one_month_avg = MalaysiaSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})
# Rename the column in 'MalaysiaSales_three_months'
MalaysiaSales_three_months_avg = MalaysiaSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})
# Rename the column in 'MalaysiaSales_six_months'
MalaysiaSales_six_months_avg = MalaysiaSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})

# Rename the column in 'IndiaSales_one_month'
IndiaSales_one_month_avg = IndiaSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})

# Rename the column in 'IndiaSales_three_months'
IndiaSales_three_months_avg = IndiaSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})

# Rename the column in 'IndiaSales_six_months'
IndiaSales_six_months_avg = IndiaSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})

# Print the 'IndiaSales_three_months_avg' DataFrame
print(IndiaSales_three_months_avg)

      Serial ID  Sales_three_months_avg
0      10010737                0.000000
1      10010738                0.000000
2      10010770                0.000000
3      10010772                0.000000
4      10010776             3887.710884
...         ...                     ...
24230  CWIP1220             7898.797140
24231  CWIP1221                0.000000
24232  CWIP1222                0.000000
24233  CWIP1223              338.983051
24234  CWIP1224             4181.708785

[24235 rows x 2 columns]


In [83]:
SouthAfricaSales_df1 = SouthAfricaSales.groupby(['Month', 'AccountID']).sum()
SouthAfricaSales_df1 = SouthAfricaSales.reset_index()


SouthAfricaSales_one_month = SouthAfricaSales_df1.sort_values('Month').groupby('AccountID').agg({'quantity' : lambda x: x.tail(1).sum()})

SouthAfricaSales_three_months = SouthAfricaSales_df1.sort_values('Month').groupby('AccountID').agg({'quantity' : lambda x: x.tail(3).sum()/3})

SouthAfricaSales_six_months = SouthAfricaSales_df1.sort_values('Month').groupby('AccountID').agg({'quantity' : lambda x: x.tail(6).sum()/6})


SouthAfricaSales_one_month = SouthAfricaSales_one_month.reset_index()
SouthAfricaSales_three_months = SouthAfricaSales_three_months.reset_index()
SouthAfricaSales_six_months = SouthAfricaSales_six_months.reset_index()
SouthAfricaSales_three_months


SouthAfricaSales_one_month_avg = SouthAfricaSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})


SouthAfricaSales_three_months_avg = SouthAfricaSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})


SouthAfricaSales_six_months_avg = SouthAfricaSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})

SouthAfricaSales_three_months_avg

Unnamed: 0,AccountID,Sales_three_months_avg
0,365014,0.086667
1,365018,-693.216667
2,366680,356.003333
3,366935,4563.703333
4,367418,6821.496667
...,...,...
650,7317156,2428.133333
651,7340045,4228.686667
652,7341159,1284.266667
653,7347170,8458.033333


In [84]:
SingaporeSales_df1 = SingaporeSales.groupby(['Month', 'Manufacturer Number']).sum()
SingaporeSales_df1 = SingaporeSales.reset_index()


SingaporeSales_one_month = SingaporeSales_df1.sort_values('Month').groupby('Manufacturer Number').agg({'quantity' : lambda x: x.tail(1).sum()})

SingaporeSales_three_months = SingaporeSales_df1.sort_values('Month').groupby('Manufacturer Number').agg({'quantity' : lambda x: x.tail(3).sum()/3})

SingaporeSales_six_months = SingaporeSales_df1.sort_values('Month').groupby('Manufacturer Number').agg({'quantity' : lambda x: x.tail(6).sum()/6})


SingaporeSales_one_month = SingaporeSales_one_month.reset_index()
SingaporeSales_three_months = SingaporeSales_three_months.reset_index()
SingaporeSales_six_months = SingaporeSales_six_months.reset_index()
SingaporeSales_three_months


SingaporeSales_one_month_avg = SingaporeSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})

SingaporeSales_three_months_avg = SingaporeSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})

SingaporeSales_six_months_avg = SingaporeSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})


  SingaporeSales_df1 = SingaporeSales.groupby(['Month', 'Manufacturer Number']).sum()


TelemetronData_df1 = TelemetronData.groupby(['Month', 'serial']).sum()
TelemetronData_df1 = TelemetronData_df1.reset_index()

TelemetronData_one_month = TelemetronData_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(1).sum()})

TelemetronData_three_months = TelemetronData_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

TelemetronData_six_months = TelemetronData_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

TelemetronData_one_month = TelemetronData_one_month.reset_index()
TelemetronData_three_months = TelemetronData_three_months.reset_index()
TelemetronData_six_months = TelemetronData_six_months.reset_index()
TelemetronData_three_months

#TelemetronData_one_month_avg = TelemetronData_one_month['quantity'].groupby(TelemetronData_one_month['serial'], axis=0).sum()
#TelemetronData_one_month_avg = TelemetronData_one_month_avg.to_frame().reset_index()
TelemetronData_one_month_avg = TelemetronData_one_month.rename(columns={"quantity": "one_month_avg"})

#TelemetronData_three_months_avg = TelemetronData_three_months['quantity'].groupby(TelemetronData_three_months['serial'], axis=0).sum()
#TelemetronData_three_months_avg = TelemetronData_three_months_avg.to_frame().reset_index()
TelemetronData_three_months_avg = TelemetronData_three_months.rename(columns={"quantity": "three_months_avg"})

#TelemetronData_six_months_avg = TelemetronData_six_months['quantity'].groupby(TelemetronData_six_months['serial'], axis=0).sum()
#TelemetronData_six_months_avg = TelemetronData_six_months_avg.to_frame().reset_index()
TelemetronData_six_months_avg = TelemetronData_six_months.rename(columns={"quantity": "six_months_avg"})

TelemetronData_three_months_avg

In [85]:
RussiaSalesData_df1 = RussiaSalesData.groupby(['Date', 'Serial']).sum()
RussiaSalesData_df1 = RussiaSalesData_df1.reset_index()

RussiaSalesData_one_month = RussiaSalesData_df1.sort_values('Date').groupby('Serial').agg({'quantity' : lambda x: x.tail(1).sum()})

RussiaSalesData_three_months = RussiaSalesData_df1.sort_values('Date').groupby('Serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

RussiaSalesData_six_months = RussiaSalesData_df1.sort_values('Date').groupby('Serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

RussiaSalesData_one_month = RussiaSalesData_one_month.reset_index()
RussiaSalesData_three_months = RussiaSalesData_three_months.reset_index()
RussiaSalesData_six_months = RussiaSalesData_six_months.reset_index()

RussiaSalesData_one_month_avg = RussiaSalesData_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})

RussiaSalesData_three_months_avg = RussiaSalesData_three_months.rename(columns={"quantity": "Sales_three_months_avg"})

RussiaSalesData_six_months_avg = RussiaSalesData_six_months.rename(columns={"quantity": "Sales_six_months_avg"})



TelemetronData_one_month_avg = TelemetronData_one_month['quantity'].groupby(TelemetronData_one_month['serial'], axis=0).sum()
TelemetronData_one_month_avg = TelemetronData_one_month_avg.to_frame().reset_index()
TelemetronData_one_month_avg = TelemetronData_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

TelemetronData_three_months_avg = TelemetronData_three_months['quantity'].groupby(TelemetronData_three_months['serial'], axis=0).sum()
TelemetronData_three_months_avg = TelemetronData_three_months_avg.to_frame().reset_index()
TelemetronData_three_months_avg = TelemetronData_three_months_avg.rename(columns={"quantity": "three_months_avg"})

TelemetronData_six_months_avg = TelemetronData_six_months['quantity'].groupby(TelemetronData_six_months['serial'], axis=0).sum()
TelemetronData_six_months_avg = TelemetronData_six_months_avg.to_frame().reset_index()
TelemetronData_six_months_avg = TelemetronData_six_months_avg.rename(columns={"quantity": "six_months_avg"})

TelemetronData_three_months_avg

VendonData_one_month_avg = VendonData_one_month['quantity'].groupby(VendonData_one_month['serial'], axis=0).sum()
VendonData_one_month_avg = VendonData_one_month_avg.to_frame().reset_index()
VendonData_one_month_avg = VendonData_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

VendonData_three_months_avg = VendonData_three_months['quantity'].groupby(VendonData_three_months['serial'], axis=0).sum()
VendonData_three_months_avg = VendonData_three_months_avg.to_frame().reset_index()
VendonData_three_months_avg = VendonData_three_months_avg.rename(columns={"quantity": "three_months_avg"})

VendonData_six_months_avg = VendonData_six_months['quantity'].groupby(VendonData_six_months['serial'], axis=0).sum()
VendonData_six_months_avg = VendonData_six_months_avg.to_frame().reset_index()
VendonData_six_months_avg = VendonData_six_months_avg.rename(columns={"quantity": "six_months_avg"})

VendonData_three_months_avg

#already done with change of code
PakistanSales_three_months_avg['three_months_avg'] = PakistanSales_three_months_avg['Sales_three_months_avg'].apply(lambda x: x/3)

PakistanSales_six_months_avg['six_months_avg'] = PakistanSales_six_months_avg['Sales_six_months_avg'].apply(lambda x: x/6)


In [86]:
PakistanSales_df = pd.merge(Pakistan_aggSales, PakistanSales_one_month_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])


# Merge the DataFrames based on the 'Serial' column
PakistanSales_df = pd.merge(Pakistan_aggSales, PakistanSales_one_month_avg, how='left', left_on='Serial', right_on='Serial')
# Print the head of the merged DataFrame
print(PakistanSales_df.head())

        Serial     quantity  Sales_one_Month_avg
0  10010063319   60245.1569             15412.08
1   2000014136  689624.6980             58078.43
2   2000014290   42356.1536              5000.00
3   2000014292  104124.6992             17440.00
4   2000014293  415646.1472             38800.00


In [87]:
PakistanSales_df2 = pd.merge(PakistanSales_df, PakistanSales_three_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
PakistanSales_df3 = pd.merge(PakistanSales_df2, PakistanSales_six_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
PakistanSales_df3 = PakistanSales_df3.fillna(0)

# Merge the DataFrames based on the 'Serial' column
MalaysiaSales_df1 = pd.merge(Malaysia_aggSales, MalaysiaSales_one_month_avg, how='left', left_on='Serial', right_on='Serial')
MalaysiaSales_df2 = pd.merge(MalaysiaSales_df1, MalaysiaSales_three_months_avg, how='left', left_on='Serial', right_on='Serial')
MalaysiaSales_df3 = pd.merge(MalaysiaSales_df2, MalaysiaSales_six_months_avg, how='left', left_on='Serial', right_on='Serial')
# Fill any missing values with 0
MalaysiaSales_df3 = MalaysiaSales_df3.fillna(0)


# Merge the DataFrames based on the 'Serial ID' column
IndiaSales_df1 = pd.merge(India_aggSales, IndiaSales_one_month_avg, how='left', left_on='Serial ID', right_on='Serial ID')
IndiaSales_df2 = pd.merge(IndiaSales_df1, IndiaSales_three_months_avg, how='left', left_on='Serial ID', right_on='Serial ID')
IndiaSales_df3 = pd.merge(IndiaSales_df2, IndiaSales_six_months_avg, how='left', left_on='Serial ID', right_on='Serial ID')

# Fill any missing values with 0
IndiaSales_df3 = IndiaSales_df3.fillna(0)

# Print the head of the merged DataFrame
print(IndiaSales_df3.head(30))

   Serial ID       quantity  Sales_one_Month_avg  Sales_three_months_avg  \
0   10010737   93785.068531             0.000000                0.000000   
1   10010738   62524.483809             0.000000                0.000000   
2   10010770       0.000000             0.000000                0.000000   
3   10010772       0.000000             0.000000                0.000000   
4   10010776   44331.245300          5066.458812             3887.710884   
5   10010778       0.000000             0.000000                0.000000   
6   10010799   36924.953375          3573.395337             3176.351411   
7   10010800   36924.953375          3573.395337             3176.351411   
8   10011102       0.000000             0.000000                0.000000   
9   10011124   19621.713884          1367.430500             1845.463909   
10  10011126  158960.003263             0.000000            12577.454160   
11  10011135       0.000000             0.000000                0.000000   
12  10011145

In [88]:
IndiaSales_df3.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24235 entries, 0 to 24234
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial ID               24235 non-null  object 
 1   quantity                24235 non-null  float64
 2   Sales_one_Month_avg     24235 non-null  float64
 3   Sales_three_months_avg  24235 non-null  float64
 4   Sales_six_months_avg    24235 non-null  float64
dtypes: float64(4), object(1)
memory usage: 1.1+ MB


#TelemetronData_three_months_avg['three_months_avg'] = TelemetronData_three_months_avg['three_months_avg'].apply(lambda x: x/3)

#TelemetronData_six_months_avg['six_months_avg'] = TelemetronData_six_months_avg['six_months_avg'].apply(lambda x: x/6)

TelemetronData_df = pd.merge(Telemetron_agg, TelemetronData_one_month_avg, how='left', left_on = ['serial'], right_on = ['serial'])
TelemetronData_df.head()

TelemetronData_df2 = pd.merge(TelemetronData_df, TelemetronData_three_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
TelemetronData_df3 = pd.merge(TelemetronData_df2, TelemetronData_six_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
TelemetronData_df3 = TelemetronData_df3.fillna(0)


In [89]:
RussiaSalesData_df = pd.merge(RussiaSalesData_agg, RussiaSalesData_one_month_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
RussiaSalesData_df.head()

RussiaSalesData_df2 = pd.merge(RussiaSalesData_df, RussiaSalesData_three_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
RussiaSalesData_df3 = pd.merge(RussiaSalesData_df2, RussiaSalesData_six_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
RussiaSalesData_df3 = RussiaSalesData_df3.fillna(0)


VendonData_three_months_avg['three_months_avg'] = VendonData_three_months_avg['three_months_avg'].apply(lambda x: x/3)

VendonData_six_months_avg['six_months_avg'] = VendonData_six_months_avg['six_months_avg'].apply(lambda x: x/6)

VendonData_df = pd.merge(Vendon_agg, VendonData_one_month_avg, how='left', left_on = ['serial'], right_on = ['serial'])

VendonData_df2 = pd.merge(VendonData_df, VendonData_three_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
VendonData_df3 = pd.merge(VendonData_df2, VendonData_six_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
VendonData_df3 = VendonData_df3.fillna(0)
VendonData_df3

Add Key manuf no and Sales org

In [90]:
SouthAfricaSales_df = pd.merge(SouthAfrica_aggSales, SouthAfricaSales_one_month_avg, how='left', left_on = ['AccountID'], right_on = ['AccountID'])
SouthAfricaSales_df.head()

SouthAfricaSales_df2 = pd.merge(SouthAfricaSales_df, SouthAfricaSales_three_months_avg, how='left', left_on = ['AccountID'], right_on = ['AccountID'])
SouthAfricaSales_df3 = pd.merge(SouthAfricaSales_df2, SouthAfricaSales_six_months_avg, how='left', left_on = ['AccountID'], right_on = ['AccountID'])
SouthAfricaSales_df3 = SouthAfricaSales_df3.fillna(0)
SouthAfricaSales_df3.head()

Unnamed: 0,AccountID,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg
0,365014,1.28,0.02,0.086667,0.116667
1,365018,-2039.0,0.2,-693.216667,-346.58
2,366680,1068.01,1068.01,356.003333,178.001667
3,366935,13691.11,8376.59,4563.703333,2281.851667
4,367418,184597.82,6095.63,6821.496667,7681.706667


In [91]:
SingaporeSales_df = pd.merge(Singapore_aggSales, SingaporeSales_one_month_avg, how='left', left_on = ['Manufacturer Number'], right_on = ['Manufacturer Number'])
SingaporeSales_df2 = pd.merge(SingaporeSales_df, SingaporeSales_three_months_avg, how='left', left_on = ['Manufacturer Number'], right_on = ['Manufacturer Number'])
SingaporeSales_df3 = pd.merge(SingaporeSales_df2, SingaporeSales_six_months_avg, how='left', left_on = ['Manufacturer Number'], right_on = ['Manufacturer Number'])
SingaporeSales_df3 = SingaporeSales_df3.fillna(0)
SingaporeSales_df3.head(30)

Unnamed: 0,Manufacturer Number,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg
0,141982300000383201,5158.605,216.005,112.618333,147.675833
1,141982300001083201,5389.8,290.0,132.853333,196.265833
2,141982300003083201,18889.012298,885.32,490.555,782.200833
3,141982300003283201,3900.23,135.93,45.31,69.675
4,141992300000483201,5186.68,175.32,556.02,380.28
5,141992300000783201,34871.232953,1997.012353,668.628966,1155.639871
6,141992300000983201,2111.53,43.83,29.22,69.433333
7,141992300001583201,9497.62,540.44,412.086667,400.8
8,141992300001683201,780.71,95.075,72.158333,82.6325
9,141992300003283201,9119.48,262.98,1566.433333,856.266667


In [92]:
PakistanSales_df3['KeyManufNo_SalesOrg'] = PakistanSales_df3['Serial'].astype(str) + 'Pakistan' 

# Create the new column by combining 'Serial' with 'Malaysia'
MalaysiaSales_df3['KeyManufNo_SalesOrg'] = MalaysiaSales_df3['Serial'].astype(str) + 'Malaysia'

# Create the new column by combining 'Serial ID' with 'India'
IndiaSales_df3['KeyManufNo_SalesOrg'] = IndiaSales_df3['Serial ID'].astype(str) + 'Nestlé India'

Not used yet in Vendon to differentiate markets

TelemetronData_df3['KeyManufNo_SalesOrg'] = TelemetronData_df3['serial'].astype(str) + 'Nestlé Russia'

In [93]:
RussiaSalesData_df3['KeyManufNo_SalesOrg'] = RussiaSalesData_df3['Serial'].astype(str) + 'Nestlé Russia'


In [94]:
SouthAfricaSales_df3['KeyManufNo_SalesOrg'] = SouthAfricaSales_df3['AccountID'].astype(str) + 'Nestle South Africa' 

Rename the accountID column from South Africa as we already did the work to get the accountID


In [95]:
SouthAfricaSales_df4 = SouthAfricaSales_df3.rename(columns = {'AccountID':'Serial'})

In [96]:
SingaporeSales_df3['KeyManufNo_SalesOrg'] = SingaporeSales_df3['Manufacturer Number'].astype(str) + 'Singapore'

In [97]:
SingaporeSales_df4 = SingaporeSales_df3.rename(columns = {'Manufacturer Number':'Serial'})

IndiaSales_df4 = IndiaSales_df3.rename(columns = {'Serial ID':'Serial'})

In [98]:
IndiaSales_df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 24235 entries, 0 to 24234
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  24235 non-null  object 
 1   quantity                24235 non-null  float64
 2   Sales_one_Month_avg     24235 non-null  float64
 3   Sales_three_months_avg  24235 non-null  float64
 4   Sales_six_months_avg    24235 non-null  float64
 5   KeyManufNo_SalesOrg     24235 non-null  object 
dtypes: float64(4), object(2)
memory usage: 1.3+ MB


VendonData_df3['KeyManufNo_SalesOrg'] = VendonData_df3['serial'].astype(str) + VendonData_df3['SalesOrg'].astype(str)

VendonData_df3=VendonData_df3.drop(columns=['SalesOrg'])
VendonData_df3.head()

In [99]:
Concat_Sales = pd.concat([RussiaSalesData_df3, PakistanSales_df3, SouthAfricaSales_df4, SingaporeSales_df4, MalaysiaSales_df3, IndiaSales_df4])
Concat_Sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107869 entries, 0 to 24234
Data columns (total 6 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Serial                  107869 non-null  object 
 1   quantity                107869 non-null  float64
 2   Sales_one_Month_avg     107869 non-null  float64
 3   Sales_three_months_avg  107869 non-null  float64
 4   Sales_six_months_avg    107869 non-null  float64
 5   KeyManufNo_SalesOrg     107869 non-null  object 
dtypes: float64(4), object(2)
memory usage: 5.8+ MB


In [100]:
Concat_Sales['(lst_mth-6mth)/6mth'] = Concat_Sales.apply(lambda x: 0 if x['Sales_six_months_avg'] <= 0 else (x['Sales_one_Month_avg']-x['Sales_six_months_avg'])/x['Sales_six_months_avg'], axis=1)

Concat_Sales['3mth-6mth)/6mth'] = Concat_Sales.apply(lambda x: 0 if x['Sales_six_months_avg'] <= 0 else (x['Sales_three_months_avg']-x['Sales_six_months_avg'])/x['Sales_six_months_avg'], axis=1)

In [101]:
Concat_Sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107869 entries, 0 to 24234
Data columns (total 8 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Serial                  107869 non-null  object 
 1   quantity                107869 non-null  float64
 2   Sales_one_Month_avg     107869 non-null  float64
 3   Sales_three_months_avg  107869 non-null  float64
 4   Sales_six_months_avg    107869 non-null  float64
 5   KeyManufNo_SalesOrg     107869 non-null  object 
 6   (lst_mth-6mth)/6mth     107869 non-null  float64
 7   3mth-6mth)/6mth         107869 non-null  float64
dtypes: float64(6), object(2)
memory usage: 7.4+ MB


In [102]:
Concat_Sales.tail(5)

Unnamed: 0,Serial,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg,KeyManufNo_SalesOrg,(lst_mth-6mth)/6mth,3mth-6mth)/6mth
24230,CWIP1220,123114.12617,9809.949214,7898.79714,7815.174835,CWIP1220Nestlé India,0.255244,0.0107
24231,CWIP1221,0.0,0.0,0.0,0.0,CWIP1221Nestlé India,0.0,0.0
24232,CWIP1222,0.0,0.0,0.0,0.0,CWIP1222Nestlé India,0.0,0.0
24233,CWIP1223,3050.932203,338.983051,338.983051,282.488701,CWIP1223Nestlé India,0.199988,0.199988
24234,CWIP1224,210454.575476,1694.915254,4181.708785,6081.570367,CWIP1224Nestlé India,-0.721303,-0.312397


Need to change the type otherwise cannot merge correctly with manuf number

TelemetronData_df3['serial'] = TelemetronData_df3['serial'].astype(str)

BeverageMachine7_wTickets_df['Manufacturer Number'] = BeverageMachine7_wTickets_df['Manufacturer Number'].astype(str)

w=aaaf.loc[aaaf['Manufacturer Number']=='20172526377']
w

In [103]:
#Concat_Telemetry = pd.concat([TelemetronData_df3, Telemetry_aggSales_df3])
Concat_Telemetry = Telemetry_aggSales_df3
Concat_Telemetry.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 43934 entries, 0 to 43933
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   serial            43934 non-null  object 
 1   quantity          43934 non-null  int64  
 2   one_month_avg     43934 non-null  int64  
 3   three_months_avg  43934 non-null  float64
 4   six_months_avg    43934 non-null  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 2.0+ MB


In [104]:
Concat_Telemetry['(lst_mth-6mth)/6mth'] = Concat_Telemetry.apply(lambda x: 0 if x['six_months_avg'] <= 0 else (x['one_month_avg']-x['six_months_avg'])/x['six_months_avg'], axis=1)

Concat_Telemetry['3mth-6mth)/6mth'] = Concat_Telemetry.apply(lambda x: 0 if x['six_months_avg'] <= 0 else (x['three_months_avg']-x['six_months_avg'])/x['six_months_avg'], axis=1)

## 10. Market Actions data

In [105]:
##Market Actions listed
MktActions = pd.read_excel(MktActions)
MktActions.head()

Unnamed: 0,Month,Serial ID,Sales Organisation,Parent Installation Point ID,Actions,Comments,CA Comments
0,2021-11-30,34F6401007,Nestle UK,7326,Other,CA Feedback Required,
1,2021-11-30,16E0031901,Nestle UK,11955,Removal planned,,
2,2021-11-30,17E0020640,Nestle UK,8151,Removal planned,,
3,2021-11-30,10238090,Nestle UK,IP-11722,Removal planned,,
4,2021-11-30,101810133,Nestle UK,4915,Other,CA Feedback Required,


In [106]:
MktActions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1742 entries, 0 to 1741
Data columns (total 7 columns):
 #   Column                        Non-Null Count  Dtype         
---  ------                        --------------  -----         
 0   Month                         1742 non-null   datetime64[ns]
 1   Serial ID                     1742 non-null   object        
 2   Sales Organisation            1742 non-null   object        
 3   Parent Installation Point ID  1742 non-null   object        
 4   Actions                       1741 non-null   object        
 5   Comments                      1121 non-null   object        
 6   CA Comments                   29 non-null     object        
dtypes: datetime64[ns](1), object(6)
memory usage: 95.4+ KB


In [107]:
#add key serial + sales org?
#One hot encoding
def preprocess_MktActions(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Actions']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

MktActions_prep = preprocess_MktActions(MktActions)
MktActions_prep.head()

Unnamed: 0,Month,Serial ID,Sales Organisation,Parent Installation Point ID,Comments,CA Comments,Actions_Churn risk reason unknown,Actions_Data corrected,Actions_Downgrade machine installed,Actions_Lack of data discipline,...,Actions_Removed,Actions_Reviewed and no action Required,Actions_Reviewed and no actions required,Actions_Seasonal Machine,Actions_Telemetry installed,Actions_Upgrade machine installed,Actions_Visit completed,Actions_Visit/Call planned,Actions_removed,Actions_tagging update
0,2021-11-30,34F6401007,Nestle UK,7326,CA Feedback Required,,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2021-11-30,16E0031901,Nestle UK,11955,,,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2021-11-30,17E0020640,Nestle UK,8151,,,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2021-11-30,10238090,Nestle UK,IP-11722,,,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2021-11-30,101810133,Nestle UK,4915,CA Feedback Required,,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [108]:
MktActions_prep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1742 entries, 0 to 1741
Data columns (total 27 columns):
 #   Column                                    Non-Null Count  Dtype         
---  ------                                    --------------  -----         
 0   Month                                     1742 non-null   datetime64[ns]
 1   Serial ID                                 1742 non-null   object        
 2   Sales Organisation                        1742 non-null   object        
 3   Parent Installation Point ID              1742 non-null   object        
 4   Comments                                  1121 non-null   object        
 5   CA Comments                               29 non-null     object        
 6   Actions_Churn risk reason unknown         1742 non-null   uint8         
 7   Actions_Data corrected                    1742 non-null   uint8         
 8   Actions_Downgrade machine installed       1742 non-null   uint8         
 9   Actions_Lack of data disciplin

In [109]:
MktActions_prep2=MktActions_prep.drop(columns=['Sales Organisation','Parent Installation Point ID', 'Month'])
MktActions_prep3 = MktActions_prep2.groupby(['Serial ID']).sum()
MktActions_prep3.head()

  MktActions_prep3 = MktActions_prep2.groupby(['Serial ID']).sum()


Unnamed: 0_level_0,Actions_Churn risk reason unknown,Actions_Data corrected,Actions_Downgrade machine installed,Actions_Lack of data discipline,Actions_New contract,Actions_Other,Actions_Out of order,Actions_Phone Call completed,Actions_Plan for removal,Actions_Removal Plan,...,Actions_Removed,Actions_Reviewed and no action Required,Actions_Reviewed and no actions required,Actions_Seasonal Machine,Actions_Telemetry installed,Actions_Upgrade machine installed,Actions_Visit completed,Actions_Visit/Call planned,Actions_removed,Actions_tagging update
Serial ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
24606,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1895151,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10238090,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10238091,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
10238092,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


In [110]:
MktActions_prep3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1430 entries, 24606 to T487736
Data columns (total 21 columns):
 #   Column                                    Non-Null Count  Dtype
---  ------                                    --------------  -----
 0   Actions_Churn risk reason unknown         1430 non-null   uint8
 1   Actions_Data corrected                    1430 non-null   uint8
 2   Actions_Downgrade machine installed       1430 non-null   uint8
 3   Actions_Lack of data discipline           1430 non-null   uint8
 4   Actions_New contract                      1430 non-null   uint8
 5   Actions_Other                             1430 non-null   uint8
 6   Actions_Out of order                      1430 non-null   uint8
 7   Actions_Phone Call completed              1430 non-null   uint8
 8   Actions_Plan for removal                  1430 non-null   uint8
 9   Actions_Removal Plan                      1430 non-null   uint8
 10  Actions_Removal planned                   1430 non-null   

### (b) Plan to manage and process the data <a class="anchor" id="ManageData"></a>

I will extract the data into excel or csv format and upload it to python.

I can merge the data of the different files together

The data is checked monthly and has been created to be linked together

Columns useful to link the datasets together :

    'Product ID [Machine Model ID]'

    'Manufacturer Number'

    'BMB/C4C Model code'

    'M1'
    
    'Manufacture Serial Number'
    
    'Serial ID'

I need to find a way to have one line per machine per month for telemetry data and Placement Tickets

The main idea behind the use of telemetry data here is to check if we can see for example a relation between churn and a the number of cup sales.

I will not use all the features. Below are the features I am planning to use for the two biggest dataset :

19 columns from Beverage Machine data :

['Serial ID', 'Sales Organisation', 'Machine Status Groupings', 'User Status', 'TA Contract Installation Date', 'Depreciation Start',
'Position', 'TA Contract Start Date', 'TA Contract End Date', 'TA Usage Indicator',
'Account ABC Classification (Account ID)', 'Industry (Account ID)', 'Industry Code 1 (Account ID)',  'Account ABC Classification (EC ID)', 'Industry (EC ID)',
'Industry Code 1 (EC ID)', 'Parent Installation Point ID', 'Registered Product Category (Registered Product ID)', 'Calendar Date']

14 columns coming from the Beverage Classification data :

['Model', 'Model Vendor', 'Model Category', 'Model Group', 'Beverage Temperature',
'System Brands', 'Ingredient Format', 'Machine Type', 'Positionning', 'Generation',
'Blueprint Throughput', 'IP Ownership', 'Trading Partner', 'G/R/M TB']

### Beverage Machine data features and the Beverage Classification data features

##### Serial ID                                              
	Unique per machine and allows to link to the Tickets placements

##### Sales Organisation                                     
	Usually a Sales Organisation corresponds to a Country

##### Product ID [Machine Model ID]                          
	Code that allows us to link it to the intermediary mapping table which contains all the details for each machine

##### Machine Status Groupings                               
	Status of the Machine shows if a machine is :
		Deployed
		Idle
		Other

##### User Status                                            
	More detailed than status groupings

##### Depreciation Start                                     
	Date when the machine started to display cup

##### Manufacturer Number                                    
	Code that allows us to link to the telemetry data


##### Position                                               
	Can tell if a machine is a:
		RENT,
		Sale,
		Loan,
		Demo,
		etc.,
##### TA Contract Installation Date
    Date when the machine was installed, different than depreciation start because a machine can be installed but could have already dispensed cups in another Installation Point

##### TA Contract Start Date                                 
	Date when the contract started
    
##### TA Contract End Date                                  
	Date when the contract ended
    
##### TA Usage Indicator                                     
	Can have several usage:
		5 Monthly Rental
		Not assigned
		Trial / Evaluation
		7 Annual / Periodic

##### Account ABC Classification (Account ID)                
	Can help to identify in which Channel is the Account
    
##### Industry (Account ID)                                  
	Can help to identify in which Channel is the Account
    
##### Industry Code 1 (Account ID)                           
	Can help to identify in which Channel is the Account

##### Account ABC Classification (EC ID)                     
	Can help to identify in which Channel is the End Customer
    
##### Industry (EC ID)                                       
	Can help to identify in which Channel is the End Customer
    
##### Industry Code 1 (EC ID)                                
	Can help to identify in which Channel is the End Customer

##### Parent Installation Point ID                           
	Help to identify if a machine is still deployed in the same location by the same customer, it is the Installation Point ID we were talking before.

##### Registered Product Category (Registered Product ID)    
	Details of the category within our group

##### Calendar Date                                          
	Date when we extracted the data of the machine
    
##### BMB/C4C Model code                                     
	Code that allows to link the intermediary mapping table to the beverage machine data

##### M1                                                     
	Name of the harmonized model and used to link the intermediary mapping to the mapping file with unique model
    
##### Model                                                  
	Name of the harmonized model and used to link the intermediary mapping to the mapping file with unique model

##### Model Vendor                                           
	Name of the vendor of the coffee machine

##### Model Category                                         
	Category of the model
		
##### Model Group                                            
	Group of the model

##### Beverage Temperature                                   
	Temperature of the beverage

##### System Brands                                         
	Brand internal classification

##### Ingredient Format                                     
	Format of the ingredient

##### Machine Type                                           
	Type of Machine

##### Positionning                                           
	Positionning of the machine

##### Generation                                             
	Generation of the machine

##### Blueprint Throughput                                   
    Type of throughput

##### IP Ownership                                           
    Ownership type

##### Trading Partner                                        
	Type of Trading Partner

##### G/R/M TB                                               
	How it is managed by the market 

Useless data not really explaining the model :

##### not used columns : 32

User Status Last Changed On                            
Product [Machine Model]                                
	Name of the machine Model 
Range Brand                                           
	Brand of the model
    
EC ID                                                  
    We can identify the end customer with this number, some can have more than one machine
    
	Can be transformed into #Machine for this customer

Equipment Number                                       
Asset Number                                           
TA Contract Number                                   
Account ID                                          
Ship To ID                                     
EC Name                                           
Sales Org ID (Installation Point)                  
Model Harmonized                                    	
Comments                                          
Source                                              
Global Projects                                    
	Machine that are part of a project :
		Roastelier
		Alegria
		Nitro
		Milano
		EZCare
		Express
		CoolPro
Toolbox                                               
Non-Toolbox Reason                 
Product                                       
Type.                              
Machines Models (Harmonized)                     
Solution Brands                 
Toolbox 2019                                       
Toolbox 2018                                     
Toolbox 2017                                        
Trade Assets                                      
Active for Procurement (2017)                       
Idle Available Stock Type                          
Modified                                          
Modified By                                     
Created                                            
Item Type                                       
Path

### Placement Tickets data features

##### Service Category
    Tell if the machine was :
        Installed
        Removed
        Replaced

##### Completion Date
    Date when the Ticket was done, we will aggregate on the number of tickets and we will use it to get the day of last visit.

##### Incident Category
    Reason of the Ticket, details about the incident or ticket

##### Serial ID
    In order to link to the Beverage Machine data

### Telemetry data features

##### quantity 
    Sales quantity

##### serial 	
    ID that allows us to map a to the manufacturer number of the beverage machines

##### columns not used :

Month

    Month of the sales

stockId

    Each machines has a button linked to an ID and by mapping this ID to the related product when can know which type of cup was sold, yet the machines is not working for every machines, so the columns product might be wrong

Column1 	

    unknown Id

Averages 	

    unknown average

inactive 	

    unknown column

machine_id2 

    unknown Id

Product

    type of cup sold (mapping is not ready for every machines yet)

We will use only the sales quantity and the serial to link to the Beverage Machine data. The other columns are either not useful or not satisfying minimum requirements on accuracy of data (bad data)

### Missing data

In [111]:
# TA Contract Installation Date
BevMachMissingInstDate = BeverageMachine_df.loc[BeverageMachine_df['TA Contract Installation Date']=='#']['TA Contract Installation Date'].count()
TotBevMach = BeverageMachine_df['Serial ID'].count()

# TA Contract Start Date
BevMachMissingStartDate = BeverageMachine_df.loc[BeverageMachine_df['TA Contract Start Date']=='#']['TA Contract Start Date'].count()

# TA Contract End Date 
BevMachMissingEndDate = BeverageMachine_df.loc[BeverageMachine_df['TA Contract End Date']=='#']['TA Contract End Date'].count()

# Depreciation Start
BevMachMissingDepStartDate = BeverageMachine_df.loc[BeverageMachine_df['Depreciation Start']=='#']['Depreciation Start'].count()


print('Beverage machines missing Installation Date : ', BevMachMissingInstDate, ', which corresponds to ', 100*round(BevMachMissingInstDate/TotBevMach,2), '%')
print('Beverage machines missing Start Date : ', BevMachMissingStartDate, ', which corresponds to ', 100*round(BevMachMissingStartDate/TotBevMach,2), '%')
print('Beverage machines missing End Date : ', BevMachMissingEndDate, ', which corresponds to ', 100*round(BevMachMissingEndDate/TotBevMach,2), '%')
print('Beverage machines missing Depreciation Start Date : ', BevMachMissingDepStartDate, ', which corresponds to ', 100*round(BevMachMissingDepStartDate/TotBevMach,2), '%')


Beverage machines missing Installation Date :  1942519 , which corresponds to  45.0 %
Beverage machines missing Start Date :  1942519 , which corresponds to  45.0 %
Beverage machines missing End Date :  1942511 , which corresponds to  45.0 %
Beverage machines missing Depreciation Start Date :  4 , which corresponds to  0.0 %


##### Telemetry data
Even if the number of beverage machines equiped with telemetry data is increasing the data available is still low and should be seen as a complement.

In August 2020 only around 200 beverage machines have telemetry data and are already in the new system from which we got Beverage Machine data and we have around 60'000 beverage machines.


##### Placement Tickets data

27'318 beverage machines does not provide any Placement tickets


##### Date features missing

We see that sometimes the date is not filled for Installation Date, Start Date and End Date

#### Visits data
A visit is linked to an account and a machine "Account ID" can be linked to a visit "Account ID.Account ID Level 01.Key" maybe a key with the Sales Org in case it is unique only by market

#### Phone Calls data
A phone Call is linked to an account. Link "Account Name" from phone call with "Account ID" of the machine.

## Preparation of the data<a class="anchor" id="prep"></a>

### a) Details of preparation<a class="anchor" id="det"></a>

#### Beverage Machine data preparation

The goal is to get the actual maximal date of each Serial ID

If a machine has a maximal date that is lower than (or not equal to) the latest snapshot date, then the machine has churned.

We will look at the max date per installation point because when we lose an installation point we lose the customer. 

A machine can be realocated to another customer.

Keep only the latest month of data



In [112]:
BeverageMachine_df['Calendar Date'] = pd.to_datetime(BeverageMachine_df['Calendar Date'], errors =  'coerce')


In [113]:
BeverageMachine_df.tail()

Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date,Key_ManufacturerID_SalesOrg
7625633,307827,Nestlé Austria,45574.0,NESCAFE ALEGRIA FTP30 v1.0 BM,100023190,ALEGRIA,Deployed,Installed,44166.0,20E0008660,...,061302 AVM: Free Standing,06 Out of Home,0613 Vending,061302 AVM: Free Standing,IP-43103,Trade Asset w/ Fixed Asset,AT10,90073039,2025-01-31,204429950Nestlé Austria
7625634,307828,Nestlé Austria,45574.0,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Deployed,Installed,42795.0,16E0027165,...,061302 AVM: Free Standing,06 Out of Home,0613 Vending,061302 AVM: Free Standing,IP-42812,Trade Asset w/ Fixed Asset,AT10,90045171,2025-01-31,164336633Nestlé Austria
7625635,307829,Nestlé Austria,45574.0,NESCAFE ALEGRIA High Cap 4 can 230V,100070657,ALEGRIA,Deployed,Installed,43160.0,16E0019941,...,061302 AVM: Free Standing,06 Out of Home,0613 Vending,061302 AVM: Free Standing,IP-42551,Trade Asset w/ Fixed Asset,AT10,90061864,2025-01-31,1606101279Nestlé Austria
7625636,307830,Nestlé Austria,45574.0,NESCAFE ALEGRIA A860 H4A2W HW BP BM,100023582,ALEGRIA,Deployed,Installed,42430.0,151616861,...,061302 AVM: Free Standing,06 Out of Home,0613 Vending,061302 AVM: Free Standing,IP-43255,Trade Asset w/ Fixed Asset,AT10,90045118,2025-01-31,151616861Nestlé Austria
7625637,307831,Nestlé Austria,45574.0,NESCAFE ALEGRIA CoffeeBrewer 6/30Piccolo,100070737,ALEGRIA,Deployed,Installed,41609.0,132020210,...,061302 AVM: Free Standing,06 Out of Home,0613 Vending,061302 AVM: Free Standing,IP-42529,Trade Asset w/ Fixed Asset,AT10,90049909,2025-01-31,132020210Nestlé Austria


BeverageMachine_df1 = BeverageMachine_df.copy()
BeverageMachine_df1 = BeverageMachine_df1.groupby(['Parent Installation Point ID'])


In [114]:
BeverageMachine_df1 = BeverageMachine_df.copy()
BeverageMachine_df1['Product ID [Machine Model ID]'] = BeverageMachine_df1['Product ID [Machine Model ID]'].astype(str)
#BeverageMachine_df2 = BeverageMachine_df1.groupby(['Parent Installation Point ID']).agg({'Calendar Date' : [np.min, np.max]})

#BeverageMachine_df1['Calendar Date2'] = BeverageMachine_df1['Calendar Date']

#BeverageMachine_df2 = BeverageMachine_df1.groupby(['Parent Installation Point ID']).agg({'Calendar Date' : 'min', 'Calendar Date2' : 'max'})

In [115]:
BeverageMachine_df2 = pd.merge(BeverageMachine_df1, BevMap_df, how='left', left_on = ['Product ID [Machine Model ID]'], right_on = ['ID Model Code'])
BeverageClassification1_df = BeverageClassification_df.drop_duplicates(['Model'])
BeverageMachine_df3 = pd.merge(BeverageMachine_df2, BeverageClassification1_df, how='left', left_on = ['Model'], right_on = ['Model']) 

# Function to filter the DataFrame in chunks to handle MemoryError
def filter_dataframe_in_chunks(df, chunk_size, query):
    chunks = []  # List to store processed chunks

    try:
        # Process the DataFrame in chunks
        for start in range(0, len(df), chunk_size):
            chunk = df.iloc[start:start + chunk_size]
            # Filter out rows based on the query
            filtered_chunk = chunk.query(query)
            chunks.append(filtered_chunk)
        
        # Concatenate all the filtered chunks into a single DataFrame
        filtered_df = pd.concat(chunks, ignore_index=True)
        return filtered_df
    except MemoryError:
        print("MemoryError: Unable to process the DataFrame due to memory constraints.")
        return None

# Define the chunk size and query
chunk_size = 10000  # Define the chunk size
query = "`Model` != 'Accessories'"

# Assuming BeverageMachine_df3 is already loaded as a DataFrame
# Filter the DataFrame using the function
BeverageMachine_df3_filtered = filter_dataframe_in_chunks(BeverageMachine_df3, chunk_size, query)

# Display the first few rows of the filtered DataFrame if it was created successfully
if BeverageMachine_df3_filtered is not None:
    print("Filtered DataFrame created successfully.")
    print(BeverageMachine_df3_filtered.head())
else:
    print("Failed to create the filtered DataFrame.")

pip install dask

import dask.dataframe as dd

# Load the DataFrame using Dask
ddf = dd.from_pandas(BeverageMachine_df3, npartitions=10)

# Filter the DataFrame
ddf_filtered = ddf[ddf['Model'] != 'Accessories']

# Compute the filtered DataFrame
BeverageMachine_df3_filtered = ddf_filtered.compute()

# Display the first few rows of the filtered DataFrame
print("Filtered DataFrame created successfully.")
print(BeverageMachine_df3_filtered.head())

In [116]:
BeverageMachine_df3 = BeverageMachine_df3.query("`Model` != 'Accessories'")

In [117]:
BeverageMachine_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3965611 entries, 0 to 4288719
Data columns (total 75 columns):
 #   Column                                               Dtype         
---  ------                                               -----         
 0   Unnamed: 0                                           int64         
 1   Sales Organisation                                   object        
 2   User Status Last Changed On                          object        
 3   Product [Machine Model]                              object        
 4   Product ID [Machine Model ID]                        object        
 5   Range Brand                                          object        
 6   Machine Status Groupings                             object        
 7   User Status                                          object        
 8   Depreciation Start                                   object        
 9   Serial ID                                            object        
 10  Manufa

In [118]:
# Get the distinct values for the column "TA Usage Indicator"
distinct_values = BeverageMachine_df3["TA Usage Indicator"].unique()

# Print the distinct values
print(distinct_values)

# Get the distinct values for the column "Position"
distinct_values2 = BeverageMachine_df3["Position"].unique()

# Print the distinct values
print(distinct_values2)

['Not assigned' '5 Monthly Rental' 'Trial / Evaluation'
 '7 Annual / Periodic' nan 'Yearly Rental' 'Trial' 'Weekly Rental']
['LOAN' '#' 'EVAL' 'RENT' 'DEMO' 'SALE' '21H0' '22O0' '23O0' 'SOLD' 'FOL'
 'SCRP' '1000' 'EXPE' '11' '1' nan 'TEST' 'LOA' 'CUST' 'FREE' 'SC']


In [119]:
# List of allowed values
allowed_values = ["DEMO", "EVAL", "FOL", "LOAN", "RENT", "SALE", "SCRP", "SOLD", "TEST"]

# Replace values not in the allowed list with "#"
BeverageMachine_df3.loc[~BeverageMachine_df3["Position"].isin(allowed_values), "Position"] = "#"

# Print the updated DataFrame for the "TA Usage Indicator" column
print(BeverageMachine_df3["Position"].unique())

# Replace "0" and NaN with "Not assigned" in the column "TA Usage Indicator"
BeverageMachine_df3["TA Usage Indicator"].replace({"0": "Not assigned", np.nan: "Not assigned"}, inplace=True)

# Print the updated DataFrame for the "TA Usage Indicator" column
print(BeverageMachine_df3["TA Usage Indicator"].unique())

['LOAN' '#' 'EVAL' 'RENT' 'DEMO' 'SALE' 'SOLD' 'FOL' 'SCRP' 'TEST']
['Not assigned' '5 Monthly Rental' 'Trial / Evaluation'
 '7 Annual / Periodic' 'Yearly Rental' 'Trial' 'Weekly Rental']


Another way to get min and max date e.g. "I wanted to create a new data frame where I can get min value in the column Numb if my string in the column Word is ab and max value if my string is bc for each Date. " :

s=df.groupby(['Date','Word']).Numb.agg(['min','max'])

s['number']=np.where(s.index.get_level_values(1)=='ab',s.min(1),s.max(1))

df11 =BeverageMachine_df.copy()
df22 = df11.reset_index()
df22.loc[df22.groupby('Parent Installation Point ID')['Calendar Date'].idxmin()]
df22.info()

In [120]:
# Sort the dataFrame by 'Calendar Date' and then remove duplicates :
BM_Maxdate_IPID2 = BeverageMachine_df3.sort_values('Calendar Date', ascending=False).drop_duplicates(['Parent Installation Point ID'])
BM_Maxdate_IPID2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270778 entries, 4288719 to 2377162
Data columns (total 75 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Unnamed: 0                                           270778 non-null  int64         
 1   Sales Organisation                                   270778 non-null  object        
 2   User Status Last Changed On                          269837 non-null  object        
 3   Product [Machine Model]                              270778 non-null  object        
 4   Product ID [Machine Model ID]                        270778 non-null  object        
 5   Range Brand                                          270778 non-null  object        
 6   Machine Status Groupings                             270778 non-null  object        
 7   User Status                                          270778 non-nul

The columns allowing to link datasets should have the same format otherwise it might not work properly if one has a string format and the other a numerical format  

In [121]:
BeverageMachine1_df = BM_Maxdate_IPID2
BeverageMachine1_df['Product ID [Machine Model ID]']=BeverageMachine1_df['Product ID [Machine Model ID]'].astype(str)

count = BeverageMachine1_df[BeverageMachine1_df['Sales Organisation'] == 'Nestlé Russia'].shape[0]
print("Number of rows with 'Sales Organisation' as 'Nestlé Russia':", count)

count = BeverageMachine1_df[(BeverageMachine1_df['Sales Organisation'] == 'Nestlé Russia') & (BeverageMachine1_df['Machine Status Groupings'] == 'Deployed')].shape[0]
print("Number of rows with 'Sales Organisation' as 'Nestlé Russia' and 'Machine Status Groupings' as 'Deployed':", count)

deployed_count = BeverageMachine1_df[BeverageMachine1_df['Machine Status Groupings'] == 'DEPLOYED'].shape[0]
print("Number of machines with status 'DEPLOYED':", deployed_count)

In [122]:
#transform the 'COMPLETION_SLA_MET' column from boolean (True/False) to integer (1/0) in your SQL query

try:
    dp_sql = """SELECT COMPLETION_DATE, INCIDENT_CATEGORY_DESCRIPTION, SERIAL_ID, COMPLETION_SLA_MET
    FROM EDW.PRS.C4C_REPAIR_TICKETS_KPI_V"""
    #dp_sql = """SELECT COMPLETION_DATE, INCIDENT_CATEGORY_DESCRIPTION, SERIAL_ID,
     #  CASE WHEN COMPLETION_SLA_MET = 'True' THEN 1 ELSE 0 END AS COMPLETION_SLA_MET
    #FROM EDW.PRS.C4C_REPAIR_TICKETS_KPI_V"""

    df_data_products_config = sf_session.sql(dp_sql)

except SnowparkSQLException as e:
    logging.error('Exception in function---[ get_data_products() ] - ' + str(e))


df_data_products_config.show()

pandas_df_Repair = df_data_products_config.toPandas()

--------------------------------------------------------------------------------------------
|"COMPLETION_DATE"  |"INCIDENT_CATEGORY_DESCRIPTION"  |"SERIAL_ID"  |"COMPLETION_SLA_MET"  |
--------------------------------------------------------------------------------------------
|2022-11-10         |2.c Hydraulic Leaking            |20O0047368   |False                 |
|2022-11-03         |1.a Ingredient Calibration       |ID18415      |True                  |
|2022-03-17         |6 Electronics (PCBs)             |ID28731      |False                 |
|2020-12-28         |1.d Ingredient Other             |MYBMB19882   |True                  |
|2020-07-24         |17 Miscellaneous                 |174444191    |True                  |
|2021-12-17         |3.a Door Display/Touchscreen     |ID11156      |False                 |
|2022-02-04         |1.a Ingredient Calibration       |17O0040516   |False                 |
|2021-12-14         |2.g Hydraulic Other              |15O0036793   |F

BeverageMachine1_df = BeverageMachine1_df.loc[BeverageMachine1_df['Machine Status Groupings']=="Deployed"]

Merge the Beverage Machine data with the Beverage Mapping in order to get the related "Harmonized Model" of the "Beverage Machine Classification data" and later merge together the "Beverage Machine data" with the "Beverage Classification data"

We should do a cleaning step in order to keep only the machine having the 'Parent Installation Point ID' filled and remove duplicates, but not for 'Serial ID'

In [123]:
BeverageMachine4_df = BeverageMachine1_df.loc[BeverageMachine1_df['Parent Installation Point ID']!="#"].drop_duplicates(['Parent Installation Point ID'])

In [124]:
BeverageMachine4_df = BeverageMachine4_df.loc[BeverageMachine4_df['Serial ID']!="#"]


In [125]:
BeverageMachine4_df.columns

Index(['Unnamed: 0', 'Sales Organisation', 'User Status Last Changed On',
       'Product [Machine Model]', 'Product ID [Machine Model ID]',
       'Range Brand', 'Machine Status Groupings', 'User Status',
       'Depreciation Start', 'Serial ID', 'Manufacturer Number',
       'Equipment Number', 'Asset Number', 'Position', 'TA Contract Number',
       'TA Contract Installation Date', 'TA Contract Start Date',
       'TA Contract End Date', 'TA Usage Indicator', 'Account ID',
       'Ship To ID', 'EC ID', 'EC Name', 'City', 'State', 'Postal Code',
       'Account ABC Classification (Account ID)', 'Industry (Account ID)',
       'Industry Code 1 (Account ID)', 'Account ABC Classification (EC ID)',
       'Industry (EC ID)', 'Industry Code 1 (EC ID)',
       'Parent Installation Point ID',
       'Registered Product Category (Registered Product ID)',
       'Sales Org ID (Installation Point)',
       'SAP Material Line Code [Machine Model ID]', 'Calendar Date',
       'Key_ManufacturerID

In [126]:
                   
                    
BeverageMachine5_df = BeverageMachine4_df[['Serial ID', 'Sales Organisation', 'Machine Status Groupings', 'User Status', 
                    'TA Contract Installation Date', 'Depreciation Start', 'Manufacturer Number', 'Position', 
                    'TA Contract Start Date', 'TA Contract End Date', 'TA Usage Indicator',
                    'Account ID',
                    'EC ID', 'EC Name', 'Account ABC Classification (Account ID)', 'Industry (Account ID)', 
                    'Industry Code 1 (Account ID)', 'Account ABC Classification (EC ID)', 
                    'Industry (EC ID)', 'Industry Code 1 (EC ID)', 'Parent Installation Point ID', 
                    'Registered Product Category (Registered Product ID)', 
                    'Model', 'Model Vendor', 'Model Category', 'Model Group', 
                    'Beverage Temperature', 'System Brands', 'Ingredient Format', 
                    'Machine Type', 'Positionning', 'Generation', 'Blueprint Throughput', 
                    'IP Ownership', 'Calendar Date', 'Key_ManufacturerID_SalesOrg', 'City', 'State', 'Postal Code']]
BeverageMachine5_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Machine Type,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code
4288719,132020210,Nestlé Austria,Deployed,Installed,44347.0,41609.0,132020210,RENT,44347.0,46173.0,...,Table Tops,Mainstream,Gen. 1,Low,Proprietary,2025-01-31,132020210Nestlé Austria,Wien,Vienna,1090.0
4178692,19O0046537,Nestlé PH,Deployed,Installed,,44621.0,T479748,#,,,...,Table Tops,Mainstream,Gen. 2,Medium,Exclusive,2025-01-31,T479748Nestlé PH,Pagadian City,Zamboanga Peninsula,9200.0
4178690,23O0024982,Nestlé PH,Deployed,Installed,,45200.0,3142073491,#,,,...,Table Tops,Mainstream,Gen. 2,Medium,Exclusive,2025-01-31,3142073491Nestlé PH,Las Pinas,National Capital Reg,1747.0
4178689,HK10020559,Nestle Hong Kong,Deployed,Installed,,44197.0,20202922905,#,,,...,Table Tops,Mainstream,Gen. 2,Low,Proprietary,2025-01-31,20202922905Nestle Hong Kong,Hong Kong,Hong Kong Island,
4178688,4410837,NP-Hungary,Deployed,Installed,,42186.0,4410837,#,,,...,Table Tops,Mainstream,Legacy,Low,Non-Proprietary,2025-01-31,4410837NP-Hungary,Szentendre,Pest,2000.0


If 'Calendar Date' is smaller than the 'ChurnDate2' it means that it has not churned

In [127]:
BeverageMachine5_df['Calendar Date'] = pd.to_datetime(BeverageMachine5_df['Calendar Date'])
BeverageMachine5_df.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  BeverageMachine5_df['Calendar Date'] = pd.to_datetime(BeverageMachine5_df['Calendar Date'])


<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 4288719 to 2377162
Data columns (total 39 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Serial ID                                            270764 non-null  object        
 1   Sales Organisation                                   270764 non-null  object        
 2   Machine Status Groupings                             270764 non-null  object        
 3   User Status                                          270764 non-null  object        
 4   TA Contract Installation Date                        102087 non-null  object        
 5   Depreciation Start                                   250852 non-null  object        
 6   Manufacturer Number                                  259887 non-null  object        
 7   Position                                             270764 non-nul

In [128]:
np.where(BeverageMachine5_df['Calendar Date']< ChurnDate2, True, False)
BeverageMachine5_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Machine Type,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code
4288719,132020210,Nestlé Austria,Deployed,Installed,44347.0,41609.0,132020210,RENT,44347.0,46173.0,...,Table Tops,Mainstream,Gen. 1,Low,Proprietary,2025-01-31,132020210Nestlé Austria,Wien,Vienna,1090.0
4178692,19O0046537,Nestlé PH,Deployed,Installed,,44621.0,T479748,#,,,...,Table Tops,Mainstream,Gen. 2,Medium,Exclusive,2025-01-31,T479748Nestlé PH,Pagadian City,Zamboanga Peninsula,9200.0
4178690,23O0024982,Nestlé PH,Deployed,Installed,,45200.0,3142073491,#,,,...,Table Tops,Mainstream,Gen. 2,Medium,Exclusive,2025-01-31,3142073491Nestlé PH,Las Pinas,National Capital Reg,1747.0
4178689,HK10020559,Nestle Hong Kong,Deployed,Installed,,44197.0,20202922905,#,,,...,Table Tops,Mainstream,Gen. 2,Low,Proprietary,2025-01-31,20202922905Nestle Hong Kong,Hong Kong,Hong Kong Island,
4178688,4410837,NP-Hungary,Deployed,Installed,,42186.0,4410837,#,,,...,Table Tops,Mainstream,Legacy,Low,Non-Proprietary,2025-01-31,4410837NP-Hungary,Szentendre,Pest,2000.0


In [129]:
columnwithfalse = False
BeverageMachine6_df=BeverageMachine5_df.copy()
BeverageMachine6_df['Churn'] = columnwithfalse
BeverageMachine6_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn
4288719,132020210,Nestlé Austria,Deployed,Installed,44347.0,41609.0,132020210,RENT,44347.0,46173.0,...,Mainstream,Gen. 1,Low,Proprietary,2025-01-31,132020210Nestlé Austria,Wien,Vienna,1090.0,False
4178692,19O0046537,Nestlé PH,Deployed,Installed,,44621.0,T479748,#,,,...,Mainstream,Gen. 2,Medium,Exclusive,2025-01-31,T479748Nestlé PH,Pagadian City,Zamboanga Peninsula,9200.0,False
4178690,23O0024982,Nestlé PH,Deployed,Installed,,45200.0,3142073491,#,,,...,Mainstream,Gen. 2,Medium,Exclusive,2025-01-31,3142073491Nestlé PH,Las Pinas,National Capital Reg,1747.0,False
4178689,HK10020559,Nestle Hong Kong,Deployed,Installed,,44197.0,20202922905,#,,,...,Mainstream,Gen. 2,Low,Proprietary,2025-01-31,20202922905Nestle Hong Kong,Hong Kong,Hong Kong Island,,False
4178688,4410837,NP-Hungary,Deployed,Installed,,42186.0,4410837,#,,,...,Mainstream,Legacy,Low,Non-Proprietary,2025-01-31,4410837NP-Hungary,Szentendre,Pest,2000.0,False


In [130]:
#BeverageMachine6_df['Churn'] = np.where((BeverageMachine5_df['Calendar Date_x']<BeverageMachine5_df['Calendar Date_y'])|
#                                (BeverageMachine5_df['Calendar Date_x'] == ChurnDate), False, True)

BeverageMachine6_df['Churn'] = np.where(BeverageMachine5_df['Calendar Date'] < ChurnDate2, True, False)
BeverageMachine6_df.loc[BeverageMachine6_df['Churn']==True].head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn
2112645,MYBMB31850,Malaysia,Deployed,Installed,,43551.0,1810010196,#,,,...,Mainstream,Gen. 1,Medium,Propr. Comp.,2024-12-31,1810010196Malaysia,PJ,Selangor,47500,True
2112637,MYBMB21900,Malaysia,Deployed,Installed,,41703.0,1311040236,#,,,...,Mainstream,Gen. 1,Medium,Propr. Comp.,2024-12-31,1311040236Malaysia,Kajang,Selangor,43000,True
2112662,MYBMB15350,Malaysia,Deployed,Installed,,36418.0,CA1249OW,#,,,...,Mainstream,Legacy,Medium,Non-Proprietary,2024-12-31,CA1249OWMalaysia,Kuching,Sarawak,93100,True
2112736,MYBMB14786,Malaysia,Deployed,Installed,,35431.0,A617GX,#,,,...,Mainstream,Legacy,Medium,Non-Proprietary,2024-12-31,A617GXMalaysia,Kuching,Sarawak,93100,True
2112713,MYBMB10614,Malaysia,Deployed,Installed,,36526.0,A937LY,#,,,...,Mainstream,Legacy,Medium,Non-Proprietary,2024-12-31,A937LYMalaysia,Kuching,Sarawak,93100,True


In [131]:
BeverageMachine6_df.loc[BeverageMachine6_df['Churn']==False].head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn
4288719,132020210,Nestlé Austria,Deployed,Installed,44347.0,41609.0,132020210,RENT,44347.0,46173.0,...,Mainstream,Gen. 1,Low,Proprietary,2025-01-31,132020210Nestlé Austria,Wien,Vienna,1090.0,False
4178692,19O0046537,Nestlé PH,Deployed,Installed,,44621.0,T479748,#,,,...,Mainstream,Gen. 2,Medium,Exclusive,2025-01-31,T479748Nestlé PH,Pagadian City,Zamboanga Peninsula,9200.0,False
4178690,23O0024982,Nestlé PH,Deployed,Installed,,45200.0,3142073491,#,,,...,Mainstream,Gen. 2,Medium,Exclusive,2025-01-31,3142073491Nestlé PH,Las Pinas,National Capital Reg,1747.0,False
4178689,HK10020559,Nestle Hong Kong,Deployed,Installed,,44197.0,20202922905,#,,,...,Mainstream,Gen. 2,Low,Proprietary,2025-01-31,20202922905Nestle Hong Kong,Hong Kong,Hong Kong Island,,False
4178688,4410837,NP-Hungary,Deployed,Installed,,42186.0,4410837,#,,,...,Mainstream,Legacy,Low,Non-Proprietary,2025-01-31,4410837NP-Hungary,Szentendre,Pest,2000.0,False


Check the data and modify it if it is not the correct type

In [132]:
e = BeverageMachine6_df.loc[BeverageMachine6_df['Serial ID']==7010054129]
e.iloc[:20,9:40]

Unnamed: 0,TA Contract End Date,TA Usage Indicator,Account ID,EC ID,EC Name,Account ABC Classification (Account ID),Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn


In [133]:
BeverageMachine6_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 4288719 to 2377162
Data columns (total 40 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Serial ID                                            270764 non-null  object        
 1   Sales Organisation                                   270764 non-null  object        
 2   Machine Status Groupings                             270764 non-null  object        
 3   User Status                                          270764 non-null  object        
 4   TA Contract Installation Date                        102087 non-null  object        
 5   Depreciation Start                                   250852 non-null  object        
 6   Manufacturer Number                                  259887 non-null  object        
 7   Position                                             270764 non-nul

I want some date features to be integer instead of non-null object

In [134]:
# Date features
Date_Features = ['TA Contract Installation Date', 'Depreciation Start',  'TA Contract Start Date', 
                 'TA Contract End Date']

BeverageMachine7_df= BeverageMachine6_df.copy()

for x in Date_Features:
    BeverageMachine7_df[x] = pd.to_numeric(BeverageMachine7_df[x], errors='coerce').fillna(0).astype(int)

In [135]:
BeverageMachine7_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 4288719 to 2377162
Data columns (total 40 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Serial ID                                            270764 non-null  object        
 1   Sales Organisation                                   270764 non-null  object        
 2   Machine Status Groupings                             270764 non-null  object        
 3   User Status                                          270764 non-null  object        
 4   TA Contract Installation Date                        270764 non-null  int32         
 5   Depreciation Start                                   270764 non-null  int32         
 6   Manufacturer Number                                  259887 non-null  object        
 7   Position                                             270764 non-nul

#### Placement Tickets data preparation

In order to merge Placement Tickets data with Beverage Machine data I need to perform some preparations of the data.

I would like to have one row per Manufacture Serial Number and Month

Remove "Removal Ticket" because it is nearly like giving the information if the machine has churned. 
To be decided maybe I should remove it too.
I just kept "Seasonal Removal" because it helps to understand that it is a special case and a similar machine might not churn if it is not a Seasonal Removal

In [136]:
Placement_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 558551 entries, 0 to 558550
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Serial ID                      558118 non-null  object
 1   Service Category               558551 non-null  object
 2   INCIDENT_CATEGORY_DESCRIPTION  555544 non-null  object
dtypes: object(3)
memory usage: 12.8+ MB


In [137]:
table1 = Placement_df.loc[Placement_df['Service Category']!="Removal"]
table1 = table1.loc[table1['Service Category']!="Removal."]
table2 = Placement_df.loc[Placement_df['INCIDENT_CATEGORY_DESCRIPTION']=="Seasonal Removal"]

In [138]:
table1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 332062 entries, 0 to 558549
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Serial ID                      331778 non-null  object
 1   Service Category               332062 non-null  object
 2   INCIDENT_CATEGORY_DESCRIPTION  329759 non-null  object
dtypes: object(3)
memory usage: 10.1+ MB


In [139]:
Placement_df_wo_rem = pd.concat([table1,table2])
Placement_df_wo_rem.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 338888 entries, 0 to 558492
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Serial ID                      338592 non-null  object
 1   Service Category               338888 non-null  object
 2   INCIDENT_CATEGORY_DESCRIPTION  336585 non-null  object
dtypes: object(3)
memory usage: 10.3+ MB


In [140]:
from xlrd.xldate import xldate_as_tuple
from dateutil.relativedelta import relativedelta

Placement_df_prep = Placement_df_wo_rem[['Serial ID', 'Service Category','INCIDENT_CATEGORY_DESCRIPTION']].copy()

Placement_df_prep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 338888 entries, 0 to 558492
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Serial ID                      338592 non-null  object
 1   Service Category               338888 non-null  object
 2   INCIDENT_CATEGORY_DESCRIPTION  336585 non-null  object
dtypes: object(3)
memory usage: 10.3+ MB


In [141]:
Placement_df_prep['Serial ID'] = Placement_df_prep['Serial ID'].astype('str')

In [142]:
def preprocess_f(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Service Category','INCIDENT_CATEGORY_DESCRIPTION']
                
    # Some columns could be also ordinal features but we will keep them as nominal features for the moment
    ##ordi_vars = ['Positionning', 'Generation',]
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

Placement_df_prep2 = preprocess_f(Placement_df_prep)
Placement_df_prep2.head()

Unnamed: 0,Serial ID,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_N/A,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
0,14E0003095,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
7,24A0098570,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
8,Y7070563,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
9,ID21682,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
10,20O0020187,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0


In [143]:
Placement_df_prep2.columns

Index(['Serial ID', 'Service Category_Installation',
       'Service Category_Removal', 'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_N/A',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade'],
      dtype='object')

In [144]:
Placement_df_prep3 = Placement_df_prep2.groupby(["Serial ID"])\
[['Serial ID', 'Service Category_Installation',
       'Service Category_Removal', 'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade']].sum()


TicketsColumnsList = ['Serial ID', 'Service Category_Installation',
       'Service Category_Removal', 'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade']

Placement_df_prep3.head()

  'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade']].sum()


Unnamed: 0_level_0,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
Serial ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0.102313088,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
0.4390764,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
0.6470438,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0
0.8061053,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
10000013.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [145]:
# Specify the filename
filename = 'TicketsColumnsList.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(TicketsColumnsList, file)

In [146]:
Placement_df_prep3.columns

Index(['Service Category_Installation', 'Service Category_Removal',
       'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade'],
      dtype='object')

In [147]:
Placement_df_prep3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 176602 entries, .102313088 to ZAR1222
Data columns (total 15 columns):
 #   Column                                                           Non-Null Count   Dtype
---  ------                                                           --------------   -----
 0   Service Category_Installation                                    176602 non-null  uint8
 1   Service Category_Removal                                         176602 non-null  uint8
 2   Service Category_Replacement                                     176602 non-null  uint8
 3   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                176602 non-null  uint8
 4   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       176602 non-null  uint8
 5   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   176602 non-null  uint8
 6   INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation      176602 non-null  uint8
 7   INCIDENT_CATEGORY_DESCRIPTION_Maintenance 

In [148]:
Placement_df_prep5 = Placement_df_prep3.reset_index()

In [149]:
Placement_df_prep5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176602 entries, 0 to 176601
Data columns (total 16 columns):
 #   Column                                                           Non-Null Count   Dtype 
---  ------                                                           --------------   ----- 
 0   Serial ID                                                        176602 non-null  object
 1   Service Category_Installation                                    176602 non-null  uint8 
 2   Service Category_Removal                                         176602 non-null  uint8 
 3   Service Category_Replacement                                     176602 non-null  uint8 
 4   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                176602 non-null  uint8 
 5   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       176602 non-null  uint8 
 6   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   176602 non-null  uint8 
 7   INCIDENT_CATEGORY_DESCRIPTION_Key Acco

In [150]:
BeverageMachine7_df.columns

Index(['Serial ID', 'Sales Organisation', 'Machine Status Groupings',
       'User Status', 'TA Contract Installation Date', 'Depreciation Start',
       'Manufacturer Number', 'Position', 'TA Contract Start Date',
       'TA Contract End Date', 'TA Usage Indicator', 'Account ID', 'EC ID',
       'EC Name', 'Account ABC Classification (Account ID)',
       'Industry (Account ID)', 'Industry Code 1 (Account ID)',
       'Account ABC Classification (EC ID)', 'Industry (EC ID)',
       'Industry Code 1 (EC ID)', 'Parent Installation Point ID',
       'Registered Product Category (Registered Product ID)', 'Model',
       'Model Vendor', 'Model Category', 'Model Group', 'Beverage Temperature',
       'System Brands', 'Ingredient Format', 'Machine Type', 'Positionning',
       'Generation', 'Blueprint Throughput', 'IP Ownership', 'Calendar Date',
       'Key_ManufacturerID_SalesOrg', 'City', 'State', 'Postal Code', 'Churn'],
      dtype='object')

In [151]:
Placement_df_prep5.loc[Placement_df_prep5['Serial ID']=='#']

Unnamed: 0,Serial ID,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade


Remove Placement tickets with 'Serial ID' == '#'

In [152]:
Placement_df_prep6 = Placement_df_prep5.loc[Placement_df_prep5['Serial ID']!='#']

In [153]:
Placement_df_prep6.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 176602 entries, 0 to 176601
Data columns (total 16 columns):
 #   Column                                                           Non-Null Count   Dtype 
---  ------                                                           --------------   ----- 
 0   Serial ID                                                        176602 non-null  object
 1   Service Category_Installation                                    176602 non-null  uint8 
 2   Service Category_Removal                                         176602 non-null  uint8 
 3   Service Category_Replacement                                     176602 non-null  uint8 
 4   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                176602 non-null  uint8 
 5   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       176602 non-null  uint8 
 6   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   176602 non-null  uint8 
 7   INCIDENT_CATEGORY_DESCRIPTION_Key Acco

In [154]:
#Placement_df_prep6['Serial ID'] = Placement_df_prep6['Serial ID'].astype('str')

In [155]:
Placement_df_prep6 = Placement_df_prep6.reset_index()

Placement_df_prep6=Placement_df_prep6.drop(columns=['index'])
Placement_df_prep6

Unnamed: 0,Serial ID,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
0,.102313088,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,.4390764,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
2,.6470438,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0
3,.8061053,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
4,10000013,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
176597,ZAB2022048,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0
176598,ZAB2022049,4,0,0,0,0,0,0,0,2,0,0,0,2,0,0
176599,ZAB2022050,2,0,0,0,0,0,0,0,2,0,0,0,0,0,0
176600,ZAG0054,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0


I will link both data together Beverage Machine data and Placement Ticket

In [156]:
BeverageMachine7_wTickets_df = pd.merge(BeverageMachine7_df, Placement_df_prep6, how='left', left_on = ['Serial ID'], right_on = ['Serial ID'])

f=BeverageMachine7_wTickets_df.loc[BeverageMachine7_wTickets_df['Serial ID']=='7010054129']
f.iloc[:20,20:50]

In [157]:
BeverageMachine7_wTickets_df

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
0,132020210,Nestlé Austria,Deployed,Installed,44347,41609,132020210,RENT,44347,46173,...,,,,,,,,,,
1,19O0046537,Nestlé PH,Deployed,Installed,0,44621,T479748,#,0,0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
2,23O0024982,Nestlé PH,Deployed,Installed,0,45200,3142073491,#,0,0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,HK10020559,Nestle Hong Kong,Deployed,Installed,0,44197,20202922905,#,0,0,...,,,,,,,,,,
4,4410837,NP-Hungary,Deployed,Installed,0,42186,4410837,#,0,0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
270759,7010049071,Pakistan,Deployed,In Repair,0,42886,7010049071,#,0,0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
270760,7010042493,Pakistan,Deployed,In Repair,0,42309,#,#,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
270761,7010059106,Pakistan,Deployed,In Repair,0,43800,7010059106,#,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
270762,7010053961,Pakistan,Deployed,In Repair,0,43101,#,#,0,0,...,,,,,,,,,,


In [158]:
BeverageMachine7_wTickets_df=BeverageMachine7_wTickets_df.fillna(0)
BeverageMachine7_wTickets_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
0,132020210,Nestlé Austria,Deployed,Installed,44347,41609,132020210,RENT,44347,46173,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,19O0046537,Nestlé PH,Deployed,Installed,0,44621,T479748,#,0,0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
2,23O0024982,Nestlé PH,Deployed,Installed,0,45200,3142073491,#,0,0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,HK10020559,Nestle Hong Kong,Deployed,Installed,0,44197,20202922905,#,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4410837,NP-Hungary,Deployed,Installed,0,42186,4410837,#,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Even if we have only around 2000 machines having tickets, BeverageMachine7_wTickets_df can be used and we will see if it can improve the model.

In [159]:
SO_Tickets =['prstzr pnstrpzcp ztd', 'prstzr nk', 'prstzr prw zrpzppd', 'ppkcstpp']

BeverageMachine7_wTicketsOnly_df = Placement_df_prep6

BeverageMachine7_wTicketsOnly_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176602 entries, 0 to 176601
Data columns (total 16 columns):
 #   Column                                                           Non-Null Count   Dtype 
---  ------                                                           --------------   ----- 
 0   Serial ID                                                        176602 non-null  object
 1   Service Category_Installation                                    176602 non-null  uint8 
 2   Service Category_Removal                                         176602 non-null  uint8 
 3   Service Category_Replacement                                     176602 non-null  uint8 
 4   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                176602 non-null  uint8 
 5   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       176602 non-null  uint8 
 6   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   176602 non-null  uint8 
 7   INCIDENT_CATEGORY_DESCRIPTION_Key Acco

#### Telemetry data preparation

Let's see what we can get with only machines having Telemetry data

In [160]:
BeverageMachine7_wTelemetry = pd.merge(BeverageMachine7_df, Telemetry_aggSales, how='inner', left_on = ['Manufacturer Number'], right_on = ['serial'])
BeverageMachine7_wTelemetry.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23309 entries, 0 to 23308
Data columns (total 41 columns):
 #   Column                                               Non-Null Count  Dtype         
---  ------                                               --------------  -----         
 0   Serial ID                                            23309 non-null  object        
 1   Sales Organisation                                   23309 non-null  object        
 2   Machine Status Groupings                             23309 non-null  object        
 3   User Status                                          23309 non-null  object        
 4   TA Contract Installation Date                        23309 non-null  int32         
 5   Depreciation Start                                   23309 non-null  int32         
 6   Manufacturer Number                                  23309 non-null  object        
 7   Position                                             23309 non-null  object        
 

I only have 218 machines matching a Telemetry Kit. This is clearly not enough in order to apply Machine Learning model to predict churn.
We should at least combine it with the Beverage Machine data if we want to use it.

#### Visits data preparation

In [161]:
Visitsdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 206181 entries, 0 to 206180
Data columns (total 7 columns):
 #   Column                              Non-Null Count   Dtype 
---  ------                              --------------   ----- 
 0   VISIT_TYPE_DESCRIPTION              152314 non-null  object
 1   End Date in Local Time Zone         206181 non-null  object
 2   Result                              14112 non-null   object
 3   SALESORG                            206181 non-null  object
 4   Activity Life Cycle Status          206181 non-null  object
 5   Visit                               206181 non-null  object
 6   Account ID.Account ID Level 01.Key  171740 non-null  object
dtypes: object(7)
memory usage: 11.0+ MB


In [162]:
Visitsdf.columns

Index(['VISIT_TYPE_DESCRIPTION', 'End Date in Local Time Zone', 'Result',
       'SALESORG', 'Activity Life Cycle Status', 'Visit',
       'Account ID.Account ID Level 01.Key'],
      dtype='object')

In [163]:
Visitsdf1 = Visitsdf[['End Date in Local Time Zone', 'Result', 'Activity Life Cycle Status', 'Visit', 'Account ID.Account ID Level 01.Key']]

Remove visits with no account id

In [164]:
Visitsdf1 = Visitsdf1.loc[Visitsdf1['Account ID.Account ID Level 01.Key']!="#"]


In [165]:
Visitsdf1['Result'] = Visitsdf1['Result'].fillna('Not assigned')


I do not have the Sales org ID in TA and I think that Account ID are unique I am not doing the key "KeySOAccID" yet.

In [166]:
#Visitsdf1['KeySOAccID'] = Visitsdf1['Sales Organization'] + Visitsdf1['Account ID.Account ID Level 01.Key'].map(str) 

In [167]:
def preprocess_visits(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Result', 'Activity Life Cycle Status']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

Visitsdf_prep = preprocess_visits(Visitsdf1)
Visitsdf_prep.head()

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Completed
0,2024-10-15,2990017,1310064,0,0,1,0,0,0,1
1,2024-07-08,2910068,2374064,0,1,0,0,0,0,1
2,2024-01-17,2768575,3791510,0,1,0,0,0,0,1
3,2024-02-06,2785408,3791510,0,1,0,0,0,0,1
4,2024-08-27,2952812,8995134,0,1,0,0,0,0,1


Summarize the column based on the Account ID and keep the last visit date

In [168]:
Visitsdf_prep.columns

Index(['End Date in Local Time Zone', 'Visit',
       'Account ID.Account ID Level 01.Key', 'Result_Incomplete Selling Call',
       'Result_Not assigned', 'Result_Objective Met',
       'Result_Objective Partially Met', 'Result_Requires Further Follow-up',
       'Result_Unsuccessful Selling Call',
       'Activity Life Cycle Status_Completed'],
      dtype='object')

In [169]:
Visitsdf_prep.iloc[0]['End Date in Local Time Zone']

datetime.date(2024, 10, 15)

In [170]:
Visitsdf_prep['End Date in Local Time Zone'] = Visitsdf_prep['End Date in Local Time Zone'].apply(str)


Visitsdf_prep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 206181 entries, 0 to 206180
Data columns (total 10 columns):
 #   Column                                Non-Null Count   Dtype 
---  ------                                --------------   ----- 
 0   End Date in Local Time Zone           206181 non-null  object
 1   Visit                                 206181 non-null  object
 2   Account ID.Account ID Level 01.Key    171740 non-null  object
 3   Result_Incomplete Selling Call        206181 non-null  uint8 
 4   Result_Not assigned                   206181 non-null  uint8 
 5   Result_Objective Met                  206181 non-null  uint8 
 6   Result_Objective Partially Met        206181 non-null  uint8 
 7   Result_Requires Further Follow-up     206181 non-null  uint8 
 8   Result_Unsuccessful Selling Call      206181 non-null  uint8 
 9   Activity Life Cycle Status_Completed  206181 non-null  uint8 
dtypes: object(3), uint8(7)
memory usage: 7.7+ MB


In [171]:
#pip install dateparser

In [172]:
#import dateparser

Visitsdf_prep2 =Visitsdf_prep.copy()
Visitsdf_prep2['End Date in Local Time Zone'] = pd.to_datetime(Visitsdf_prep2['End Date in Local Time Zone'])
#Visitsdf_prep2['End Date in Local Time Zone'] = Visitsdf_prep2['End Date in Local Time Zone'].apply(lambda x: dateparser.parse(x))

In [173]:
Visitsdf_prep2

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Completed
0,2024-10-15,2990017,1310064,0,0,1,0,0,0,1
1,2024-07-08,2910068,2374064,0,1,0,0,0,0,1
2,2024-01-17,2768575,3791510,0,1,0,0,0,0,1
3,2024-02-06,2785408,3791510,0,1,0,0,0,0,1
4,2024-08-27,2952812,8995134,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...
206176,2024-04-29,2854383,,0,1,0,0,0,0,1
206177,2024-12-09,3038271,,0,1,0,0,0,0,1
206178,2024-04-23,2848127,,0,1,0,0,0,0,1
206179,2024-02-22,2801026,,0,1,0,0,0,0,1


pd.to_datetime(Visitsdf_prep2['End Date in Local Time Zone'])

In [174]:
Visitsdf_prep2.sort_values('End Date in Local Time Zone', ascending = True)

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Completed
111986,2019-04-26,2895029,1094391,0,1,0,0,0,0,1
163661,2019-04-30,2895025,1094391,0,1,0,0,0,0,1
163660,2019-04-30,2895026,1094391,0,1,0,0,0,0,1
8854,2019-04-30,2895027,1094391,0,1,0,0,0,0,1
198531,2020-08-05,2759749,1822439,0,1,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...
115420,2025-12-16,3046441,8428488,0,1,0,0,0,0,1
195419,2025-12-16,3046432,9283864,0,1,0,0,0,0,1
114267,2025-12-16,3046420,9070260,0,1,0,0,0,0,1
189510,2029-04-29,2854625,5629131,0,1,0,0,0,0,1


In [175]:
Visitsdf_prep2 = Visitsdf_prep2.sort_values('End Date in Local Time Zone')
Visitsdf_prep2.head()

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Completed
111986,2019-04-26,2895029,1094391,0,1,0,0,0,0,1
163661,2019-04-30,2895025,1094391,0,1,0,0,0,0,1
163660,2019-04-30,2895026,1094391,0,1,0,0,0,0,1
8854,2019-04-30,2895027,1094391,0,1,0,0,0,0,1
198531,2020-08-05,2759749,1822439,0,1,0,0,0,0,1


In [176]:
Visitsdf_prep3 = (Visitsdf_prep2.sort_values('End Date in Local Time Zone')
    .groupby(["Account ID.Account ID Level 01.Key"])
                      .agg({
        'End Date in Local Time Zone': lambda s: s.values[-1],
        'Result_Incomplete Selling Call' : 'sum',
        'Result_Not assigned' : 'sum', 
        'Result_Objective Met' : 'sum',
       'Result_Objective Partially Met' : 'sum', 'Result_Requires Further Follow-up' : 'sum',
       'Result_Unsuccessful Selling Call' : 'sum',
       #'Activity Life Cycle Status_Canceled' : 'sum',
       'Activity Life Cycle Status_Completed' : 'sum',
       #'Activity Life Cycle Status_In Process' : 'sum',
       #'Activity Life Cycle Status_Open' : 'sum'
    })
)

In [177]:
Visitsdf_prep3.head()

Unnamed: 0_level_0,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Completed
Account ID.Account ID Level 01.Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1000058,2025-01-28,0,6,0,0,0,0,6
1000096,2024-04-10,0,2,0,0,0,0,2
1000216,2024-07-04,0,14,0,0,0,0,14
1000222,2024-07-17,0,2,0,0,0,0,2
1000256,2024-08-21,0,6,0,0,0,0,6


In [178]:
Visitsdf_prep4 = Visitsdf_prep3.copy()
Visitsdf_prep4.reset_index(inplace=True)

In [179]:
Visitsdf_prep4['Last_visit_diff_months'] = ChurnDate2 - Visitsdf_prep4['End Date in Local Time Zone']

Visitsdf_prep4['Last_visit_diff_months'] = Visitsdf_prep4['Last_visit_diff_months']/np.timedelta64(1,'M')

In [180]:
Visitsdf_prep4.head()

Unnamed: 0,Account ID.Account ID Level 01.Key,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Completed,Last_visit_diff_months
0,1000058,2025-01-28,0,6,0,0,0,0,6,0.098565
1,1000096,2024-04-10,0,2,0,0,0,0,2,9.725046
2,1000216,2024-07-04,0,14,0,0,0,0,14,6.932381
3,1000222,2024-07-17,0,2,0,0,0,0,2,6.505267
4,1000256,2024-08-21,0,6,0,0,0,0,6,5.355346


In [181]:
Visitsdf_wVisits = Visitsdf_prep4.copy()
Visitsdf_wVisits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54873 entries, 0 to 54872
Data columns (total 10 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   Account ID.Account ID Level 01.Key    54873 non-null  object        
 1   End Date in Local Time Zone           54873 non-null  datetime64[ns]
 2   Result_Incomplete Selling Call        54873 non-null  uint8         
 3   Result_Not assigned                   54873 non-null  uint64        
 4   Result_Objective Met                  54873 non-null  uint8         
 5   Result_Objective Partially Met        54873 non-null  uint8         
 6   Result_Requires Further Follow-up     54873 non-null  uint8         
 7   Result_Unsuccessful Selling Call      54873 non-null  uint8         
 8   Activity Life Cycle Status_Completed  54873 non-null  uint64        
 9   Last_visit_diff_months                54873 non-null  float64       
dty

df['Reported_Date'] = pd.to_datetime(df['Reported_Date'], format='%m/%d/%Y')
df['Process Date'] = pd.to_datetime(df['Process Date'], format='%m/%d/%Y')

df = (
    df
    .sort_values('Process Date')
    .groupby('ID', as_index=False)
    .agg({
        'Total': 'sum',
        'Process Date': lambda s: s.values[-1]
    })
)

'Activity Owner', 'Visit Description', 'Sales Unit (Hierarchy)' might be useful but with one hot encoding I would have too many columns

In [182]:
Visitsdf_wVisits2 = Visitsdf_wVisits.reset_index() 
Visitsdf_wVisits = Visitsdf_wVisits2.rename(columns={"Account ID.Account ID Level 01.Key":"Acc_ID"})
Visitsdf_wVisits

Unnamed: 0,index,Acc_ID,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Completed,Last_visit_diff_months
0,0,1000058,2025-01-28,0,6,0,0,0,0,6,0.098565
1,1,1000096,2024-04-10,0,2,0,0,0,0,2,9.725046
2,2,1000216,2024-07-04,0,14,0,0,0,0,14,6.932381
3,3,1000222,2024-07-17,0,2,0,0,0,0,2,6.505267
4,4,1000256,2024-08-21,0,6,0,0,0,0,6,5.355346
...,...,...,...,...,...,...,...,...,...,...,...
54868,54868,9886445,2025-02-14,0,1,0,0,0,0,1,-0.459968
54869,54869,9886449,2025-02-13,0,1,0,0,0,0,1,-0.427113
54870,54870,9886451,2025-02-13,0,1,0,0,0,0,1,-0.427113
54871,54871,9886654,2025-02-13,0,1,0,0,0,0,1,-0.427113


left2 = pd.DataFrame(
    {"A": ["A0", "A1", "A2", "C1"], "B": ["B0", "B1", "B2", "B2"], "K1": [1938031, 1938031, 2, 3]}, index=["K0", "K1", "K2", "K2"]
)
left2

In [183]:
Visitsdf_wVisits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54873 entries, 0 to 54872
Data columns (total 11 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   index                                 54873 non-null  int64         
 1   Acc_ID                                54873 non-null  object        
 2   End Date in Local Time Zone           54873 non-null  datetime64[ns]
 3   Result_Incomplete Selling Call        54873 non-null  uint8         
 4   Result_Not assigned                   54873 non-null  uint64        
 5   Result_Objective Met                  54873 non-null  uint8         
 6   Result_Objective Partially Met        54873 non-null  uint8         
 7   Result_Requires Further Follow-up     54873 non-null  uint8         
 8   Result_Unsuccessful Selling Call      54873 non-null  uint8         
 9   Activity Life Cycle Status_Completed  54873 non-null  uint64        
 10

In [184]:

Visitsdf_wVisits['Acc_ID'] = Visitsdf_wVisits['Acc_ID'].astype(str)
Visitsdf_wVisits['#Visits completed'] = Visitsdf_wVisits['Activity Life Cycle Status_Completed']
Visitsdf_wVisits = Visitsdf_wVisits.drop(['Activity Life Cycle Status_Completed'], axis = 1)
Visitsdf_wVisits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54873 entries, 0 to 54872
Data columns (total 11 columns):
 #   Column                             Non-Null Count  Dtype         
---  ------                             --------------  -----         
 0   index                              54873 non-null  int64         
 1   Acc_ID                             54873 non-null  object        
 2   End Date in Local Time Zone        54873 non-null  datetime64[ns]
 3   Result_Incomplete Selling Call     54873 non-null  uint8         
 4   Result_Not assigned                54873 non-null  uint64        
 5   Result_Objective Met               54873 non-null  uint8         
 6   Result_Objective Partially Met     54873 non-null  uint8         
 7   Result_Requires Further Follow-up  54873 non-null  uint8         
 8   Result_Unsuccessful Selling Call   54873 non-null  uint8         
 9   Last_visit_diff_months             54873 non-null  float64       
 10  #Visits completed                 

In [185]:
a=Visitsdf_wVisits.loc[Visitsdf_wVisits['Acc_ID']=='1938031']
a

Unnamed: 0,index,Acc_ID,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Last_visit_diff_months,#Visits completed


result = pd.merge(left2, Visitsdf_wVisits, how='left', left_on = ['K1'], right_on = ['Acc_ID']) 
result

#### Phone calls data preparation

In [186]:
PhoneCallsdf.head()

Unnamed: 0,ACTIVITY_NAME,Account Name,ACTIVITY_OWNER,Activity Life Cycle Status,PHONE_CALL_ID,OBJECTIVE_PHONE_CALL,SALES_ORGANIZATION,End Date in Local Time Zone,START_DATE_IN_LOCAL_TIME_ZONE,PERIODEND,EE
0,2024-05-21- Malá Itálie pizzerie s.r.o,9223247,Milan Svoboda,Completed,1327588,,CZ11,2024-05-20,2024-05-20,2024 - 05,4826
1,2025-02-06- Hotel Call 3,9870085,Faruq Khan,Completed,1448411,,IN14,2025-02-07,2025-02-07,2025 - 02,12496
2,2025-02-13- Education academy Call 1,9885185,Gaurav Mishra,Completed,1451304,,IN14,2025-02-14,2025-02-14,2025 - 02,11273
3,2025-02-13- Das Call 1,9884739,Sandeep Gupta,Completed,1451143,,IN14,2025-02-13,2025-02-13,2025 - 02,16062
4,2025-02-06- New satyanarayan mithai bhandar Ca...,9871856,Akash Chawariya,Completed,1448810,,IN14,2025-02-08,2025-02-08,2025 - 02,10608


In [187]:
PhoneCallsdf.columns

Index(['ACTIVITY_NAME', 'Account Name', 'ACTIVITY_OWNER',
       'Activity Life Cycle Status', 'PHONE_CALL_ID', 'OBJECTIVE_PHONE_CALL',
       'SALES_ORGANIZATION', 'End Date in Local Time Zone',
       'START_DATE_IN_LOCAL_TIME_ZONE', 'PERIODEND', 'EE'],
      dtype='object')

'Activity Owner',
 'Objective (Phone Call)' -> to much text freedom and too many reasons
 'Phone Call ID' -> not needed

In [188]:
PhoneCallsdf1 = PhoneCallsdf[['Account Name', 'Activity Life Cycle Status', 'End Date in Local Time Zone']]

In [189]:
def preprocess_calls(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Activity Life Cycle Status']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

PhoneCallsdf_prep = preprocess_calls(PhoneCallsdf1)
PhoneCallsdf_prep.head()

Unnamed: 0,Account Name,End Date in Local Time Zone,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
0,9223247,2024-05-20,0,1,0,0
1,9870085,2025-02-07,0,1,0,0
2,9885185,2025-02-14,0,1,0,0
3,9884739,2025-02-13,0,1,0,0
4,9871856,2025-02-08,0,1,0,0


Remove phone calls without an account ID

In [190]:
PhoneCallsdf_prep1 = PhoneCallsdf_prep.loc[PhoneCallsdf_prep['Account Name']!="#"]


In [191]:
PhoneCallsdf_prep1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 177010 entries, 0 to 177009
Data columns (total 6 columns):
 #   Column                                 Non-Null Count   Dtype 
---  ------                                 --------------   ----- 
 0   Account Name                           177010 non-null  object
 1   End Date in Local Time Zone            177010 non-null  object
 2   Activity Life Cycle Status_Canceled    177010 non-null  uint8 
 3   Activity Life Cycle Status_Completed   177010 non-null  uint8 
 4   Activity Life Cycle Status_In Process  177010 non-null  uint8 
 5   Activity Life Cycle Status_Open        177010 non-null  uint8 
dtypes: object(2), uint8(4)
memory usage: 4.7+ MB


Remove date greater than next year

In [192]:
Churndate2_year = ChurnDate2.year

In [193]:
PhoneCallsdf_prep1['End Date in Local Time Zone'] = pd.to_datetime(PhoneCallsdf_prep1['End Date in Local Time Zone'], errors = 'coerce')


In [194]:
PhoneCallsdf_prep1 = PhoneCallsdf_prep1.loc[PhoneCallsdf_prep1['End Date in Local Time Zone'] < dt.datetime(Churndate2_year+1,1,1)]

In [195]:
PhoneCallsdf_prep1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 177000 entries, 0 to 177009
Data columns (total 6 columns):
 #   Column                                 Non-Null Count   Dtype         
---  ------                                 --------------   -----         
 0   Account Name                           177000 non-null  object        
 1   End Date in Local Time Zone            177000 non-null  datetime64[ns]
 2   Activity Life Cycle Status_Canceled    177000 non-null  uint8         
 3   Activity Life Cycle Status_Completed   177000 non-null  uint8         
 4   Activity Life Cycle Status_In Process  177000 non-null  uint8         
 5   Activity Life Cycle Status_Open        177000 non-null  uint8         
dtypes: datetime64[ns](1), object(1), uint8(4)
memory usage: 4.7+ MB


In [196]:
PhoneCallsdf_prep1 = PhoneCallsdf_prep1.sort_values('End Date in Local Time Zone')

In [197]:
PhoneCallsdf_prep2 = (PhoneCallsdf_prep1.sort_values('End Date in Local Time Zone')
    .groupby(["Account Name"])
                      .agg({
        'End Date in Local Time Zone': lambda s: s.values[-1],
        'Activity Life Cycle Status_Completed' : 'sum'}))


In [198]:
PhoneCallsdf_prep2

Unnamed: 0_level_0,End Date in Local Time Zone,Activity Life Cycle Status_Completed
Account Name,Unnamed: 1_level_1,Unnamed: 2_level_1
1000058,2025-03-18,1
1000096,2024-05-31,4
1000216,2024-11-21,61
1000222,2024-11-25,4
1000256,2024-03-01,3
...,...,...
9889326,2025-02-16,1
9889336,2025-02-16,1
9889343,2025-02-16,1
9889354,2025-02-16,1


In [199]:
PhoneCallsdf_prep3 = PhoneCallsdf_prep2.copy()
PhoneCallsdf_prep3.reset_index()

Unnamed: 0,Account Name,End Date in Local Time Zone,Activity Life Cycle Status_Completed
0,1000058,2025-03-18,1
1,1000096,2024-05-31,4
2,1000216,2024-11-21,61
3,1000222,2024-11-25,4
4,1000256,2024-03-01,3
...,...,...,...
81801,9889326,2025-02-16,1
81802,9889336,2025-02-16,1
81803,9889343,2025-02-16,1
81804,9889354,2025-02-16,1


In [200]:
PhoneCallsdf_prep3['Last_call_diff_months'] = ChurnDate2 - PhoneCallsdf_prep3['End Date in Local Time Zone']

PhoneCallsdf_prep3['Last_call_diff_months'] = PhoneCallsdf_prep3['Last_call_diff_months']/np.timedelta64(1,'M')

In [201]:
PhoneCallsdf_prep3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 81806 entries, 1000058 to None
Data columns (total 3 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   End Date in Local Time Zone           81806 non-null  datetime64[ns]
 1   Activity Life Cycle Status_Completed  81806 non-null  uint64        
 2   Last_call_diff_months                 81806 non-null  float64       
dtypes: datetime64[ns](1), float64(1), uint64(1)
memory usage: 2.5+ MB


In [202]:
PhoneCallsdf_prep3 = PhoneCallsdf_prep3.copy()
PhoneCallsdf_prep3.reset_index()

Unnamed: 0,Account Name,End Date in Local Time Zone,Activity Life Cycle Status_Completed,Last_call_diff_months
0,1000058,2025-03-18,1,-1.511325
1,1000096,2024-05-31,4,8.049447
2,1000216,2024-11-21,61,2.332697
3,1000222,2024-11-25,4,2.201277
4,1000256,2024-03-01,3,11.039241
...,...,...,...,...
81801,9889326,2025-02-16,1,-0.525678
81802,9889336,2025-02-16,1,-0.525678
81803,9889343,2025-02-16,1,-0.525678
81804,9889354,2025-02-16,1,-0.525678


In [203]:
PhoneCallsdf_prep3['#Calls Completed'] = PhoneCallsdf_prep3['Activity Life Cycle Status_Completed']
PhoneCallsdf_prep3 = PhoneCallsdf_prep3.drop(['Activity Life Cycle Status_Completed'], axis = 1)
PhoneCallsdf_prep3.head()

Unnamed: 0_level_0,End Date in Local Time Zone,Last_call_diff_months,#Calls Completed
Account Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1000058,2025-03-18,-1.511325,1
1000096,2024-05-31,8.049447,4
1000216,2024-11-21,2.332697,61
1000222,2024-11-25,2.201277,4
1000256,2024-03-01,11.039241,3


#### Incident Ticket preparation

In [204]:
IncidentTicketdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 685189 entries, 0 to 685188
Data columns (total 4 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   Completion Date_2   682261 non-null  object
 1   Incident Category   685189 non-null  object
 2   Serial ID           685189 non-null  object
 3   COMPLETION_SLA_MET  683732 non-null  object
dtypes: object(4)
memory usage: 20.9+ MB


In [205]:
IncidentTicketdf.columns

Index(['Completion Date_2', 'Incident Category', 'Serial ID',
       'COMPLETION_SLA_MET'],
      dtype='object')

Maybe I will do a delta between "Completion Date_2" and "Reported On"

Removed:
'AuxFix' -> 'AuxTime'
'Completion SLA Met' -> 'SLAMet'
'SLAMet' -> 'SLA MET?' 
'Service Technician' not clear and a lot of data

In [206]:
IncidentTicketdf3 = IncidentTicketdf

IncidentTicketdf2 = IncidentTicketdf1.copy()
#IncidentTicketdf2['Completion Date_2'] = IncidentTicketdf2['Completion Date_2'].apply(str)
#IncidentTicketdf2['Completion Date_2'] = IncidentTicketdf2['Completion Date_2'].apply(lambda x: dateparser.parse(x))

In [207]:
IncidentTicketdf3['Completion Date_2'] = pd.to_datetime(IncidentTicketdf3['Completion Date_2'], errors = 'coerce')
#IncidentTicketdf3['Reported On'] = pd.to_datetime(IncidentTicketdf3['REPORTED_ON'], errors = 'coerce')

IncidentTicketdf3['Completion Date_2'] = IncidentTicketdf3['Completion Date_2'].fillna(dt.datetime(2000,1,1))
#IncidentTicketdf3['Reported On'] = IncidentTicketdf3['Reported On'].fillna(dt.datetime(2000,1,1))

In [208]:
IncidentTicketdf3.head()

Unnamed: 0,Completion Date_2,Incident Category,Serial ID,COMPLETION_SLA_MET
0,2022-05-24,1.d Ingredient Other,ZA8345,False
1,2022-10-31,5.c Dispensing Area Other,ID15217,False
2,2022-09-20,2.e Hydraulic Cooling/Freezing,16O0039291,False
3,2022-11-01,14 Accessory problem(external pump..),ID20273,True
4,2022-12-23,14 Accessory problem(external pump..),ID12789,True


In [209]:
IncidentTicketdf3 = IncidentTicketdf3.loc[IncidentTicketdf3['Serial ID']!="#"]

In [210]:
def preprocess_InciTickets(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Incident Category', 'COMPLETION_SLA_MET']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

IncidentTicketdf_prep = preprocess_InciTickets(IncidentTicketdf3)
IncidentTicketdf_prep.head()

Unnamed: 0,Completion Date_2,Serial ID,Incident Category_1.a Ingredient Calibration,Incident Category_1.b Ingredient Dispensing,Incident Category_1.c Ingredient Dripping,Incident Category_1.d Ingredient Other,Incident Category_10 Abnormal smell,Incident Category_11 Electrical power,Incident Category_12 Water supply issue,Incident Category_13 Connectivity (modem),...,Incident Category_6 Electronics (PCBs),Incident Category_7 Wire/Harness,Incident Category_8 Software/Firmware,Incident Category_9 Abnormal noise,Incident Category_Low throughput,Incident Category_N/A,Incident Category_Requested by Customer,Incident Category_Scheduled,COMPLETION_SLA_MET_False,COMPLETION_SLA_MET_True
0,2022-05-24,ZA8345,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,2022-10-31,ID15217,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
2,2022-09-20,16O0039291,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,2022-11-01,ID20273,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,2022-12-23,ID12789,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [211]:
IncidentTicketdf_prep.columns

Index(['Completion Date_2', 'Serial ID',
       'Incident Category_1.a Ingredient Calibration',
       'Incident Category_1.b Ingredient Dispensing',
       'Incident Category_1.c Ingredient Dripping',
       'Incident Category_1.d Ingredient Other',
       'Incident Category_10 Abnormal smell',
       'Incident Category_11 Electrical power',
       'Incident Category_12 Water supply issue',
       'Incident Category_13 Connectivity (modem)',
       'Incident Category_14 Accessory problem(external pump..)',
       'Incident Category_15 Return with parts',
       'Incident Category_16 Operator mishandling(improper fill..)',
       'Incident Category_17 Miscellaneous', 'Incident Category_18 N/A',
       'Incident Category_2.a Hydraulic Calibration',
       'Incident Category_2.b Hydraulic Dispensing',
       'Incident Category_2.c Hydraulic Leaking',
       'Incident Category_2.d Hydraulic Heating',
       'Incident Category_2.e Hydraulic Cooling/Freezing',
       'Incident Category_2.f 

In [212]:
IncidentTicketdf_prep = IncidentTicketdf_prep.sort_values('Completion Date_2')

I will not use 'Reported On' because I aggreagate and I do not want to make a delta anymore

In [213]:
IncidentTicketdf_prep2 = (IncidentTicketdf_prep.sort_values('Completion Date_2')
    .groupby(["Serial ID"])
                      .agg({'Completion Date_2' : lambda s: s.values[-1], 
       'Incident Category_1.a Ingredient Calibration' : 'sum',
       'Incident Category_1.b Ingredient Dispensing' : 'sum',
       'Incident Category_1.c Ingredient Dripping' : 'sum',
       'Incident Category_1.d Ingredient Other' : 'sum',
       'Incident Category_10 Abnormal smell' : 'sum',
       'Incident Category_11 Electrical power' : 'sum',
       'Incident Category_12 Water supply issue' : 'sum',
       'Incident Category_13 Connectivity (modem)' : 'sum',
       'Incident Category_14 Accessory problem(external pump..)' : 'sum',
       'Incident Category_15 Return with parts' : 'sum',
       'Incident Category_16 Operator mishandling(improper fill..)' : 'sum',
       'Incident Category_17 Miscellaneous': 'sum',
                            'Incident Category_18 N/A' : 'sum',
       'Incident Category_2.a Hydraulic Calibration' : 'sum',
       'Incident Category_2.b Hydraulic Dispensing' : 'sum',
       'Incident Category_2.c Hydraulic Leaking' : 'sum',
       'Incident Category_2.d Hydraulic Heating': 'sum',
       'Incident Category_2.e Hydraulic Cooling/Freezing': 'sum',
       'Incident Category_2.f Hydraulic Filling': 'sum',
       'Incident Category_2.g Hydraulic Other': 'sum',
       'Incident Category_3.a Door Display/Touchscreen': 'sum',
       'Incident Category_3.b Door Menu buttons': 'sum',
       'Incident Category_3.c Door Detection': 'sum',
       'Incident Category_3.d Door Key/Key switch': 'sum',
       'Incident Category_3.e Door Other': 'sum',
       'Incident Category_4.a Reconst. Area In-cup quality/Recipes': 'sum',
       'Incident Category_4.b Reconstitution Area Mixing system': 'sum',
       'Incident Category_4.c Reconstitution Area Other': 'sum',
       'Incident Category_5.a Disp. Area Manifold/Distribution': 'sum',
       'Incident Category_5.b Dispensing Area Drip Tray': 'sum',
       'Incident Category_5.c Dispensing Area Other': 'sum',
       'Incident Category_6 Electronics (PCBs)': 'sum',
       'Incident Category_7 Wire/Harness': 'sum',
       'Incident Category_8 Software/Firmware': 'sum',
       'Incident Category_9 Abnormal noise': 'sum',
                            'COMPLETION_SLA_MET_False': 'sum',
                            'COMPLETION_SLA_MET_True': 'sum'})
)


In [214]:
IncidentTicketdf_prep2['Last_InTick_diff_months'] = ChurnDate2 - IncidentTicketdf_prep2['Completion Date_2']

IncidentTicketdf_prep2['Last_InTick_diff_months'] = IncidentTicketdf_prep2['Last_InTick_diff_months']/np.timedelta64(1,'M')

In [215]:
IncidentTicketdf_prep2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 83579 entries, 100100250 to ZA978
Data columns (total 39 columns):
 #   Column                                                      Non-Null Count  Dtype         
---  ------                                                      --------------  -----         
 0   Completion Date_2                                           83579 non-null  datetime64[ns]
 1   Incident Category_1.a Ingredient Calibration                83579 non-null  uint8         
 2   Incident Category_1.b Ingredient Dispensing                 83579 non-null  uint8         
 3   Incident Category_1.c Ingredient Dripping                   83579 non-null  uint8         
 4   Incident Category_1.d Ingredient Other                      83579 non-null  uint8         
 5   Incident Category_10 Abnormal smell                         83579 non-null  uint8         
 6   Incident Category_11 Electrical power                       83579 non-null  uint8         
 7   Incident Category_1

In [216]:
IncidentTicketdf_prep2 = IncidentTicketdf_prep2.reset_index()
IncidentTicketdf_prep2.head()

Unnamed: 0,Serial ID,Completion Date_2,Incident Category_1.a Ingredient Calibration,Incident Category_1.b Ingredient Dispensing,Incident Category_1.c Ingredient Dripping,Incident Category_1.d Ingredient Other,Incident Category_10 Abnormal smell,Incident Category_11 Electrical power,Incident Category_12 Water supply issue,Incident Category_13 Connectivity (modem),...,Incident Category_5.a Disp. Area Manifold/Distribution,Incident Category_5.b Dispensing Area Drip Tray,Incident Category_5.c Dispensing Area Other,Incident Category_6 Electronics (PCBs),Incident Category_7 Wire/Harness,Incident Category_8 Software/Firmware,Incident Category_9 Abnormal noise,COMPLETION_SLA_MET_False,COMPLETION_SLA_MET_True,Last_InTick_diff_months
0,100100250,2023-02-20,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,23.359823
1,100100251,2022-01-13,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,36.600341
2,100100276,2021-08-01,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,1,42.021397
3,100100293,2022-11-16,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,26.513891
4,100100296,2021-07-10,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,42.744204


#### Data with all
Let's see what we can get if we include Telemetry, sales and Tickets

In [217]:

BeverageMachine7_wTickets_df['Manufacturer Number'] = BeverageMachine7_wTickets_df['Manufacturer Number'].astype('str')
Concat_Telemetry['serial'] = Concat_Telemetry['serial'].astype('str')

In [218]:
BeverageMachine7_wTickets_wTelemetry_df = pd.merge(BeverageMachine7_wTickets_df, Concat_Telemetry, how='left', left_on = ['Manufacturer Number'], right_on = ['serial'])

BeverageMachine7_wTickets_wTelemetry_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 0 to 270763
Data columns (total 62 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        270764 non-null  object        
 1   Sales Organisation                                               270764 non-null  object        
 2   Machine Status Groupings                                         270764 non-null  object        
 3   User Status                                                      270764 non-null  object        
 4   TA Contract Installation Date                                    270764 non-null  int32         
 5   Depreciation Start                                               270764 non-null  int32         
 6   Manufacturer Number                                              270

In [219]:
BeverageMachine7_wTickets_wTelemetry_df = pd.merge(BeverageMachine7_wTickets_df, Concat_Telemetry, how='left', left_on = ['Manufacturer Number'], right_on = ['serial'])
BeverageMachine7_wTickets_wTelemetry_df=BeverageMachine7_wTickets_wTelemetry_df.fillna(0)
BeverageMachine7_wTickets_wTelemetry_df["quantity"] = BeverageMachine7_wTickets_wTelemetry_df["quantity"].astype(int)
BeverageMachine7_wTickets_wTelemetry_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 0 to 270763
Data columns (total 62 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        270764 non-null  object        
 1   Sales Organisation                                               270764 non-null  object        
 2   Machine Status Groupings                                         270764 non-null  object        
 3   User Status                                                      270764 non-null  object        
 4   TA Contract Installation Date                                    270764 non-null  int32         
 5   Depreciation Start                                               270764 non-null  int32         
 6   Manufacturer Number                                              270

In [220]:
A= Concat_Sales.drop_duplicates(subset= 'KeyManufNo_SalesOrg')
A.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107869 entries, 0 to 24234
Data columns (total 8 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Serial                  107869 non-null  object 
 1   quantity                107869 non-null  float64
 2   Sales_one_Month_avg     107869 non-null  float64
 3   Sales_three_months_avg  107869 non-null  float64
 4   Sales_six_months_avg    107869 non-null  float64
 5   KeyManufNo_SalesOrg     107869 non-null  object 
 6   (lst_mth-6mth)/6mth     107869 non-null  float64
 7   3mth-6mth)/6mth         107869 non-null  float64
dtypes: float64(6), object(2)
memory usage: 7.4+ MB


In [221]:
Concat_Sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 107869 entries, 0 to 24234
Data columns (total 8 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   Serial                  107869 non-null  object 
 1   quantity                107869 non-null  float64
 2   Sales_one_Month_avg     107869 non-null  float64
 3   Sales_three_months_avg  107869 non-null  float64
 4   Sales_six_months_avg    107869 non-null  float64
 5   KeyManufNo_SalesOrg     107869 non-null  object 
 6   (lst_mth-6mth)/6mth     107869 non-null  float64
 7   3mth-6mth)/6mth         107869 non-null  float64
dtypes: float64(6), object(2)
memory usage: 7.4+ MB


In [222]:
BeverageMachine7_wTickets_wTelemetry_wSales_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_df, Concat_Sales, how='left', left_on = ['Key_ManufacturerID_SalesOrg'], right_on = ['KeyManufNo_SalesOrg'])
BeverageMachine7_wTickets_wTelemetry_wSales_df = BeverageMachine7_wTickets_wTelemetry_wSales_df.fillna(0)
BeverageMachine7_wTickets_wTelemetry_wSales_df["quantity_y"] = BeverageMachine7_wTickets_wTelemetry_wSales_df["quantity_y"].astype(int)
BeverageMachine7_wTickets_wTelemetry_wSales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 0 to 270763
Data columns (total 70 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        270764 non-null  object        
 1   Sales Organisation                                               270764 non-null  object        
 2   Machine Status Groupings                                         270764 non-null  object        
 3   User Status                                                      270764 non-null  object        
 4   TA Contract Installation Date                                    270764 non-null  int32         
 5   Depreciation Start                                               270764 non-null  int32         
 6   Manufacturer Number                                              270

In [223]:
BeverageMachine7_wTickets_wTelemetry_wSales_df['EC ID'] = BeverageMachine7_wTickets_wTelemetry_wSales_df['EC ID'].astype('str')
Visitsdf_wVisits['Acc_ID'] = Visitsdf_wVisits['Acc_ID'].astype('str')

In [224]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_wSales_df, Visitsdf_wVisits, how='left', left_on = ['EC ID'], right_on = ['Acc_ID'])
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'] = pd.to_datetime(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'])
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'] = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'].fillna(dt.datetime(2000,1,1))
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df.fillna(0)


In [225]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 0 to 270763
Data columns (total 81 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        270764 non-null  object        
 1   Sales Organisation                                               270764 non-null  object        
 2   Machine Status Groupings                                         270764 non-null  object        
 3   User Status                                                      270764 non-null  object        
 4   TA Contract Installation Date                                    270764 non-null  int32         
 5   Depreciation Start                                               270764 non-null  int32         
 6   Manufacturer Number                                              270

In [226]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Acc_ID,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Last_visit_diff_months,#Visits completed
0,132020210,Nestlé Austria,Deployed,Installed,44347,41609,132020210,RENT,44347,46173,...,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,19O0046537,Nestlé PH,Deployed,Installed,0,44621,T479748,#,0,0,...,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,23O0024982,Nestlé PH,Deployed,Installed,0,45200,3142073491,#,0,0,...,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,HK10020559,Nestle Hong Kong,Deployed,Installed,0,44197,20202922905,#,0,0,...,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,4410837,NP-Hungary,Deployed,Installed,0,42186,4410837,#,0,0,...,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [227]:
PhoneCallsdf_prep3 = PhoneCallsdf_prep3.reset_index()

In [228]:
PhoneCallsdf_prep3['Account Name'] = PhoneCallsdf_prep3['Account Name'].astype('str')

In [229]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df, PhoneCallsdf_prep3, how='left', left_on = ['EC ID'], right_on = ['Account Name'])
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 0 to 270763
Data columns (total 85 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        270764 non-null  object        
 1   Sales Organisation                                               270764 non-null  object        
 2   Machine Status Groupings                                         270764 non-null  object        
 3   User Status                                                      270764 non-null  object        
 4   TA Contract Installation Date                                    270764 non-null  int32         
 5   Depreciation Start                                               270764 non-null  int32         
 6   Manufacturer Number                                              270

In [230]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'] = pd.to_datetime(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'])

BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'] = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'].fillna(dt.datetime(2000,1,1))
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df.fillna(0)

In [231]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 0 to 270763
Data columns (total 85 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        270764 non-null  object        
 1   Sales Organisation                                               270764 non-null  object        
 2   Machine Status Groupings                                         270764 non-null  object        
 3   User Status                                                      270764 non-null  object        
 4   TA Contract Installation Date                                    270764 non-null  int32         
 5   Depreciation Start                                               270764 non-null  int32         
 6   Manufacturer Number                                              270

In [232]:
IncidentTicketdf_prep2['Serial ID'] = IncidentTicketdf_prep2['Serial ID'].astype('str')

In [233]:
IncidentTicketdf_prep2['Serial ID'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 83579 entries, 0 to 83578
Series name: Serial ID
Non-Null Count  Dtype 
--------------  ----- 
83579 non-null  object
dtypes: object(1)
memory usage: 653.1+ KB


In [234]:
BeverageMachine_all_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df, IncidentTicketdf_prep2, how='left', left_on = ['Serial ID'], right_on = ['Serial ID'])
BeverageMachine_all_df['Completion Date_2'] = pd.to_datetime(BeverageMachine_all_df['Completion Date_2'])
BeverageMachine_all_df['Completion Date_2'] = BeverageMachine_all_df['Completion Date_2'].fillna(dt.datetime(2000,1,1))
BeverageMachine_all_df = BeverageMachine_all_df.fillna(0)


In [235]:
MktActions_prep3 = MktActions_prep3.reset_index()
MktActions_prep3

Unnamed: 0,Serial ID,Actions_Churn risk reason unknown,Actions_Data corrected,Actions_Downgrade machine installed,Actions_Lack of data discipline,Actions_New contract,Actions_Other,Actions_Out of order,Actions_Phone Call completed,Actions_Plan for removal,...,Actions_Removed,Actions_Reviewed and no action Required,Actions_Reviewed and no actions required,Actions_Seasonal Machine,Actions_Telemetry installed,Actions_Upgrade machine installed,Actions_Visit completed,Actions_Visit/Call planned,Actions_removed,Actions_tagging update
0,24606,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1895151,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,10238090,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,10238091,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,10238092,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1425,T348033,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1426,T348040,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1427,T348082,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1428,T392705,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0


In [236]:
MktActions_prep3['Serial ID'] = MktActions_prep3['Serial ID'].astype('str')

In [237]:

BeverageMachine_all_df2 = pd.merge(BeverageMachine_all_df, MktActions_prep3, how='left', left_on = ['Serial ID'], right_on = ['Serial ID'])
#BeverageMachine_all_df['Completion Date_2'] = pd.to_datetime(BeverageMachine_all_df['Completion Date_2'])
#BeverageMachine_all_df['Completion Date_2'] = BeverageMachine_all_df['Completion Date_2'].fillna(dt.datetime(2000,1,1))
BeverageMachine_all_df2 = BeverageMachine_all_df2.fillna(0)

BeverageMachine_all_df = BeverageMachine_all_df2

UKService_prep2 = UKService_prep2.reset_index()

BeverageMachine_all_df2 = pd.merge(BeverageMachine_all_df, UKService_prep2, how='left', left_on = ['Key_ManufacturerID_SalesOrg'], right_on = ['Key_ManufacturerID_SalesOrg'])
BeverageMachine_all_df2['Month'] = pd.to_datetime(BeverageMachine_all_df2['Month'])
BeverageMachine_all_df2['Month'] = BeverageMachine_all_df2['Month'].fillna(dt.datetime(2000,1,1))
BeverageMachine_all_df2 = BeverageMachine_all_df2.fillna(0)


In [238]:
BeverageMachine_all_df = BeverageMachine_all_df2
a = BeverageMachine_all_df.iloc[:,100:]
a.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 0 to 270763
Data columns (total 45 columns):
 #   Column                                                      Non-Null Count   Dtype  
---  ------                                                      --------------   -----  
 0   Incident Category_2.b Hydraulic Dispensing                  270764 non-null  float64
 1   Incident Category_2.c Hydraulic Leaking                     270764 non-null  float64
 2   Incident Category_2.d Hydraulic Heating                     270764 non-null  float64
 3   Incident Category_2.e Hydraulic Cooling/Freezing            270764 non-null  float64
 4   Incident Category_2.f Hydraulic Filling                     270764 non-null  float64
 5   Incident Category_2.g Hydraulic Other                       270764 non-null  float64
 6   Incident Category_3.a Door Display/Touchscreen              270764 non-null  float64
 7   Incident Category_3.b Door Menu buttons                     270764 non-nul

###TODO Remove when market have enough data

BeverageMachine_all_df2 = BeverageMachine_all_df.copy()

# Sales Organisation with more than one month of data
SO = ['Nestle Sweden',  'Nestlé Czech', 'Nestlé Denmark', 'Nestlé Finland', 'Nestlé Norway', 'Nestlé Slovak Republic']

#BeverageMachine_all_df3 =  pd.DataFrame([])

for i in SO:
    BeverageMachine_all_df2 = BeverageMachine_all_df2.loc[BeverageMachine_all_df2['Sales Organisation'] != i]
BeverageMachine_all_df2.head()
BeverageMachine_all_df = BeverageMachine_all_df2

# Specify the filename
filename = 'TelemetryColumnsList.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(TelemetryColumnsList, file)

#### Data summary

I now have four datasets :

"BeverageMachine7_df" 

    which is all the data of the beverage machines without ticket data

    This data will be our main data and it will be used to Train and test our models because we have data for all the machines

"BeverageMachine7_wTickets_df" 

    which is with the Ticket data and when there is no tickets for a machine we fill with 0
    
    As we only have around 2000 machines having tickets we will use it on the model that performed better with main data to see if it can bring better results with Telemetry data

"BeverageMachine7_wTicketsOnly_df" 

    which is only the data of the machines having Tickets
    
    Only useful for EDA

"BeverageMachine7_wTickets_wTelemetry_df"

    We will use it on the model that performed better with main data to see if it can bring better results than the Main data or the Main data with tickets. If it does not improve significantly the results we will not use it  because it takes a lot of time to get Telemetry data.
    Later, more machines will have Telemetry and a data lake will be created and it will br easier to get the data.

### Save the data<a class="anchor" id="save"></a>

I choose to save the data into a pickle file because it is a good way to transfer a pandas dataframe

##### BeverageMachine7_df

In [239]:
BeverageMachine_all_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 270764 entries, 0 to 270763
Columns: 145 entries, Serial ID to Actions_tagging update
dtypes: bool(1), datetime64[ns](4), float64(95), int32(6), object(39)
memory usage: 293.6+ MB


In [240]:
BeverageMachine_all_df.iloc[0:10, 68:90]

Unnamed: 0,(lst_mth-6mth)/6mth_y,3mth-6mth)/6mth_y,index,Acc_ID,End Date in Local Time Zone_x,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,...,#Visits completed,Account Name,End Date in Local Time Zone_y,Last_call_diff_months,#Calls Completed,Completion Date_2,Incident Category_1.a Ingredient Calibration,Incident Category_1.b Ingredient Dispensing,Incident Category_1.c Ingredient Dripping,Incident Category_1.d Ingredient Other
0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0,2000-01-01,0.0,0.0,2024-12-19,0.0,0.0,0.0,0.0
6,-0.624471,-0.874824,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0,2000-01-01,0.0,0.0,2021-03-09,0.0,1.0,0.0,0.0
7,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,0.0


In [241]:
# Specify the filename
filename = 'BM_noTickets.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_df, file)

In [242]:
# Specify the filename
filename = 'BM_noTickets.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Load the pickle file
with open(file_path_with_filename, 'rb') as file:
    BM_noTickets = pickle.load(file)

Quick test to see if I am able to reopen the data in another Notebook

##### BeverageMachine7_wTickets_df

In [243]:
# Specify the filename
filename = 'BM_wTickets.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_wTickets_df, file)

##### BeverageMachine7_wTicketsOnly_df

In [244]:
# Specify the filename
filename = 'BM_wTicketsOnly.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_wTicketsOnly_df, file)

##### BeverageMachine7_wTickets_wTelemetry_df

In [245]:
# Specify the filename
filename = 'BeverageMachine7_wTickets_wTelemetry_df.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_wTickets_wTelemetry_df, file)

##### Other dataframe needed for the second preparation step later

In [246]:
# Specify the filename
filename = 'IncidentTicketdf.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(IncidentTicketdf_prep2, file)

In [247]:
# Specify the filename
filename = 'TelemetryAggregated_df.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(Telemetry_aggSales, file)

#### All data with placements, telemetry, visits, phone calls, Incidents tickets

In [248]:
# Specify the filename
filename = 'BeverageMachine_all_df2.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine_all_df, file)

In [249]:
BeverageMachine_all_df.to_csv(r'C:\Users\msalomo\predictions-BevData.csv', index = False, header=True)