* ## [1) The problem](#TheProblem)

    * #### Goal
    
* ## [2) The Data](#TheData)
    * ### [(a) Clear overview of your data](#DataOverview)
    
        ##### Beverage Machine data

        ##### Beverage Mapping data

        ##### Beverage Classification data
    
        ##### Placement Tickets data

        ##### Telemetry data

    * ### [(b) Plan to manage and process the data](#ManageData)
    
        ##### Beverage Machine data features and the Beverage Classification data features

        ##### Placement Tickets data features
        
        ##### Telemetry data features

        ##### Missing data
        
        ##### Preparation of the data in order to execute some EDA

* ## [3) Preparation of the data](#prep)         
    * ### [(a) Details of preparation](#det)
    
        #### Beverage Machine data preparation

        #### Placement Tickets data preparation

        #### Telemetry data preparation

        #### Data summary
        
    * ### [(b) Save the data](#save)


## 1) The problem <a class="anchor" id="TheProblem"></a>

The main business is a full service for beverage machine including :

    beverage machines placed at a customer’s place (rented or loaned), 
    
    the beverage ingredients (coffee beans, soluble coffee, juice, etc.) delivered to the customers 
    
    and the management of any issue and repair.

A little bit like the printers in companies where a printing machine is placed and the ink and the issues are also managed by the same company.

We have high churn rate of the beverage machine rented/loaned in our business and the goal is to reduce the churn rate by predicting which customer is more likely to churn and try to retain these customers.

The goal is to use Machine Learning in order to predict which machine is at risk of churn by calculating a churn likelihood.

The 'churned' machines are the machines that are definitively removed from their installation point thus resulting in a lower number of machines deployed dispensing beverage cups.

A churned machine generates a one-time cost for removal and replacement and a variable cost for depreciation and storage whilst a new location is found.

The Installation Point is referring to a customer's point where the machine is installed. A customer can have one or several Installation Point. A machine can be replaced by a new Machine on the same Installation Point. The Idea is to look when we lose an Installation Point, meaning that a machine distributing beverage cups has churned.

A Machine can be replaced on an Installation Point and it means we have kept the customer, so that is why we focus on the Installation Point rather than the Machine's Serial ID.

Two proposals could be used:

    Proposal 1 : We keep all the Installation Point data available and we do not aggregate the monthly data of the machines
    
    Proposal 2 : We aggregate the data of the same Installation Point over several month.
    
Example Proposal 1:

    InstPoint     Month of snapshot     ID       Churn      Age in Month      
    Inst.   1     Jan                   1        No         20
    Inst.   2     Jan                   2        No         48
    Inst.   3     Jan                   3        No         69
    Inst.   4     Jan                   4        Yes        45
    
    Inst.   1     Feb                   5        No         21
    Inst.   2     Feb                   6        No         49
    Inst.   3     Feb                   7        Yes        70
    Inst.   5     Feb                   8        No         25
    
    Inst.   1     Mar                   9        No         22
    Inst.   2     Mar                   10       No         50
    Inst.   5     Mar                   11       No         26
    Inst.   6     Mar                   12       No         30
    Inst.   7     Mar                   13       No         42
    Inst.   8     Mar                   14       No         7
    
    
Example Proposal 2:

    Inst.   #     Latest month snap     ID       Churn      Age in Month       data available since (month)     
    Inst.   1     Mar                   1        No         22                 3
    Inst.   2     Mar                   2        No         50                 3
    Inst.   3     Feb                   3        Yes        70                 1
    Inst.   4     Jan                   4        Yes        45                 2
    Inst.   5     Mar                   5        No         26                 2
    Inst.   6     Mar                   6        No         30                 1
    Inst.   7     Mar                   7        No         42                 1
    Inst.   8     Mar                   8        No         7                  1

I am currently missing all the Installation Point before January who have churned, therefore, the data is only having the current Park of Installation Point, only having the survivors, so I need to be careful of the Survivorship bias.

In order to make a Time Series problem it would be better to have more data.

I can have data from one Sales Organisation available in January and for another Sales Organisation in March.

The idea behind the first proposal was to predict a monthly churn rate, the monthly churn rate is the number of churn over the total. However with 10 month of data it is not the best solution.

With the second solution I will predict based on features if the machine has churned or not. The advantage of the second solution is that we can work without the time dimension and focus only on the features to make a prediction if the machine has churned or not.

### a) Goal <a class="anchor" id="Goal"></a>

By giving the customer's installation point with the highest churn likelihood to the managers, they can take action in order to retain more customer's installation point.

This will help to retain more customer's installation points and increase the company's deployed beverage machine park.

Also, less churn implies higher efficiency per machine (less time in the deposit) and lower cost for installation removal.

### TO DO LIST
Add Sales Org ID to vendon data and to Incident tickets and with a mapping create a key Serial-SalesOrg to link to the main data
Add acquisition cost and book value from ERP?

## 2) The data <a class="anchor" id="TheData"></a>

### (a) Clear overview of your data  <a class="anchor" id="DataOverview"></a>

pip install matplotlib

In [1]:
import pandas as pd
import os
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

import datetime as dt
from datetime import datetime

import pickle
#Install brokenaxes
#!pip install brokenaxes

# Specify the file path
file_path_output = r'C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Notebook output'

# Date when the data was extracted
ChurnDate2=datetime(2023,7,31)

# The range from when I want to have the details about Telemetry data
TelemetryDateRangeStart = '2020-01-01'

In [2]:
# Date when the data was extracted
import calendar

#Algorithm that gives the last day of the past month as Churn Date
CurrentDate = datetime.today()

shift_year = 0
shift_month = 1

if (CurrentDate.month == 1):
    shift_year = 1
    shift_month = -11

new_date = calendar.monthrange(CurrentDate.year - shift_year,CurrentDate.month - shift_month)
ChurnDate2 = datetime(CurrentDate.year - shift_year,CurrentDate.month - shift_month,new_date[1])

print(ChurnDate2)

2024-03-31 00:00:00


In [3]:
####Whe is the last time we had telemetry data
# Should be the same as Churn Date2

#TelemetryDate = ChurnDate2
#TelemetryDate = datetime(2020,8,31)

#PakistanLastUpdate = datetime(2021,5,31)
#PakistanDateRangeStart = datetime(2020,7,31)
#TelemetronLastUpdate = datetime(2021,7,31)
#TelemetronDateRangeStart = datetime(2021,1,31)
#VendonDateRangeStart = TelemetryDate
#VendonLastUpdate = datetime(2021,9,30)

The data has been anonymized

#### Below is a list of my datasets:

#### 1.	Beverage Machine data
    - The Beverage machine data is maintained by the Service manager of each Sales Organisation (usually a Sales Organisation corresponds to a country) and I can create a report to extract the data in excel from a database maintained by an external provider.
    - The database only keeps the latest state of the machine, therefore, I take a monthly snapshot of the data to capture the changes. 
    - This data provides details about the Beverage Machines park situation.
    - More and more Sales Organisations are going to be managed by this system, so the number of machines managed is increasing.

#### 2.	Beverage Mapping data
    - Beverage Mapping data is maintained in an Excel file by a colleague, I ask him to upload this mapping whenever I find new machines in the consolidated Beverage Machine data.
    - The goal of the file is to link the Beverage machine data to the Beverage Classification data.

#### 3.	Beverage Classification data
    - Beverage Classification data is maintained in a SharePoint file by a colleague.
    - This file is to get more technical details and features of the Beverage Machines.
    
#### 4.	Placement Tickets data
    - The Placement Tickets data is maintained by the Service manager of each Sales org and I can create a report to extract the data in excel from a database maintained by an external provider.
    - This data provides details of the placements and some incidents tickets of the Beverage Machines.
    - Sometimes the tickets are not done by the Service manager and some market does not fill this data inside the database, so only a minority of machines have this data.

#### 5.	Telemetry data
    - A new project has been launched not very long ago and some machine are equipped with telemetry data.
    - This data is stored by the telemetry provider and I asked an external colleague managing the relationship with the telemetry provider to share with me the data he could get from his requests.
    - Very few machines are equipped with telemetry data.
    - The number of machines connected with Telemetry is going to increase in the future.
    - This is not the definitive data, I have asked my colleague, but he could not provide me the final data this month, a data lake is being built in order to access the data more easily in the future


#### 6.	Visits data

#### 7. Phone Calls data

#### 8. Repair tickets data

### Beverage Machine data
Below you can find an extract of the Beverage Machine data which contains the details of the Beverage Machines

No need to use 2021 data so I turned it to Markdown

BeverageMachine_df = pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload.csv")
BeverageMachine_df.head()

#### Additionnal beverage data

In [4]:
###From 2022 onwards
BeverageMachine22_df = pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload22.csv")
BeverageMachine22_df.head()

  BeverageMachine22_df = pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload22.csv")


Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date
0,0,Kuwait General Operational Manager,43985,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,0,184658239,...,0801 Nestle Companies,080107 Nestle Middle East,08 Export,0801 Nestle Companies,080107 Nestle Middle East,980519,Trade Asset w/ Fixed Asset,KW10,90068903,2022-01-31
1,1,Kuwait General Operational Manager,43985,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Idle,To be Assigned,0,184658259,...,0102 Hypermarket,010299 Not classified,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,KW10,90068903,2022-01-31
2,2,Kuwait General Operational Manager,43985,NESCAFE ALEGRIA FTP30 v1.0 BM,100023190,ALEGRIA,Deployed,Installed,0,195061606,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1015364,Trade Asset w/ Fixed Asset,KW10,90073039,2022-01-31
3,3,Kuwait General Operational Manager,43985,NESCAFE ALEGRIA FTP30 v1.0 BM,100023190,ALEGRIA,Deployed,Installed,44013,195061605,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,Not assigned,666056,Trade Asset w/ Fixed Asset,KW10,90073039,2022-01-31
4,4,Kuwait General Operational Manager,43985,EZ Care Mini-Duo BM,100023377,OTHERS-R/L/N,Deployed,Installed,39295,T070572,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0601 Full Service Rest's,Not assigned,667092,Trade Asset w/ Fixed Asset,KW10,90045690,2022-01-31


In [5]:
BeverageMachine22_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3425600 entries, 0 to 3425599
Data columns (total 37 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

Not needed anymore
### 2020 data
Bev_add2 = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload23.csv")
Bev_add2.head()

In [6]:
##2023 data
BeverageMachine23_df =  pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload23.csv")
BeverageMachine23_df.head()

  BeverageMachine23_df =  pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload23.csv")


Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date
0,0,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42430,16E0009895,...,1101 Exclusive,110101 Distribution Center,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
1,1,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42522,16E0014757,...,0618 Distributors OOH,061802 Non Exclusive,06 Out of Home,0203 Petrol Station,020399 Not classified,1046515,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
2,2,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42614,16E0021271,...,0618 Distributors OOH,061802 Non Exclusive,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
3,3,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42705,16E0021245,...,1101 Exclusive,110101 Distribution Center,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
4,4,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42736,16E0021249,...,1101 Exclusive,110101 Distribution Center,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31


In [7]:
##2024 data
BeverageMachine24_df =  pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload24.csv")
BeverageMachine24_df.head()

  BeverageMachine24_df =  pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload24.csv")


Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date
0,0,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544849,...,0614 Convenience OOH,061406 PMO:Petrol Stations,06 Out of Home,0614 Convenience OOH,061406 PMO:Petrol Stations,1498958,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31
1,1,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544851,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-8930,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31
2,2,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544855,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-9059,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31
3,3,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43282.0,182026936,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-9146,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31
4,4,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43313.0,182026920,...,0605 Business/Industry,060503 Remote Site Company,06 Out of Home,0605 Business/Industry,060503 Remote Site Company,IP-9006,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31


In [8]:
#BeverageMachine_df = BeverageMachine_df.append(Bev_add)
#BeverageMachine_df = BeverageMachine_df.append(Bev_add2)
BeverageMachine_df = BeverageMachine24_df.append(BeverageMachine22_df)
BeverageMachine_df = BeverageMachine_df.append(BeverageMachine23_df)
BeverageMachine_df.info()

  BeverageMachine_df = BeverageMachine24_df.append(BeverageMachine22_df)


  BeverageMachine_df = BeverageMachine_df.append(BeverageMachine23_df)


<class 'pandas.core.frame.DataFrame'>
Int64Index: 7564504 entries, 0 to 3201315
Data columns (total 37 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

Manufacturer Serial number can be the same for two different machine in different countries let's create a key Key_ManufacturerID_SalesOrg

Key_ManufacturerID_SalesOrg will be used for merging local sales data from market with the main data

import pandas as pd

# Create a new column 'Key_ManufacturerID_SalesOrg' with initial values from 'Manufacturer Number' and 'Sales Organisation'
BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Manufacturer Number'].astype(str) + BeverageMachine_df['Sales Organisation']



# Conditionally update 'Key_ManufacturerID_SalesOrg' column if it is a specific Sales Organisation
specific_market = 'Nestlé Russia'  # Replace with the name of your specific market
# for Russia use Account ID instead of Manufacturer Number
BeverageMachine_df.loc[BeverageMachine_df['Sales Organisation'] == specific_market, 'Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Account ID'].astype(str) + BeverageMachine_df['Sales Organisation']


In [9]:
# Create a new column 'Key_ManufacturerID_SalesOrg' with initial values from 'Manufacturer Number' and 'Sales Organisation'
BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Manufacturer Number'].astype(str) + BeverageMachine_df['Sales Organisation']

#Account ID should be of type "String"
BeverageMachine_df['Account ID'] = BeverageMachine_df['Account ID'].astype(str)

# Conditionally update 'Key_ManufacturerID_SalesOrg' column if it is a specific market
specific_market = 'Nestle South Africa' # Replace with the name of your specific market
# for South Africa use Account ID instead of Manufacturer Number
BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df.apply(lambda row: row['Account ID'] + row['Sales Organisation'] if row['Sales Organisation'] == specific_market else row['Key_ManufacturerID_SalesOrg'], axis=1)


BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Manufacturer Number'].astype(str) +  BeverageMachine_df['Sales Organisation'] 

Serial Id should be a string had issue with mix type for same serial ID

In [10]:
BeverageMachine_df['Serial ID'] = BeverageMachine_df['Serial ID'].astype('str')

BeverageMachine_df['Parent Installation Point ID'] = BeverageMachine_df['Parent Installation Point ID'].astype('str')

In [11]:
BeverageMachine_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7564504 entries, 0 to 3201315
Data columns (total 38 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

In [12]:
BeverageMachine_df = BeverageMachine_df.query("`Product [Machine Model]` != 'Vendon Telemetry Device – vBox BM'")

In [13]:
BeverageMachine_df['Sales Organisation'].unique()

array(['Nestlé UAE', 'NP Bosnia & Herzegovina', 'Néstlé Bahrain',
       'NESTLE PROD SERV - CN19', 'SHL NESTLE PROD SERV',
       'Nestlé Denmark', 'Nestlé Finland', 'Nestle Hong Kong',
       'Indonesia', 'JP Japan Sales',
       'Kuwait General Operational Manager', 'NP North Macedonia',
       'Malaysia', 'Nestle New Zealand', 'Nestlé PH', 'Nestlé Qatar',
       'Nestlé Russia', 'Nestlé Slovak Republic', 'Nestle Turkiye Gida',
       'Nestle South Africa', 'Singapore', 'Nestle Australia Ltd',
       'NP-Bulgaria', 'NESTLE PROD SERV - CN17',
       'NESTLE PROD SERV - CN20', 'Nestlé Czech', 'Nestle UK',
       'NP Croatia, Slovenia', 'Nestlé India', 'Néstlé Jordania',
       'Nestle Kenya Ltd', 'Néstlé Lebanon', 'Nestle Prd Mauritius Ltd',
       'NP-Netherlands', 'Nestlé Norway',
       'Oman - Business Manager UAE & Oman', 'Pakistan',
       'NP Serbia, Kosovo, Montenegro', 'Néstlé Saudi Arabia',
       'Nestle Sweden', 'Thailand', 'Nestlé Taiwan', 'NP-Belgilux',
       'NP-France

Removed some markets from analysis

In [14]:
BeverageMachine_df2 = BeverageMachine_df.copy()
BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='NESTLE PROD SERV - CN17']
BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='NESTLE PROD SERV - CN19']
BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='NESTLE PROD SERV - CN20']
#BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='SHL NESTLE PROD SERV']
#BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='Nestlé Taiwan']
#BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='NP-Netherlands']


#BeverageMachine_df2 = BeverageMachine_df2.drop(BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']=='Nestlé India'].index, inplace=True)
#BeverageMachine_df2 = BeverageMachine_df2.drop(BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']=='NESTLE PROD SERV - CN17'].index, inplace=True)
BeverageMachine_df2.head()

Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date,Key_ManufacturerID_SalesOrg
0,0,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544849,...,061406 PMO:Petrol Stations,06 Out of Home,0614 Convenience OOH,061406 PMO:Petrol Stations,1498958,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20174544849Nestlé UAE
1,1,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544851,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-8930,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20174544851Nestlé UAE
2,2,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544855,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-9059,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20174544855Nestlé UAE
3,3,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43282.0,182026936,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-9146,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20182026936Nestlé UAE
4,4,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43313.0,182026920,...,060503 Remote Site Company,06 Out of Home,0605 Business/Industry,060503 Remote Site Company,IP-9006,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20182026920Nestlé UAE


In [15]:
BeverageMachine_df2['Sales Organisation'].unique()

array(['Nestlé UAE', 'NP Bosnia & Herzegovina', 'Néstlé Bahrain',
       'SHL NESTLE PROD SERV', 'Nestlé Denmark', 'Nestlé Finland',
       'Nestle Hong Kong', 'Indonesia', 'JP Japan Sales',
       'Kuwait General Operational Manager', 'NP North Macedonia',
       'Malaysia', 'Nestle New Zealand', 'Nestlé PH', 'Nestlé Qatar',
       'Nestlé Russia', 'Nestlé Slovak Republic', 'Nestle Turkiye Gida',
       'Nestle South Africa', 'Singapore', 'Nestle Australia Ltd',
       'NP-Bulgaria', 'Nestlé Czech', 'Nestle UK', 'NP Croatia, Slovenia',
       'Nestlé India', 'Néstlé Jordania', 'Nestle Kenya Ltd',
       'Néstlé Lebanon', 'Nestle Prd Mauritius Ltd', 'NP-Netherlands',
       'Nestlé Norway', 'Oman - Business Manager UAE & Oman', 'Pakistan',
       'NP Serbia, Kosovo, Montenegro', 'Néstlé Saudi Arabia',
       'Nestle Sweden', 'Thailand', 'Nestlé Taiwan', 'NP-Belgilux',
       'NP-France', 'NP Croatia, Slovenia-HR11', 'Nestlé Italy IT35 OOH'],
      dtype=object)

In [16]:
BeverageMachine_df = BeverageMachine_df2

In [17]:
BeverageMachine_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6453678 entries, 0 to 3201315
Data columns (total 38 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

### Beverage Mapping data

In [18]:
# (A) Load the Beverage Mapping data
BevMap_df = pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\SBU-11 NESTLE PRO. Translation.csv")
BevMap_df['ID Model Code']=BevMap_df['ID Model Code'].astype(str)
BevMap_df.head()

Unnamed: 0,Brand Name,Description,ID Model Code,Source,Model,Revised,Modified,Modified By
0,Accolade,Accolade 12oz,ACC-FPD-12z,BMB,N&W Astro Accolade,Yes,06/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
1,Accolade,Accolade 9oz,ACC-FPD- 9z,BMB,N&W Astro Accolade,Yes,06/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
2,ALEGRIA,Chest Freezer NP PK BM,100069870,C4C,Accessories,Yes,06/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
3,ALEGRIA,Chiller SAX 250 NP BM PK,100069872,C4C,Others,Yes,06/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
4,ALEGRIA,Chiller SAX 400 NP BM PK,100069869,C4C,Others,Yes,06/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"


In [19]:
BevMap_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2432 entries, 0 to 2431
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Brand Name     2432 non-null   object
 1   Description    2432 non-null   object
 2   ID Model Code  2432 non-null   object
 3   Source         2432 non-null   object
 4   Model          2432 non-null   object
 5   Revised        2432 non-null   object
 6   Modified       2432 non-null   object
 7   Modified By    2432 non-null   object
dtypes: object(8)
memory usage: 152.1+ KB


### Beverage Classification data

In [20]:
# (A) Load the Beverage Classification data
BeverageClassification_df = pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\SBU-11 NESTLE PRO. Models.csv")
BeverageClassification_df.head()

Unnamed: 0,Model,Model Vendor,Model Category,Global Projects,System Brands,Solution Brands,Model Group,Generation,Product,Ingredient Format,...,PSL,TAA & TAR Ownership,TAA & TAR,SC & Planning,Production,IM,Sustainability LCA Ownership,Sustainability LCA,Vendon Compatible,Technical Capacity
0,4Swiss Roma A10 PRO,Others,Mainstream B2C,%23-N/A,Branded others,Branded Others,Other,Legacy,Pure R&G,Pure R&G,...,Validated,Market,Not Done,Market,Discontinued,Market,Market,Not Done,,20
1,Accessories,Generic,Other,%23-N/A,Branded others,Non-Branded,Other,Legacy,%23-Unknown,Other,...,Validated,Market,Not Done,Market,Discontinued,Market,Market,Not Done,No,0
2,Alegria V-Café 140,Crem,Hot Liquid,Alegria,Nescafé Alegria,Nescafé Alegria,NA Legacy,Gen. 1,Hot Liquid,Liquid,...,Mandatory,Region,Not Done,Region,Active,Region,Region,Not Done,,0
3,Alegria V-Café 2120,Crem,Hot Liquid,Alegria,Nescafé Alegria,Nescafé Alegria,NA Legacy,Gen. 1,Hot Liquid,Liquid,...,Mandatory,Region,Not Done,Region,Active,Region,Region,Not Done,,0
4,Alegria V-Café 4500,NP Beverages,Hot Liquid,Alegria,Nescafé Alegria,Nescafé Alegria,NA Legacy,Gen. 1,Hot Liquid,Liquid,...,Validated,Market,Not Done,Market,Discontinued,Market,Market,Not Done,,0


In [21]:
BeverageClassification_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565 entries, 0 to 564
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Model                         564 non-null    object
 1   Model Vendor                  565 non-null    object
 2   Model Category                565 non-null    object
 3   Global Projects               565 non-null    object
 4   System Brands                 565 non-null    object
 5   Solution Brands               565 non-null    object
 6   Model Group                   565 non-null    object
 7   Generation                    565 non-null    object
 8   Product                       565 non-null    object
 9   Ingredient Format             565 non-null    object
 10  Model Category 2              565 non-null    object
 11  Machine Type                  565 non-null    object
 12  Beverage Temperature          565 non-null    object
 13  Positionning        

### Placement Tickets data

In [22]:
pip install openpyxl




[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip





In [23]:
# Load the Placement Tickets data
#Placement_df = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Net Placements.xlsx")
#Placement_df.tail()

In [24]:
#Placement_df.info()

In [25]:
# Read the Excel file into a pandas DataFrame and filter columns
file_path = r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Net Placements.xlsx"
selected_columns = ['Serial ID', 'Service Category', 'INCIDENT_CATEGORY_DESCRIPTION']
Placement_df = pd.read_excel(file_path, usecols=selected_columns)
# Perform operations on the selected DataFrame as needed
print(Placement_df)

       Service Category INCIDENT_CATEGORY_DESCRIPTION   Serial ID
0               Removal                Low throughput    10033350
1               Removal                Low throughput    10039905
2               Removal                Low throughput    10051892
3               Removal                Low throughput    10041863
4               Removal                Low throughput    10048413
...                 ...                           ...         ...
303637          Removal               End of contract    13000529
303638          Removal               End of contract    13001374
303639          Removal               End of contract    19000159
303640          Removal               End of contract  20O0017708
303641          Removal               End of contract  21O0020584

[303642 rows x 3 columns]


### Telemetry data

Get data from URL for Telemetry Data Lake

In [26]:
url = "https://queryenginelandingprod.blob.core.windows.net/shared/np/churn/np_churn_historical_consumption_by_product_group.csv?sp=r&st=2023-04-10T06:40:59Z&se=2050-04-10T14:40:59Z&spr=https&sv=2021-12-02&sr=b&sig=d%2Fn5C%2FWWksWDI%2FiiZEqwz5mOaw2jAqkW9DHUOSz6R7Q%3D"
np_churn_consumption2 =pd.read_csv(url)
np_churn_consumption2

Unnamed: 0,date,serial,sap_serial,quantity,salesorg,machine_id,product_group
0,2021-04-30,20193844106,19E0014923,48,UKI,56,ESPRESSO
1,2021-04-30,20192330649,19E0011261,1,UKI,4570,ESPRESSO
2,2021-04-30,20195061997,,23,ESAR,6207,WHITE COFFEE
3,2021-04-30,20192024074,,40,ESAR,6853,WHITE COFFEE
4,2021-04-30,20193140136,19E0013999,5,UKI,1679,ESPRESSO
...,...,...,...,...,...,...,...
3024316,2022-09-30,20161313610,,1,Russia,1169537,FLAVOURED CAPPUCCINO
3024317,2022-09-30,20174241708,,1,Russia,1175667,CHOCOLATE
3024318,2022-09-30,20170705803,,1,Russia,1171414,CHOCOLATE
3024319,2022-09-30,20162726340,,1,Russia,1169533,MOCHA


In [27]:
#url= "https://queryenginelandingstag.blob.core.windows.net/shared/np/churn/np_churn_historical_consumption.csv?sp=r&st=2022-09-02T07:17:17Z&se=2050-09-02T15:17:17Z&spr=https&sv=2021-06-08&sr=b&sig=hiIpKctZ%2BlxXwR9E%2BVReK1TnsQqZrcayCYu%2BZaCynlw%3D"
url = "https://queryenginelandingprod.blob.core.windows.net/shared/np/churn/np_churn_historical_consumption.csv?sp=r&st=2022-11-29T12:22:43Z&se=2050-11-29T20:22:43Z&spr=https&sv=2021-06-08&sr=b&sig=JZE599UA3foRsJ6ZbOHW6M0nWexxLc3JCB49gJ%2B2faU%3D"
np_churn_consumption =pd.read_csv(url)
np_churn_consumption

Unnamed: 0,date,serial,sap_serial,quantity,salesorg,machine_id
0,2021-04-30,20195061994,,777,ESAR,2068
1,2021-04-30,20151111313,513344,1171,Brazil,6135
2,2021-04-30,20132424364,509864,960,Brazil,3048
3,2021-04-30,20132424558,510058,206,Brazil,6923
4,2021-04-30,20204632034,,35,Czech Republic,5999
...,...,...,...,...,...,...
423396,2024-02-29,20213824593,70010003293,1,Vietnam,494943
423397,2024-02-29,20231619328,20231619328,1,France,3650826
423398,2024-03-31,20214933569,70010060796,2,Poland,658341
423399,2024-03-31,20204833180,70010001980,2,MENA,1129988


In [28]:
np_churn_consumption = np_churn_consumption.rename(columns={"date": "Month", "salesorg": "SalesOrg"}).reset_index()
np_churn_consumption.head()

Unnamed: 0,index,Month,serial,sap_serial,quantity,SalesOrg,machine_id
0,0,2021-04-30,20195061994,,777,ESAR,2068
1,1,2021-04-30,20151111313,513344.0,1171,Brazil,6135
2,2,2021-04-30,20132424364,509864.0,960,Brazil,3048
3,3,2021-04-30,20132424558,510058.0,206,Brazil,6923
4,4,2021-04-30,20204632034,,35,Czech Republic,5999


In [29]:
np_churn_consumption=np_churn_consumption.drop(columns=['sap_serial','index', 'machine_id'])
np_churn_consumption.head()

Unnamed: 0,Month,serial,quantity,SalesOrg
0,2021-04-30,20195061994,777,ESAR
1,2021-04-30,20151111313,1171,Brazil
2,2021-04-30,20132424364,960,Brazil
3,2021-04-30,20132424558,206,Brazil
4,2021-04-30,20204632034,35,Czech Republic


In [30]:
Telemetry_df = np_churn_consumption

# Load the Telemetry data
Telemetry_df = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Telemetry2021.xlsx")
Telemetry_df.tail()

In [31]:
Telemetry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 423401 entries, 0 to 423400
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   Month     423401 non-null  object
 1   serial    411501 non-null  object
 2   quantity  423401 non-null  int64 
 3   SalesOrg  423383 non-null  object
dtypes: int64(1), object(3)
memory usage: 12.9+ MB


# Load the Telemetry data
Telemetry_add = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Telemetry2022.xlsx")
Telemetry_add.tail()

Telemetry_df = Telemetry_df.append(Telemetry_add)
Telemetry_df.info()

Telemetry_df.tail()

I will only keep the telemetry data in the date range, so it is only the telemtry data starting after "TelemetryDateRangeStart"

#Telemetry_df1 = Telemetry_df.loc[Telemetry_df['Month']>=TelemetryDateRangeStart]

I will aggregate the number of Cup Sales for each Machine by the feature called 'serial' which corresponds to the feature 'Manufacturer Number' in the Beverage Machine data

In [32]:
Telemetry_aggSales = Telemetry_df['quantity'].groupby(Telemetry_df['serial'], axis=0).sum()
Telemetry_aggSales_df = Telemetry_aggSales.to_frame().reset_index()
Telemetry_aggSales_df

Unnamed: 0,serial,quantity
0,'20202016602,64242
1,'Y20231619331,6884
2,.,82347
3,00000000000,987
4,0000000141981,7
...,...,...
30263,Х580BGS230203370085,2211
30264,Х580BGS230203370088,8907
30265,Х580BGS230203370097,2192
30266,Х580BGS230203370098,1648


In [33]:
Telemetry_df1 = Telemetry_df.groupby(['Month', 'serial']).sum()
# df.groupby(['col5', 'col2']).size()
#['quantity']
Telemetry_df1 = Telemetry_df1.reset_index()

Telemetry_df1

  Telemetry_df1 = Telemetry_df.groupby(['Month', 'serial']).sum()


Unnamed: 0,Month,serial,quantity
0,2021-04-30,'20202016602,423
1,2021-04-30,000103535,1508
2,2021-04-30,000103550,1161
3,2021-04-30,00089366,155
4,2021-04-30,00094264,232
...,...,...,...
401547,2024-03-31,Х580BGS230203370081,412
401548,2024-03-31,Х580BGS230203370085,486
401549,2024-03-31,Х580BGS230203370088,2452
401550,2024-03-31,Х580BGS230203370097,626


from dateutil.relativedelta import relativedelta

one_month = TelemetryDate + relativedelta(months=-1)
three_months = TelemetryDate + relativedelta(months=-3)
six_months = TelemetryDate + relativedelta(months=-6)

Telemetry_df_one_month = Telemetry_df1.loc[Telemetry_df1['Month']>one_month]
Telemetry_df_three_months = Telemetry_df1.loc[Telemetry_df1['Month']>three_months]
Telemetry_df_six_months = Telemetry_df1.loc[Telemetry_df1['Month']>six_months]

TODO
why "Telemetry_aggSales_one_month_avg = Telemetry_df_one_month['quantity'].groupby(Telemetry_df_one_month['serial'], axis=0).count()"

not this?
Telemetry_aggSales_one_month_avg = Telemetry_df_one_month['quantity'].groupby(Telemetry_df_one_month['serial'], axis=0).sum()

In [34]:
Telemetry_df_one_month = Telemetry_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(1).sum()})

Telemetry_df_three_months = Telemetry_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

Telemetry_df_six_months = Telemetry_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

Telemetry_df_one_month = Telemetry_df_one_month.rename(columns={"quantity": "one_month_avg"}).reset_index()

Telemetry_df_three_months = Telemetry_df_three_months.rename(columns={"quantity": "three_months_avg"}).reset_index()

Telemetry_df_six_months = Telemetry_df_six_months.rename(columns={"quantity": "six_months_avg"}).reset_index()

Telemetry_aggSales_one_month_avg = Telemetry_df_one_month['quantity'].groupby(Telemetry_df_one_month['serial'], axis=0).count()
Telemetry_aggSales_one_month_avg = Telemetry_aggSales_one_month_avg.to_frame().reset_index()
Telemetry_aggSales_one_month_avg = Telemetry_aggSales_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

Telemetry_aggSales_three_months_avg = Telemetry_df_three_months['quantity'].groupby(Telemetry_df_three_months['serial'], axis=0).count()
Telemetry_aggSales_three_months_avg = Telemetry_aggSales_three_months_avg.to_frame().reset_index()
Telemetry_aggSales_three_months_avg = Telemetry_aggSales_three_months_avg.rename(columns={"quantity": "three_months_avg"})

Telemetry_aggSales_six_months_avg = Telemetry_df_six_months['quantity'].groupby(Telemetry_df_six_months['serial'], axis=0).count()
Telemetry_aggSales_six_months_avg = Telemetry_aggSales_six_months_avg.to_frame().reset_index()
Telemetry_aggSales_six_months_avg = Telemetry_aggSales_six_months_avg.rename(columns={"quantity": "six_months_avg"})

Telemetry_aggSales_three_months_avg

Telemetry_aggSales_three_months_avg['three_months_avg'] = Telemetry_aggSales_three_months_avg['three_months_avg'].apply(lambda x: x/3)

Telemetry_aggSales_six_months_avg['six_months_avg'] = Telemetry_aggSales_six_months_avg['six_months_avg'].apply(lambda x: x/6)

Telemetry_aggSales_three_months_avg 

I used 'left' instead of 'inner' because I want all the machines that had data

In [35]:
Telemetry_aggSales_df

Unnamed: 0,serial,quantity
0,'20202016602,64242
1,'Y20231619331,6884
2,.,82347
3,00000000000,987
4,0000000141981,7
...,...,...
30263,Х580BGS230203370085,2211
30264,Х580BGS230203370088,8907
30265,Х580BGS230203370097,2192
30266,Х580BGS230203370098,1648


In [36]:
Telemetry_df_one_month

Unnamed: 0,serial,one_month_avg
0,'20202016602,1401
1,'Y20231619331,1240
2,.,8291
3,00000000000,34
4,0000000141981,5
...,...,...
30263,Х580BGS230203370085,486
30264,Х580BGS230203370088,2452
30265,Х580BGS230203370097,626
30266,Х580BGS230203370098,663


In [37]:
Telemetry_aggSales_df1 = pd.merge(Telemetry_aggSales_df, Telemetry_df_one_month, how='left', left_on = ['serial'], right_on = ['serial'])
Telemetry_aggSales_df1.head()

Unnamed: 0,serial,quantity,one_month_avg
0,'20202016602,64242,1401
1,'Y20231619331,6884,1240
2,.,82347,8291
3,00000000000,987,34
4,0000000141981,7,5


In [38]:
Telemetry_aggSales_df2 = pd.merge(Telemetry_aggSales_df1, Telemetry_df_three_months, how='left', left_on = ['serial'], right_on = ['serial'])
Telemetry_aggSales_df3 = pd.merge(Telemetry_aggSales_df2, Telemetry_df_six_months, how='left', left_on = ['serial'], right_on = ['serial'])
Telemetry_aggSales_df3 = Telemetry_aggSales_df3.fillna(0)
Telemetry_aggSales_df3

Unnamed: 0,serial,quantity,one_month_avg,three_months_avg,six_months_avg
0,'20202016602,64242,1401,2447.000000,2536.166667
1,'Y20231619331,6884,1240,1350.666667,1147.333333
2,.,82347,8291,8127.333333,6710.833333
3,00000000000,987,34,329.000000,164.500000
4,0000000141981,7,5,2.333333,1.166667
...,...,...,...,...,...
30263,Х580BGS230203370085,2211,486,591.666667,368.500000
30264,Х580BGS230203370088,8907,2452,2272.333333,1484.500000
30265,Х580BGS230203370097,2192,626,473.333333,365.333333
30266,Х580BGS230203370098,1648,663,485.333333,274.666667


In [39]:
Telemetry_aggSales_df3['serial'] = Telemetry_aggSales_df3['serial'].astype(str)

In [40]:
Telemetry_aggSales_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30268 entries, 0 to 30267
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   serial            30268 non-null  object 
 1   quantity          30268 non-null  int64  
 2   one_month_avg     30268 non-null  int64  
 3   three_months_avg  30268 non-null  float64
 4   six_months_avg    30268 non-null  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 1.4+ MB


### 6. Visits data

In [41]:
# Load the Visits data
#Visitsdf = pd.read_excel(os.path.join('C:', 'Users', 'msalomo', 'Churn Project', 'Data', 'Sales Visits.xlsx'))

Visitsdf = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Sales Visits.xlsx")
Visitsdf.head()

  warn(msg)


Unnamed: 0,Month,Year,Period,Counter_visits_completed,Cummulative,Cummulative_Final,Cummulative Graph,Occurence Balancing,Activity Owner,Activity Owner ID,...,Sales Unit (Hierarchy),Sales Unit (Hierarchy) ID,Activity Life Cycle Status id,Activity Life Cycle Status,Counter_visits,Visit Description,Visit,Account ID.Account ID Level 01,Account ID.Account ID Level 01.Key,Index
0,4,2023,2023 - 04,1,75 173,21 127,4,RU3A751732023 - 04,ESR NP_Екатеринбург_Север_esr,6482.0,...,Market OOO Ekaterinburg,NPRU100012164,3,Completed,1,Аудит,2322418,"ООО Уралойл ул. Ленина 1, ББ",1938062,2124
1,8,2023,2023 - 08,1,166 262,24 821,7,RU3A1662622023 - 08,ESR NP_Екатеринбург_Север_esr,6482.0,...,Market OOO Ekaterinburg,NPRU100012164,3,Completed,1,Аудит,2541707,"ООО Уралойл ул. Ленина 1, ББ",1938062,2125
2,6,2023,2023 - 06,1,120 635,22 440,5,RU3A1206352023 - 06,ESR NP_Екатеринбург_Север_esr,6482.0,...,Market OOO Ekaterinburg,NPRU100012164,3,Completed,1,Аудит,2409667,"Шаурма, ул. Рябова 3а",1936072,3463
3,11,2023,2023 - 11,1,239 379,26 318,9,RU3A2393792023 - 11,ESR NP_Екатеринбург_Север_esr,6482.0,...,Market OOO Ekaterinburg,NPRU100012164,3,Completed,1,Аудит,2685903,"Шаурма, ул. Рябова 3а",1936072,3466
4,6,2023,2023 - 06,1,120 635,22 440,5,RU3A1206352023 - 06,ESR NP_Екатеринбург_Север_esr,6482.0,...,Market OOO Ekaterinburg,NPRU100012164,3,Completed,1,Аудит,2418486,АЗС № 66449,1937417,3753


### 7. Phone Calls data

In [42]:
# Load the Visits data
#PhoneCallsdf = pd.read_excel(os.path.join('C:', 'Users', 'msalomo', 'Churn Project', 'Data', 'Completed Phone Calls.xlsx'))

PhoneCallsdf = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Completed Phone Calls.xlsx", dtype={'Account Name': str})
PhoneCallsdf.head()

Unnamed: 0,Activity Name,Account Name,Activity Owner,Activity Life Cycle Status,Phone Call ID,Objective (Phone Call),Sales Organization,End Date in Local Time Zone,Start Date in Local Time Zone,PeriodEnd,ee
0,2023-03-21- Residence Call 2,7316409,Jadala Aishwarya,Completed,1076101,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
1,2023-03-22- Bazar Call 1,7323610,Jadala Aishwarya,Completed,1076189,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
2,2023-03-21- Yes Call 1,7317829,Jadala Aishwarya,Completed,1075454,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
3,2023-03-22- College Call 1,7323619,Jadala Aishwarya,Completed,1076183,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
4,2023-03-21- Yatri Nivas Hotel Call 2,7316945,Jadala Aishwarya,Completed,1076356,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0


### 8. Incident Tickets data

In [44]:
IncidentTicketdf = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Incident tickets.xlsx")
IncidentTicketdf.head()

Unnamed: 0,Index,SLAMet,YearMonth,Period,NextDateAux,NextDateAux2,AuxTime,TimeFrom,Next CreatedDatevar,TimeTo,...,SUB_TICKET_SALES_ORGANIZATION_ID,SUB_TICKET_SALES_ORGANIZATION_DESCRIPTION,ITEM_TARGET_INSTALLATION_POINT,WORK_PROGRESS,DATE_OF_LAST_MOVEMENT,COMPLETION_DUE_DATE,SERVICE_CATEGORY,SERVICE_CATEGORY_DESCRIPTION,SALESORG,COUNTER
0,2192-07-14 00:00:00,0,202110.0,2021 - 10,-738067.0,"vendredi, 18 mai 1979",Yes,20,2021-10-25,9999,...,BG10,Nestle Bulgaria A.D.,,6,NaT,"mardi, 5 octobre 2021",CA_5,Repair,BG10,1
1,2194-03-24 00:00:00,0,202110.0,2021 - 10,-738069.0,"mercredi, 16 mai 1979",No,1,2021-10-08,9999,...,BG10,Nestle Bulgaria A.D.,,6,NaT,"vendredi, 8 octobre 2021",CA_5,Repair,BG10,1
2,2194-11-06 00:00:00,0,202110.0,2021 - 10,-738069.0,"mercredi, 16 mai 1979",No,5,2021-10-12,9999,...,BG10,Nestle Bulgaria A.D.,,6,NaT,"mercredi, 6 octobre 2021",CA_5,Repair,BG10,1
3,2194-12-10 00:00:00,0,202110.0,2021 - 10,-738069.0,"mercredi, 16 mai 1979",Yes,344,2022-09-16,9999,...,BG10,Nestle Bulgaria A.D.,,6,NaT,"vendredi, 8 octobre 2021",CA_5,Repair,BG10,1
4,2195-06-01 00:00:00,0,202110.0,2021 - 10,-738070.0,"mardi, 15 mai 1979",Yes,893,2024-03-19,9999,...,BG10,Nestle Bulgaria A.D.,,6,NaT,"samedi, 9 octobre 2021",CA_5,Repair,BG10,1


### 9. Market specific data

UK stopped providing their service data

# Load the Visits data
#PhoneCallsdf = pd.read_excel(os.path.join('C:', 'Users', 'msalomo', 'Churn Project', 'Data', 'Completed Phone Calls.xlsx'))

UKService = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\UK Service data 202103.xlsx")


UKService['Key_ManufacturerID_SalesOrg'] = UKService['Serial N'].astype(str) + "Nestle UK"

UKService.head()

def preprocess_UKService(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Fault Codes', 'FTF']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

UKService_prep = preprocess_UKService(UKService)
UKService_prep.head()

Quick check on a serial with 7 entries

UKService_prepX=UKService_prep.loc[UKService_prep['Serial N'] =='0010237915']

UKService_prepX

UKService_prep.columns

UKService_prep2 = (UKService_prep.sort_values('Month')
    .groupby(["Key_ManufacturerID_SalesOrg"])
                      .agg({'Month': lambda s: s.values[-1], 
                            'Serial N': lambda s: s.values[-1], 
                            'Minutes': 'mean', 
                            'Fault Codes_Blocked ingredients' : 'sum', 
                            'Fault Codes_Boiler fault' : 'sum',
                            'Fault Codes_Booked' : 'sum', 
                            'Fault Codes_Brewer issue' : 'sum',
                            'Fault Codes_Canister' : 'sum', 
                            'Fault Codes_Card reader install' : 'sum',
                            'Fault Codes_Card reader removal' : 'sum',
                            'Fault Codes_Change drink size' : 'sum',
                            'Fault Codes_Cleaning / Hygiene kit ' : 'sum',
                            'Fault Codes_Coinmech' : 'sum',
                            'Fault Codes_Decomission' : 'sum', 
                            'Fault Codes_Delivery' : 'sum',
                            'Fault Codes_Display issue' : 'sum', 
                            'Fault Codes_Door' : 'sum',
                            'Fault Codes_Drink Strength/ taste' : 'sum', 
                            'Fault Codes_Faulty door' : 'sum',
                            'Fault Codes_Filters ' : 'sum', 
                            'Fault Codes_Fridge fault' : 'sum', 
                            'Fault Codes_Leak' : 'sum',
                            'Fault Codes_Machine empty no ingredients' : 'sum', 
                            'Fault Codes_Measures' : 'sum',
                            'Fault Codes_Motor' : 'sum', 
                            'Fault Codes_No Fault Found' : 'sum',
                            'Fault Codes_No power Socket' : 'sum', 
                            'Fault Codes_Not dispensing drinks' : 'sum',
                            'Fault Codes_Not heating' : 'sum', 
                            'Fault Codes_Other' : 'sum',
                            'Fault Codes_Other (derial number check etc)' : 'sum', 
                            'Fault Codes_PPM' : 'sum',
                            'Fault Codes_Power / CPU Machine' : 'sum', 
                            'Fault Codes_Price increase' : 'sum',
                            'Fault Codes_Pump / Valve  internal' : 'sum',
                            'Fault Codes_Pump / Water External' : 'sum', 
                            'Fault Codes_Telemetry' : 'sum',
                            'Fault Codes_Training' : 'sum', 
                            'Fault Codes_consumable' : 'sum',
                            'Fault Codes_machine Install' : 'sum', 
                            'Fault Codes_workshop' : 'sum', 
                            'FTF_1' : 'sum', 
                            'FTF_2' : 'sum',
                            'FTF_3' : 'sum', 
                            'FTF_4' : 'sum'                  
    })
)

UKService_prep2

### 10. LOCAL DATA

Sales & Telemetron & Vendon 2021 Sales

PakistanSales both Serial no and manuf no are the same
RussiaSalesData uses Manuf no

Vendon data uses manuf no

telemetron uses manuf no

In [45]:
PakistanSales = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Sales Data Pakistan.xlsx")
MalaysiaSales = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Sales_Data_Malaysia.xlsx")

# Drop the 'Serial' column
MalaysiaSales.drop('Serial', axis=1, inplace=True)
# Rename the 'Serial Manufacturer' column to 'Serial'
MalaysiaSales.rename(columns={'Serial Manufacturer': 'Serial'}, inplace=True)

In [46]:
PakistanSales.head()

Unnamed: 0,Serial,quantity,Month
0,20O0014321,2512.8206,2021-01-01
1,7010054243,8488.0412,2021-01-01
2,7010055066,91133.6902,2021-01-01
3,7010045635,91133.6902,2021-01-01
4,7010058209,91133.6902,2021-01-01


In [47]:
PakistanSales['Serial'] = PakistanSales['Serial'].astype(str)
MalaysiaSales['Serial'] = MalaysiaSales['Serial'].astype(str)

In [48]:
RussiaSalesData = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Sales data Russia.xlsx")

In [49]:
RussiaSalesData.tail()

Unnamed: 0,Date,Machine Manufacturer Serial Number,ПРОДАЖИ (NPS)
864004,2024-03-31,20182128784,0.0
864005,2024-03-31,20173032554,0.0
864006,2024-03-31,20170908163,0.0
864007,2024-03-31,15297DU17072840691,0.0
864008,2024-03-31,15297DU17072840705,0.0


In [50]:
RussiaSalesData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 864009 entries, 0 to 864008
Data columns (total 3 columns):
 #   Column                              Non-Null Count   Dtype         
---  ------                              --------------   -----         
 0   Date                                864009 non-null  datetime64[ns]
 1   Machine Manufacturer Serial Number  864003 non-null  object        
 2   ПРОДАЖИ (NPS)                       863971 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 19.8+ MB


In [51]:
SouthAfricaSales = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Sales_Data_South_Africa - consumables.xlsx")

In [52]:
SouthAfricaSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3862 entries, 0 to 3861
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   AccountID  3862 non-null   int64         
 1   quantity   3862 non-null   float64       
 2   Month      3862 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 90.6 KB


In [53]:
SingaporeSales = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Sales_Data_Singapore.xlsx")

In [54]:
SingaporeSales['Month'] = pd.to_datetime(SingaporeSales['Month'])
SingaporeSales['Serial ID'] = SingaporeSales['Serial ID'].astype(str)

  SingaporeSales['Month'] = pd.to_datetime(SingaporeSales['Month'])


In [55]:
SingaporeSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85606 entries, 0 to 85605
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Serial ID            85606 non-null  object        
 1   Month                85606 non-null  datetime64[ns]
 2   Sales                79954 non-null  float64       
 3   Ship to              85606 non-null  object        
 4   Account ID           85606 non-null  int64         
 5   Manufacturer Number  85606 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 3.9+ MB


pip install pandasql

In [56]:
# create data frame
df1 = SingaporeSales
 
print("Original DataFrame")
 
# print original data frame
display(df1)
 
# create a dictionary
# key = old name
# value = new name
dict = {'Serial ID': 'Serial_ID',
        'Month' : 'Month',
        'Sales' : 'Sales',
        'Ship to': 'Ship_to',
       'Account ID' : 'Account_ID',
       'Manufacturer Number' : 'Manufacturer_Number'}
 
print("\nAfter rename")
# call rename () method
df1.rename(columns=dict,
          inplace=True)
 
# print Data frame after rename columns
display(df1)

Original DataFrame


Unnamed: 0,Serial ID,Month,Sales,Ship to,Account ID,Manufacturer Number
0,SGBMB03059,2023-01-31,0.00,30885489704,3981172,20092213179
1,SGBMB04056,2023-01-31,,27835PA00101,3981872,20092414016
2,SGBMB03049,2023-01-31,0.00,30885489184,3981521,20092414029
3,SGBMB03772,2023-01-31,2412.35,27835KEP008A01,3982306,20094625024
4,SGBMB03804,2023-01-31,0.00,30885489344,3981360,20094625056
...,...,...,...,...,...,...
85601,23O0043567,2024-03-31,8601.96,280317190225,8410298,20234650036
85602,23O0043569,2024-03-31,8601.96,280317190225,8410298,20234650038
85603,23O0047359,2024-03-31,490.44,6767280N-2403003,9054652,EFSIN23120001
85604,23O0047344,2024-03-31,0.00,6750157O5969,8921214,3400000263847



After rename


Unnamed: 0,Serial_ID,Month,Sales,Ship_to,Account_ID,Manufacturer_Number
0,SGBMB03059,2023-01-31,0.00,30885489704,3981172,20092213179
1,SGBMB04056,2023-01-31,,27835PA00101,3981872,20092414016
2,SGBMB03049,2023-01-31,0.00,30885489184,3981521,20092414029
3,SGBMB03772,2023-01-31,2412.35,27835KEP008A01,3982306,20094625024
4,SGBMB03804,2023-01-31,0.00,30885489344,3981360,20094625056
...,...,...,...,...,...,...
85601,23O0043567,2024-03-31,8601.96,280317190225,8410298,20234650036
85602,23O0043569,2024-03-31,8601.96,280317190225,8410298,20234650038
85603,23O0047359,2024-03-31,490.44,6767280N-2403003,9054652,EFSIN23120001
85604,23O0047344,2024-03-31,0.00,6750157O5969,8921214,3400000263847


In [57]:
import pandas as pd
import sqlite3

# create a sample DataFrame
df = df1

# create an in-memory SQLite database
conn = sqlite3.connect(':memory:')

# write the DataFrame to the database
df.to_sql('my_table', con=conn)

# define the SQL query
query = '''
SELECT Month, Sales AS quantity, Ship_to, Manufacturer_Number,
       COUNT(Manufacturer_Number) OVER (PARTITION BY Month, Ship_to) AS Manufacturer_Count,
       (Sales / COUNT(Manufacturer_Number) OVER (PARTITION BY Month, Ship_to)) AS Sales_perMachine
FROM my_table
'''

# run the query using pandas
result = pd.read_sql_query(query, conn)

# print the result
print(result)

                     Month  quantity         Ship_to Manufacturer_Number  \
0      2023-01-31 00:00:00       0.0      2215122122         20103018840   
1      2023-01-31 00:00:00       0.0      2215122122         20103018859   
2      2023-01-31 00:00:00       0.0      2215122122         20102917864   
3      2023-01-31 00:00:00       0.0      2215122122         20102917875   
4      2023-01-31 00:00:00       0.0      2215122122         20103923848   
...                    ...       ...             ...                 ...   
85601  2024-03-31 00:00:00       0.0  69338116933811         20141213799   
85602  2024-03-31 00:00:00       0.0  69338116933811         20141213800   
85603  2024-03-31 00:00:00       0.0  69338116933811         20141213801   
85604  2024-03-31 00:00:00       0.0  69338116933811         20141213823   
85605  2024-03-31 00:00:00       0.0  69768386976838         20224845590   

       Manufacturer_Count  Sales_perMachine  
0                     204               0

In [58]:
SingaporeSales = result
SingaporeSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85606 entries, 0 to 85605
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Month                85606 non-null  object 
 1   quantity             79954 non-null  float64
 2   Ship_to              85606 non-null  object 
 3   Manufacturer_Number  85606 non-null  object 
 4   Manufacturer_Count   85606 non-null  int64  
 5   Sales_perMachine     79954 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 3.9+ MB


In [59]:
SingaporeSales.rename(columns={'Manufacturer_Number': 'Manufacturer Number'}, inplace=True)
SingaporeSales.rename(columns={'quantity': 'quantityold'}, inplace=True)
SingaporeSales.rename(columns={'Sales_perMachine': 'quantity'}, inplace=True)

In [60]:
SingaporeSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85606 entries, 0 to 85605
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Month                85606 non-null  object 
 1   quantityold          79954 non-null  float64
 2   Ship_to              85606 non-null  object 
 3   Manufacturer Number  85606 non-null  object 
 4   Manufacturer_Count   85606 non-null  int64  
 5   quantity             79954 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 3.9+ MB


In [61]:
RussiaSalesData.rename(columns={'Machine Manufacturer Serial Number': 'Serial', 'ПРОДАЖИ (NPS)': 'quantity'}, inplace=True)
RussiaSalesData

Unnamed: 0,Date,Serial,quantity
0,2021-01-31,4228,0.0
1,2021-01-31,5419,0.0
2,2021-01-31,5477,0.0
3,2021-01-31,420090,0.0
4,2021-01-31,420283,0.0
...,...,...,...
864004,2024-03-31,20182128784,0.0
864005,2024-03-31,20173032554,0.0
864006,2024-03-31,20170908163,0.0
864007,2024-03-31,15297DU17072840691,0.0


In [62]:
RussiaSalesData['quantity'] = RussiaSalesData['quantity'].astype(float)
RussiaSalesData['Serial'] = RussiaSalesData['Serial'].astype(str)

In [63]:
RussiaSalesData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 864009 entries, 0 to 864008
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   Date      864009 non-null  datetime64[ns]
 1   Serial    864009 non-null  object        
 2   quantity  863971 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 19.8+ MB


In [64]:
TelemetronData = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Telemetron Data.xlsx")

In [65]:
TelemetronData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51329 entries, 0 to 51328
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Month           51329 non-null  datetime64[ns]
 1   Machine serial  51329 non-null  int64         
 2   Total           51329 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 1.2 MB


In [66]:
TelemetronData.rename(columns={'Machine serial': 'serial', 'Total': 'quantity'}, inplace=True)

In [67]:
TelemetronData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51329 entries, 0 to 51328
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Month     51329 non-null  datetime64[ns]
 1   serial    51329 non-null  int64         
 2   quantity  51329 non-null  int64         
dtypes: datetime64[ns](1), int64(2)
memory usage: 1.2 MB


In [68]:
PakistanSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 231279 entries, 0 to 231278
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   Serial    231279 non-null  object        
 1   quantity  231279 non-null  float64       
 2   Month     231279 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 5.3+ MB


VendonData = pd.read_excel(r"C:\Users\msalomo\Churn Project\Data\Telemetry2021.xlsx")

VendonData.head()

In [69]:
Pakistan_aggSales = PakistanSales['quantity'].groupby(PakistanSales['Serial'], axis=0).sum()
Pakistan_aggSales = Pakistan_aggSales.reset_index()

# Perform the aggregation on 'quantity' grouped by 'Serial'
Malaysia_aggSales = MalaysiaSales['quantity'].groupby(MalaysiaSales['Serial']).sum().reset_index()
# Print the 'Malaysia_aggSales' DataFrame
print(Malaysia_aggSales)

           Serial     quantity
0      1005010001   128.367330
1      1005010006   742.657798
2      1005010007   742.657798
3      1005010009  2748.936909
4      1005010011  6442.373203
...           ...          ...
15977     T573424  1104.270000
15978     T573425  2423.145633
15979     T573426   359.400000
15980     T573427  4612.954324
15981  VZCA 00001  5386.958367

[15982 rows x 2 columns]


In [70]:
Telemetron_agg = TelemetronData['quantity'].groupby(TelemetronData['serial'], axis=0).sum()
Telemetron_agg = Telemetron_agg.reset_index()
Telemetron_agg

Unnamed: 0,serial,quantity
0,2018223011,63409
1,20104730813,640
2,20110907777,3351
3,20122320353,20686
4,20122320803,8324
...,...,...
2491,20204732782,10338
2492,20204732783,11597
2493,200180404900,33575
2494,201751552475,111485


In [71]:
RussiaSalesData_agg = RussiaSalesData['quantity'].groupby(RussiaSalesData['Serial'], axis=0).sum()
RussiaSalesData_agg = RussiaSalesData_agg.reset_index()
RussiaSalesData_agg

Unnamed: 0,Serial,quantity
0,00001428-0011,0.00
1,00001429-0001,51785.57
2,00001429-0004,0.00
3,00001429-0006,0.00
4,00001429-0007,0.00
...,...,...
30384,ХК 0115,0.00
30385,ХК 0116,0.00
30386,ХК 0118,0.00
30387,ХК 0120,0.00


In [72]:
SouthAfrica_aggSales = SouthAfricaSales['quantity'].groupby(SouthAfricaSales['AccountID'], axis=0).sum()
SouthAfrica_aggSales = SouthAfrica_aggSales.reset_index()
SouthAfrica_aggSales

Unnamed: 0,AccountID,quantity
0,365014,0.63
1,365018,40.53
2,366680,1068.01
3,366935,13691.11
4,367418,144338.28
...,...,...
574,7136798,155547.18
575,7141430,6503.40
576,7145255,683460.98
577,7149375,972822.83


In [73]:
Singapore_aggSales = SingaporeSales['quantity'].groupby(SingaporeSales['Manufacturer Number'], axis=0).sum()
Singapore_aggSales = Singapore_aggSales.reset_index()
Singapore_aggSales

Unnamed: 0,Manufacturer Number,quantity
0,141982300000383201,3419.030000
1,141982300001083201,3200.850000
2,141982300003083201,12183.988922
3,141982300003283201,3208.420000
4,141992300000483201,2571.080000
...,...,...
6302,ZEBA0072,19373.090169
6303,ZEBA0073,12183.988922
6304,ZEBA0074,15559.990009
6305,ZEBA0075,17402.895143


#Vendon_agg = VendonData['quantity'].groupby(VendonData['serial'], axis=0).sum()

Vendon_agg = (VendonData.sort_values('Month')
    .groupby(["serial"])
                      .agg({'SalesOrg' : lambda s: s.values[-1],
                            'quantity' : 'sum'}))

Vendon_agg = Vendon_agg.reset_index()
Vendon_agg

PakistanSales = PakistanSales.loc[PakistanSales['Month']>=PakistanDateRangeStart]

TelemetronData = TelemetronData.loc[TelemetronData['Month']>=TelemetronDateRangeStart]

VendonData = VendonData.loc[VendonData['Month']>=VendonDateRangeStart]

I will aggregate the number of Cup Sales for each Machine by the feature called 'serial' which corresponds to the feature 'Manufacturer Number' in the Beverage Machine data

In [74]:
#df.sort_values('date').groupby('id').tail(1)
#df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))
#df.groupby('Type').apply(lambda x: x.tail(3).mean())

PakistanSales_df1 = PakistanSales.groupby(['Month', 'Serial']).sum()
PakistanSales_df1 = PakistanSales_df1.reset_index()


PakistanSales_one_month = PakistanSales_df1.sort_values('Month').groupby('Serial').agg({'quantity' : lambda x: x.tail(1).sum()})

PakistanSales_three_months = PakistanSales_df1.sort_values('Month').groupby('Serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

PakistanSales_six_months = PakistanSales_df1.sort_values('Month').groupby('Serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

# Group by 'Month' and 'Serial' and calculate the sum
MalaysiaSales_df1 = MalaysiaSales.groupby(['Month', 'Serial']).sum().reset_index()
# Calculate the sum of 'quantity' for the latest month for each 'Serial'
MalaysiaSales_one_month = MalaysiaSales_df1.sort_values('Month').groupby('Serial').agg({'quantity': lambda x: x.tail(1).sum()})
# Calculate the average of 'quantity' for the last three months for each 'Serial'
MalaysiaSales_three_months = MalaysiaSales_df1.sort_values('Month').groupby('Serial').agg({'quantity': lambda x: x.tail(3).sum() / 3})
# Calculate the average of 'quantity' for the last six months for each 'Serial'
MalaysiaSales_six_months = MalaysiaSales_df1.sort_values('Month').groupby('Serial').agg({'quantity': lambda x: x.tail(6).sum() / 6})
# Print the resulting DataFrames
print(MalaysiaSales_one_month)


               quantity
Serial                 
1005010001     0.000000
1005010006     0.000000
1005010007     0.000000
1005010009   417.780483
1005010011  1151.370455
...                 ...
T573424        0.000000
T573425      374.281441
T573426        0.000000
T573427      294.736667
VZCA 00001   828.944922

[15982 rows x 1 columns]


In [75]:
PakistanSales_one_month = PakistanSales_one_month.reset_index()
PakistanSales_three_months = PakistanSales_three_months.reset_index()
PakistanSales_six_months = PakistanSales_six_months.reset_index()

# Reset the index for 'MalaysiaSales_one_month'
MalaysiaSales_one_month = MalaysiaSales_one_month.reset_index()
# Reset the index for 'MalaysiaSales_three_months'
MalaysiaSales_three_months = MalaysiaSales_three_months.reset_index()
# Reset the index for 'MalaysiaSales_six_months'
MalaysiaSales_six_months = MalaysiaSales_six_months.reset_index()
# Print the 'MalaysiaSales_three_months' DataFrame
print(MalaysiaSales_three_months)

           Serial     quantity
0      1005010001     6.391393
1      1005010006     0.000000
2      1005010007     0.000000
3      1005010009   444.384206
4      1005010011  1539.484848
...           ...          ...
15977     T573424    39.450000
15978     T573425   388.879459
15979     T573426     0.000000
15980     T573427   704.707014
15981  VZCA 00001   917.998386

[15982 rows x 2 columns]


from dateutil.relativedelta import relativedelta

one_month_pak = PakistanLastUpdate + relativedelta(months=-1)
three_months_pak = PakistanLastUpdate + relativedelta(months=-3)
six_months_pak = PakistanLastUpdate + relativedelta(months=-6)

PakistanSales_one_month = PakistanSales.loc[PakistanSales['Month']>one_month_pak]
PakistanSales_three_months = PakistanSales.loc[PakistanSales['Month']>three_months_pak]
PakistanSales_six_months = PakistanSales.loc[PakistanSales['Month']>six_months_pak]

TelemetronData_one_month = TelemetronData.loc[TelemetronData['Month']>one_month_pak]
TelemetronData_three_months = TelemetronData.loc[TelemetronData['Month']>three_months_pak]
TelemetronData_six_months = TelemetronData.loc[TelemetronData['Month']>six_months_pak]

VendonData_one_month = VendonData.loc[VendonData['Month']>one_month_pak]
VendonData_three_months = VendonData.loc[VendonData['Month']>three_months_pak]
VendonData_six_months = VendonData.loc[VendonData['Month']>six_months_pak]

In [76]:
PakistanSales_one_month

Unnamed: 0,Serial,quantity
0,10010063319,15412.0800
1,2000014136,58078.4300
2,2000014290,5000.0000
3,2000014292,17440.0000
4,2000014293,38800.0000
...,...,...
15145,7010070073,27537.3339
15146,7010070077,9979.3956
15147,7010070112,36690.0000
15148,7010070113,32800.0000


PakistanSales_one_month.loc[PakistanSales_one_month['serial'] != '70010058920']

PakistanSales_one_month_avg = PakistanSales_one_month['quantity'].groupby(PakistanSales_one_month['serial'], axis=0).sum()

PakistanSales_one_month_avg = PakistanSales_one_month_avg.to_frame().reset_index()
PakistanSales_one_month_avg = PakistanSales_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

In [77]:
#PakistanSales_one_month_avg = PakistanSales_one_month['quantity'].groupby(PakistanSales_one_month['serial'], axis=0).sum()
#PakistanSales_one_month_avg = PakistanSales_one_month_avg.to_frame().reset_index()
PakistanSales_one_month_avg = PakistanSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})

#PakistanSales_three_months_avg = PakistanSales_three_months['quantity'].groupby(PakistanSales_three_months['serial'], axis=0).sum()
#PakistanSales_three_months_avg = PakistanSales_three_months_avg.to_frame().reset_index()
PakistanSales_three_months_avg = PakistanSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})

#PakistanSales_six_months_avg = PakistanSales_six_months['quantity'].groupby(PakistanSales_six_months['serial'], axis=0).sum()
#PakistanSales_six_months_avg = PakistanSales_six_months_avg.to_frame().reset_index()
PakistanSales_six_months_avg = PakistanSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})

PakistanSales_three_months_avg

# Rename the column in 'MalaysiaSales_one_month'
MalaysiaSales_one_month_avg = MalaysiaSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})
# Rename the column in 'MalaysiaSales_three_months'
MalaysiaSales_three_months_avg = MalaysiaSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})
# Rename the column in 'MalaysiaSales_six_months'
MalaysiaSales_six_months_avg = MalaysiaSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})
# Print the 'MalaysiaSales_three_months_avg' DataFrame
print(MalaysiaSales_three_months_avg)

           Serial  Sales_three_months_avg
0      1005010001                6.391393
1      1005010006                0.000000
2      1005010007                0.000000
3      1005010009              444.384206
4      1005010011             1539.484848
...           ...                     ...
15977     T573424               39.450000
15978     T573425              388.879459
15979     T573426                0.000000
15980     T573427              704.707014
15981  VZCA 00001              917.998386

[15982 rows x 2 columns]


In [78]:
SouthAfrica_aggSales

Unnamed: 0,AccountID,quantity
0,365014,0.63
1,365018,40.53
2,366680,1068.01
3,366935,13691.11
4,367418,144338.28
...,...,...
574,7136798,155547.18
575,7141430,6503.40
576,7145255,683460.98
577,7149375,972822.83


In [79]:
SouthAfricaSales_df1 = SouthAfricaSales.groupby(['Month', 'AccountID']).sum()
SouthAfricaSales_df1 = SouthAfricaSales.reset_index()


SouthAfricaSales_one_month = SouthAfricaSales_df1.sort_values('Month').groupby('AccountID').agg({'quantity' : lambda x: x.tail(1).sum()})

SouthAfricaSales_three_months = SouthAfricaSales_df1.sort_values('Month').groupby('AccountID').agg({'quantity' : lambda x: x.tail(3).sum()/3})

SouthAfricaSales_six_months = SouthAfricaSales_df1.sort_values('Month').groupby('AccountID').agg({'quantity' : lambda x: x.tail(6).sum()/6})


SouthAfricaSales_one_month = SouthAfricaSales_one_month.reset_index()
SouthAfricaSales_three_months = SouthAfricaSales_three_months.reset_index()
SouthAfricaSales_six_months = SouthAfricaSales_six_months.reset_index()
SouthAfricaSales_three_months


SouthAfricaSales_one_month_avg = SouthAfricaSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})


SouthAfricaSales_three_months_avg = SouthAfricaSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})


SouthAfricaSales_six_months_avg = SouthAfricaSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})

SouthAfricaSales_three_months_avg

Unnamed: 0,AccountID,Sales_three_months_avg
0,365014,0.026667
1,365018,13.376667
2,366680,356.003333
3,366935,4563.703333
4,367418,11818.086667
...,...,...
574,7136798,10700.113333
575,7141430,2167.800000
576,7145255,46244.286667
577,7149375,67403.756667


In [80]:
SingaporeSales_df1 = SingaporeSales.groupby(['Month', 'Manufacturer Number']).sum()
SingaporeSales_df1 = SingaporeSales.reset_index()


SingaporeSales_one_month = SingaporeSales_df1.sort_values('Month').groupby('Manufacturer Number').agg({'quantity' : lambda x: x.tail(1).sum()})

SingaporeSales_three_months = SingaporeSales_df1.sort_values('Month').groupby('Manufacturer Number').agg({'quantity' : lambda x: x.tail(3).sum()/3})

SingaporeSales_six_months = SingaporeSales_df1.sort_values('Month').groupby('Manufacturer Number').agg({'quantity' : lambda x: x.tail(6).sum()/6})


SingaporeSales_one_month = SingaporeSales_one_month.reset_index()
SingaporeSales_three_months = SingaporeSales_three_months.reset_index()
SingaporeSales_six_months = SingaporeSales_six_months.reset_index()
SingaporeSales_three_months


SingaporeSales_one_month_avg = SingaporeSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})

SingaporeSales_three_months_avg = SingaporeSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})

SingaporeSales_six_months_avg = SingaporeSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})

SingaporeSales_three_months_avg

  SingaporeSales_df1 = SingaporeSales.groupby(['Month', 'Manufacturer Number']).sum()


Unnamed: 0,Manufacturer Number,Sales_three_months_avg
0,141982300000383201,219.810000
1,141982300001083201,165.150000
2,141982300003083201,711.387249
3,141982300003283201,174.733333
4,141992300000483201,222.613333
...,...,...
6302,ZEBA0072,1187.497178
6303,ZEBA0073,711.387249
6304,ZEBA0074,711.387249
6305,ZEBA0075,711.387249


In [81]:
TelemetronData_df1 = TelemetronData.groupby(['Month', 'serial']).sum()
TelemetronData_df1 = TelemetronData_df1.reset_index()

TelemetronData_one_month = TelemetronData_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(1).sum()})

TelemetronData_three_months = TelemetronData_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

TelemetronData_six_months = TelemetronData_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

TelemetronData_one_month = TelemetronData_one_month.reset_index()
TelemetronData_three_months = TelemetronData_three_months.reset_index()
TelemetronData_six_months = TelemetronData_six_months.reset_index()
TelemetronData_three_months

Unnamed: 0,serial,quantity
0,2018223011,2796.000000
1,20104730813,43.000000
2,20110907777,197.333333
3,20122320353,877.333333
4,20122320803,1074.666667
...,...,...
2491,20204732782,571.000000
2492,20204732783,600.666667
2493,200180404900,1613.333333
2494,201751552475,4815.333333


In [82]:
#TelemetronData_one_month_avg = TelemetronData_one_month['quantity'].groupby(TelemetronData_one_month['serial'], axis=0).sum()
#TelemetronData_one_month_avg = TelemetronData_one_month_avg.to_frame().reset_index()
TelemetronData_one_month_avg = TelemetronData_one_month.rename(columns={"quantity": "one_month_avg"})

#TelemetronData_three_months_avg = TelemetronData_three_months['quantity'].groupby(TelemetronData_three_months['serial'], axis=0).sum()
#TelemetronData_three_months_avg = TelemetronData_three_months_avg.to_frame().reset_index()
TelemetronData_three_months_avg = TelemetronData_three_months.rename(columns={"quantity": "three_months_avg"})

#TelemetronData_six_months_avg = TelemetronData_six_months['quantity'].groupby(TelemetronData_six_months['serial'], axis=0).sum()
#TelemetronData_six_months_avg = TelemetronData_six_months_avg.to_frame().reset_index()
TelemetronData_six_months_avg = TelemetronData_six_months.rename(columns={"quantity": "six_months_avg"})

TelemetronData_three_months_avg

Unnamed: 0,serial,three_months_avg
0,2018223011,2796.000000
1,20104730813,43.000000
2,20110907777,197.333333
3,20122320353,877.333333
4,20122320803,1074.666667
...,...,...
2491,20204732782,571.000000
2492,20204732783,600.666667
2493,200180404900,1613.333333
2494,201751552475,4815.333333


In [83]:
RussiaSalesData_df1 = RussiaSalesData.groupby(['Date', 'Serial']).sum()
RussiaSalesData_df1 = RussiaSalesData_df1.reset_index()

RussiaSalesData_one_month = RussiaSalesData_df1.sort_values('Date').groupby('Serial').agg({'quantity' : lambda x: x.tail(1).sum()})

RussiaSalesData_three_months = RussiaSalesData_df1.sort_values('Date').groupby('Serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

RussiaSalesData_six_months = RussiaSalesData_df1.sort_values('Date').groupby('Serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

RussiaSalesData_one_month = RussiaSalesData_one_month.reset_index()
RussiaSalesData_three_months = RussiaSalesData_three_months.reset_index()
RussiaSalesData_six_months = RussiaSalesData_six_months.reset_index()

RussiaSalesData_one_month_avg = RussiaSalesData_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})

RussiaSalesData_three_months_avg = RussiaSalesData_three_months.rename(columns={"quantity": "Sales_three_months_avg"})

RussiaSalesData_six_months_avg = RussiaSalesData_six_months.rename(columns={"quantity": "Sales_six_months_avg"})



TelemetronData_one_month_avg = TelemetronData_one_month['quantity'].groupby(TelemetronData_one_month['serial'], axis=0).sum()
TelemetronData_one_month_avg = TelemetronData_one_month_avg.to_frame().reset_index()
TelemetronData_one_month_avg = TelemetronData_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

TelemetronData_three_months_avg = TelemetronData_three_months['quantity'].groupby(TelemetronData_three_months['serial'], axis=0).sum()
TelemetronData_three_months_avg = TelemetronData_three_months_avg.to_frame().reset_index()
TelemetronData_three_months_avg = TelemetronData_three_months_avg.rename(columns={"quantity": "three_months_avg"})

TelemetronData_six_months_avg = TelemetronData_six_months['quantity'].groupby(TelemetronData_six_months['serial'], axis=0).sum()
TelemetronData_six_months_avg = TelemetronData_six_months_avg.to_frame().reset_index()
TelemetronData_six_months_avg = TelemetronData_six_months_avg.rename(columns={"quantity": "six_months_avg"})

TelemetronData_three_months_avg

VendonData_one_month_avg = VendonData_one_month['quantity'].groupby(VendonData_one_month['serial'], axis=0).sum()
VendonData_one_month_avg = VendonData_one_month_avg.to_frame().reset_index()
VendonData_one_month_avg = VendonData_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

VendonData_three_months_avg = VendonData_three_months['quantity'].groupby(VendonData_three_months['serial'], axis=0).sum()
VendonData_three_months_avg = VendonData_three_months_avg.to_frame().reset_index()
VendonData_three_months_avg = VendonData_three_months_avg.rename(columns={"quantity": "three_months_avg"})

VendonData_six_months_avg = VendonData_six_months['quantity'].groupby(VendonData_six_months['serial'], axis=0).sum()
VendonData_six_months_avg = VendonData_six_months_avg.to_frame().reset_index()
VendonData_six_months_avg = VendonData_six_months_avg.rename(columns={"quantity": "six_months_avg"})

VendonData_three_months_avg

#already done with change of code
PakistanSales_three_months_avg['three_months_avg'] = PakistanSales_three_months_avg['Sales_three_months_avg'].apply(lambda x: x/3)

PakistanSales_six_months_avg['six_months_avg'] = PakistanSales_six_months_avg['Sales_six_months_avg'].apply(lambda x: x/6)


In [84]:
PakistanSales_one_month_avg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15150 entries, 0 to 15149
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Serial               15150 non-null  object 
 1   Sales_one_Month_avg  15150 non-null  float64
dtypes: float64(1), object(1)
memory usage: 236.8+ KB


In [85]:
Pakistan_aggSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15150 entries, 0 to 15149
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Serial    15150 non-null  object 
 1   quantity  15150 non-null  float64
dtypes: float64(1), object(1)
memory usage: 236.8+ KB


In [86]:
PakistanSales_df = pd.merge(Pakistan_aggSales, PakistanSales_one_month_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])


# Merge the DataFrames based on the 'Serial' column
PakistanSales_df = pd.merge(Pakistan_aggSales, PakistanSales_one_month_avg, how='left', left_on='Serial', right_on='Serial')
# Print the head of the merged DataFrame
print(PakistanSales_df.head())

        Serial     quantity  Sales_one_Month_avg
0  10010063319   60245.1569             15412.08
1   2000014136  689624.6980             58078.43
2   2000014290   42356.1536              5000.00
3   2000014292  104124.6992             17440.00
4   2000014293  415646.1472             38800.00


In [87]:
PakistanSales_df2 = pd.merge(PakistanSales_df, PakistanSales_three_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
PakistanSales_df3 = pd.merge(PakistanSales_df2, PakistanSales_six_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
PakistanSales_df3 = PakistanSales_df3.fillna(0)

# Merge the DataFrames based on the 'Serial' column
MalaysiaSales_df1 = pd.merge(Malaysia_aggSales, MalaysiaSales_one_month_avg, how='left', left_on='Serial', right_on='Serial')
MalaysiaSales_df2 = pd.merge(MalaysiaSales_df1, MalaysiaSales_three_months_avg, how='left', left_on='Serial', right_on='Serial')
MalaysiaSales_df3 = pd.merge(MalaysiaSales_df2, MalaysiaSales_six_months_avg, how='left', left_on='Serial', right_on='Serial')
# Fill any missing values with 0
MalaysiaSales_df3 = MalaysiaSales_df3.fillna(0)
# Print the head of the merged DataFrame
print(MalaysiaSales_df3.head(30))

        Serial      quantity  Sales_one_Month_avg  Sales_three_months_avg  \
0   1005010001    128.367330             0.000000                6.391393   
1   1005010006    742.657798             0.000000                0.000000   
2   1005010007    742.657798             0.000000                0.000000   
3   1005010009   2748.936909           417.780483              444.384206   
4   1005010011   6442.373203          1151.370455             1539.484848   
5   1005010014   6331.466127           788.729630             1022.703660   
6   1005010015  18670.952147          3429.041811             3428.241404   
7   1005010016   3019.859071           433.838675              476.872336   
8   1005010017    742.657798             0.000000                0.000000   
9   1005010018   4494.520465           686.039247              715.374827   
10  1005010021  29407.272455          4303.907273             3853.037576   
11  1005010022   4923.650000          1353.300000             1039.633333   

In [88]:
MalaysiaSales_df3.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15982 entries, 0 to 15981
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  15982 non-null  object 
 1   quantity                15982 non-null  float64
 2   Sales_one_Month_avg     15982 non-null  float64
 3   Sales_three_months_avg  15982 non-null  float64
 4   Sales_six_months_avg    15982 non-null  float64
dtypes: float64(4), object(1)
memory usage: 749.2+ KB


In [89]:
#TelemetronData_three_months_avg['three_months_avg'] = TelemetronData_three_months_avg['three_months_avg'].apply(lambda x: x/3)

#TelemetronData_six_months_avg['six_months_avg'] = TelemetronData_six_months_avg['six_months_avg'].apply(lambda x: x/6)

TelemetronData_df = pd.merge(Telemetron_agg, TelemetronData_one_month_avg, how='left', left_on = ['serial'], right_on = ['serial'])
TelemetronData_df.head()

TelemetronData_df2 = pd.merge(TelemetronData_df, TelemetronData_three_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
TelemetronData_df3 = pd.merge(TelemetronData_df2, TelemetronData_six_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
TelemetronData_df3 = TelemetronData_df3.fillna(0)


In [90]:
RussiaSalesData_df = pd.merge(RussiaSalesData_agg, RussiaSalesData_one_month_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
RussiaSalesData_df.head()

RussiaSalesData_df2 = pd.merge(RussiaSalesData_df, RussiaSalesData_three_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
RussiaSalesData_df3 = pd.merge(RussiaSalesData_df2, RussiaSalesData_six_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
RussiaSalesData_df3 = RussiaSalesData_df3.fillna(0)


VendonData_three_months_avg['three_months_avg'] = VendonData_three_months_avg['three_months_avg'].apply(lambda x: x/3)

VendonData_six_months_avg['six_months_avg'] = VendonData_six_months_avg['six_months_avg'].apply(lambda x: x/6)

VendonData_df = pd.merge(Vendon_agg, VendonData_one_month_avg, how='left', left_on = ['serial'], right_on = ['serial'])

VendonData_df2 = pd.merge(VendonData_df, VendonData_three_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
VendonData_df3 = pd.merge(VendonData_df2, VendonData_six_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
VendonData_df3 = VendonData_df3.fillna(0)
VendonData_df3

Add Key manuf no and Sales org

In [91]:
SouthAfricaSales_df = pd.merge(SouthAfrica_aggSales, SouthAfricaSales_one_month_avg, how='left', left_on = ['AccountID'], right_on = ['AccountID'])
SouthAfricaSales_df.head()

SouthAfricaSales_df2 = pd.merge(SouthAfricaSales_df, SouthAfricaSales_three_months_avg, how='left', left_on = ['AccountID'], right_on = ['AccountID'])
SouthAfricaSales_df3 = pd.merge(SouthAfricaSales_df2, SouthAfricaSales_six_months_avg, how='left', left_on = ['AccountID'], right_on = ['AccountID'])
SouthAfricaSales_df3 = SouthAfricaSales_df3.fillna(0)
SouthAfricaSales_df3.head()

Unnamed: 0,AccountID,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg
0,365014,0.63,0.05,0.026667,0.038333
1,365018,40.53,0.05,13.376667,6.73
2,366680,1068.01,1068.01,356.003333,178.001667
3,366935,13691.11,8376.59,4563.703333,2281.851667
4,367418,144338.28,5830.7,11818.086667,10983.813333


In [92]:
SingaporeSales_df = pd.merge(Singapore_aggSales, SingaporeSales_one_month_avg, how='left', left_on = ['Manufacturer Number'], right_on = ['Manufacturer Number'])
SingaporeSales_df2 = pd.merge(SingaporeSales_df, SingaporeSales_three_months_avg, how='left', left_on = ['Manufacturer Number'], right_on = ['Manufacturer Number'])
SingaporeSales_df3 = pd.merge(SingaporeSales_df2, SingaporeSales_six_months_avg, how='left', left_on = ['Manufacturer Number'], right_on = ['Manufacturer Number'])
SingaporeSales_df3 = SingaporeSales_df3.fillna(0)
SingaporeSales_df3.head(30)

Unnamed: 0,Manufacturer Number,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg
0,141982300000383201,3419.03,152.33,219.81,192.6125
1,141982300001083201,3200.85,165.92,165.15,200.775
2,141982300003083201,12183.988922,554.280404,711.387249,767.414209
3,141982300003283201,3208.42,0.0,174.733333,149.686667
4,141992300000483201,2571.08,250.44,222.613333,194.783333
5,141992300000783201,21445.077058,1619.672727,1560.793567,1572.896429
6,141992300000983201,1614.22,80.39,67.613333,83.728333
7,141992300001583201,6109.08,166.96,289.686667,407.483333
8,141992300001683201,780.71,95.075,72.158333,82.6325
9,141992300003283201,2871.56,0.0,0.0,166.953333


In [93]:
PakistanSales_df3['KeyManufNo_SalesOrg'] = PakistanSales_df3['Serial'].astype(str) + 'Pakistan' 

# Create the new column by combining 'Serial' with 'Malaysia'
MalaysiaSales_df3['KeyManufNo_SalesOrg'] = MalaysiaSales_df3['Serial'].astype(str) + 'Malaysia'

Not used yet in Vendon to differentiate markets

TelemetronData_df3['KeyManufNo_SalesOrg'] = TelemetronData_df3['serial'].astype(str) + 'Nestlé Russia'

In [94]:
RussiaSalesData_df3['KeyManufNo_SalesOrg'] = RussiaSalesData_df3['Serial'].astype(str) + 'Nestlé Russia'


In [95]:
SouthAfricaSales_df3['KeyManufNo_SalesOrg'] = SouthAfricaSales_df3['AccountID'].astype(str) + 'Nestle South Africa' 

Rename the accountID column from South Africa as we already did the work to get the accountID


In [96]:
SouthAfricaSales_df4 = SouthAfricaSales_df3.rename(columns = {'AccountID':'Serial'})

In [97]:
SingaporeSales_df3['KeyManufNo_SalesOrg'] = SingaporeSales_df3['Manufacturer Number'].astype(str) + 'Singapore'

In [98]:
SingaporeSales_df4 = SingaporeSales_df3.rename(columns = {'Manufacturer Number':'Serial'})

In [99]:
SingaporeSales_df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6307 entries, 0 to 6306
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  6307 non-null   object 
 1   quantity                6307 non-null   float64
 2   Sales_one_Month_avg     6307 non-null   float64
 3   Sales_three_months_avg  6307 non-null   float64
 4   Sales_six_months_avg    6307 non-null   float64
 5   KeyManufNo_SalesOrg     6307 non-null   object 
dtypes: float64(4), object(2)
memory usage: 344.9+ KB


VendonData_df3['KeyManufNo_SalesOrg'] = VendonData_df3['serial'].astype(str) + VendonData_df3['SalesOrg'].astype(str)

VendonData_df3=VendonData_df3.drop(columns=['SalesOrg'])
VendonData_df3.head()

In [100]:
Concat_Sales = pd.concat([RussiaSalesData_df3, PakistanSales_df3, SouthAfricaSales_df4, SingaporeSales_df4, MalaysiaSales_df3])
Concat_Sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68407 entries, 0 to 15981
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  68407 non-null  object 
 1   quantity                68407 non-null  float64
 2   Sales_one_Month_avg     68407 non-null  float64
 3   Sales_three_months_avg  68407 non-null  float64
 4   Sales_six_months_avg    68407 non-null  float64
 5   KeyManufNo_SalesOrg     68407 non-null  object 
dtypes: float64(4), object(2)
memory usage: 3.7+ MB


In [101]:
Concat_Sales['(lst_mth-6mth)/6mth'] = Concat_Sales.apply(lambda x: 0 if x['Sales_six_months_avg'] <= 0 else (x['Sales_one_Month_avg']-x['Sales_six_months_avg'])/x['Sales_six_months_avg'], axis=1)

Concat_Sales['3mth-6mth)/6mth'] = Concat_Sales.apply(lambda x: 0 if x['Sales_six_months_avg'] <= 0 else (x['Sales_three_months_avg']-x['Sales_six_months_avg'])/x['Sales_six_months_avg'], axis=1)

In [102]:
Concat_Sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68407 entries, 0 to 15981
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  68407 non-null  object 
 1   quantity                68407 non-null  float64
 2   Sales_one_Month_avg     68407 non-null  float64
 3   Sales_three_months_avg  68407 non-null  float64
 4   Sales_six_months_avg    68407 non-null  float64
 5   KeyManufNo_SalesOrg     68407 non-null  object 
 6   (lst_mth-6mth)/6mth     68407 non-null  float64
 7   3mth-6mth)/6mth         68407 non-null  float64
dtypes: float64(6), object(2)
memory usage: 4.7+ MB


In [103]:
Concat_Sales.tail(5)

Unnamed: 0,Serial,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg,KeyManufNo_SalesOrg,(lst_mth-6mth)/6mth,3mth-6mth)/6mth
15977,T573424,1104.27,0.0,39.45,184.045,T573424Malaysia,-1.0,-0.78565
15978,T573425,2423.145633,374.281441,388.879459,403.857606,T573425Malaysia,-0.073234,-0.037088
15979,T573426,359.4,0.0,0.0,59.9,T573426Malaysia,-1.0,-1.0
15980,T573427,4612.954324,294.736667,704.707014,768.825721,T573427Malaysia,-0.61664,-0.083398
15981,VZCA 00001,5386.958367,828.944922,917.998386,897.826395,VZCA 00001Malaysia,-0.07672,0.022468


Need to change the type otherwise cannot merge correctly with manuf number

In [104]:
TelemetronData_df3['serial'] = TelemetronData_df3['serial'].astype(str)

BeverageMachine7_wTickets_df['Manufacturer Number'] = BeverageMachine7_wTickets_df['Manufacturer Number'].astype(str)

d = TelemetronData_df3.loc[TelemetronData_df3['serial']==20172526377]
d

w=aaaf.loc[aaaf['Manufacturer Number']=='20172526377']
w

aaaf = pd.merge(BeverageMachine7_wTickets_df, TelemetronData_df3, how='left', left_on = ['Manufacturer Number'], right_on = ['serial'])

aaaf.info()

In [105]:
TelemetronData_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2496 entries, 0 to 2495
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   serial            2496 non-null   object 
 1   quantity          2496 non-null   int64  
 2   one_month_avg     2496 non-null   int64  
 3   three_months_avg  2496 non-null   float64
 4   six_months_avg    2496 non-null   float64
dtypes: float64(2), int64(2), object(1)
memory usage: 117.0+ KB


In [106]:
Concat_Telemetry = pd.concat([TelemetronData_df3, Telemetry_aggSales_df3])
Concat_Telemetry.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 32764 entries, 0 to 30267
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   serial            32764 non-null  object 
 1   quantity          32764 non-null  int64  
 2   one_month_avg     32764 non-null  int64  
 3   three_months_avg  32764 non-null  float64
 4   six_months_avg    32764 non-null  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 1.5+ MB


In [107]:
Concat_Telemetry['(lst_mth-6mth)/6mth'] = Concat_Telemetry.apply(lambda x: 0 if x['six_months_avg'] <= 0 else (x['one_month_avg']-x['six_months_avg'])/x['six_months_avg'], axis=1)

Concat_Telemetry['3mth-6mth)/6mth'] = Concat_Telemetry.apply(lambda x: 0 if x['six_months_avg'] <= 0 else (x['three_months_avg']-x['six_months_avg'])/x['six_months_avg'], axis=1)

## 10. Market Actions data

In [108]:
##Market Actions listed
MktActions = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Actions taken listed.xlsx")
MktActions.head()

Unnamed: 0,Month,Serial ID,Sales Organisation,Parent Installation Point ID,Actions,Actions linked to churn predictions,Comments,CA Comments,Actions proposed by SBU
0,2021-11-30,34F6401007,Nestle UK,7326,Other,Yes,CA Feedback Required,,No action Yet
1,2021-11-30,16E0031901,Nestle UK,11955,Removal planned,Yes,,,
2,2021-11-30,17E0020640,Nestle UK,8151,Removal planned,Yes,,,
3,2021-11-30,10238090,Nestle UK,IP-11722,Removal planned,Yes,,,
4,2021-11-30,101810133,Nestle UK,4915,Other,Yes,CA Feedback Required,,


In [109]:
MktActions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1151 entries, 0 to 1150
Data columns (total 9 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   Month                                1151 non-null   datetime64[ns]
 1   Serial ID                            1151 non-null   object        
 2   Sales Organisation                   1151 non-null   object        
 3   Parent Installation Point ID         1151 non-null   object        
 4   Actions                              1150 non-null   object        
 5   Actions linked to churn predictions  1151 non-null   object        
 6   Comments                             837 non-null    object        
 7   CA Comments                          29 non-null     object        
 8   Actions proposed by SBU              13 non-null     object        
dtypes: datetime64[ns](1), object(8)
memory usage: 81.1+ KB


In [110]:
#add key serial + sales org?
#One hot encoding
def preprocess_MktActions(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Actions']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

MktActions_prep = preprocess_MktActions(MktActions)
MktActions_prep.head()

Unnamed: 0,Month,Serial ID,Sales Organisation,Parent Installation Point ID,Actions linked to churn predictions,Comments,CA Comments,Actions proposed by SBU,Actions_Churn risk reason unknown,Actions_Data corrected,...,Actions_Removed,Actions_Reviewed and no action Required,Actions_Reviewed and no actions required,Actions_Seasonal Machine,Actions_Telemetry installed,Actions_Upgrade machine installed,Actions_Visit completed,Actions_Visit/Call planned,Actions_removed,Actions_tagging update
0,2021-11-30,34F6401007,Nestle UK,7326,Yes,CA Feedback Required,,No action Yet,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2021-11-30,16E0031901,Nestle UK,11955,Yes,,,,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2021-11-30,17E0020640,Nestle UK,8151,Yes,,,,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2021-11-30,10238090,Nestle UK,IP-11722,Yes,,,,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2021-11-30,101810133,Nestle UK,4915,Yes,CA Feedback Required,,,0,0,...,0,0,0,0,0,0,0,0,0,0


In [111]:
MktActions_prep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1151 entries, 0 to 1150
Data columns (total 28 columns):
 #   Column                                    Non-Null Count  Dtype         
---  ------                                    --------------  -----         
 0   Month                                     1151 non-null   datetime64[ns]
 1   Serial ID                                 1151 non-null   object        
 2   Sales Organisation                        1151 non-null   object        
 3   Parent Installation Point ID              1151 non-null   object        
 4   Actions linked to churn predictions       1151 non-null   object        
 5   Comments                                  837 non-null    object        
 6   CA Comments                               29 non-null     object        
 7   Actions proposed by SBU                   13 non-null     object        
 8   Actions_Churn risk reason unknown         1151 non-null   uint8         
 9   Actions_Data corrected        

In [112]:
MktActions_prep2=MktActions_prep.drop(columns=['Sales Organisation','Parent Installation Point ID', 'Month'])
MktActions_prep3 = MktActions_prep2.groupby(['Serial ID']).sum()
MktActions_prep3.head()

  MktActions_prep3 = MktActions_prep2.groupby(['Serial ID']).sum()


Unnamed: 0_level_0,Actions_Churn risk reason unknown,Actions_Data corrected,Actions_Downgrade machine installed,Actions_Lack of data discipline,Actions_New contract,Actions_Other,Actions_Out of order,Actions_Phone Call completed,Actions_Removal Plan,Actions_Removal planned,Actions_Removed,Actions_Reviewed and no action Required,Actions_Reviewed and no actions required,Actions_Seasonal Machine,Actions_Telemetry installed,Actions_Upgrade machine installed,Actions_Visit completed,Actions_Visit/Call planned,Actions_removed,Actions_tagging update
Serial ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
24606,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1895151,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10238090,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
10238091,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
10238092,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [113]:
MktActions_prep3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, 24606 to EM9933
Data columns (total 20 columns):
 #   Column                                    Non-Null Count  Dtype
---  ------                                    --------------  -----
 0   Actions_Churn risk reason unknown         930 non-null    uint8
 1   Actions_Data corrected                    930 non-null    uint8
 2   Actions_Downgrade machine installed       930 non-null    uint8
 3   Actions_Lack of data discipline           930 non-null    uint8
 4   Actions_New contract                      930 non-null    uint8
 5   Actions_Other                             930 non-null    uint8
 6   Actions_Out of order                      930 non-null    uint8
 7   Actions_Phone Call completed              930 non-null    uint8
 8   Actions_Removal Plan                      930 non-null    uint8
 9   Actions_Removal planned                   930 non-null    uint8
 10  Actions_Removed                           930 non-null    ui

### (b) Plan to manage and process the data <a class="anchor" id="ManageData"></a>

I will extract the data into excel or csv format and upload it to python.

I can merge the data of the different files together

The data is checked monthly and has been created to be linked together

Columns useful to link the datasets together :

    'Product ID [Machine Model ID]'

    'Manufacturer Number'

    'BMB/C4C Model code'

    'M1'
    
    'Manufacture Serial Number'
    
    'Serial ID'

I need to find a way to have one line per machine per month for telemetry data and Placement Tickets

The main idea behind the use of telemetry data here is to check if we can see for example a relation between churn and a the number of cup sales.

I will not use all the features. Below are the features I am planning to use for the two biggest dataset :

19 columns from Beverage Machine data :

['Serial ID', 'Sales Organisation', 'Machine Status Groupings', 'User Status', 'TA Contract Installation Date', 'Depreciation Start',
'Position', 'TA Contract Start Date', 'TA Contract End Date', 'TA Usage Indicator',
'Account ABC Classification (Account ID)', 'Industry (Account ID)', 'Industry Code 1 (Account ID)',  'Account ABC Classification (EC ID)', 'Industry (EC ID)',
'Industry Code 1 (EC ID)', 'Parent Installation Point ID', 'Registered Product Category (Registered Product ID)', 'Calendar Date']

14 columns coming from the Beverage Classification data :

['Model', 'Model Vendor', 'Model Category', 'Model Group', 'Beverage Temperature',
'System Brands', 'Ingredient Format', 'Machine Type', 'Positionning', 'Generation',
'Blueprint Throughput', 'IP Ownership', 'Trading Partner', 'G/R/M TB']

### Beverage Machine data features and the Beverage Classification data features

##### Serial ID                                              
	Unique per machine and allows to link to the Tickets placements

##### Sales Organisation                                     
	Usually a Sales Organisation corresponds to a Country

##### Product ID [Machine Model ID]                          
	Code that allows us to link it to the intermediary mapping table which contains all the details for each machine

##### Machine Status Groupings                               
	Status of the Machine shows if a machine is :
		Deployed
		Idle
		Other

##### User Status                                            
	More detailed than status groupings

##### Depreciation Start                                     
	Date when the machine started to display cup

##### Manufacturer Number                                    
	Code that allows us to link to the telemetry data


##### Position                                               
	Can tell if a machine is a:
		RENT,
		Sale,
		Loan,
		Demo,
		etc.,
##### TA Contract Installation Date
    Date when the machine was installed, different than depreciation start because a machine can be installed but could have already dispensed cups in another Installation Point

##### TA Contract Start Date                                 
	Date when the contract started
    
##### TA Contract End Date                                  
	Date when the contract ended
    
##### TA Usage Indicator                                     
	Can have several usage:
		5 Monthly Rental
		Not assigned
		Trial / Evaluation
		7 Annual / Periodic

##### Account ABC Classification (Account ID)                
	Can help to identify in which Channel is the Account
    
##### Industry (Account ID)                                  
	Can help to identify in which Channel is the Account
    
##### Industry Code 1 (Account ID)                           
	Can help to identify in which Channel is the Account

##### Account ABC Classification (EC ID)                     
	Can help to identify in which Channel is the End Customer
    
##### Industry (EC ID)                                       
	Can help to identify in which Channel is the End Customer
    
##### Industry Code 1 (EC ID)                                
	Can help to identify in which Channel is the End Customer

##### Parent Installation Point ID                           
	Help to identify if a machine is still deployed in the same location by the same customer, it is the Installation Point ID we were talking before.

##### Registered Product Category (Registered Product ID)    
	Details of the category within our group

##### Calendar Date                                          
	Date when we extracted the data of the machine
    
##### BMB/C4C Model code                                     
	Code that allows to link the intermediary mapping table to the beverage machine data

##### M1                                                     
	Name of the harmonized model and used to link the intermediary mapping to the mapping file with unique model
    
##### Model                                                  
	Name of the harmonized model and used to link the intermediary mapping to the mapping file with unique model

##### Model Vendor                                           
	Name of the vendor of the coffee machine

##### Model Category                                         
	Category of the model
		
##### Model Group                                            
	Group of the model

##### Beverage Temperature                                   
	Temperature of the beverage

##### System Brands                                         
	Brand internal classification

##### Ingredient Format                                     
	Format of the ingredient

##### Machine Type                                           
	Type of Machine

##### Positionning                                           
	Positionning of the machine

##### Generation                                             
	Generation of the machine

##### Blueprint Throughput                                   
    Type of throughput

##### IP Ownership                                           
    Ownership type

##### Trading Partner                                        
	Type of Trading Partner

##### G/R/M TB                                               
	How it is managed by the market 

Useless data not really explaining the model :

##### not used columns : 32

User Status Last Changed On                            
Product [Machine Model]                                
	Name of the machine Model 
Range Brand                                           
	Brand of the model
    
EC ID                                                  
    We can identify the end customer with this number, some can have more than one machine
    
	Can be transformed into #Machine for this customer

Equipment Number                                       
Asset Number                                           
TA Contract Number                                   
Account ID                                          
Ship To ID                                     
EC Name                                           
Sales Org ID (Installation Point)                  
Model Harmonized                                    	
Comments                                          
Source                                              
Global Projects                                    
	Machine that are part of a project :
		Roastelier
		Alegria
		Nitro
		Milano
		EZCare
		Express
		CoolPro
Toolbox                                               
Non-Toolbox Reason                 
Product                                       
Type.                              
Machines Models (Harmonized)                     
Solution Brands                 
Toolbox 2019                                       
Toolbox 2018                                     
Toolbox 2017                                        
Trade Assets                                      
Active for Procurement (2017)                       
Idle Available Stock Type                          
Modified                                          
Modified By                                     
Created                                            
Item Type                                       
Path

### Placement Tickets data features

##### Service Category
    Tell if the machine was :
        Installed
        Removed
        Replaced

##### Completion Date
    Date when the Ticket was done, we will not use it since we will aggregate on the number of tickets without the time dimension

##### Incident Category
    Reason of the Ticket, details about the incident or ticket

##### Serial ID
    In order to link to the Beverage Machine data

### Telemetry data features

##### quantity 
    Sales quantity

##### serial 	
    ID that allows us to map a to the manufacturer number of the beverage machines

##### columns not used :

Month

    Month of the sales

stockId

    Each machines has a button linked to an ID and by mapping this ID to the related product when can know which type of cup was sold, yet the machines is not working for every machines, so the columns product might be wrong

Column1 	

    unknown Id

Averages 	

    unknown average

inactive 	

    unknown column

machine_id2 

    unknown Id

Product

    type of cup sold (mapping is not ready for every machines yet)

We will use only the sales quantity and the serial to link to the Beverage Machine data. The other columns are either not useful or not satisfying minimum requirements on accuracy of data (bad data)

### Missing data

In [114]:
# TA Contract Installation Date
BevMachMissingInstDate = BeverageMachine_df.loc[BeverageMachine_df['TA Contract Installation Date']=='#']['TA Contract Installation Date'].count()
TotBevMach = BeverageMachine_df['Serial ID'].count()

# TA Contract Start Date
BevMachMissingStartDate = BeverageMachine_df.loc[BeverageMachine_df['TA Contract Start Date']=='#']['TA Contract Start Date'].count()

# TA Contract End Date 
BevMachMissingEndDate = BeverageMachine_df.loc[BeverageMachine_df['TA Contract End Date']=='#']['TA Contract End Date'].count()

# Depreciation Start
BevMachMissingDepStartDate = BeverageMachine_df.loc[BeverageMachine_df['Depreciation Start']=='#']['Depreciation Start'].count()


print('Beverage machines missing Installation Date : ', BevMachMissingInstDate, ', which corresponds to ', 100*round(BevMachMissingInstDate/TotBevMach,2), '%')
print('Beverage machines missing Start Date : ', BevMachMissingStartDate, ', which corresponds to ', 100*round(BevMachMissingStartDate/TotBevMach,2), '%')
print('Beverage machines missing End Date : ', BevMachMissingEndDate, ', which corresponds to ', 100*round(BevMachMissingEndDate/TotBevMach,2), '%')
print('Beverage machines missing Depreciation Start Date : ', BevMachMissingDepStartDate, ', which corresponds to ', 100*round(BevMachMissingDepStartDate/TotBevMach,2), '%')


Beverage machines missing Installation Date :  5209196 , which corresponds to  81.0 %
Beverage machines missing Start Date :  5209196 , which corresponds to  81.0 %
Beverage machines missing End Date :  5209190 , which corresponds to  81.0 %
Beverage machines missing Depreciation Start Date :  3 , which corresponds to  0.0 %


##### Telemetry data
Even if the number of beverage machines equiped with telemetry data is increasing the data available is still low and should be seen as a complement.

In August 2020 only around 200 beverage machines have telemetry data and are already in the new system from which we got Beverage Machine data and we have around 60'000 beverage machines.


##### Placement Tickets data

27'318 beverage machines does not provide any Placement tickets


##### Date features missing

We see that sometimes the date is not filled for Installation Date, Start Date and End Date

#### Visits data
A visit is linked to an account and a machine "Account ID" can be linked to a visit "Account ID.Account ID Level 01.Key" maybe a key with the Sales Org in case it is unique only by market

#### Phone Calls data
A phone Call is linked to an account. Link "Account Name" from phone call with "Account ID" of the machine.

## Preparation of the data<a class="anchor" id="prep"></a>

### a) Details of preparation<a class="anchor" id="det"></a>

#### Beverage Machine data preparation

The goal is to get the actual maximal date of each Serial ID

If a machine has a maximal date that is lower than (or not equal to) the latest snapshot date, then the machine has churned.

We will look at the max date per installation point because when we lose an installation point we lose the customer. 

A machine can be realocated to another customer.

Keep only the latest month of data



In [115]:
BeverageMachine_df['Calendar Date'] = pd.to_datetime(BeverageMachine_df['Calendar Date'], errors =  'coerce')

In [116]:
BeverageMachine_df.tail()

Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date,Key_ManufacturerID_SalesOrg
3201311,3201311,Nestle South Africa,44460,NESCAFE MILANO 8/60 H6 AR,100118541,MILANO,Deployed,Installed,40626,ZA4188,...,060502 Factory,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1079926,Trade Asset w/ Fixed Asset,ZA10,90083851,2023-12-31,3897301Nestle South Africa
3201312,3201312,Nestle South Africa,44460,NESCAFE MILANO 8/60 H6 AR,100118541,MILANO,Deployed,Installed,40695,ZA5014,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1080068,Trade Asset w/ Fixed Asset,ZA10,90083851,2023-12-31,3896790Nestle South Africa
3201313,3201313,Nestle South Africa,44460,NESCAFE MILANO 8/60 H6 AR,100118541,MILANO,Deployed,Installed,41334,ZA4631,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1080064,Trade Asset w/ Fixed Asset,ZA10,90083851,2023-12-31,3896790Nestle South Africa
3201314,3201314,Nestle South Africa,44460,NESCAFE ALEGRIA Base Cabinet,100118550,ALEGRIA,Deployed,Installed,42005,ZA14050,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1080336,Trade Asset Accessory,ZA10,90083852,2023-12-31,3896790Nestle South Africa
3201315,3201315,Nestle South Africa,44460,NESCAFE ALEGRIA Base Cabinet,100118550,ALEGRIA,Deployed,Installed,42370,ZA14115,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1079951,Trade Asset Accessory,ZA10,90083852,2023-12-31,3896944Nestle South Africa


BeverageMachine_df1 = BeverageMachine_df.copy()
BeverageMachine_df1 = BeverageMachine_df1.groupby(['Parent Installation Point ID'])


In [117]:
BeverageMachine_df1 = BeverageMachine_df.copy()
BeverageMachine_df1['Product ID [Machine Model ID]'] = BeverageMachine_df1['Product ID [Machine Model ID]'].astype(str)
#BeverageMachine_df2 = BeverageMachine_df1.groupby(['Parent Installation Point ID']).agg({'Calendar Date' : [np.min, np.max]})

#BeverageMachine_df1['Calendar Date2'] = BeverageMachine_df1['Calendar Date']

#BeverageMachine_df2 = BeverageMachine_df1.groupby(['Parent Installation Point ID']).agg({'Calendar Date' : 'min', 'Calendar Date2' : 'max'})

In [118]:
BeverageMachine_df2 = pd.merge(BeverageMachine_df1, BevMap_df, how='left', left_on = ['Product ID [Machine Model ID]'], right_on = ['ID Model Code'])
BeverageClassification1_df = BeverageClassification_df.drop_duplicates(['Model'])
BeverageMachine_df3 = pd.merge(BeverageMachine_df2, BeverageClassification1_df, how='left', left_on = ['Model'], right_on = ['Model']) 

In [119]:
BeverageMachine_df3 = BeverageMachine_df3.query("`Model` != 'Accessories'")

In [120]:
BeverageMachine_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5601614 entries, 0 to 6453675
Data columns (total 73 columns):
 #   Column                                               Dtype         
---  ------                                               -----         
 0   Unnamed: 0                                           int64         
 1   Sales Organisation                                   object        
 2   User Status Last Changed On                          object        
 3   Product [Machine Model]                              object        
 4   Product ID [Machine Model ID]                        object        
 5   Range Brand                                          object        
 6   Machine Status Groupings                             object        
 7   User Status                                          object        
 8   Depreciation Start                                   object        
 9   Serial ID                                            object        
 10  Manufa

Another way to get min and max date e.g. "I wanted to create a new data frame where I can get min value in the column Numb if my string in the column Word is ab and max value if my string is bc for each Date. " :

s=df.groupby(['Date','Word']).Numb.agg(['min','max'])

s['number']=np.where(s.index.get_level_values(1)=='ab',s.min(1),s.max(1))

df11 =BeverageMachine_df.copy()
df22 = df11.reset_index()
df22.loc[df22.groupby('Parent Installation Point ID')['Calendar Date'].idxmin()]
df22.info()

In [121]:
# Sort the dataFrame by 'Calendar Date' and then remove duplicates :
BM_Maxdate_IPID2 = BeverageMachine_df3.sort_values('Calendar Date', ascending=False).drop_duplicates(['Parent Installation Point ID'])
BM_Maxdate_IPID2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 246599 entries, 525824 to 927207
Data columns (total 73 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Unnamed: 0                                           246599 non-null  int64         
 1   Sales Organisation                                   246599 non-null  object        
 2   User Status Last Changed On                          246592 non-null  object        
 3   Product [Machine Model]                              246599 non-null  object        
 4   Product ID [Machine Model ID]                        246599 non-null  object        
 5   Range Brand                                          246599 non-null  object        
 6   Machine Status Groupings                             246599 non-null  object        
 7   User Status                                          246599 non-null 

The columns allowing to link datasets should have the same format otherwise it might not work properly if one has a string format and the other a numerical format  

In [122]:
BeverageMachine1_df = BM_Maxdate_IPID2
BeverageMachine1_df['Product ID [Machine Model ID]']=BeverageMachine1_df['Product ID [Machine Model ID]'].astype(str)

In [123]:
BeverageMachine1_df = BeverageMachine1_df.loc[BeverageMachine1_df['Machine Status Groupings']=="Deployed"]

Merge the Beverage Machine data with the Beverage Mapping in order to get the related "Harmonized Model" of the "Beverage Machine Classification data" and later merge together the "Beverage Machine data" with the "Beverage Classification data"

We should do a cleaning step in order to keep only the machine having the 'Parent Installation Point ID' filled and remove duplicates, but not for 'Serial ID'

In [124]:
BeverageMachine4_df = BeverageMachine1_df.loc[BeverageMachine1_df['Parent Installation Point ID']!="#"].drop_duplicates(['Parent Installation Point ID'])

In [125]:
BeverageMachine4_df = BeverageMachine4_df.loc[BeverageMachine4_df['Serial ID']!="#"]


In [126]:
BeverageMachine4_df.columns

Index(['Unnamed: 0', 'Sales Organisation', 'User Status Last Changed On',
       'Product [Machine Model]', 'Product ID [Machine Model ID]',
       'Range Brand', 'Machine Status Groupings', 'User Status',
       'Depreciation Start', 'Serial ID', 'Manufacturer Number',
       'Equipment Number', 'Asset Number', 'Position', 'TA Contract Number',
       'TA Contract Installation Date', 'TA Contract Start Date',
       'TA Contract End Date', 'TA Usage Indicator', 'Account ID',
       'Ship To ID', 'EC ID', 'EC Name', 'City', 'State', 'Postal Code',
       'Account ABC Classification (Account ID)', 'Industry (Account ID)',
       'Industry Code 1 (Account ID)', 'Account ABC Classification (EC ID)',
       'Industry (EC ID)', 'Industry Code 1 (EC ID)',
       'Parent Installation Point ID',
       'Registered Product Category (Registered Product ID)',
       'Sales Org ID (Installation Point)',
       'SAP Material Line Code [Machine Model ID]', 'Calendar Date',
       'Key_ManufacturerID

In [127]:
                   
                    
BeverageMachine5_df = BeverageMachine4_df[['Serial ID', 'Sales Organisation', 'Machine Status Groupings', 'User Status', 
                    'TA Contract Installation Date', 'Depreciation Start', 'Manufacturer Number', 'Position', 
                    'TA Contract Start Date', 'TA Contract End Date', 'TA Usage Indicator',
                    'Account ID',
                    'EC ID', 'EC Name', 'Account ABC Classification (Account ID)', 'Industry (Account ID)', 
                    'Industry Code 1 (Account ID)', 'Account ABC Classification (EC ID)', 
                    'Industry (EC ID)', 'Industry Code 1 (EC ID)', 'Parent Installation Point ID', 
                    'Registered Product Category (Registered Product ID)', 
                    'Model', 'Model Vendor', 'Model Category', 'Model Group', 
                    'Beverage Temperature', 'System Brands', 'Ingredient Format', 
                    'Machine Type', 'Positionning', 'Generation', 'Blueprint Throughput', 
                    'IP Ownership', 'Calendar Date', 'Key_ManufacturerID_SalesOrg', 'City', 'State', 'Postal Code']]
BeverageMachine5_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Machine Type,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code
525824,19O0017079,Nestle Australia Ltd,Deployed,Installed,#,43739,20192228627_VB_23O0037199,LOAN,#,#,...,Table Tops,Mainstream,Gen. 2,Low,Proprietary,2024-03-31,20192228627_VB_23O0037199Nestle Australia Ltd,Thornleigh,New South Wales,2120
622584,22O0023824,Nestlé India,Deployed,Installed,#,44896,SP3.1-06904,LOAN,#,#,...,Table Tops,Entry,Legacy,Low,Non-Proprietary,2024-03-31,SP3.1-06904Nestlé India,Patna,Bihar,800001
622595,22O0023864,Nestlé India,Deployed,Installed,#,44896,22O0023864,LOAN,#,#,...,Table Tops,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023864Nestlé India,kolkata,West Bengal,700091
622594,22O0023862,Nestlé India,Deployed,Installed,#,44896,22O0023862,LOAN,#,#,...,Table Tops,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023862Nestlé India,Kolkata,West Bengal,700091
622593,22O0023856,Nestlé India,Deployed,Installed,#,44896,22O0023856,LOAN,#,#,...,Table Tops,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023856Nestlé India,"Kolkata,New Town",West Bengal,700156


If 'Calendar Date' is smaller than the 'ChurnDate2' it means that it has not churned

In [128]:
BeverageMachine5_df['Calendar Date'] = pd.to_datetime(BeverageMachine5_df['Calendar Date'])
BeverageMachine5_df.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  BeverageMachine5_df['Calendar Date'] = pd.to_datetime(BeverageMachine5_df['Calendar Date'])


<class 'pandas.core.frame.DataFrame'>
Int64Index: 217282 entries, 525824 to 927207
Data columns (total 39 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Serial ID                                            217282 non-null  object        
 1   Sales Organisation                                   217282 non-null  object        
 2   Machine Status Groupings                             217282 non-null  object        
 3   User Status                                          217282 non-null  object        
 4   TA Contract Installation Date                        217259 non-null  object        
 5   Depreciation Start                                   217064 non-null  object        
 6   Manufacturer Number                                  217280 non-null  object        
 7   Position                                             217282 non-null 

In [129]:
np.where(BeverageMachine5_df['Calendar Date']< ChurnDate2, True, False)
BeverageMachine5_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Machine Type,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code
525824,19O0017079,Nestle Australia Ltd,Deployed,Installed,#,43739,20192228627_VB_23O0037199,LOAN,#,#,...,Table Tops,Mainstream,Gen. 2,Low,Proprietary,2024-03-31,20192228627_VB_23O0037199Nestle Australia Ltd,Thornleigh,New South Wales,2120
622584,22O0023824,Nestlé India,Deployed,Installed,#,44896,SP3.1-06904,LOAN,#,#,...,Table Tops,Entry,Legacy,Low,Non-Proprietary,2024-03-31,SP3.1-06904Nestlé India,Patna,Bihar,800001
622595,22O0023864,Nestlé India,Deployed,Installed,#,44896,22O0023864,LOAN,#,#,...,Table Tops,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023864Nestlé India,kolkata,West Bengal,700091
622594,22O0023862,Nestlé India,Deployed,Installed,#,44896,22O0023862,LOAN,#,#,...,Table Tops,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023862Nestlé India,Kolkata,West Bengal,700091
622593,22O0023856,Nestlé India,Deployed,Installed,#,44896,22O0023856,LOAN,#,#,...,Table Tops,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023856Nestlé India,"Kolkata,New Town",West Bengal,700156


In [130]:
columnwithfalse = False
BeverageMachine6_df=BeverageMachine5_df.copy()
BeverageMachine6_df['Churn'] = columnwithfalse
BeverageMachine6_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn
525824,19O0017079,Nestle Australia Ltd,Deployed,Installed,#,43739,20192228627_VB_23O0037199,LOAN,#,#,...,Mainstream,Gen. 2,Low,Proprietary,2024-03-31,20192228627_VB_23O0037199Nestle Australia Ltd,Thornleigh,New South Wales,2120,False
622584,22O0023824,Nestlé India,Deployed,Installed,#,44896,SP3.1-06904,LOAN,#,#,...,Entry,Legacy,Low,Non-Proprietary,2024-03-31,SP3.1-06904Nestlé India,Patna,Bihar,800001,False
622595,22O0023864,Nestlé India,Deployed,Installed,#,44896,22O0023864,LOAN,#,#,...,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023864Nestlé India,kolkata,West Bengal,700091,False
622594,22O0023862,Nestlé India,Deployed,Installed,#,44896,22O0023862,LOAN,#,#,...,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023862Nestlé India,Kolkata,West Bengal,700091,False
622593,22O0023856,Nestlé India,Deployed,Installed,#,44896,22O0023856,LOAN,#,#,...,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023856Nestlé India,"Kolkata,New Town",West Bengal,700156,False


In [131]:
#BeverageMachine6_df['Churn'] = np.where((BeverageMachine5_df['Calendar Date_x']<BeverageMachine5_df['Calendar Date_y'])|
#                                (BeverageMachine5_df['Calendar Date_x'] == ChurnDate), False, True)

BeverageMachine6_df['Churn'] = np.where(BeverageMachine5_df['Calendar Date'] < ChurnDate2, True, False)
BeverageMachine6_df.loc[BeverageMachine6_df['Churn']==True].head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn
440187,SGBMB12452,Singapore,Deployed,Installed,#,43191,20174647120,#,#,#,...,Premium,Gen. 2,Medium,Proprietary,2024-02-29,20174647120Singapore,#,SG/Not assigned,529482,True
440224,SGBMB12608,Singapore,Deployed,Installed,#,43282,20181217518,#,#,#,...,Premium,Gen. 2,Medium,Proprietary,2024-02-29,20181217518Singapore,#,SG/Not assigned,718827,True
440125,SGBMB12069,Singapore,Deployed,Installed,#,42826,20164235456,#,#,#,...,Premium,Gen. 2,Medium,Proprietary,2024-02-29,20164235456Singapore,#,SG/Not assigned,609921,True
440127,SGBMB12194,Singapore,Deployed,Installed,#,42826,20165041036,#,#,#,...,Premium,Gen. 2,Medium,Proprietary,2024-02-29,20165041036Singapore,#,SG/Not assigned,609921,True
440874,SGBMB07569,Singapore,Deployed,Installed,#,41061,20120604450,#,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-02-29,20120604450Singapore,#,SG/Not assigned,179434,True


In [132]:
BeverageMachine6_df.loc[BeverageMachine6_df['Churn']==False].head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn
525824,19O0017079,Nestle Australia Ltd,Deployed,Installed,#,43739,20192228627_VB_23O0037199,LOAN,#,#,...,Mainstream,Gen. 2,Low,Proprietary,2024-03-31,20192228627_VB_23O0037199Nestle Australia Ltd,Thornleigh,New South Wales,2120,False
622584,22O0023824,Nestlé India,Deployed,Installed,#,44896,SP3.1-06904,LOAN,#,#,...,Entry,Legacy,Low,Non-Proprietary,2024-03-31,SP3.1-06904Nestlé India,Patna,Bihar,800001,False
622595,22O0023864,Nestlé India,Deployed,Installed,#,44896,22O0023864,LOAN,#,#,...,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023864Nestlé India,kolkata,West Bengal,700091,False
622594,22O0023862,Nestlé India,Deployed,Installed,#,44896,22O0023862,LOAN,#,#,...,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023862Nestlé India,Kolkata,West Bengal,700091,False
622593,22O0023856,Nestlé India,Deployed,Installed,#,44896,22O0023856,LOAN,#,#,...,Entry,Legacy,Low,Non-Proprietary,2024-03-31,22O0023856Nestlé India,"Kolkata,New Town",West Bengal,700156,False


Check the data and modify it if it is not the correct type

In [133]:
e = BeverageMachine6_df.loc[BeverageMachine6_df['Serial ID']==7010054129]
e.iloc[:20,9:40]

Unnamed: 0,TA Contract End Date,TA Usage Indicator,Account ID,EC ID,EC Name,Account ABC Classification (Account ID),Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn


In [134]:
BeverageMachine6_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217282 entries, 525824 to 927207
Data columns (total 40 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Serial ID                                            217282 non-null  object        
 1   Sales Organisation                                   217282 non-null  object        
 2   Machine Status Groupings                             217282 non-null  object        
 3   User Status                                          217282 non-null  object        
 4   TA Contract Installation Date                        217259 non-null  object        
 5   Depreciation Start                                   217064 non-null  object        
 6   Manufacturer Number                                  217280 non-null  object        
 7   Position                                             217282 non-null 

I want some date features to be integer instead of non-null object

In [135]:
# Date features
Date_Features = ['TA Contract Installation Date', 'Depreciation Start',  'TA Contract Start Date', 
                 'TA Contract End Date']

BeverageMachine7_df= BeverageMachine6_df.copy()

for x in Date_Features:
    BeverageMachine7_df[x] = pd.to_numeric(BeverageMachine7_df[x], errors='coerce').fillna(0).astype(int)

In [136]:
BeverageMachine7_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217282 entries, 525824 to 927207
Data columns (total 40 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Serial ID                                            217282 non-null  object        
 1   Sales Organisation                                   217282 non-null  object        
 2   Machine Status Groupings                             217282 non-null  object        
 3   User Status                                          217282 non-null  object        
 4   TA Contract Installation Date                        217282 non-null  int32         
 5   Depreciation Start                                   217282 non-null  int32         
 6   Manufacturer Number                                  217280 non-null  object        
 7   Position                                             217282 non-null 

#### Placement Tickets data preparation

In order to merge Placement Tickets data with Beverage Machine data I need to perform some preparations of the data.

I would like to have one row per Manufacture Serial Number and Month

Remove "Removal Ticket" because it is nearly like giving the information if the machine has churned. 
To be decided maybe I should remove it too.
I just kept "Seasonal Removal" because it helps to understand that it is a special case and a similar machine might not churn if it is not a Seasonal Removal

In [137]:
Placement_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303642 entries, 0 to 303641
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Service Category               303642 non-null  object
 1   INCIDENT_CATEGORY_DESCRIPTION  302337 non-null  object
 2   Serial ID                      303550 non-null  object
dtypes: object(3)
memory usage: 6.9+ MB


In [138]:
table1 = Placement_df.loc[Placement_df['Service Category']!="Removal"]
table1 = table1.loc[table1['Service Category']!="Removal."]
table2 = Placement_df.loc[Placement_df['INCIDENT_CATEGORY_DESCRIPTION']=="Seasonal Removal"]

In [139]:
table1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183332 entries, 551 to 301472
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Service Category               183332 non-null  object
 1   INCIDENT_CATEGORY_DESCRIPTION  182358 non-null  object
 2   Serial ID                      183281 non-null  object
dtypes: object(3)
memory usage: 5.6+ MB


In [140]:
Placement_df_wo_rem = pd.concat([table1,table2])
Placement_df_wo_rem.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 187294 entries, 551 to 301385
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Service Category               187294 non-null  object
 1   INCIDENT_CATEGORY_DESCRIPTION  186320 non-null  object
 2   Serial ID                      187240 non-null  object
dtypes: object(3)
memory usage: 5.7+ MB


In [141]:
from xlrd.xldate import xldate_as_tuple
from dateutil.relativedelta import relativedelta

Placement_df_prep = Placement_df_wo_rem[['Serial ID', 'Service Category','INCIDENT_CATEGORY_DESCRIPTION']].copy()

Placement_df_prep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 187294 entries, 551 to 301385
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Serial ID                      187240 non-null  object
 1   Service Category               187294 non-null  object
 2   INCIDENT_CATEGORY_DESCRIPTION  186320 non-null  object
dtypes: object(3)
memory usage: 5.7+ MB


In [142]:
Placement_df_prep['Serial ID'] = Placement_df_prep['Serial ID'].astype('str')

In [143]:
def preprocess_f(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Service Category','INCIDENT_CATEGORY_DESCRIPTION']
                
    # Some columns could be also ordinal features but we will keep them as nominal features for the moment
    ##ordi_vars = ['Positionning', 'Generation',]
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

Placement_df_prep2 = preprocess_f(Placement_df_prep)
Placement_df_prep2.head()

Unnamed: 0,Serial ID,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
551,10043045,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
552,10048419,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
2417,10051301,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
4426,10056376,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0
11743,10039999,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0


In [144]:
Placement_df_prep2.columns

Index(['Serial ID', 'Service Category_Installation',
       'Service Category_Removal', 'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade'],
      dtype='object')

In [145]:
Placement_df_prep3 = Placement_df_prep2.groupby(["Serial ID"])\
[['Serial ID', 'Service Category_Installation',
       'Service Category_Removal', 'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade']].sum()


TicketsColumnsList = ['Serial ID', 'Service Category_Installation',
       'Service Category_Removal', 'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade']

Placement_df_prep3.head()

  'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade']].sum()


Unnamed: 0_level_0,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
Serial ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0.102313088,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0.4390764,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
100100125.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
100100249.0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
100100250.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1


In [146]:
# Specify the filename
filename = 'TicketsColumnsList.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(TicketsColumnsList, file)

In [147]:
Placement_df_prep3.columns

Index(['Service Category_Installation', 'Service Category_Removal',
       'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade'],
      dtype='object')

In [148]:
Placement_df_prep3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 122173 entries, .102313088 to nan
Data columns (total 16 columns):
 #   Column                                                           Non-Null Count   Dtype
---  ------                                                           --------------   -----
 0   Service Category_Installation                                    122173 non-null  uint8
 1   Service Category_Removal                                         122173 non-null  uint8
 2   Service Category_Replacement                                     122173 non-null  uint8
 3   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                122173 non-null  uint8
 4   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       122173 non-null  uint8
 5   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   122173 non-null  uint8
 6   INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation      122173 non-null  uint8
 7   INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Re

In [149]:
Placement_df_prep5 = Placement_df_prep3.reset_index()

In [150]:
Placement_df_prep5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122173 entries, 0 to 122172
Data columns (total 17 columns):
 #   Column                                                           Non-Null Count   Dtype 
---  ------                                                           --------------   ----- 
 0   Serial ID                                                        122173 non-null  object
 1   Service Category_Installation                                    122173 non-null  uint8 
 2   Service Category_Removal                                         122173 non-null  uint8 
 3   Service Category_Replacement                                     122173 non-null  uint8 
 4   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                122173 non-null  uint8 
 5   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       122173 non-null  uint8 
 6   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   122173 non-null  uint8 
 7   INCIDENT_CATEGORY_DESCRIPTION_Key Acco

In [151]:
BeverageMachine7_df.columns

Index(['Serial ID', 'Sales Organisation', 'Machine Status Groupings',
       'User Status', 'TA Contract Installation Date', 'Depreciation Start',
       'Manufacturer Number', 'Position', 'TA Contract Start Date',
       'TA Contract End Date', 'TA Usage Indicator', 'Account ID', 'EC ID',
       'EC Name', 'Account ABC Classification (Account ID)',
       'Industry (Account ID)', 'Industry Code 1 (Account ID)',
       'Account ABC Classification (EC ID)', 'Industry (EC ID)',
       'Industry Code 1 (EC ID)', 'Parent Installation Point ID',
       'Registered Product Category (Registered Product ID)', 'Model',
       'Model Vendor', 'Model Category', 'Model Group', 'Beverage Temperature',
       'System Brands', 'Ingredient Format', 'Machine Type', 'Positionning',
       'Generation', 'Blueprint Throughput', 'IP Ownership', 'Calendar Date',
       'Key_ManufacturerID_SalesOrg', 'City', 'State', 'Postal Code', 'Churn'],
      dtype='object')

In [152]:
Placement_df_prep5.loc[Placement_df_prep5['Serial ID']=='#']

Unnamed: 0,Serial ID,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade


Remove Placement tickets with 'Serial ID' == '#'

In [153]:
Placement_df_prep6 = Placement_df_prep5.loc[Placement_df_prep5['Serial ID']!='#']

In [154]:
Placement_df_prep6.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 122173 entries, 0 to 122172
Data columns (total 17 columns):
 #   Column                                                           Non-Null Count   Dtype 
---  ------                                                           --------------   ----- 
 0   Serial ID                                                        122173 non-null  object
 1   Service Category_Installation                                    122173 non-null  uint8 
 2   Service Category_Removal                                         122173 non-null  uint8 
 3   Service Category_Replacement                                     122173 non-null  uint8 
 4   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                122173 non-null  uint8 
 5   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       122173 non-null  uint8 
 6   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   122173 non-null  uint8 
 7   INCIDENT_CATEGORY_DESCRIPTION_Key Acco

In [155]:
#Placement_df_prep6['Serial ID'] = Placement_df_prep6['Serial ID'].astype('str')

In [156]:
Placement_df_prep6 = Placement_df_prep6.reset_index()

Placement_df_prep6=Placement_df_prep6.drop(columns=['index'])
Placement_df_prep6

Unnamed: 0,Serial ID,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
0,.102313088,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,.4390764,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,100100125,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
3,100100249,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
4,100100250,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
122168,ZAB2022049,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
122169,ZAB2022050,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
122170,ZAG0054,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
122171,ZAR1222,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0


I will link both data together Beverage Machine data and Placement Ticket

In [157]:
BeverageMachine7_wTickets_df = pd.merge(BeverageMachine7_df, Placement_df_prep6, how='left', left_on = ['Serial ID'], right_on = ['Serial ID'])

f=BeverageMachine7_wTickets_df.loc[BeverageMachine7_wTickets_df['Serial ID']=='7010054129']
f.iloc[:20,20:50]

In [158]:
BeverageMachine7_wTickets_df

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
0,19O0017079,Nestle Australia Ltd,Deployed,Installed,0,43739,20192228627_VB_23O0037199,LOAN,0,0,...,,,,,,,,,,
1,22O0023824,Nestlé India,Deployed,Installed,0,44896,SP3.1-06904,LOAN,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,22O0023864,Nestlé India,Deployed,Installed,0,44896,22O0023864,LOAN,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,22O0023862,Nestlé India,Deployed,Installed,0,44896,22O0023862,LOAN,0,0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,22O0023856,Nestlé India,Deployed,Installed,0,44896,22O0023856,LOAN,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
217277,20O0017858,Singapore,Deployed,Installed,0,44166,2008020009,#,0,0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
217278,20O0017859,Singapore,Deployed,Installed,0,44166,2008020010,#,0,0,...,1.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
217279,20O0017862,Singapore,Deployed,Installed,0,44166,2008020013,#,0,0,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
217280,20O0017861,Singapore,Deployed,Installed,0,44166,2008020012,#,0,0,...,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [159]:
BeverageMachine7_wTickets_df=BeverageMachine7_wTickets_df.fillna(0)
BeverageMachine7_wTickets_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
0,19O0017079,Nestle Australia Ltd,Deployed,Installed,0,43739,20192228627_VB_23O0037199,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,22O0023824,Nestlé India,Deployed,Installed,0,44896,SP3.1-06904,LOAN,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,22O0023864,Nestlé India,Deployed,Installed,0,44896,22O0023864,LOAN,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,22O0023862,Nestlé India,Deployed,Installed,0,44896,22O0023862,LOAN,0,0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,22O0023856,Nestlé India,Deployed,Installed,0,44896,22O0023856,LOAN,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Even if we have only around 2000 machines having tickets, BeverageMachine7_wTickets_df can be used and we will see if it can improve the model.

In [160]:
SO_Tickets =['prstzr pnstrpzcp ztd', 'prstzr nk', 'prstzr prw zrpzppd', 'ppkcstpp']

BeverageMachine7_wTicketsOnly_df = Placement_df_prep6

BeverageMachine7_wTicketsOnly_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 122173 entries, 0 to 122172
Data columns (total 17 columns):
 #   Column                                                           Non-Null Count   Dtype 
---  ------                                                           --------------   ----- 
 0   Serial ID                                                        122173 non-null  object
 1   Service Category_Installation                                    122173 non-null  uint8 
 2   Service Category_Removal                                         122173 non-null  uint8 
 3   Service Category_Replacement                                     122173 non-null  uint8 
 4   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                122173 non-null  uint8 
 5   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       122173 non-null  uint8 
 6   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   122173 non-null  uint8 
 7   INCIDENT_CATEGORY_DESCRIPTION_Key Acco

#### Telemetry data preparation

Let's see what we can get with only machines having Telemetry data

In [161]:
BeverageMachine7_wTelemetry = pd.merge(BeverageMachine7_df, Telemetry_aggSales, how='inner', left_on = ['Manufacturer Number'], right_on = ['serial'])
BeverageMachine7_wTelemetry.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13695 entries, 0 to 13694
Data columns (total 41 columns):
 #   Column                                               Non-Null Count  Dtype         
---  ------                                               --------------  -----         
 0   Serial ID                                            13695 non-null  object        
 1   Sales Organisation                                   13695 non-null  object        
 2   Machine Status Groupings                             13695 non-null  object        
 3   User Status                                          13695 non-null  object        
 4   TA Contract Installation Date                        13695 non-null  int32         
 5   Depreciation Start                                   13695 non-null  int32         
 6   Manufacturer Number                                  13695 non-null  object        
 7   Position                                             13695 non-null  object        
 

I only have 218 machines matching a Telemetry Kit. This is clearly not enough in order to apply Machine Learning model to predict churn.
We should at least combine it with the Beverage Machine data if we want to use it.

#### Visits data preparation

In [162]:
Visitsdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 581230 entries, 0 to 581229
Data columns (total 24 columns):
 #   Column                              Non-Null Count   Dtype         
---  ------                              --------------   -----         
 0   Month                               581230 non-null  int64         
 1   Year                                581230 non-null  int64         
 2   Period                              581230 non-null  object        
 3   Counter_visits_completed            581230 non-null  int64         
 4   Cummulative                         581230 non-null  object        
 5   Cummulative_Final                   581230 non-null  object        
 6   Cummulative Graph                   579206 non-null  object        
 7   Occurence Balancing                 581230 non-null  object        
 8   Activity Owner                      581208 non-null  object        
 9   Activity Owner ID                   581208 non-null  float64       
 10  End Date

In [163]:
Visitsdf.columns

Index(['Month', 'Year', 'Period', 'Counter_visits_completed', 'Cummulative',
       'Cummulative_Final', 'Cummulative Graph', 'Occurence Balancing',
       'Activity Owner', 'Activity Owner ID', 'End Date in Local Time Zone',
       'Result', 'Sales Org Desc', 'Sales Organization',
       'Sales Unit (Hierarchy)', 'Sales Unit (Hierarchy) ID',
       'Activity Life Cycle Status id', 'Activity Life Cycle Status',
       'Counter_visits', 'Visit Description', 'Visit',
       'Account ID.Account ID Level 01', 'Account ID.Account ID Level 01.Key',
       'Index'],
      dtype='object')

In [164]:
Visitsdf1 = Visitsdf[['End Date in Local Time Zone', 'Result', 'Activity Life Cycle Status', 'Visit', 'Account ID.Account ID Level 01.Key']]

Remove visits with no account id

In [165]:
Visitsdf1 = Visitsdf1.loc[Visitsdf1['Account ID.Account ID Level 01.Key']!="#"]

I do not have the Sales org ID in TA and I think that Account ID are unique I am not doing the key "KeySOAccID" yet.

In [166]:
#Visitsdf1['KeySOAccID'] = Visitsdf1['Sales Organization'] + Visitsdf1['Account ID.Account ID Level 01.Key'].map(str) 

In [167]:
def preprocess_visits(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Result', 'Activity Life Cycle Status']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

Visitsdf_prep = preprocess_visits(Visitsdf1)
Visitsdf_prep.head()

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
0,2023-04-26,2322418,1938062,1,0,0,0,0,0,0,1,0,0
1,2023-08-25,2541707,1938062,0,0,1,0,0,0,0,1,0,0
2,2023-06-01,2409667,1936072,0,0,0,1,0,0,0,1,0,0
3,2023-11-23,2685903,1936072,0,0,0,0,0,1,0,1,0,0
4,2023-06-22,2418486,1937417,0,0,0,0,1,0,0,1,0,0


Summarize the column based on the Account ID and keep the last visit date

In [168]:
Visitsdf_prep.columns

Index(['End Date in Local Time Zone', 'Visit',
       'Account ID.Account ID Level 01.Key', 'Result_Incomplete Selling Call',
       'Result_Not assigned', 'Result_Objective Met',
       'Result_Objective Partially Met', 'Result_Requires Further Follow-up',
       'Result_Unsuccessful Selling Call',
       'Activity Life Cycle Status_Canceled',
       'Activity Life Cycle Status_Completed',
       'Activity Life Cycle Status_In Process',
       'Activity Life Cycle Status_Open'],
      dtype='object')

In [169]:
Visitsdf_prep.iloc[0]['End Date in Local Time Zone']

Timestamp('2023-04-26 00:00:00')

In [170]:
Visitsdf_prep['End Date in Local Time Zone'] = Visitsdf_prep['End Date in Local Time Zone'].apply(str)


Visitsdf_prep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 581230 entries, 0 to 581229
Data columns (total 13 columns):
 #   Column                                 Non-Null Count   Dtype 
---  ------                                 --------------   ----- 
 0   End Date in Local Time Zone            581230 non-null  object
 1   Visit                                  581230 non-null  int64 
 2   Account ID.Account ID Level 01.Key     535030 non-null  object
 3   Result_Incomplete Selling Call         581230 non-null  uint8 
 4   Result_Not assigned                    581230 non-null  uint8 
 5   Result_Objective Met                   581230 non-null  uint8 
 6   Result_Objective Partially Met         581230 non-null  uint8 
 7   Result_Requires Further Follow-up      581230 non-null  uint8 
 8   Result_Unsuccessful Selling Call       581230 non-null  uint8 
 9   Activity Life Cycle Status_Canceled    581230 non-null  uint8 
 10  Activity Life Cycle Status_Completed   581230 non-null  uint8 
 11  

In [171]:
#pip install dateparser

In [172]:
#import dateparser

Visitsdf_prep2 =Visitsdf_prep.copy()
Visitsdf_prep2['End Date in Local Time Zone'] = pd.to_datetime(Visitsdf_prep2['End Date in Local Time Zone'])
#Visitsdf_prep2['End Date in Local Time Zone'] = Visitsdf_prep2['End Date in Local Time Zone'].apply(lambda x: dateparser.parse(x))

In [173]:
Visitsdf_prep2

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
0,2023-04-26,2322418,1938062,1,0,0,0,0,0,0,1,0,0
1,2023-08-25,2541707,1938062,0,0,1,0,0,0,0,1,0,0
2,2023-06-01,2409667,1936072,0,0,0,1,0,0,0,1,0,0
3,2023-11-23,2685903,1936072,0,0,0,0,0,1,0,1,0,0
4,2023-06-22,2418486,1937417,0,0,0,0,1,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
581225,2023-05-29,2372953,,0,1,0,0,0,0,1,0,0,0
581226,2023-10-10,2624297,,0,1,0,0,0,0,1,0,0,0
581227,2023-04-27,2319862,,0,1,0,0,0,0,1,0,0,0
581228,2023-11-28,2706162,,0,1,0,0,0,0,1,0,0,0


pd.to_datetime(Visitsdf_prep2['End Date in Local Time Zone'])

In [174]:
Visitsdf_prep2.sort_values('End Date in Local Time Zone', ascending = True)

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
262292,2010-07-29,2406815,3914214,0,1,0,0,0,0,0,0,0,1
262279,2010-08-26,2406233,3556508,0,1,0,0,0,0,0,0,0,1
262253,2011-07-12,2406832,2757927,0,1,0,0,0,0,0,0,0,1
262257,2012-07-26,2450208,,0,1,0,0,0,0,0,0,0,1
262278,2012-08-24,2406231,3315972,0,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
425181,2027-06-07,2396817,7729244,0,1,0,0,0,0,0,1,0,0
577926,2028-03-21,2822355,,0,1,0,0,0,0,0,0,0,1
577444,2028-03-27,2804343,,0,1,0,0,0,0,0,0,0,1
397774,2028-07-28,2484908,1809624,0,1,0,0,0,0,0,1,0,0


In [175]:
Visitsdf_prep2 = Visitsdf_prep2.sort_values('End Date in Local Time Zone')
Visitsdf_prep2.head()

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
262292,2010-07-29,2406815,3914214.0,0,1,0,0,0,0,0,0,0,1
262279,2010-08-26,2406233,3556508.0,0,1,0,0,0,0,0,0,0,1
262253,2011-07-12,2406832,2757927.0,0,1,0,0,0,0,0,0,0,1
262257,2012-07-26,2450208,,0,1,0,0,0,0,0,0,0,1
262278,2012-08-24,2406231,3315972.0,0,1,0,0,0,0,0,0,0,1


In [176]:
Visitsdf_prep3 = (Visitsdf_prep2.sort_values('End Date in Local Time Zone')
    .groupby(["Account ID.Account ID Level 01.Key"])
                      .agg({
        'End Date in Local Time Zone': lambda s: s.values[-1],
        'Result_Incomplete Selling Call' : 'sum',
        'Result_Not assigned' : 'sum', 
        'Result_Objective Met' : 'sum',
       'Result_Objective Partially Met' : 'sum', 'Result_Requires Further Follow-up' : 'sum',
       'Result_Unsuccessful Selling Call' : 'sum',
       'Activity Life Cycle Status_Canceled' : 'sum',
       'Activity Life Cycle Status_Completed' : 'sum',
       'Activity Life Cycle Status_In Process' : 'sum',
       'Activity Life Cycle Status_Open' : 'sum'
    })
)

In [177]:
Visitsdf_prep3.head()

Unnamed: 0_level_0,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
Account ID.Account ID Level 01.Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000058,2024-02-15,0,6,0,0,0,0,0,6,0,0
1000216,2024-03-20,0,16,0,0,0,0,0,16,0,0
1000256,2024-11-03,0,25,0,0,0,0,2,22,0,1
1000278,2023-03-16,0,1,0,0,0,0,1,0,0,0
1000280,2024-03-07,0,2,0,0,0,0,1,1,0,0


In [178]:
Visitsdf_prep4 = Visitsdf_prep3.copy()
Visitsdf_prep4.reset_index()

Unnamed: 0,Account ID.Account ID Level 01.Key,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
0,1000058,2024-02-15,0,6,0,0,0,0,0,6,0,0
1,1000216,2024-03-20,0,16,0,0,0,0,0,16,0,0
2,1000256,2024-11-03,0,25,0,0,0,0,2,22,0,1
3,1000278,2023-03-16,0,1,0,0,0,0,1,0,0,0
4,1000280,2024-03-07,0,2,0,0,0,0,1,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
95811,9102838,2024-03-05,0,1,0,0,0,0,0,1,0,0
95812,9102840,2024-04-05,0,1,0,0,0,0,0,0,1,0
95813,9102914,2024-04-03,0,1,0,0,0,0,0,0,1,0
95814,9103080,2024-03-12,0,1,0,0,0,0,0,0,0,1


In [179]:
Visitsdf_prep4['Last_visit_diff_months'] = ChurnDate2 - Visitsdf_prep4['End Date in Local Time Zone']

Visitsdf_prep4['Last_visit_diff_months'] = Visitsdf_prep4['Last_visit_diff_months']/np.timedelta64(1,'M')

In [180]:
Visitsdf_prep4.head()

Unnamed: 0_level_0,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months
Account ID.Account ID Level 01.Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1000058,2024-02-15,0,6,0,0,0,0,0,6,0,0,1.47847
1000216,2024-03-20,0,16,0,0,0,0,0,16,0,0,0.361404
1000256,2024-11-03,0,25,0,0,0,0,2,22,0,1,-7.12951
1000278,2023-03-16,0,1,0,0,0,0,1,0,0,0,12.517711
1000280,2024-03-07,0,2,0,0,0,0,1,1,0,0,0.788517


In [181]:
Visitsdf_wVisits = Visitsdf_prep4.copy()
Visitsdf_wVisits.info()

<class 'pandas.core.frame.DataFrame'>
Index: 95816 entries, 1000058 to CO_TW10
Data columns (total 12 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   End Date in Local Time Zone            95816 non-null  datetime64[ns]
 1   Result_Incomplete Selling Call         95816 non-null  uint8         
 2   Result_Not assigned                    95816 non-null  uint64        
 3   Result_Objective Met                   95816 non-null  uint8         
 4   Result_Objective Partially Met         95816 non-null  uint8         
 5   Result_Requires Further Follow-up      95816 non-null  uint8         
 6   Result_Unsuccessful Selling Call       95816 non-null  uint8         
 7   Activity Life Cycle Status_Canceled    95816 non-null  uint8         
 8   Activity Life Cycle Status_Completed   95816 non-null  uint64        
 9   Activity Life Cycle Status_In Process  95816 non-null  uin

df['Reported_Date'] = pd.to_datetime(df['Reported_Date'], format='%m/%d/%Y')
df['Process Date'] = pd.to_datetime(df['Process Date'], format='%m/%d/%Y')

df = (
    df
    .sort_values('Process Date')
    .groupby('ID', as_index=False)
    .agg({
        'Total': 'sum',
        'Process Date': lambda s: s.values[-1]
    })
)

'Activity Owner', 'Visit Description', 'Sales Unit (Hierarchy)' might be useful but with one hot encoding I would have too many columns

In [182]:
Visitsdf_wVisits2 = Visitsdf_wVisits.reset_index() 
Visitsdf_wVisits = Visitsdf_wVisits2.rename(columns={"Account ID.Account ID Level 01.Key":"Acc_ID"})
Visitsdf_wVisits

Unnamed: 0,Acc_ID,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months
0,1000058,2024-02-15,0,6,0,0,0,0,0,6,0,0,1.478470
1,1000216,2024-03-20,0,16,0,0,0,0,0,16,0,0,0.361404
2,1000256,2024-11-03,0,25,0,0,0,0,2,22,0,1,-7.129510
3,1000278,2023-03-16,0,1,0,0,0,0,1,0,0,0,12.517711
4,1000280,2024-03-07,0,2,0,0,0,0,1,1,0,0,0.788517
...,...,...,...,...,...,...,...,...,...,...,...,...,...
95811,9102838,2024-03-05,0,1,0,0,0,0,0,1,0,0,0.854227
95812,9102840,2024-04-05,0,1,0,0,0,0,0,0,1,0,-0.164274
95813,9102914,2024-04-03,0,1,0,0,0,0,0,0,1,0,-0.098565
95814,9103080,2024-03-12,0,1,0,0,0,0,0,0,0,1,0.624243


left2 = pd.DataFrame(
    {"A": ["A0", "A1", "A2", "C1"], "B": ["B0", "B1", "B2", "B2"], "K1": [1938031, 1938031, 2, 3]}, index=["K0", "K1", "K2", "K2"]
)
left2

In [183]:
Visitsdf_wVisits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95816 entries, 0 to 95815
Data columns (total 13 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Acc_ID                                 95816 non-null  object        
 1   End Date in Local Time Zone            95816 non-null  datetime64[ns]
 2   Result_Incomplete Selling Call         95816 non-null  uint8         
 3   Result_Not assigned                    95816 non-null  uint64        
 4   Result_Objective Met                   95816 non-null  uint8         
 5   Result_Objective Partially Met         95816 non-null  uint8         
 6   Result_Requires Further Follow-up      95816 non-null  uint8         
 7   Result_Unsuccessful Selling Call       95816 non-null  uint8         
 8   Activity Life Cycle Status_Canceled    95816 non-null  uint8         
 9   Activity Life Cycle Status_Completed   95816 non-null  uint64

In [184]:

Visitsdf_wVisits['Acc_ID'] = Visitsdf_wVisits['Acc_ID'].astype(str)
Visitsdf_wVisits['#Visits completed'] = Visitsdf_wVisits['Activity Life Cycle Status_Completed']
Visitsdf_wVisits = Visitsdf_wVisits.drop(['Activity Life Cycle Status_Completed'], axis = 1)
Visitsdf_wVisits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 95816 entries, 0 to 95815
Data columns (total 13 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Acc_ID                                 95816 non-null  object        
 1   End Date in Local Time Zone            95816 non-null  datetime64[ns]
 2   Result_Incomplete Selling Call         95816 non-null  uint8         
 3   Result_Not assigned                    95816 non-null  uint64        
 4   Result_Objective Met                   95816 non-null  uint8         
 5   Result_Objective Partially Met         95816 non-null  uint8         
 6   Result_Requires Further Follow-up      95816 non-null  uint8         
 7   Result_Unsuccessful Selling Call       95816 non-null  uint8         
 8   Activity Life Cycle Status_Canceled    95816 non-null  uint8         
 9   Activity Life Cycle Status_In Process  95816 non-null  uint8 

In [185]:
a=Visitsdf_wVisits.loc[Visitsdf_wVisits['Acc_ID']=='1938031']
a

Unnamed: 0,Acc_ID,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months,#Visits completed
16738,1938031,2023-06-20,0,24,0,0,0,0,0,0,0,9.363642,24


result = pd.merge(left2, Visitsdf_wVisits, how='left', left_on = ['K1'], right_on = ['Acc_ID']) 
result

#### Phone calls data preparation

In [186]:
PhoneCallsdf.head()

Unnamed: 0,Activity Name,Account Name,Activity Owner,Activity Life Cycle Status,Phone Call ID,Objective (Phone Call),Sales Organization,End Date in Local Time Zone,Start Date in Local Time Zone,PeriodEnd,ee
0,2023-03-21- Residence Call 2,7316409,Jadala Aishwarya,Completed,1076101,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
1,2023-03-22- Bazar Call 1,7323610,Jadala Aishwarya,Completed,1076189,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
2,2023-03-21- Yes Call 1,7317829,Jadala Aishwarya,Completed,1075454,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
3,2023-03-22- College Call 1,7323619,Jadala Aishwarya,Completed,1076183,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
4,2023-03-21- Yatri Nivas Hotel Call 2,7316945,Jadala Aishwarya,Completed,1076356,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0


In [187]:
PhoneCallsdf.columns

Index(['Activity Name', 'Account Name', 'Activity Owner',
       'Activity Life Cycle Status', 'Phone Call ID', 'Objective (Phone Call)',
       'Sales Organization', 'End Date in Local Time Zone',
       'Start Date in Local Time Zone', 'PeriodEnd', 'ee'],
      dtype='object')

'Activity Owner',
 'Objective (Phone Call)' -> to much text freedom and too many reasons
 'Phone Call ID' -> not needed

In [188]:
PhoneCallsdf1 = PhoneCallsdf[['Account Name', 'Activity Life Cycle Status', 'End Date in Local Time Zone']]

In [189]:
def preprocess_calls(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Activity Life Cycle Status']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

PhoneCallsdf_prep = preprocess_calls(PhoneCallsdf1)
PhoneCallsdf_prep.head()

Unnamed: 0,Account Name,End Date in Local Time Zone,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
0,7316409,"mercredi, 22 mars 2023",0,1,0,0
1,7323610,"mercredi, 22 mars 2023",0,1,0,0
2,7317829,"mercredi, 22 mars 2023",0,1,0,0
3,7323619,"mercredi, 22 mars 2023",0,1,0,0
4,7316945,"mercredi, 22 mars 2023",0,1,0,0


Remove phone calls without an account ID

In [190]:
PhoneCallsdf_prep1 = PhoneCallsdf_prep.loc[PhoneCallsdf_prep['Account Name']!="#"]


In [191]:
PhoneCallsdf_prep1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 276697 entries, 0 to 276696
Data columns (total 6 columns):
 #   Column                                 Non-Null Count   Dtype 
---  ------                                 --------------   ----- 
 0   Account Name                           275675 non-null  object
 1   End Date in Local Time Zone            276697 non-null  object
 2   Activity Life Cycle Status_Canceled    276697 non-null  uint8 
 3   Activity Life Cycle Status_Completed   276697 non-null  uint8 
 4   Activity Life Cycle Status_In Process  276697 non-null  uint8 
 5   Activity Life Cycle Status_Open        276697 non-null  uint8 
dtypes: object(2), uint8(4)
memory usage: 7.4+ MB


Remove date greater than next year

In [192]:
Churndate2_year = ChurnDate2.year

In [193]:
PhoneCallsdf_prep1['End Date in Local Time Zone'] = pd.to_datetime(PhoneCallsdf_prep1['End Date in Local Time Zone'], errors = 'coerce')


In [194]:
PhoneCallsdf_prep1 = PhoneCallsdf_prep1.loc[PhoneCallsdf_prep1['End Date in Local Time Zone'] < dt.datetime(Churndate2_year+1,1,1)]

In [195]:
PhoneCallsdf_prep1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 6 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Account Name                           0 non-null      object        
 1   End Date in Local Time Zone            0 non-null      datetime64[ns]
 2   Activity Life Cycle Status_Canceled    0 non-null      uint8         
 3   Activity Life Cycle Status_Completed   0 non-null      uint8         
 4   Activity Life Cycle Status_In Process  0 non-null      uint8         
 5   Activity Life Cycle Status_Open        0 non-null      uint8         
dtypes: datetime64[ns](1), object(1), uint8(4)
memory usage: 0.0+ bytes


In [196]:
PhoneCallsdf_prep1 = PhoneCallsdf_prep1.sort_values('End Date in Local Time Zone')

In [197]:
PhoneCallsdf_prep2 = (PhoneCallsdf_prep1.sort_values('End Date in Local Time Zone')
    .groupby(["Account Name"])
                      .agg({
        'End Date in Local Time Zone': lambda s: s.values[-1],
        'Activity Life Cycle Status_Completed' : 'sum'}))


In [198]:
PhoneCallsdf_prep2

Unnamed: 0_level_0,End Date in Local Time Zone,Activity Life Cycle Status_Completed
Account Name,Unnamed: 1_level_1,Unnamed: 2_level_1


In [199]:
PhoneCallsdf_prep3 = PhoneCallsdf_prep2.copy()
PhoneCallsdf_prep3.reset_index()

Unnamed: 0,Account Name,End Date in Local Time Zone,Activity Life Cycle Status_Completed


In [200]:
PhoneCallsdf_prep3['Last_call_diff_months'] = ChurnDate2 - PhoneCallsdf_prep3['End Date in Local Time Zone']

PhoneCallsdf_prep3['Last_call_diff_months'] = PhoneCallsdf_prep3['Last_call_diff_months']/np.timedelta64(1,'M')

In [201]:
PhoneCallsdf_prep3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   End Date in Local Time Zone           0 non-null      datetime64[ns]
 1   Activity Life Cycle Status_Completed  0 non-null      uint8         
 2   Last_call_diff_months                 0 non-null      float64       
dtypes: datetime64[ns](1), float64(1), uint8(1)
memory usage: 0.0+ bytes


In [202]:
PhoneCallsdf_prep3 = PhoneCallsdf_prep3.copy()
PhoneCallsdf_prep3.reset_index()

Unnamed: 0,Account Name,End Date in Local Time Zone,Activity Life Cycle Status_Completed,Last_call_diff_months


In [203]:
PhoneCallsdf_prep3['#Calls Completed'] = PhoneCallsdf_prep3['Activity Life Cycle Status_Completed']
PhoneCallsdf_prep3 = PhoneCallsdf_prep3.drop(['Activity Life Cycle Status_Completed'], axis = 1)
PhoneCallsdf_prep3.head()

Unnamed: 0_level_0,End Date in Local Time Zone,Last_call_diff_months,#Calls Completed
Account Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


#### Incident Ticket preparation

In [204]:
IncidentTicketdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223904 entries, 0 to 223903
Data columns (total 70 columns):
 #   Column                                     Non-Null Count   Dtype         
---  ------                                     --------------   -----         
 0   Index                                      223904 non-null  object        
 1   SLAMet                                     223904 non-null  int64         
 2   YearMonth                                  223007 non-null  float64       
 3   Period                                     223007 non-null  object        
 4   NextDateAux                                223007 non-null  float64       
 5   NextDateAux2                               223007 non-null  object        
 6   AuxTime                                    223904 non-null  object        
 7   TimeFrom                                   223904 non-null  int64         
 8   Next CreatedDatevar                        151018 non-null  datetime64[ns]
 9   Time

In [205]:
IncidentTicketdf.columns

Index(['Index', 'SLAMet', 'YearMonth', 'Period', 'NextDateAux', 'NextDateAux2',
       'AuxTime', 'TimeFrom', 'Next CreatedDatevar', 'TimeTo',
       'Previous CreatedDatevar', 'AuxFix', 'SLA MET?', 'AuxTimeUS',
       'Main Ticket ID', 'Main Ticket', 'MAIN_TICKET_COMPLETION_DATE',
       'Sub Ticket ID', 'Sub Ticket', 'REPORTED_ON', 'SOLVED_VIA_PHONE',
       'STATUS', 'STATUS_DESCRIPTION', 'PROCESSING', 'PROCESSING_DESCRIPTION',
       'SERVICE_TECHNICIAN', 'Completion Date_2', 'Completion SLA Met',
       'PRODUCT_DESCRIPTION', 'PRODUCT_ID', 'Serial ID',
       'PRIORITY_DESCRIPTION', 'EC_ID', 'EC_NAME', 'EC_HOUSENUMBER',
       'EC_STREET', 'EC_CIY', 'EC_STATE', 'EC_POSTALCODE',
       'INCIDENT_CATEGORY_ID', 'Incident Category',
       'MANUFACTURER_SERIAL_NUMBER', 'MAIN_ADDRESS', 'ACCOUNT_ID',
       'ACCOUNT_DESCRIPTION', 'ACCOUNT_POSTAL_CODE', 'SHIP_TO_ID',
       'SHIP_TO_NAME', 'SHIP_TO_POSTAL_CODE', 'ECRESPONSIBLE_SALES_ID',
       'ECRESPONSIBLE_SALES_DESCRIPTION', 'ACCOUNT

Maybe I will do a delta between "Completion Date_2" and "Reported On"

Removed:
'AuxFix' -> 'AuxTime'
'Completion SLA Met' -> 'SLAMet'
'SLAMet' -> 'SLA MET?' 
'Service Technician' not clear and a lot of data

In [206]:
print(IncidentTicketdf['Completion SLA Met'])

0         0.0
1         1.0
2         0.0
3         1.0
4         1.0
         ... 
223899    1.0
223900    1.0
223901    1.0
223902    1.0
223903    1.0
Name: Completion SLA Met, Length: 223904, dtype: float64


In [207]:
IncidentTicketdf1 = IncidentTicketdf[['Completion Date_2', 'Incident Category', 
                    'REPORTED_ON', 'Serial ID', 'Completion SLA Met', 'AuxTime']]
IncidentTicketdf1.rename(columns={'Completion SLA Met': 'SLA MET?'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  IncidentTicketdf1.rename(columns={'Completion SLA Met': 'SLA MET?'}, inplace=True)


In [208]:
IncidentTicketdf1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 223904 entries, 0 to 223903
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   Completion Date_2  223007 non-null  object        
 1   Incident Category  223904 non-null  object        
 2   REPORTED_ON        223903 non-null  datetime64[ns]
 3   Serial ID          223904 non-null  object        
 4   SLA MET?           223367 non-null  float64       
 5   AuxTime            223904 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 10.2+ MB


In [209]:
a = IncidentTicketdf1.loc[IncidentTicketdf1['Serial ID']==7010054129]
#a= IncidentTicketdf1.loc[IncidentTicketdf1['Serial ID']=='MYBMB20838']
a

Unnamed: 0,Completion Date_2,Incident Category,REPORTED_ON,Serial ID,SLA MET?,AuxTime
37815,"mardi, 25 janvier 2022",1.d Ingredient Other,2022-01-25 16:13:32,7010054129,1.0,Yes


In [210]:
IncidentTicketdf2 = IncidentTicketdf1.copy()
#IncidentTicketdf2['Completion Date_2'] = IncidentTicketdf2['Completion Date_2'].apply(str)
#IncidentTicketdf2['Completion Date_2'] = IncidentTicketdf2['Completion Date_2'].apply(lambda x: dateparser.parse(x))

In [211]:
IncidentTicketdf3 = IncidentTicketdf2.copy()
IncidentTicketdf3['Completion Date_2'] = pd.to_datetime(IncidentTicketdf3['Completion Date_2'], errors = 'coerce')
IncidentTicketdf3['Reported On'] = pd.to_datetime(IncidentTicketdf3['REPORTED_ON'], errors = 'coerce')

IncidentTicketdf3['Completion Date_2'] = IncidentTicketdf3['Completion Date_2'].fillna(dt.datetime(2000,1,1))
IncidentTicketdf3['Reported On'] = IncidentTicketdf3['Reported On'].fillna(dt.datetime(2000,1,1))

In [212]:
IncidentTicketdf3 = IncidentTicketdf3.loc[IncidentTicketdf3['Serial ID']!="#"]

In [213]:
def preprocess_InciTickets(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Incident Category', 'SLA MET?', 'AuxTime']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

IncidentTicketdf_prep = preprocess_InciTickets(IncidentTicketdf3)
IncidentTicketdf_prep.head()

Unnamed: 0,Completion Date_2,REPORTED_ON,Serial ID,Reported On,Incident Category_1.a Ingredient Calibration,Incident Category_1.b Ingredient Dispensing,Incident Category_1.c Ingredient Dripping,Incident Category_1.d Ingredient Other,Incident Category_10 Abnormal smell,Incident Category_11 Electrical power,...,Incident Category_7 Wire/Harness,Incident Category_8 Software/Firmware,Incident Category_9 Abnormal noise,Incident Category_Low throughput,Incident Category_Requested by Customer,Incident Category_Scheduled,SLA MET?_0.0,SLA MET?_1.0,AuxTime_No,AuxTime_Yes
0,2000-01-01,2021-10-03 10:33:40,18E0017587,2021-10-03 10:33:40,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,2000-01-01,2021-10-06 08:10:50,16E0023488,2021-10-06 08:10:50,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
2,2000-01-01,2021-10-04 16:43:38,Y101709203,2021-10-04 16:43:38,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
3,2000-01-01,2021-10-06 05:35:47,Y105133933,2021-10-06 05:35:47,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
4,2000-01-01,2021-10-07 10:17:57,16E0023588,2021-10-07 10:17:57,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1


In [214]:
IncidentTicketdf_prep.columns

Index(['Completion Date_2', 'REPORTED_ON', 'Serial ID', 'Reported On',
       'Incident Category_1.a Ingredient Calibration',
       'Incident Category_1.b Ingredient Dispensing',
       'Incident Category_1.c Ingredient Dripping',
       'Incident Category_1.d Ingredient Other',
       'Incident Category_10 Abnormal smell',
       'Incident Category_11 Electrical power',
       'Incident Category_12 Water supply issue',
       'Incident Category_13 Connectivity (modem)',
       'Incident Category_14 Accessory problem(external pump..)',
       'Incident Category_15 Return with parts',
       'Incident Category_16 Operator mishandling(improper fill..)',
       'Incident Category_17 Miscellaneous', 'Incident Category_18 N/A',
       'Incident Category_2.a Hydraulic Calibration',
       'Incident Category_2.b Hydraulic Dispensing',
       'Incident Category_2.c Hydraulic Leaking',
       'Incident Category_2.d Hydraulic Heating',
       'Incident Category_2.e Hydraulic Cooling/Freezing',


In [215]:
IncidentTicketdf_prep = IncidentTicketdf_prep.sort_values('Completion Date_2')

I will not use 'Reported On' because I aggreagate and I do not want to make a delta anymore

In [216]:
IncidentTicketdf_prep2 = (IncidentTicketdf_prep.sort_values('Completion Date_2')
    .groupby(["Serial ID"])
                      .agg({'Completion Date_2' : lambda s: s.values[-1], 
       'Incident Category_1.a Ingredient Calibration' : 'sum',
       'Incident Category_1.b Ingredient Dispensing' : 'sum',
       'Incident Category_1.c Ingredient Dripping' : 'sum',
       'Incident Category_1.d Ingredient Other' : 'sum',
       'Incident Category_10 Abnormal smell' : 'sum',
       'Incident Category_11 Electrical power' : 'sum',
       'Incident Category_12 Water supply issue' : 'sum',
       'Incident Category_13 Connectivity (modem)' : 'sum',
       'Incident Category_14 Accessory problem(external pump..)' : 'sum',
       'Incident Category_15 Return with parts' : 'sum',
       'Incident Category_16 Operator mishandling(improper fill..)' : 'sum',
       'Incident Category_17 Miscellaneous': 'sum',
                            'Incident Category_18 N/A' : 'sum',
       'Incident Category_2.a Hydraulic Calibration' : 'sum',
       'Incident Category_2.b Hydraulic Dispensing' : 'sum',
       'Incident Category_2.c Hydraulic Leaking' : 'sum',
       'Incident Category_2.d Hydraulic Heating': 'sum',
       'Incident Category_2.e Hydraulic Cooling/Freezing': 'sum',
       'Incident Category_2.f Hydraulic Filling': 'sum',
       'Incident Category_2.g Hydraulic Other': 'sum',
       'Incident Category_3.a Door Display/Touchscreen': 'sum',
       'Incident Category_3.b Door Menu buttons': 'sum',
       'Incident Category_3.c Door Detection': 'sum',
       'Incident Category_3.d Door Key/Key switch': 'sum',
       'Incident Category_3.e Door Other': 'sum',
       'Incident Category_4.a Reconst. Area In-cup quality/Recipes': 'sum',
       'Incident Category_4.b Reconstitution Area Mixing system': 'sum',
       'Incident Category_4.c Reconstitution Area Other': 'sum',
       'Incident Category_5.a Disp. Area Manifold/Distribution': 'sum',
       'Incident Category_5.b Dispensing Area Drip Tray': 'sum',
       'Incident Category_5.c Dispensing Area Other': 'sum',
       'Incident Category_6 Electronics (PCBs)': 'sum',
       'Incident Category_7 Wire/Harness': 'sum',
       'Incident Category_8 Software/Firmware': 'sum',
       'Incident Category_9 Abnormal noise': 'sum',
                            'SLA MET?_0.0': 'sum',
                            'SLA MET?_1.0': 'sum',
                            'AuxTime_No': 'sum', 
                            'AuxTime_Yes': 'sum'})
)


In [217]:
IncidentTicketdf_prep2['Last_InTick_diff_months'] = ChurnDate2 - IncidentTicketdf_prep2['Completion Date_2']

IncidentTicketdf_prep2['Last_InTick_diff_months'] = IncidentTicketdf_prep2['Last_InTick_diff_months']/np.timedelta64(1,'M')

In [218]:
IncidentTicketdf_prep2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 70890 entries, 210 to ZA978
Data columns (total 41 columns):
 #   Column                                                      Non-Null Count  Dtype         
---  ------                                                      --------------  -----         
 0   Completion Date_2                                           70890 non-null  datetime64[ns]
 1   Incident Category_1.a Ingredient Calibration                70890 non-null  uint8         
 2   Incident Category_1.b Ingredient Dispensing                 70890 non-null  uint8         
 3   Incident Category_1.c Ingredient Dripping                   70890 non-null  uint8         
 4   Incident Category_1.d Ingredient Other                      70890 non-null  uint8         
 5   Incident Category_10 Abnormal smell                         70890 non-null  uint8         
 6   Incident Category_11 Electrical power                       70890 non-null  uint8         
 7   Incident Category_12 Wate

In [219]:
IncidentTicketdf_prep2 = IncidentTicketdf_prep2.reset_index()
IncidentTicketdf_prep2.head()

Unnamed: 0,Serial ID,Completion Date_2,Incident Category_1.a Ingredient Calibration,Incident Category_1.b Ingredient Dispensing,Incident Category_1.c Ingredient Dripping,Incident Category_1.d Ingredient Other,Incident Category_10 Abnormal smell,Incident Category_11 Electrical power,Incident Category_12 Water supply issue,Incident Category_13 Connectivity (modem),...,Incident Category_5.c Dispensing Area Other,Incident Category_6 Electronics (PCBs),Incident Category_7 Wire/Harness,Incident Category_8 Software/Firmware,Incident Category_9 Abnormal noise,SLA MET?_0.0,SLA MET?_1.0,AuxTime_No,AuxTime_Yes,Last_InTick_diff_months
0,210,2000-01-01,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,290.962853
1,1003,2000-01-01,1,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,2,290.962853
2,1012,2000-01-01,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,290.962853
3,2100,2000-01-01,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,2,290.962853
4,21000,2000-01-01,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,290.962853


#### Data with all
Let's see what we can get if we include Telemetry, sales and Tickets

In [220]:

BeverageMachine7_wTickets_df['Manufacturer Number'] = BeverageMachine7_wTickets_df['Manufacturer Number'].astype('str')
Concat_Telemetry['serial'] = Concat_Telemetry['serial'].astype('str')

In [221]:
BeverageMachine7_wTickets_wTelemetry_df = pd.merge(BeverageMachine7_wTickets_df, Concat_Telemetry, how='left', left_on = ['Manufacturer Number'], right_on = ['serial'])

BeverageMachine7_wTickets_wTelemetry_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217678 entries, 0 to 217677
Data columns (total 63 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        217678 non-null  object        
 1   Sales Organisation                                               217678 non-null  object        
 2   Machine Status Groupings                                         217678 non-null  object        
 3   User Status                                                      217678 non-null  object        
 4   TA Contract Installation Date                                    217678 non-null  int32         
 5   Depreciation Start                                               217678 non-null  int32         
 6   Manufacturer Number                                              217

In [222]:
BeverageMachine7_wTickets_wTelemetry_df = pd.merge(BeverageMachine7_wTickets_df, Concat_Telemetry, how='left', left_on = ['Manufacturer Number'], right_on = ['serial'])
BeverageMachine7_wTickets_wTelemetry_df=BeverageMachine7_wTickets_wTelemetry_df.fillna(0)
BeverageMachine7_wTickets_wTelemetry_df["quantity"] = BeverageMachine7_wTickets_wTelemetry_df["quantity"].astype(int)
BeverageMachine7_wTickets_wTelemetry_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217678 entries, 0 to 217677
Data columns (total 63 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        217678 non-null  object        
 1   Sales Organisation                                               217678 non-null  object        
 2   Machine Status Groupings                                         217678 non-null  object        
 3   User Status                                                      217678 non-null  object        
 4   TA Contract Installation Date                                    217678 non-null  int32         
 5   Depreciation Start                                               217678 non-null  int32         
 6   Manufacturer Number                                              217

In [223]:
b = Concat_Sales[Concat_Sales['KeyManufNo_SalesOrg']=='20161919205' + 'Nestlé Russia']
b

Unnamed: 0,Serial,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg,KeyManufNo_SalesOrg,(lst_mth-6mth)/6mth,3mth-6mth)/6mth
12805,20161919205,98702.52,9536.4367,8257.8411,12048.058333,20161919205Nestlé Russia,-0.208467,-0.314592


In [224]:
A= Concat_Sales.drop_duplicates(subset= 'KeyManufNo_SalesOrg')
A.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68407 entries, 0 to 15981
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  68407 non-null  object 
 1   quantity                68407 non-null  float64
 2   Sales_one_Month_avg     68407 non-null  float64
 3   Sales_three_months_avg  68407 non-null  float64
 4   Sales_six_months_avg    68407 non-null  float64
 5   KeyManufNo_SalesOrg     68407 non-null  object 
 6   (lst_mth-6mth)/6mth     68407 non-null  float64
 7   3mth-6mth)/6mth         68407 non-null  float64
dtypes: float64(6), object(2)
memory usage: 4.7+ MB


In [225]:
Concat_Sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68407 entries, 0 to 15981
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  68407 non-null  object 
 1   quantity                68407 non-null  float64
 2   Sales_one_Month_avg     68407 non-null  float64
 3   Sales_three_months_avg  68407 non-null  float64
 4   Sales_six_months_avg    68407 non-null  float64
 5   KeyManufNo_SalesOrg     68407 non-null  object 
 6   (lst_mth-6mth)/6mth     68407 non-null  float64
 7   3mth-6mth)/6mth         68407 non-null  float64
dtypes: float64(6), object(2)
memory usage: 4.7+ MB


In [226]:
BeverageMachine7_wTickets_wTelemetry_wSales_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_df, Concat_Sales, how='left', left_on = ['Key_ManufacturerID_SalesOrg'], right_on = ['KeyManufNo_SalesOrg'])
BeverageMachine7_wTickets_wTelemetry_wSales_df = BeverageMachine7_wTickets_wTelemetry_wSales_df.fillna(0)
BeverageMachine7_wTickets_wTelemetry_wSales_df["quantity_y"] = BeverageMachine7_wTickets_wTelemetry_wSales_df["quantity_y"].astype(int)
BeverageMachine7_wTickets_wTelemetry_wSales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217678 entries, 0 to 217677
Data columns (total 71 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        217678 non-null  object        
 1   Sales Organisation                                               217678 non-null  object        
 2   Machine Status Groupings                                         217678 non-null  object        
 3   User Status                                                      217678 non-null  object        
 4   TA Contract Installation Date                                    217678 non-null  int32         
 5   Depreciation Start                                               217678 non-null  int32         
 6   Manufacturer Number                                              217

In [227]:
BeverageMachine7_wTickets_wTelemetry_wSales_df['EC ID'] = BeverageMachine7_wTickets_wTelemetry_wSales_df['EC ID'].astype('str')
Visitsdf_wVisits['Acc_ID'] = Visitsdf_wVisits['Acc_ID'].astype('str')

In [228]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_wSales_df, Visitsdf_wVisits, how='left', left_on = ['EC ID'], right_on = ['Acc_ID'])
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'] = pd.to_datetime(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'])
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'] = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'].fillna(dt.datetime(2000,1,1))
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df.fillna(0)


In [229]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217678 entries, 0 to 217677
Data columns (total 84 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        217678 non-null  object        
 1   Sales Organisation                                               217678 non-null  object        
 2   Machine Status Groupings                                         217678 non-null  object        
 3   User Status                                                      217678 non-null  object        
 4   TA Contract Installation Date                                    217678 non-null  int32         
 5   Depreciation Start                                               217678 non-null  int32         
 6   Manufacturer Number                                              217

In [230]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months,#Visits completed
0,19O0017079,Nestle Australia Ltd,Deployed,Installed,0,43739,20192228627_VB_23O0037199,LOAN,0,0,...,4.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.577034,3.0
1,22O0023824,Nestlé India,Deployed,Installed,0,44896,SP3.1-06904,LOAN,0,0,...,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.425204,2.0
2,22O0023864,Nestlé India,Deployed,Installed,0,44896,22O0023864,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,22O0023862,Nestlé India,Deployed,Installed,0,44896,22O0023862,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,22O0023856,Nestlé India,Deployed,Installed,0,44896,22O0023856,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [231]:
PhoneCallsdf_prep3 = PhoneCallsdf_prep3.reset_index()

In [232]:
PhoneCallsdf_prep3['Account Name'] = PhoneCallsdf_prep3['Account Name'].astype('str')

In [233]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df, PhoneCallsdf_prep3, how='left', left_on = ['EC ID'], right_on = ['Account Name'])
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217678 entries, 0 to 217677
Data columns (total 88 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        217678 non-null  object        
 1   Sales Organisation                                               217678 non-null  object        
 2   Machine Status Groupings                                         217678 non-null  object        
 3   User Status                                                      217678 non-null  object        
 4   TA Contract Installation Date                                    217678 non-null  int32         
 5   Depreciation Start                                               217678 non-null  int32         
 6   Manufacturer Number                                              217

In [234]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'] = pd.to_datetime(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'])

BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'] = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'].fillna(dt.datetime(2000,1,1))
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df.fillna(0)

In [235]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217678 entries, 0 to 217677
Data columns (total 88 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        217678 non-null  object        
 1   Sales Organisation                                               217678 non-null  object        
 2   Machine Status Groupings                                         217678 non-null  object        
 3   User Status                                                      217678 non-null  object        
 4   TA Contract Installation Date                                    217678 non-null  int32         
 5   Depreciation Start                                               217678 non-null  int32         
 6   Manufacturer Number                                              217

In [236]:
IncidentTicketdf_prep2['Serial ID'] = IncidentTicketdf_prep2['Serial ID'].astype('str')

In [237]:
BeverageMachine_all_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df, IncidentTicketdf_prep2, how='left', left_on = ['Serial ID'], right_on = ['Serial ID'])
BeverageMachine_all_df['Completion Date_2'] = pd.to_datetime(BeverageMachine_all_df['Completion Date_2'])
BeverageMachine_all_df['Completion Date_2'] = BeverageMachine_all_df['Completion Date_2'].fillna(dt.datetime(2000,1,1))
BeverageMachine_all_df = BeverageMachine_all_df.fillna(0)


In [238]:
f=BeverageMachine_all_df.loc[BeverageMachine_all_df['Serial ID']=='7010054129']
f.iloc[:20,80:100]              

Unnamed: 0,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months,#Visits completed,Account Name,End Date in Local Time Zone_y,Last_call_diff_months,#Calls Completed,Completion Date_2,Incident Category_1.a Ingredient Calibration,Incident Category_1.b Ingredient Dispensing,Incident Category_1.c Ingredient Dripping,Incident Category_1.d Ingredient Other,Incident Category_10 Abnormal smell,Incident Category_11 Electrical power,Incident Category_12 Water supply issue,Incident Category_13 Connectivity (modem),Incident Category_14 Accessory problem(external pump..),Incident Category_15 Return with parts,Incident Category_16 Operator mishandling(improper fill..)
127336,0.0,0.0,8.575125,2.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
163559,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
205272,0.0,0.0,12.944824,1.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [239]:
MktActions_prep3 = MktActions_prep3.reset_index()
MktActions_prep3

Unnamed: 0,Serial ID,Actions_Churn risk reason unknown,Actions_Data corrected,Actions_Downgrade machine installed,Actions_Lack of data discipline,Actions_New contract,Actions_Other,Actions_Out of order,Actions_Phone Call completed,Actions_Removal Plan,...,Actions_Removed,Actions_Reviewed and no action Required,Actions_Reviewed and no actions required,Actions_Seasonal Machine,Actions_Telemetry installed,Actions_Upgrade machine installed,Actions_Visit completed,Actions_Visit/Call planned,Actions_removed,Actions_tagging update
0,24606,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1895151,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,10238090,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,10238091,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,10238092,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
925,22O0021800,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
926,22O0021869,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
927,34F6401007,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
928,EM10023,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [240]:
MktActions_prep3['Serial ID'] = MktActions_prep3['Serial ID'].astype('str')

In [241]:

BeverageMachine_all_df2 = pd.merge(BeverageMachine_all_df, MktActions_prep3, how='left', left_on = ['Serial ID'], right_on = ['Serial ID'])
#BeverageMachine_all_df['Completion Date_2'] = pd.to_datetime(BeverageMachine_all_df['Completion Date_2'])
#BeverageMachine_all_df['Completion Date_2'] = BeverageMachine_all_df['Completion Date_2'].fillna(dt.datetime(2000,1,1))
BeverageMachine_all_df2 = BeverageMachine_all_df2.fillna(0)

BeverageMachine_all_df = BeverageMachine_all_df2

UKService_prep2 = UKService_prep2.reset_index()

BeverageMachine_all_df2 = pd.merge(BeverageMachine_all_df, UKService_prep2, how='left', left_on = ['Key_ManufacturerID_SalesOrg'], right_on = ['Key_ManufacturerID_SalesOrg'])
BeverageMachine_all_df2['Month'] = pd.to_datetime(BeverageMachine_all_df2['Month'])
BeverageMachine_all_df2['Month'] = BeverageMachine_all_df2['Month'].fillna(dt.datetime(2000,1,1))
BeverageMachine_all_df2 = BeverageMachine_all_df2.fillna(0)


In [242]:
BeverageMachine_all_df = BeverageMachine_all_df2
a = BeverageMachine_all_df.iloc[:,100:]
a.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217678 entries, 0 to 217677
Data columns (total 49 columns):
 #   Column                                                      Non-Null Count   Dtype  
---  ------                                                      --------------   -----  
 0   Incident Category_17 Miscellaneous                          217678 non-null  float64
 1   Incident Category_18 N/A                                    217678 non-null  float64
 2   Incident Category_2.a Hydraulic Calibration                 217678 non-null  float64
 3   Incident Category_2.b Hydraulic Dispensing                  217678 non-null  float64
 4   Incident Category_2.c Hydraulic Leaking                     217678 non-null  float64
 5   Incident Category_2.d Hydraulic Heating                     217678 non-null  float64
 6   Incident Category_2.e Hydraulic Cooling/Freezing            217678 non-null  float64
 7   Incident Category_2.f Hydraulic Filling                     217678 non-nul

###TODO Remove when market have enough data

BeverageMachine_all_df2 = BeverageMachine_all_df.copy()

# Sales Organisation with more than one month of data
SO = ['Nestle Sweden',  'Nestlé Czech', 'Nestlé Denmark', 'Nestlé Finland', 'Nestlé Norway', 'Nestlé Slovak Republic']

#BeverageMachine_all_df3 =  pd.DataFrame([])

for i in SO:
    BeverageMachine_all_df2 = BeverageMachine_all_df2.loc[BeverageMachine_all_df2['Sales Organisation'] != i]
BeverageMachine_all_df2.head()
BeverageMachine_all_df = BeverageMachine_all_df2

# Specify the filename
filename = 'TelemetryColumnsList.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(TelemetryColumnsList, file)

#### Data summary

I now have four datasets :

"BeverageMachine7_df" 

    which is all the data of the beverage machines without ticket data

    This data will be our main data and it will be used to Train and test our models because we have data for all the machines

"BeverageMachine7_wTickets_df" 

    which is with the Ticket data and when there is no tickets for a machine we fill with 0
    
    As we only have around 2000 machines having tickets we will use it on the model that performed better with main data to see if it can bring better results with Telemetry data

"BeverageMachine7_wTicketsOnly_df" 

    which is only the data of the machines having Tickets
    
    Only useful for EDA

"BeverageMachine7_wTickets_wTelemetry_df"

    We will use it on the model that performed better with main data to see if it can bring better results than the Main data or the Main data with tickets. If it does not improve significantly the results we will not use it  because it takes a lot of time to get Telemetry data.
    Later, more machines will have Telemetry and a data lake will be created and it will br easier to get the data.

### Save the data<a class="anchor" id="save"></a>

I choose to save the data into a pickle file because it is a good way to transfer a pandas dataframe

##### BeverageMachine7_df

In [243]:
BeverageMachine_all_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 217678 entries, 0 to 217677
Columns: 149 entries, Serial ID to Actions_tagging update
dtypes: bool(1), datetime64[ns](4), float64(99), int32(6), int64(1), object(38)
memory usage: 242.7+ MB


In [244]:
BeverageMachine_all_df.iloc[0:10, 68:90]

Unnamed: 0,KeyManufNo_SalesOrg,(lst_mth-6mth)/6mth_y,3mth-6mth)/6mth_y,Acc_ID,End Date in Local Time Zone_x,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,...,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months,#Visits completed,Account Name,End Date in Local Time Zone_y,Last_call_diff_months,#Calls Completed,Completion Date_2,Incident Category_1.a Ingredient Calibration
0,0,0.0,0.0,2285747,2024-02-12,0.0,4.0,0.0,0.0,0.0,...,0.0,1.0,1.577034,3.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
1,0,0.0,0.0,6900365,2023-08-18,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,7.425204,2.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
2,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
3,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
4,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
5,0,0.0,0.0,3657148,2023-05-02,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,10.973531,1.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
6,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
7,0,0.0,0.0,6900259,2023-06-20,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,9.363642,2.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
8,0,0.0,0.0,5499777,2023-07-27,0.0,11.0,0.0,0.0,0.0,...,0.0,0.0,8.148011,11.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
9,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0


In [245]:
# Specify the filename
filename = 'BM_noTickets.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_df, file)

In [246]:
# Specify the filename
filename = 'BM_noTickets.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Load the pickle file
with open(file_path_with_filename, 'rb') as file:
    BM_noTickets = pickle.load(file)

Quick test to see if I am able to reopen the data in another Notebook

In [247]:
a=BeverageMachine_all_df.loc[BeverageMachine_all_df['EC ID']=='1056184']
a.iloc[0:10,:86]

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months,#Visits completed,Account Name,End Date in Local Time Zone_y
72058,21E0003520,Nestle UK,Deployed,In Repair,44515,44531,20212718834,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,9.265077,6.0,0,2000-01-01
72072,21E0003516,Nestle UK,Deployed,Installed,44515,44531,20212718830,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,9.265077,6.0,0,2000-01-01
72079,21E0003522,Nestle UK,Deployed,Installed,44515,44531,20212718836,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,9.265077,6.0,0,2000-01-01
72080,21E0003521,Nestle UK,Deployed,Installed,44515,44531,20212718835,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,9.265077,6.0,0,2000-01-01
72081,21E0003519,Nestle UK,Deployed,Installed,44518,44531,20212718833,RENT,44518,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,9.265077,6.0,0,2000-01-01
72083,21E0003517,Nestle UK,Deployed,Installed,44518,44531,20212718831,RENT,44518,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,9.265077,6.0,0,2000-01-01
72291,20E0006016,Nestle UK,Deployed,Installed,44518,44531,20203123949,RENT,44518,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,9.265077,6.0,0,2000-01-01
72292,20E0006003,Nestle UK,Deployed,Installed,44515,44531,20203123936,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,9.265077,6.0,0,2000-01-01
72293,20E0005995,Nestle UK,Deployed,Installed,44515,44531,20203123928,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,9.265077,6.0,0,2000-01-01
72296,20E0006031,Nestle UK,Deployed,In Repair,44518,44531,20203123964,RENT,44518,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,9.265077,6.0,0,2000-01-01


##### BeverageMachine7_wTickets_df

In [248]:
# Specify the filename
filename = 'BM_wTickets.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_wTickets_df, file)

##### BeverageMachine7_wTicketsOnly_df

In [249]:
# Specify the filename
filename = 'BM_wTicketsOnly.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_wTicketsOnly_df, file)

##### BeverageMachine7_wTickets_wTelemetry_df

In [250]:
# Specify the filename
filename = 'BeverageMachine7_wTickets_wTelemetry_df.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_wTickets_wTelemetry_df, file)

##### Other dataframe needed for the second preparation step later

In [251]:
# Specify the filename
filename = 'IncidentTicketdf.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(IncidentTicketdf_prep2, file)

In [252]:
# Specify the filename
filename = 'TelemetryAggregated_df.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(Telemetry_aggSales, file)

#### All data with placements, telemetry, visits, phone calls, Incidents tickets

In [253]:
# Specify the filename
filename = 'BeverageMachine_all_df2.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine_all_df, file)

In [254]:
BeverageMachine_all_df.to_csv(r'C:\Users\msalomo\predictions-BevData.csv', index = False, header=True)