* ## [1) The problem](#TheProblem)

    * #### Goal
    
* ## [2) The Data](#TheData)
    * ### [(a) Clear overview of your data](#DataOverview)
    
        ##### Beverage Machine data

        ##### Beverage Mapping data

        ##### Beverage Classification data
    
        ##### Placement Tickets data

        ##### Telemetry data

    * ### [(b) Plan to manage and process the data](#ManageData)
    
        ##### Beverage Machine data features and the Beverage Classification data features

        ##### Placement Tickets data features
        
        ##### Telemetry data features

        ##### Missing data
        
        ##### Preparation of the data in order to execute some EDA

* ## [3) Preparation of the data](#prep)         
    * ### [(a) Details of preparation](#det)
    
        #### Beverage Machine data preparation

        #### Placement Tickets data preparation

        #### Telemetry data preparation

        #### Data summary
        
    * ### [(b) Save the data](#save)


## 1) The problem <a class="anchor" id="TheProblem"></a>

The main business is a full service for beverage machine including :

    beverage machines placed at a customer’s place (rented or loaned), 
    
    the beverage ingredients (coffee beans, soluble coffee, juice, etc.) delivered to the customers 
    
    and the management of any issue and repair.

A little bit like the printers in companies where a printing machine is placed and the ink and the issues are also managed by the same company.

We have high churn rate of the beverage machine rented/loaned in our business and the goal is to reduce the churn rate by predicting which customer is more likely to churn and try to retain these customers.

The goal is to use Machine Learning in order to predict which machine is at risk of churn by calculating a churn likelihood.

The 'churned' machines are the machines that are definitively removed from their installation point thus resulting in a lower number of machines deployed dispensing beverage cups.

A churned machine generates a one-time cost for removal and replacement and a variable cost for depreciation and storage whilst a new location is found.

The Installation Point is referring to a customer's point where the machine is installed. A customer can have one or several Installation Point. A machine can be replaced by a new Machine on the same Installation Point. The Idea is to look when we lose an Installation Point, meaning that a machine distributing beverage cups has churned.

A Machine can be replaced on an Installation Point and it means we have kept the customer, so that is why we focus on the Installation Point rather than the Machine's Serial ID.

Two proposals could be used:

    Proposal 1 : We keep all the Installation Point data available and we do not aggregate the monthly data of the machines
    
    Proposal 2 : We aggregate the data of the same Installation Point over several month.
    
Example Proposal 1:

    InstPoint     Month of snapshot     ID       Churn      Age in Month      
    Inst.   1     Jan                   1        No         20
    Inst.   2     Jan                   2        No         48
    Inst.   3     Jan                   3        No         69
    Inst.   4     Jan                   4        Yes        45
    
    Inst.   1     Feb                   5        No         21
    Inst.   2     Feb                   6        No         49
    Inst.   3     Feb                   7        Yes        70
    Inst.   5     Feb                   8        No         25
    
    Inst.   1     Mar                   9        No         22
    Inst.   2     Mar                   10       No         50
    Inst.   5     Mar                   11       No         26
    Inst.   6     Mar                   12       No         30
    Inst.   7     Mar                   13       No         42
    Inst.   8     Mar                   14       No         7
    
    
Example Proposal 2:

    Inst.   #     Latest month snap     ID       Churn      Age in Month       data available since (month)     
    Inst.   1     Mar                   1        No         22                 3
    Inst.   2     Mar                   2        No         50                 3
    Inst.   3     Feb                   3        Yes        70                 1
    Inst.   4     Jan                   4        Yes        45                 2
    Inst.   5     Mar                   5        No         26                 2
    Inst.   6     Mar                   6        No         30                 1
    Inst.   7     Mar                   7        No         42                 1
    Inst.   8     Mar                   8        No         7                  1

I am currently missing all the Installation Point before January who have churned, therefore, the data is only having the current Park of Installation Point, only having the survivors, so I need to be careful of the Survivorship bias.

In order to make a Time Series problem it would be better to have more data.

I can have data from one Sales Organisation available in January and for another Sales Organisation in March.

The idea behind the first proposal was to predict a monthly churn rate, the monthly churn rate is the number of churn over the total. However with 10 month of data it is not the best solution.

With the second solution I will predict based on features if the machine has churned or not. The advantage of the second solution is that we can work without the time dimension and focus only on the features to make a prediction if the machine has churned or not.

### a) Goal <a class="anchor" id="Goal"></a>

By giving the customer's installation point with the highest churn likelihood to the managers, they can take action in order to retain more customer's installation point.

This will help to retain more customer's installation points and increase the company's deployed beverage machine park.

Also, less churn implies higher efficiency per machine (less time in the deposit) and lower cost for installation removal.

### TO DO LIST
Add Sales Org ID to vendon data and to Incident tickets and with a mapping create a key Serial-SalesOrg to link to the main data
Add acquisition cost and book value from ERP?

## 2) The data <a class="anchor" id="TheData"></a>

### (a) Clear overview of your data  <a class="anchor" id="DataOverview"></a>

pip install matplotlib

In [1]:
import pandas as pd
import os
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

import datetime as dt
from datetime import datetime

import pickle
#Install brokenaxes
#!pip install brokenaxes

from config import (
    BeverageMachine22_df,
    BeverageMachine23_df,
    BeverageMachine24_df,
    BevMap_df,
    BeverageClassification_df,
    Placement_df,
    np_churn_consumption2,
    np_churn_consumption,
    Visitsdf,
    PhoneCallsdf,
    IncidentTicketdf,
    PakistanSales,
    MalaysiaSales,
    RussiaSalesData,
    SouthAfricaSales,
    SingaporeSales,
    MktActions
)

# Specify the file path
file_path_output = r'C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Notebook output'

# Date when the data was extracted
ChurnDate2=datetime(2023,7,31)

# The range from when I want to have the details about Telemetry data
TelemetryDateRangeStart = '2020-01-01'

In [2]:
# Date when the data was extracted
import calendar

#Algorithm that gives the last day of the past month as Churn Date
CurrentDate = datetime.today()

shift_year = 0
shift_month = 1

if (CurrentDate.month == 1):
    shift_year = 1
    shift_month = -11

new_date = calendar.monthrange(CurrentDate.year - shift_year,CurrentDate.month - shift_month)
ChurnDate2 = datetime(CurrentDate.year - shift_year,CurrentDate.month - shift_month,new_date[1])

print(ChurnDate2)

2024-04-30 00:00:00


In [3]:
####Whe is the last time we had telemetry data
# Should be the same as Churn Date2

#TelemetryDate = ChurnDate2
#TelemetryDate = datetime(2020,8,31)

#PakistanLastUpdate = datetime(2021,5,31)
#PakistanDateRangeStart = datetime(2020,7,31)

#VendonDateRangeStart = TelemetryDate
#VendonLastUpdate = datetime(2021,9,30)

The data has been anonymized

#### Below is a list of my datasets:

#### 1.	Beverage Machine data
    - The Beverage machine data is maintained by the Service manager of each Sales Organisation (usually a Sales Organisation corresponds to a country) and I can create a report to extract the data in excel from a database maintained by an external provider.
    - The database only keeps the latest state of the machine, therefore, I take a monthly snapshot of the data to capture the changes. 
    - This data provides details about the Beverage Machines park situation.
    - More and more Sales Organisations are going to be managed by this system, so the number of machines managed is increasing.

#### 2.	Beverage Mapping data
    - Beverage Mapping data is maintained in an Excel file by a colleague, I ask him to upload this mapping whenever I find new machines in the consolidated Beverage Machine data.
    - The goal of the file is to link the Beverage machine data to the Beverage Classification data.

#### 3.	Beverage Classification data
    - Beverage Classification data is maintained in a SharePoint file by a colleague.
    - This file is to get more technical details and features of the Beverage Machines.
    
#### 4.	Placement Tickets data
    - The Placement Tickets data is maintained by the Service manager of each Sales org and I can create a report to extract the data in excel from a database maintained by an external provider.
    - This data provides details of the placements and some incidents tickets of the Beverage Machines.
    - Sometimes the tickets are not done by the Service manager and some market does not fill this data inside the database, so only a minority of machines have this data.

#### 5.	Telemetry data
    - A new project has been launched not very long ago and some machine are equipped with telemetry data.
    - This data is stored by the telemetry provider and I asked an external colleague managing the relationship with the telemetry provider to share with me the data he could get from his requests.
    - Very few machines are equipped with telemetry data.
    - The number of machines connected with Telemetry is going to increase in the future.
    - This is not the definitive data, I have asked my colleague, but he could not provide me the final data this month, a data lake is being built in order to access the data more easily in the future


#### 6.	Visits data

#### 7. Phone Calls data

#### 8. Repair tickets data

### Beverage Machine data
Below you can find an extract of the Beverage Machine data which contains the details of the Beverage Machines

No need to use 2021 data so I turned it to Markdown

BeverageMachine_df = pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload.csv")
BeverageMachine_df.head()

#### Additionnal beverage data

In [4]:
###From 2022 onwards
#BeverageMachine22_df = pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload22.csv")

BeverageMachine22_df = pd.read_csv(BeverageMachine22_df)
BeverageMachine22_df.head()

  BeverageMachine22_df = pd.read_csv(BeverageMachine22_df)


Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date
0,0,Kuwait General Operational Manager,43985,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,0,184658239,...,0801 Nestle Companies,080107 Nestle Middle East,08 Export,0801 Nestle Companies,080107 Nestle Middle East,980519,Trade Asset w/ Fixed Asset,KW10,90068903,2022-01-31
1,1,Kuwait General Operational Manager,43985,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Idle,To be Assigned,0,184658259,...,0102 Hypermarket,010299 Not classified,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,KW10,90068903,2022-01-31
2,2,Kuwait General Operational Manager,43985,NESCAFE ALEGRIA FTP30 v1.0 BM,100023190,ALEGRIA,Deployed,Installed,0,195061606,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1015364,Trade Asset w/ Fixed Asset,KW10,90073039,2022-01-31
3,3,Kuwait General Operational Manager,43985,NESCAFE ALEGRIA FTP30 v1.0 BM,100023190,ALEGRIA,Deployed,Installed,44013,195061605,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,Not assigned,666056,Trade Asset w/ Fixed Asset,KW10,90073039,2022-01-31
4,4,Kuwait General Operational Manager,43985,EZ Care Mini-Duo BM,100023377,OTHERS-R/L/N,Deployed,Installed,39295,T070572,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0601 Full Service Rest's,Not assigned,667092,Trade Asset w/ Fixed Asset,KW10,90045690,2022-01-31


In [5]:
BeverageMachine22_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3425600 entries, 0 to 3425599
Data columns (total 37 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

Not needed anymore
### 2020 data
Bev_add2 = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload23.csv")
Bev_add2.head()

In [6]:
##2023 data

#BeverageMachine23_df =  pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload23.csv")
BeverageMachine23_df =  pd.read_csv(BeverageMachine23_df)
BeverageMachine23_df.head()

  BeverageMachine23_df =  pd.read_csv(BeverageMachine23_df)


Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date
0,0,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42430,16E0009895,...,1101 Exclusive,110101 Distribution Center,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
1,1,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42522,16E0014757,...,0618 Distributors OOH,061802 Non Exclusive,06 Out of Home,0203 Petrol Station,020399 Not classified,1046515,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
2,2,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42614,16E0021271,...,0618 Distributors OOH,061802 Non Exclusive,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
3,3,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42705,16E0021245,...,1101 Exclusive,110101 Distribution Center,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31
4,4,NP Bosnia & Herzegovina,44447,NESCAFE ALEGRIA A630 H3A2W HW BP BM,90045171,ALEGRIA,Idle,To be Assigned,42736,16E0021249,...,1101 Exclusive,110101 Distribution Center,Not assigned,Not assigned,Not assigned,#,Trade Asset w/ Fixed Asset,BA10,90045171,2023-01-31


In [7]:
##2024 data
#BeverageMachine24_df =  pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\C4CTAUpload24.csv")
BeverageMachine24_df =  pd.read_csv(BeverageMachine24_df)
BeverageMachine24_df.head()

  BeverageMachine24_df =  pd.read_csv(BeverageMachine24_df)


Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date
0,0,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544849,...,0614 Convenience OOH,061406 PMO:Petrol Stations,06 Out of Home,0614 Convenience OOH,061406 PMO:Petrol Stations,1498958,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31
1,1,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544851,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-8930,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31
2,2,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544855,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-9059,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31
3,3,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43282.0,182026936,...,0605 Business/Industry,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-9146,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31
4,4,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43313.0,182026920,...,0605 Business/Industry,060503 Remote Site Company,06 Out of Home,0605 Business/Industry,060503 Remote Site Company,IP-9006,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31


In [8]:
BeverageMachine_df = pd.concat([BeverageMachine24_df, BeverageMachine22_df], ignore_index=True) 
BeverageMachine_df = pd.concat([BeverageMachine_df, BeverageMachine23_df], ignore_index=True) 

#BeverageMachine_df = BeverageMachine24_df.append(BeverageMachine22_df)
#BeverageMachine_df = BeverageMachine_df.append(BeverageMachine23_df)
BeverageMachine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7899389 entries, 0 to 7899388
Data columns (total 37 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

Manufacturer Serial number can be the same for two different machine in different countries let's create a key Key_ManufacturerID_SalesOrg

Key_ManufacturerID_SalesOrg will be used for merging local sales data from market with the main data

import pandas as pd

# Create a new column 'Key_ManufacturerID_SalesOrg' with initial values from 'Manufacturer Number' and 'Sales Organisation'
BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Manufacturer Number'].astype(str) + BeverageMachine_df['Sales Organisation']



# Conditionally update 'Key_ManufacturerID_SalesOrg' column if it is a specific Sales Organisation
specific_market = 'Nestlé Russia'  # Replace with the name of your specific market
# for Russia use Account ID instead of Manufacturer Number
BeverageMachine_df.loc[BeverageMachine_df['Sales Organisation'] == specific_market, 'Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Account ID'].astype(str) + BeverageMachine_df['Sales Organisation']


In [9]:
# Create a new column 'Key_ManufacturerID_SalesOrg' with initial values from 'Manufacturer Number' and 'Sales Organisation'
BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Manufacturer Number'].astype(str) + BeverageMachine_df['Sales Organisation']

#Account ID should be of type "String"
BeverageMachine_df['Account ID'] = BeverageMachine_df['Account ID'].astype(str)

# Conditionally update 'Key_ManufacturerID_SalesOrg' column if it is a specific market
specific_market = 'Nestle South Africa' # Replace with the name of your specific market
# for South Africa use Account ID instead of Manufacturer Number
BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df.apply(lambda row: row['Account ID'] + row['Sales Organisation'] if row['Sales Organisation'] == specific_market else row['Key_ManufacturerID_SalesOrg'], axis=1)


BeverageMachine_df['Key_ManufacturerID_SalesOrg'] = BeverageMachine_df['Manufacturer Number'].astype(str) +  BeverageMachine_df['Sales Organisation'] 

Serial Id should be a string had issue with mix type for same serial ID

In [10]:
BeverageMachine_df['Serial ID'] = BeverageMachine_df['Serial ID'].astype('str')

BeverageMachine_df['Parent Installation Point ID'] = BeverageMachine_df['Parent Installation Point ID'].astype('str')

In [11]:
BeverageMachine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7899389 entries, 0 to 7899388
Data columns (total 38 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

In [12]:
BeverageMachine_df = BeverageMachine_df.query("`Product [Machine Model]` != 'Vendon Telemetry Device – vBox BM'")

In [13]:
BeverageMachine_df['Sales Organisation'].unique()

array(['Nestlé UAE', 'NP Bosnia & Herzegovina', 'Néstlé Bahrain',
       'NESTLE PROD SERV - CN19', 'SHL NESTLE PROD SERV',
       'Nestlé Denmark', 'Nestlé Finland', 'Nestle Hong Kong',
       'Indonesia', 'JP Japan Sales',
       'Kuwait General Operational Manager', 'NP North Macedonia',
       'Malaysia', 'Nestle New Zealand', 'Nestlé PH', 'Nestlé Qatar',
       'Nestlé Russia', 'Nestlé Slovak Republic', 'Nestle Turkiye Gida',
       'Nestle South Africa', 'Singapore', 'Nestle Australia Ltd',
       'NP-Bulgaria', 'NESTLE PROD SERV - CN17',
       'NESTLE PROD SERV - CN20', 'Nestlé Czech', 'Nestle UK',
       'NP Croatia, Slovenia', 'Nestlé India', 'Néstlé Jordania',
       'Nestle Kenya Ltd', 'Néstlé Lebanon', 'Nestle Prd Mauritius Ltd',
       'NP-Netherlands', 'Nestlé Norway',
       'Oman - Business Manager UAE & Oman', 'Pakistan',
       'NP Serbia, Kosovo, Montenegro', 'Néstlé Saudi Arabia',
       'Nestle Sweden', 'Thailand', 'Nestlé Taiwan', 'NP-Belgilux',
       'NP-France

Removed some markets from analysis

In [14]:
BeverageMachine_df2 = BeverageMachine_df.copy()
BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='NESTLE PROD SERV - CN17']
BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='NESTLE PROD SERV - CN19']
BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='NESTLE PROD SERV - CN20']
#BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='SHL NESTLE PROD SERV']
#BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='Nestlé Taiwan']
#BeverageMachine_df2=BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']!='NP-Netherlands']


#BeverageMachine_df2 = BeverageMachine_df2.drop(BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']=='Nestlé India'].index, inplace=True)
#BeverageMachine_df2 = BeverageMachine_df2.drop(BeverageMachine_df2.loc[BeverageMachine_df2['Sales Organisation']=='NESTLE PROD SERV - CN17'].index, inplace=True)
BeverageMachine_df2.head()

Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date,Key_ManufacturerID_SalesOrg
0,0,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544849,...,061406 PMO:Petrol Stations,06 Out of Home,0614 Convenience OOH,061406 PMO:Petrol Stations,1498958,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20174544849Nestlé UAE
1,1,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544851,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-8930,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20174544851Nestlé UAE
2,2,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43191.0,174544855,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-9059,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20174544855Nestlé UAE
3,3,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43282.0,182026936,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,IP-9146,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20182026936Nestlé UAE
4,4,Nestlé UAE,43992,NESCAFE MILANO MTS60E H4E1R2W HW Tki BM,90068903,MILANO,Deployed,Installed,43313.0,182026920,...,060503 Remote Site Company,06 Out of Home,0605 Business/Industry,060503 Remote Site Company,IP-9006,Trade Asset w/ Fixed Asset,AE12,90068903,2024-01-31,20182026920Nestlé UAE


In [15]:
BeverageMachine_df2['Sales Organisation'].unique()

array(['Nestlé UAE', 'NP Bosnia & Herzegovina', 'Néstlé Bahrain',
       'SHL NESTLE PROD SERV', 'Nestlé Denmark', 'Nestlé Finland',
       'Nestle Hong Kong', 'Indonesia', 'JP Japan Sales',
       'Kuwait General Operational Manager', 'NP North Macedonia',
       'Malaysia', 'Nestle New Zealand', 'Nestlé PH', 'Nestlé Qatar',
       'Nestlé Russia', 'Nestlé Slovak Republic', 'Nestle Turkiye Gida',
       'Nestle South Africa', 'Singapore', 'Nestle Australia Ltd',
       'NP-Bulgaria', 'Nestlé Czech', 'Nestle UK', 'NP Croatia, Slovenia',
       'Nestlé India', 'Néstlé Jordania', 'Nestle Kenya Ltd',
       'Néstlé Lebanon', 'Nestle Prd Mauritius Ltd', 'NP-Netherlands',
       'Nestlé Norway', 'Oman - Business Manager UAE & Oman', 'Pakistan',
       'NP Serbia, Kosovo, Montenegro', 'Néstlé Saudi Arabia',
       'Nestle Sweden', 'Thailand', 'Nestlé Taiwan', 'NP-Belgilux',
       'NP-France', 'NP Croatia, Slovenia-HR11', 'Nestlé Italy IT35 OOH',
       'Nestle Ireland', 'Nestle Romania S.R.

In [16]:
BeverageMachine_df = BeverageMachine_df2

In [17]:
BeverageMachine_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6737852 entries, 0 to 7899388
Data columns (total 38 columns):
 #   Column                                               Dtype 
---  ------                                               ----- 
 0   Unnamed: 0                                           int64 
 1   Sales Organisation                                   object
 2   User Status Last Changed On                          object
 3   Product [Machine Model]                              object
 4   Product ID [Machine Model ID]                        int64 
 5   Range Brand                                          object
 6   Machine Status Groupings                             object
 7   User Status                                          object
 8   Depreciation Start                                   object
 9   Serial ID                                            object
 10  Manufacturer Number                                  object
 11  Equipment Number                     

### Beverage Mapping data

In [18]:
# (A) Load the Beverage Mapping data
#BevMap_df = pd.read_csv(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\SBU-11 NESTLE PRO. Translation.csv")
BevMap_df = pd.read_csv(BevMap_df)

BevMap_df['ID Model Code']=BevMap_df['ID Model Code'].astype(str)
BevMap_df.head()

Unnamed: 0,Brand Name,Description,ID Model Code,Source,Model,Revised,Modified,Modified By
0,Accolade,Accolade 12oz,ACC-FPD-12z,BMB,N&W Astro Accolade,Yes,06/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
1,Accolade,Accolade 9oz,ACC-FPD- 9z,BMB,N&W Astro Accolade,Yes,06/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
2,ALEGRIA,Chest Freezer NP PK BM,100069870,C4C,Accessories,Yes,06/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
3,ALEGRIA,Chiller SAX 250 NP BM PK,100069872,C4C,Others,Yes,06/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"
4,ALEGRIA,Chiller SAX 400 NP BM PK,100069869,C4C,Others,Yes,06/14/2023 10:07 AM,"Baeza,Jordi,CH-ORBE"


In [19]:
BevMap_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2527 entries, 0 to 2526
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Brand Name     2527 non-null   object
 1   Description    2527 non-null   object
 2   ID Model Code  2527 non-null   object
 3   Source         2527 non-null   object
 4   Model          2527 non-null   object
 5   Revised        2527 non-null   object
 6   Modified       2527 non-null   object
 7   Modified By    2527 non-null   object
dtypes: object(8)
memory usage: 158.1+ KB


### Beverage Classification data

In [20]:
# (A) Load the Beverage Classification data
BeverageClassification_df = pd.read_csv(BeverageClassification_df)

BeverageClassification_df.head()

Unnamed: 0,Model,Model Vendor,Model Category,Global Projects,System Brands,Solution Brands,Model Group,Generation,Product,Ingredient Format,...,PSL,TAA & TAR Ownership,TAA & TAR,SC & Planning,Production,IM,Sustainability LCA Ownership,Sustainability LCA,Vendon Compatible,Technical Capacity
0,4Swiss Roma A10 PRO,Others,Mainstream B2C,%23-N/A,Branded others,Branded Others,Other,Legacy,Pure R&G,Pure R&G,...,Validated,Market,Not Done,Market,Discontinued,Market,Market,Not Done,,20
1,Accessories,Generic,Other,%23-N/A,Branded others,Non-Branded,Other,Legacy,%23-Unknown,Other,...,Validated,Market,Not Done,Market,Discontinued,Market,Market,Not Done,No,0
2,Alegria V-Café 140,Crem,Hot Liquid,Alegria,Nescafé Alegria,Nescafé Alegria,NA Legacy,Gen. 1,Hot Liquid,Liquid,...,Mandatory,Region,Not Done,Region,Active,Region,Region,Not Done,,0
3,Alegria V-Café 2120,Crem,Hot Liquid,Alegria,Nescafé Alegria,Nescafé Alegria,NA Legacy,Gen. 1,Hot Liquid,Liquid,...,Mandatory,Region,Not Done,Region,Active,Region,Region,Not Done,,0
4,Alegria V-Café 4500,NP Beverages,Hot Liquid,Alegria,Nescafé Alegria,Nescafé Alegria,NA Legacy,Gen. 1,Hot Liquid,Liquid,...,Validated,Market,Not Done,Market,Discontinued,Market,Market,Not Done,,0


In [21]:
BeverageClassification_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 565 entries, 0 to 564
Data columns (total 28 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Model                         564 non-null    object
 1   Model Vendor                  565 non-null    object
 2   Model Category                565 non-null    object
 3   Global Projects               565 non-null    object
 4   System Brands                 565 non-null    object
 5   Solution Brands               565 non-null    object
 6   Model Group                   565 non-null    object
 7   Generation                    565 non-null    object
 8   Product                       565 non-null    object
 9   Ingredient Format             565 non-null    object
 10  Model Category 2              565 non-null    object
 11  Machine Type                  565 non-null    object
 12  Beverage Temperature          565 non-null    object
 13  Positionning        

### Placement Tickets data

In [22]:
pip install openpyxl

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 24.0
[notice] To update, run: python.exe -m pip install --upgrade pip


In [23]:
# Load the Placement Tickets data
#Placement_df = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Net Placements.xlsx")
#Placement_df.tail()

In [24]:
#Placement_df.info()

In [25]:
# Read the Excel file into a pandas DataFrame and filter columns
file_path = Placement_df
selected_columns = ['Serial ID', 'Service Category', 'INCIDENT_CATEGORY_DESCRIPTION']
Placement_df = pd.read_excel(file_path, usecols=selected_columns)
# Perform operations on the selected DataFrame as needed
print(Placement_df)

       Service Category INCIDENT_CATEGORY_DESCRIPTION   Serial ID
0               Removal                Low throughput    10043325
1               Removal                Low throughput    10047056
2               Removal                Low throughput    10047050
3               Removal                Low throughput    10050922
4               Removal                Low throughput    10051909
...                 ...                           ...         ...
311779          Removal               End of contract  20O0015708
311780          Removal               End of contract    13000727
311781          Removal               End of contract    19000062
311782          Removal               End of contract    17000066
311783          Removal               End of contract    18000122

[311784 rows x 3 columns]


### Telemetry data

Get data from URL for Telemetry Data Lake

In [26]:
url = "https://queryenginelandingprod.blob.core.windows.net/shared/np/churn/np_churn_historical_consumption_by_product_group.csv?sp=r&st=2023-04-10T06:40:59Z&se=2050-04-10T14:40:59Z&spr=https&sv=2021-12-02&sr=b&sig=d%2Fn5C%2FWWksWDI%2FiiZEqwz5mOaw2jAqkW9DHUOSz6R7Q%3D"
np_churn_consumption2 =pd.read_csv(url)
np_churn_consumption2

Unnamed: 0,date,serial,sap_serial,quantity,salesorg,machine_id,product_group
0,2021-05-31,20114333086,,6,ESAR,415,CHOCOLATE
1,2021-05-31,20172931414,,21,UKI,6523,CAPPUCCINO
2,2021-05-31,20114332941,,95,ESAR,2126,CAPPUCCINO
3,2021-05-31,20101206601,,193,ESAR,4784,CHOCOLATE
4,2021-05-31,20153732083,,283,ESAR,1836,CHOCOLATE
...,...,...,...,...,...,...,...
3290637,2023-03-31,20181013381,,1,Russia,1177616,Unknown
3290638,2023-03-31,20180912310,,1,Russia,1173053,HOT WATER
3290639,2023-03-31,20172526546,,1,Russia,1170238,MOCHA
3290640,2023-03-31,20204732772,,1,Russia,1170336,FLAT WHITE


In [27]:
#url= "https://queryenginelandingstag.blob.core.windows.net/shared/np/churn/np_churn_historical_consumption.csv?sp=r&st=2022-09-02T07:17:17Z&se=2050-09-02T15:17:17Z&spr=https&sv=2021-06-08&sr=b&sig=hiIpKctZ%2BlxXwR9E%2BVReK1TnsQqZrcayCYu%2BZaCynlw%3D"
url = "https://queryenginelandingprod.blob.core.windows.net/shared/np/churn/np_churn_historical_consumption.csv?sp=r&st=2022-11-29T12:22:43Z&se=2050-11-29T20:22:43Z&spr=https&sv=2021-06-08&sr=b&sig=JZE599UA3foRsJ6ZbOHW6M0nWexxLc3JCB49gJ%2B2faU%3D"
np_churn_consumption =pd.read_csv(url)
np_churn_consumption

Unnamed: 0,date,serial,sap_serial,quantity,salesorg,machine_id
0,2021-05-31,20121412504,,347,ESAR,6334
1,2021-05-31,20172931407,,26,UKI,4816
2,2021-05-31,20112619791,,3505,ESAR,4882
3,2021-05-31,20101206590,,1234,ESAR,2077
4,2021-05-31,20104024878,,213,ESAR,1351
...,...,...,...,...,...,...
457064,2024-04-30,20150404880,,1,ESAR,137052
457065,2024-04-30,,,1,Russia,4906462
457066,2024-04-30,10275427,,2,USA,3296657
457067,2024-04-30,3400000071579,1.5,1,MENA,1110


In [28]:
np_churn_consumption = np_churn_consumption.rename(columns={"date": "Month", "salesorg": "SalesOrg"}).reset_index()
np_churn_consumption.head()

Unnamed: 0,index,Month,serial,sap_serial,quantity,SalesOrg,machine_id
0,0,2021-05-31,20121412504,,347,ESAR,6334
1,1,2021-05-31,20172931407,,26,UKI,4816
2,2,2021-05-31,20112619791,,3505,ESAR,4882
3,3,2021-05-31,20101206590,,1234,ESAR,2077
4,4,2021-05-31,20104024878,,213,ESAR,1351


In [29]:
np_churn_consumption=np_churn_consumption.drop(columns=['sap_serial','index', 'machine_id'])
np_churn_consumption.head()

Unnamed: 0,Month,serial,quantity,SalesOrg
0,2021-05-31,20121412504,347,ESAR
1,2021-05-31,20172931407,26,UKI
2,2021-05-31,20112619791,3505,ESAR
3,2021-05-31,20101206590,1234,ESAR
4,2021-05-31,20104024878,213,ESAR


In [30]:
Telemetry_df = np_churn_consumption

# Load the Telemetry data
Telemetry_df = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Telemetry2021.xlsx")
Telemetry_df.tail()

In [31]:
Telemetry_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 457069 entries, 0 to 457068
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   Month     457069 non-null  object
 1   serial    444701 non-null  object
 2   quantity  457069 non-null  int64 
 3   SalesOrg  457052 non-null  object
dtypes: int64(1), object(3)
memory usage: 13.9+ MB


# Load the Telemetry data
Telemetry_add = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Telemetry2022.xlsx")
Telemetry_add.tail()

Telemetry_df = Telemetry_df.append(Telemetry_add)
Telemetry_df.info()

Telemetry_df.tail()

I will only keep the telemetry data in the date range, so it is only the telemtry data starting after "TelemetryDateRangeStart"

#Telemetry_df1 = Telemetry_df.loc[Telemetry_df['Month']>=TelemetryDateRangeStart]

I will aggregate the number of Cup Sales for each Machine by the feature called 'serial' which corresponds to the feature 'Manufacturer Number' in the Beverage Machine data

In [32]:
Telemetry_aggSales = Telemetry_df['quantity'].groupby(Telemetry_df['serial'], axis=0).sum()
Telemetry_aggSales_df = Telemetry_aggSales.to_frame().reset_index()
Telemetry_aggSales_df

Unnamed: 0,serial,quantity
0,'20202016602,63782
1,'Y20231619331,8308
2,.,82605
3,0,5
4,00000000000,1705
...,...,...
31468,Х580BGS230203370085,2785
31469,Х580BGS230203370088,11326
31470,Х580BGS230203370097,2943
31471,Х580BGS230203370098,2848


In [33]:
Telemetry_df1 = Telemetry_df.groupby(['Month', 'serial']).sum()
# df.groupby(['col5', 'col2']).size()
#['quantity']
Telemetry_df1 = Telemetry_df1.reset_index()

Telemetry_df1

  Telemetry_df1 = Telemetry_df.groupby(['Month', 'serial']).sum()


Unnamed: 0,Month,serial,quantity
0,2021-05-31,'20202016602,610
1,2021-05-31,000103535,1230
2,2021-05-31,000103550,1434
3,2021-05-31,00089366,75
4,2021-05-31,00094264,276
...,...,...,...
434037,2024-04-30,Х580BGS230203370085,574
434038,2024-04-30,Х580BGS230203370088,2419
434039,2024-04-30,Х580BGS230203370097,751
434040,2024-04-30,Х580BGS230203370098,1200


from dateutil.relativedelta import relativedelta

one_month = TelemetryDate + relativedelta(months=-1)
three_months = TelemetryDate + relativedelta(months=-3)
six_months = TelemetryDate + relativedelta(months=-6)

Telemetry_df_one_month = Telemetry_df1.loc[Telemetry_df1['Month']>one_month]
Telemetry_df_three_months = Telemetry_df1.loc[Telemetry_df1['Month']>three_months]
Telemetry_df_six_months = Telemetry_df1.loc[Telemetry_df1['Month']>six_months]

TODO
why "Telemetry_aggSales_one_month_avg = Telemetry_df_one_month['quantity'].groupby(Telemetry_df_one_month['serial'], axis=0).count()"

not this?
Telemetry_aggSales_one_month_avg = Telemetry_df_one_month['quantity'].groupby(Telemetry_df_one_month['serial'], axis=0).sum()

In [34]:
Telemetry_df_one_month = Telemetry_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(1).sum()})

Telemetry_df_three_months = Telemetry_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

Telemetry_df_six_months = Telemetry_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

Telemetry_df_one_month = Telemetry_df_one_month.rename(columns={"quantity": "one_month_avg"}).reset_index()

Telemetry_df_three_months = Telemetry_df_three_months.rename(columns={"quantity": "three_months_avg"}).reset_index()

Telemetry_df_six_months = Telemetry_df_six_months.rename(columns={"quantity": "six_months_avg"}).reset_index()

Telemetry_aggSales_one_month_avg = Telemetry_df_one_month['quantity'].groupby(Telemetry_df_one_month['serial'], axis=0).count()
Telemetry_aggSales_one_month_avg = Telemetry_aggSales_one_month_avg.to_frame().reset_index()
Telemetry_aggSales_one_month_avg = Telemetry_aggSales_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

Telemetry_aggSales_three_months_avg = Telemetry_df_three_months['quantity'].groupby(Telemetry_df_three_months['serial'], axis=0).count()
Telemetry_aggSales_three_months_avg = Telemetry_aggSales_three_months_avg.to_frame().reset_index()
Telemetry_aggSales_three_months_avg = Telemetry_aggSales_three_months_avg.rename(columns={"quantity": "three_months_avg"})

Telemetry_aggSales_six_months_avg = Telemetry_df_six_months['quantity'].groupby(Telemetry_df_six_months['serial'], axis=0).count()
Telemetry_aggSales_six_months_avg = Telemetry_aggSales_six_months_avg.to_frame().reset_index()
Telemetry_aggSales_six_months_avg = Telemetry_aggSales_six_months_avg.rename(columns={"quantity": "six_months_avg"})

Telemetry_aggSales_three_months_avg

Telemetry_aggSales_three_months_avg['three_months_avg'] = Telemetry_aggSales_three_months_avg['three_months_avg'].apply(lambda x: x/3)

Telemetry_aggSales_six_months_avg['six_months_avg'] = Telemetry_aggSales_six_months_avg['six_months_avg'].apply(lambda x: x/6)

Telemetry_aggSales_three_months_avg 

I used 'left' instead of 'inner' because I want all the machines that had data

In [35]:
Telemetry_aggSales_df

Unnamed: 0,serial,quantity
0,'20202016602,63782
1,'Y20231619331,8308
2,.,82605
3,0,5
4,00000000000,1705
...,...,...
31468,Х580BGS230203370085,2785
31469,Х580BGS230203370088,11326
31470,Х580BGS230203370097,2943
31471,Х580BGS230203370098,2848


In [36]:
Telemetry_df_one_month

Unnamed: 0,serial,one_month_avg
0,'20202016602,1401
1,'Y20231619331,1424
2,.,10016
3,0,5
4,00000000000,718
...,...,...
31468,Х580BGS230203370085,574
31469,Х580BGS230203370088,2419
31470,Х580BGS230203370097,751
31471,Х580BGS230203370098,1200


In [37]:
Telemetry_aggSales_df1 = pd.merge(Telemetry_aggSales_df, Telemetry_df_one_month, how='left', left_on = ['serial'], right_on = ['serial'])
Telemetry_aggSales_df1.head()

Unnamed: 0,serial,quantity,one_month_avg
0,'20202016602,63782,1401
1,'Y20231619331,8308,1424
2,.,82605,10016
3,0,5,5
4,00000000000,1705,718


In [38]:
Telemetry_aggSales_df2 = pd.merge(Telemetry_aggSales_df1, Telemetry_df_three_months, how='left', left_on = ['serial'], right_on = ['serial'])
Telemetry_aggSales_df3 = pd.merge(Telemetry_aggSales_df2, Telemetry_df_six_months, how='left', left_on = ['serial'], right_on = ['serial'])
Telemetry_aggSales_df3 = Telemetry_aggSales_df3.fillna(0)
Telemetry_aggSales_df3

Unnamed: 0,serial,quantity,one_month_avg,three_months_avg,six_months_avg
0,'20202016602,63782,1401,2447.000000,2536.166667
1,'Y20231619331,8308,1424,1318.000000,1384.666667
2,.,82605,10016,8500.333333,6966.666667
3,0,5,5,1.666667,0.833333
4,00000000000,1705,718,403.000000,284.166667
...,...,...,...,...,...
31468,Х580BGS230203370085,2785,574,477.333333,464.166667
31469,Х580BGS230203370088,11326,2419,2418.666667,1887.666667
31470,Х580BGS230203370097,2943,751,505.666667,490.500000
31471,Х580BGS230203370098,2848,1200,743.666667,474.666667


In [39]:
Telemetry_aggSales_df3['serial'] = Telemetry_aggSales_df3['serial'].astype(str)

In [40]:
Telemetry_aggSales_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31473 entries, 0 to 31472
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   serial            31473 non-null  object 
 1   quantity          31473 non-null  int64  
 2   one_month_avg     31473 non-null  int64  
 3   three_months_avg  31473 non-null  float64
 4   six_months_avg    31473 non-null  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 1.4+ MB


### 6. Visits data

In [41]:
# Load the Visits data
Visitsdf = pd.read_excel(Visitsdf)
Visitsdf.head()

Unnamed: 0,Month,Year,Period,Counter_visits_completed,Cummulative,Cummulative_Final,Cummulative Graph,Occurence Balancing,Activity Owner,Activity Owner ID,...,Sales Unit (Hierarchy) ID,Activity Life Cycle Status id,Activity Life Cycle Status,Counter_visits,Visit Description,Visit,Account ID.Account ID Level 01,Account ID.Account ID Level 01.Key,Result,Index
0,1,2023,2023 - 01,1,15 469,15 317,1,RU3A154692023 - 01,ESR NP_Екатеринбург_Север_esr,6482.0,...,NPRU100012164,3,Completed,1,Аудит,2147753,"Пекарня «Хлебница», пр. Космонавтов, 84",7042010,,1018
1,3,2023,2023 - 03,1,54 046,20 822,3,RU3A540462023 - 03,ESR NP_Екатеринбург_Север_esr,6482.0,...,NPRU100012164,3,Completed,1,Аудит,2276730,"Пекарня «Хлебница», пр. Космонавтов, 84",7042010,,1022
2,3,2023,2023 - 03,1,54 046,20 822,3,RU3A540462023 - 03,ESR NP_Екатеринбург_Север_esr,6482.0,...,NPRU100012164,3,Completed,1,Аудит,2272298,"Пекарня «Хлебница», пр. Космонавтов, 84",7042010,,1023
3,1,2023,2023 - 01,1,15 469,15 317,1,RU3A154692023 - 01,ESR NP_Екатеринбург_Север_esr,6482.0,...,NPRU100012164,3,Completed,1,Аудит,2147752,"Пекарня «Хлебница», пр. Космонавтов, 45а",7042268,,1463
4,7,2023,2023 - 07,1,141 441,20 806,7,RU3A1414412023 - 07,ESR NP_Екатеринбург_Север_esr,6482.0,...,NPRU100012164,3,Completed,1,Аудит,2477044,"Шаурма, ул. Предельная, 63",6978596,,2363


### 7. Phone Calls data

In [42]:
# Load the Visits data
PhoneCallsdf = pd.read_excel(PhoneCallsdf, dtype={'Account Name': str})
PhoneCallsdf.head()

Unnamed: 0,Activity Name,Account Name,Activity Owner,Activity Life Cycle Status,Phone Call ID,Objective (Phone Call),Sales Organization,End Date in Local Time Zone,Start Date in Local Time Zone,PeriodEnd,ee
0,2023-03-22- P e Call 1,7323225,Jadala Aishwarya,Completed,1075973,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
1,2023-03-21- Residence Call 2,7316409,Jadala Aishwarya,Completed,1076101,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
2,2023-03-21- No Call 1,7318215,Jadala Aishwarya,Completed,1075664,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
3,2023-03-21- Yes Call 1,7317829,Jadala Aishwarya,Completed,1075454,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
4,2023-03-21- bluepal Call 2,7316130,Jadala Aishwarya,Completed,1075797,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0


### 8. Incident Tickets data

In [43]:
IncidentTicketdf = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Incident tickets.xlsx")
IncidentTicketdf.head()

Unnamed: 0,Index,SLAMet,YearMonth,Period,NextDateAux,NextDateAux2,AuxTime,TimeFrom,Next CreatedDatevar,TimeTo,...,SUB_TICKET_SALES_ORGANIZATION_ID,SUB_TICKET_SALES_ORGANIZATION_DESCRIPTION,ITEM_TARGET_INSTALLATION_POINT,WORK_PROGRESS,DATE_OF_LAST_MOVEMENT,COMPLETION_DUE_DATE,SERVICE_CATEGORY,SERVICE_CATEGORY_DESCRIPTION,SALESORG,COUNTER
0,2182-03-24 00:00:00,0,202110.0,2021 - 10,-738067.0,"vendredi, 18 mai 1979",Yes,20,2021-10-25,9999,...,BG10,Nestle Bulgaria A.D.,,6,NaT,"mardi, 5 octobre 2021",CA_5,Repair,BG10,1
1,2183-10-25 00:00:00,0,202110.0,2021 - 10,-738069.0,"mercredi, 16 mai 1979",Yes,344,2022-09-16,9999,...,BG10,Nestle Bulgaria A.D.,,6,NaT,"vendredi, 8 octobre 2021",CA_5,Repair,BG10,1
2,2183-11-19 00:00:00,0,202110.0,2021 - 10,-738069.0,"mercredi, 16 mai 1979",No,1,2021-10-08,9999,...,BG10,Nestle Bulgaria A.D.,,6,NaT,"vendredi, 8 octobre 2021",CA_5,Repair,BG10,1
3,2184-04-02 00:00:00,0,202110.0,2021 - 10,-738069.0,"mercredi, 16 mai 1979",No,5,2021-10-12,9999,...,BG10,Nestle Bulgaria A.D.,,6,NaT,"mercredi, 6 octobre 2021",CA_5,Repair,BG10,1
4,2184-08-05 00:00:00,0,202110.0,2021 - 10,-738070.0,"mardi, 15 mai 1979",Yes,34,2021-11-11,9999,...,BG10,Nestle Bulgaria A.D.,,6,NaT,"samedi, 9 octobre 2021",CA_5,Repair,BG10,1


### 9. Market specific data

UK stopped providing their service data

# Load the Visits data
#PhoneCallsdf = pd.read_excel(os.path.join('C:', 'Users', 'msalomo', 'Churn Project', 'Data', 'Completed Phone Calls.xlsx'))

UKService = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\UK Service data 202103.xlsx")


UKService['Key_ManufacturerID_SalesOrg'] = UKService['Serial N'].astype(str) + "Nestle UK"

UKService.head()

def preprocess_UKService(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Fault Codes', 'FTF']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

UKService_prep = preprocess_UKService(UKService)
UKService_prep.head()

Quick check on a serial with 7 entries

UKService_prepX=UKService_prep.loc[UKService_prep['Serial N'] =='0010237915']

UKService_prepX

UKService_prep.columns

UKService_prep2 = (UKService_prep.sort_values('Month')
    .groupby(["Key_ManufacturerID_SalesOrg"])
                      .agg({'Month': lambda s: s.values[-1], 
                            'Serial N': lambda s: s.values[-1], 
                            'Minutes': 'mean', 
                            'Fault Codes_Blocked ingredients' : 'sum', 
                            'Fault Codes_Boiler fault' : 'sum',
                            'Fault Codes_Booked' : 'sum', 
                            'Fault Codes_Brewer issue' : 'sum',
                            'Fault Codes_Canister' : 'sum', 
                            'Fault Codes_Card reader install' : 'sum',
                            'Fault Codes_Card reader removal' : 'sum',
                            'Fault Codes_Change drink size' : 'sum',
                            'Fault Codes_Cleaning / Hygiene kit ' : 'sum',
                            'Fault Codes_Coinmech' : 'sum',
                            'Fault Codes_Decomission' : 'sum', 
                            'Fault Codes_Delivery' : 'sum',
                            'Fault Codes_Display issue' : 'sum', 
                            'Fault Codes_Door' : 'sum',
                            'Fault Codes_Drink Strength/ taste' : 'sum', 
                            'Fault Codes_Faulty door' : 'sum',
                            'Fault Codes_Filters ' : 'sum', 
                            'Fault Codes_Fridge fault' : 'sum', 
                            'Fault Codes_Leak' : 'sum',
                            'Fault Codes_Machine empty no ingredients' : 'sum', 
                            'Fault Codes_Measures' : 'sum',
                            'Fault Codes_Motor' : 'sum', 
                            'Fault Codes_No Fault Found' : 'sum',
                            'Fault Codes_No power Socket' : 'sum', 
                            'Fault Codes_Not dispensing drinks' : 'sum',
                            'Fault Codes_Not heating' : 'sum', 
                            'Fault Codes_Other' : 'sum',
                            'Fault Codes_Other (derial number check etc)' : 'sum', 
                            'Fault Codes_PPM' : 'sum',
                            'Fault Codes_Power / CPU Machine' : 'sum', 
                            'Fault Codes_Price increase' : 'sum',
                            'Fault Codes_Pump / Valve  internal' : 'sum',
                            'Fault Codes_Pump / Water External' : 'sum', 
                            'Fault Codes_Telemetry' : 'sum',
                            'Fault Codes_Training' : 'sum', 
                            'Fault Codes_consumable' : 'sum',
                            'Fault Codes_machine Install' : 'sum', 
                            'Fault Codes_workshop' : 'sum', 
                            'FTF_1' : 'sum', 
                            'FTF_2' : 'sum',
                            'FTF_3' : 'sum', 
                            'FTF_4' : 'sum'                  
    })
)

UKService_prep2

### 10. LOCAL DATA

Sales & Telemetron & Vendon 2021 Sales

PakistanSales both Serial no and manuf no are the same
RussiaSalesData uses Manuf no

Vendon data uses manuf no



In [44]:
PakistanSales = pd.read_excel(PakistanSales)
MalaysiaSales = pd.read_excel(MalaysiaSales)

# Drop the 'Serial' column
MalaysiaSales.drop('Serial', axis=1, inplace=True)
# Rename the 'Serial Manufacturer' column to 'Serial'
MalaysiaSales.rename(columns={'Serial Manufacturer': 'Serial'}, inplace=True)

In [45]:
PakistanSales.head()

Unnamed: 0,Serial,quantity,Month
0,20O0014321,2512.8206,2021-01-01
1,7010054243,8488.0412,2021-01-01
2,7010055066,91133.6902,2021-01-01
3,7010045635,91133.6902,2021-01-01
4,7010058209,91133.6902,2021-01-01


In [46]:
PakistanSales['Serial'] = PakistanSales['Serial'].astype(str)
MalaysiaSales['Serial'] = MalaysiaSales['Serial'].astype(str)

In [47]:
RussiaSalesData = pd.read_excel(RussiaSalesData)

In [48]:
RussiaSalesData.tail()

Unnamed: 0,Date,Machine Manufacturer Serial Number,ПРОДАЖИ (NPS)
882513,2024-04-30,20182128784,0.0
882514,2024-04-30,20173032554,0.0
882515,2024-04-30,20170908163,0.0
882516,2024-04-30,15297DU17072840691,0.0
882517,2024-04-30,15297DU17072840705,0.0


In [49]:
RussiaSalesData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 882518 entries, 0 to 882517
Data columns (total 3 columns):
 #   Column                              Non-Null Count   Dtype         
---  ------                              --------------   -----         
 0   Date                                882518 non-null  datetime64[ns]
 1   Machine Manufacturer Serial Number  882512 non-null  object        
 2   ПРОДАЖИ (NPS)                       882474 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 20.2+ MB


In [50]:
SouthAfricaSales = pd.read_excel(SouthAfricaSales)

In [51]:
SouthAfricaSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4073 entries, 0 to 4072
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   AccountID  4073 non-null   int64         
 1   quantity   4073 non-null   float64       
 2   Month      4073 non-null   datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int64(1)
memory usage: 95.6 KB


In [52]:
SingaporeSales = pd.read_excel(SingaporeSales)

In [53]:
SingaporeSales['Month'] = pd.to_datetime(SingaporeSales['Month'])
SingaporeSales['Serial ID'] = SingaporeSales['Serial ID'].astype(str)

  SingaporeSales['Month'] = pd.to_datetime(SingaporeSales['Month'])


In [54]:
SingaporeSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91400 entries, 0 to 91399
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   Serial ID            91400 non-null  object        
 1   Month                91400 non-null  datetime64[ns]
 2   Sales                85748 non-null  float64       
 3   Ship to              91400 non-null  object        
 4   Account ID           91400 non-null  int64         
 5   Manufacturer Number  91400 non-null  object        
dtypes: datetime64[ns](1), float64(1), int64(1), object(3)
memory usage: 4.2+ MB


pip install pandasql

In [55]:
# create data frame
df1 = SingaporeSales
 
print("Original DataFrame")
 
# print original data frame
display(df1)
 
# create a dictionary
# key = old name
# value = new name
dict = {'Serial ID': 'Serial_ID',
        'Month' : 'Month',
        'Sales' : 'Sales',
        'Ship to': 'Ship_to',
       'Account ID' : 'Account_ID',
       'Manufacturer Number' : 'Manufacturer_Number'}
 
print("\nAfter rename")
# call rename () method
df1.rename(columns=dict,
          inplace=True)
 
# print Data frame after rename columns
display(df1)

Original DataFrame


Unnamed: 0,Serial ID,Month,Sales,Ship to,Account ID,Manufacturer Number
0,SGBMB03059,2023-01-31,0.00,30885489704,3981172,20092213179
1,SGBMB04056,2023-01-31,,27835PA00101,3981872,20092414016
2,SGBMB03049,2023-01-31,0.00,30885489184,3981521,20092414029
3,SGBMB03772,2023-01-31,2412.35,27835KEP008A01,3982306,20094625024
4,SGBMB03804,2023-01-31,0.00,30885489344,3981360,20094625056
...,...,...,...,...,...,...
91395,23O0043568,2024-04-30,1653.72,280317190225,8410298,20234650037
91396,23O0043569,2024-04-30,1653.72,280317190225,8410298,20234650038
91397,23O0047359,2024-04-30,0.00,6767280N-2403003,9054652,EFSIN23120001
91398,23O0047344,2024-04-30,896.40,6750157O5969,8921214,3400000263847



After rename


Unnamed: 0,Serial_ID,Month,Sales,Ship_to,Account_ID,Manufacturer_Number
0,SGBMB03059,2023-01-31,0.00,30885489704,3981172,20092213179
1,SGBMB04056,2023-01-31,,27835PA00101,3981872,20092414016
2,SGBMB03049,2023-01-31,0.00,30885489184,3981521,20092414029
3,SGBMB03772,2023-01-31,2412.35,27835KEP008A01,3982306,20094625024
4,SGBMB03804,2023-01-31,0.00,30885489344,3981360,20094625056
...,...,...,...,...,...,...
91395,23O0043568,2024-04-30,1653.72,280317190225,8410298,20234650037
91396,23O0043569,2024-04-30,1653.72,280317190225,8410298,20234650038
91397,23O0047359,2024-04-30,0.00,6767280N-2403003,9054652,EFSIN23120001
91398,23O0047344,2024-04-30,896.40,6750157O5969,8921214,3400000263847


In [56]:
import pandas as pd
import sqlite3

# create a sample DataFrame
df = df1

# create an in-memory SQLite database
conn = sqlite3.connect(':memory:')

# write the DataFrame to the database
df.to_sql('my_table', con=conn)

# define the SQL query
query = '''
SELECT Month, Sales AS quantity, Ship_to, Manufacturer_Number,
       COUNT(Manufacturer_Number) OVER (PARTITION BY Month, Ship_to) AS Manufacturer_Count,
       (Sales / COUNT(Manufacturer_Number) OVER (PARTITION BY Month, Ship_to)) AS Sales_perMachine
FROM my_table
'''

# run the query using pandas
result = pd.read_sql_query(query, conn)

# print the result
print(result)

                     Month  quantity         Ship_to Manufacturer_Number  \
0      2023-01-31 00:00:00       0.0      2215122122         20103018840   
1      2023-01-31 00:00:00       0.0      2215122122         20103018859   
2      2023-01-31 00:00:00       0.0      2215122122         20102917864   
3      2023-01-31 00:00:00       0.0      2215122122         20102917875   
4      2023-01-31 00:00:00       0.0      2215122122         20103923848   
...                    ...       ...             ...                 ...   
91395  2024-04-30 00:00:00       0.0  69338116933811         20141213799   
91396  2024-04-30 00:00:00       0.0  69338116933811         20141213800   
91397  2024-04-30 00:00:00       0.0  69338116933811         20141213801   
91398  2024-04-30 00:00:00       0.0  69338116933811         20141213823   
91399  2024-04-30 00:00:00       0.0  69768386976838         20224845590   

       Manufacturer_Count  Sales_perMachine  
0                     204               0

In [57]:
SingaporeSales = result
SingaporeSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91400 entries, 0 to 91399
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Month                91400 non-null  object 
 1   quantity             85748 non-null  float64
 2   Ship_to              91400 non-null  object 
 3   Manufacturer_Number  91400 non-null  object 
 4   Manufacturer_Count   91400 non-null  int64  
 5   Sales_perMachine     85748 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 4.2+ MB


In [58]:
SingaporeSales.rename(columns={'Manufacturer_Number': 'Manufacturer Number'}, inplace=True)
SingaporeSales.rename(columns={'quantity': 'quantityold'}, inplace=True)
SingaporeSales.rename(columns={'Sales_perMachine': 'quantity'}, inplace=True)

In [59]:
SingaporeSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91400 entries, 0 to 91399
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Month                91400 non-null  object 
 1   quantityold          85748 non-null  float64
 2   Ship_to              91400 non-null  object 
 3   Manufacturer Number  91400 non-null  object 
 4   Manufacturer_Count   91400 non-null  int64  
 5   quantity             85748 non-null  float64
dtypes: float64(2), int64(1), object(3)
memory usage: 4.2+ MB


In [60]:
RussiaSalesData.rename(columns={'Machine Manufacturer Serial Number': 'Serial', 'ПРОДАЖИ (NPS)': 'quantity'}, inplace=True)
RussiaSalesData

Unnamed: 0,Date,Serial,quantity
0,2021-01-31,4228,0.0
1,2021-01-31,5419,0.0
2,2021-01-31,5477,0.0
3,2021-01-31,420090,0.0
4,2021-01-31,420283,0.0
...,...,...,...
882513,2024-04-30,20182128784,0.0
882514,2024-04-30,20173032554,0.0
882515,2024-04-30,20170908163,0.0
882516,2024-04-30,15297DU17072840691,0.0


In [61]:
RussiaSalesData['quantity'] = RussiaSalesData['quantity'].astype(float)
RussiaSalesData['Serial'] = RussiaSalesData['Serial'].astype(str)

In [62]:
RussiaSalesData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 882518 entries, 0 to 882517
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   Date      882518 non-null  datetime64[ns]
 1   Serial    882518 non-null  object        
 2   quantity  882474 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 20.2+ MB


TelemetronData = pd.read_excel(r"C:\Users\msalomo\OneDrive - NESTLE\Certificate Machine Learning and Data\Churn Project\Data\Telemetron Data.xlsx")

TelemetronData.rename(columns={'Machine serial': 'serial', 'Total': 'quantity'}, inplace=True)

In [63]:
PakistanSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 236516 entries, 0 to 236515
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype         
---  ------    --------------   -----         
 0   Serial    236516 non-null  object        
 1   quantity  236516 non-null  float64       
 2   Month     236516 non-null  datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 5.4+ MB


VendonData = pd.read_excel(r"C:\Users\msalomo\Churn Project\Data\Telemetry2021.xlsx")

VendonData.head()

In [64]:
Pakistan_aggSales = PakistanSales['quantity'].groupby(PakistanSales['Serial'], axis=0).sum()
Pakistan_aggSales = Pakistan_aggSales.reset_index()

# Perform the aggregation on 'quantity' grouped by 'Serial'
Malaysia_aggSales = MalaysiaSales['quantity'].groupby(MalaysiaSales['Serial']).sum().reset_index()
# Print the 'Malaysia_aggSales' DataFrame
print(Malaysia_aggSales)

           Serial     quantity
0      1005010001   128.367330
1      1005010006   742.657798
2      1005010007   742.657798
3      1005010009  2748.936909
4      1005010011  6442.373203
...           ...          ...
15977     T573424  1104.270000
15978     T573425  2423.145633
15979     T573426   359.400000
15980     T573427  4612.954324
15981  VZCA 00001  5386.958367

[15982 rows x 2 columns]


Telemetron_agg = TelemetronData['quantity'].groupby(TelemetronData['serial'], axis=0).sum()
Telemetron_agg = Telemetron_agg.reset_index()
Telemetron_agg

In [65]:
RussiaSalesData_agg = RussiaSalesData['quantity'].groupby(RussiaSalesData['Serial'], axis=0).sum()
RussiaSalesData_agg = RussiaSalesData_agg.reset_index()
RussiaSalesData_agg

Unnamed: 0,Serial,quantity
0,00001428-0011,0.00
1,00001429-0001,51785.57
2,00001429-0004,0.00
3,00001429-0006,0.00
4,00001429-0007,0.00
...,...,...
30479,ХК 0115,0.00
30480,ХК 0116,0.00
30481,ХК 0118,0.00
30482,ХК 0120,0.00


In [66]:
SouthAfrica_aggSales = SouthAfricaSales['quantity'].groupby(SouthAfricaSales['AccountID'], axis=0).sum()
SouthAfrica_aggSales = SouthAfrica_aggSales.reset_index()
SouthAfrica_aggSales

Unnamed: 0,AccountID,quantity
0,365014,0.82
1,365018,40.53
2,366680,1068.01
3,366935,13691.11
4,367418,144338.28
...,...,...
585,7136798,165575.49
586,7141430,6503.40
587,7145255,774992.42
588,7149375,965814.83


In [67]:
Singapore_aggSales = SingaporeSales['quantity'].groupby(SingaporeSales['Manufacturer Number'], axis=0).sum()
Singapore_aggSales = Singapore_aggSales.reset_index()
Singapore_aggSales

Unnamed: 0,Manufacturer Number,quantity
0,141982300000383201,3691.780000
1,141982300001083201,3383.430000
2,141982300003083201,12672.237298
3,141982300003283201,3208.420000
4,141992300000483201,2571.080000
...,...,...
6361,ZEBA0072,20698.834875
6362,ZEBA0073,12672.237298
6363,ZEBA0074,16048.238385
6364,ZEBA0075,17891.143519


#Vendon_agg = VendonData['quantity'].groupby(VendonData['serial'], axis=0).sum()

Vendon_agg = (VendonData.sort_values('Month')
    .groupby(["serial"])
                      .agg({'SalesOrg' : lambda s: s.values[-1],
                            'quantity' : 'sum'}))

Vendon_agg = Vendon_agg.reset_index()
Vendon_agg

PakistanSales = PakistanSales.loc[PakistanSales['Month']>=PakistanDateRangeStart]

VendonData = VendonData.loc[VendonData['Month']>=VendonDateRangeStart]

I will aggregate the number of Cup Sales for each Machine by the feature called 'serial' which corresponds to the feature 'Manufacturer Number' in the Beverage Machine data

In [68]:
#df.sort_values('date').groupby('id').tail(1)
#df.sort_values('date').groupby('id').apply(lambda x: x.tail(1))
#df.groupby('Type').apply(lambda x: x.tail(3).mean())

PakistanSales_df1 = PakistanSales.groupby(['Month', 'Serial']).sum()
PakistanSales_df1 = PakistanSales_df1.reset_index()


PakistanSales_one_month = PakistanSales_df1.sort_values('Month').groupby('Serial').agg({'quantity' : lambda x: x.tail(1).sum()})

PakistanSales_three_months = PakistanSales_df1.sort_values('Month').groupby('Serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

PakistanSales_six_months = PakistanSales_df1.sort_values('Month').groupby('Serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

# Group by 'Month' and 'Serial' and calculate the sum
MalaysiaSales_df1 = MalaysiaSales.groupby(['Month', 'Serial']).sum().reset_index()
# Calculate the sum of 'quantity' for the latest month for each 'Serial'
MalaysiaSales_one_month = MalaysiaSales_df1.sort_values('Month').groupby('Serial').agg({'quantity': lambda x: x.tail(1).sum()})
# Calculate the average of 'quantity' for the last three months for each 'Serial'
MalaysiaSales_three_months = MalaysiaSales_df1.sort_values('Month').groupby('Serial').agg({'quantity': lambda x: x.tail(3).sum() / 3})
# Calculate the average of 'quantity' for the last six months for each 'Serial'
MalaysiaSales_six_months = MalaysiaSales_df1.sort_values('Month').groupby('Serial').agg({'quantity': lambda x: x.tail(6).sum() / 6})
# Print the resulting DataFrames
print(MalaysiaSales_one_month)


               quantity
Serial                 
1005010001     0.000000
1005010006     0.000000
1005010007     0.000000
1005010009   417.780483
1005010011  1151.370455
...                 ...
T573424        0.000000
T573425      374.281441
T573426        0.000000
T573427      294.736667
VZCA 00001   828.944922

[15982 rows x 1 columns]


In [69]:
PakistanSales_one_month = PakistanSales_one_month.reset_index()
PakistanSales_three_months = PakistanSales_three_months.reset_index()
PakistanSales_six_months = PakistanSales_six_months.reset_index()

# Reset the index for 'MalaysiaSales_one_month'
MalaysiaSales_one_month = MalaysiaSales_one_month.reset_index()
# Reset the index for 'MalaysiaSales_three_months'
MalaysiaSales_three_months = MalaysiaSales_three_months.reset_index()
# Reset the index for 'MalaysiaSales_six_months'
MalaysiaSales_six_months = MalaysiaSales_six_months.reset_index()
# Print the 'MalaysiaSales_three_months' DataFrame
print(MalaysiaSales_three_months)

           Serial     quantity
0      1005010001     6.391393
1      1005010006     0.000000
2      1005010007     0.000000
3      1005010009   444.384206
4      1005010011  1539.484848
...           ...          ...
15977     T573424    39.450000
15978     T573425   388.879459
15979     T573426     0.000000
15980     T573427   704.707014
15981  VZCA 00001   917.998386

[15982 rows x 2 columns]


from dateutil.relativedelta import relativedelta

one_month_pak = PakistanLastUpdate + relativedelta(months=-1)
three_months_pak = PakistanLastUpdate + relativedelta(months=-3)
six_months_pak = PakistanLastUpdate + relativedelta(months=-6)

PakistanSales_one_month = PakistanSales.loc[PakistanSales['Month']>one_month_pak]
PakistanSales_three_months = PakistanSales.loc[PakistanSales['Month']>three_months_pak]
PakistanSales_six_months = PakistanSales.loc[PakistanSales['Month']>six_months_pak]

VendonData_one_month = VendonData.loc[VendonData['Month']>one_month_pak]
VendonData_three_months = VendonData.loc[VendonData['Month']>three_months_pak]
VendonData_six_months = VendonData.loc[VendonData['Month']>six_months_pak]

In [70]:
PakistanSales_one_month

Unnamed: 0,Serial,quantity
0,10010063319,15412.0800
1,2000014136,58078.4300
2,2000014290,5000.0000
3,2000014292,17440.0000
4,2000014293,38800.0000
...,...,...
15145,7010070073,27537.3339
15146,7010070077,9979.3956
15147,7010070112,36690.0000
15148,7010070113,32800.0000


PakistanSales_one_month.loc[PakistanSales_one_month['serial'] != '70010058920']

PakistanSales_one_month_avg = PakistanSales_one_month['quantity'].groupby(PakistanSales_one_month['serial'], axis=0).sum()

PakistanSales_one_month_avg = PakistanSales_one_month_avg.to_frame().reset_index()
PakistanSales_one_month_avg = PakistanSales_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

In [71]:
#PakistanSales_one_month_avg = PakistanSales_one_month['quantity'].groupby(PakistanSales_one_month['serial'], axis=0).sum()
#PakistanSales_one_month_avg = PakistanSales_one_month_avg.to_frame().reset_index()
PakistanSales_one_month_avg = PakistanSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})

#PakistanSales_three_months_avg = PakistanSales_three_months['quantity'].groupby(PakistanSales_three_months['serial'], axis=0).sum()
#PakistanSales_three_months_avg = PakistanSales_three_months_avg.to_frame().reset_index()
PakistanSales_three_months_avg = PakistanSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})

#PakistanSales_six_months_avg = PakistanSales_six_months['quantity'].groupby(PakistanSales_six_months['serial'], axis=0).sum()
#PakistanSales_six_months_avg = PakistanSales_six_months_avg.to_frame().reset_index()
PakistanSales_six_months_avg = PakistanSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})

PakistanSales_three_months_avg

# Rename the column in 'MalaysiaSales_one_month'
MalaysiaSales_one_month_avg = MalaysiaSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})
# Rename the column in 'MalaysiaSales_three_months'
MalaysiaSales_three_months_avg = MalaysiaSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})
# Rename the column in 'MalaysiaSales_six_months'
MalaysiaSales_six_months_avg = MalaysiaSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})
# Print the 'MalaysiaSales_three_months_avg' DataFrame
print(MalaysiaSales_three_months_avg)

           Serial  Sales_three_months_avg
0      1005010001                6.391393
1      1005010006                0.000000
2      1005010007                0.000000
3      1005010009              444.384206
4      1005010011             1539.484848
...           ...                     ...
15977     T573424               39.450000
15978     T573425              388.879459
15979     T573426                0.000000
15980     T573427              704.707014
15981  VZCA 00001              917.998386

[15982 rows x 2 columns]


In [72]:
SouthAfrica_aggSales

Unnamed: 0,AccountID,quantity
0,365014,0.82
1,365018,40.53
2,366680,1068.01
3,366935,13691.11
4,367418,144338.28
...,...,...
585,7136798,165575.49
586,7141430,6503.40
587,7145255,774992.42
588,7149375,965814.83


In [73]:
SouthAfricaSales_df1 = SouthAfricaSales.groupby(['Month', 'AccountID']).sum()
SouthAfricaSales_df1 = SouthAfricaSales.reset_index()


SouthAfricaSales_one_month = SouthAfricaSales_df1.sort_values('Month').groupby('AccountID').agg({'quantity' : lambda x: x.tail(1).sum()})

SouthAfricaSales_three_months = SouthAfricaSales_df1.sort_values('Month').groupby('AccountID').agg({'quantity' : lambda x: x.tail(3).sum()/3})

SouthAfricaSales_six_months = SouthAfricaSales_df1.sort_values('Month').groupby('AccountID').agg({'quantity' : lambda x: x.tail(6).sum()/6})


SouthAfricaSales_one_month = SouthAfricaSales_one_month.reset_index()
SouthAfricaSales_three_months = SouthAfricaSales_three_months.reset_index()
SouthAfricaSales_six_months = SouthAfricaSales_six_months.reset_index()
SouthAfricaSales_three_months


SouthAfricaSales_one_month_avg = SouthAfricaSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})


SouthAfricaSales_three_months_avg = SouthAfricaSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})


SouthAfricaSales_six_months_avg = SouthAfricaSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})

SouthAfricaSales_three_months_avg

Unnamed: 0,AccountID,Sales_three_months_avg
0,365014,0.083333
1,365018,13.376667
2,366680,356.003333
3,366935,4563.703333
4,367418,11818.086667
...,...,...
585,7136798,11913.196667
586,7141430,2167.800000
587,7145255,61264.896667
588,7149375,33526.356667


In [74]:
SingaporeSales_df1 = SingaporeSales.groupby(['Month', 'Manufacturer Number']).sum()
SingaporeSales_df1 = SingaporeSales.reset_index()


SingaporeSales_one_month = SingaporeSales_df1.sort_values('Month').groupby('Manufacturer Number').agg({'quantity' : lambda x: x.tail(1).sum()})

SingaporeSales_three_months = SingaporeSales_df1.sort_values('Month').groupby('Manufacturer Number').agg({'quantity' : lambda x: x.tail(3).sum()/3})

SingaporeSales_six_months = SingaporeSales_df1.sort_values('Month').groupby('Manufacturer Number').agg({'quantity' : lambda x: x.tail(6).sum()/6})


SingaporeSales_one_month = SingaporeSales_one_month.reset_index()
SingaporeSales_three_months = SingaporeSales_three_months.reset_index()
SingaporeSales_six_months = SingaporeSales_six_months.reset_index()
SingaporeSales_three_months


SingaporeSales_one_month_avg = SingaporeSales_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})

SingaporeSales_three_months_avg = SingaporeSales_three_months.rename(columns={"quantity": "Sales_three_months_avg"})

SingaporeSales_six_months_avg = SingaporeSales_six_months.rename(columns={"quantity": "Sales_six_months_avg"})

SingaporeSales_three_months_avg

  SingaporeSales_df1 = SingaporeSales.groupby(['Month', 'Manufacturer Number']).sum()


Unnamed: 0,Manufacturer Number,Sales_three_months_avg
0,141982300000383201,247.150000
1,141982300001083201,210.726667
2,141982300003083201,591.731587
3,141982300003283201,83.480000
4,141992300000483201,83.480000
...,...,...
6361,ZEBA0072,1147.863488
6362,ZEBA0073,591.731587
6363,ZEBA0074,591.731587
6364,ZEBA0075,591.731587


TelemetronData_df1 = TelemetronData.groupby(['Month', 'serial']).sum()
TelemetronData_df1 = TelemetronData_df1.reset_index()

TelemetronData_one_month = TelemetronData_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(1).sum()})

TelemetronData_three_months = TelemetronData_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

TelemetronData_six_months = TelemetronData_df1.sort_values('Month').groupby('serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

TelemetronData_one_month = TelemetronData_one_month.reset_index()
TelemetronData_three_months = TelemetronData_three_months.reset_index()
TelemetronData_six_months = TelemetronData_six_months.reset_index()
TelemetronData_three_months

#TelemetronData_one_month_avg = TelemetronData_one_month['quantity'].groupby(TelemetronData_one_month['serial'], axis=0).sum()
#TelemetronData_one_month_avg = TelemetronData_one_month_avg.to_frame().reset_index()
TelemetronData_one_month_avg = TelemetronData_one_month.rename(columns={"quantity": "one_month_avg"})

#TelemetronData_three_months_avg = TelemetronData_three_months['quantity'].groupby(TelemetronData_three_months['serial'], axis=0).sum()
#TelemetronData_three_months_avg = TelemetronData_three_months_avg.to_frame().reset_index()
TelemetronData_three_months_avg = TelemetronData_three_months.rename(columns={"quantity": "three_months_avg"})

#TelemetronData_six_months_avg = TelemetronData_six_months['quantity'].groupby(TelemetronData_six_months['serial'], axis=0).sum()
#TelemetronData_six_months_avg = TelemetronData_six_months_avg.to_frame().reset_index()
TelemetronData_six_months_avg = TelemetronData_six_months.rename(columns={"quantity": "six_months_avg"})

TelemetronData_three_months_avg

In [75]:
RussiaSalesData_df1 = RussiaSalesData.groupby(['Date', 'Serial']).sum()
RussiaSalesData_df1 = RussiaSalesData_df1.reset_index()

RussiaSalesData_one_month = RussiaSalesData_df1.sort_values('Date').groupby('Serial').agg({'quantity' : lambda x: x.tail(1).sum()})

RussiaSalesData_three_months = RussiaSalesData_df1.sort_values('Date').groupby('Serial').agg({'quantity' : lambda x: x.tail(3).sum()/3})

RussiaSalesData_six_months = RussiaSalesData_df1.sort_values('Date').groupby('Serial').agg({'quantity' : lambda x: x.tail(6).sum()/6})

RussiaSalesData_one_month = RussiaSalesData_one_month.reset_index()
RussiaSalesData_three_months = RussiaSalesData_three_months.reset_index()
RussiaSalesData_six_months = RussiaSalesData_six_months.reset_index()

RussiaSalesData_one_month_avg = RussiaSalesData_one_month.rename(columns={"quantity": "Sales_one_Month_avg"})

RussiaSalesData_three_months_avg = RussiaSalesData_three_months.rename(columns={"quantity": "Sales_three_months_avg"})

RussiaSalesData_six_months_avg = RussiaSalesData_six_months.rename(columns={"quantity": "Sales_six_months_avg"})



TelemetronData_one_month_avg = TelemetronData_one_month['quantity'].groupby(TelemetronData_one_month['serial'], axis=0).sum()
TelemetronData_one_month_avg = TelemetronData_one_month_avg.to_frame().reset_index()
TelemetronData_one_month_avg = TelemetronData_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

TelemetronData_three_months_avg = TelemetronData_three_months['quantity'].groupby(TelemetronData_three_months['serial'], axis=0).sum()
TelemetronData_three_months_avg = TelemetronData_three_months_avg.to_frame().reset_index()
TelemetronData_three_months_avg = TelemetronData_three_months_avg.rename(columns={"quantity": "three_months_avg"})

TelemetronData_six_months_avg = TelemetronData_six_months['quantity'].groupby(TelemetronData_six_months['serial'], axis=0).sum()
TelemetronData_six_months_avg = TelemetronData_six_months_avg.to_frame().reset_index()
TelemetronData_six_months_avg = TelemetronData_six_months_avg.rename(columns={"quantity": "six_months_avg"})

TelemetronData_three_months_avg

VendonData_one_month_avg = VendonData_one_month['quantity'].groupby(VendonData_one_month['serial'], axis=0).sum()
VendonData_one_month_avg = VendonData_one_month_avg.to_frame().reset_index()
VendonData_one_month_avg = VendonData_one_month_avg.rename(columns={"quantity": "one_Month_avg"})

VendonData_three_months_avg = VendonData_three_months['quantity'].groupby(VendonData_three_months['serial'], axis=0).sum()
VendonData_three_months_avg = VendonData_three_months_avg.to_frame().reset_index()
VendonData_three_months_avg = VendonData_three_months_avg.rename(columns={"quantity": "three_months_avg"})

VendonData_six_months_avg = VendonData_six_months['quantity'].groupby(VendonData_six_months['serial'], axis=0).sum()
VendonData_six_months_avg = VendonData_six_months_avg.to_frame().reset_index()
VendonData_six_months_avg = VendonData_six_months_avg.rename(columns={"quantity": "six_months_avg"})

VendonData_three_months_avg

#already done with change of code
PakistanSales_three_months_avg['three_months_avg'] = PakistanSales_three_months_avg['Sales_three_months_avg'].apply(lambda x: x/3)

PakistanSales_six_months_avg['six_months_avg'] = PakistanSales_six_months_avg['Sales_six_months_avg'].apply(lambda x: x/6)


In [76]:
PakistanSales_one_month_avg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15150 entries, 0 to 15149
Data columns (total 2 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Serial               15150 non-null  object 
 1   Sales_one_Month_avg  15150 non-null  float64
dtypes: float64(1), object(1)
memory usage: 236.8+ KB


In [77]:
Pakistan_aggSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15150 entries, 0 to 15149
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Serial    15150 non-null  object 
 1   quantity  15150 non-null  float64
dtypes: float64(1), object(1)
memory usage: 236.8+ KB


In [78]:
PakistanSales_df = pd.merge(Pakistan_aggSales, PakistanSales_one_month_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])


# Merge the DataFrames based on the 'Serial' column
PakistanSales_df = pd.merge(Pakistan_aggSales, PakistanSales_one_month_avg, how='left', left_on='Serial', right_on='Serial')
# Print the head of the merged DataFrame
print(PakistanSales_df.head())

        Serial     quantity  Sales_one_Month_avg
0  10010063319   60245.1569             15412.08
1   2000014136  689624.6980             58078.43
2   2000014290   42356.1536              5000.00
3   2000014292  104124.6992             17440.00
4   2000014293  415646.1472             38800.00


In [79]:
PakistanSales_df2 = pd.merge(PakistanSales_df, PakistanSales_three_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
PakistanSales_df3 = pd.merge(PakistanSales_df2, PakistanSales_six_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
PakistanSales_df3 = PakistanSales_df3.fillna(0)

# Merge the DataFrames based on the 'Serial' column
MalaysiaSales_df1 = pd.merge(Malaysia_aggSales, MalaysiaSales_one_month_avg, how='left', left_on='Serial', right_on='Serial')
MalaysiaSales_df2 = pd.merge(MalaysiaSales_df1, MalaysiaSales_three_months_avg, how='left', left_on='Serial', right_on='Serial')
MalaysiaSales_df3 = pd.merge(MalaysiaSales_df2, MalaysiaSales_six_months_avg, how='left', left_on='Serial', right_on='Serial')
# Fill any missing values with 0
MalaysiaSales_df3 = MalaysiaSales_df3.fillna(0)
# Print the head of the merged DataFrame
print(MalaysiaSales_df3.head(30))

        Serial      quantity  Sales_one_Month_avg  Sales_three_months_avg  \
0   1005010001    128.367330             0.000000                6.391393   
1   1005010006    742.657798             0.000000                0.000000   
2   1005010007    742.657798             0.000000                0.000000   
3   1005010009   2748.936909           417.780483              444.384206   
4   1005010011   6442.373203          1151.370455             1539.484848   
5   1005010014   6331.466127           788.729630             1022.703660   
6   1005010015  18670.952147          3429.041811             3428.241404   
7   1005010016   3019.859071           433.838675              476.872336   
8   1005010017    742.657798             0.000000                0.000000   
9   1005010018   4494.520465           686.039247              715.374827   
10  1005010021  29407.272455          4303.907273             3853.037576   
11  1005010022   4923.650000          1353.300000             1039.633333   

In [80]:
MalaysiaSales_df3.info() 

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15982 entries, 0 to 15981
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  15982 non-null  object 
 1   quantity                15982 non-null  float64
 2   Sales_one_Month_avg     15982 non-null  float64
 3   Sales_three_months_avg  15982 non-null  float64
 4   Sales_six_months_avg    15982 non-null  float64
dtypes: float64(4), object(1)
memory usage: 749.2+ KB


#TelemetronData_three_months_avg['three_months_avg'] = TelemetronData_three_months_avg['three_months_avg'].apply(lambda x: x/3)

#TelemetronData_six_months_avg['six_months_avg'] = TelemetronData_six_months_avg['six_months_avg'].apply(lambda x: x/6)

TelemetronData_df = pd.merge(Telemetron_agg, TelemetronData_one_month_avg, how='left', left_on = ['serial'], right_on = ['serial'])
TelemetronData_df.head()

TelemetronData_df2 = pd.merge(TelemetronData_df, TelemetronData_three_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
TelemetronData_df3 = pd.merge(TelemetronData_df2, TelemetronData_six_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
TelemetronData_df3 = TelemetronData_df3.fillna(0)


In [81]:
RussiaSalesData_df = pd.merge(RussiaSalesData_agg, RussiaSalesData_one_month_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
RussiaSalesData_df.head()

RussiaSalesData_df2 = pd.merge(RussiaSalesData_df, RussiaSalesData_three_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
RussiaSalesData_df3 = pd.merge(RussiaSalesData_df2, RussiaSalesData_six_months_avg, how='left', left_on = ['Serial'], right_on = ['Serial'])
RussiaSalesData_df3 = RussiaSalesData_df3.fillna(0)


VendonData_three_months_avg['three_months_avg'] = VendonData_three_months_avg['three_months_avg'].apply(lambda x: x/3)

VendonData_six_months_avg['six_months_avg'] = VendonData_six_months_avg['six_months_avg'].apply(lambda x: x/6)

VendonData_df = pd.merge(Vendon_agg, VendonData_one_month_avg, how='left', left_on = ['serial'], right_on = ['serial'])

VendonData_df2 = pd.merge(VendonData_df, VendonData_three_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
VendonData_df3 = pd.merge(VendonData_df2, VendonData_six_months_avg, how='left', left_on = ['serial'], right_on = ['serial'])
VendonData_df3 = VendonData_df3.fillna(0)
VendonData_df3

Add Key manuf no and Sales org

In [82]:
SouthAfricaSales_df = pd.merge(SouthAfrica_aggSales, SouthAfricaSales_one_month_avg, how='left', left_on = ['AccountID'], right_on = ['AccountID'])
SouthAfricaSales_df.head()

SouthAfricaSales_df2 = pd.merge(SouthAfricaSales_df, SouthAfricaSales_three_months_avg, how='left', left_on = ['AccountID'], right_on = ['AccountID'])
SouthAfricaSales_df3 = pd.merge(SouthAfricaSales_df2, SouthAfricaSales_six_months_avg, how='left', left_on = ['AccountID'], right_on = ['AccountID'])
SouthAfricaSales_df3 = SouthAfricaSales_df3.fillna(0)
SouthAfricaSales_df3.head()

Unnamed: 0,AccountID,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg
0,365014,0.82,0.19,0.083333,0.056667
1,365018,40.53,0.05,13.376667,6.73
2,366680,1068.01,1068.01,356.003333,178.001667
3,366935,13691.11,8376.59,4563.703333,2281.851667
4,367418,144338.28,5830.7,11818.086667,10983.813333


In [83]:
SingaporeSales_df = pd.merge(Singapore_aggSales, SingaporeSales_one_month_avg, how='left', left_on = ['Manufacturer Number'], right_on = ['Manufacturer Number'])
SingaporeSales_df2 = pd.merge(SingaporeSales_df, SingaporeSales_three_months_avg, how='left', left_on = ['Manufacturer Number'], right_on = ['Manufacturer Number'])
SingaporeSales_df3 = pd.merge(SingaporeSales_df2, SingaporeSales_six_months_avg, how='left', left_on = ['Manufacturer Number'], right_on = ['Manufacturer Number'])
SingaporeSales_df3 = SingaporeSales_df3.fillna(0)
SingaporeSales_df3.head(30)

Unnamed: 0,Manufacturer Number,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg
0,141982300000383201,3691.78,272.75,247.15,201.344167
1,141982300001083201,3383.43,182.58,210.726667,208.716667
2,141982300003083201,12672.237298,488.248376,591.731587,705.98585
3,141982300003283201,3208.42,0.0,83.48,127.426667
4,141992300000483201,2571.08,0.0,83.48,161.393333
5,141992300000783201,22751.700695,1306.623636,1579.519725,1568.842368
6,141992300000983201,1694.93,80.71,67.613333,90.326667
7,141992300001583201,6109.08,0.0,230.226667,360.166667
8,141992300001683201,780.71,95.075,72.158333,82.6325
9,141992300003283201,3205.48,333.92,111.306667,155.826667


In [84]:
PakistanSales_df3['KeyManufNo_SalesOrg'] = PakistanSales_df3['Serial'].astype(str) + 'Pakistan' 

# Create the new column by combining 'Serial' with 'Malaysia'
MalaysiaSales_df3['KeyManufNo_SalesOrg'] = MalaysiaSales_df3['Serial'].astype(str) + 'Malaysia'

Not used yet in Vendon to differentiate markets

TelemetronData_df3['KeyManufNo_SalesOrg'] = TelemetronData_df3['serial'].astype(str) + 'Nestlé Russia'

In [85]:
RussiaSalesData_df3['KeyManufNo_SalesOrg'] = RussiaSalesData_df3['Serial'].astype(str) + 'Nestlé Russia'


In [86]:
SouthAfricaSales_df3['KeyManufNo_SalesOrg'] = SouthAfricaSales_df3['AccountID'].astype(str) + 'Nestle South Africa' 

Rename the accountID column from South Africa as we already did the work to get the accountID


In [87]:
SouthAfricaSales_df4 = SouthAfricaSales_df3.rename(columns = {'AccountID':'Serial'})

In [88]:
SingaporeSales_df3['KeyManufNo_SalesOrg'] = SingaporeSales_df3['Manufacturer Number'].astype(str) + 'Singapore'

In [89]:
SingaporeSales_df4 = SingaporeSales_df3.rename(columns = {'Manufacturer Number':'Serial'})

In [90]:
SingaporeSales_df4.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6366 entries, 0 to 6365
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  6366 non-null   object 
 1   quantity                6366 non-null   float64
 2   Sales_one_Month_avg     6366 non-null   float64
 3   Sales_three_months_avg  6366 non-null   float64
 4   Sales_six_months_avg    6366 non-null   float64
 5   KeyManufNo_SalesOrg     6366 non-null   object 
dtypes: float64(4), object(2)
memory usage: 348.1+ KB


VendonData_df3['KeyManufNo_SalesOrg'] = VendonData_df3['serial'].astype(str) + VendonData_df3['SalesOrg'].astype(str)

VendonData_df3=VendonData_df3.drop(columns=['SalesOrg'])
VendonData_df3.head()

In [91]:
Concat_Sales = pd.concat([RussiaSalesData_df3, PakistanSales_df3, SouthAfricaSales_df4, SingaporeSales_df4, MalaysiaSales_df3])
Concat_Sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68572 entries, 0 to 15981
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  68572 non-null  object 
 1   quantity                68572 non-null  float64
 2   Sales_one_Month_avg     68572 non-null  float64
 3   Sales_three_months_avg  68572 non-null  float64
 4   Sales_six_months_avg    68572 non-null  float64
 5   KeyManufNo_SalesOrg     68572 non-null  object 
dtypes: float64(4), object(2)
memory usage: 3.7+ MB


In [92]:
Concat_Sales['(lst_mth-6mth)/6mth'] = Concat_Sales.apply(lambda x: 0 if x['Sales_six_months_avg'] <= 0 else (x['Sales_one_Month_avg']-x['Sales_six_months_avg'])/x['Sales_six_months_avg'], axis=1)

Concat_Sales['3mth-6mth)/6mth'] = Concat_Sales.apply(lambda x: 0 if x['Sales_six_months_avg'] <= 0 else (x['Sales_three_months_avg']-x['Sales_six_months_avg'])/x['Sales_six_months_avg'], axis=1)

In [93]:
Concat_Sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68572 entries, 0 to 15981
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  68572 non-null  object 
 1   quantity                68572 non-null  float64
 2   Sales_one_Month_avg     68572 non-null  float64
 3   Sales_three_months_avg  68572 non-null  float64
 4   Sales_six_months_avg    68572 non-null  float64
 5   KeyManufNo_SalesOrg     68572 non-null  object 
 6   (lst_mth-6mth)/6mth     68572 non-null  float64
 7   3mth-6mth)/6mth         68572 non-null  float64
dtypes: float64(6), object(2)
memory usage: 4.7+ MB


In [94]:
Concat_Sales.tail(5)

Unnamed: 0,Serial,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg,KeyManufNo_SalesOrg,(lst_mth-6mth)/6mth,3mth-6mth)/6mth
15977,T573424,1104.27,0.0,39.45,184.045,T573424Malaysia,-1.0,-0.78565
15978,T573425,2423.145633,374.281441,388.879459,403.857606,T573425Malaysia,-0.073234,-0.037088
15979,T573426,359.4,0.0,0.0,59.9,T573426Malaysia,-1.0,-1.0
15980,T573427,4612.954324,294.736667,704.707014,768.825721,T573427Malaysia,-0.61664,-0.083398
15981,VZCA 00001,5386.958367,828.944922,917.998386,897.826395,VZCA 00001Malaysia,-0.07672,0.022468


Need to change the type otherwise cannot merge correctly with manuf number

TelemetronData_df3['serial'] = TelemetronData_df3['serial'].astype(str)

BeverageMachine7_wTickets_df['Manufacturer Number'] = BeverageMachine7_wTickets_df['Manufacturer Number'].astype(str)

w=aaaf.loc[aaaf['Manufacturer Number']=='20172526377']
w

In [95]:
#Concat_Telemetry = pd.concat([TelemetronData_df3, Telemetry_aggSales_df3])
Concat_Telemetry = Telemetry_aggSales_df3
Concat_Telemetry.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 31473 entries, 0 to 31472
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   serial            31473 non-null  object 
 1   quantity          31473 non-null  int64  
 2   one_month_avg     31473 non-null  int64  
 3   three_months_avg  31473 non-null  float64
 4   six_months_avg    31473 non-null  float64
dtypes: float64(2), int64(2), object(1)
memory usage: 1.4+ MB


In [96]:
Concat_Telemetry['(lst_mth-6mth)/6mth'] = Concat_Telemetry.apply(lambda x: 0 if x['six_months_avg'] <= 0 else (x['one_month_avg']-x['six_months_avg'])/x['six_months_avg'], axis=1)

Concat_Telemetry['3mth-6mth)/6mth'] = Concat_Telemetry.apply(lambda x: 0 if x['six_months_avg'] <= 0 else (x['three_months_avg']-x['six_months_avg'])/x['six_months_avg'], axis=1)

## 10. Market Actions data

In [97]:
##Market Actions listed
MktActions = pd.read_excel(MktActions)
MktActions.head()

Unnamed: 0,Month,Serial ID,Sales Organisation,Parent Installation Point ID,Actions,Actions linked to churn predictions,Comments,CA Comments,Actions proposed by SBU
0,2021-11-30,34F6401007,Nestle UK,7326,Other,Yes,CA Feedback Required,,No action Yet
1,2021-11-30,16E0031901,Nestle UK,11955,Removal planned,Yes,,,
2,2021-11-30,17E0020640,Nestle UK,8151,Removal planned,Yes,,,
3,2021-11-30,10238090,Nestle UK,IP-11722,Removal planned,Yes,,,
4,2021-11-30,101810133,Nestle UK,4915,Other,Yes,CA Feedback Required,,


In [98]:
MktActions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1151 entries, 0 to 1150
Data columns (total 9 columns):
 #   Column                               Non-Null Count  Dtype         
---  ------                               --------------  -----         
 0   Month                                1151 non-null   datetime64[ns]
 1   Serial ID                            1151 non-null   object        
 2   Sales Organisation                   1151 non-null   object        
 3   Parent Installation Point ID         1151 non-null   object        
 4   Actions                              1150 non-null   object        
 5   Actions linked to churn predictions  1151 non-null   object        
 6   Comments                             837 non-null    object        
 7   CA Comments                          29 non-null     object        
 8   Actions proposed by SBU              13 non-null     object        
dtypes: datetime64[ns](1), object(8)
memory usage: 81.1+ KB


In [99]:
#add key serial + sales org?
#One hot encoding
def preprocess_MktActions(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Actions']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

MktActions_prep = preprocess_MktActions(MktActions)
MktActions_prep.head()

Unnamed: 0,Month,Serial ID,Sales Organisation,Parent Installation Point ID,Actions linked to churn predictions,Comments,CA Comments,Actions proposed by SBU,Actions_Churn risk reason unknown,Actions_Data corrected,...,Actions_Removed,Actions_Reviewed and no action Required,Actions_Reviewed and no actions required,Actions_Seasonal Machine,Actions_Telemetry installed,Actions_Upgrade machine installed,Actions_Visit completed,Actions_Visit/Call planned,Actions_removed,Actions_tagging update
0,2021-11-30,34F6401007,Nestle UK,7326,Yes,CA Feedback Required,,No action Yet,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2021-11-30,16E0031901,Nestle UK,11955,Yes,,,,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2021-11-30,17E0020640,Nestle UK,8151,Yes,,,,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2021-11-30,10238090,Nestle UK,IP-11722,Yes,,,,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2021-11-30,101810133,Nestle UK,4915,Yes,CA Feedback Required,,,0,0,...,0,0,0,0,0,0,0,0,0,0


In [100]:
MktActions_prep.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1151 entries, 0 to 1150
Data columns (total 28 columns):
 #   Column                                    Non-Null Count  Dtype         
---  ------                                    --------------  -----         
 0   Month                                     1151 non-null   datetime64[ns]
 1   Serial ID                                 1151 non-null   object        
 2   Sales Organisation                        1151 non-null   object        
 3   Parent Installation Point ID              1151 non-null   object        
 4   Actions linked to churn predictions       1151 non-null   object        
 5   Comments                                  837 non-null    object        
 6   CA Comments                               29 non-null     object        
 7   Actions proposed by SBU                   13 non-null     object        
 8   Actions_Churn risk reason unknown         1151 non-null   uint8         
 9   Actions_Data corrected        

In [101]:
MktActions_prep2=MktActions_prep.drop(columns=['Sales Organisation','Parent Installation Point ID', 'Month'])
MktActions_prep3 = MktActions_prep2.groupby(['Serial ID']).sum()
MktActions_prep3.head()

  MktActions_prep3 = MktActions_prep2.groupby(['Serial ID']).sum()


Unnamed: 0_level_0,Actions_Churn risk reason unknown,Actions_Data corrected,Actions_Downgrade machine installed,Actions_Lack of data discipline,Actions_New contract,Actions_Other,Actions_Out of order,Actions_Phone Call completed,Actions_Removal Plan,Actions_Removal planned,Actions_Removed,Actions_Reviewed and no action Required,Actions_Reviewed and no actions required,Actions_Seasonal Machine,Actions_Telemetry installed,Actions_Upgrade machine installed,Actions_Visit completed,Actions_Visit/Call planned,Actions_removed,Actions_tagging update
Serial ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
24606,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1895151,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10238090,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
10238091,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
10238092,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0


In [102]:
MktActions_prep3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, 24606 to EM9933
Data columns (total 20 columns):
 #   Column                                    Non-Null Count  Dtype
---  ------                                    --------------  -----
 0   Actions_Churn risk reason unknown         930 non-null    uint8
 1   Actions_Data corrected                    930 non-null    uint8
 2   Actions_Downgrade machine installed       930 non-null    uint8
 3   Actions_Lack of data discipline           930 non-null    uint8
 4   Actions_New contract                      930 non-null    uint8
 5   Actions_Other                             930 non-null    uint8
 6   Actions_Out of order                      930 non-null    uint8
 7   Actions_Phone Call completed              930 non-null    uint8
 8   Actions_Removal Plan                      930 non-null    uint8
 9   Actions_Removal planned                   930 non-null    uint8
 10  Actions_Removed                           930 non-null    ui

### (b) Plan to manage and process the data <a class="anchor" id="ManageData"></a>

I will extract the data into excel or csv format and upload it to python.

I can merge the data of the different files together

The data is checked monthly and has been created to be linked together

Columns useful to link the datasets together :

    'Product ID [Machine Model ID]'

    'Manufacturer Number'

    'BMB/C4C Model code'

    'M1'
    
    'Manufacture Serial Number'
    
    'Serial ID'

I need to find a way to have one line per machine per month for telemetry data and Placement Tickets

The main idea behind the use of telemetry data here is to check if we can see for example a relation between churn and a the number of cup sales.

I will not use all the features. Below are the features I am planning to use for the two biggest dataset :

19 columns from Beverage Machine data :

['Serial ID', 'Sales Organisation', 'Machine Status Groupings', 'User Status', 'TA Contract Installation Date', 'Depreciation Start',
'Position', 'TA Contract Start Date', 'TA Contract End Date', 'TA Usage Indicator',
'Account ABC Classification (Account ID)', 'Industry (Account ID)', 'Industry Code 1 (Account ID)',  'Account ABC Classification (EC ID)', 'Industry (EC ID)',
'Industry Code 1 (EC ID)', 'Parent Installation Point ID', 'Registered Product Category (Registered Product ID)', 'Calendar Date']

14 columns coming from the Beverage Classification data :

['Model', 'Model Vendor', 'Model Category', 'Model Group', 'Beverage Temperature',
'System Brands', 'Ingredient Format', 'Machine Type', 'Positionning', 'Generation',
'Blueprint Throughput', 'IP Ownership', 'Trading Partner', 'G/R/M TB']

### Beverage Machine data features and the Beverage Classification data features

##### Serial ID                                              
	Unique per machine and allows to link to the Tickets placements

##### Sales Organisation                                     
	Usually a Sales Organisation corresponds to a Country

##### Product ID [Machine Model ID]                          
	Code that allows us to link it to the intermediary mapping table which contains all the details for each machine

##### Machine Status Groupings                               
	Status of the Machine shows if a machine is :
		Deployed
		Idle
		Other

##### User Status                                            
	More detailed than status groupings

##### Depreciation Start                                     
	Date when the machine started to display cup

##### Manufacturer Number                                    
	Code that allows us to link to the telemetry data


##### Position                                               
	Can tell if a machine is a:
		RENT,
		Sale,
		Loan,
		Demo,
		etc.,
##### TA Contract Installation Date
    Date when the machine was installed, different than depreciation start because a machine can be installed but could have already dispensed cups in another Installation Point

##### TA Contract Start Date                                 
	Date when the contract started
    
##### TA Contract End Date                                  
	Date when the contract ended
    
##### TA Usage Indicator                                     
	Can have several usage:
		5 Monthly Rental
		Not assigned
		Trial / Evaluation
		7 Annual / Periodic

##### Account ABC Classification (Account ID)                
	Can help to identify in which Channel is the Account
    
##### Industry (Account ID)                                  
	Can help to identify in which Channel is the Account
    
##### Industry Code 1 (Account ID)                           
	Can help to identify in which Channel is the Account

##### Account ABC Classification (EC ID)                     
	Can help to identify in which Channel is the End Customer
    
##### Industry (EC ID)                                       
	Can help to identify in which Channel is the End Customer
    
##### Industry Code 1 (EC ID)                                
	Can help to identify in which Channel is the End Customer

##### Parent Installation Point ID                           
	Help to identify if a machine is still deployed in the same location by the same customer, it is the Installation Point ID we were talking before.

##### Registered Product Category (Registered Product ID)    
	Details of the category within our group

##### Calendar Date                                          
	Date when we extracted the data of the machine
    
##### BMB/C4C Model code                                     
	Code that allows to link the intermediary mapping table to the beverage machine data

##### M1                                                     
	Name of the harmonized model and used to link the intermediary mapping to the mapping file with unique model
    
##### Model                                                  
	Name of the harmonized model and used to link the intermediary mapping to the mapping file with unique model

##### Model Vendor                                           
	Name of the vendor of the coffee machine

##### Model Category                                         
	Category of the model
		
##### Model Group                                            
	Group of the model

##### Beverage Temperature                                   
	Temperature of the beverage

##### System Brands                                         
	Brand internal classification

##### Ingredient Format                                     
	Format of the ingredient

##### Machine Type                                           
	Type of Machine

##### Positionning                                           
	Positionning of the machine

##### Generation                                             
	Generation of the machine

##### Blueprint Throughput                                   
    Type of throughput

##### IP Ownership                                           
    Ownership type

##### Trading Partner                                        
	Type of Trading Partner

##### G/R/M TB                                               
	How it is managed by the market 

Useless data not really explaining the model :

##### not used columns : 32

User Status Last Changed On                            
Product [Machine Model]                                
	Name of the machine Model 
Range Brand                                           
	Brand of the model
    
EC ID                                                  
    We can identify the end customer with this number, some can have more than one machine
    
	Can be transformed into #Machine for this customer

Equipment Number                                       
Asset Number                                           
TA Contract Number                                   
Account ID                                          
Ship To ID                                     
EC Name                                           
Sales Org ID (Installation Point)                  
Model Harmonized                                    	
Comments                                          
Source                                              
Global Projects                                    
	Machine that are part of a project :
		Roastelier
		Alegria
		Nitro
		Milano
		EZCare
		Express
		CoolPro
Toolbox                                               
Non-Toolbox Reason                 
Product                                       
Type.                              
Machines Models (Harmonized)                     
Solution Brands                 
Toolbox 2019                                       
Toolbox 2018                                     
Toolbox 2017                                        
Trade Assets                                      
Active for Procurement (2017)                       
Idle Available Stock Type                          
Modified                                          
Modified By                                     
Created                                            
Item Type                                       
Path

### Placement Tickets data features

##### Service Category
    Tell if the machine was :
        Installed
        Removed
        Replaced

##### Completion Date
    Date when the Ticket was done, we will not use it since we will aggregate on the number of tickets without the time dimension

##### Incident Category
    Reason of the Ticket, details about the incident or ticket

##### Serial ID
    In order to link to the Beverage Machine data

### Telemetry data features

##### quantity 
    Sales quantity

##### serial 	
    ID that allows us to map a to the manufacturer number of the beverage machines

##### columns not used :

Month

    Month of the sales

stockId

    Each machines has a button linked to an ID and by mapping this ID to the related product when can know which type of cup was sold, yet the machines is not working for every machines, so the columns product might be wrong

Column1 	

    unknown Id

Averages 	

    unknown average

inactive 	

    unknown column

machine_id2 

    unknown Id

Product

    type of cup sold (mapping is not ready for every machines yet)

We will use only the sales quantity and the serial to link to the Beverage Machine data. The other columns are either not useful or not satisfying minimum requirements on accuracy of data (bad data)

### Missing data

In [103]:
# TA Contract Installation Date
BevMachMissingInstDate = BeverageMachine_df.loc[BeverageMachine_df['TA Contract Installation Date']=='#']['TA Contract Installation Date'].count()
TotBevMach = BeverageMachine_df['Serial ID'].count()

# TA Contract Start Date
BevMachMissingStartDate = BeverageMachine_df.loc[BeverageMachine_df['TA Contract Start Date']=='#']['TA Contract Start Date'].count()

# TA Contract End Date 
BevMachMissingEndDate = BeverageMachine_df.loc[BeverageMachine_df['TA Contract End Date']=='#']['TA Contract End Date'].count()

# Depreciation Start
BevMachMissingDepStartDate = BeverageMachine_df.loc[BeverageMachine_df['Depreciation Start']=='#']['Depreciation Start'].count()


print('Beverage machines missing Installation Date : ', BevMachMissingInstDate, ', which corresponds to ', 100*round(BevMachMissingInstDate/TotBevMach,2), '%')
print('Beverage machines missing Start Date : ', BevMachMissingStartDate, ', which corresponds to ', 100*round(BevMachMissingStartDate/TotBevMach,2), '%')
print('Beverage machines missing End Date : ', BevMachMissingEndDate, ', which corresponds to ', 100*round(BevMachMissingEndDate/TotBevMach,2), '%')
print('Beverage machines missing Depreciation Start Date : ', BevMachMissingDepStartDate, ', which corresponds to ', 100*round(BevMachMissingDepStartDate/TotBevMach,2), '%')


Beverage machines missing Installation Date :  5426386 , which corresponds to  81.0 %
Beverage machines missing Start Date :  5426386 , which corresponds to  81.0 %
Beverage machines missing End Date :  5426378 , which corresponds to  81.0 %
Beverage machines missing Depreciation Start Date :  4 , which corresponds to  0.0 %


##### Telemetry data
Even if the number of beverage machines equiped with telemetry data is increasing the data available is still low and should be seen as a complement.

In August 2020 only around 200 beverage machines have telemetry data and are already in the new system from which we got Beverage Machine data and we have around 60'000 beverage machines.


##### Placement Tickets data

27'318 beverage machines does not provide any Placement tickets


##### Date features missing

We see that sometimes the date is not filled for Installation Date, Start Date and End Date

#### Visits data
A visit is linked to an account and a machine "Account ID" can be linked to a visit "Account ID.Account ID Level 01.Key" maybe a key with the Sales Org in case it is unique only by market

#### Phone Calls data
A phone Call is linked to an account. Link "Account Name" from phone call with "Account ID" of the machine.

## Preparation of the data<a class="anchor" id="prep"></a>

### a) Details of preparation<a class="anchor" id="det"></a>

#### Beverage Machine data preparation

The goal is to get the actual maximal date of each Serial ID

If a machine has a maximal date that is lower than (or not equal to) the latest snapshot date, then the machine has churned.

We will look at the max date per installation point because when we lose an installation point we lose the customer. 

A machine can be realocated to another customer.

Keep only the latest month of data



In [104]:
BeverageMachine_df['Calendar Date'] = pd.to_datetime(BeverageMachine_df['Calendar Date'], errors =  'coerce')

In [105]:
BeverageMachine_df.tail()

Unnamed: 0.1,Unnamed: 0,Sales Organisation,User Status Last Changed On,Product [Machine Model],Product ID [Machine Model ID],Range Brand,Machine Status Groupings,User Status,Depreciation Start,Serial ID,...,Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),Industry Code 1 (EC ID),Parent Installation Point ID,Registered Product Category (Registered Product ID),Sales Org ID (Installation Point),SAP Material Line Code [Machine Model ID],Calendar Date,Key_ManufacturerID_SalesOrg
7899384,3201311,Nestle South Africa,44460,NESCAFE MILANO 8/60 H6 AR,100118541,MILANO,Deployed,Installed,40626,ZA4188,...,060502 Factory,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1079926,Trade Asset w/ Fixed Asset,ZA10,90083851,2023-12-31,3897301Nestle South Africa
7899385,3201312,Nestle South Africa,44460,NESCAFE MILANO 8/60 H6 AR,100118541,MILANO,Deployed,Installed,40695,ZA5014,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1080068,Trade Asset w/ Fixed Asset,ZA10,90083851,2023-12-31,3896790Nestle South Africa
7899386,3201313,Nestle South Africa,44460,NESCAFE MILANO 8/60 H6 AR,100118541,MILANO,Deployed,Installed,41334,ZA4631,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1080064,Trade Asset w/ Fixed Asset,ZA10,90083851,2023-12-31,3896790Nestle South Africa
7899387,3201314,Nestle South Africa,44460,NESCAFE ALEGRIA Base Cabinet,100118550,ALEGRIA,Deployed,Installed,42005,ZA14050,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1080336,Trade Asset Accessory,ZA10,90083852,2023-12-31,3896790Nestle South Africa
7899388,3201315,Nestle South Africa,44460,NESCAFE ALEGRIA Base Cabinet,100118550,ALEGRIA,Deployed,Installed,42370,ZA14115,...,060501 Office Leasing Ctr,06 Out of Home,0605 Business/Industry,060501 Office Leasing Ctr,1079951,Trade Asset Accessory,ZA10,90083852,2023-12-31,3896944Nestle South Africa


BeverageMachine_df1 = BeverageMachine_df.copy()
BeverageMachine_df1 = BeverageMachine_df1.groupby(['Parent Installation Point ID'])


In [106]:
BeverageMachine_df1 = BeverageMachine_df.copy()
BeverageMachine_df1['Product ID [Machine Model ID]'] = BeverageMachine_df1['Product ID [Machine Model ID]'].astype(str)
#BeverageMachine_df2 = BeverageMachine_df1.groupby(['Parent Installation Point ID']).agg({'Calendar Date' : [np.min, np.max]})

#BeverageMachine_df1['Calendar Date2'] = BeverageMachine_df1['Calendar Date']

#BeverageMachine_df2 = BeverageMachine_df1.groupby(['Parent Installation Point ID']).agg({'Calendar Date' : 'min', 'Calendar Date2' : 'max'})

In [107]:
BeverageMachine_df2 = pd.merge(BeverageMachine_df1, BevMap_df, how='left', left_on = ['Product ID [Machine Model ID]'], right_on = ['ID Model Code'])
BeverageClassification1_df = BeverageClassification_df.drop_duplicates(['Model'])
BeverageMachine_df3 = pd.merge(BeverageMachine_df2, BeverageClassification1_df, how='left', left_on = ['Model'], right_on = ['Model']) 

In [108]:
BeverageMachine_df3 = BeverageMachine_df3.query("`Model` != 'Accessories'")

In [109]:
BeverageMachine_df3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5844059 entries, 0 to 6737849
Data columns (total 73 columns):
 #   Column                                               Dtype         
---  ------                                               -----         
 0   Unnamed: 0                                           int64         
 1   Sales Organisation                                   object        
 2   User Status Last Changed On                          object        
 3   Product [Machine Model]                              object        
 4   Product ID [Machine Model ID]                        object        
 5   Range Brand                                          object        
 6   Machine Status Groupings                             object        
 7   User Status                                          object        
 8   Depreciation Start                                   object        
 9   Serial ID                                            object        
 10  Manufa

Another way to get min and max date e.g. "I wanted to create a new data frame where I can get min value in the column Numb if my string in the column Word is ab and max value if my string is bc for each Date. " :

s=df.groupby(['Date','Word']).Numb.agg(['min','max'])

s['number']=np.where(s.index.get_level_values(1)=='ab',s.min(1),s.max(1))

df11 =BeverageMachine_df.copy()
df22 = df11.reset_index()
df22.loc[df22.groupby('Parent Installation Point ID')['Calendar Date'].idxmin()]
df22.info()

In [110]:
# Sort the dataFrame by 'Calendar Date' and then remove duplicates :
BM_Maxdate_IPID2 = BeverageMachine_df3.sort_values('Calendar Date', ascending=False).drop_duplicates(['Parent Installation Point ID'])
BM_Maxdate_IPID2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 253965 entries, 813777 to 1211381
Data columns (total 73 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Unnamed: 0                                           253965 non-null  int64         
 1   Sales Organisation                                   253965 non-null  object        
 2   User Status Last Changed On                          253958 non-null  object        
 3   Product [Machine Model]                              253965 non-null  object        
 4   Product ID [Machine Model ID]                        253965 non-null  object        
 5   Range Brand                                          253965 non-null  object        
 6   Machine Status Groupings                             253965 non-null  object        
 7   User Status                                          253965 non-null

The columns allowing to link datasets should have the same format otherwise it might not work properly if one has a string format and the other a numerical format  

In [111]:
BeverageMachine1_df = BM_Maxdate_IPID2
BeverageMachine1_df['Product ID [Machine Model ID]']=BeverageMachine1_df['Product ID [Machine Model ID]'].astype(str)

In [112]:
count = BeverageMachine1_df[BeverageMachine1_df['Sales Organisation'] == 'Nestlé Russia'].shape[0]
print("Number of rows with 'Sales Organisation' as 'Nestlé Russia':", count)

count = BeverageMachine1_df[(BeverageMachine1_df['Sales Organisation'] == 'Nestlé Russia') & (BeverageMachine1_df['Machine Status Groupings'] == 'Deployed')].shape[0]
print("Number of rows with 'Sales Organisation' as 'Nestlé Russia' and 'Machine Status Groupings' as 'Deployed':", count)

Number of rows with 'Sales Organisation' as 'Nestlé Russia': 26794
Number of rows with 'Sales Organisation' as 'Nestlé Russia' and 'Machine Status Groupings' as 'Deployed': 10935


In [113]:
deployed_count = BeverageMachine1_df[BeverageMachine1_df['Machine Status Groupings'] == 'DEPLOYED'].shape[0]
print("Number of machines with status 'DEPLOYED':", deployed_count)

Number of machines with status 'DEPLOYED': 20159


In [114]:
#TODELETE
#Snowflake values are different and in Upper case, it creates a problem when I filter out machines that are not "Deployed"
BeverageMachine1_df['Machine Status Groupings'] = BeverageMachine1_df['Machine Status Groupings'].replace({'DEPLOYED': 'Deployed', 'IDLE': 'Idle', 'OTHER': 'Other'})

#Snowflake has a different Upper letters value and for one month of data the algorithm can know if the data comes from Snowflake and adapt the algorithm to know these machines have not churned, to be deleted once all the data comes from Snowflake
BeverageMachine1_df['User Status'] = BeverageMachine1_df['User Status'].replace({'IN PREPARATION': 'In Preparation', 'TO BE ASSIGNED': 'To be Assigned', 'TO BE DESTROYED': 'To be Destroyed', 'IN REPAIR': 'In Repair', 'INSTALLED': 'Installed', 'UNDER INSTALLATION': 'Under Installation', 'MISSING': 'Missing', 'STATUS TO BE CORRECTED IN ERP': 'Status to be corrected in ERP', 'TO BE REMOVED': 'To be Removed'})
BeverageMachine1_df['TA Usage Indicator'] = BeverageMachine1_df['TA Usage Indicator'].replace({'Monthly Rental': '5 Monthly Rental', '': 'Not assigned'})


In [115]:
BeverageMachine1_df = BeverageMachine1_df.loc[BeverageMachine1_df['Machine Status Groupings']=="Deployed"]

Merge the Beverage Machine data with the Beverage Mapping in order to get the related "Harmonized Model" of the "Beverage Machine Classification data" and later merge together the "Beverage Machine data" with the "Beverage Classification data"

We should do a cleaning step in order to keep only the machine having the 'Parent Installation Point ID' filled and remove duplicates, but not for 'Serial ID'

In [116]:
BeverageMachine4_df = BeverageMachine1_df.loc[BeverageMachine1_df['Parent Installation Point ID']!="#"].drop_duplicates(['Parent Installation Point ID'])

In [117]:
BeverageMachine4_df = BeverageMachine4_df.loc[BeverageMachine4_df['Serial ID']!="#"]


In [118]:
BeverageMachine4_df.columns

Index(['Unnamed: 0', 'Sales Organisation', 'User Status Last Changed On',
       'Product [Machine Model]', 'Product ID [Machine Model ID]',
       'Range Brand', 'Machine Status Groupings', 'User Status',
       'Depreciation Start', 'Serial ID', 'Manufacturer Number',
       'Equipment Number', 'Asset Number', 'Position', 'TA Contract Number',
       'TA Contract Installation Date', 'TA Contract Start Date',
       'TA Contract End Date', 'TA Usage Indicator', 'Account ID',
       'Ship To ID', 'EC ID', 'EC Name', 'City', 'State', 'Postal Code',
       'Account ABC Classification (Account ID)', 'Industry (Account ID)',
       'Industry Code 1 (Account ID)', 'Account ABC Classification (EC ID)',
       'Industry (EC ID)', 'Industry Code 1 (EC ID)',
       'Parent Installation Point ID',
       'Registered Product Category (Registered Product ID)',
       'Sales Org ID (Installation Point)',
       'SAP Material Line Code [Machine Model ID]', 'Calendar Date',
       'Key_ManufacturerID

In [119]:
                   
                    
BeverageMachine5_df = BeverageMachine4_df[['Serial ID', 'Sales Organisation', 'Machine Status Groupings', 'User Status', 
                    'TA Contract Installation Date', 'Depreciation Start', 'Manufacturer Number', 'Position', 
                    'TA Contract Start Date', 'TA Contract End Date', 'TA Usage Indicator',
                    'Account ID',
                    'EC ID', 'EC Name', 'Account ABC Classification (Account ID)', 'Industry (Account ID)', 
                    'Industry Code 1 (Account ID)', 'Account ABC Classification (EC ID)', 
                    'Industry (EC ID)', 'Industry Code 1 (EC ID)', 'Parent Installation Point ID', 
                    'Registered Product Category (Registered Product ID)', 
                    'Model', 'Model Vendor', 'Model Category', 'Model Group', 
                    'Beverage Temperature', 'System Brands', 'Ingredient Format', 
                    'Machine Type', 'Positionning', 'Generation', 'Blueprint Throughput', 
                    'IP Ownership', 'Calendar Date', 'Key_ManufacturerID_SalesOrg', 'City', 'State', 'Postal Code']]
BeverageMachine5_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Machine Type,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code
896110,153933605,Néstlé Jordania,Deployed,Installed,#,42339,20153933605,LOAN,#,#,...,Table Tops,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933605Néstlé Jordania,Amman,JO/Not assigned,11885
896122,160708192,Néstlé Jordania,Deployed,Installed,#,42491,20160708192,LOAN,#,#,...,Table Tops,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20160708192Néstlé Jordania,Amman,JO/Not assigned,0
896121,160708189,Néstlé Jordania,Deployed,Installed,#,42491,20160708189,LOAN,#,#,...,Table Tops,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20160708189Néstlé Jordania,مادبا,JO/Not assigned,#
896120,153933611,Néstlé Jordania,Deployed,Installed,#,42461,20153933611,LOAN,#,#,...,Table Tops,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933611Néstlé Jordania,عمان,JO/Not assigned,#
896119,153933610,Néstlé Jordania,Deployed,Installed,#,42461,20153933610,LOAN,#,#,...,Table Tops,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933610Néstlé Jordania,Amman,JO/Not assigned,0


If 'Calendar Date' is smaller than the 'ChurnDate2' it means that it has not churned

In [120]:
BeverageMachine5_df['Calendar Date'] = pd.to_datetime(BeverageMachine5_df['Calendar Date'])
BeverageMachine5_df.info()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  BeverageMachine5_df['Calendar Date'] = pd.to_datetime(BeverageMachine5_df['Calendar Date'])


<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 896110 to 1211381
Data columns (total 39 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Serial ID                                            239400 non-null  object        
 1   Sales Organisation                                   239400 non-null  object        
 2   Machine Status Groupings                             239400 non-null  object        
 3   User Status                                          239400 non-null  object        
 4   TA Contract Installation Date                        236317 non-null  object        
 5   Depreciation Start                                   239003 non-null  object        
 6   Manufacturer Number                                  239388 non-null  object        
 7   Position                                             238555 non-null

In [121]:
np.where(BeverageMachine5_df['Calendar Date']< ChurnDate2, True, False)
BeverageMachine5_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Machine Type,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code
896110,153933605,Néstlé Jordania,Deployed,Installed,#,42339,20153933605,LOAN,#,#,...,Table Tops,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933605Néstlé Jordania,Amman,JO/Not assigned,11885
896122,160708192,Néstlé Jordania,Deployed,Installed,#,42491,20160708192,LOAN,#,#,...,Table Tops,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20160708192Néstlé Jordania,Amman,JO/Not assigned,0
896121,160708189,Néstlé Jordania,Deployed,Installed,#,42491,20160708189,LOAN,#,#,...,Table Tops,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20160708189Néstlé Jordania,مادبا,JO/Not assigned,#
896120,153933611,Néstlé Jordania,Deployed,Installed,#,42461,20153933611,LOAN,#,#,...,Table Tops,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933611Néstlé Jordania,عمان,JO/Not assigned,#
896119,153933610,Néstlé Jordania,Deployed,Installed,#,42461,20153933610,LOAN,#,#,...,Table Tops,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933610Néstlé Jordania,Amman,JO/Not assigned,0


In [122]:
columnwithfalse = False
BeverageMachine6_df=BeverageMachine5_df.copy()
BeverageMachine6_df['Churn'] = columnwithfalse
BeverageMachine6_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn
896110,153933605,Néstlé Jordania,Deployed,Installed,#,42339,20153933605,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933605Néstlé Jordania,Amman,JO/Not assigned,11885,False
896122,160708192,Néstlé Jordania,Deployed,Installed,#,42491,20160708192,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20160708192Néstlé Jordania,Amman,JO/Not assigned,0,False
896121,160708189,Néstlé Jordania,Deployed,Installed,#,42491,20160708189,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20160708189Néstlé Jordania,مادبا,JO/Not assigned,#,False
896120,153933611,Néstlé Jordania,Deployed,Installed,#,42461,20153933611,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933611Néstlé Jordania,عمان,JO/Not assigned,#,False
896119,153933610,Néstlé Jordania,Deployed,Installed,#,42461,20153933610,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933610Néstlé Jordania,Amman,JO/Not assigned,0,False


In [123]:
#BeverageMachine6_df['Churn'] = np.where((BeverageMachine5_df['Calendar Date_x']<BeverageMachine5_df['Calendar Date_y'])|
#                                (BeverageMachine5_df['Calendar Date_x'] == ChurnDate), False, True)

BeverageMachine6_df['Churn'] = np.where(BeverageMachine5_df['Calendar Date'] < ChurnDate2, True, False)
BeverageMachine6_df.loc[BeverageMachine6_df['Churn']==True].head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn
631179,1116000895,Nestlé Italy IT35 OOH,Deployed,Installed,44748,43252,1116000895,RENT,44748,2958465,...,Mainstream,Gen. 1,Medium,Proprietary,2024-03-31,1116000895Nestlé Italy IT35 OOH,Vieste,Foggia,71019,True
631178,1016000435,Nestlé Italy IT35 OOH,Deployed,Installed,#,43252,1016000435,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-03-31,1016000435Nestlé Italy IT35 OOH,Peschici,Foggia,71010,True
631209,16E0011742,Nestlé Italy IT35 OOH,Deployed,To be Removed,#,42552,20161212885,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-03-31,20161212885Nestlé Italy IT35 OOH,Siracusa,Siracusa,96100,True
631213,17E0029086,Nestlé Italy IT35 OOH,Deployed,To be Removed,#,43252,20174039830,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-03-31,20174039830Nestlé Italy IT35 OOH,Figline E Incisa Valdarno,Firenze,50063,True
631212,16E0014809,Nestlé Italy IT35 OOH,Deployed,To be Removed,#,42614,20161817983,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-03-31,20161817983Nestlé Italy IT35 OOH,Siracusa,Siracusa,96100,True


In [124]:
BeverageMachine6_df.loc[BeverageMachine6_df['Churn']==False].head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn
896110,153933605,Néstlé Jordania,Deployed,Installed,#,42339,20153933605,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933605Néstlé Jordania,Amman,JO/Not assigned,11885,False
896122,160708192,Néstlé Jordania,Deployed,Installed,#,42491,20160708192,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20160708192Néstlé Jordania,Amman,JO/Not assigned,0,False
896121,160708189,Néstlé Jordania,Deployed,Installed,#,42491,20160708189,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20160708189Néstlé Jordania,مادبا,JO/Not assigned,#,False
896120,153933611,Néstlé Jordania,Deployed,Installed,#,42461,20153933611,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933611Néstlé Jordania,عمان,JO/Not assigned,#,False
896119,153933610,Néstlé Jordania,Deployed,Installed,#,42461,20153933610,LOAN,#,#,...,Mainstream,Gen. 1,Medium,Proprietary,2024-04-30,20153933610Néstlé Jordania,Amman,JO/Not assigned,0,False


Check the data and modify it if it is not the correct type

In [125]:
e = BeverageMachine6_df.loc[BeverageMachine6_df['Serial ID']==7010054129]
e.iloc[:20,9:40]

Unnamed: 0,TA Contract End Date,TA Usage Indicator,Account ID,EC ID,EC Name,Account ABC Classification (Account ID),Industry (Account ID),Industry Code 1 (Account ID),Account ABC Classification (EC ID),Industry (EC ID),...,Positionning,Generation,Blueprint Throughput,IP Ownership,Calendar Date,Key_ManufacturerID_SalesOrg,City,State,Postal Code,Churn


In [126]:
BeverageMachine6_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 896110 to 1211381
Data columns (total 40 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Serial ID                                            239400 non-null  object        
 1   Sales Organisation                                   239400 non-null  object        
 2   Machine Status Groupings                             239400 non-null  object        
 3   User Status                                          239400 non-null  object        
 4   TA Contract Installation Date                        236317 non-null  object        
 5   Depreciation Start                                   239003 non-null  object        
 6   Manufacturer Number                                  239388 non-null  object        
 7   Position                                             238555 non-null

I want some date features to be integer instead of non-null object

In [127]:
# Date features
Date_Features = ['TA Contract Installation Date', 'Depreciation Start',  'TA Contract Start Date', 
                 'TA Contract End Date']

BeverageMachine7_df= BeverageMachine6_df.copy()

for x in Date_Features:
    BeverageMachine7_df[x] = pd.to_numeric(BeverageMachine7_df[x], errors='coerce').fillna(0).astype(int)

In [128]:
BeverageMachine7_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 896110 to 1211381
Data columns (total 40 columns):
 #   Column                                               Non-Null Count   Dtype         
---  ------                                               --------------   -----         
 0   Serial ID                                            239400 non-null  object        
 1   Sales Organisation                                   239400 non-null  object        
 2   Machine Status Groupings                             239400 non-null  object        
 3   User Status                                          239400 non-null  object        
 4   TA Contract Installation Date                        239400 non-null  int32         
 5   Depreciation Start                                   239400 non-null  int32         
 6   Manufacturer Number                                  239388 non-null  object        
 7   Position                                             238555 non-null

#### Placement Tickets data preparation

In order to merge Placement Tickets data with Beverage Machine data I need to perform some preparations of the data.

I would like to have one row per Manufacture Serial Number and Month

Remove "Removal Ticket" because it is nearly like giving the information if the machine has churned. 
To be decided maybe I should remove it too.
I just kept "Seasonal Removal" because it helps to understand that it is a special case and a similar machine might not churn if it is not a Seasonal Removal

In [129]:
Placement_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311784 entries, 0 to 311783
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Service Category               311784 non-null  object
 1   INCIDENT_CATEGORY_DESCRIPTION  310479 non-null  object
 2   Serial ID                      311610 non-null  object
dtypes: object(3)
memory usage: 7.1+ MB


In [130]:
table1 = Placement_df.loc[Placement_df['Service Category']!="Removal"]
table1 = table1.loc[table1['Service Category']!="Removal."]
table2 = Placement_df.loc[Placement_df['INCIDENT_CATEGORY_DESCRIPTION']=="Seasonal Removal"]

In [131]:
table1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 188263 entries, 593 to 309434
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Service Category               188263 non-null  object
 1   INCIDENT_CATEGORY_DESCRIPTION  187289 non-null  object
 2   Serial ID                      188157 non-null  object
dtypes: object(3)
memory usage: 5.7+ MB


In [132]:
Placement_df_wo_rem = pd.concat([table1,table2])
Placement_df_wo_rem.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 192351 entries, 593 to 309388
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Service Category               192351 non-null  object
 1   INCIDENT_CATEGORY_DESCRIPTION  191377 non-null  object
 2   Serial ID                      192241 non-null  object
dtypes: object(3)
memory usage: 5.9+ MB


In [133]:
from xlrd.xldate import xldate_as_tuple
from dateutil.relativedelta import relativedelta

Placement_df_prep = Placement_df_wo_rem[['Serial ID', 'Service Category','INCIDENT_CATEGORY_DESCRIPTION']].copy()

Placement_df_prep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 192351 entries, 593 to 309388
Data columns (total 3 columns):
 #   Column                         Non-Null Count   Dtype 
---  ------                         --------------   ----- 
 0   Serial ID                      192241 non-null  object
 1   Service Category               192351 non-null  object
 2   INCIDENT_CATEGORY_DESCRIPTION  191377 non-null  object
dtypes: object(3)
memory usage: 5.9+ MB


In [134]:
Placement_df_prep['Serial ID'] = Placement_df_prep['Serial ID'].astype('str')

In [135]:
def preprocess_f(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Service Category','INCIDENT_CATEGORY_DESCRIPTION']
                
    # Some columns could be also ordinal features but we will keep them as nominal features for the moment
    ##ordi_vars = ['Positionning', 'Generation',]
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

Placement_df_prep2 = preprocess_f(Placement_df_prep)
Placement_df_prep2.head()

Unnamed: 0,Serial ID,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
593,10048419,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
594,10043045,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
2542,10051301,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
4618,10056376,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0
9294,10047527,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0


In [136]:
Placement_df_prep2.columns

Index(['Serial ID', 'Service Category_Installation',
       'Service Category_Removal', 'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade'],
      dtype='object')

In [137]:
Placement_df_prep3 = Placement_df_prep2.groupby(["Serial ID"])\
[['Serial ID', 'Service Category_Installation',
       'Service Category_Removal', 'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade']].sum()


TicketsColumnsList = ['Serial ID', 'Service Category_Installation',
       'Service Category_Removal', 'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade']

Placement_df_prep3.head()

  'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade']].sum()


Unnamed: 0_level_0,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
Serial ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0.102313088,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
0.4390764,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
100100125.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
100100249.0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
100100250.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1


In [138]:
# Specify the filename
filename = 'TicketsColumnsList.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(TicketsColumnsList, file)

In [139]:
Placement_df_prep3.columns

Index(['Service Category_Installation', 'Service Category_Removal',
       'Service Category_Replacement',
       'INCIDENT_CATEGORY_DESCRIPTION_Customer relocation',
       'INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales',
       'INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair',
       'INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point',
       'INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix',
       'INCIDENT_CATEGORY_DESCRIPTION_Renew',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation',
       'INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal',
       'INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show',
       'INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other',
       'INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade'],
      dtype='object')

In [140]:
Placement_df_prep3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 124702 entries, .102313088 to nan
Data columns (total 16 columns):
 #   Column                                                           Non-Null Count   Dtype
---  ------                                                           --------------   -----
 0   Service Category_Installation                                    124702 non-null  uint8
 1   Service Category_Removal                                         124702 non-null  uint8
 2   Service Category_Replacement                                     124702 non-null  uint8
 3   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                124702 non-null  uint8
 4   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       124702 non-null  uint8
 5   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   124702 non-null  uint8
 6   INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation      124702 non-null  uint8
 7   INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Re

In [141]:
Placement_df_prep5 = Placement_df_prep3.reset_index()

In [142]:
Placement_df_prep5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124702 entries, 0 to 124701
Data columns (total 17 columns):
 #   Column                                                           Non-Null Count   Dtype 
---  ------                                                           --------------   ----- 
 0   Serial ID                                                        124702 non-null  object
 1   Service Category_Installation                                    124702 non-null  uint8 
 2   Service Category_Removal                                         124702 non-null  uint8 
 3   Service Category_Replacement                                     124702 non-null  uint8 
 4   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                124702 non-null  uint8 
 5   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       124702 non-null  uint8 
 6   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   124702 non-null  uint8 
 7   INCIDENT_CATEGORY_DESCRIPTION_Key Acco

In [143]:
BeverageMachine7_df.columns

Index(['Serial ID', 'Sales Organisation', 'Machine Status Groupings',
       'User Status', 'TA Contract Installation Date', 'Depreciation Start',
       'Manufacturer Number', 'Position', 'TA Contract Start Date',
       'TA Contract End Date', 'TA Usage Indicator', 'Account ID', 'EC ID',
       'EC Name', 'Account ABC Classification (Account ID)',
       'Industry (Account ID)', 'Industry Code 1 (Account ID)',
       'Account ABC Classification (EC ID)', 'Industry (EC ID)',
       'Industry Code 1 (EC ID)', 'Parent Installation Point ID',
       'Registered Product Category (Registered Product ID)', 'Model',
       'Model Vendor', 'Model Category', 'Model Group', 'Beverage Temperature',
       'System Brands', 'Ingredient Format', 'Machine Type', 'Positionning',
       'Generation', 'Blueprint Throughput', 'IP Ownership', 'Calendar Date',
       'Key_ManufacturerID_SalesOrg', 'City', 'State', 'Postal Code', 'Churn'],
      dtype='object')

In [144]:
Placement_df_prep5.loc[Placement_df_prep5['Serial ID']=='#']

Unnamed: 0,Serial ID,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade


Remove Placement tickets with 'Serial ID' == '#'

In [145]:
Placement_df_prep6 = Placement_df_prep5.loc[Placement_df_prep5['Serial ID']!='#']

In [146]:
Placement_df_prep6.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 124702 entries, 0 to 124701
Data columns (total 17 columns):
 #   Column                                                           Non-Null Count   Dtype 
---  ------                                                           --------------   ----- 
 0   Serial ID                                                        124702 non-null  object
 1   Service Category_Installation                                    124702 non-null  uint8 
 2   Service Category_Removal                                         124702 non-null  uint8 
 3   Service Category_Replacement                                     124702 non-null  uint8 
 4   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                124702 non-null  uint8 
 5   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       124702 non-null  uint8 
 6   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   124702 non-null  uint8 
 7   INCIDENT_CATEGORY_DESCRIPTION_Key Acco

In [147]:
#Placement_df_prep6['Serial ID'] = Placement_df_prep6['Serial ID'].astype('str')

In [148]:
Placement_df_prep6 = Placement_df_prep6.reset_index()

Placement_df_prep6=Placement_df_prep6.drop(columns=['index'])
Placement_df_prep6

Unnamed: 0,Serial ID,Service Category_Installation,Service Category_Removal,Service Category_Replacement,INCIDENT_CATEGORY_DESCRIPTION_Customer relocation,INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales,INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
0,.102313088,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
1,.4390764,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,100100125,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
3,100100249,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
4,100100250,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
124697,ZAB2022049,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
124698,ZAB2022050,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
124699,ZAG0054,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0
124700,ZAR1222,1,0,1,0,1,0,0,0,1,0,0,0,0,0,0,0


I will link both data together Beverage Machine data and Placement Ticket

In [149]:
BeverageMachine7_wTickets_df = pd.merge(BeverageMachine7_df, Placement_df_prep6, how='left', left_on = ['Serial ID'], right_on = ['Serial ID'])

f=BeverageMachine7_wTickets_df.loc[BeverageMachine7_wTickets_df['Serial ID']=='7010054129']
f.iloc[:20,20:50]

In [150]:
BeverageMachine7_wTickets_df

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
0,153933605,Néstlé Jordania,Deployed,Installed,0,42339,20153933605,LOAN,0,0,...,,,,,,,,,,
1,160708192,Néstlé Jordania,Deployed,Installed,0,42491,20160708192,LOAN,0,0,...,,,,,,,,,,
2,160708189,Néstlé Jordania,Deployed,Installed,0,42491,20160708189,LOAN,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,153933611,Néstlé Jordania,Deployed,Installed,0,42461,20153933611,LOAN,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,153933610,Néstlé Jordania,Deployed,Installed,0,42461,20153933610,LOAN,0,0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239395,20O0017858,Singapore,Deployed,Installed,0,44166,2008020009,#,0,0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
239396,20O0017859,Singapore,Deployed,Installed,0,44166,2008020010,#,0,0,...,1.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
239397,20O0017862,Singapore,Deployed,Installed,0,44166,2008020013,#,0,0,...,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
239398,20O0017861,Singapore,Deployed,Installed,0,44166,2008020012,#,0,0,...,1.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [151]:
BeverageMachine7_wTickets_df=BeverageMachine7_wTickets_df.fillna(0)
BeverageMachine7_wTickets_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,INCIDENT_CATEGORY_DESCRIPTION_Key Account Test Installation,INCIDENT_CATEGORY_DESCRIPTION_Maintenance & Repair,INCIDENT_CATEGORY_DESCRIPTION_New Customer / Installation Point,INCIDENT_CATEGORY_DESCRIPTION_Removal / Data Fix,INCIDENT_CATEGORY_DESCRIPTION_Renew,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Installation,INCIDENT_CATEGORY_DESCRIPTION_Seasonal Removal,INCIDENT_CATEGORY_DESCRIPTION_Trial / Demo /Food Show,INCIDENT_CATEGORY_DESCRIPTION_Unknown/Other,INCIDENT_CATEGORY_DESCRIPTION_Upgrade/Downgrade
0,153933605,Néstlé Jordania,Deployed,Installed,0,42339,20153933605,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,160708192,Néstlé Jordania,Deployed,Installed,0,42491,20160708192,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,160708189,Néstlé Jordania,Deployed,Installed,0,42491,20160708189,LOAN,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,153933611,Néstlé Jordania,Deployed,Installed,0,42461,20153933611,LOAN,0,0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,153933610,Néstlé Jordania,Deployed,Installed,0,42461,20153933610,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Even if we have only around 2000 machines having tickets, BeverageMachine7_wTickets_df can be used and we will see if it can improve the model.

In [152]:
SO_Tickets =['prstzr pnstrpzcp ztd', 'prstzr nk', 'prstzr prw zrpzppd', 'ppkcstpp']

BeverageMachine7_wTicketsOnly_df = Placement_df_prep6

BeverageMachine7_wTicketsOnly_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124702 entries, 0 to 124701
Data columns (total 17 columns):
 #   Column                                                           Non-Null Count   Dtype 
---  ------                                                           --------------   ----- 
 0   Serial ID                                                        124702 non-null  object
 1   Service Category_Installation                                    124702 non-null  uint8 
 2   Service Category_Removal                                         124702 non-null  uint8 
 3   Service Category_Replacement                                     124702 non-null  uint8 
 4   INCIDENT_CATEGORY_DESCRIPTION_Customer relocation                124702 non-null  uint8 
 5   INCIDENT_CATEGORY_DESCRIPTION_Exchange / Replacement Sales       124702 non-null  uint8 
 6   INCIDENT_CATEGORY_DESCRIPTION_Install/Data Fix                   124702 non-null  uint8 
 7   INCIDENT_CATEGORY_DESCRIPTION_Key Acco

#### Telemetry data preparation

Let's see what we can get with only machines having Telemetry data

In [153]:
BeverageMachine7_wTelemetry = pd.merge(BeverageMachine7_df, Telemetry_aggSales, how='inner', left_on = ['Manufacturer Number'], right_on = ['serial'])
BeverageMachine7_wTelemetry.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17453 entries, 0 to 17452
Data columns (total 41 columns):
 #   Column                                               Non-Null Count  Dtype         
---  ------                                               --------------  -----         
 0   Serial ID                                            17453 non-null  object        
 1   Sales Organisation                                   17453 non-null  object        
 2   Machine Status Groupings                             17453 non-null  object        
 3   User Status                                          17453 non-null  object        
 4   TA Contract Installation Date                        17453 non-null  int32         
 5   Depreciation Start                                   17453 non-null  int32         
 6   Manufacturer Number                                  17453 non-null  object        
 7   Position                                             17449 non-null  object        
 

I only have 218 machines matching a Telemetry Kit. This is clearly not enough in order to apply Machine Learning model to predict churn.
We should at least combine it with the Beverage Machine data if we want to use it.

#### Visits data preparation

In [154]:
Visitsdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600578 entries, 0 to 600577
Data columns (total 25 columns):
 #   Column                              Non-Null Count   Dtype         
---  ------                              --------------   -----         
 0   Month                               600578 non-null  int64         
 1   Year                                600578 non-null  int64         
 2   Period                              600578 non-null  object        
 3   Counter_visits_completed            600578 non-null  int64         
 4   Cummulative                         600578 non-null  object        
 5   Cummulative_Final                   600578 non-null  object        
 6   Cummulative Graph                   598495 non-null  object        
 7   Occurence Balancing                 600578 non-null  object        
 8   Activity Owner                      600556 non-null  object        
 9   Activity Owner ID                   600556 non-null  float64       
 10  End Date

In [155]:
Visitsdf.columns

Index(['Month', 'Year', 'Period', 'Counter_visits_completed', 'Cummulative',
       'Cummulative_Final', 'Cummulative Graph', 'Occurence Balancing',
       'Activity Owner', 'Activity Owner ID', 'End Date in Local Time Zone',
       'Result_ID', 'Sales Org Desc', 'Sales Organization',
       'Sales Unit (Hierarchy)', 'Sales Unit (Hierarchy) ID',
       'Activity Life Cycle Status id', 'Activity Life Cycle Status',
       'Counter_visits', 'Visit Description', 'Visit',
       'Account ID.Account ID Level 01', 'Account ID.Account ID Level 01.Key',
       'Result', 'Index'],
      dtype='object')

In [156]:
Visitsdf1 = Visitsdf[['End Date in Local Time Zone', 'Result', 'Activity Life Cycle Status', 'Visit', 'Account ID.Account ID Level 01.Key']]

Remove visits with no account id

In [157]:
Visitsdf1 = Visitsdf1.loc[Visitsdf1['Account ID.Account ID Level 01.Key']!="#"]


In [170]:
Visitsdf1['Result'] = Visitsdf1['Result'].fillna('Not assigned')


I do not have the Sales org ID in TA and I think that Account ID are unique I am not doing the key "KeySOAccID" yet.

In [171]:
#Visitsdf1['KeySOAccID'] = Visitsdf1['Sales Organization'] + Visitsdf1['Account ID.Account ID Level 01.Key'].map(str) 

In [172]:
def preprocess_visits(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Result', 'Activity Life Cycle Status']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

Visitsdf_prep = preprocess_visits(Visitsdf1)
Visitsdf_prep.head()

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
0,2023-01-13,2147753,7042010,0,1,0,0,0,0,0,1,0,0
1,2023-03-24,2276730,7042010,0,1,0,0,0,0,0,1,0,0
2,2023-03-10,2272298,7042010,0,1,0,0,0,0,0,1,0,0
3,2023-01-13,2147752,7042268,0,1,0,0,0,0,0,1,0,0
4,2023-07-24,2477044,6978596,0,1,0,0,0,0,0,1,0,0


Summarize the column based on the Account ID and keep the last visit date

In [173]:
Visitsdf_prep.columns

Index(['End Date in Local Time Zone', 'Visit',
       'Account ID.Account ID Level 01.Key', 'Result_Incomplete Selling Call',
       'Result_Not assigned', 'Result_Objective Met',
       'Result_Objective Partially Met', 'Result_Requires Further Follow-up',
       'Result_Unsuccessful Selling Call',
       'Activity Life Cycle Status_Canceled',
       'Activity Life Cycle Status_Completed',
       'Activity Life Cycle Status_In Process',
       'Activity Life Cycle Status_Open'],
      dtype='object')

In [174]:
Visitsdf_prep.iloc[0]['End Date in Local Time Zone']

Timestamp('2023-01-13 00:00:00')

In [175]:
Visitsdf_prep['End Date in Local Time Zone'] = Visitsdf_prep['End Date in Local Time Zone'].apply(str)


Visitsdf_prep.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 600578 entries, 0 to 600577
Data columns (total 13 columns):
 #   Column                                 Non-Null Count   Dtype 
---  ------                                 --------------   ----- 
 0   End Date in Local Time Zone            600578 non-null  object
 1   Visit                                  600578 non-null  int64 
 2   Account ID.Account ID Level 01.Key     551558 non-null  object
 3   Result_Incomplete Selling Call         600578 non-null  uint8 
 4   Result_Not assigned                    600578 non-null  uint8 
 5   Result_Objective Met                   600578 non-null  uint8 
 6   Result_Objective Partially Met         600578 non-null  uint8 
 7   Result_Requires Further Follow-up      600578 non-null  uint8 
 8   Result_Unsuccessful Selling Call       600578 non-null  uint8 
 9   Activity Life Cycle Status_Canceled    600578 non-null  uint8 
 10  Activity Life Cycle Status_Completed   600578 non-null  uint8 
 11  

In [176]:
#pip install dateparser

In [177]:
#import dateparser

Visitsdf_prep2 =Visitsdf_prep.copy()
Visitsdf_prep2['End Date in Local Time Zone'] = pd.to_datetime(Visitsdf_prep2['End Date in Local Time Zone'])
#Visitsdf_prep2['End Date in Local Time Zone'] = Visitsdf_prep2['End Date in Local Time Zone'].apply(lambda x: dateparser.parse(x))

In [178]:
Visitsdf_prep2

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
0,2023-01-13,2147753,7042010,0,1,0,0,0,0,0,1,0,0
1,2023-03-24,2276730,7042010,0,1,0,0,0,0,0,1,0,0
2,2023-03-10,2272298,7042010,0,1,0,0,0,0,0,1,0,0
3,2023-01-13,2147752,7042268,0,1,0,0,0,0,0,1,0,0
4,2023-07-24,2477044,6978596,0,1,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
600573,2024-03-20,2819289,,0,0,0,0,1,0,0,1,0,0
600574,2024-03-21,2822203,,0,0,0,0,1,0,0,1,0,0
600575,2024-03-05,2794343,,0,0,0,0,1,0,0,1,0,0
600576,2023-11-29,2706998,,0,0,0,0,1,0,0,1,0,0


pd.to_datetime(Visitsdf_prep2['End Date in Local Time Zone'])

In [179]:
Visitsdf_prep2.sort_values('End Date in Local Time Zone', ascending = True)

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
261832,2010-07-29,2406815,3914214,0,1,0,0,0,0,0,0,0,1
261795,2010-08-26,2406233,3556508,0,1,0,0,0,0,0,0,0,1
261774,2011-07-12,2406832,2757927,0,1,0,0,0,0,0,0,0,1
261779,2012-07-26,2450208,,0,1,0,0,0,0,0,0,0,1
261799,2012-08-24,2406231,3315972,0,1,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
572852,2028-03-21,2822355,,0,1,0,0,0,0,0,0,0,1
572313,2028-03-27,2804343,,0,1,0,0,0,0,0,0,0,1
371901,2028-07-28,2484908,1809624,0,1,0,0,0,0,0,1,0,0
456420,2029-04-29,2854625,5629131,0,1,0,0,0,0,0,1,0,0


In [180]:
Visitsdf_prep2 = Visitsdf_prep2.sort_values('End Date in Local Time Zone')
Visitsdf_prep2.head()

Unnamed: 0,End Date in Local Time Zone,Visit,Account ID.Account ID Level 01.Key,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
261832,2010-07-29,2406815,3914214.0,0,1,0,0,0,0,0,0,0,1
261795,2010-08-26,2406233,3556508.0,0,1,0,0,0,0,0,0,0,1
261774,2011-07-12,2406832,2757927.0,0,1,0,0,0,0,0,0,0,1
261779,2012-07-26,2450208,,0,1,0,0,0,0,0,0,0,1
261799,2012-08-24,2406231,3315972.0,0,1,0,0,0,0,0,0,0,1


In [181]:
Visitsdf_prep3 = (Visitsdf_prep2.sort_values('End Date in Local Time Zone')
    .groupby(["Account ID.Account ID Level 01.Key"])
                      .agg({
        'End Date in Local Time Zone': lambda s: s.values[-1],
        'Result_Incomplete Selling Call' : 'sum',
        'Result_Not assigned' : 'sum', 
        'Result_Objective Met' : 'sum',
       'Result_Objective Partially Met' : 'sum', 'Result_Requires Further Follow-up' : 'sum',
       'Result_Unsuccessful Selling Call' : 'sum',
       'Activity Life Cycle Status_Canceled' : 'sum',
       'Activity Life Cycle Status_Completed' : 'sum',
       'Activity Life Cycle Status_In Process' : 'sum',
       'Activity Life Cycle Status_Open' : 'sum'
    })
)

In [182]:
Visitsdf_prep3.head()

Unnamed: 0_level_0,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
Account ID.Account ID Level 01.Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000058,2024-05-01,0,7,0,0,0,0,0,7,0,0
1000096,2024-04-10,0,2,0,0,0,0,0,2,0,0
1000216,2024-04-25,0,19,0,0,0,0,0,19,0,0
1000256,2024-11-03,0,25,0,0,0,0,2,22,0,1
1000278,2023-03-16,0,1,0,0,0,0,1,0,0,0


In [183]:
Visitsdf_prep4 = Visitsdf_prep3.copy()
Visitsdf_prep4.reset_index()

Unnamed: 0,Account ID.Account ID Level 01.Key,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
0,1000058,2024-05-01,0,7,0,0,0,0,0,7,0,0
1,1000096,2024-04-10,0,2,0,0,0,0,0,2,0,0
2,1000216,2024-04-25,0,19,0,0,0,0,0,19,0,0
3,1000256,2024-11-03,0,25,0,0,0,0,2,22,0,1
4,1000278,2023-03-16,0,1,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
98432,9184374,2024-04-15,0,1,0,0,0,0,0,1,0,0
98433,9184379,2024-04-29,0,0,1,0,0,0,0,1,0,0
98434,9184618,2024-05-02,0,1,0,0,0,0,0,0,0,1
98435,9186694,2024-05-02,0,1,0,0,0,0,0,0,0,1


In [184]:
Visitsdf_prep4['Last_visit_diff_months'] = ChurnDate2 - Visitsdf_prep4['End Date in Local Time Zone']

Visitsdf_prep4['Last_visit_diff_months'] = Visitsdf_prep4['Last_visit_diff_months']/np.timedelta64(1,'M')

In [185]:
Visitsdf_prep4.head()

Unnamed: 0_level_0,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months
Account ID.Account ID Level 01.Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1000058,2024-05-01,0,7,0,0,0,0,0,7,0,0,-0.032855
1000096,2024-04-10,0,2,0,0,0,0,0,2,0,0,0.657098
1000216,2024-04-25,0,19,0,0,0,0,0,19,0,0,0.164274
1000256,2024-11-03,0,25,0,0,0,0,2,22,0,1,-6.143863
1000278,2023-03-16,0,1,0,0,0,0,1,0,0,0,13.503357


In [186]:
Visitsdf_wVisits = Visitsdf_prep4.copy()
Visitsdf_wVisits.info()

<class 'pandas.core.frame.DataFrame'>
Index: 98437 entries, 1000058 to CO_TW10
Data columns (total 12 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   End Date in Local Time Zone            98437 non-null  datetime64[ns]
 1   Result_Incomplete Selling Call         98437 non-null  uint8         
 2   Result_Not assigned                    98437 non-null  uint64        
 3   Result_Objective Met                   98437 non-null  uint8         
 4   Result_Objective Partially Met         98437 non-null  uint8         
 5   Result_Requires Further Follow-up      98437 non-null  uint8         
 6   Result_Unsuccessful Selling Call       98437 non-null  uint8         
 7   Activity Life Cycle Status_Canceled    98437 non-null  uint8         
 8   Activity Life Cycle Status_Completed   98437 non-null  uint64        
 9   Activity Life Cycle Status_In Process  98437 non-null  uin

df['Reported_Date'] = pd.to_datetime(df['Reported_Date'], format='%m/%d/%Y')
df['Process Date'] = pd.to_datetime(df['Process Date'], format='%m/%d/%Y')

df = (
    df
    .sort_values('Process Date')
    .groupby('ID', as_index=False)
    .agg({
        'Total': 'sum',
        'Process Date': lambda s: s.values[-1]
    })
)

'Activity Owner', 'Visit Description', 'Sales Unit (Hierarchy)' might be useful but with one hot encoding I would have too many columns

In [187]:
Visitsdf_wVisits2 = Visitsdf_wVisits.reset_index() 
Visitsdf_wVisits = Visitsdf_wVisits2.rename(columns={"Account ID.Account ID Level 01.Key":"Acc_ID"})
Visitsdf_wVisits

Unnamed: 0,Acc_ID,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months
0,1000058,2024-05-01,0,7,0,0,0,0,0,7,0,0,-0.032855
1,1000096,2024-04-10,0,2,0,0,0,0,0,2,0,0,0.657098
2,1000216,2024-04-25,0,19,0,0,0,0,0,19,0,0,0.164274
3,1000256,2024-11-03,0,25,0,0,0,0,2,22,0,1,-6.143863
4,1000278,2023-03-16,0,1,0,0,0,0,1,0,0,0,13.503357
...,...,...,...,...,...,...,...,...,...,...,...,...,...
98432,9184374,2024-04-15,0,1,0,0,0,0,0,1,0,0,0.492823
98433,9184379,2024-04-29,0,0,1,0,0,0,0,1,0,0,0.032855
98434,9184618,2024-05-02,0,1,0,0,0,0,0,0,0,1,-0.065710
98435,9186694,2024-05-02,0,1,0,0,0,0,0,0,0,1,-0.065710


left2 = pd.DataFrame(
    {"A": ["A0", "A1", "A2", "C1"], "B": ["B0", "B1", "B2", "B2"], "K1": [1938031, 1938031, 2, 3]}, index=["K0", "K1", "K2", "K2"]
)
left2

In [188]:
Visitsdf_wVisits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98437 entries, 0 to 98436
Data columns (total 13 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Acc_ID                                 98437 non-null  object        
 1   End Date in Local Time Zone            98437 non-null  datetime64[ns]
 2   Result_Incomplete Selling Call         98437 non-null  uint8         
 3   Result_Not assigned                    98437 non-null  uint64        
 4   Result_Objective Met                   98437 non-null  uint8         
 5   Result_Objective Partially Met         98437 non-null  uint8         
 6   Result_Requires Further Follow-up      98437 non-null  uint8         
 7   Result_Unsuccessful Selling Call       98437 non-null  uint8         
 8   Activity Life Cycle Status_Canceled    98437 non-null  uint8         
 9   Activity Life Cycle Status_Completed   98437 non-null  uint64

In [189]:

Visitsdf_wVisits['Acc_ID'] = Visitsdf_wVisits['Acc_ID'].astype(str)
Visitsdf_wVisits['#Visits completed'] = Visitsdf_wVisits['Activity Life Cycle Status_Completed']
Visitsdf_wVisits = Visitsdf_wVisits.drop(['Activity Life Cycle Status_Completed'], axis = 1)
Visitsdf_wVisits.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98437 entries, 0 to 98436
Data columns (total 13 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Acc_ID                                 98437 non-null  object        
 1   End Date in Local Time Zone            98437 non-null  datetime64[ns]
 2   Result_Incomplete Selling Call         98437 non-null  uint8         
 3   Result_Not assigned                    98437 non-null  uint64        
 4   Result_Objective Met                   98437 non-null  uint8         
 5   Result_Objective Partially Met         98437 non-null  uint8         
 6   Result_Requires Further Follow-up      98437 non-null  uint8         
 7   Result_Unsuccessful Selling Call       98437 non-null  uint8         
 8   Activity Life Cycle Status_Canceled    98437 non-null  uint8         
 9   Activity Life Cycle Status_In Process  98437 non-null  uint8 

In [190]:
a=Visitsdf_wVisits.loc[Visitsdf_wVisits['Acc_ID']=='1938031']
a

Unnamed: 0,Acc_ID,End Date in Local Time Zone,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months,#Visits completed
17099,1938031,2023-06-20,0,24,0,0,0,0,0,0,0,10.349288,24


result = pd.merge(left2, Visitsdf_wVisits, how='left', left_on = ['K1'], right_on = ['Acc_ID']) 
result

#### Phone calls data preparation

In [191]:
PhoneCallsdf.head()

Unnamed: 0,Activity Name,Account Name,Activity Owner,Activity Life Cycle Status,Phone Call ID,Objective (Phone Call),Sales Organization,End Date in Local Time Zone,Start Date in Local Time Zone,PeriodEnd,ee
0,2023-03-22- P e Call 1,7323225,Jadala Aishwarya,Completed,1075973,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
1,2023-03-21- Residence Call 2,7316409,Jadala Aishwarya,Completed,1076101,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
2,2023-03-21- No Call 1,7318215,Jadala Aishwarya,Completed,1075664,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
3,2023-03-21- Yes Call 1,7317829,Jadala Aishwarya,Completed,1075454,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0
4,2023-03-21- bluepal Call 2,7316130,Jadala Aishwarya,Completed,1075797,,IN14,"mercredi, 22 mars 2023","mercredi, 22 mars 2023",2023 - 03,4473.0


In [192]:
PhoneCallsdf.columns

Index(['Activity Name', 'Account Name', 'Activity Owner',
       'Activity Life Cycle Status', 'Phone Call ID', 'Objective (Phone Call)',
       'Sales Organization', 'End Date in Local Time Zone',
       'Start Date in Local Time Zone', 'PeriodEnd', 'ee'],
      dtype='object')

'Activity Owner',
 'Objective (Phone Call)' -> to much text freedom and too many reasons
 'Phone Call ID' -> not needed

In [193]:
PhoneCallsdf1 = PhoneCallsdf[['Account Name', 'Activity Life Cycle Status', 'End Date in Local Time Zone']]

In [194]:
def preprocess_calls(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Activity Life Cycle Status']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

PhoneCallsdf_prep = preprocess_calls(PhoneCallsdf1)
PhoneCallsdf_prep.head()

Unnamed: 0,Account Name,End Date in Local Time Zone,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_Completed,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open
0,7323225,"mercredi, 22 mars 2023",0,1,0,0
1,7316409,"mercredi, 22 mars 2023",0,1,0,0
2,7318215,"mercredi, 22 mars 2023",0,1,0,0
3,7317829,"mercredi, 22 mars 2023",0,1,0,0
4,7316130,"mercredi, 22 mars 2023",0,1,0,0


Remove phone calls without an account ID

In [195]:
PhoneCallsdf_prep1 = PhoneCallsdf_prep.loc[PhoneCallsdf_prep['Account Name']!="#"]


In [196]:
PhoneCallsdf_prep1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 294654 entries, 0 to 294653
Data columns (total 6 columns):
 #   Column                                 Non-Null Count   Dtype 
---  ------                                 --------------   ----- 
 0   Account Name                           293505 non-null  object
 1   End Date in Local Time Zone            294654 non-null  object
 2   Activity Life Cycle Status_Canceled    294654 non-null  uint8 
 3   Activity Life Cycle Status_Completed   294654 non-null  uint8 
 4   Activity Life Cycle Status_In Process  294654 non-null  uint8 
 5   Activity Life Cycle Status_Open        294654 non-null  uint8 
dtypes: object(2), uint8(4)
memory usage: 7.9+ MB


Remove date greater than next year

In [197]:
Churndate2_year = ChurnDate2.year

In [198]:
PhoneCallsdf_prep1['End Date in Local Time Zone'] = pd.to_datetime(PhoneCallsdf_prep1['End Date in Local Time Zone'], errors = 'coerce')


In [199]:
PhoneCallsdf_prep1 = PhoneCallsdf_prep1.loc[PhoneCallsdf_prep1['End Date in Local Time Zone'] < dt.datetime(Churndate2_year+1,1,1)]

In [200]:
PhoneCallsdf_prep1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 6 columns):
 #   Column                                 Non-Null Count  Dtype         
---  ------                                 --------------  -----         
 0   Account Name                           0 non-null      object        
 1   End Date in Local Time Zone            0 non-null      datetime64[ns]
 2   Activity Life Cycle Status_Canceled    0 non-null      uint8         
 3   Activity Life Cycle Status_Completed   0 non-null      uint8         
 4   Activity Life Cycle Status_In Process  0 non-null      uint8         
 5   Activity Life Cycle Status_Open        0 non-null      uint8         
dtypes: datetime64[ns](1), object(1), uint8(4)
memory usage: 0.0+ bytes


In [201]:
PhoneCallsdf_prep1 = PhoneCallsdf_prep1.sort_values('End Date in Local Time Zone')

In [202]:
PhoneCallsdf_prep2 = (PhoneCallsdf_prep1.sort_values('End Date in Local Time Zone')
    .groupby(["Account Name"])
                      .agg({
        'End Date in Local Time Zone': lambda s: s.values[-1],
        'Activity Life Cycle Status_Completed' : 'sum'}))


In [203]:
PhoneCallsdf_prep2

Unnamed: 0_level_0,End Date in Local Time Zone,Activity Life Cycle Status_Completed
Account Name,Unnamed: 1_level_1,Unnamed: 2_level_1


In [204]:
PhoneCallsdf_prep3 = PhoneCallsdf_prep2.copy()
PhoneCallsdf_prep3.reset_index()

Unnamed: 0,Account Name,End Date in Local Time Zone,Activity Life Cycle Status_Completed


In [205]:
PhoneCallsdf_prep3['Last_call_diff_months'] = ChurnDate2 - PhoneCallsdf_prep3['End Date in Local Time Zone']

PhoneCallsdf_prep3['Last_call_diff_months'] = PhoneCallsdf_prep3['Last_call_diff_months']/np.timedelta64(1,'M')

In [206]:
PhoneCallsdf_prep3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 3 columns):
 #   Column                                Non-Null Count  Dtype         
---  ------                                --------------  -----         
 0   End Date in Local Time Zone           0 non-null      datetime64[ns]
 1   Activity Life Cycle Status_Completed  0 non-null      uint8         
 2   Last_call_diff_months                 0 non-null      float64       
dtypes: datetime64[ns](1), float64(1), uint8(1)
memory usage: 0.0+ bytes


In [207]:
PhoneCallsdf_prep3 = PhoneCallsdf_prep3.copy()
PhoneCallsdf_prep3.reset_index()

Unnamed: 0,Account Name,End Date in Local Time Zone,Activity Life Cycle Status_Completed,Last_call_diff_months


In [208]:
PhoneCallsdf_prep3['#Calls Completed'] = PhoneCallsdf_prep3['Activity Life Cycle Status_Completed']
PhoneCallsdf_prep3 = PhoneCallsdf_prep3.drop(['Activity Life Cycle Status_Completed'], axis = 1)
PhoneCallsdf_prep3.head()

Unnamed: 0_level_0,End Date in Local Time Zone,Last_call_diff_months,#Calls Completed
Account Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1


#### Incident Ticket preparation

In [209]:
IncidentTicketdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232733 entries, 0 to 232732
Data columns (total 70 columns):
 #   Column                                     Non-Null Count   Dtype         
---  ------                                     --------------   -----         
 0   Index                                      232733 non-null  object        
 1   SLAMet                                     232733 non-null  int64         
 2   YearMonth                                  231836 non-null  float64       
 3   Period                                     231836 non-null  object        
 4   NextDateAux                                231836 non-null  float64       
 5   NextDateAux2                               231836 non-null  object        
 6   AuxTime                                    232733 non-null  object        
 7   TimeFrom                                   232733 non-null  int64         
 8   Next CreatedDatevar                        158225 non-null  datetime64[ns]
 9   Time

In [210]:
IncidentTicketdf.columns

Index(['Index', 'SLAMet', 'YearMonth', 'Period', 'NextDateAux', 'NextDateAux2',
       'AuxTime', 'TimeFrom', 'Next CreatedDatevar', 'TimeTo',
       'Previous CreatedDatevar', 'AuxFix', 'SLA MET?', 'AuxTimeUS',
       'Main Ticket ID', 'Main Ticket', 'MAIN_TICKET_COMPLETION_DATE',
       'Sub Ticket ID', 'Sub Ticket', 'REPORTED_ON', 'SOLVED_VIA_PHONE',
       'STATUS', 'STATUS_DESCRIPTION', 'PROCESSING', 'PROCESSING_DESCRIPTION',
       'SERVICE_TECHNICIAN', 'Completion Date_2', 'Completion SLA Met',
       'PRODUCT_DESCRIPTION', 'PRODUCT_ID', 'Serial ID',
       'PRIORITY_DESCRIPTION', 'EC_ID', 'EC_NAME', 'EC_HOUSENUMBER',
       'EC_STREET', 'EC_CIY', 'EC_STATE', 'EC_POSTALCODE',
       'INCIDENT_CATEGORY_ID', 'Incident Category',
       'MANUFACTURER_SERIAL_NUMBER', 'MAIN_ADDRESS', 'ACCOUNT_ID',
       'ACCOUNT_DESCRIPTION', 'ACCOUNT_POSTAL_CODE', 'SHIP_TO_ID',
       'SHIP_TO_NAME', 'SHIP_TO_POSTAL_CODE', 'ECRESPONSIBLE_SALES_ID',
       'ECRESPONSIBLE_SALES_DESCRIPTION', 'ACCOUNT

Maybe I will do a delta between "Completion Date_2" and "Reported On"

Removed:
'AuxFix' -> 'AuxTime'
'Completion SLA Met' -> 'SLAMet'
'SLAMet' -> 'SLA MET?' 
'Service Technician' not clear and a lot of data

In [211]:
print(IncidentTicketdf['Completion SLA Met'])

0         0.0
1         1.0
2         1.0
3         0.0
4         1.0
         ... 
232728    1.0
232729    1.0
232730    1.0
232731    1.0
232732    1.0
Name: Completion SLA Met, Length: 232733, dtype: float64


In [212]:
IncidentTicketdf1 = IncidentTicketdf[['Completion Date_2', 'Incident Category', 
                    'REPORTED_ON', 'Serial ID', 'Completion SLA Met', 'AuxTime']]
IncidentTicketdf1.rename(columns={'Completion SLA Met': 'SLA MET?'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  IncidentTicketdf1.rename(columns={'Completion SLA Met': 'SLA MET?'}, inplace=True)


In [213]:
IncidentTicketdf1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 232733 entries, 0 to 232732
Data columns (total 6 columns):
 #   Column             Non-Null Count   Dtype         
---  ------             --------------   -----         
 0   Completion Date_2  231836 non-null  object        
 1   Incident Category  232733 non-null  object        
 2   REPORTED_ON        232712 non-null  datetime64[ns]
 3   Serial ID          232733 non-null  object        
 4   SLA MET?           232196 non-null  float64       
 5   AuxTime            232733 non-null  object        
dtypes: datetime64[ns](1), float64(1), object(4)
memory usage: 10.7+ MB


In [214]:
a = IncidentTicketdf1.loc[IncidentTicketdf1['Serial ID']==7010054129]
#a= IncidentTicketdf1.loc[IncidentTicketdf1['Serial ID']=='MYBMB20838']
a

Unnamed: 0,Completion Date_2,Incident Category,REPORTED_ON,Serial ID,SLA MET?,AuxTime
39573,"mardi, 25 janvier 2022",1.d Ingredient Other,2022-01-25 16:13:32,7010054129,1.0,Yes


In [215]:
IncidentTicketdf2 = IncidentTicketdf1.copy()
#IncidentTicketdf2['Completion Date_2'] = IncidentTicketdf2['Completion Date_2'].apply(str)
#IncidentTicketdf2['Completion Date_2'] = IncidentTicketdf2['Completion Date_2'].apply(lambda x: dateparser.parse(x))

In [216]:
IncidentTicketdf3 = IncidentTicketdf2.copy()
IncidentTicketdf3['Completion Date_2'] = pd.to_datetime(IncidentTicketdf3['Completion Date_2'], errors = 'coerce')
IncidentTicketdf3['Reported On'] = pd.to_datetime(IncidentTicketdf3['REPORTED_ON'], errors = 'coerce')

IncidentTicketdf3['Completion Date_2'] = IncidentTicketdf3['Completion Date_2'].fillna(dt.datetime(2000,1,1))
IncidentTicketdf3['Reported On'] = IncidentTicketdf3['Reported On'].fillna(dt.datetime(2000,1,1))

In [217]:
IncidentTicketdf3 = IncidentTicketdf3.loc[IncidentTicketdf3['Serial ID']!="#"]

In [218]:
def preprocess_InciTickets(df):
    # Work on a copy
    df = df.copy()

    nomi_vars = ['Incident Category', 'SLA MET?', 'AuxTime']
    
    dummy_columns = nomi_vars
        
    df = pd.get_dummies(df, columns=dummy_columns)

    return df

IncidentTicketdf_prep = preprocess_InciTickets(IncidentTicketdf3)
IncidentTicketdf_prep.head()

Unnamed: 0,Completion Date_2,REPORTED_ON,Serial ID,Reported On,Incident Category_1.a Ingredient Calibration,Incident Category_1.b Ingredient Dispensing,Incident Category_1.c Ingredient Dripping,Incident Category_1.d Ingredient Other,Incident Category_10 Abnormal smell,Incident Category_11 Electrical power,...,Incident Category_7 Wire/Harness,Incident Category_8 Software/Firmware,Incident Category_9 Abnormal noise,Incident Category_Low throughput,Incident Category_Requested by Customer,Incident Category_Scheduled,SLA MET?_0.0,SLA MET?_1.0,AuxTime_No,AuxTime_Yes
0,2000-01-01,2021-10-03 10:33:40,18E0017587,2021-10-03 10:33:40,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,2000-01-01,2021-10-06 05:35:47,Y105133933,2021-10-06 05:35:47,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
2,2000-01-01,2021-10-06 08:10:50,16E0023488,2021-10-06 08:10:50,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
3,2000-01-01,2021-10-04 16:43:38,Y101709203,2021-10-04 16:43:38,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
4,2000-01-01,2021-10-07 08:02:15,18E0014610,2021-10-07 08:02:15,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1


In [219]:
IncidentTicketdf_prep.columns

Index(['Completion Date_2', 'REPORTED_ON', 'Serial ID', 'Reported On',
       'Incident Category_1.a Ingredient Calibration',
       'Incident Category_1.b Ingredient Dispensing',
       'Incident Category_1.c Ingredient Dripping',
       'Incident Category_1.d Ingredient Other',
       'Incident Category_10 Abnormal smell',
       'Incident Category_11 Electrical power',
       'Incident Category_12 Water supply issue',
       'Incident Category_13 Connectivity (modem)',
       'Incident Category_14 Accessory problem(external pump..)',
       'Incident Category_15 Return with parts',
       'Incident Category_16 Operator mishandling(improper fill..)',
       'Incident Category_17 Miscellaneous', 'Incident Category_18 N/A',
       'Incident Category_2.a Hydraulic Calibration',
       'Incident Category_2.b Hydraulic Dispensing',
       'Incident Category_2.c Hydraulic Leaking',
       'Incident Category_2.d Hydraulic Heating',
       'Incident Category_2.e Hydraulic Cooling/Freezing',


In [220]:
IncidentTicketdf_prep = IncidentTicketdf_prep.sort_values('Completion Date_2')

I will not use 'Reported On' because I aggreagate and I do not want to make a delta anymore

In [221]:
IncidentTicketdf_prep2 = (IncidentTicketdf_prep.sort_values('Completion Date_2')
    .groupby(["Serial ID"])
                      .agg({'Completion Date_2' : lambda s: s.values[-1], 
       'Incident Category_1.a Ingredient Calibration' : 'sum',
       'Incident Category_1.b Ingredient Dispensing' : 'sum',
       'Incident Category_1.c Ingredient Dripping' : 'sum',
       'Incident Category_1.d Ingredient Other' : 'sum',
       'Incident Category_10 Abnormal smell' : 'sum',
       'Incident Category_11 Electrical power' : 'sum',
       'Incident Category_12 Water supply issue' : 'sum',
       'Incident Category_13 Connectivity (modem)' : 'sum',
       'Incident Category_14 Accessory problem(external pump..)' : 'sum',
       'Incident Category_15 Return with parts' : 'sum',
       'Incident Category_16 Operator mishandling(improper fill..)' : 'sum',
       'Incident Category_17 Miscellaneous': 'sum',
                            'Incident Category_18 N/A' : 'sum',
       'Incident Category_2.a Hydraulic Calibration' : 'sum',
       'Incident Category_2.b Hydraulic Dispensing' : 'sum',
       'Incident Category_2.c Hydraulic Leaking' : 'sum',
       'Incident Category_2.d Hydraulic Heating': 'sum',
       'Incident Category_2.e Hydraulic Cooling/Freezing': 'sum',
       'Incident Category_2.f Hydraulic Filling': 'sum',
       'Incident Category_2.g Hydraulic Other': 'sum',
       'Incident Category_3.a Door Display/Touchscreen': 'sum',
       'Incident Category_3.b Door Menu buttons': 'sum',
       'Incident Category_3.c Door Detection': 'sum',
       'Incident Category_3.d Door Key/Key switch': 'sum',
       'Incident Category_3.e Door Other': 'sum',
       'Incident Category_4.a Reconst. Area In-cup quality/Recipes': 'sum',
       'Incident Category_4.b Reconstitution Area Mixing system': 'sum',
       'Incident Category_4.c Reconstitution Area Other': 'sum',
       'Incident Category_5.a Disp. Area Manifold/Distribution': 'sum',
       'Incident Category_5.b Dispensing Area Drip Tray': 'sum',
       'Incident Category_5.c Dispensing Area Other': 'sum',
       'Incident Category_6 Electronics (PCBs)': 'sum',
       'Incident Category_7 Wire/Harness': 'sum',
       'Incident Category_8 Software/Firmware': 'sum',
       'Incident Category_9 Abnormal noise': 'sum',
                            'SLA MET?_0.0': 'sum',
                            'SLA MET?_1.0': 'sum',
                            'AuxTime_No': 'sum', 
                            'AuxTime_Yes': 'sum'})
)


In [222]:
IncidentTicketdf_prep2['Last_InTick_diff_months'] = ChurnDate2 - IncidentTicketdf_prep2['Completion Date_2']

IncidentTicketdf_prep2['Last_InTick_diff_months'] = IncidentTicketdf_prep2['Last_InTick_diff_months']/np.timedelta64(1,'M')

In [223]:
IncidentTicketdf_prep2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 72460 entries, 210 to ZA978
Data columns (total 41 columns):
 #   Column                                                      Non-Null Count  Dtype         
---  ------                                                      --------------  -----         
 0   Completion Date_2                                           72460 non-null  datetime64[ns]
 1   Incident Category_1.a Ingredient Calibration                72460 non-null  uint8         
 2   Incident Category_1.b Ingredient Dispensing                 72460 non-null  uint8         
 3   Incident Category_1.c Ingredient Dripping                   72460 non-null  uint8         
 4   Incident Category_1.d Ingredient Other                      72460 non-null  uint8         
 5   Incident Category_10 Abnormal smell                         72460 non-null  uint8         
 6   Incident Category_11 Electrical power                       72460 non-null  uint8         
 7   Incident Category_12 Wate

In [224]:
IncidentTicketdf_prep2 = IncidentTicketdf_prep2.reset_index()
IncidentTicketdf_prep2.head()

Unnamed: 0,Serial ID,Completion Date_2,Incident Category_1.a Ingredient Calibration,Incident Category_1.b Ingredient Dispensing,Incident Category_1.c Ingredient Dripping,Incident Category_1.d Ingredient Other,Incident Category_10 Abnormal smell,Incident Category_11 Electrical power,Incident Category_12 Water supply issue,Incident Category_13 Connectivity (modem),...,Incident Category_5.c Dispensing Area Other,Incident Category_6 Electronics (PCBs),Incident Category_7 Wire/Harness,Incident Category_8 Software/Firmware,Incident Category_9 Abnormal noise,SLA MET?_0.0,SLA MET?_1.0,AuxTime_No,AuxTime_Yes,Last_InTick_diff_months
0,210,2000-01-01,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,291.9485
1,1003,2000-01-01,1,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,2,291.9485
2,1012,2000-01-01,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,291.9485
3,2100,2000-01-01,0,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,2,291.9485
4,21000,2000-01-01,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,291.9485


#### Data with all
Let's see what we can get if we include Telemetry, sales and Tickets

In [225]:

BeverageMachine7_wTickets_df['Manufacturer Number'] = BeverageMachine7_wTickets_df['Manufacturer Number'].astype('str')
Concat_Telemetry['serial'] = Concat_Telemetry['serial'].astype('str')

In [226]:
BeverageMachine7_wTickets_wTelemetry_df = pd.merge(BeverageMachine7_wTickets_df, Concat_Telemetry, how='left', left_on = ['Manufacturer Number'], right_on = ['serial'])

BeverageMachine7_wTickets_wTelemetry_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 0 to 239399
Data columns (total 63 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        239400 non-null  object        
 1   Sales Organisation                                               239400 non-null  object        
 2   Machine Status Groupings                                         239400 non-null  object        
 3   User Status                                                      239400 non-null  object        
 4   TA Contract Installation Date                                    239400 non-null  int32         
 5   Depreciation Start                                               239400 non-null  int32         
 6   Manufacturer Number                                              239

In [227]:
BeverageMachine7_wTickets_wTelemetry_df = pd.merge(BeverageMachine7_wTickets_df, Concat_Telemetry, how='left', left_on = ['Manufacturer Number'], right_on = ['serial'])
BeverageMachine7_wTickets_wTelemetry_df=BeverageMachine7_wTickets_wTelemetry_df.fillna(0)
BeverageMachine7_wTickets_wTelemetry_df["quantity"] = BeverageMachine7_wTickets_wTelemetry_df["quantity"].astype(int)
BeverageMachine7_wTickets_wTelemetry_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 0 to 239399
Data columns (total 63 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        239400 non-null  object        
 1   Sales Organisation                                               239400 non-null  object        
 2   Machine Status Groupings                                         239400 non-null  object        
 3   User Status                                                      239400 non-null  object        
 4   TA Contract Installation Date                                    239400 non-null  int32         
 5   Depreciation Start                                               239400 non-null  int32         
 6   Manufacturer Number                                              239

In [228]:
b = Concat_Sales[Concat_Sales['KeyManufNo_SalesOrg']=='20161919205' + 'Nestlé Russia']
b

Unnamed: 0,Serial,quantity,Sales_one_Month_avg,Sales_three_months_avg,Sales_six_months_avg,KeyManufNo_SalesOrg,(lst_mth-6mth)/6mth,3mth-6mth)/6mth
12807,20161919205,103309.3267,4606.8067,7969.195567,9316.851117,20161919205Nestlé Russia,-0.50554,-0.144647


In [229]:
A= Concat_Sales.drop_duplicates(subset= 'KeyManufNo_SalesOrg')
A.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68572 entries, 0 to 15981
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  68572 non-null  object 
 1   quantity                68572 non-null  float64
 2   Sales_one_Month_avg     68572 non-null  float64
 3   Sales_three_months_avg  68572 non-null  float64
 4   Sales_six_months_avg    68572 non-null  float64
 5   KeyManufNo_SalesOrg     68572 non-null  object 
 6   (lst_mth-6mth)/6mth     68572 non-null  float64
 7   3mth-6mth)/6mth         68572 non-null  float64
dtypes: float64(6), object(2)
memory usage: 4.7+ MB


In [230]:
Concat_Sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 68572 entries, 0 to 15981
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Serial                  68572 non-null  object 
 1   quantity                68572 non-null  float64
 2   Sales_one_Month_avg     68572 non-null  float64
 3   Sales_three_months_avg  68572 non-null  float64
 4   Sales_six_months_avg    68572 non-null  float64
 5   KeyManufNo_SalesOrg     68572 non-null  object 
 6   (lst_mth-6mth)/6mth     68572 non-null  float64
 7   3mth-6mth)/6mth         68572 non-null  float64
dtypes: float64(6), object(2)
memory usage: 4.7+ MB


In [231]:
BeverageMachine7_wTickets_wTelemetry_wSales_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_df, Concat_Sales, how='left', left_on = ['Key_ManufacturerID_SalesOrg'], right_on = ['KeyManufNo_SalesOrg'])
BeverageMachine7_wTickets_wTelemetry_wSales_df = BeverageMachine7_wTickets_wTelemetry_wSales_df.fillna(0)
BeverageMachine7_wTickets_wTelemetry_wSales_df["quantity_y"] = BeverageMachine7_wTickets_wTelemetry_wSales_df["quantity_y"].astype(int)
BeverageMachine7_wTickets_wTelemetry_wSales_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 0 to 239399
Data columns (total 71 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        239400 non-null  object        
 1   Sales Organisation                                               239400 non-null  object        
 2   Machine Status Groupings                                         239400 non-null  object        
 3   User Status                                                      239400 non-null  object        
 4   TA Contract Installation Date                                    239400 non-null  int32         
 5   Depreciation Start                                               239400 non-null  int32         
 6   Manufacturer Number                                              239

In [232]:
BeverageMachine7_wTickets_wTelemetry_wSales_df['EC ID'] = BeverageMachine7_wTickets_wTelemetry_wSales_df['EC ID'].astype('str')
Visitsdf_wVisits['Acc_ID'] = Visitsdf_wVisits['Acc_ID'].astype('str')

In [233]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_wSales_df, Visitsdf_wVisits, how='left', left_on = ['EC ID'], right_on = ['Acc_ID'])
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'] = pd.to_datetime(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'])
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'] = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df['End Date in Local Time Zone'].fillna(dt.datetime(2000,1,1))
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df.fillna(0)


In [234]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 0 to 239399
Data columns (total 84 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        239400 non-null  object        
 1   Sales Organisation                                               239400 non-null  object        
 2   Machine Status Groupings                                         239400 non-null  object        
 3   User Status                                                      239400 non-null  object        
 4   TA Contract Installation Date                                    239400 non-null  int32         
 5   Depreciation Start                                               239400 non-null  int32         
 6   Manufacturer Number                                              239

In [235]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df.head()

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months,#Visits completed
0,153933605,Néstlé Jordania,Deployed,Installed,0,42339,20153933605,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,160708192,Néstlé Jordania,Deployed,Installed,0,42491,20160708192,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,160708189,Néstlé Jordania,Deployed,Installed,0,42491,20160708189,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,153933611,Néstlé Jordania,Deployed,Installed,0,42461,20153933611,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,153933610,Néstlé Jordania,Deployed,Installed,0,42461,20153933610,LOAN,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [236]:
PhoneCallsdf_prep3 = PhoneCallsdf_prep3.reset_index()

In [237]:
PhoneCallsdf_prep3['Account Name'] = PhoneCallsdf_prep3['Account Name'].astype('str')

In [238]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_df, PhoneCallsdf_prep3, how='left', left_on = ['EC ID'], right_on = ['Account Name'])
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 0 to 239399
Data columns (total 88 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        239400 non-null  object        
 1   Sales Organisation                                               239400 non-null  object        
 2   Machine Status Groupings                                         239400 non-null  object        
 3   User Status                                                      239400 non-null  object        
 4   TA Contract Installation Date                                    239400 non-null  int32         
 5   Depreciation Start                                               239400 non-null  int32         
 6   Manufacturer Number                                              239

In [239]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'] = pd.to_datetime(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'])

BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'] = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df['End Date in Local Time Zone_y'].fillna(dt.datetime(2000,1,1))
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df = BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df.fillna(0)

In [240]:
BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 0 to 239399
Data columns (total 88 columns):
 #   Column                                                           Non-Null Count   Dtype         
---  ------                                                           --------------   -----         
 0   Serial ID                                                        239400 non-null  object        
 1   Sales Organisation                                               239400 non-null  object        
 2   Machine Status Groupings                                         239400 non-null  object        
 3   User Status                                                      239400 non-null  object        
 4   TA Contract Installation Date                                    239400 non-null  int32         
 5   Depreciation Start                                               239400 non-null  int32         
 6   Manufacturer Number                                              239

In [241]:
IncidentTicketdf_prep2['Serial ID'] = IncidentTicketdf_prep2['Serial ID'].astype('str')

In [242]:
BeverageMachine_all_df = pd.merge(BeverageMachine7_wTickets_wTelemetry_wSales_wVisits_wCalls_df, IncidentTicketdf_prep2, how='left', left_on = ['Serial ID'], right_on = ['Serial ID'])
BeverageMachine_all_df['Completion Date_2'] = pd.to_datetime(BeverageMachine_all_df['Completion Date_2'])
BeverageMachine_all_df['Completion Date_2'] = BeverageMachine_all_df['Completion Date_2'].fillna(dt.datetime(2000,1,1))
BeverageMachine_all_df = BeverageMachine_all_df.fillna(0)


In [243]:
f=BeverageMachine_all_df.loc[BeverageMachine_all_df['Serial ID']=='7010054129']
f.iloc[:20,80:100]              

Unnamed: 0,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months,#Visits completed,Account Name,End Date in Local Time Zone_y,Last_call_diff_months,#Calls Completed,Completion Date_2,Incident Category_1.a Ingredient Calibration,Incident Category_1.b Ingredient Dispensing,Incident Category_1.c Ingredient Dripping,Incident Category_1.d Ingredient Other,Incident Category_10 Abnormal smell,Incident Category_11 Electrical power,Incident Category_12 Water supply issue,Incident Category_13 Connectivity (modem),Incident Category_14 Accessory problem(external pump..),Incident Category_15 Return with parts,Incident Category_16 Operator mishandling(improper fill..)
115923,0.0,0.0,9.560771,2.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
185541,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
227072,0.0,0.0,13.930471,1.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [244]:
MktActions_prep3 = MktActions_prep3.reset_index()
MktActions_prep3

Unnamed: 0,Serial ID,Actions_Churn risk reason unknown,Actions_Data corrected,Actions_Downgrade machine installed,Actions_Lack of data discipline,Actions_New contract,Actions_Other,Actions_Out of order,Actions_Phone Call completed,Actions_Removal Plan,...,Actions_Removed,Actions_Reviewed and no action Required,Actions_Reviewed and no actions required,Actions_Seasonal Machine,Actions_Telemetry installed,Actions_Upgrade machine installed,Actions_Visit completed,Actions_Visit/Call planned,Actions_removed,Actions_tagging update
0,24606,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1895151,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,10238090,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,10238091,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,10238092,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
925,22O0021800,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
926,22O0021869,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
927,34F6401007,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
928,EM10023,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [245]:
MktActions_prep3['Serial ID'] = MktActions_prep3['Serial ID'].astype('str')

In [246]:

BeverageMachine_all_df2 = pd.merge(BeverageMachine_all_df, MktActions_prep3, how='left', left_on = ['Serial ID'], right_on = ['Serial ID'])
#BeverageMachine_all_df['Completion Date_2'] = pd.to_datetime(BeverageMachine_all_df['Completion Date_2'])
#BeverageMachine_all_df['Completion Date_2'] = BeverageMachine_all_df['Completion Date_2'].fillna(dt.datetime(2000,1,1))
BeverageMachine_all_df2 = BeverageMachine_all_df2.fillna(0)

BeverageMachine_all_df = BeverageMachine_all_df2

UKService_prep2 = UKService_prep2.reset_index()

BeverageMachine_all_df2 = pd.merge(BeverageMachine_all_df, UKService_prep2, how='left', left_on = ['Key_ManufacturerID_SalesOrg'], right_on = ['Key_ManufacturerID_SalesOrg'])
BeverageMachine_all_df2['Month'] = pd.to_datetime(BeverageMachine_all_df2['Month'])
BeverageMachine_all_df2['Month'] = BeverageMachine_all_df2['Month'].fillna(dt.datetime(2000,1,1))
BeverageMachine_all_df2 = BeverageMachine_all_df2.fillna(0)


In [247]:
BeverageMachine_all_df = BeverageMachine_all_df2
a = BeverageMachine_all_df.iloc[:,100:]
a.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 0 to 239399
Data columns (total 49 columns):
 #   Column                                                      Non-Null Count   Dtype  
---  ------                                                      --------------   -----  
 0   Incident Category_17 Miscellaneous                          239400 non-null  float64
 1   Incident Category_18 N/A                                    239400 non-null  float64
 2   Incident Category_2.a Hydraulic Calibration                 239400 non-null  float64
 3   Incident Category_2.b Hydraulic Dispensing                  239400 non-null  float64
 4   Incident Category_2.c Hydraulic Leaking                     239400 non-null  float64
 5   Incident Category_2.d Hydraulic Heating                     239400 non-null  float64
 6   Incident Category_2.e Hydraulic Cooling/Freezing            239400 non-null  float64
 7   Incident Category_2.f Hydraulic Filling                     239400 non-nul

###TODO Remove when market have enough data

BeverageMachine_all_df2 = BeverageMachine_all_df.copy()

# Sales Organisation with more than one month of data
SO = ['Nestle Sweden',  'Nestlé Czech', 'Nestlé Denmark', 'Nestlé Finland', 'Nestlé Norway', 'Nestlé Slovak Republic']

#BeverageMachine_all_df3 =  pd.DataFrame([])

for i in SO:
    BeverageMachine_all_df2 = BeverageMachine_all_df2.loc[BeverageMachine_all_df2['Sales Organisation'] != i]
BeverageMachine_all_df2.head()
BeverageMachine_all_df = BeverageMachine_all_df2

# Specify the filename
filename = 'TelemetryColumnsList.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(TelemetryColumnsList, file)

#### Data summary

I now have four datasets :

"BeverageMachine7_df" 

    which is all the data of the beverage machines without ticket data

    This data will be our main data and it will be used to Train and test our models because we have data for all the machines

"BeverageMachine7_wTickets_df" 

    which is with the Ticket data and when there is no tickets for a machine we fill with 0
    
    As we only have around 2000 machines having tickets we will use it on the model that performed better with main data to see if it can bring better results with Telemetry data

"BeverageMachine7_wTicketsOnly_df" 

    which is only the data of the machines having Tickets
    
    Only useful for EDA

"BeverageMachine7_wTickets_wTelemetry_df"

    We will use it on the model that performed better with main data to see if it can bring better results than the Main data or the Main data with tickets. If it does not improve significantly the results we will not use it  because it takes a lot of time to get Telemetry data.
    Later, more machines will have Telemetry and a data lake will be created and it will br easier to get the data.

### Save the data<a class="anchor" id="save"></a>

I choose to save the data into a pickle file because it is a good way to transfer a pandas dataframe

##### BeverageMachine7_df

In [248]:
BeverageMachine_all_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 239400 entries, 0 to 239399
Columns: 149 entries, Serial ID to Actions_tagging update
dtypes: bool(1), datetime64[ns](4), float64(99), int32(6), int64(1), object(38)
memory usage: 266.9+ MB


In [249]:
BeverageMachine_all_df.iloc[0:10, 68:90]

Unnamed: 0,KeyManufNo_SalesOrg,(lst_mth-6mth)/6mth_y,3mth-6mth)/6mth_y,Acc_ID,End Date in Local Time Zone_x,Result_Incomplete Selling Call,Result_Not assigned,Result_Objective Met,Result_Objective Partially Met,Result_Requires Further Follow-up,...,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months,#Visits completed,Account Name,End Date in Local Time Zone_y,Last_call_diff_months,#Calls Completed,Completion Date_2,Incident Category_1.a Ingredient Calibration
0,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,1.0
1,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
2,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,2.0
3,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,1.0
4,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
5,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
6,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
7,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
8,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0
9,0,0.0,0.0,0,2000-01-01,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0,2000-01-01,0.0,0.0,2000-01-01,0.0


In [250]:
# Specify the filename
filename = 'BM_noTickets.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_df, file)

In [251]:
# Specify the filename
filename = 'BM_noTickets.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Load the pickle file
with open(file_path_with_filename, 'rb') as file:
    BM_noTickets = pickle.load(file)

Quick test to see if I am able to reopen the data in another Notebook

In [252]:
a=BeverageMachine_all_df.loc[BeverageMachine_all_df['EC ID']=='1056184']
a.iloc[0:10,:86]

Unnamed: 0,Serial ID,Sales Organisation,Machine Status Groupings,User Status,TA Contract Installation Date,Depreciation Start,Manufacturer Number,Position,TA Contract Start Date,TA Contract End Date,...,Result_Objective Partially Met,Result_Requires Further Follow-up,Result_Unsuccessful Selling Call,Activity Life Cycle Status_Canceled,Activity Life Cycle Status_In Process,Activity Life Cycle Status_Open,Last_visit_diff_months,#Visits completed,Account Name,End Date in Local Time Zone_y
69461,20E0006031,Nestle UK,Deployed,Installed,44518,44531,20203123964,RENT,44518,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,10.250724,6.0,0,2000-01-01
69464,21E0003520,Nestle UK,Deployed,Installed,44515,44531,20212718834,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,10.250724,6.0,0,2000-01-01
69782,20E0005995,Nestle UK,Deployed,Installed,44515,44531,20203123928,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,10.250724,6.0,0,2000-01-01
69792,20E0006016,Nestle UK,Deployed,Installed,44518,44531,20203123949,RENT,44518,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,10.250724,6.0,0,2000-01-01
69793,20E0006003,Nestle UK,Deployed,Installed,44515,44531,20203123936,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,10.250724,6.0,0,2000-01-01
69820,20E0006025,Nestle UK,Deployed,Installed,44589,44621,20203123958,RENT,44589,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,10.250724,6.0,0,2000-01-01
69874,21E0003516,Nestle UK,Deployed,Installed,44515,44531,20212718830,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,10.250724,6.0,0,2000-01-01
69875,21E0003517,Nestle UK,Deployed,Installed,44518,44531,20212718831,RENT,44518,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,10.250724,6.0,0,2000-01-01
69891,21E0003519,Nestle UK,Deployed,Installed,44518,44531,20212718833,RENT,44518,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,10.250724,6.0,0,2000-01-01
70099,21E0003522,Nestle UK,Deployed,Installed,44515,44531,20212718836,RENT,44515,2958465,...,0.0,0.0,0.0,0.0,0.0,0.0,10.250724,6.0,0,2000-01-01


##### BeverageMachine7_wTickets_df

In [253]:
# Specify the filename
filename = 'BM_wTickets.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_wTickets_df, file)

##### BeverageMachine7_wTicketsOnly_df

In [254]:
# Specify the filename
filename = 'BM_wTicketsOnly.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_wTicketsOnly_df, file)

##### BeverageMachine7_wTickets_wTelemetry_df

In [255]:
# Specify the filename
filename = 'BeverageMachine7_wTickets_wTelemetry_df.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine7_wTickets_wTelemetry_df, file)

##### Other dataframe needed for the second preparation step later

In [256]:
# Specify the filename
filename = 'IncidentTicketdf.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(IncidentTicketdf_prep2, file)

In [257]:
# Specify the filename
filename = 'TelemetryAggregated_df.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the list into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(Telemetry_aggSales, file)

#### All data with placements, telemetry, visits, phone calls, Incidents tickets

In [258]:
# Specify the filename
filename = 'BeverageMachine_all_df2.p'

# Combine the file path and filename
file_path_with_filename = os.path.join(file_path_output, filename)

# Save the DataFrame into a pickle file
with open(file_path_with_filename, 'wb') as file:
    pickle.dump(BeverageMachine_all_df, file)

In [259]:
BeverageMachine_all_df.to_csv(r'C:\Users\msalomo\predictions-BevData.csv', index = False, header=True)