## DATA MIGRATION CASE

Based on this dataset, decide what data we are going to inherit from the origin to the future data system. First understand what the data is about and what are the needs, clean and analyze the dataset and explain the process. 

------

## What information do we need to migrate to the future system and what data is not essential? 

In order to know this, we need to answer the next questions:

- What is the purpose of the future system? What do we want to analyze or extract from this data in the future? Will this data have to be compared and merged with other systems? 
- What is the use that we will get from this data? For example, if we want to know what is the biggest client and what machines they usually use? We want to control manufacturing time for machines that are more popular?
- Are we evaluating manufacturing time to see if we can improve it?

Knowing this we can be sure about what data is not needed or just overcomplicating the dataset and what data would be nice to have (for example, calculated fields from what we already have).

Assumption: why we need this data and what we will be using it for.

Important data according to our assumption: 

## Let's analyze all the dataset step by step: 

- Check for null values, duplicates, unique values, and possible errors in format
- Substitute null values or other unnecessary data and why (ex: activo fijo could be boolean)
- Fix errors or change values that could lead to confusion
- See if we can group data to simplify the dataset
- See if we can add calculated fields to get more relevant information
- Remove columns that we won't need in the future

## Other options?

- SQL
- Review assumptions?



-------

# General exploration

## Import and visualize data

In [120]:
import pandas as pd

In [121]:
origin_dataset = pd.read_csv("/Users/martafillolbruguera/Documents/Data_projects/practice_Case/dataset.csv")

In [122]:
origin_dataset.head()

Unnamed: 0,Número de serie,Equipo,Número-identificación técnica,Grupo planificación,Enviar a parte,Flota,Activo fijo,Año de construcción,Brand name,Creado el,...,Modificado por,Fe.puesta servicio,Fecha de última orden,Inic.garantía clte.,intervalo,Mes de construcción,País de fabricación,Pto.tbjo.responsable,Status de usuario,Tipo de equipo
0,H2X992W15465,1132732,3,E82,33925845,C,,2008.0,LINDE,12/6/2012,...,HK57F5,,11/11/2022,15/9/2008,Contrapesada térmica,5.0,DE,KXE-6188,AVLB,L
1,H2X995S19125,1207034,,E12,39380933,C,,2005.0,,18/6/2012,...,HK57F5,,13/7/2018,14/12/2011,Contrapesada eléctrica,5.0,DE,KXE-6639,AVLB PINA,L
2,W4X979W12995,1290040,1204 PMP EX,E82,33927373,C,,2008.0,LINDE,18/12/2012,...,CKMAGKNK,29/5/2008,19/9/2023,,Apilador,6.0,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,,,33922280,N,,2004.0,LINDE,18/12/2012,...,HK57F5,30/4/2004,24/5/2023,,Contrapesada térmica,5.0,,,AVLB,L
4,G5X997P11599,1655177,,E11,33923892,C,,2003.0,,23/9/2014,...,HK57F5,,1/4/2020,15/7/2003,Contrapesada eléctrica,6.0,GB,KXE-6263,AVLB PINA,L


In [123]:
#first let's look at the columns and description that we are working with.
#First thing that comes to mind is there are some column names that could be improved to be more readable, like enviar a parte, creado el, intervalo. It's important that we understand what the column is about without confusing it with something else in case we need to merge with other data in the future
#I also see that we have the same date were the machine was manufactured but one column is the year and the other the month, is there a reason we need this division? Can we create a column just with the manufacturing date?
#Seems like the n serie needs to be 12 cts long, so would be good to check that for all items
#Do we need to calculate extra columns? Time from manufacturing to service? Final garantía? Status PINA
#Check type of data so it can be used for analysis later
#Would be interesting to analyze later by brand, country, status


origin_dataset.head()

Unnamed: 0,Número de serie,Equipo,Número-identificación técnica,Grupo planificación,Enviar a parte,Flota,Activo fijo,Año de construcción,Brand name,Creado el,...,Modificado por,Fe.puesta servicio,Fecha de última orden,Inic.garantía clte.,intervalo,Mes de construcción,País de fabricación,Pto.tbjo.responsable,Status de usuario,Tipo de equipo
0,H2X992W15465,1132732,3,E82,33925845,C,,2008.0,LINDE,12/6/2012,...,HK57F5,,11/11/2022,15/9/2008,Contrapesada térmica,5.0,DE,KXE-6188,AVLB,L
1,H2X995S19125,1207034,,E12,39380933,C,,2005.0,,18/6/2012,...,HK57F5,,13/7/2018,14/12/2011,Contrapesada eléctrica,5.0,DE,KXE-6639,AVLB PINA,L
2,W4X979W12995,1290040,1204 PMP EX,E82,33927373,C,,2008.0,LINDE,18/12/2012,...,CKMAGKNK,29/5/2008,19/9/2023,,Apilador,6.0,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,,,33922280,N,,2004.0,LINDE,18/12/2012,...,HK57F5,30/4/2004,24/5/2023,,Contrapesada térmica,5.0,,,AVLB,L
4,G5X997P11599,1655177,,E11,33923892,C,,2003.0,,23/9/2014,...,HK57F5,,1/4/2020,15/7/2003,Contrapesada eléctrica,6.0,GB,KXE-6263,AVLB PINA,L


In [124]:
origin_dataset.columns

Index(['Número de serie', 'Equipo', 'Número-identificación técnica',
       'Grupo planificación', 'Enviar a parte', 'Flota', 'Activo fijo',
       'Año de construcción', 'Brand name', 'Creado el',
       'Denominación de garantía de cliente', 'Modificado el',
       'Modificado por', 'Fe.puesta servicio', 'Fecha de última orden',
       'Inic.garantía clte.', 'intervalo', 'Mes de construcción',
       'País de fabricación', 'Pto.tbjo.responsable', 'Status de usuario',
       'Tipo de equipo'],
      dtype='object')

In [125]:
origin_dataset.shape

#check shape of table (rows, columns)

(75564, 22)

In [126]:
origin_dataset.describe()

#not relevant in this case

Unnamed: 0,Equipo,Enviar a parte,Año de construcción,Mes de construcción
count,75564.0,75564.0,73369.0,71685.0
mean,3778983.0,34037050.0,2015.357113,5.317458
std,828046.6,2435792.0,8.611414,3.835214
min,1132732.0,2.0,199.0,0.0
25%,3222933.0,33925870.0,2013.0,2.0
50%,3256996.0,33937470.0,2016.0,5.0
75%,4280855.0,33944430.0,2019.0,9.0
max,6122680.0,73953050.0,2202.0,12.0


In [127]:
info = pd.DataFrame(origin_dataset.info())

# information about the type of data we have and the number of null values
# we can already see here that there are some weird dtypes (dates have different types, equipo being integer, month being float) so we will fix this when cleaning.
#There are some columns that are more relevant: Equipo, Flota, Tipo de Equipo, Numero de serie. I would drop the rows that have null values in more than one of these.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75564 entries, 0 to 75563
Data columns (total 22 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Número de serie                      50634 non-null  object 
 1   Equipo                               75564 non-null  int64  
 2   Número-identificación técnica        5817 non-null   object 
 3   Grupo planificación                  13431 non-null  object 
 4   Enviar a parte                       75564 non-null  int64  
 5   Flota                                75558 non-null  object 
 6   Activo fijo                          1800 non-null   object 
 7   Año de construcción                  73369 non-null  float64
 8   Brand name                           70433 non-null  object 
 9   Creado el                            75564 non-null  object 
 10  Denominación de garantía de cliente  11980 non-null  object 
 11  Modificado el               

In [128]:
origin_dataset.isnull().sum()

#check for null values. The only columns that don't have any are Equipo, enviar a parte, Creado el, Tipo de equipo so we will consider these columns like the most relevant
#Activo fijo has the most null values so we will check this column later to see if we need it or we can substitute the null values.

Número de serie                        24930
Equipo                                     0
Número-identificación técnica          69747
Grupo planificación                    62133
Enviar a parte                             0
Flota                                      6
Activo fijo                            73764
Año de construcción                     2195
Brand name                              5131
Creado el                                  0
Denominación de garantía de cliente    63584
Modificado el                          11598
Modificado por                         11598
Fe.puesta servicio                     65823
Fecha de última orden                  62740
Inic.garantía clte.                    63582
intervalo                              64313
Mes de construcción                     3879
País de fabricación                    66021
Pto.tbjo.responsable                   62228
Status de usuario                      61245
Tipo de equipo                             0
dtype: int

In [129]:
origin_dataset["Equipo"].value_counts()

#equipo are all unique values and non-null

1132732    1
3880535    1
3882621    1
3882562    1
3882310    1
          ..
3230089    1
3230088    1
3230087    1
3230086    1
6122680    1
Name: Equipo, Length: 75564, dtype: int64

In [130]:
origin_dataset["Número de serie"].value_counts()
# check the number of unique values in each column.
# For numero de serie which is a unique identifier we have multiple elements in the table, which means they have different values in other columns

#I would drop the rows that are duplicated if they don't have any other relevant data in other columns.

2,99E+52         17
2991552111259    12
2991552111252    11
7991142115255    11
2991552111555    11
                 ..
9521122111499     1
9521122111492     1
9521122111911     1
9521142111421     1
W45565N11195      1
Name: Número de serie, Length: 33667, dtype: int64

---------

### Conclusions from this exploration:

- We could make some adjustments in the naming of the columns to see faster what they are
- The type of data is not coherent, we will change some columns to make it easier to use in the future.
- The columns año/mes could be grouped but we have a lot of nulls, so drop the rows where we have both empty and create a new column called "manufacturing date".
- Activo Fijo has a lot of null values, could be transformed to boolean maybe.
- We could create another column to know how long has been the warranty active. because the start date alone does not say anything. 
- Before dropping columns
  - The most relevant fields could be Numero de serie, Equipo, Flota, Tipo de Equipo, Enviar a parte, brand, fecha construcción. We will drop rows that have null values in more than one of these. 

- columns to drop:  
    Modificado por: could be dropped, doesn't seem relevant for analysis or to migrate to the future system.   
    Pto.tbjo.responsable: internal doesn't seem relevant to migrate, many nulls   
    Grupo planificación: also internal seems like something that could change over time and many nulls   


------

## General changes

### Updating names

In [131]:
clean_dataset = pd.read_csv("/Users/martafillolbruguera/Documents/Data_projects/practice_Case/dataset.csv")

clean_dataset.head()

Unnamed: 0,Número de serie,Equipo,Número-identificación técnica,Grupo planificación,Enviar a parte,Flota,Activo fijo,Año de construcción,Brand name,Creado el,...,Modificado por,Fe.puesta servicio,Fecha de última orden,Inic.garantía clte.,intervalo,Mes de construcción,País de fabricación,Pto.tbjo.responsable,Status de usuario,Tipo de equipo
0,H2X992W15465,1132732,3,E82,33925845,C,,2008.0,LINDE,12/6/2012,...,HK57F5,,11/11/2022,15/9/2008,Contrapesada térmica,5.0,DE,KXE-6188,AVLB,L
1,H2X995S19125,1207034,,E12,39380933,C,,2005.0,,18/6/2012,...,HK57F5,,13/7/2018,14/12/2011,Contrapesada eléctrica,5.0,DE,KXE-6639,AVLB PINA,L
2,W4X979W12995,1290040,1204 PMP EX,E82,33927373,C,,2008.0,LINDE,18/12/2012,...,CKMAGKNK,29/5/2008,19/9/2023,,Apilador,6.0,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,,,33922280,N,,2004.0,LINDE,18/12/2012,...,HK57F5,30/4/2004,24/5/2023,,Contrapesada térmica,5.0,,,AVLB,L
4,G5X997P11599,1655177,,E11,33923892,C,,2003.0,,23/9/2014,...,HK57F5,,1/4/2020,15/7/2003,Contrapesada eléctrica,6.0,GB,KXE-6263,AVLB PINA,L


In [132]:
list = clean_dataset.columns
list

for i in list:
    print(i)

Número de serie
Equipo
Número-identificación técnica
Grupo planificación
Enviar a parte
Flota
Activo fijo
Año de construcción
Brand name
Creado el
Denominación de garantía de cliente
Modificado el
Modificado por
Fe.puesta servicio
Fecha de última orden
Inic.garantía clte.
intervalo
Mes de construcción
País de fabricación
Pto.tbjo.responsable
Status de usuario
Tipo de equipo


In [133]:
# Mapping of old column names to new column names

column_mapping = {
    "Número de serie": "serial_number",
    "Equipo": "equipment_id",
    "Número-identificación técnica": "client_id",
    "Grupo planificación": "group",
    "Enviar a parte": "client_code",
    "Flota": "fleet_type",
    "Activo fijo": "fixed_asset",
    "Año de construcción": "construction_year",
    "Brand name": "brand_name",
    "Creado el": "created_date",
    "Denominación de garantía de cliente": "warranty_type",
    "Modificado el": "last_modified",
    "Modificado por": "modified_by",
    "Fe.puesta servicio": "service_start_date",
    "Fecha de última orden": "last_service_date",
    "Inic.garantía clte.": "warranty_start",
    "intervalo": "equipment_type_name",
    "Mes de construcción": "construction_month",
    "País de fabricación": "country",
    "Pto.tbjo.responsable": "technician",
    "Status de usuario": "user_status",
    "Tipo de equipo": "equipment_type"
}

clean_dataset.reset_index(drop=True, inplace=True)

clean_dataset.rename(columns=column_mapping, inplace=True)


In [134]:
clean_dataset.head()

Unnamed: 0,serial_number,equipment_id,client_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,...,modified_by,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type
0,H2X992W15465,1132732,3,E82,33925845,C,,2008.0,LINDE,12/6/2012,...,HK57F5,,11/11/2022,15/9/2008,Contrapesada térmica,5.0,DE,KXE-6188,AVLB,L
1,H2X995S19125,1207034,,E12,39380933,C,,2005.0,,18/6/2012,...,HK57F5,,13/7/2018,14/12/2011,Contrapesada eléctrica,5.0,DE,KXE-6639,AVLB PINA,L
2,W4X979W12995,1290040,1204 PMP EX,E82,33927373,C,,2008.0,LINDE,18/12/2012,...,CKMAGKNK,29/5/2008,19/9/2023,,Apilador,6.0,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,,,33922280,N,,2004.0,LINDE,18/12/2012,...,HK57F5,30/4/2004,24/5/2023,,Contrapesada térmica,5.0,,,AVLB,L
4,G5X997P11599,1655177,,E11,33923892,C,,2003.0,,23/9/2014,...,HK57F5,,1/4/2020,15/7/2003,Contrapesada eléctrica,6.0,GB,KXE-6263,AVLB PINA,L


### Changing dtypes

In [135]:
clean_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75564 entries, 0 to 75563
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   serial_number        50634 non-null  object 
 1   equipment_id         75564 non-null  int64  
 2   client_id            5817 non-null   object 
 3   group                13431 non-null  object 
 4   client_code          75564 non-null  int64  
 5   fleet_type           75558 non-null  object 
 6   fixed_asset          1800 non-null   object 
 7   construction_year    73369 non-null  float64
 8   brand_name           70433 non-null  object 
 9   created_date         75564 non-null  object 
 10  warranty_type        11980 non-null  object 
 11  last_modified        63966 non-null  object 
 12  modified_by          63966 non-null  object 
 13  service_start_date   9741 non-null   object 
 14  last_service_date    12824 non-null  object 
 15  warranty_start       11982 non-null 

In [136]:
#for now, change dates into datetime format, and change mes/año to int instead of float to combine later

# created_date
# last_modified
# service_start_date
# last_service_date
# warranty_start

In [137]:
clean_dataset.created_date = pd.to_datetime(clean_dataset.created_date, format='%d/%m/%Y', errors='coerce')
clean_dataset.last_modified = pd.to_datetime(clean_dataset.last_modified, format='%d/%m/%Y', errors='coerce')
clean_dataset.service_start_date = pd.to_datetime(clean_dataset.service_start_date, format='%d/%m/%Y', errors='coerce')
clean_dataset.last_service_date = pd.to_datetime(clean_dataset.last_service_date, format='%d/%m/%Y', errors='coerce')
clean_dataset.warranty_start = pd.to_datetime(clean_dataset.warranty_start, format='%d/%m/%Y', errors='coerce')

In [138]:
clean_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75564 entries, 0 to 75563
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   serial_number        50634 non-null  object        
 1   equipment_id         75564 non-null  int64         
 2   client_id            5817 non-null   object        
 3   group                13431 non-null  object        
 4   client_code          75564 non-null  int64         
 5   fleet_type           75558 non-null  object        
 6   fixed_asset          1800 non-null   object        
 7   construction_year    73369 non-null  float64       
 8   brand_name           70433 non-null  object        
 9   created_date         75564 non-null  datetime64[ns]
 10  warranty_type        11980 non-null  object        
 11  last_modified        63966 non-null  datetime64[ns]
 12  modified_by          63966 non-null  object        
 13  service_start_date   9741 non-n

In [139]:
# review changes
clean_dataset.head()

Unnamed: 0,serial_number,equipment_id,client_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,...,modified_by,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type
0,H2X992W15465,1132732,3,E82,33925845,C,,2008.0,LINDE,2012-06-12,...,HK57F5,NaT,2022-11-11,2008-09-15,Contrapesada térmica,5.0,DE,KXE-6188,AVLB,L
1,H2X995S19125,1207034,,E12,39380933,C,,2005.0,,2012-06-18,...,HK57F5,NaT,2018-07-13,2011-12-14,Contrapesada eléctrica,5.0,DE,KXE-6639,AVLB PINA,L
2,W4X979W12995,1290040,1204 PMP EX,E82,33927373,C,,2008.0,LINDE,2012-12-18,...,CKMAGKNK,2008-05-29,2023-09-19,NaT,Apilador,6.0,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,,,33922280,N,,2004.0,LINDE,2012-12-18,...,HK57F5,2004-04-30,2023-05-24,NaT,Contrapesada térmica,5.0,,,AVLB,L
4,G5X997P11599,1655177,,E11,33923892,C,,2003.0,,2014-09-23,...,HK57F5,NaT,2020-04-01,2003-07-15,Contrapesada eléctrica,6.0,GB,KXE-6263,AVLB PINA,L


In [140]:
#review nulls are the same as before

print(clean_dataset.isnull().sum())

origin_dataset.isnull().sum()


serial_number          24930
equipment_id               0
client_id              69747
group                  62133
client_code                0
fleet_type                 6
fixed_asset            73764
construction_year       2195
brand_name              5131
created_date               0
warranty_type          63584
last_modified          11598
modified_by            11598
service_start_date     65823
last_service_date      62740
warranty_start         63582
equipment_type_name    64313
construction_month      3879
country                66021
technician             62228
user_status            61245
equipment_type             0
dtype: int64


Número de serie                        24930
Equipo                                     0
Número-identificación técnica          69747
Grupo planificación                    62133
Enviar a parte                             0
Flota                                      6
Activo fijo                            73764
Año de construcción                     2195
Brand name                              5131
Creado el                                  0
Denominación de garantía de cliente    63584
Modificado el                          11598
Modificado por                         11598
Fe.puesta servicio                     65823
Fecha de última orden                  62740
Inic.garantía clte.                    63582
intervalo                              64313
Mes de construcción                     3879
País de fabricación                    66021
Pto.tbjo.responsable                   62228
Status de usuario                      61245
Tipo de equipo                             0
dtype: int

In [141]:
#change float to int month and year. We can't convert null values directly to int, so we have to fill null values with 0.

clean_dataset.construction_month = clean_dataset.construction_month.fillna(1).astype(int)
clean_dataset.construction_year = clean_dataset.construction_year.fillna(2016).astype(int)

In [142]:
#check that the change has been made correctly

clean_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75564 entries, 0 to 75563
Data columns (total 22 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   serial_number        50634 non-null  object        
 1   equipment_id         75564 non-null  int64         
 2   client_id            5817 non-null   object        
 3   group                13431 non-null  object        
 4   client_code          75564 non-null  int64         
 5   fleet_type           75558 non-null  object        
 6   fixed_asset          1800 non-null   object        
 7   construction_year    75564 non-null  int64         
 8   brand_name           70433 non-null  object        
 9   created_date         75564 non-null  datetime64[ns]
 10  warranty_type        11980 non-null  object        
 11  last_modified        63966 non-null  datetime64[ns]
 12  modified_by          63966 non-null  object        
 13  service_start_date   9741 non-n

In [143]:
clean_dataset.isnull().sum()

#now the columns construction year and month have 0 null values.

serial_number          24930
equipment_id               0
client_id              69747
group                  62133
client_code                0
fleet_type                 6
fixed_asset            73764
construction_year          0
brand_name              5131
created_date               0
warranty_type          63584
last_modified          11598
modified_by            11598
service_start_date     65823
last_service_date      62740
warranty_start         63582
equipment_type_name    64313
construction_month         0
country                66021
technician             62228
user_status            61245
equipment_type             0
dtype: int64

In [144]:
clean_dataset.construction_month.value_counts()

1     11416
0      8791
7      6864
2      6589
5      6294
9      5608
6      4873
12     4647
4      4588
11     4575
3      4316
10     3913
8      3090
Name: construction_month, dtype: int64

In [145]:
#there are some months values with a 0, we will change them for january as we did with the default nulls before

# Replace 0 with 1 in the "construction_month" column
clean_dataset['construction_month'] = clean_dataset['construction_month'].replace(0, 1)

# Verify the changes
print(clean_dataset['construction_month'].value_counts())

1     20207
7      6864
2      6589
5      6294
9      5608
6      4873
12     4647
4      4588
11     4575
3      4316
10     3913
8      3090
Name: construction_month, dtype: int64


-----

## Cleaning by column

In [146]:
list = clean_dataset.columns

for i in list:
    print("### "+i)

### serial_number
### equipment_id
### client_id
### group
### client_code
### fleet_type
### fixed_asset
### construction_year
### brand_name
### created_date
### warranty_type
### last_modified
### modified_by
### service_start_date
### last_service_date
### warranty_start
### equipment_type_name
### construction_month
### country
### technician
### user_status
### equipment_type


### serial_number

- verify if Linde have 12 cts 
- Unique? 
- Missing values
- Dtype

In [147]:

clean_dataset.serial_number.info()

#dtype is string and it makes sense since it's alphanumeric element

<class 'pandas.core.series.Series'>
RangeIndex: 75564 entries, 0 to 75563
Series name: serial_number
Non-Null Count  Dtype 
--------------  ----- 
50634 non-null  object
dtypes: object(1)
memory usage: 590.5+ KB


In [148]:
#verify if Linde has 12 cts in serial number

linde_brand = clean_dataset[clean_dataset.brand_name == "LINDE"]
linde_brand

Unnamed: 0,serial_number,equipment_id,client_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,...,modified_by,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type
0,H2X992W15465,1132732,3,E82,33925845,C,,2008,LINDE,2012-06-12,...,HK57F5,NaT,2022-11-11,2008-09-15,Contrapesada térmica,5,DE,KXE-6188,AVLB,L
2,W4X979W12995,1290040,1204 PMP EX,E82,33927373,C,,2008,LINDE,2012-12-18,...,CKMAGKNK,2008-05-29,2023-09-19,NaT,Apilador,6,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,,,33922280,N,,2004,LINDE,2012-12-18,...,HK57F5,2004-04-30,2023-05-24,NaT,Contrapesada térmica,5,,,AVLB,L
5,W4X595F14922,2139958,,E11,33933808,C,,2015,LINDE,2015-10-14,...,HK57F5,NaT,2021-02-10,2015-10-05,Transpaleta eléctrica,10,FR,KXE-6626,ONOD PINA,L
7,H2X926C19625,2205572,,E12,39327943,C,,2012,LINDE,2015-12-05,...,CKMAGKNK,2020-10-08,2024-03-14,2020-12-18,Contrapesada eléctrica,5,DE,KXE-6157,SOLD RSVD,L
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75514,W41559C15595,6106796,A7,E81,33933227,C,,2016,LINDE,2024-03-20,...,,NaT,2024-03-07,NaT,Transpaleta eléctrica,1,,KXE-6215,AVLB,L
75517,W45565N11199,6108537,,E11,33934507,R,,2024,LINDE,2024-03-21,...,A0063827,2024-04-02,NaT,2024-04-02,Apilador,3,FR,LE196666,AVLB,L
75522,996F12155991,6109448,,E11,39303353,C,,2016,LINDE,2024-03-22,...,KS411A,NaT,NaT,NaT,,1,,KXE-6642,AVLB,L
75526,W45552F19256,6109504,,E12,33933702,C,,2016,LINDE,2024-03-22,...,,NaT,NaT,NaT,,1,,KXE-6646,AVLB,L


In [149]:
#We have 11502 rows with Linde brand elements. Now check character length

# Check rows where the length of elements in the column is NOT 12
invalid_length = linde_brand["serial_number"].str.len() != 12

# Display rows with invalid lengths with a boolean mask
rows_with_invalid_length = linde_brand[invalid_length]
rows_with_invalid_length

Unnamed: 0,serial_number,equipment_id,client_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,...,modified_by,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type
61,E1L71GTE1119242,2981089,,E11,33930535,C,,2016,LINDE,2017-11-29,...,HK57F5,2019-05-30,2020-04-06,NaT,,1,,LE616666,SOTR PINA,C
348,HLI1416595,2984170,,E14,33933050,R,J01186,2017,LINDE,2017-11-29,...,CKMAGKNK,NaT,2024-01-12,2017-08-22,Transpaleta manual,8,,KXE-6278,NORE,L
393,GRSSALVESEN,2984446,,E11,33929798,C,,2016,LINDE,2017-11-29,...,HK57F5,NaT,2020-02-05,2016-02-02,,1,,KXE-6191,AVLB PINA,C
3233,UFW212792,3013111,,E11,33923308,C,,2017,LINDE,2017-11-29,...,HK57F5,2021-06-15,2022-01-28,2021-06-15,Retráctil,1,SE,KXE-6626,SOLD PINA,L
3454,W4X595F15472OLD,3014862,,,55,C,,2015,LINDE,2017-11-30,...,HK57F5,2018-10-05,NaT,2015-03-26,Transpaleta eléctrica,1,FR,,SOLD PINA,L
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75156,,6045292,,,39354898,C,,2006,LINDE,2024-02-07,...,,NaT,NaT,NaT,,1,,,,G
75301,,6074424,,,32384905,C,,1996,LINDE,2024-02-27,...,,NaT,NaT,NaT,,1,,,,G
75323,BAT.H2X922D11122,6076355,,E12,33937707,C,,2016,LINDE,2024-02-28,...,KS411A,NaT,NaT,NaT,,1,,KXE-6646,AVLB,L
75429,,6087932,,,33940808,C,,1998,LINDE,2024-03-07,...,,NaT,NaT,NaT,,1,,,,G


In [150]:
#there are 777 rows of invalid length, so i would drop these

invalid_rows = (clean_dataset["brand_name"] == "LINDE") & (clean_dataset["serial_number"].str.len() != 12)
clean_dataset = clean_dataset[~invalid_rows]

print(clean_dataset)

      serial_number  equipment_id    client_id group  client_code fleet_type  \
0      H2X992W15465       1132732            3   E82     33925845          C   
1      H2X995S19125       1207034          NaN   E12     39380933          C   
2      W4X979W12995       1290040  1204 PMP EX   E82     33927373          C   
3      H2X994R11511       1306179          NaN   NaN     33922280          N   
4      G5X997P11599       1655177          NaN   E11     33923892          C   
...             ...           ...          ...   ...          ...        ...   
75559           NaN       6122641          NaN   NaN     39333323          C   
75560           NaN       6122642          NaN   NaN     39333323          C   
75561           NaN       6122677          NaN   NaN     39333323          C   
75562           NaN       6122678          NaN   NaN     39333323          C   
75563           NaN       6122680          NaN   NaN     39333323          C   

      fixed_asset  construction_year br

In [151]:
#now we have 74787 rows, 777 less than before

clean_dataset.shape

(74787, 22)

In [152]:
#we still have null values. I would drop them unless we can use equipment id as a unique identifier. So let's check first equipment ID unique values and then come back to this to see if we have still too many nulls.

clean_dataset.serial_number.isnull().sum()

24261

In [153]:
#since serial number is critical, I would drop the rows that have nulls (it's 30% of the rows)

clean_dataset = clean_dataset.dropna(subset=["serial_number"])

In [154]:
clean_dataset.shape

(50526, 22)

### equipment_id

- see if we have unique values
- missing values
- dtype

In [155]:
clean_dataset.equipment_id.isnull().sum()

0

In [156]:
clean_dataset.equipment_id.unique

<bound method Series.unique of 0        1132732
1        1207034
2        1290040
3        1306179
4        1655177
          ...   
75532    6109690
75533    6109768
75541    6111696
75542    6112185
75548    6118220
Name: equipment_id, Length: 50526, dtype: int64>

In [157]:
#perfect, we have 74787 unique values and 0 nulls so we don't have to drop anything.

### client_id

- see if we have unique values
- missing values
- dtype


In [158]:
clean_dataset.client_id.isnull().sum()

44787

In [159]:
clean_dataset.client_id.value_counts()

2                      78
3                      77
1                      72
SUPERSOL BELLAVISTA    68
SUPERSOL GETAFE        63
                       ..
PALENCIA                1
SALAMANCA               1
24v/225ah               1
CON POCISIONADOR        1
muelle palas largas     1
Name: client_id, Length: 2953, dtype: int64

In [160]:
clean_dataset.client_id.info()

<class 'pandas.core.series.Series'>
Int64Index: 50526 entries, 0 to 75548
Series name: client_id
Non-Null Count  Dtype 
--------------  ----- 
5739 non-null   object
dtypes: object(1)
memory usage: 789.5+ KB


In [161]:
#we could drop this column because it has 93% null values and it's not reliable as an id, as it's how the client identifies it so we might not need this for future systems.

clean_dataset = clean_dataset.drop(columns=["client_id"])
clean_dataset.head()

Unnamed: 0,serial_number,equipment_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,warranty_type,...,modified_by,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type
0,H2X992W15465,1132732,E82,33925845,C,,2008,LINDE,2012-06-12,,...,HK57F5,NaT,2022-11-11,2008-09-15,Contrapesada térmica,5,DE,KXE-6188,AVLB,L
1,H2X995S19125,1207034,E12,39380933,C,,2005,,2012-06-18,,...,HK57F5,NaT,2018-07-13,2011-12-14,Contrapesada eléctrica,5,DE,KXE-6639,AVLB PINA,L
2,W4X979W12995,1290040,E82,33927373,C,,2008,LINDE,2012-12-18,,...,CKMAGKNK,2008-05-29,2023-09-19,NaT,Apilador,6,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,,33922280,N,,2004,LINDE,2012-12-18,,...,HK57F5,2004-04-30,2023-05-24,NaT,Contrapesada térmica,5,,,AVLB,L
4,G5X997P11599,1655177,E11,33923892,C,,2003,,2014-09-23,12 Mon./2000 h,...,HK57F5,NaT,2020-04-01,2003-07-15,Contrapesada eléctrica,6,GB,KXE-6263,AVLB PINA,L


### group

Also could be dropped, I don't see how this would be relevant in the future

In [162]:
clean_dataset.group.isnull().sum()

37178

In [163]:
clean_dataset.group.value_counts()

E11    5446
E12    2648
E81    2271
E14    1740
E13     646
E82     596
E83       1
Name: group, dtype: int64

In [164]:
clean_dataset.group.unique()


array(['E82', 'E12', nan, 'E11', 'E13', 'E14', 'E81', 'E83'], dtype=object)

In [165]:
#for now maybe let's fillna with "unknown"

clean_dataset.group = clean_dataset.group.fillna("unknown")
clean_dataset.group.isnull().sum()

0

In [166]:
clean_dataset.group.value_counts()

unknown    37178
E11         5446
E12         2648
E81         2271
E14         1740
E13          646
E82          596
E83            1
Name: group, dtype: int64

### client_code

"enviar a parte" 

This is a relevant column

In [167]:
clean_dataset.client_code.isnull().sum()

0

In [168]:
clean_dataset.client_code.unique

<bound method Series.unique of 0        33925845
1        39380933
2        33927373
3        33922280
4        33923892
           ...   
75532    33934592
75533    33934592
75541    33920279
75542    33923302
75548    33934507
Name: client_code, Length: 50526, dtype: int64>

In [169]:
clean_dataset.shape

(50526, 21)

In [170]:
#all the client codes are unique, so this is good.


In [171]:
clean_dataset.client_code.info()

<class 'pandas.core.series.Series'>
Int64Index: 50526 entries, 0 to 75548
Series name: client_code
Non-Null Count  Dtype
--------------  -----
50526 non-null  int64
dtypes: int64(1)
memory usage: 789.5 KB


### fleet_type





In [172]:
clean_dataset.fleet_type.isnull().sum()

6

In [173]:
clean_dataset.fleet_type.unique()

array(['C', 'N', 'U', 'R', 'D', nan], dtype=object)

In [174]:
#since there are only 6 nulls and it's a relevant field, i will drop them

clean_dataset = clean_dataset.dropna(subset=["fleet_type"])

In [175]:
clean_dataset.fleet_type.isnull().sum()

0

In [176]:
clean_dataset.shape

(50520, 21)

### Activo fijo

Activo Fijo. Cuanto la maquina pertenece a flotas de la compañía tiene asociado un numero de activo.

Convert to boolean, 0/1

In [177]:
clean_dataset.fixed_asset.isnull().sum()

48723

In [178]:
clean_dataset.fixed_asset.unique()

array([nan, 'J06271', 'J05127', ..., 'J096J3', 'J096J2', 'J096JJ'],
      dtype=object)

In [179]:
clean_dataset.fixed_asset.value_counts()

J05127    2
J01211    2
J072J6    1
J07JJ9    1
J07378    1
         ..
J08856    1
J0391J    1
J08953    1
J0J921    1
J096JJ    1
Name: fixed_asset, Length: 1795, dtype: int64

In [180]:
#not sure if i would drop it, because it's internal classification so not sure about the future use with other systems. and there are only 1795 assets.

#for now let's fillna with not applicable in case it's not a fixed asset

clean_dataset.fixed_asset = clean_dataset.fixed_asset.fillna("N/A")

In [181]:
clean_dataset.fixed_asset.isnull().sum()

0

### Año de construcción

Año de fabricación del equipo.  

Converted to int, we will group later

In [182]:
clean_dataset.construction_year.isnull().sum()

0

In [183]:
clean_dataset.construction_year.unique()

array([2008, 2005, 2004, 2003, 2015, 2012, 2016, 2017, 2013, 2014, 2000,
       2001, 2007, 1999, 2002, 2006, 2010, 2011, 2009, 2019, 1998, 1993,
       1995, 1997, 1991, 1989, 1990, 1992, 1994, 1984, 1986, 1987, 1988,
       1900, 1996, 1985, 2018, 2023, 2020, 2022, 2021, 2024,  199])

In [184]:
# we see some wrong years, let's delete them

wrong_year = clean_dataset[(clean_dataset.construction_year == 0) | (clean_dataset.construction_year == 199)]
wrong_year.construction_year.value_counts()

199    1
Name: construction_year, dtype: int64

In [185]:
clean_dataset = clean_dataset[~((clean_dataset.construction_year == 0) | (clean_dataset.construction_year == 199))]
clean_dataset.construction_year.unique()

array([2008, 2005, 2004, 2003, 2015, 2012, 2016, 2017, 2013, 2014, 2000,
       2001, 2007, 1999, 2002, 2006, 2010, 2011, 2009, 2019, 1998, 1993,
       1995, 1997, 1991, 1989, 1990, 1992, 1994, 1984, 1986, 1987, 1988,
       1900, 1996, 1985, 2018, 2023, 2020, 2022, 2021, 2024])

In [186]:
clean_dataset.shape

(50519, 21)

In [187]:
clean_dataset.construction_year.isnull().sum()

0

### Brand name

In [188]:
clean_dataset.brand_name.isnull().sum()

5123

In [189]:
#ok we have some null values, let's see what the others are

clean_dataset.brand_name.unique()

array(['LINDE', nan, 'Others', 'KÄRCHER', 'Still / OM Pimespo', 'FENWICK',
       'BT !!OUT OF DATE!!', 'Not defined', 'Toyota / BT', 'VOLVO',
       'NISSAN', 'JUNGHEINRICH', 'CLARK', 'DAEWOO', 'YALE', 'CATERPILLAR',
       'MITSUBISHI', 'HYSTER', 'CROWN', 'KOMATSU', 'CESAB', 'KALMAR',
       'ATLET', 'STEINBOCK', 'MANITOU', 'LUGLI', 'TCM', 'BOLZONI-AURAMO',
       'CASCADE', 'HYUNDAI', 'TOYOTA !!OUT OF DATE!!', 'FRANKEL',
       'SCANIA'], dtype=object)

In [190]:
clean_dataset.brand_name.value_counts()

LINDE                     10725
Toyota / BT               10174
JUNGHEINRICH               5694
Still / OM Pimespo         5193
Others                     4385
HYSTER                     2407
NISSAN                     2333
CATERPILLAR                1063
MITSUBISHI                  940
YALE                        726
CROWN                       526
KOMATSU                     365
CLARK                       294
CESAB                       240
DAEWOO                      117
ATLET                       116
KALMAR                       36
TOYOTA !!OUT OF DATE!!       11
LUGLI                        11
STEINBOCK                    10
BT !!OUT OF DATE!!            8
Not defined                   4
MANITOU                       3
CASCADE                       3
KÄRCHER                       2
FENWICK                       2
TCM                           2
HYUNDAI                       2
BOLZONI-AURAMO                1
VOLVO                         1
FRANKEL                       1
SCANIA  

In [191]:
#i see that there is a not defined brand, so let's check if we can fillna with this value
clean_dataset.brand_name = clean_dataset.brand_name.fillna("Not defined")
clean_dataset.brand_name.isnull().sum()

0

In [192]:
clean_dataset.brand_name.value_counts()

LINDE                     10725
Toyota / BT               10174
JUNGHEINRICH               5694
Still / OM Pimespo         5193
Not defined                5127
Others                     4385
HYSTER                     2407
NISSAN                     2333
CATERPILLAR                1063
MITSUBISHI                  940
YALE                        726
CROWN                       526
KOMATSU                     365
CLARK                       294
CESAB                       240
DAEWOO                      117
ATLET                       116
KALMAR                       36
TOYOTA !!OUT OF DATE!!       11
LUGLI                        11
STEINBOCK                    10
BT !!OUT OF DATE!!            8
MANITOU                       3
CASCADE                       3
KÄRCHER                       2
TCM                           2
HYUNDAI                       2
FENWICK                       2
VOLVO                         1
BOLZONI-AURAMO                1
FRANKEL                       1
SCANIA  

### created_date

In [193]:
clean_dataset.created_date.isnull().sum()

0

In [194]:
clean_dataset.created_date.unique

<bound method Series.unique of 0       2012-06-12
1       2012-06-18
2       2012-12-18
3       2012-12-18
4       2014-09-23
           ...    
75532   2024-03-22
75533   2024-03-22
75541   2024-03-25
75542   2024-03-25
75548   2024-03-29
Name: created_date, Length: 50519, dtype: datetime64[ns]>

In [195]:
clean_dataset.shape

(50519, 21)

### warranty_type

Tipo de garantía del cliente

Pensar quina utilitat podria tenir al sistema futur

In [196]:
clean_dataset.warranty_type.isnull().sum()

38573

In [197]:
clean_dataset.warranty_type.value_counts()

12 Mon./2000 h     11703
12mon/900hrs          84
6 mon/500h            61
12mon/650hrs          51
3 Mon./300 h          29
24 Mon./3000 h        11
12mon/1000hrs          6
6 mon. /900 hrs        1
Name: warranty_type, dtype: int64

In [198]:
clean_dataset.warranty_type.info()

<class 'pandas.core.series.Series'>
Int64Index: 50519 entries, 0 to 75548
Series name: warranty_type
Non-Null Count  Dtype 
--------------  ----- 
11946 non-null  object
dtypes: object(1)
memory usage: 789.4+ KB


In [199]:
#not sure how useful this is, for now fillna

clean_dataset.warranty_type = clean_dataset.warranty_type.fillna("N/A")

In [200]:
clean_dataset.warranty_type.isnull().sum()

0

### last_modified

since we have assumed that we only migrate business related data, I think this is one of the columns that will become redundant in the future system, so I would drop it. 

In [201]:
#drop the last modified column

clean_dataset.drop(columns=["last_modified"], inplace=True)
clean_dataset.head()

Unnamed: 0,serial_number,equipment_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,warranty_type,modified_by,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type
0,H2X992W15465,1132732,E82,33925845,C,,2008,LINDE,2012-06-12,,HK57F5,NaT,2022-11-11,2008-09-15,Contrapesada térmica,5,DE,KXE-6188,AVLB,L
1,H2X995S19125,1207034,E12,39380933,C,,2005,Not defined,2012-06-18,,HK57F5,NaT,2018-07-13,2011-12-14,Contrapesada eléctrica,5,DE,KXE-6639,AVLB PINA,L
2,W4X979W12995,1290040,E82,33927373,C,,2008,LINDE,2012-12-18,,CKMAGKNK,2008-05-29,2023-09-19,NaT,Apilador,6,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,unknown,33922280,N,,2004,LINDE,2012-12-18,,HK57F5,2004-04-30,2023-05-24,NaT,Contrapesada térmica,5,,,AVLB,L
4,G5X997P11599,1655177,E11,33923892,C,,2003,Not defined,2014-09-23,12 Mon./2000 h,HK57F5,NaT,2020-04-01,2003-07-15,Contrapesada eléctrica,6,GB,KXE-6263,AVLB PINA,L


In [202]:
clean_dataset.columns

Index(['serial_number', 'equipment_id', 'group', 'client_code', 'fleet_type',
       'fixed_asset', 'construction_year', 'brand_name', 'created_date',
       'warranty_type', 'modified_by', 'service_start_date',
       'last_service_date', 'warranty_start', 'equipment_type_name',
       'construction_month', 'country', 'technician', 'user_status',
       'equipment_type'],
      dtype='object')

### modified_by

Same as before, we don't need this information in the future system. 

In [203]:
clean_dataset = clean_dataset.drop(columns=["modified_by"])
clean_dataset.head()

Unnamed: 0,serial_number,equipment_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,warranty_type,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type
0,H2X992W15465,1132732,E82,33925845,C,,2008,LINDE,2012-06-12,,NaT,2022-11-11,2008-09-15,Contrapesada térmica,5,DE,KXE-6188,AVLB,L
1,H2X995S19125,1207034,E12,39380933,C,,2005,Not defined,2012-06-18,,NaT,2018-07-13,2011-12-14,Contrapesada eléctrica,5,DE,KXE-6639,AVLB PINA,L
2,W4X979W12995,1290040,E82,33927373,C,,2008,LINDE,2012-12-18,,2008-05-29,2023-09-19,NaT,Apilador,6,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,unknown,33922280,N,,2004,LINDE,2012-12-18,,2004-04-30,2023-05-24,NaT,Contrapesada térmica,5,,,AVLB,L
4,G5X997P11599,1655177,E11,33923892,C,,2003,Not defined,2014-09-23,12 Mon./2000 h,NaT,2020-04-01,2003-07-15,Contrapesada eléctrica,6,GB,KXE-6263,AVLB PINA,L


In [204]:
clean_dataset.shape

(50519, 19)

### service_start_date

This is relevant for business, so let's see if we can keep it

In [205]:
clean_dataset.service_start_date.isnull().sum()

40818

In [206]:
clean_dataset.shape

(50519, 19)

In [207]:
#we have 80% of null values, if we were to keep this it would be maybe to relate it to construction date, so let's see if the ones we have (not null) are also not null in construction date

# Filter rows with no null values in both columns
filtered_dataset = clean_dataset[clean_dataset['service_start_date'].notna() & clean_dataset['last_service_date'].notna()]

# Display the filtered dataset
filtered_dataset.shape

(8843, 19)

In [208]:
filtered_dataset_na = clean_dataset[clean_dataset['service_start_date'].notna() & clean_dataset['last_service_date'].isna()]

filtered_dataset_na.shape

(858, 19)

In [209]:
#From all the values that we have for service start date, there is a construction year too. In that case, we are not going to drop them as at least for these we might be able to do some comparisons for manufacturing time.
# maybe we could drop all rows that are empty, but we need to look at the rest of the data first in case we have more important fields that we want to keep from those rows.
#if we look at last service date too, we have 287 rows with null values in last service and not null in service start. Let's do the opposite.

### last_service_date

same, could be interesting to compare with service start date and manufacturing date

In [210]:
clean_dataset.last_service_date.isnull().sum()

37765

In [211]:
#we also have some nulls, let's check the same as before

# Filter rows with no null values in both columns
filtered_dataset_ls = clean_dataset[clean_dataset['last_service_date'].notna() & clean_dataset['service_start_date'].notna()]

# Display the filtered dataset
filtered_dataset_ls.shape

(8843, 19)

In [212]:
# Filter rows with no null values in service start and not null on last service
filtered_dataset_ls_na = clean_dataset[clean_dataset['last_service_date'].notna() & clean_dataset['service_start_date'].isna()]

# Display the filtered dataset
filtered_dataset_ls_na.shape

(3911, 19)

In [213]:
#seeing this case I would delete all the rows that have null values for Service start dates and not null for last service date, as there is no point in having the last service if we can't compare to anything

clean_dataset = clean_dataset[~((clean_dataset.service_start_date.isna()) & (clean_dataset.last_service_date.notna()))]


In [214]:
clean_dataset.last_service_date.isnull().sum()

#we still have nulls in last service date, let's see if we have nulls as well in service start date

# Filter rows with null values in both columns
filtered_dataset_ls = clean_dataset[clean_dataset['last_service_date'].isna() & clean_dataset['service_start_date'].isna()]

filtered_dataset_ls.shape

(36907, 19)

In [215]:
#there are many nulls in both last service and service start date, but I can't drop these rows because they might contain other relevant information.
# Let's keep these columns and fill null values with N/A for now
clean_dataset.last_service_date = clean_dataset.last_service_date.fillna("N/A")
clean_dataset.service_start_date = clean_dataset.service_start_date.fillna("N/A")
print(clean_dataset.last_service_date.isnull().sum())
print(clean_dataset.service_start_date.isnull().sum())

0
0


In [216]:
clean_dataset.construction_year.unique()

array([2008, 2004, 2012, 2016, 2017, 2000, 1999, 2002, 2006, 2014, 2013,
       2007, 2015, 2003, 2010, 2011, 2005, 2009, 2001, 2019, 1998, 1995,
       2018, 2023, 2020, 2022, 2021, 1992, 1996, 1990, 1985, 1991, 2024])

### warranty_start

Fecha de inicio de garantía. 

We might want to compare this to > service start date

In [217]:
clean_dataset.warranty_start.isnull().sum()

37906

In [218]:
clean_dataset.warranty_start.unique

<bound method Series.unique of 2              NaT
3              NaT
7       2020-12-18
8       2016-10-25
9       2007-03-20
           ...    
75532          NaT
75533          NaT
75541          NaT
75542          NaT
75548   2024-04-03
Name: warranty_start, Length: 46608, dtype: datetime64[ns]>

In [219]:

clean_dataset.warranty_start.info()

<class 'pandas.core.series.Series'>
Int64Index: 46608 entries, 2 to 75548
Series name: warranty_start
Non-Null Count  Dtype         
--------------  -----         
8702 non-null   datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 728.2 KB


In [220]:

filter_warranty = clean_dataset[(clean_dataset.warranty_start.notna()) & (clean_dataset.service_start_date == "N/A")]
filter_warranty

Unnamed: 0,serial_number,equipment_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,warranty_type,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type
129,F25572D11112,2981948,E13,50,C,,2013,Not defined,2017-11-29,12 Mon./2000 h,,,2013-12-18,,1,FR,LE616666,AVLB PINA,L
130,F25572D11222,2981956,E13,50,C,,2013,Not defined,2017-11-29,12 Mon./2000 h,,,2013-12-18,,1,FR,LE616666,AVLB PINA,L
136,F25572E11952,2982032,E11,50,C,,2014,Not defined,2017-11-29,12 Mon./2000 h,,,2014-02-26,,1,FR,KXE-6641,AVLB PINA,L
259,BA556797,2983751,E13,33934593,R,J05866,2015,Not defined,2017-11-29,12 Mon./2000 h,,,2015-08-11,,8,,LE616666,AVLB PINA,V
269,CK149592292,2983810,E13,33933802,R,J01126,2016,Not defined,2017-11-29,12 Mon./2000 h,,,2016-11-18,,11,,LE616666,HIRE PINA,V
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5549,222226,3050764,unknown,39373829,C,,2016,Not defined,2017-11-30,12 Mon./2000 h,,,1987-11-06,,1,,,AVLB PINA,V
5552,225759,3050794,unknown,39373829,C,,2016,Not defined,2017-11-30,12 Mon./2000 h,,,1988-11-30,,1,,,AVLB PINA,V
5553,225754,3050795,unknown,39373829,C,,2016,Not defined,2017-11-30,12 Mon./2000 h,,,1988-11-30,,1,,,AVLB PINA,V
5580,92616-45,3051436,E13,33937324,C,,2016,Not defined,2017-11-30,12 Mon./2000 h,,,2017-03-24,,1,,LE616666,AVLB PINA,V


In [221]:

filter_warranty_isna= clean_dataset[clean_dataset.warranty_start.isna() & clean_dataset.service_start_date.isna()]
filter_warranty_isna

Unnamed: 0,serial_number,equipment_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,warranty_type,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type


In [222]:
#as we can see from all the null values in warranty_start (37906) we have 37906 that are also null at service_start_date

#let's compare with warranty_type too

filter_warranty_isna_type = clean_dataset[(clean_dataset.warranty_start.isna()) & (clean_dataset.warranty_type == "N/A")]
filter_warranty_isna_type


Unnamed: 0,serial_number,equipment_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,warranty_type,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type
2,W4X979W12995,1290040,E82,33927373,C,,2008,LINDE,2012-12-18,,2008-05-29 00:00:00,2023-09-19 00:00:00,NaT,Apilador,6,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,unknown,33922280,N,,2004,LINDE,2012-12-18,,2004-04-30 00:00:00,2023-05-24 00:00:00,NaT,Contrapesada térmica,5,,,AVLB,L
144,F25572E15252,2982144,E11,33930535,C,,2014,LINDE,2017-11-29,,2022-10-20 00:00:00,2022-07-27 00:00:00,NaT,Preparador de pedidos vertical,1,FR,KXE-6647,SOTR,L
362,TM55CIN115151,2984215,E11,33934593,R,J0J769,2014,Not defined,2017-11-29,,,,NaT,,9,,LE616666,AWIN PINA,V
384,15/9715/5-12,2984332,E11,33934593,R,J01166,2017,Not defined,2017-11-29,,,,NaT,,6,,LE616666,AWIN PINA,V
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75531,22421261,6109680,unknown,33934592,U,,2016,Not defined,2024-03-22,,,,NaT,,1,,,AVLB,V
75532,91149257,6109690,unknown,33934592,U,,2016,Not defined,2024-03-22,,,,NaT,,1,,,AVLB,V
75533,29516965,6109768,unknown,33934592,U,,2016,Not defined,2024-03-22,,,,NaT,,1,,,AVLB,V
75541,HLI5172622,6111696,E11,33920279,C,,2016,Not defined,2024-03-25,,2024-03-25 00:00:00,,NaT,,1,,LE196666,SOLD,V


In [223]:
#it seems that for all warranty start nulls we also have warranty_type N/A so let's fill the null values with N/A too in warranty_start, later we see if we can remove it

#let's fillna with N/A

clean_dataset.warranty_start = clean_dataset.warranty_start.fillna("N/A")
clean_dataset.warranty_start.value_counts()


N/A                    37906
2023-05-15 00:00:00       51
2022-09-27 00:00:00       50
2021-03-24 00:00:00       47
2021-03-01 00:00:00       43
                       ...  
2010-11-03 00:00:00        1
2017-06-07 00:00:00        1
2016-12-22 00:00:00        1
2016-12-15 00:00:00        1
2019-09-17 00:00:00        1
Name: warranty_start, Length: 2169, dtype: int64

### equipment_type_name

let's see if this is relevant

In [224]:
clean_dataset.equipment_type_name.isnull().sum()

38034

In [225]:
clean_dataset.equipment_type_name.value_counts()

Contrapesada eléctrica            2428
Transpaleta eléctrica             2106
Apilador                          1135
Contrapesada térmica              1066
Transpaleta manual                 830
Retráctil                          465
Preparador de pedidos              264
Preparador de pedidos vertical      82
Carretilla Combi                    61
Tractor de arrastre                 60
Transpaleta de tijera               37
Batería                             16
Otros                               16
Trilateral (VNA)                     5
Remolque                             3
Name: equipment_type_name, dtype: int64

In [226]:
clean_dataset.equipment_type_name.info()

<class 'pandas.core.series.Series'>
Int64Index: 46608 entries, 2 to 75548
Series name: equipment_type_name
Non-Null Count  Dtype 
--------------  ----- 
8574 non-null   object
dtypes: object(1)
memory usage: 728.2+ KB


In [227]:
#like before, let's check if the nulls in this field have nulls in other relevant fields to see if we can remove some rows.
#we could compare it to equipment type but it has 0 null values

#let's check serial_number


filter_equipment = clean_dataset[clean_dataset.equipment_type_name.isna() & clean_dataset.serial_number.notna()]
filter_equipment

Unnamed: 0,serial_number,equipment_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,warranty_type,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type
11,AJC62569,2979975,E11,33925229,C,,2016,Not defined,2017-11-29,12 Mon./2000 h,2023-06-06 00:00:00,2023-04-26 00:00:00,2010-06-30 00:00:00,,1,,KXE-6236,SOLD,C
15,51996,2980140,E13,39383374,C,,2017,KÄRCHER,2017-11-29,12 Mon./2000 h,2021-06-07 00:00:00,,2014-09-30 00:00:00,,11,,LE616666,SOLD PINA,V
16,BID14195552114297,2980142,E13,33924295,C,,2017,Others,2017-11-29,12 Mon./2000 h,2021-05-25 00:00:00,,2014-09-30 00:00:00,,11,,LE616666,SOTR PINA,V
53,DTA-6999,2980849,E12,33957850,C,,2016,Others,2017-11-29,12 Mon./2000 h,2022-07-06 00:00:00,2023-08-10 00:00:00,2022-07-06 00:00:00,,1,,KXE-6227,SOLD,L
129,F25572D11112,2981948,E13,50,C,,2013,Not defined,2017-11-29,12 Mon./2000 h,,,2013-12-18 00:00:00,,1,FR,LE616666,AVLB PINA,L
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75531,22421261,6109680,unknown,33934592,U,,2016,Not defined,2024-03-22,,,,,,1,,,AVLB,V
75532,91149257,6109690,unknown,33934592,U,,2016,Not defined,2024-03-22,,,,,,1,,,AVLB,V
75533,29516965,6109768,unknown,33934592,U,,2016,Not defined,2024-03-22,,,,,,1,,,AVLB,V
75541,HLI5172622,6111696,E11,33920279,C,,2016,Not defined,2024-03-25,,2024-03-25 00:00:00,,,,1,,LE196666,SOLD,V


In [228]:
#okay so for 39303 of the 39303 null values in equipment type we actually have a value in serial number
#that makes sense since we deleted the rows without serial numbers
#let's fill Nulls with N/A

clean_dataset.equipment_type_name = clean_dataset.equipment_type_name.fillna("N/A")
clean_dataset.equipment_type_name.isnull().sum()

0

### construction_month

In [229]:
clean_dataset.construction_month.isna().sum()

0

In [230]:
clean_dataset.construction_month.unique()

array([ 6,  5,  9,  1, 11,  7,  3,  4,  2,  8, 10, 12])

### country

In [231]:
clean_dataset.country.isna().sum()

39162

In [232]:
clean_dataset.shape

(46608, 19)

In [233]:
clean_dataset.country.unique()

array([nan, 'DE', 'CN', 'GB', 'CZ', 'IT', 'FR', 'PL', 'AM', 'ES'],
      dtype=object)

### technician

Pto.tbjo.responsable. Técnico asignado al equipo.

this seems like a metric that could expire and become redundant for future systems

In [234]:
clean_dataset.technician.isna().sum()

37071

In [235]:
clean_dataset.technician.value_counts()

LE196666    2296
LE616666     787
KXE-6278     219
KXE-6642     214
KXE-6614     206
            ... 
KXE-6126       1
KXE-6115       1
KXE-6189       1
KXE-6672       1
KXE-6218       1
Name: technician, Length: 100, dtype: int64

In [236]:
clean_dataset.technician.unique()

array(['KXE-6156', nan, 'KXE-6157', 'LE616666', 'KXE-6236', 'KXE-6626',
       'KXE-6633', 'KXE-6659', 'KXE-6227', 'KXE-6664', 'KXE-6642',
       'KXE-6663', 'KXE-6149', 'KXE-6215', 'KXE-6629', 'KXE-6653',
       'KXE-6641', 'KXE-6269', 'KXE-6171', 'KXE-6647', 'KXE-6658',
       'KXE-6263', 'KXE-6644', 'KXE-6614', 'KXE-6212', 'KXE-6624',
       'KXE-6617', 'KXE-6223', 'KXE-6649', 'KXE-6656', 'KXE-6635',
       'KXE-6276', 'KXE-6278', 'KXE-6646', 'KXE-6619', 'KXE-6266',
       'KXE-6657', 'KXE-6654', 'KXE-6655', 'KXE-6191', 'KXE-6143',
       'LE686666', 'KXE-6634', 'KXE-6228', 'KXE-6639', 'KXE-6148',
       'KXE-6146', 'KXE-6666', 'KXE-6187', 'KXE-6271', 'KXE-6621',
       'KXE-6618', 'KXE-6264', 'KXE-6686', 'KXE-6183', 'KXE-6175',
       'KXE-6178', 'KXE-6126', 'KXE-6142', 'KXE-6256', 'KXE-6196',
       'KXE-6238', 'KXE-6158', 'KXE-6683', 'KXE-6622', 'KXE-6166',
       'KXE-6268', 'KXE-6672', 'KXE-6677', 'KXE-6189', 'KXE-6144',
       'KXE-6231', 'KXE-6115', 'LE196666', 'KXE-6188', 'K

In [237]:
#we only have 104 unique technician ids. And 36964 are null values

### user_status

Status de Usuario. Solo tener en cuenta, estado PINA, significa que se lleva dos años sin dar servicio al equipo.



In [238]:
clean_dataset.user_status.isna().sum()

36246

In [239]:
clean_dataset.user_status.unique()

array(['AWIN', 'AVLB', 'SOLD RSVD', 'SOTR', 'SOTR PINA', 'SOLD',
       'SOLD PINA', 'SOTR PINA RSVD', 'SOTR RSVD', 'AVLB PINA', 'HIRE',
       'SOTR AVLB PINA RESH', 'AVLB PINA RESH', 'HIRE PINA', 'AWIN PINA',
       'NORE PINA', 'AVLB PINA RSVD', 'SCRA PINA', 'SOLD PINA RESH',
       'SOTR RESH', 'SOTR AVLB', 'SOTR PINA RESH', 'SOLD RESH',
       'SOTR RESH RSVD', 'AWIN RESH', 'AWIP', 'AWIN RSVD',
       'SOLD PINA RSVD', 'HIRE RESH', 'NORE', 'AWIP RESH', 'AVLB RESH',
       'AVLB RSVD', 'SOTR AVLB PINA', 'SOTR AVLB RSVD', 'AWIN PINA RESH',
       'STOL PINA', nan, 'NOKN PINA', 'AWIP RSVD', 'SCRA',
       'AWIP HIRE RESH', 'NORE RESH', 'SOTR FTRA PINA'], dtype=object)

In [240]:
#since it seems we only care about the PINA status, I will create another column called "is_inactive" and return True or False depending on PINA from the user status column

clean_dataset['is_inactive'] = clean_dataset['user_status'].str.contains('PINA', case=False, na=False)

clean_dataset

Unnamed: 0,serial_number,equipment_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,warranty_type,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,user_status,equipment_type,is_inactive
2,W4X979W12995,1290040,E82,33927373,C,,2008,LINDE,2012-12-18,,2008-05-29 00:00:00,2023-09-19 00:00:00,,Apilador,6,,KXE-6156,AWIN,L,False
3,H2X994R11511,1306179,unknown,33922280,N,,2004,LINDE,2012-12-18,,2004-04-30 00:00:00,2023-05-24 00:00:00,,Contrapesada térmica,5,,,AVLB,L,False
7,H2X926C19625,2205572,E12,39327943,C,,2012,LINDE,2015-12-05,12mon/900hrs,2020-10-08 00:00:00,2024-03-14 00:00:00,2020-12-18 00:00:00,Contrapesada eléctrica,5,DE,KXE-6157,SOLD RSVD,L,False
8,W45552G19154,2576292,unknown,383,C,,2016,LINDE,2016-09-23,12 Mon./2000 h,2022-09-21 00:00:00,2023-06-27 00:00:00,2016-10-25 00:00:00,Transpaleta eléctrica,9,,,SOTR,L,False
9,AE512955,2979962,E11,33923773,C,,2016,Others,2017-11-29,12 Mon./2000 h,2019-04-16 00:00:00,2019-08-19 00:00:00,2007-03-20 00:00:00,Batería,1,,LE616666,SOTR PINA,C,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75532,91149257,6109690,unknown,33934592,U,,2016,Not defined,2024-03-22,,,,,,1,,,AVLB,V,False
75533,29516965,6109768,unknown,33934592,U,,2016,Not defined,2024-03-22,,,,,,1,,,AVLB,V,False
75541,HLI5172622,6111696,E11,33920279,C,,2016,Not defined,2024-03-25,,2024-03-25 00:00:00,,,,1,,LE196666,SOLD,V,False
75542,HLI5172659,6112185,E11,33923302,C,,2016,Not defined,2024-03-25,,2024-03-25 00:00:00,,,,1,,LE196666,SOLD,V,False


In [241]:
#now we can drop the user status column

clean_dataset = clean_dataset.drop(columns=['user_status'])

In [242]:
clean_dataset.head()

Unnamed: 0,serial_number,equipment_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,warranty_type,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,equipment_type,is_inactive
2,W4X979W12995,1290040,E82,33927373,C,,2008,LINDE,2012-12-18,,2008-05-29 00:00:00,2023-09-19 00:00:00,,Apilador,6,,KXE-6156,L,False
3,H2X994R11511,1306179,unknown,33922280,N,,2004,LINDE,2012-12-18,,2004-04-30 00:00:00,2023-05-24 00:00:00,,Contrapesada térmica,5,,,L,False
7,H2X926C19625,2205572,E12,39327943,C,,2012,LINDE,2015-12-05,12mon/900hrs,2020-10-08 00:00:00,2024-03-14 00:00:00,2020-12-18 00:00:00,Contrapesada eléctrica,5,DE,KXE-6157,L,False
8,W45552G19154,2576292,unknown,383,C,,2016,LINDE,2016-09-23,12 Mon./2000 h,2022-09-21 00:00:00,2023-06-27 00:00:00,2016-10-25 00:00:00,Transpaleta eléctrica,9,,,L,False
9,AE512955,2979962,E11,33923773,C,,2016,Others,2017-11-29,12 Mon./2000 h,2019-04-16 00:00:00,2019-08-19 00:00:00,2007-03-20 00:00:00,Batería,1,,LE616666,C,True


In [243]:
clean_dataset.is_inactive.isnull().sum()

0

In [244]:
clean_dataset.is_inactive.info()

<class 'pandas.core.series.Series'>
Int64Index: 46608 entries, 2 to 75548
Series name: is_inactive
Non-Null Count  Dtype
--------------  -----
46608 non-null  bool 
dtypes: bool(1)
memory usage: 409.6 KB


### equipment_type

This is one of the relevant ones

Tipo de Equipo. 

C Máquina de la competencia/D Máquina Dummy/G Máquina Demo /L Máquina Linde Mh /O Otros Equipos/ V Componentes Aislados

In [245]:
clean_dataset.equipment_type.isnull().sum()

0

In [246]:
clean_dataset.equipment_type.unique()

array(['L', 'C', 'V', 'O', 'G'], dtype=object)

In [247]:
clean_dataset[(clean_dataset.equipment_type == "L") & (clean_dataset.brand_name == "Not defined")]

Unnamed: 0,serial_number,equipment_id,group,client_code,fleet_type,fixed_asset,construction_year,brand_name,created_date,warranty_type,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,country,technician,equipment_type,is_inactive
87,E5X959T11997,2981314,E11,33923773,C,,2006,Not defined,2017-11-29,12 Mon./2000 h,2019-12-23 00:00:00,2019-06-11 00:00:00,2006-11-24 00:00:00,Contrapesada térmica,1,GB,KXE-6629,L,True
129,F25572D11112,2981948,E13,50,C,,2013,Not defined,2017-11-29,12 Mon./2000 h,,,2013-12-18 00:00:00,,1,FR,LE616666,L,True
130,F25572D11222,2981956,E13,50,C,,2013,Not defined,2017-11-29,12 Mon./2000 h,,,2013-12-18 00:00:00,,1,FR,LE616666,L,True
136,F25572E11952,2982032,E11,50,C,,2014,Not defined,2017-11-29,12 Mon./2000 h,,,2014-02-26 00:00:00,,1,FR,KXE-6641,L,True
142,F25572E15255,2982140,E11,33292724,C,,2014,Not defined,2017-11-29,12 Mon./2000 h,2022-03-08 00:00:00,2021-10-13 00:00:00,2014-06-25 00:00:00,Preparador de pedidos vertical,1,FR,KXE-6647,L,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75037,G5X527W51597,6025209,E81,33933233,C,,2016,Not defined,2024-01-24,,,,,Tractor de arrastre,1,GB,KXE-6144,L,False
75179,H25254N11145,6050111,E11,33934507,R,J09560,2024,Not defined,2024-02-10,12 Mon./2000 h,2024-02-21 00:00:00,2024-02-20 00:00:00,2024-02-21 00:00:00,Contrapesada eléctrica,1,DE,LE196666,L,False
75194,W45555N15119,6058317,E11,33934507,R,J09561,2024,Not defined,2024-02-15,12 Mon./2000 h,2024-02-21 00:00:00,2024-03-13 00:00:00,2024-02-21 00:00:00,Transpaleta eléctrica,2,FR,LE196666,L,False
75273,H25254N11151,6068378,E11,33934507,R,J09596,2024,Not defined,2024-02-22,12 Mon./2000 h,2024-03-05 00:00:00,2024-03-04 00:00:00,2024-03-05 00:00:00,Contrapesada eléctrica,2,DE,LE196666,L,False


In [248]:
clean_dataset.brand_name.value_counts()

Toyota / BT               10164
LINDE                      8587
JUNGHEINRICH               5651
Still / OM Pimespo         5183
Others                     4340
Not defined                3486
HYSTER                     2407
NISSAN                     2324
CATERPILLAR                1059
MITSUBISHI                  939
YALE                        725
CROWN                       525
KOMATSU                     364
CLARK                       293
CESAB                       240
DAEWOO                      117
ATLET                       116
KALMAR                       36
LUGLI                        11
TOYOTA !!OUT OF DATE!!       11
STEINBOCK                    10
BT !!OUT OF DATE!!            7
CASCADE                       3
KÄRCHER                       2
FENWICK                       2
MANITOU                       2
TCM                           2
BOLZONI-AURAMO                1
HYUNDAI                       1
Name: brand_name, dtype: int64

In [249]:
#looking at equipment type  maybe we can fill some not defined elements from the "brand_name" column as we know these are Linde


# Update "brand_name" from "not defined" to "LINDE" where "equipment_type" is "L"
clean_dataset.loc[
    (clean_dataset['equipment_type'] == "L") & (clean_dataset['brand_name'] == "Not defined"),
    'brand_name'
] = "LINDE"


In [250]:
clean_dataset.brand_name.value_counts()

Toyota / BT               10164
LINDE                      9040
JUNGHEINRICH               5651
Still / OM Pimespo         5183
Others                     4340
Not defined                3033
HYSTER                     2407
NISSAN                     2324
CATERPILLAR                1059
MITSUBISHI                  939
YALE                        725
CROWN                       525
KOMATSU                     364
CLARK                       293
CESAB                       240
DAEWOO                      117
ATLET                       116
KALMAR                       36
LUGLI                        11
TOYOTA !!OUT OF DATE!!       11
STEINBOCK                    10
BT !!OUT OF DATE!!            7
CASCADE                       3
KÄRCHER                       2
FENWICK                       2
MANITOU                       2
TCM                           2
BOLZONI-AURAMO                1
HYUNDAI                       1
Name: brand_name, dtype: int64

-----

## review

In [251]:
#extra drops


In [252]:
clean_dataset = clean_dataset.drop(columns=["created_date", "technician","group","country"])

In [253]:
clean_dataset.head()

Unnamed: 0,serial_number,equipment_id,client_code,fleet_type,fixed_asset,construction_year,brand_name,warranty_type,service_start_date,last_service_date,warranty_start,equipment_type_name,construction_month,equipment_type,is_inactive
2,W4X979W12995,1290040,33927373,C,,2008,LINDE,,2008-05-29 00:00:00,2023-09-19 00:00:00,,Apilador,6,L,False
3,H2X994R11511,1306179,33922280,N,,2004,LINDE,,2004-04-30 00:00:00,2023-05-24 00:00:00,,Contrapesada térmica,5,L,False
7,H2X926C19625,2205572,39327943,C,,2012,LINDE,12mon/900hrs,2020-10-08 00:00:00,2024-03-14 00:00:00,2020-12-18 00:00:00,Contrapesada eléctrica,5,L,False
8,W45552G19154,2576292,383,C,,2016,LINDE,12 Mon./2000 h,2022-09-21 00:00:00,2023-06-27 00:00:00,2016-10-25 00:00:00,Transpaleta eléctrica,9,L,False
9,AE512955,2979962,33923773,C,,2016,Others,12 Mon./2000 h,2019-04-16 00:00:00,2019-08-19 00:00:00,2007-03-20 00:00:00,Batería,1,C,True


In [254]:
clean_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46608 entries, 2 to 75548
Data columns (total 15 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   serial_number        46608 non-null  object
 1   equipment_id         46608 non-null  int64 
 2   client_code          46608 non-null  int64 
 3   fleet_type           46608 non-null  object
 4   fixed_asset          46608 non-null  object
 5   construction_year    46608 non-null  int64 
 6   brand_name           46608 non-null  object
 7   warranty_type        46608 non-null  object
 8   service_start_date   46608 non-null  object
 9   last_service_date    46608 non-null  object
 10  warranty_start       46608 non-null  object
 11  equipment_type_name  46608 non-null  object
 12  construction_month   46608 non-null  int64 
 13  equipment_type       46608 non-null  object
 14  is_inactive          46608 non-null  bool  
dtypes: bool(1), int64(4), object(10)
memory usage: 5.4+ M

## Add calculated fields + re-order


fields to add 

- manufacturing date (month + year)

In [255]:
#to simplify, let's add another column called manufacturing date with month and year. Since we had some null values, we filled them with the most common ones (month = 7 and year = 2016)


In [256]:
clean_dataset.construction_month.value_counts()

1     7428
7     4682
9     4280
5     4033
2     3944
11    3712
3     3694
6     3588
4     3504
10    2973
12    2719
8     2051
Name: construction_month, dtype: int64

In [257]:
print(clean_dataset.construction_month.isnull().sum())
print(clean_dataset.construction_month.unique())

0
[ 6  5  9  1 11  7  3  4  2  8 10 12]


In [258]:
print(clean_dataset.construction_year.isnull().sum())
print(clean_dataset.construction_year.unique())

0
[2008 2004 2012 2016 2017 2000 1999 2002 2006 2014 2013 2007 2015 2003
 2010 2011 2005 2009 2001 2019 1998 1995 2018 2023 2020 2022 2021 1992
 1996 1990 1985 1991 2024]


In [259]:

clean_dataset["construction_date"] = pd.to_datetime(
    clean_dataset["construction_year"].astype("Int64").astype(str) + "-" +
    clean_dataset["construction_month"].fillna(1).astype("Int64").astype(str).str.zfill(2) + "-01",
    errors="coerce"
)

In [260]:
clean_dataset.construction_date.isnull().sum()

0

In [261]:
clean_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46608 entries, 2 to 75548
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   serial_number        46608 non-null  object        
 1   equipment_id         46608 non-null  int64         
 2   client_code          46608 non-null  int64         
 3   fleet_type           46608 non-null  object        
 4   fixed_asset          46608 non-null  object        
 5   construction_year    46608 non-null  int64         
 6   brand_name           46608 non-null  object        
 7   warranty_type        46608 non-null  object        
 8   service_start_date   46608 non-null  object        
 9   last_service_date    46608 non-null  object        
 10  warranty_start       46608 non-null  object        
 11  equipment_type_name  46608 non-null  object        
 12  construction_month   46608 non-null  int64         
 13  equipment_type       46608 non-

In [262]:
origin_dataset.head()

Unnamed: 0,Número de serie,Equipo,Número-identificación técnica,Grupo planificación,Enviar a parte,Flota,Activo fijo,Año de construcción,Brand name,Creado el,...,Modificado por,Fe.puesta servicio,Fecha de última orden,Inic.garantía clte.,intervalo,Mes de construcción,País de fabricación,Pto.tbjo.responsable,Status de usuario,Tipo de equipo
0,H2X992W15465,1132732,3,E82,33925845,C,,2008.0,LINDE,12/6/2012,...,HK57F5,,11/11/2022,15/9/2008,Contrapesada térmica,5.0,DE,KXE-6188,AVLB,L
1,H2X995S19125,1207034,,E12,39380933,C,,2005.0,,18/6/2012,...,HK57F5,,13/7/2018,14/12/2011,Contrapesada eléctrica,5.0,DE,KXE-6639,AVLB PINA,L
2,W4X979W12995,1290040,1204 PMP EX,E82,33927373,C,,2008.0,LINDE,18/12/2012,...,CKMAGKNK,29/5/2008,19/9/2023,,Apilador,6.0,,KXE-6156,AWIN,L
3,H2X994R11511,1306179,,,33922280,N,,2004.0,LINDE,18/12/2012,...,HK57F5,30/4/2004,24/5/2023,,Contrapesada térmica,5.0,,,AVLB,L
4,G5X997P11599,1655177,,E11,33923892,C,,2003.0,,23/9/2014,...,HK57F5,,1/4/2020,15/7/2003,Contrapesada eléctrica,6.0,GB,KXE-6263,AVLB PINA,L


In [263]:
clean_dataset.isnull().sum()

serial_number          0
equipment_id           0
client_code            0
fleet_type             0
fixed_asset            0
construction_year      0
brand_name             0
warranty_type          0
service_start_date     0
last_service_date      0
warranty_start         0
equipment_type_name    0
construction_month     0
equipment_type         0
is_inactive            0
construction_date      0
dtype: int64

In [264]:
clean_dataset.last_service_date.isnull().sum()

0

In [265]:
origin_dataset.shape

(75564, 22)

In [266]:
clean_dataset.shape

(46608, 16)

In [268]:
clean_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46608 entries, 2 to 75548
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   serial_number        46608 non-null  object        
 1   equipment_id         46608 non-null  int64         
 2   client_code          46608 non-null  int64         
 3   fleet_type           46608 non-null  object        
 4   fixed_asset          46608 non-null  object        
 5   construction_year    46608 non-null  int64         
 6   brand_name           46608 non-null  object        
 7   warranty_type        46608 non-null  object        
 8   service_start_date   46608 non-null  object        
 9   last_service_date    46608 non-null  object        
 10  warranty_start       46608 non-null  object        
 11  equipment_type_name  46608 non-null  object        
 12  construction_month   46608 non-null  int64         
 13  equipment_type       46608 non-

In [269]:
clean_dataset = clean_dataset.drop(columns =["construction_year","construction_month"])

In [270]:
clean_dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46608 entries, 2 to 75548
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   serial_number        46608 non-null  object        
 1   equipment_id         46608 non-null  int64         
 2   client_code          46608 non-null  int64         
 3   fleet_type           46608 non-null  object        
 4   fixed_asset          46608 non-null  object        
 5   brand_name           46608 non-null  object        
 6   warranty_type        46608 non-null  object        
 7   service_start_date   46608 non-null  object        
 8   last_service_date    46608 non-null  object        
 9   warranty_start       46608 non-null  object        
 10  equipment_type_name  46608 non-null  object        
 11  equipment_type       46608 non-null  object        
 12  is_inactive          46608 non-null  bool          
 13  construction_date    46608 non-

In [271]:
clean_dataset.shape

(46608, 14)