## Analisando e tratando os dados do arquivp Production.Product.csv que estão na pasta RAW
### Ao final desse notebook os dados serão carregados na pasta REFINED em formato parquet

In [1]:
# importando as bibliotecas necessárias
import pandas as pd
import numpy as np

In [2]:
# criando a variavel com o caminho do arquivo que iremos carregar
path_product = "gs://bike-factory-datalake/01.RAW/Production.Product.csv"

In [3]:
# carregando o arquivo CSV
product = pd.read_csv(path_product,sep=';')

In [4]:
# criando o DF pandas
df_product = pd.DataFrame(product)

In [5]:
df_product.dtypes

ProductID                  int64
Name                      object
ProductNumber             object
MakeFlag                   int64
FinishedGoodsFlag          int64
Color                     object
SafetyStockLevel           int64
ReorderPoint               int64
StandardCost              object
ListPrice                 object
Size                      object
SizeUnitMeasureCode       object
WeightUnitMeasureCode     object
Weight                   float64
DaysToManufacture          int64
ProductLine               object
Class                     object
Style                     object
ProductSubcategoryID     float64
ProductModelID           float64
SellStartDate             object
SellEndDate               object
DiscontinuedDate         float64
rowguid                   object
ModifiedDate              object
dtype: object

In [6]:
# verificando as primeiras linhas do DF
df_product.head()

Unnamed: 0,ProductID,Name,ProductNumber,MakeFlag,FinishedGoodsFlag,Color,SafetyStockLevel,ReorderPoint,StandardCost,ListPrice,...,ProductLine,Class,Style,ProductSubcategoryID,ProductModelID,SellStartDate,SellEndDate,DiscontinuedDate,rowguid,ModifiedDate
0,1,Adjustable Race,AR-5381,0,0,,1000,750,0,0,...,,,,,,2008-04-30 00:00:00.000,,,694215B7-08F7-4C0D-ACB1-D734BA44C0C8,2014-02-08 10:01:36.827
1,2,Bearing Ball,BA-8327,0,0,,1000,750,0,0,...,,,,,,2008-04-30 00:00:00.000,,,58AE3C20-4F3A-4749-A7D4-D568806CC537,2014-02-08 10:01:36.827
2,3,BB Ball Bearing,BE-2349,1,0,,800,600,0,0,...,,,,,,2008-04-30 00:00:00.000,,,9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E,2014-02-08 10:01:36.827
3,4,Headset Ball Bearings,BE-2908,0,0,,800,600,0,0,...,,,,,,2008-04-30 00:00:00.000,,,ECFED6CB-51FF-49B5-B06C-7D8AC834DB8B,2014-02-08 10:01:36.827
4,316,Blade,BL-2036,1,0,,800,600,0,0,...,,,,,,2008-04-30 00:00:00.000,,,E73E9750-603B-4131-89F5-3DD15ED5FF80,2014-02-08 10:01:36.827


In [7]:
# Coletando informações básicas do DF
# A saída desse comando será a quantidade de linhas e colunas (linhas, colunas)
df_product.shape

(504, 25)

In [8]:
# verificando o index do arquivo
df_product.index

RangeIndex(start=0, stop=504, step=1)

In [9]:
# Verificando as colunas do DF
df_product.columns

Index(['ProductID', 'Name', 'ProductNumber', 'MakeFlag', 'FinishedGoodsFlag',
       'Color', 'SafetyStockLevel', 'ReorderPoint', 'StandardCost',
       'ListPrice', 'Size', 'SizeUnitMeasureCode', 'WeightUnitMeasureCode',
       'Weight', 'DaysToManufacture', 'ProductLine', 'Class', 'Style',
       'ProductSubcategoryID', 'ProductModelID', 'SellStartDate',
       'SellEndDate', 'DiscontinuedDate', 'rowguid', 'ModifiedDate'],
      dtype='object')

In [10]:
# Contagem de dados não nulos
df_product.count()

ProductID                504
Name                     504
ProductNumber            504
MakeFlag                 504
FinishedGoodsFlag        504
Color                    256
SafetyStockLevel         504
ReorderPoint             504
StandardCost             504
ListPrice                504
Size                     211
SizeUnitMeasureCode      176
WeightUnitMeasureCode    205
Weight                   205
DaysToManufacture        504
ProductLine              278
Class                    247
Style                    211
ProductSubcategoryID     295
ProductModelID           295
SellStartDate            504
SellEndDate               98
DiscontinuedDate           0
rowguid                  504
ModifiedDate             504
dtype: int64

In [11]:
# Identificando a quantidade de dados nulos em cada coluna
df_product.isnull().sum()

ProductID                  0
Name                       0
ProductNumber              0
MakeFlag                   0
FinishedGoodsFlag          0
Color                    248
SafetyStockLevel           0
ReorderPoint               0
StandardCost               0
ListPrice                  0
Size                     293
SizeUnitMeasureCode      328
WeightUnitMeasureCode    299
Weight                   299
DaysToManufacture          0
ProductLine              226
Class                    257
Style                    293
ProductSubcategoryID     209
ProductModelID           209
SellStartDate              0
SellEndDate              406
DiscontinuedDate         504
rowguid                    0
ModifiedDate               0
dtype: int64

Como foi explicado no notebook anterior, os campos nulos precisam ser analisados do ponto de vista do problema de negócio para decidir o que fazer.

Temos as seguintes opções para tratar campos nulos: dropar a coluna toda, dropar a linha que contém o campo nulo ou substituir o valor.

Abaixo iremos avaliar cada caso:

As colunas Color, Size, SizeUnitMeasureCode, WeightUnitMeasureCode, ProductLine, Class e Style, são todos campos string, apesar de não precisar desses campos para responder o problema de negócio proposto, iremos completar os campos nulos dessas colunas por "N/I" (Não informado) 

A coluna Weight é um campo do tipo float, então iremos completar os campos nulos com 0.0. Uma opção seria fazer a média dos pesos informados e completar essa coluna, mas para nosso problema proposta não há necessidade.

As colunas ProductSubcategoryID e ProductModelID foram identificados pelo pandas como float, mas elas só tem numeros inteiros e que não será feito nenhuma operação matemática, então iremos completar os campos nulos com "N/I" e converter ela pro tipo string.

A SellEndDate foi identificada como tipo object mas é do tipo Date, iremos trocar o tipo e preencher os campos nulos com um valor genérico.

A Coluna DiscontinuedDate tem todos os campos nulos, então droparemos ela inteira pois nenhum dado é útil.

In [12]:
# Substituindo os valor das colunas Color, Size, SizeUnitMeasureCode, WeightUnitMeasureCode, ProductLine, Class e Style por "N/I"
df_product.Color = df_product.Color.fillna('N/I')
df_product.Size = df_product.Size.fillna('N/I')
df_product.SizeUnitMeasureCode = df_product.SizeUnitMeasureCode.fillna('N/I')
df_product.WeightUnitMeasureCode = df_product.WeightUnitMeasureCode.fillna('N/I')
df_product.ProductLine = df_product.ProductLine.fillna('N/I')
df_product.Class = df_product.Class.fillna('N/I')
df_product.Style = df_product.Style.fillna('N/I')

In [13]:
# Completando os campo nulos da coluna Weight por 0.0
df_product.Weight = df_product.Weight.fillna(0.0)

In [14]:
# Convertendo as colunas ProductSubcategoryID e ProductModelID para string e completando os campos nulos por "N/I"
df_product.ProductSubcategoryID = df_product.ProductSubcategoryID.fillna('N/I')
df_product.ProductModelID = df_product.ProductModelID.fillna('N/I')
df_product.ProductSubcategoryID = df_product.ProductSubcategoryID.astype(str)
df_product.ProductModelID = df_product.ProductModelID.astype(str)

In [15]:
#  Convertendo a coluna SellEndDate para Date e preenchendo os campos nulos com a data de hoje (2022-04-01).
df_product.SellEndDate = df_product.SellEndDate.fillna('2022-04-01')
df_product.SellEndDate= pd.to_datetime(df_product.SellEndDate)

In [16]:
# Dropando a coluna 'DiscontinuedDate' pois ela tem todos as 504 linhas nulas
df_product = df_product.drop('DiscontinuedDate',axis=1)

In [17]:
# Verificando se ainda temos valores nulos
df_product.isnull().sum()

ProductID                0
Name                     0
ProductNumber            0
MakeFlag                 0
FinishedGoodsFlag        0
Color                    0
SafetyStockLevel         0
ReorderPoint             0
StandardCost             0
ListPrice                0
Size                     0
SizeUnitMeasureCode      0
WeightUnitMeasureCode    0
Weight                   0
DaysToManufacture        0
ProductLine              0
Class                    0
Style                    0
ProductSubcategoryID     0
ProductModelID           0
SellStartDate            0
SellEndDate              0
rowguid                  0
ModifiedDate             0
dtype: int64

In [18]:
# Verificando os tipos
df_product.dtypes

ProductID                         int64
Name                             object
ProductNumber                    object
MakeFlag                          int64
FinishedGoodsFlag                 int64
Color                            object
SafetyStockLevel                  int64
ReorderPoint                      int64
StandardCost                     object
ListPrice                        object
Size                             object
SizeUnitMeasureCode              object
WeightUnitMeasureCode            object
Weight                          float64
DaysToManufacture                 int64
ProductLine                      object
Class                            object
Style                            object
ProductSubcategoryID             object
ProductModelID                   object
SellStartDate                    object
SellEndDate              datetime64[ns]
rowguid                          object
ModifiedDate                     object
dtype: object

A coluna SellStartDate está como string, então temos que troca o tipo por date

In [19]:
df_product.SellStartDate= pd.to_datetime(df_product.SellStartDate)

In [20]:
df_product.dtypes

ProductID                         int64
Name                             object
ProductNumber                    object
MakeFlag                          int64
FinishedGoodsFlag                 int64
Color                            object
SafetyStockLevel                  int64
ReorderPoint                      int64
StandardCost                     object
ListPrice                        object
Size                             object
SizeUnitMeasureCode              object
WeightUnitMeasureCode            object
Weight                          float64
DaysToManufacture                 int64
ProductLine                      object
Class                            object
Style                            object
ProductSubcategoryID             object
ProductModelID                   object
SellStartDate            datetime64[ns]
SellEndDate              datetime64[ns]
rowguid                          object
ModifiedDate                     object
dtype: object

In [21]:
# Verificando como fico nosso DataFrame após as alterações
df_product.head(10)

Unnamed: 0,ProductID,Name,ProductNumber,MakeFlag,FinishedGoodsFlag,Color,SafetyStockLevel,ReorderPoint,StandardCost,ListPrice,...,DaysToManufacture,ProductLine,Class,Style,ProductSubcategoryID,ProductModelID,SellStartDate,SellEndDate,rowguid,ModifiedDate
0,1,Adjustable Race,AR-5381,0,0,N/I,1000,750,0,0,...,0,N/I,N/I,N/I,N/I,N/I,2008-04-30,2022-04-01,694215B7-08F7-4C0D-ACB1-D734BA44C0C8,2014-02-08 10:01:36.827
1,2,Bearing Ball,BA-8327,0,0,N/I,1000,750,0,0,...,0,N/I,N/I,N/I,N/I,N/I,2008-04-30,2022-04-01,58AE3C20-4F3A-4749-A7D4-D568806CC537,2014-02-08 10:01:36.827
2,3,BB Ball Bearing,BE-2349,1,0,N/I,800,600,0,0,...,1,N/I,N/I,N/I,N/I,N/I,2008-04-30,2022-04-01,9C21AED2-5BFA-4F18-BCB8-F11638DC2E4E,2014-02-08 10:01:36.827
3,4,Headset Ball Bearings,BE-2908,0,0,N/I,800,600,0,0,...,0,N/I,N/I,N/I,N/I,N/I,2008-04-30,2022-04-01,ECFED6CB-51FF-49B5-B06C-7D8AC834DB8B,2014-02-08 10:01:36.827
4,316,Blade,BL-2036,1,0,N/I,800,600,0,0,...,1,N/I,N/I,N/I,N/I,N/I,2008-04-30,2022-04-01,E73E9750-603B-4131-89F5-3DD15ED5FF80,2014-02-08 10:01:36.827
5,317,LL Crankarm,CA-5965,0,0,Black,500,375,0,0,...,0,N/I,L,N/I,N/I,N/I,2008-04-30,2022-04-01,3C9D10B7-A6B2-4774-9963-C19DCEE72FEA,2014-02-08 10:01:36.827
6,318,ML Crankarm,CA-6738,0,0,Black,500,375,0,0,...,0,N/I,M,N/I,N/I,N/I,2008-04-30,2022-04-01,EABB9A92-FA07-4EAB-8955-F0517B4A4CA7,2014-02-08 10:01:36.827
7,319,HL Crankarm,CA-7457,0,0,Black,500,375,0,0,...,0,N/I,N/I,N/I,N/I,N/I,2008-04-30,2022-04-01,7D3FD384-4F29-484B-86FA-4206E276FE58,2014-02-08 10:01:36.827
8,320,Chainring Bolts,CB-2903,0,0,Silver,1000,750,0,0,...,0,N/I,N/I,N/I,N/I,N/I,2008-04-30,2022-04-01,7BE38E48-B7D6-4486-888E-F53C26735101,2014-02-08 10:01:36.827
9,321,Chainring Nut,CN-6137,0,0,Silver,1000,750,0,0,...,0,N/I,N/I,N/I,N/I,N/I,2008-04-30,2022-04-01,3314B1D7-EF69-4431-B6DD-DC75268BD5DF,2014-02-08 10:01:36.827


Por fim, já temos todas as nossas colunas tratadas, então iremos salvar o DF na pasta REFINED do nosso Lake em um arquivo parquet que é otimizado para o modelo analítico.

In [22]:
df_product.to_parquet('gs://bike-factory-datalake/02.REFINED/product.parquet')