# Granularidade

Granulidade pode ser entendida como o nível de detalhamento da sua tabela ou também como o nível de detalhe no qual o dado é armazenado no banco de dados. A granularidade **deve** estar presente em todas as linhas de uma tabela.

#### Tabela 1

Na tabela abaixo, a granularidade é o nome do produto - já que este atributo determina a quantidade de itens da tabela. O preço é apenas um atributo da granularidade. 

| product_category | product_name | product_price |
|------------------|--------------|---------------| 
| jeans | jeans slim | 29.99 | 
| jeans | jeans flare | 89.99 | 
| jeans | jeans urban | 39.99 | 

#### Tabela 2

Já na tabela abaixo, a granularidade é definida pela cor - porque é a cor que determina a quantidade de itens da tabela. 

| product_category | product_name | product_price | color_name | 
|------------------|--------------|---------------|------------|
| jeans | jeans slim | 29.99 | black | 
| jeans | jeans slim | 29.99 | white | 
| jeans | jeans slim | 29.99 | blue | 
| jeans | jeans flare | 89.99 | white | 
| jeans | jeans urban | 39.99 | blue | 

#### Tabela 3

Na tabela 3, a granularidade é definida pelo tamanho do produto - porque este atributo determina a quantidade de itens da tabela. 

| product_category | product_name | product_price | color_name | product_size |
|------------------|--------------|---------------|------------|--------------|
| jeans | jeans slim | 29.99 | black | 32 |
| jeans | jeans slim | 29.99 | black | 34 |
| jeans | jeans slim | 29.99 | black | 36 |
| jeans | jeans slim | 29.99 | white | 32 |
| jeans | jeans slim | 29.99 | blue | 32 |
| jeans | jeans flare | 89.99 | white | 34 | 
| jeans | jeans urban | 39.99 | blue | 32 |


Note que, se a tabela acima for apresentada da seguinte forma,

| product_category | product_name | product_price | color_name | 
|------------------|--------------|---------------|------------|
| jeans | jeans slim | 29.99 | black | 
| jeans | jeans slim | 29.99 | black |
| jeans | jeans slim | 29.99 | black |
| jeans | jeans slim | 29.99 | white | 
| jeans | jeans slim | 29.99 | blue | 
| jeans | jeans flare | 89.99 | white | 
| jeans | jeans urban | 39.99 | blue | 

facilmente nós deletaríamos as duas primeiras linhas por considerá-las como duplicatas. Sendo que, na verdade, falta o dado mais importante da tabela, que é a granularidade.

* Antes deletar qualquer linha supostamente duplicada de uma tabela, precisamos verificar se a granularidade está igual. Se estiver, então é de fato uma duplicata. 

# Limpeza de dados

In [3]:
import pandas as pd
import numpy as np

In [4]:
url = "https://raw.githubusercontent.com/lucasquemelli/Star_Jeans/main/webscraping/data_clean.csv"
data_clean = pd.read_csv(url)

In [6]:
data_clean.head(30)

Unnamed: 0.1,Unnamed: 0,id,product_name,product_type,price,datetime,style_id,color_id,color,Fit,Composition,cotton,polyester,spandex
0,0,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,black_washed_out,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
1,1,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,denim_blue,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
2,2,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,dark_denim_blue,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
3,3,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,graphite_gray,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
4,4,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,light_denim_blue,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
5,5,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,black,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
6,6,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,cream,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
7,7,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,denim_blue,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.2,0.02
8,8,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,light_blue,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.2,0.02
9,9,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,denim_blue,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.2,0.02


In [7]:
data_clean[data_clean['id'] == 811993040].head()

Unnamed: 0.1,Unnamed: 0,id,product_name,product_type,price,datetime,style_id,color_id,color,Fit,Composition,cotton,polyester,spandex
0,0,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,black_washed_out,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
1,1,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,denim_blue,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
2,2,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,dark_denim_blue,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
3,3,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,graphite_gray,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02
4,4,811993040,regular_jeans,men_jeans_regular,29.99,2022-01-27 19:43:16,811993,40,light_denim_blue,regular_fit,"Cotton 98%, Spandex 2%",0.98,0.21,0.02


In [9]:
data_aux = data_clean[data_clean['id'] == 811993040]
data_aux.shape

(30, 14)

In [10]:
data_aux.apply(lambda x: len(x.unique()))

Unnamed: 0      30
id               1
product_name     1
product_type     1
price            1
datetime         1
style_id         1
color_id         1
color           11
Fit              1
Composition      1
cotton           1
polyester        3
spandex          1
dtype: int64