# Python DS-ao-Dev

## Bussiness Problem

**Star Jeans Company**

- Eduardo e Marcelo são dois brasileiros, amigos e sócios de empreendimento. Depois de vários negócio bem sucedidos, eles estão planejando entrar no mercado de moda dos USA como um modelo de negócio do tipo E-commerce.

- A idéia inicial é entrar no mercado com apenas um produto e para um público específico, no caso o produto seria calças Jenas para o público masculino. O objetivo é manter o custo de operação baixo e escalar a medida que forem conseguindo clientes.

- Porém, mesmo com o produto de entrada e a audiência definidos, os dois sócios não tem experiência nesse mercado de moda e portanto não sabem definir coisas básicas como preço, o tipo de calça e o material para a fabricação de cada peça.

- Assim, os dois sócios contrataram uma consultoria de Ciência de Dados para responder as seguintes perguntas: ]
    1. Qual o melhor preço de venda para as calças? 
    2. Quantos tipos de calças e suas cores para o produto inicial? 
    3. Quais as matérias-prima necessárias para confeccionar as calças?
    
- As principais concorrentes da empresa Start Jeans são as americadas H&M e Macys.

## Solution Planning (Input-Output-Tasks)

**Bussiness Question**

- Which is the best price for jeans?

1. Input:
    1. Fonte de dados
        - Site da H&M: https://www2.hm.com/en_us/men/products/jeans.html
        - Site da Macys: https://www.macys.com/shop/mens-clothing/mens-jeans
    2. Ferramentas
        - Python 3.8.0
        - Bibliotecas de Webscrapping ( BS4, Selenium )
        - PyCharm
        - Jupyter Notebook ( Analise e prototipagens )
        - Crontjob, Airflow
        - Streamlit
    
2. Output:
    1. A resposta para a pergunta.
        - Mediana dos preços dos concorrents.
    2. Formato da entrega
        - Tabela ou gráfico
    3. Local da entrega
        - App Streamlit
    
3. Tasks:
    1. Passo a passso para construir o cálculo da mediana ou média
        - Realizar o calculo da mediana sobre o produto, tipo e cor
    2. Definir o formato da entrega ( Visualização, Tabela, Frase )
        - Gráfico de barras com a mediana dos preço dos produtos, por tipo e cor dos últimos 30 dia
        - Tabela com as seguintes colunas: id | product_name | product_type | product_color | produ
        - Definição do schema: Colunas e seu tipo
        - Definição a infraestrutura de armazenamento ( SQLITE3 )
        - Design do ETL ( Scripts de Extração, Transformação e Carga )
        - Planejamento de Agendamento dos scripts ( dependencias entre os scripts )
        - Fazer as visualizações
        - Entrega do produto final
    3. Decidir o local de entrega ( PowerBi, Telegram, Email, Streamlit, Intranet ),
        - App com Streamlit

## Bussiness Models

“Como você planeja ganhar dinheiro”, Michael Lewis

“Um modelo de negócio descreve a lógica de criação, entrega e captura de valor por
parte de uma organização”, Alexander Osterwalder

- E-commerce:
    1. Faturamento: Vendas de um produto.
    2. Exemplo: Lojas Riachuelo, Submarino, Magazine Luiza, etc
        
- Software AS a Service ( SaaS ):
    1. Faturamento: Assinatura mensal/anual de utilização ou por usuário.
    2. Exemplo: Looker, Asana, Gmail, Salesforce.
    
- Serviço:
    1. Faturamento: Prestação de serviço por tempo ou projeto.
    2. Exemplo: Sul América, Porto Seguro, Mapfre.
    
- Mobile App:
    1. Faturamento: Venda de upgrades.
    2. Exemplo: Wildlife, Ubisoft, Games Mobile.
    
- Media Site:
    1. Faturamento: Cobrança por clicks ou visualizações de um determinado anúncio.
    2. Exemplo: Facebook, Google, UOL, G1, etc.
    
- Marketplace:
    1. Faturamento: Taxa sobre a transação entre o passageiro e o motorista.
    2. Exemplo: Uber, Ifood, 99, Elo7, Submarino.

## E-commerce Metrics

- **Growth Metrics**:
    1. Porcentagem do Marketshare
    2. Número de Clientes Novos
- **Revenue Metrics**:
    1. Número de Vendas
    2. Ticket Médio
    3. LTV ( Long Time Value )
    4. Recência Média
    5. Basket Size Médio
    6. Markup médio
- **Cost Metrics**:
    1. CAC ( Custo de aquisição de Clientes )
    2. Desconto médio
    3. Custo de Produção
    4. Taxa de devolução
    5. Custos Fixos ( Folha de pagamento, escritório, softwares )
    6. Impostos

In [None]:
from IPython.display import Image
Image(filename='/home/marxcerqueira/repos/Data-Science-Projects/pa005_insiders_clustering/pa005_marx_cerqueira/reports/figures/mapa_metricas_e_commerce.png')

# Imports

In [1]:
import re
import numpy as np
import pandas as pd

In [2]:
pwd

'/home/marxcerqueira/repos/python-ds-ao-dev'

# Loading Data (Web Scrapping)

In [3]:
data = pd.read_csv( '/home/marxcerqueira/repos/python-ds-ao-dev/products_hm.csv' )

In [4]:
data.isna().sum()

Unnamed: 0             0
product_id             0
product_category       0
product_name           0
product_price          0
scrapy_datetime        0
style_id               0
color_id               0
color_name             1
Fit                    1
Composition            1
Size                 611
Product safety      1548
dtype: int64

In [6]:
#drop unnecessary column
data = data.drop('Unnamed: 0', axis = 1)

#rename columns
data = data.rename(columns = {'Fit': 'fit', 'Composition': 'composition','Size': 'size' ,'Product safety': 'product_safety'} )

In [7]:
data.head()

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size,product_safety
0,985197001,men_jeans_slim,Slim Jeans,$ 19.99,2021-12-13 13:18:24,985197,1,Midnight blue,Slim fit,Pocket lining: Cotton 100%,"The model is 189cm/6'2"" and wears a size 32/32",
1,985197001,men_jeans_slim,Slim Jeans,$ 19.99,2021-12-13 13:18:24,985197,1,Midnight blue,Slim fit,"Shell: Cotton 98%, Spandex 2%","The model is 189cm/6'2"" and wears a size 32/32",
2,985197001,men_jeans_slim,Slim Jeans,$ 19.99,2021-12-13 13:18:24,985197,1,Denim blue,Slim fit,Pocket lining: Cotton 100%,"The model is 189cm/6'2"" and wears a size 32/32",
3,985197001,men_jeans_slim,Slim Jeans,$ 19.99,2021-12-13 13:18:24,985197,1,Denim blue,Slim fit,"Shell: Cotton 98%, Spandex 2%","The model is 189cm/6'2"" and wears a size 32/32",
4,985197001,men_jeans_slim,Slim Jeans,$ 19.99,2021-12-13 13:18:24,985197,1,Dark denim blue,Slim fit,Pocket lining: Cotton 100%,"The model is 189cm/6'2"" and wears a size 32/32",


In [8]:
data['color_name'].isna().sum()

1

In [9]:
aux = data[data['color_name'].isna()]
aux.head()

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size,product_safety
963,1028865001,men_jeans_relaxed,Relaxed Jeans with Embroidery Detail,$ 49.99,2021-12-13 13:18:24,1028865,1,,,,,


In [25]:
data.dtypes

product_id           int64
product_category    object
product_name        object
product_price       object
scrapy_datetime     object
style_id             int64
color_id             int64
color_name          object
fit                 object
composition         object
size                object
product_safety      object
dtype: object

In [10]:
data.isnull().sum()

product_id             0
product_category       0
product_name           0
product_price          0
scrapy_datetime        0
style_id               0
color_id               0
color_name             1
fit                    1
composition            1
size                 611
product_safety      1548
dtype: int64

In [11]:
data['composition'].unique()

array(['Pocket lining: Cotton 100%', 'Shell: Cotton 98%, Spandex 2%',
       'Shell: Cotton 99%, Spandex 1%',
       'Pocket lining: Polyester 65%, Cotton 35%',
       'Cotton 98%, Spandex 2%', 'Lining: Polyester 100%',
       'Cotton 99%, Spandex 1%',
       'Pocket lining: Polyester 63%, Cotton 37%',
       'Cotton 89%, Polyester 10%, Spandex 1%',
       'Cotton 93%, Polyester 6%, Spandex 1%', 'Cotton 100%', nan,
       'Shell: Cotton 90%, Elasterell-P 8%, Spandex 2%',
       'Cotton 90%, Elasterell-P 8%, Spandex 2%',
       'Cotton 79%, Polyester 20%, Spandex 1%',
       'Cotton 77%, Polyester 21%, Spandex 2%'], dtype=object)

In [12]:
#load data
data = pd.read_csv( '/home/marxcerqueira/repos/python-ds-ao-dev/products_hm.csv' )
#drop unnecessary column
data = data.drop('Unnamed: 0', axis = 1)

#rename columns
data = data.rename(columns = {'Fit': 'fit', 'Composition': 'composition','Size': 'size' ,'Product safety': 'product_safety'} )

#####______#####
#product name
data['product_name'] = data['product_name'].apply(lambda x: x.replace(' ', '_').lower())

#product price
data['product_price'] = data['product_price'].apply(lambda x: x.replace('$', '')).astype(float)

#scrapy datetime
data['scrapy_datetime'] = pd.to_datetime(data['scrapy_datetime'], format = '%Y-%m-%d %H:%M:%S')

#style_id
#color_id
#color name
data['color_name'] = data['color_name'].apply(lambda x: x.replace(' ', '_').replace('/', '_').lower() if pd.notnull(x) else x)

#fit
data['fit'] = data['fit'].apply(lambda x: x.replace(' ', '_').lower() if pd.notnull(x) else x)

#composition


#====  size  ======
#size number
data['size_number'] = data['size'].apply(lambda x: re.search('\d{3}cm', x).group(0) if pd.notnull(x) else x)
data['size_number'] = data['size_number'].apply(lambda x: re.search('\d+', x).group(0) if pd.notnull(x) else x) #group(0) locates the whole match expression

#size model 
data['size_model'] = data['size'].str.extract('(\d+/\\d+)') #.str to vectorize the lines, .extracts cant be applied in the whole column

#product safety
data = data.drop(columns = ['size', 'product_safety'], axis = 1)

#composition
data = data[~data['composition'].str.contains('Pocket lining:', na = False)] #na = false pula os NAs existentes
data = data[~data['composition'].str.contains('Shell:', na = False)]
data = data[~data['composition'].str.contains('Lining:', na = False)] 

#break composition by comma
df1 = data['composition'].str.split(',', expand = True)

# cotton / polyester / spandex / estasterel
# creating empty dataframe as reference to organize the wanted columns and 
# then concatanete with the main dataframe, but it has to have the same lenght as 'data' dataframe

df_ref = pd.DataFrame(index = np.arange(len(data)), columns = ['cotton', 'polyester', 'spandex', 'elasterell'])

# #cotton
# df_cotton = df1[0]
# df_cotton.name = 'cotton'

# df_ref = pd.concat([df_ref, df_cotton], axis = 1)
# df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')] #exclue colunas duplicadas, mantem as diferentes

# #polyester
# df_polyester = df1[1]
# df_polyester.name = 'polyester'

# df_ref = pd.concat([df_ref, df_polyester], axis = 1)
# df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')] #exclue colunas duplicadas, mantem as diferentes

# #spandex
# df_spandex = df1[1]
# df_spandex.name = 'spandex'

# df_ref = pd.concat([df_ref, df_spandex], axis = 1)
# df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')] #exclue colunas duplicadas, mantem as diferentes

# #elasterell
# df_spandex2  = df1[2]
# df_spandex2.name = 'spandex'

# df_ref = pd.concat([df_ref, df_spandex2 ], axis = 1)
# df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')] #exclue colunas duplicadas, mantem as diferentes

# #final join
# data = pd.concat([data,  df_ref], axis = 1)

# # format composition data
# data['cotton']     = data['cotton'].apply(lambda x: int(re.search('\d+', x).group(0))/100 if pd.notnull(x) else x)
# data['polyester']  = data['polyester'].apply( lambda x: int( re.search( '\d+', x).group(0) ) / 100 if pd.notnull( x ) else x )
# data['spandex']   = data['spandex'].apply( lambda x: int( re.search( '\d+', x ).group(0) ) / 100 if pd.notnull( x ) else x )
# data['elasterell'] = data['elasterell'].apply( lambda x: int( re.search('\d+',x ).group(0) ) / 100 if pd.notnull( x ) else x )

data.head(20)

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,fit,composition,size_number,size_model
72,690449043,men_jeans_ripped,skinny_jeans,39.99,2021-12-13 13:18:24,690449,43,light_denim_blue_trashed,skinny_fit,"Cotton 98%, Spandex 2%",187.0,32/32
74,690449043,men_jeans_ripped,skinny_jeans,39.99,2021-12-13 13:18:24,690449,43,denim_blue,skinny_fit,"Cotton 98%, Spandex 2%",187.0,32/32
76,690449043,men_jeans_ripped,skinny_jeans,39.99,2021-12-13 13:18:24,690449,43,black_washed,skinny_fit,"Cotton 98%, Spandex 2%",187.0,32/32
78,690449043,men_jeans_ripped,skinny_jeans,39.99,2021-12-13 13:18:24,690449,43,light_denim_blue,skinny_fit,"Cotton 98%, Spandex 2%",187.0,32/32
80,690449043,men_jeans_ripped,skinny_jeans,39.99,2021-12-13 13:18:24,690449,43,black_washed_out,skinny_fit,"Cotton 98%, Spandex 2%",187.0,32/32
82,690449043,men_jeans_ripped,skinny_jeans,39.99,2021-12-13 13:18:24,690449,43,white,skinny_fit,"Cotton 98%, Spandex 2%",187.0,32/32
84,690449043,men_jeans_ripped,skinny_jeans,39.99,2021-12-13 13:18:24,690449,43,black_washed,skinny_fit,"Cotton 98%, Spandex 2%",187.0,32/32
86,690449043,men_jeans_ripped,skinny_jeans,39.99,2021-12-13 13:18:24,690449,43,dark_denim_blue_trashed,skinny_fit,"Cotton 98%, Spandex 2%",187.0,32/32
88,690449043,men_jeans_ripped,skinny_jeans,39.99,2021-12-13 13:18:24,690449,43,black_trashed,skinny_fit,"Cotton 98%, Spandex 2%",187.0,32/32
90,690449043,men_jeans_ripped,skinny_jeans,39.99,2021-12-13 13:18:24,690449,43,dark_blue_trashed,skinny_fit,"Cotton 98%, Spandex 2%",187.0,32/32


In [82]:
data['composition'].unique()

array(['Cotton 98%, Spandex 2%', 'Cotton 99%, Spandex 1%',
       'Cotton 89%, Polyester 10%, Spandex 1%',
       'Cotton 93%, Polyester 6%, Spandex 1%', 'Cotton 100%', nan,
       'Cotton 90%, Elasterell-P 8%, Spandex 2%',
       'Cotton 79%, Polyester 20%, Spandex 1%',
       'Cotton 77%, Polyester 21%, Spandex 2%'], dtype=object)

In [99]:
df_ref.head() 

Unnamed: 0,cotton,polyester,spandex,elasterell
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,


In [94]:
df1.head()

Unnamed: 0,0,1,2
72,Cotton 98%,Spandex 2%,
74,Cotton 98%,Spandex 2%,
76,Cotton 98%,Spandex 2%,
78,Cotton 98%,Spandex 2%,
80,Cotton 98%,Spandex 2%,


In [98]:
df1[1].unique()

array([' Spandex 2%', ' Spandex 1%', ' Polyester 10%', ' Polyester 6%',
       None, nan, ' Elasterell-P 8%', ' Polyester 20%', ' Polyester 21%'],
      dtype=object)

In [88]:
df1.head()

Unnamed: 0,0,1,2
72,Cotton 98%,Spandex 2%,
74,Cotton 98%,Spandex 2%,
76,Cotton 98%,Spandex 2%,
78,Cotton 98%,Spandex 2%,
80,Cotton 98%,Spandex 2%,


In [63]:
data['composition'].unique()

array(['Cotton 98%, Spandex 2%', 'Cotton 99%, Spandex 1%',
       'Cotton 89%, Polyester 10%, Spandex 1%',
       'Cotton 93%, Polyester 6%, Spandex 1%', 'Cotton 100%', nan,
       'Cotton 90%, Elasterell-P 8%, Spandex 2%',
       'Cotton 79%, Polyester 20%, Spandex 1%',
       'Cotton 77%, Polyester 21%, Spandex 2%'], dtype=object)

In [59]:
df_ref.head()

Unnamed: 0,elastano,cotton,polyester,elastane,elasterell
0,,,,,
1,,,,,
2,,,,,
3,,,,,
4,,,,,


In [51]:
df1[0].unique()

array(['Cotton 98%', 'Cotton 99%', 'Cotton 89%', 'Cotton 93%',
       'Cotton 100%', nan, 'Cotton 90%', 'Cotton 79%', 'Cotton 77%'],
      dtype=object)

In [50]:
df1[2].unique()

array([None, ' Spandex 1%', nan, ' Spandex 2%'], dtype=object)

In [52]:
# cotton / polyester / elastano /estasterel
# creating empty dataframe as reference to organize the wanted columns and 
# then concatanete with the main dataframe, but it has to have the same lenght as 'data' dataframe

df_ref = pd.DataFrame(index = np.arange(len(data)), columns = ['cotton', 'polyester', 'elastano', 'elasterell'])

#cotton
df_cotton = df1[0]
df_cotton.name = 'cotton'

df_ref = pd.concat([df_ref, df_cotton], axis = 1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')] #exclue colunas duplicadas, mantem as diferentes

#polyester
df_polyester = df1[0]
df_polyester.name = 'polyester'

df_ref = pd.concat([df_ref, df_polyester], axis = 1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')] #exclue colunas duplicadas, mantem as diferentes

#elastane
df_elastane = df1[0]
df_elastane.name = 'elastane'

df_ref = pd.concat([df_ref, df_elastane], axis = 1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')] #exclue colunas duplicadas, mantem as diferentes

#elasterell
df_elasterell = df1[0]
df_elasterell.name = 'elasterell'

df_ref = pd.concat([df_ref, df_elasterell], axis = 1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')] #exclue colunas duplicadas, mantem as diferentes

#final join
pd.concat([data, df_ref], axis = 1)

# format composition data
data['cotton']     = data['cotton'].apply(lambda x: int(re.search('\d+', x).group(0))/100 if pd.notnull(x) else x)
data['polyester']  = data['polyester'].apply( lambda x: int( re.search( '\d+', x).group(0) ) / 100 if pd.notnull( x ) else x )
data['elastane']   = data['elastane'].apply( lambda x: int( re.search( '\d+', x ).group(0) ) / 100 if pd.notnull( x ) else x )
data['elasterell'] = data['elasterell'].apply( lambda x: int( re.search('\d+',x ).group(0) ) / 100 if pd.notnull( x ) else x )

data.head()

TypeError: concat() got multiple values for argument 'axis'

In [37]:
df_ref.head()

Unnamed: 0,cotton,polyester,elastano,elasterell
0,,,,
1,,,,
2,,,,
3,,,,
4,,,,


In [35]:
df_ref.shape

(436, 4)

In [33]:
df1.head()

Unnamed: 0,0,1,2
72,Cotton 98%,Spandex 2%,
74,Cotton 98%,Spandex 2%,
76,Cotton 98%,Spandex 2%,
78,Cotton 98%,Spandex 2%,
80,Cotton 98%,Spandex 2%,


In [29]:
data['composition'].unique()

array(['Cotton 98%, Spandex 2%', 'Cotton 99%, Spandex 1%',
       'Cotton 89%, Polyester 10%, Spandex 1%',
       'Cotton 93%, Polyester 6%, Spandex 1%', 'Cotton 100%', nan,
       'Cotton 90%, Elasterell-P 8%, Spandex 2%',
       'Cotton 79%, Polyester 20%, Spandex 1%',
       'Cotton 77%, Polyester 21%, Spandex 2%'], dtype=object)

In [30]:
data[['product_id', 'composition']].sample(10) 

Unnamed: 0,product_id,composition
1087,1004476003,"Cotton 90%, Elasterell-P 8%, Spandex 2%"
703,427159006,"Cotton 93%, Polyester 6%, Spandex 1%"
479,811993036,"Cotton 99%, Spandex 1%"
110,690449043,"Cotton 98%, Spandex 2%"
215,690449051,"Cotton 98%, Spandex 2%"
94,690449043,"Cotton 98%, Spandex 2%"
1365,1004476004,"Cotton 90%, Elasterell-P 8%, Spandex 2%"
1098,1004199001,"Cotton 99%, Spandex 1%"
482,811993036,"Cotton 99%, Spandex 1%"
1535,1004199003,"Cotton 99%, Spandex 1%"


In [None]:
df = data.copy()

In [None]:
cols = ['product_id','product_category','product_name','product_price','scrapy_datetime','style_id','color_id','color_name','fit','composition','size', 'product_safety']

In [None]:
#checkin NA
data.isna().sum()

In [None]:
data[data['product_category'].iloc[471:]]

In [None]:
# check product_category NA rows
data[data['product_category'].isna()]

In [None]:
# check unique values in product category
data['product_category'].unique()

In [None]:
data.info()

In [None]:
data['composition'].unique()

In [None]:
data.columns

In [None]:
data = pd.read_csv( '/home/marxcerqueira/repos/python-ds-ao-dev/products_hm.csv' )
data = data.drop(['Unnamed: 0'], axis = 1)
data.columns = cols

# product id
data['product_id'] = data['product_id'].astype(int)

# product category
data = data.dropna(subset = ['product_category'])
# product name
data['product_name'] = data['product_name'].apply(lambda x: x.replace(' ', '_').lower())
 
# product price
data['product_price'] = data['product_price'].apply(lambda x: x.replace('$ ', '')).astype(float)
 
# scrapy datetime
data['scrapy_datetime'] = pd.to_datetime(data['scrapy_datetime'], format = '%Y-%m-%d %H:%M:%S')

# style id

# color id
 
# color name
data['color_name'] = data['color_name'].apply(lambda x: x.replace(' ','_').lower() if pd.notnull(x) else x)

# fit
data['fit'] = data['fit'].apply(lambda x: x.replace(' ', '_').lower() if pd.notnull(x) else x)

# size
# data['size_number'] = data['size'].apply(lambda x: re.search(̣'\d{3}cm', x).group(0) if pd.notnull(x) else x)
data['size_number'] = data['size'].apply( lambda x: re.search( '\d{3}cm', x ).group(0) if pd.notnull( x ) else x )
data['size_number'] = data['size_number'].apply(lambda x: re.search('\d+', x).group(0) if pd.notnull(x) else x)        

# size model
data['size_model'] = data['size'].str.extract('(\d+/\\d+)')

# composition
data = data[~data['composition'].str.contains('Pocket lining:', na = False)]
data = data[~data['composition'].str.contains('Lining:', na = False)]
data = data[~data['composition'].str.contains('Shell:', na = False)]

# drop duplicates
data = data.drop_duplicates( subset = ['product_id', 'product_category', 'product_name', 'product_price',
                                       'scrapy_datetime', 'style_id', 'color_id', 'color_name', 'fit'], keep='last' )

# reset index after dropping a row
data = data.reset_index(drop = True)

#break composition by comma
df1 = data['composition'].str.split(',', expand = True)

# cottom / polyester / elastano / elasterell
df_ref = pd.DataFrame( index = np.arange( len(data) ), columns = ['cotton', 'polyester', 'elastane', 'elasterell'])

# cotton
df_cotton = df1[0]
df_cotton.name = 'cotton'

df_ref = pd.concat([df_ref, df_cotton], axis = 1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')] #exclue colunas duplicadas, mantem as diferentes
df_ref['cotton'] = df_ref['cotton'].fillna('Cotton 0%') 

# polyester
df_polyester = df1.loc[df1[1].str.contains('Polyester', na=True), 1]
df_polyester.name = 'polyester'

df_ref =pd.concat([df_ref, df_polyester], axis = 1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')]
df_ref['polyester'] = df_ref['polyester'].fillna('Polyester 0%') 


# elastane
df_elastane = df1.loc[df1[1].str.contains('Elastane', na=True), 1]
df_elastane.name = 'elastane'


# combine elastane from both columns 1 ans 2
df_elastane = df_elastane.combine_first(df1[2])

df_ref = pd.concat([df_ref, df_elastane], axis = 1)
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep ='last')]
df_ref['elastane'] = df_ref['elastane'].fillna('Elastane 0%') 

# elasterell
df_elasterell = df1.loc[df1[1].str.contains( 'Elasterell', na=True ), 1]
df_elasterell.name = 'elasterell'

df_ref = pd.concat( [df_ref, df_elasterell], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]
df_ref['elasterell'] = df_ref['elasterell'].fillna('Elasterell-P 0%') 

# final join
data = pd.concat([data, df_ref], axis = 1)
   
# format composition data
data['cotton'] = data['cotton'].apply(lambda x: int(re.search('\d+', x).group(0))/100 if pd.notnull(x) else x)
data['polyester'] = data['polyester'].apply( lambda x: int( re.search( '\d+', x).group(0) ) / 100 if pd.notnull( x ) else x )
data['elastane'] = data['elastane'].apply( lambda x: int( re.search( '\d+', x ).group(0) ) / 100 if pd.notnull( x ) else x )
data['elasterell'] = data['elasterell'].apply( lambda x: int( re.search('\d+',x ).group(0) ) / 100 if pd.notnull( x ) else x )

# Drop columns
data = data.drop(['size', 'product_safety', 'composition'], axis = 1)

# Drop duplicates
data = data.drop_duplicates()
# data.head()

In [None]:
# save new raw dataset after cleaning it
data.to_csv('products_hm_cleaned.csv')

In [None]:
data.head()

In [None]:
data.sample(20)

In [None]:
df_polyester.unique()

In [None]:
df1.head()

In [None]:
df_ref.isna().sum()

In [None]:
df_elastane.unique()

In [None]:
df_elastane[df_elastane.isna()]

In [None]:
data.head(20)

In [None]:
data.shape

In [None]:
df_aux = data[data['product_id'] == 720504001].sort_values('color_name')

df_aux.shape

In [None]:
# verificando quantidade de cores unicas, provavelmente a granularidade será por cor. 
# Temos 9 cores mas 30 itens, provavelmente duplicados
df_aux.apply(lambda x: len(x.unique()))

In [None]:
df1.head(50)

In [None]:
df_ref.isna().sum()

In [None]:
df1[0].unique()

In [None]:
df1[1].unique()

In [None]:
#checkin Na rows position
df1[df1[0].isna()]

In [None]:
df1[0].unique()

In [None]:
data.loc[475, :]

In [None]:
data[data['product_id'] == 753512001].head()