# Python DS-ao-Dev

## Bussiness Problem

**Star Jeans Company**

- Eduardo e Marcelo são dois brasileiros, amigos e sócios de empreendimento. Depois de vários negócio bem sucedidos, eles estão planejando entrar no mercado de moda dos USA como um modelo de negócio do tipo E-commerce.

- A idéia inicial é entrar no mercado com apenas um produto e para um público específico, no caso o produto seria calças Jenas para o público masculino. O objetivo é manter o custo de operação baixo e escalar a medida que forem conseguindo clientes.

- Porém, mesmo com o produto de entrada e a audiência definidos, os dois sócios não tem experiência nesse mercado de moda e portanto não sabem definir coisas básicas como preço, o tipo de calça e o material para a fabricação de cada peça.

- Assim, os dois sócios contrataram uma consultoria de Ciência de Dados para responder as seguintes perguntas: ]
    1. Qual o melhor preço de venda para as calças? 
    2. Quantos tipos de calças e suas cores para o produto inicial? 
    3. Quais as matérias-prima necessárias para confeccionar as calças?
    
- As principais concorrentes da empresa Start Jeans são as americadas H&M e Macys.

## Solution Planning (Input-Output-Tasks)

**Bussiness Question**

- Which is the best price for jeans?

1. Input:
    1. Fonte de dados
        - Site da H&M: https://www2.hm.com/en_us/men/products/jeans.html
        - Site da Macys: https://www.macys.com/shop/mens-clothing/mens-jeans
    2. Ferramentas
        - Python 3.8.0
        - Bibliotecas de Webscrapping ( BS4, Selenium )
        - PyCharm
        - Jupyter Notebook ( Analise e prototipagens )
        - Crontjob, Airflow
        - Streamlit
    
2. Output:
    1. A resposta para a pergunta.
        - Mediana dos preços dos concorrents.
    2. Formato da entrega
        - Tabela ou gráfico
    3. Local da entrega
        - App Streamlit
    
3. Tasks:
    1. Passo a passso para construir o cálculo da mediana ou média
        - Realizar o calculo da mediana sobre o produto, tipo e cor
    2. Definir o formato da entrega ( Visualização, Tabela, Frase )
        - Gráfico de barras com a mediana dos preço dos produtos, por tipo e cor dos últimos 30 dia
        - Tabela com as seguintes colunas: id | product_name | product_type | product_color | produ
        - Definição do schema: Colunas e seu tipo
        - Definição a infraestrutura de armazenamento ( SQLITE3 )
        - Design do ETL ( Scripts de Extração, Transformação e Carga )
        - Planejamento de Agendamento dos scripts ( dependencias entre os scripts )
        - Fazer as visualizações
        - Entrega do produto final
    3. Decidir o local de entrega ( PowerBi, Telegram, Email, Streamlit, Intranet ),
        - App com Streamlit

## Bussiness Models

“Como você planeja ganhar dinheiro”, Michael Lewis

“Um modelo de negócio descreve a lógica de criação, entrega e captura de valor por
parte de uma organização”, Alexander Osterwalder

- E-commerce:
    1. Faturamento: Vendas de um produto.
    2. Exemplo: Lojas Riachuelo, Submarino, Magazine Luiza, etc
        
- Software AS a Service ( SaaS ):
    1. Faturamento: Assinatura mensal/anual de utilização ou por usuário.
    2. Exemplo: Looker, Asana, Gmail, Salesforce.
    
- Serviço:
    1. Faturamento: Prestação de serviço por tempo ou projeto.
    2. Exemplo: Sul América, Porto Seguro, Mapfre.
    
- Mobile App:
    1. Faturamento: Venda de upgrades.
    2. Exemplo: Wildlife, Ubisoft, Games Mobile.
    
- Media Site:
    1. Faturamento: Cobrança por clicks ou visualizações de um determinado anúncio.
    2. Exemplo: Facebook, Google, UOL, G1, etc.
    
- Marketplace:
    1. Faturamento: Taxa sobre a transação entre o passageiro e o motorista.
    2. Exemplo: Uber, Ifood, 99, Elo7, Submarino.

## E-commerce Metrics

- **Growth Metrics**:
    1. Porcentagem do Marketshare
    2. Número de Clientes Novos
- **Revenue Metrics**:
    1. Número de Vendas
    2. Ticket Médio
    3. LTV ( Long Time Value )
    4. Recência Média
    5. Basket Size Médio
    6. Markup médio
- **Cost Metrics**:
    1. CAC ( Custo de aquisição de Clientes )
    2. Desconto médio
    3. Custo de Produção
    4. Taxa de devolução
    5. Custos Fixos ( Folha de pagamento, escritório, softwares )
    6. Impostos

In [None]:
from IPython.display import Image
Image(filename='/home/marxcerqueira/repos/Data-Science-Projects/pa005_insiders_clustering/pa005_marx_cerqueira/reports/figures/mapa_metricas_e_commerce.png')

# Imports

In [1]:
import pandas as pd
import numpy as np

from datetime import datetime

from bs4 import BeautifulSoup
import requests

# Loading Data (Web Scrapping)

## Beautiful Soup - Pratica I

In [None]:
# Preciso extrair
# id
# product_name
# product_type
# product_color
# composition
# priceb

In [17]:
url = 'https://www2.hm.com/en_us/men/products/jeans.html'

headers = {'user-agent': 'my-app/0.0.1'}

page = requests.get(url, headers = headers)

In [18]:
soup = BeautifulSoup(page.text, 'html.parser')

In [19]:
products = soup.find('ul', class_= 'products-listing small')

In [20]:
product_list = products.find_all('article', class_= 'hm-product-item')
len(product_list)

36

In [21]:
#teste product_id
product_list[4].get('data-articlecode')

'0636207010'

In [None]:
product_list

In [22]:
product_list = products.find_all('article', class_ = 'hm-product-item')

# product id
product_id = [p.get('data-articlecode') for p in product_list]

# product_category
product_category = [p.get('data-category') for p in product_list]

In [23]:
# product name
product_list = products.find_all('a', class_ = 'link')
product_name = [p.get_text() for p in product_list]

In [24]:
#price
product_list = products.find_all('span', class_ = 'price regular')
product_price = [p.get_text() for p in product_list]

In [None]:
# product color

In [None]:
# product composition

In [25]:
data = pd.DataFrame([product_id, product_category, product_name, product_price]).T
data.columns = ['product_id', 'product_category', 'product_name', 'product_price']

# scrapy time
data['scrapy_datetime'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

In [26]:
data.head()

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime
0,690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37
1,636207006,men_jeans_slim,Slim Jeans,$ 19.99,2021-06-01 08:57:37
2,427159006,men_jeans_skinny,Trashed Skinny Jeans,$ 39.99,2021-06-01 08:57:37
3,730863005,men_jeans_skinny,Skinny Jeans,$ 29.99,2021-06-01 08:57:37
4,636207010,men_jeans_slim,Slim Jeans,$ 19.99,2021-06-01 08:57:37


## Beautiful Soup - Pratica II

In [None]:
url = 'https://www2.hm.com/en_us/men/products/jeans.html'

headers = {'user-agent': 'my-app/0.0.1'}

page = requests.get(url, headers = headers)

In [None]:
soup = BeautifulSoup(page.text, 'html.parser')

In [None]:
# get all the items
total_item = soup.find_all('h2', class_ = 'load-more-heading')[0].get('data-total')
total_item

In [None]:
# get the page number (paginação)
page_number = np.ceil(int(total_item)/36) #np.ceil rounds up the number
page_number

In [None]:
url02 = url + '?page-size=' + str(int(page_number*36))
url02

## Beautiful Soup - Pratica III

### One product

In [27]:
url = 'https://www2.hm.com/en_us/productpage.0636207010.html'

headers = {'user-agent': 'my-app/0.0.1'}

page = requests.get(url, headers = headers)

In [28]:
soup = BeautifulSoup(page.text, 'html.parser')

In [29]:
# color name
product_list = soup.find_all( 'a', class_='filter-option miniature' )
color_name = [p.get( 'data-color' ) for p in product_list]

# product id
product_id = [p.get( 'data-articlecode' ) for p in product_list]

df_color = pd.DataFrame( [product_id, color_name] ).T
df_color.columns = ['product_id', 'color_name']

# generate style id + color id
df_color['style_id'] = df_color['product_id'].apply( lambda x: x[:-3] )
df_color['color_id'] = df_color['product_id'].apply( lambda x: x[-3:] )

In [30]:
df_color

Unnamed: 0,product_id,color_name,style_id,color_id
0,636207001,Dark denim blue,636207,1
1,636207002,Dark gray denim,636207,2
2,636207004,Denim blue,636207,4
3,636207005,Gray,636207,5
4,636207006,Black,636207,6
5,636207011,Midnight blue,636207,11
6,636207014,Dark gray,636207,14
7,636207015,Denim blue,636207,15
8,636207017,White,636207,17
9,636207019,Pale denim blue,636207,19


In [31]:
# composition
product_composition_list = soup.find_all( 'div', class_='pdp-description-list-item' )
product_composition = [list( filter( None, p.get_text().split( '\n' ) ) ) for p in product_composition_list]

# rename dataframe
df_composition = pd.DataFrame(product_composition).T
df_composition.columns = df_composition.iloc[0]

# delete first row
df_composition = df_composition.iloc[1:].fillna(method = 'ffill')

# generate style id + color id
df_composition['style_id'] = df_composition['Art. No.'].apply(lambda x: x[:-3])
df_composition['color_id'] = df_composition['Art. No.'].apply(lambda x: x[-3:])

# merge data color + composition

data_sku = pd.merge( df_color, df_composition[['style_id', 'Fit', 'Composition']], how = 'left', on = 'style_id')

In [38]:
df_composition

Unnamed: 0,Fit,Composition,Art. No.,style_id,color_id
1,Slim fit,"Cotton 88%, Polyester 10%, Elastane 2%",636207010,636207,10
2,Slim fit,Pocket lining: Cotton 100%,636207010,636207,10


In [32]:
data_sku.head()

Unnamed: 0,product_id,color_name,style_id,color_id,Fit,Composition
0,636207001,Dark denim blue,636207,1,Slim fit,"Cotton 88%, Polyester 10%, Elastane 2%"
1,636207001,Dark denim blue,636207,1,Slim fit,Pocket lining: Cotton 100%
2,636207002,Dark gray denim,636207,2,Slim fit,"Cotton 88%, Polyester 10%, Elastane 2%"
3,636207002,Dark gray denim,636207,2,Slim fit,Pocket lining: Cotton 100%
4,636207004,Denim blue,636207,4,Slim fit,"Cotton 88%, Polyester 10%, Elastane 2%"


### Multiple product

In [42]:
headers = {'user-agent': 'my-app/0.0.1'}

#empty dataframe
df_details = pd.DataFrame()

# unique columns for all products
aux = []

cols = ['Art. No.', 'Composition', 'Fit', 'Product safety', 'Size']
df_pattern = pd.DataFrame(columns = cols)

In [37]:
df_pattern

Unnamed: 0,Art. No.,Composition,Fit,Product safety,Size


In [49]:
for i in range(len(data)):
    # API Request
    url = 'https://www2.hm.com/en_us/productpage.' + data.loc[i, 'product_id'] + '.html'
    page = requests.get(url, headers = headers)
    
    # Beautiful Soup object
    soup = BeautifulSoup(page.text, 'html.parser')
    
    # ==================== color name =================================
    # color name
    product_list = soup.find_all( 'a', class_='filter-option miniature' )
    color_name = [p.get( 'data-color' ) for p in product_list]
    
    # product id
    product_id = [p.get( 'data-articlecode' ) for p in product_list]
    
    df_color = pd.DataFrame( [product_id, color_name] ).T
    df_color.columns = ['product_id', 'color_name']
    
    # generate style id + color id
    df_color['style_id'] = df_color['product_id'].apply( lambda x: x[:-3] )
    df_color['color_id'] = df_color['product_id'].apply( lambda x: x[-3:] )
    
    # ==================== composition =================================
    product_composition_list = soup.find_all( 'div', class_='pdp-description-list-item' )
    product_composition = [list( filter( None, p.get_text().split( '\n' ) ) ) for p in product_composition_list]
    
    # rename dataframe
    df_composition = pd.DataFrame(product_composition).T
    df_composition.columns = df_composition.iloc[0]
    
    # delete first row
    df_composition = df_composition.iloc[1:].fillna(method = 'ffill')
    
    # garantee the same number of columns
    df_composition = pd.concat( [df_pattern, df_composition], axis=0 )
    
    # generate style id + color id
    df_composition['style_id'] = df_composition['Art. No.'].apply(lambda x: x[:-3])
    df_composition['color_id'] = df_composition['Art. No.'].apply(lambda x: x[-3:])
    
    aux = aux + df_composition.columns.tolist()
    
    # merge data color + composition
    
    data_sku = pd.merge( df_color, df_composition[['style_id','Fit','Composition', 'Size','Product safety']], how = 'left', on = 'style_id')
    
    # all details products
    df_details = pd.concat([df_details, data_sku], axis = 0)
    
# Join Showroom data + details
data['style_id'] = data['product_id'].apply( lambda x: x[:-3] )
data['color_id'] = data['product_id'].apply( lambda x: x[-3:] )

data_raw = pd.merge( data, df_details[['style_id','color_name','Fit','Composition', 'Size','Product safety']], how = 'left', on = 'style_id')

In [50]:
data_raw

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,Fit,Composition,Size,Product safety
0,0690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37,0690449,022,Light denim blue/trashed,Skinny fit,Lining: Polyester 100%,"The model is 184cm/6'0"" and wears a size 31/32",
1,0690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37,0690449,022,Light denim blue/trashed,Skinny fit,"Cotton 98%, Elastane 2%","The model is 184cm/6'0"" and wears a size 31/32",
2,0690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37,0690449,022,Denim blue,Skinny fit,Lining: Polyester 100%,"The model is 184cm/6'0"" and wears a size 31/32",
3,0690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37,0690449,022,Denim blue,Skinny fit,"Cotton 98%, Elastane 2%","The model is 184cm/6'0"" and wears a size 31/32",
4,0690449022,men_jeans_skinny,Skinny Jeans,$ 39.99,2021-06-01 08:57:37,0690449,022,Black/washed,Skinny fit,Lining: Polyester 100%,"The model is 184cm/6'0"" and wears a size 31/32",
...,...,...,...,...,...,...,...,...,...,...,...,...
1856,0814620001,men_jeans_skinny,Freefit® Skinny Jeans,$ 49.99,2021-06-01 08:57:37,0814620,001,Denim blue,Skinny fit,"Cotton 90%, Elasterell-P 8%, Elastane 2%",,
1857,0814620001,men_jeans_skinny,Freefit® Skinny Jeans,$ 49.99,2021-06-01 08:57:37,0814620,001,Dark blue,Skinny fit,"Cotton 90%, Elasterell-P 8%, Elastane 2%","The model is 186cm/6'1"" and wears a size 31/30",
1858,0814620001,men_jeans_skinny,Freefit® Skinny Jeans,$ 49.99,2021-06-01 08:57:37,0814620,001,Light denim blue,Skinny fit,"Cotton 90%, Elasterell-P 8%, Elastane 2%","The model is 186cm/6'1"" and wears a size 31/30",
1859,0814620001,men_jeans_skinny,Freefit® Skinny Jeans,$ 49.99,2021-06-01 08:57:37,0814620,001,Blue-gray,Skinny fit,"Cotton 90%, Elasterell-P 8%, Elastane 2%","The model is 186cm/6'1"" and wears a size 31/30",


# Loading Data (Web Scrapping)

In [1]:
import re
import numpy as np
import pandas as pd

In [None]:
data = pd.read_csv( '/Users/meigarom.lopes/repos/python-ds-ao-dev/Module01/
,→products_hm.csv' )
# product id
data = data.dropna( subset=['product_id'] )
data['product_id'] = data['product_id'].astype( int )
# product name
data['product_name'] = data['product_name'].apply( lambda x: x.replace( ' ',␣
,→'_' ).lower() )
                   
# product price
data['product_price'] = data['product_price'].apply( lambda x: x.replace( '$ ',␣
,→'' ) ).astype( float )
                   
# scrapy datetime
data['scrapy_datetime'] = pd.to_datetime( data['scrapy_datetime'],␣
,→format='%Y-%m-%d %H:%M:%S' )
                   
# style id
data['style_id'] = data['style_id'].astype( int )
                   
# color id
data['color_id'] = data['color_id'].astype( int )
                   
# color name
data['color_name'] = data['color_name'].apply( lambda x: x.replace( ' ', '_' ).
,→replace( '/', '_' ).lower() if pd.notnull( x ) else x )
                   
# fit
data['fit'] = data['fit'].apply( lambda x: x.replace( ' ', '_' ).lower() if pd.
,→notnull( x ) else x )
                   
# size number
data['size_number'] = data['size'].apply( lambda x: re.search( '\d{3}cm', x ).
,→group(0) if pd.notnull( x ) else x )
                   
data['size_number'] = data['size_number'].apply( lambda x: re.search( '\d+', x␣
,→).group(0) if pd.notnull( x ) else x )
                   
# size model
        
data['size_model'] = data['size'].str.extract( '(\d+/\\d+)' )
                   
# composition
data = data[~data['composition'].str.contains( 'Pocket lining:', na=False )]
data = data[~data['composition'].str.contains( 'Lining:', na=False )]
data = data[~data['composition'].str.contains( 'Shell:', na=False )]
                   
# drop duplicates
data = data.drop_duplicates( subset=['product_id', 'product_category',␣
,→'product_name', 'product_price',
'scrapy_datetime', 'style_id', 'color_id',␣
,→'color_name', 'fit'], keep='last' )
                   
# reset index
data = data.reset_index( drop=True )
                   
# break composition by comma
df1 = data['composition'].str.split( ',', expand=True )
                   
# cotton | polyester | elastano | elasterell
df_ref = pd.DataFrame( index=np.arange( len( data ) ),␣
,→columns=['cotton','polyester', 'elastane', 'elasterell'] )
                   
# cotton
df_cotton = df1[0]
                   
df_cotton.name = 'cotton'
df_ref = pd.concat( [df_ref, df_cotton ], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last')]
df_ref['cotton'] = df_ref['cotton'].fillna( 'Cotton 0%' )
                   
# polyester
                
df_polyester = df1.loc[df1[1].str.contains( 'Polyester', na=True ), 1]
df_polyester.name = 'polyester'
df_ref = pd.concat( [df_ref, df_polyester], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]
df_ref['polyester'] = df_ref['polyester'].fillna( 'Polyester 0%' )
                   
# elastano
df_elastane = df1.loc[df1[1].str.contains( 'Elastane', na=True ), 1]
df_elastane.name = 'elastane'
                   
# combine elastane from both columns 1 and 2
df_elastane = df_elastane.combine_first( df1[2] )
df_ref = pd.concat( [df_ref, df_elastane], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]
df_ref['elastane'] = df_ref['elastane'].fillna( 'Elastane 0%' )
                   
# elasterell
df_elasterell = df1.loc[df1[1].str.contains( 'Elasterell', na=True ), 1]
df_elasterell.name = 'elasterell'
df_ref = pd.concat( [df_ref, df_elasterell], axis=1 )
df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep='last') ]
df_ref['elasterell'] = df_ref['elasterell'].fillna( 'Elasterell-P 0%' )
                   
# final join
                
data = pd.concat( [data, df_ref], axis=1 )
                   
# format composition data
        
data['cotton'] = data['cotton'].apply( lambda x: int( re.search( '\d+', x ).
,→group(0) ) / 100 if pd.notnull( x ) else x )
data['polyester'] = data['polyester'].apply( lambda x: int( re.search( '\d+', x␣
,→).group(0) ) / 100 if pd.notnull( x ) else x )
data['elastane'] = data['elastane'].apply( lambda x: int( re.search( '\d+', x ).
,→group(0) ) / 100 if pd.notnull( x ) else x )
data['elasterell'] = data['elasterell'].apply( lambda x: int( re.search(␣
,→'\d+',x ).group(0) ) / 100 if pd.notnull( x ) else x )
                   
# Drop columns

data = data.drop( columns=['size', 'product safety', 'composition'], axis=1 )
# Drop duplicates
            
data = data.drop_duplicates()
data.shape        
        