# Python DS-ao-Dev

## Bussiness Problem



**Star Jeans Company**

- Eduardo e Marcelo são dois brasileiros, amigos e sócios de empreendimento. Depois de vários negócio bem sucedidos, eles estão planejando entrar no mercado de moda dos USA como um modelo de negócio do tipo E-commerce.

- A idéia inicial é entrar no mercado com apenas um produto e para um público específico, no caso o produto seria calças Jenas para o público masculino. O objetivo é manter o custo de operação baixo e escalar a medida que forem conseguindo clientes.

- Porém, mesmo com o produto de entrada e a audiência definidos, os dois sócios não tem experiência nesse mercado de moda e portanto não sabem definir coisas básicas como preço, o tipo de calça e o material para a fabricação de cada peça.

- Assim, os dois sócios contrataram uma consultoria de Ciência de Dados para responder as seguintes perguntas: ]
    1. Qual o melhor preço de venda para as calças? 
    2. Quantos tipos de calças e suas cores para o produto inicial? 
    3. Quais as matérias-prima necessárias para confeccionar as calças?
    
- As principais concorrentes da empresa Start Jeans são as americadas H&M e Macys.

## Solution Planning (Input-Output-Tasks)





**Bussiness Question**

- Which is the best price for jeans?

1. Input:
    1. Fonte de dados
        - Site da H&M: https://www2.hm.com/en_us/men/products/jeans.html
        - Site da Macys: https://www.macys.com/shop/mens-clothing/mens-jeans
    2. Ferramentas
        - Python 3.8.0
        - Bibliotecas de Webscrapping ( BS4, Selenium )
        - PyCharm
        - Jupyter Notebook ( Analise e prototipagens )
        - Crontjob, Airflow
        - Streamlit
    
2. Output:
    1. A resposta para a pergunta.
        - Mediana dos preços dos concorrents.
    2. Formato da entrega
        - Tabela ou gráfico
    3. Local da entrega
        - App Streamlit
    
3. Tasks:
    1. Passo a passso para construir o cálculo da mediana ou média
        - Realizar o calculo da mediana sobre o produto, tipo e cor
    2. Definir o formato da entrega ( Visualização, Tabela, Frase )
        - Gráfico de barras com a mediana dos preço dos produtos, por tipo e cor dos últimos 30 dia
        - Tabela com as seguintes colunas: id | product_name | product_type | product_color | produ
        - Definição do schema: Colunas e seu tipo
        - Definição a infraestrutura de armazenamento ( SQLITE3 )
        - Design do ETL ( Scripts de Extração, Transformação e Carga )
        - Planejamento de Agendamento dos scripts ( dependencias entre os scripts )
        - Fazer as visualizações
        - Entrega do produto final
    3. Decidir o local de entrega ( PowerBi, Telegram, Email, Streamlit, Intranet ),
        - App com Streamlit


## Bussiness Models



“Como você planeja ganhar dinheiro”, Michael Lewis

“Um modelo de negócio descreve a lógica de criação, entrega e captura de valor por
parte de uma organização”, Alexander Osterwalder

- E-commerce:
    1. Faturamento: Vendas de um produto.
    2. Exemplo: Lojas Riachuelo, Submarino, Magazine Luiza, etc
        
- Software AS a Service ( SaaS ):
    1. Faturamento: Assinatura mensal/anual de utilização ou por usuário.
    2. Exemplo: Looker, Asana, Gmail, Salesforce.
    
- Serviço:
    1. Faturamento: Prestação de serviço por tempo ou projeto.
    2. Exemplo: Sul América, Porto Seguro, Mapfre.
    
- Mobile App:
    1. Faturamento: Venda de upgrades.
    2. Exemplo: Wildlife, Ubisoft, Games Mobile.
    
- Media Site:
    1. Faturamento: Cobrança por clicks ou visualizações de um determinado anúncio.
    2. Exemplo: Facebook, Google, UOL, G1, etc.
    
- Marketplace:
    1. Faturamento: Taxa sobre a transação entre o passageiro e o motorista.
    2. Exemplo: Uber, Ifood, 99, Elo7, Submarino.


## E-commerce Metrics

- **Growth Metrics**:
    1. Porcentagem do Marketshare
    2. Número de Clientes Novos
- **Revenue Metrics**:
    1. Número de Vendas
    2. Ticket Médio
    3. LTV ( Long Time Value )
    4. Recência Média
    5. Basket Size Médio
    6. Markup médio
- **Cost Metrics**:
    1. CAC ( Custo de aquisição de Clientes )
    2. Desconto médio
    3. Custo de Produção
    4. Taxa de devolução
    5. Custos Fixos ( Folha de pagamento, escritório, softwares )
    6. Impostos

# Imports

In [5]:
import pandas as pd
import numpy as np
import seaborn               as sns
import re

from datetime import datetime

from bs4 import BeautifulSoup
import requests
import matplotlib.pyplot     as plt

from IPython.core.display    import HTML
from IPython.display         import Image

In [2]:
#helper fuction
def jupyter_settings():
    %matplotlib inline
    %pylab inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [25, 12]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    pd.options.display.max_columns = None
    pd.options.display.max_rows = None
    pd.set_option( 'display.expand_frame_repr', False )
    
    sns.set()

In [3]:
jupyter_settings()

Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  warn("pylab import has clobbered these variables: %s"  % clobbered +


# Data Gathering

## Showcase

In [6]:
# parameters
headers = {'user-agent': 'my-app/0.0.1'}

# URL
url = 'https://www2.hm.com/en_us/men/products/jeans.html'

#request to URL
page = requests.get(url, headers = headers)

# BeautifulSoup object
soup = BeautifulSoup(page.text, 'html.parser')

# ============== Product Data ===============
#website showcase
products = soup.find('ul', class_= 'products-listing small') #find retorna apenas 1 elemento, pois temos apenas 1 UL, uma vitrine, find_all retorna lista

#list comprehension to get all products id and products category from the first page of 
product_list = products.find_all('article', class_ = 'hm-product-item')

# product id
product_id = [p.get('data-articlecode') for p in product_list]

# product_category
product_category = [p.get('data-category') for p in product_list]

# product name
product_list = products.find_all('a', class_ = 'link')
product_name = [p.get_text() for p in product_list]

#price
product_list = soup.find_all('span', class_ = 'price regular')
product_price = [p.get_text() for p in product_list]

data = pd.DataFrame([product_id, product_category, product_name, product_price]).T
data.columns = ['product_id', 'product_category', 'product_name', 'product_price']

#scrapy time 
data['scrapy_datetime'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
data.head()

Unnamed: 0,product_id,product_category,product_name,product_price,scrapy_datetime
0,1008549001,men_jeans_regular,Regular Jeans,$ 19.99,2022-02-14 23:13:21
1,811993040,men_jeans_regular,Regular Jeans,$ 29.99,2022-02-14 23:13:21
2,1013317006,men_jeans_regular,Hybrid Regular Tapered Joggers,$ 39.99,2022-02-14 23:13:21
3,875105016,men_jeans_relaxed,Relaxed Jeans,$ 29.99,2022-02-14 23:13:21
4,875105018,men_jeans_relaxed,Relaxed Jeans,$ 29.99,2022-02-14 23:13:21


In [7]:
data.shape

(36, 5)

## By products

In [19]:
# empty  dataframe
df_compositions = pd.DataFrame()

# unique columns for all products
aux = []

cols = ['Art. No.', 'Composition', 'Fit', 'Product safety', 'Size', 'More sustainable materials']
df_pattern = pd.DataFrame(columns = cols)

for i in range(len(data)):
    # API Request
    url = 'https://www2.hm.com/en_us/productpage.' + data.loc[i, 'product_id'] + '.html'

    print('Product: {}'.format(url))
    
    page = requests.get(url, headers = headers)
    
    # Beautiful Soup object
    soup = BeautifulSoup(page.text, 'html.parser')
    
    # ==================== color name =================================
    product_list = soup.find_all( 'a', class_='filter-option miniature active' ) + soup.find_all( 'a', class_='filter-option miniature' )
    
    # color name
    color_name = [p.get( 'data-color' ) for p in product_list]
    
    # product id
    product_id = [p.get( 'data-articlecode' ) for p in product_list]
    
    df_color = pd.DataFrame( [product_id, color_name] ).T
    df_color.columns = ['product_id', 'color_name']
    
    for j in range(len(df_color)): #go through all colors and collect each composition 
        # API Request
        url = 'https://www2.hm.com/en_us/productpage.' + df_color.loc[j, 'product_id'] + '.html'
    
        print('Color: {}'.format(url))

        page = requests.get(url, headers = headers)

        # Beautiful Soup object
        soup = BeautifulSoup(page.text, 'html.parser')
        
        # ============ Product Name =============roduct_price = soup.find_all('div', class_ = 'primary-row product-item-price')
        product_name = soup.find_all('h1', class_ = 'primary product-item-headline')
#       product_name = [ p.get_text() for p in product_name]
        product_name = product_name[0].get_text()
    
        # ============ Product Price =============
        product_price = soup.find_all('div', class_ = 'primary-row product-item-price')
        product_price = re.findall(r'\d+.?\d+', product_price[0].get_text())[0]     
        
#         df_product_name_price = pd.DataFrame([product_name, product_price]).T
        
        # =================== composition =====================
        product_composition_list = soup.find_all( 'div', class_='pdp-description-list-item' )
        product_composition = [list( filter( None, p.get_text().split( '\n' ) ) ) for p in product_composition_list]

        # create composition dataframe
        df_composition = pd.DataFrame(product_composition).T
        df_composition.columns = df_composition.iloc[0]

        # delete first row
        df_composition = df_composition.iloc[1:].fillna(method = 'ffill')

        # remove pocket lining, shell and lining
        df_composition['Composition'] = df_composition['Composition'].str.replace('Pocket lining: ', '', regex = True)
        df_composition['Composition'] = df_composition['Composition'].str.replace('Shell: ', '', regex = True)
        df_composition['Composition'] = df_composition['Composition'].str.replace('Lining: ', '', regex = True)

        # garantee the same number of columns
        df_composition = pd.concat( [df_pattern, df_composition], axis=0 )

        #rename columns
        df_composition.columns = ['product_id','composition','fit','product_safety', 'size', 'sustainable_materials']

        #create columns product name and product price
        df_composition['product_name'] = product_name
        df_composition['product_price'] = product_price
        
        #keep new columns if it shows up
        aux = aux + df_composition.columns.tolist() #to guarantee we have all columns of composition unique values

        # merge data color + composition
        df_composition = pd.merge( df_composition, df_color, how = 'left', on = 'product_id')

        # all products
        df_compositions = pd.concat([df_compositions, df_composition], axis = 0)
    
# # Join Showroom data + details
df_compositions['style_id'] = df_compositions['product_id'].apply( lambda x: x[:-3] )
df_compositions['color_id'] = df_compositions['product_id'].apply( lambda x: x[-3:] )

#scrapy time 
df_compositions['scrapy_datetime'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')



Product: https://www2.hm.com/en_us/productpage.1008549001.html
Color: https://www2.hm.com/en_us/productpage.1008549001.html
Color: https://www2.hm.com/en_us/productpage.1008549002.html
Color: https://www2.hm.com/en_us/productpage.1008549004.html
Color: https://www2.hm.com/en_us/productpage.1008549006.html
Color: https://www2.hm.com/en_us/productpage.1008549008.html
Product: https://www2.hm.com/en_us/productpage.0811993040.html
Color: https://www2.hm.com/en_us/productpage.0811993040.html
Color: https://www2.hm.com/en_us/productpage.0811993001.html
Color: https://www2.hm.com/en_us/productpage.0811993002.html
Color: https://www2.hm.com/en_us/productpage.0811993003.html
Color: https://www2.hm.com/en_us/productpage.0811993006.html
Color: https://www2.hm.com/en_us/productpage.0811993007.html
Color: https://www2.hm.com/en_us/productpage.0811993021.html
Color: https://www2.hm.com/en_us/productpage.0811993022.html
Color: https://www2.hm.com/en_us/productpage.0811993024.html
Color: https://www2.

Color: https://www2.hm.com/en_us/productpage.0690449040.html
Color: https://www2.hm.com/en_us/productpage.0690449046.html
Color: https://www2.hm.com/en_us/productpage.0690449051.html
Color: https://www2.hm.com/en_us/productpage.0690449056.html
Product: https://www2.hm.com/en_us/productpage.1008549006.html
Color: https://www2.hm.com/en_us/productpage.1008549006.html
Color: https://www2.hm.com/en_us/productpage.1008549001.html
Color: https://www2.hm.com/en_us/productpage.1008549002.html
Color: https://www2.hm.com/en_us/productpage.1008549004.html
Color: https://www2.hm.com/en_us/productpage.1008549008.html
Product: https://www2.hm.com/en_us/productpage.0690449056.html
Color: https://www2.hm.com/en_us/productpage.0690449056.html
Color: https://www2.hm.com/en_us/productpage.0690449001.html
Color: https://www2.hm.com/en_us/productpage.0690449002.html
Color: https://www2.hm.com/en_us/productpage.0690449006.html
Color: https://www2.hm.com/en_us/productpage.0690449007.html
Color: https://www2.

Product: https://www2.hm.com/en_us/productpage.0690449051.html
Color: https://www2.hm.com/en_us/productpage.0690449051.html
Color: https://www2.hm.com/en_us/productpage.0690449001.html
Color: https://www2.hm.com/en_us/productpage.0690449002.html
Color: https://www2.hm.com/en_us/productpage.0690449006.html
Color: https://www2.hm.com/en_us/productpage.0690449007.html
Color: https://www2.hm.com/en_us/productpage.0690449009.html
Color: https://www2.hm.com/en_us/productpage.0690449011.html
Color: https://www2.hm.com/en_us/productpage.0690449013.html
Color: https://www2.hm.com/en_us/productpage.0690449021.html
Color: https://www2.hm.com/en_us/productpage.0690449022.html
Color: https://www2.hm.com/en_us/productpage.0690449024.html
Color: https://www2.hm.com/en_us/productpage.0690449028.html
Color: https://www2.hm.com/en_us/productpage.0690449035.html
Color: https://www2.hm.com/en_us/productpage.0690449036.html
Color: https://www2.hm.com/en_us/productpage.0690449040.html
Color: https://www2.hm

In [None]:
len(df_compositions['product_id'].unique())

In [None]:
df_product_category = data[['product_id', 'product_category']].drop_duplicates()
df_product_category

In [None]:
df_compositions.head(15)

In [None]:
df_color.head(25)

In [None]:
df_composition.head()

In [None]:
aux

In [24]:
df_compositions.head()

Unnamed: 0,product_id,composition,fit,product_safety,size,sustainable_materials,product_name,product_price,color_name,style_id,color_id,scrapy_datetime
0,1008549001,"Cotton 98%, Spandex 2%",Regular fit,,,Recycled cotton 20%,\n\t\t\t\t\t\t\t Regular Jeans,19.99,Light denim blue,1008549,1,2022-02-14 23:36:13
1,1008549001,"Polyester 65%, Cotton 35%",Regular fit,,,Recycled cotton 20%,\n\t\t\t\t\t\t\t Regular Jeans,19.99,Light denim blue,1008549,1,2022-02-14 23:36:13
0,1008549002,"Polyester 65%, Cotton 35%",Regular fit,,"The model is 185cm/6'1"" and wears a size 31/32",Recycled cotton 20%,\n\t\t\t\t\t\t\t Regular Jeans,19.99,Denim blue,1008549,2,2022-02-14 23:36:13
1,1008549002,"Cotton 99%, Spandex 1%",Regular fit,,"The model is 185cm/6'1"" and wears a size 31/32",Recycled cotton 20%,\n\t\t\t\t\t\t\t Regular Jeans,19.99,Denim blue,1008549,2,2022-02-14 23:36:13
0,1008549004,"Cotton 99%, Spandex 1%",Regular fit,,"The model is 180cm/5'11"" and wears a size 31/32",Recycled cotton 20%,\n\t\t\t\t\t\t\t Regular Jeans,19.99,Dark blue,1008549,4,2022-02-14 23:36:13


In [18]:
df_compositions.columns

Index(['product_id', 'composition', 'fit', 'product_safety', 'size',
       'sustainable_materials', 'pŕoduct_name', 'product_price', 'color_name',
       'style_id', 'color_id', 'scrapy_datetime'],
      dtype='object')

In [38]:
df_data.head()

Unnamed: 0,product_id,composition,fit,product_safety,size,sustainable_materials,product_name,product_price,color_name,style_id,color_id,scrapy_datetime
0,1008549001,"Cotton 98%, Spandex 2%",Regular fit,,,Recycled cotton 20%,regular jeans,19.99,Light denim blue,1008549,1,2022-02-14 23:36:13
1,1008549001,"Polyester 65%, Cotton 35%",Regular fit,,,Recycled cotton 20%,regular jeans,19.99,Light denim blue,1008549,1,2022-02-14 23:36:13
0,1008549002,"Polyester 65%, Cotton 35%",Regular fit,,"The model is 185cm/6'1"" and wears a size 31/32",Recycled cotton 20%,regular jeans,19.99,Denim blue,1008549,2,2022-02-14 23:36:13
1,1008549002,"Cotton 99%, Spandex 1%",Regular fit,,"The model is 185cm/6'1"" and wears a size 31/32",Recycled cotton 20%,regular jeans,19.99,Denim blue,1008549,2,2022-02-14 23:36:13
0,1008549004,"Cotton 99%, Spandex 1%",Regular fit,,"The model is 180cm/5'11"" and wears a size 31/32",Recycled cotton 20%,regular jeans,19.99,Dark blue,1008549,4,2022-02-14 23:36:13


In [45]:
df_data['product_name'].unique()


array(['  regular jeans', '  hybrid regular tapered joggers',
       '  relaxed jeans', '  loose jeans', '  slim jeans',
       '  skinny jeans', '  skinny cropped jeans',
       '  trashed skinny jeans'], dtype=object)

## Data Cleaning

In [37]:
#product id
df_data = df_compositions.dropna(subset = ['product_id'])

#product name

df_data['product_name'] = df_compositions['product_name'].apply(lambda x: x.replace('\n', '').lower())
df_data['product_name'] = df_data['product_name'].apply(lambda x: x.replace('\t', '').lower())
# df_data['product_name'] = df_data['product_name'].apply(lambda x: x.replace(' ', '_').lower())

# #product price
# df_data['product_price'] = df_compositions['product_price'].apply(lambda x: x.replace('$', '')).astype(float)

#color name
# df_data['color_name'] = df_compositions['color_name'].apply(lambda x: x.replace(' ', '_').replace('/', '_').lower() if pd.notnull(x) else x)

# #fit
# df_data['fit'] = df_data['fit'].apply(lambda x: x.replace(' ', '_').lower() if pd.notnull(x) else x)

# #composition


# #====  size  ======
# #size number
# df_data['size_number'] = df_data['size'].apply(lambda x: re.search('\d{3}cm', x).group(0) if pd.notnull(x) else x)
# df_data['size_number'] = df_data['size_number'].apply(lambda x: re.search('\d+', x).group(0) if pd.notnull(x) else x) #group(0) locates the whole match expression

# #size model 
# df_data['size_model'] = df_data['size'].str.extract('(\d+/\\d+)') #.str to vectorize the lines, .extracts cant be applied in the whole column

# # #product safety


# # ====================  composition =====================
# #break composition by comma
# df1 = df_data['composition'].str.split(',', expand = True)#.reset_index(drop=True)

# # # cotton / polyester / spandex / estasterel
# # # creating empty dataframe as reference to organize the wanted columns and 
# # # then concatanete with the main dataframe, but it has to have the same lenght as 'data' dataframe

# # df_ref = pd.DataFrame(index = np.arange(len(data)), columns = ['cotton', 'polyester', 'spandex', 'elasterell'])

# # # #cotton
# # df_cotton = df1[0]
# # df_cotton.name = 'cotton'

# # # # df_ref = pd.merge(df_ref, df_cotton, left_index=True, right_index=True)

# # df_ref = pd.concat([df_ref, df_cotton], axis = 1)
# # df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated( keep = 'last')] #exclue colunas duplicadas, mantem as diferentes
# # df_ref['cotton'] = df_ref['cotton'].fillna('Cotton 0%')

# # # # #polyester
# # df_polyester = df1.loc[df1[1].str.contains('Polyester', na = True), 1]
# # df_polyester.name = 'polyester'

# # df_ref = pd.concat([df_ref, df_polyester], axis = 1)
# # df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')]
# # df_ref['polyester'] = df_ref['polyester'].fillna('polyester 0%')


# # # # #spandex
# # df_spandex = df1.loc[df1[1].str.contains('Spandex', na = True), 1]
# # df_spandex.name = 'spandex'

# # # df_spandex2 = df1.loc[df1[2].str.contains('Spandex', na = True), 2]
# # # df_spandex2.name = 'spandex'

# # #other solution
# # df_spandex = df_spandex.combine_first(df1[2])

# # df_ref = pd.concat([df_ref, df_spandex], axis = 1)
# # # df_ref = pd.concat([df_ref, df_spandex2], axis = 1)
# # df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')]
# # df_ref['spandex'] = df_ref['spandex'].fillna('spandex 0%')

# # # # #elasterell
# # df_elasterell = df1.loc[df1[1].str.contains('Elasterell', na = True), 1]
# # df_elasterell.name = 'elasterell'

# # df_ref = pd.concat([df_ref, df_elasterell], axis = 1)
# # df_ref = df_ref.iloc[:, ~df_ref.columns.duplicated(keep='last')]
# # df_ref['elasterell'] = df_ref['elasterell'].fillna('elasterell 0%')

# # # # #final join
# # data = pd.concat([data, df_ref], axis = 1)

# # # #formatt composition data
# # data['cotton'] = data['cotton'].apply(lambda x: int(re.search('\d+', x).group(0))/100 if pd.notnull(x) else x)
# # data['polyester'] = data['polyester'].apply(lambda x: int(re.search('\d+', x).group(0))/100 if pd.notnull(x) else x)
# # data['spandex'] = data['spandex'].apply(lambda x: int(re.search('\d+', x).group(0))/100 if pd.notnull(x) else x)
# # data['elasterell'] = data['elasterell'].apply(lambda x: int(re.search('\d+', x).group(0))/100 if pd.notnull(x) else x)


# # #drop columns
# # data = data.drop(columns = ['size', 'product_safety','composition'], axis = 1)

# # #drop duplicates

# # data = data.drop_duplicates()

# # data.head(25)