# <center> O Problema do Negócio

Ana e Júlia são duas brasileiras, amigas e sócias de empreendimento. Depois de vários
negócio bem sucedidos, eles estão planejando entrar no mercado de moda dos USA como um
modelo de negócio do tipo E-commerce.

A idéia inicial é entrar no mercado com apenas um produto e para um público específico, no caso
o produto seria calças Jenas para o público masculino. O objetivo é manter o custo de operação
baixo e escalar a medida que forem conseguindo clientes.

Porém, mesmo com o produto de entrada e a audiência definidos, as duas sócias não têm experiência
nesse mercado de moda e portanto não sabem definir coisas básicas como preço, o tipo de calça e
o material para a fabricação de cada peça.

Assim, os dois sócios contrataram uma consultoria de Ciência de Dados para responder as seguintes
perguntas: 

1. Qual o melhor preço de venda para as calças?

2. Quantos tipos de calças e suas cores para o produto inicial?

3. Quais as matérias-prima necessárias para confeccionar as calças?

As principais concorrentes da empresa Start Jeans são as americadas H&M e Macys.

##  <center> Extração de Dados em HTML

In [1]:
import math
import requests

import pandas    as pd
import numpy     as np

from datetime   import  datetime
from bs4        import  BeautifulSoup

pd.set_option('display.max_rows', None)

In [2]:
url = 'https://www2.hm.com/en_us/men/products/jeans.html?page-size=108'

headers = {'User-Agent': 'Mozilla/5.0'}

page = requests.get (url, headers = headers)

soup = BeautifulSoup(page.text, 'html.parser')

products = soup.find('ul', class_ = 'products-listing small')

product_list = products.find_all( 'article', class_ = 'hm-product-item')

In [3]:
#id
product_id = [p.get ('data-articlecode') for p in product_list]

#category
product_cat = [p.get ('data-category') for p in product_list]

#name
product_list = products.find_all ('a', class_='link')
product_name = [p.get_text() for p in product_list]

#price
product_list = products.find_all ('span', class_ = 'price regular')
product_price = [p.get_text() for p in product_list]

In [4]:
data = pd.DataFrame ([product_id, product_cat, product_name, product_price ]).T
data.columns = ['product_id', 'product_cat', 'product_name', 'product_price']

#scrapy datetime

data ['scrapy_datetime'] = datetime.now ().strftime ('%Y-%m-%d %H:%S')

# <center> Prática III e IV

In [9]:
headers = {'User-Agent': 'Mozilla/5.0'}

#empty dataframe

df_details = pd.DataFrame()

#unique columns for all products
aux = []

#all cols
cols = ['Art. No.', 'Composition', 'Fit', 'More sustainable materials', 'Size']
df_pattern = pd.DataFrame (columns = cols)

for i in range (len (data)):

    # API Requests

    url = 'https://www2.hm.com/en_us/productpage.' + data.loc[i, 'product_id'] + '.html'

    page = requests.get (url, headers = headers)

    #BeautifulSoup

    soup = BeautifulSoup(page.text, 'html.parser')
 
    ##################################### color name #######################################
    
    product_list = soup.find_all( 'a', class_='filter-option miniature' )
    color_name = [p.get( 'data-color' ) for p in product_list]

    df_color = pd.DataFrame( [product_id, color_name] ).T
    df_color.columns = ['product_id', 'color_name']
    
    # generate style id + color id
    df_color['style_id'] = df_color['product_id'].apply( lambda x: x[:-3] )
    df_color['color_id'] = df_color['product_id'].apply( lambda x: x[-3:] )
    
    ###################################### composition #######################################

    product_composition_list = soup.find_all('div', class_ = 'pdp-description-list-item')

    product_composition = [list(filter(None, p.get_text().split('\n'))) for p in product_composition_list]

    # Rename Dataframe

    df_composition = pd.DataFrame (product_composition).T
    df_composition.columns = df_composition.iloc[0]

    #delete first row and Fill None

    df_composition = df_composition.iloc[1:].fillna (method = 'ffill')
    
    #garantee the same number os columns
    
    df_composition = pd.concat([df_pattern, df_composition], axis = 0)

    #generate style id + color id
    df_composition['style_id'] = df_composition['Art. No.'].apply( lambda x: x[:-3])
    df_composition['color_id'] = df_composition['Art. No.'].apply( lambda x: x[-3:])
    
    aux = aux + df_composition.columns.tolist()

    ######################################### Merge #######################################

    data_sku = pd.merge (df_color, df_composition[['style_id', 'Fit', 'Composition', 'More sustainable materials', 'Size']], how = 'left', on = 'style_id' )
    
    df_details = pd.concat([df_details, data_sku], axis = 0)

    data['style_id'] = data['product_id'].apply( lambda x: x[:-3] )

    data['color_id'] = data['product_id'].apply( lambda x: x[-3:] )

    data_raw = pd.merge( data, df_details[['style_id', 'color_name', 'Fit','Composition', 'Size', 'More sustainable materials']],how='left', on='style_id' )


In [10]:
data_raw.head()

Unnamed: 0,product_id,product_cat,product_name,product_price,scrapy_datetime,style_id,color_id,color_name,Fit,Composition,Size,More sustainable materials
0,985197001,men_jeans_slim,Slim Jeans,$ 19.99,2021-12-09 11:35,985197,1,Midnight blue,Slim fit,Pocket lining: Cotton 100%,"The model is 189cm/6'2"" and wears a size 32/32",
1,985197001,men_jeans_slim,Slim Jeans,$ 19.99,2021-12-09 11:35,985197,1,Midnight blue,Slim fit,"Shell: Cotton 98%, Spandex 2%","The model is 189cm/6'2"" and wears a size 32/32",
2,985197001,men_jeans_slim,Slim Jeans,$ 19.99,2021-12-09 11:35,985197,1,,Slim fit,Pocket lining: Cotton 100%,"The model is 189cm/6'2"" and wears a size 32/32",
3,985197001,men_jeans_slim,Slim Jeans,$ 19.99,2021-12-09 11:35,985197,1,,Slim fit,"Shell: Cotton 98%, Spandex 2%","The model is 189cm/6'2"" and wears a size 32/32",
4,985197001,men_jeans_slim,Slim Jeans,$ 19.99,2021-12-09 11:35,985197,1,,Slim fit,Pocket lining: Cotton 100%,"The model is 189cm/6'2"" and wears a size 32/32",
