# Libraries

In [1]:
import pandas as pd
import numpy as np
import requests
import re

from bs4 import BeautifulSoup
from datetime import datetime

# 1. Project Plan

In this section, we created the plan to answer the questions asked by the partners. The questions were answered using *SAPE* method. *SAPE* is the method created by a Brazilian Data Scientist (Meigarom Lopes) in order to better organize the strategies to solve the business problems. The most appropriate translation of *SAPE* into English is *OPI*: output, process and input. 

#### 1. What is the best jeans sale price?

Output

1. How to answer the question.

- Median of the products from the competitors website.

2. Format

- Table or chart.

3. Mean of delivery.

- Streamlit App.

Process

1. Steps to calculate the answer.

- Price median per category, type and color.

2. What we will use to create the table and the chart.

- Simulation using Google Sheets.

3. How the final product will be.

- A dashboard with a Streamlit App. It will be published to Heroku (cloud environment).

Input

1. H&M: https://www2.hm.com/en_us/men/products/jeans.html

2. Macys: https://www.macys.com/shop/mens-clothing/mens-jeans

#### 2. How many different types of jeans and colors should we choose?

This question must be answered by the previous plan. 

#### 3. What raw materials should we choose to make the jeans?  

We may answer this question by the same procedure chosen in the previous plan and selecting the composition on the websites. 

# 2. Data Collection 

In this section, we collected the attributes - defined in the previous section - from the Star Jeans competitors website: **H&M** and **Macys**.

## 2.1. H&M

### 2.1.1. Collection of id, product name, product type/category, price and datetime. 

The attributes collected in this section are part of the showcase for men jeans from H&M website page - but datetime. 

In [2]:
def get_webpage(url01, headers):
    page = requests.get(url01, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')

    return soup
    
def collect_data(soup):    
    products = soup.find('ul', 'products-listing small')
    
    product_id_category = products.find_all('article', 'hm-product-item')
    product_name = products.find_all('a', 'link')
    product_price = products.find_all('span', 'price regular')
    
    return product_id_category, product_name, product_price

def get_id(product_id_category):
    product_id = [p.get('data-articlecode') for p in product_id_category]
    
    return product_id

def get_category(product_id_category):
    product_category = [p.get('data-category') for p in product_id_category]
    
    return product_category

def get_name(product_name):
    product_name = [p.get_text() for p in product_name]
    
    return product_name

def get_price(product_price):
    product_price = [p.get_text() for p in product_price]

    return product_price

def create_dataframe(product_id, product_name, product_category, product_price):
    data = pd.DataFrame([product_id, product_name, product_category, product_price]).T
    data.columns = ['id', 'product_name', 'product_type', 'price']
    
    return data

def set_datetime():
    data['datetime'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    
    return data

if __name__ == '__main__':
    url01 = "https://www2.hm.com/en_us/men/products/jeans.html?sort=stock&image-size=small&image=model&offset=0&page-size=72"
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    
    soup = get_webpage(url01, headers)
    
    product_id_category, product_name, product_price = collect_data(soup)
    
    product_id = get_id(product_id_category)
    
    product_category = get_category(product_id_category)
    
    product_name = get_name(product_name)
    
    product_price = get_price(product_price)
    
    data = create_dataframe(product_id, product_name, product_category, product_price)
    
    data = set_datetime()

In [3]:
data

Unnamed: 0,id,product_name,product_type,price,datetime
0,1008549001,Regular Jeans,men_jeans_regular,$ 19.99,2022-01-27 15:28:38
1,1008549004,Regular Jeans,men_jeans_regular,$ 19.99,2022-01-27 15:28:38
2,0875105018,Relaxed Jeans,men_jeans_relaxed,$ 29.99,2022-01-27 15:28:38
3,0979945001,Loose Jeans,men_jeans_loose,$ 29.99,2022-01-27 15:28:38
4,0875105016,Relaxed Jeans,men_jeans_relaxed,$ 29.99,2022-01-27 15:28:38
...,...,...,...,...,...
66,0974202002,Regular Denim Joggers,men_jeans_loose,$ 29.99,2022-01-27 15:28:38
67,0985197004,Slim Jeans,men_jeans_slim,$ 19.99,2022-01-27 15:28:38
68,0993887002,Hybrid Regular Denim Joggers,men_jeans_regular,$ 44.99,2022-01-27 15:28:38
69,0927964002,Regular Tapered Crop Jeans,men_jeans_regular,$ 19.99,2022-01-27 15:28:38


### 2.1.2. Collection of color, fit, composition, more sustainable materials and size

To collect the attributes of this section, we may not scrape data from the showcase because each product has its own specificity. As a solution, we selected each product and collected its attributes. 

#### 2.1.2.1. Pagination number

We must calculate the pagination number in order to achieve the number of pages that matches the number of products which are available in the showcase. 

In [4]:
def number_products(soup):
    total_itens = soup.find_all('h2', 'load-more-heading')[0].get('data-total')
    print(f"The total number of products is: {total_itens}.")
    
    return total_itens

def pagination(total_itens):
    products_per_page = 36
    pagination_number = np.round(int(total_itens)/products_per_page)
    print(f"The pagination number is: {pagination_number}.")
    
    return pagination_number

if __name__ == '__main__':
    total_itens = number_products(soup)
    
    pagination_number = pagination(total_itens)

The total number of products is: 71.
The pagination number is: 2.0.


#### 2.1.2.2. Available product attributes on website

This section is to certify what attributes are available for the products on the H&M website. 

In [18]:
def attributes_evaluation():
    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

    # Unique columns for all products
    aux = []

    for i in range(len(data)):

        #API request
        # conteúdo de headers é padrão
        url = "https://www2.hm.com/en_us/productpage." + data.loc[i, 'id'] + ".html"

        page = requests.get(url, headers=headers)

        #Beautiful Soup object
        soup = BeautifulSoup(page.text, 'html.parser')

        # ============================= Color =========================

        #product list
        product_list = soup.find_all('a', 'filter-option miniature')

        #color
        product_color = [p.get('data-color') for p in product_list] 

        #id
        product_id = [p.get('data-articlecode') for p in product_list]

        #dataframe
        df_color = pd.DataFrame([product_id, product_color]).T
        df_color.columns = ['id', 'color']

        #generate style id + color id
        df_color['style_id'] = df_color['id'].apply(lambda x: x[:-3])
        df_color['color_id'] = df_color['id'].apply(lambda x: x[-3:])

        # ============================ Composition =====================

        # Product list
        product_composition_list = soup.find_all('div', 'pdp-description-list-item')

        # Composition
        product_composition = [list( filter( None, p.get_text().split('\n') ) ) for p in product_composition_list]


        # dataframe
        df_composition = pd.DataFrame(product_composition).T

        # Columns name

        df_composition.columns = df_composition.iloc[0]

        # Filling None/NA values
        df_composition = df_composition.iloc[1:].fillna(method='ffill')

        # ========== Columns we want
        aux = aux + df_composition.columns.tolist()

    return print(set(aux))

In [19]:
attributes_evaluation()

{'Size', 'Art. No.', 'Fit', 'More sustainable materials', 'Composition'}


#### 2.1.2.3 Data collection

In this section, we collected the data based on the attributes found in previous section. 

In [28]:
def data_collection():

    headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}

    #empty dataframe
    df_final = pd.DataFrame()

    # All columns found on website
    cols = ['Art. No.', 'Composition', 'Fit', 'More sustainable materials', 'Size']
    df_pattern = pd.DataFrame(columns=cols)

    for i in range(len(data)):

        #API request
        # conteúdo de headers é padrão
        url02 = "https://www2.hm.com/en_us/productpage." + data.loc[i, 'id'] + ".html"+ "?page-size=" + str(int(pagination_number*36))

        page = requests.get(url02, headers=headers)

        #Beautiful Soup object
        soup = BeautifulSoup(page.text, 'html.parser')

        # ============================= Color =========================

        #product list
        product_list = soup.find_all('a', role='radio')

        #color
        product_color = [p.get('data-color') for p in product_list] 

        #id
        product_id = [p.get('data-articlecode') for p in product_list]

        #dataframe
        df_color = pd.DataFrame([product_id, product_color]).T
        df_color.columns = ['id', 'color']

        #generate style id + color id
        df_color['style_id'] = df_color['id'].apply(lambda x: x[:-3])
        df_color['color_id'] = df_color['id'].apply(lambda x: x[-3:])

         # ============================ Composition =====================

        # Product list
        product_composition_list = soup.find_all('div', 'pdp-description-list-item')

        # Composition
        product_composition = [list( filter( None, p.get_text().split('\n') ) ) for p in product_composition_list]


        # dataframe
        df_composition = pd.DataFrame(product_composition).T

        # Columns name
        print(df_composition)
        df_composition.columns = df_composition.iloc[0]

        # Filling None/NA values
        df_composition = df_composition.iloc[1:].fillna(method='ffill')

        # The same number of columns (pattern)
        df_composition = pd.concat( [df_pattern, df_composition] )

        # Generate Style ID + Color ID
        # All values, but the last three values
        df_composition['style_id'] = df_composition['Art. No.'].apply(lambda x: x[:-3])
        df_composition['color_id'] = df_composition['Art. No.'].apply(lambda x: x[-3:])

        # ======================= Merging color + composition ==========================
        data_merge = pd.merge(df_color, df_composition[['style_id', 'Fit', 'Composition', 'More sustainable materials', 'Size']], how='left', on='style_id')

        # ======================= Concatenate ==========================================
        df_final = pd.concat( [df_final, data_merge], axis=0 )
        
    return df_final

def data_merge(data, df_final):
    # Creating style_id + color_id
    data['style_id'] = data['id'].apply(lambda x: x[:-3])
    data['color_id'] = data['id'].apply(lambda x: x[-3:])

    data_raw = pd.merge( data, df_final[['color', 'style_id', 'Fit', 'Composition', 'More sustainable materials', 'Size']], how='left', on='style_id')
    
    return data_raw

def save_csv(data_raw):
    data_raw.to_csv("data_raw.csv")
    
    return None
    
if __name__ == '__main__':
    df_final = data_collection()
    
    data_raw = data_merge(data, df_final)
    
    save_csv(data_raw)

             0                                         1  \
0          Fit                               Composition   
1  Regular fit             Shell: Cotton 98%, Spandex 2%   
2         None  Pocket lining: Polyester 65%, Cotton 35%   

                            2           3  
0  More sustainable materials    Art. No.  
1         Recycled cotton 20%  1008549001  
2                        None        None  
                                                0            1  \
0                                            Size          Fit   
1  The model is 189cm/6'2" and wears a size 32/32  Regular fit   
2                                            None         None   

                                          2                           3  \
0                               Composition  More sustainable materials   
1             Shell: Cotton 99%, Spandex 1%         Recycled cotton 20%   
2  Pocket lining: Polyester 63%, Cotton 37%                        None   

            4  
0

                                                0         1  \
0                                            Size       Fit   
1  The model is 188cm/6'2" and wears a size 31/32  Slim fit   
2                                            None      None   

                                          2           3  
0                               Composition    Art. No.  
1  Pocket lining: Polyester 65%, Cotton 35%  1024256007  
2             Shell: Cotton 99%, Spandex 1%        None  
                                             0           1  \
0                                         Size         Fit   
1  The model is 182cm/6'0" and wears a size 31  Skinny fit   
2                                         None        None   

                                          2           3  
0                               Composition    Art. No.  
1  Pocket lining: Polyester 65%, Cotton 35%  1004199005  
2             Shell: Cotton 99%, Spandex 1%        None  
                                  

          0                           1           2
0       Fit                 Composition    Art. No.
1  Slim fit  Pocket lining: Cotton 100%  0938875007
2      None      Cotton 99%, Spandex 1%        None
                                                0           1  \
0                                            Size         Fit   
1  The model is 187cm/6'2" and wears a size 32/32  Skinny fit   
2                                            None        None   

                                          2           3  
0                               Composition    Art. No.  
1  Pocket lining: Polyester 65%, Cotton 35%  0985159007  
2             Shell: Cotton 99%, Spandex 1%        None  
                                                0            1  \
0                                            Size          Fit   
1  The model is 189cm/6'2" and wears a size 31/32  Relaxed fit   
2                                            None         None   

                                  

            0                       1                           2           3
0         Fit             Composition  More sustainable materials    Art. No.
1  Skinny fit  Cotton 98%, Spandex 2%         Recycled cotton 21%  0730863033
                                                0         1  \
0                                            Size       Fit   
1  The model is 188cm/6'2" and wears a size 31/32  Slim fit   
2                                            None      None   

                                                2           3  
0                                     Composition    Art. No.  
1        Pocket lining: Polyester 80%, Cotton 20%  1008110003  
2  Shell: Cotton 89%, Elasterell-P 9%, Spandex 2%        None  
                                                0         1  \
0                                            Size       Fit   
1  The model is 189cm/6'2" and wears a size 32/32  Slim fit   
2                                            None      None   

    

In [21]:
df_final

Unnamed: 0,id,color,style_id,color_id,Fit,Composition
0,1008549002,Denim blue,1008549,002,Regular fit,"Shell: Cotton 98%, Spandex 2%"
1,1008549002,Denim blue,1008549,002,Regular fit,"Pocket lining: Polyester 65%, Cotton 35%"
2,1008549004,Dark blue,1008549,004,Regular fit,"Shell: Cotton 98%, Spandex 2%"
3,1008549004,Dark blue,1008549,004,Regular fit,"Pocket lining: Polyester 65%, Cotton 35%"
4,1008549006,Black,1008549,006,Regular fit,"Shell: Cotton 98%, Spandex 2%"
...,...,...,...,...,...,...
7,0927964005,White,0927964,005,Regular fit,"Pocket lining: Polyester 65%, Cotton 35%"
0,0865734002,Denim blue,0865734,002,Relaxed fit,Cotton 100%
1,0865734005,Light denim blue,0865734,005,Relaxed fit,Cotton 100%
2,0865734006,Gray,0865734,006,Relaxed fit,Cotton 100%
