### Initial Comment

This script allows to access the Continete website and retrieve the data that is stored on the website specific tags.

Using the requests and the bs4 libraries was accessed the html code of the different webpages and retrieved the information regarding the name and the price of the products that were listed there.

The data that was obtained via this method was then transformed into a string and manipulated to get the product name and price on a readable format.

Based on the excel file _Categories.xlsx_, that was inputed, it was defined that the code would loop through the names of the products and would search for that name on the products webpage of the Continete website.
 - the group opted to create a new url based on the "base url" of the webpage and replacing the words that had the search criteria for the ones on this file.

When accesing a url the data was stored on a data frame. This df will then be filtered using the function data_cleaning, which removes all searches that are out of scope for that search.
 - for example, when searching meat products is usual to have animal food on the search results, which is out of the scope of the search, therefore these items would be removed
Based on the removed results then the products would be narrowed for the top 10 products, that were selected based on the search preferences that each supermarket marked as more relevant.

These individual product dataframes were then concatenated into one dataframe that stores the information of this specific supermarket.

To this dataframe are then created new columns that have some summary statistics of the data on each row, detailing the minimum price, maximum and average price for each row, as well as the number of products that were passed onto the final dataframe for each product.

These data are to be then used on the next steps of the project

In [None]:
import requests
from bs4 import BeautifulSoup
import re # For string transformation
import numpy as np
import pandas as pd
import unidecode

In [2]:
# Read the initial excel file with all product names to be searched on and all the string expections
data = pd.read_excel("Categories.xlsx")
data_subset = data.iloc[:,:1] # All rows, and only the first column

## String transformation to remove all uppercase and all accents (text normalisation)
data_subset = data_subset.applymap(lambda row: unidecode.unidecode(row).lower() if type(row) == str else row)
data_subset.head()

Unnamed: 0,Product
0,peru
1,frango
2,bacalhau
3,cebola
4,batata


### Explanation Comment
The below function retrieves the data based of the Continente's product's web-page.

Based on the information on the web-page it will create a dataframe with all the Product Names and respective Product Prices on that page.

In [3]:
def get_name_price(url):
    
    #1. Get web data into the notebook
    page = requests.get(url)
    if page.status_code == 200:
        pass
    elif page.status_code == 404:
        print('This page was not found.')
        return 10
    
    soup = BeautifulSoup(page.text, 'html.parser')
    
    
    number_products_price = len(soup.find_all('span', class_ = 'ct-price-formatted'))
                          
    product_prices = soup.find_all('span', class_ = 'ct-price-formatted') # get price based on Continente's website design
    product_names = soup.find_all('a', class_ = 'ct-tile--description') # get name based on Continente's website design
    
    number_products_price = len(product_prices)
    number_products_name = len(product_names)
    
    if number_products_price != number_products_price:
        # The number of product names is different from the number of product prices
        print("The number of prices and names on this webpage is different, please check this page manually")
        # return
        
    product_name_list = []
    product_price_list = []
    
    for i in range(0, number_products_price):
        product_name = str(product_names[i])
        product_price = str(product_prices[i])
    
    
        product_price_auxiliar = str(soup.find_all('div', class_ = 'prices-wrapper')[i]) # get price based on Continente's website design

    
        if "/un" in product_price_auxiliar:
            # Meaning that the product is sold by the unit
            product_price = str(soup.find_all('span', class_ = 'ct-price-value')[i]) # get price based on Continente's website design
        else:
            pass
    
    
        # 2. String formatting
        # 2.1. Transformation of Product Name:
        product_name = re.sub('<[^>]+>', '', product_name)
        
        # 2.2. Transformation of Product Price
        product_price = re.sub('<[^>]+>', '', product_price)
        product_price = re.sub('\n', '', product_price)
        product_price = re.sub(' ', '', product_price)
        product_price = re.sub('€', '', product_price)
        product_price = re.sub(',', '.', product_price)
        
        
        # 3. Append product name and price to the correspondent lists
        product_name_list.append(product_name)
        product_price_list.append(product_price)
        
    # 4. Create a dataframe to store all the data
    output_df = pd.DataFrame(list(zip(product_name_list,product_price_list)), columns = ["Product Name", "Product Price"])
    return output_df

### Explanation Comment
The below **_function_** allows the definition of the data cleaning process based on the "forbiden strings" for each product type
 - Please be aware that the column position on the original dataframe that has the information regarding the forbiden strings is hard coded and at the moment has the column number = 2. If the original data set changes then we need to adapt this position

This function sets up the top 10 products for each search object, and removes from it some user defined strings that may confound the products to be analysed.
 - i.e. if we search for "peru" we may have some results regarding animal food that has turkey, and also some other meat products made from turkey. These can be defined initially to avoid including these data on the final dataframe

In [4]:
def data_cleaning(url_data, index, original_data = data):
    
    strings_to_remove = original_data.iloc[index, 2] # get the set of strings to remove
    
    if pd.isna(strings_to_remove) == True:
    # Meaning that the forbiden strings field is empty
        strings_to_remove = ""
    
    data_10 = url_data[url_data["Product Name"].str.contains(strings_to_remove,)==False] # Include the specific strings that may confound the search for this product
    
    data_10 = data_10.iloc[:10,:] # Search for the top 10 products sorted by relevance and from the main brand7

    return data_10

#product_data = data_cleaning(url_data = all_data, index = 2, original_data = data)

url_data = all_data
index = 2
original_data = data



Step-by-step
data_10 = url_data.iloc[:10,:]
strings_to_remove = original_data.iloc[index, 2]

if pd.isna(strings_to_remove) == True:
    # Meaning that the forbiden strings field is empty
    strings_to_remove = ""
    
    


#if np.isnan(strings_to_remove) == True:
    #strings_to_remove
    print("ok")
#else:
    print("não")
pd.isna(strings_to_remove)

### Explanation Comment
The below loop will iterate across all products that we've defined in the initial excel.

The **search_url** object has a fixed structure that will define the website url to be search on.
 - Using this url the function **get_name_price** will be called to retrieve the data from the web using the built url
 - The data will then be cleaned to have the top10 products on each search term that fits the description

The output of each iteration will be a dataframe with all relevant product names and prices for each product that is being searched

In [5]:
# For tests, when deploying please remove the if condition so that it can run for all product that are placed on the initial excel


for index, row in data_subset.iterrows():
    if index == 0:
        search_object = row["Product"]
        search_url = str("https://www.continente.pt/pesquisa/?q="+str(search_object)+"&start=0&srule=search-relevance&pmin=0.01&prefn1=brand&prefv1=Continente")
        strings_to_remove = data.iloc[index, 2] # get the set of strings to remove
  
        ## Call the function to retrieve the data from the web
        all_data = get_name_price(search_url)
        
        all_data["Supermarket"] = "Continente" # Adding Supermarket to the dataframe
        all_data["Category Object"] = data.iloc[index, 1] # Adding Category to the dataframe
        all_data["Product Object"] = data.iloc[index, 0] # Adding Product to the dataframe
        
        # Clean the results obtained so that we can have only the results that are of interest to us
        product_data = data_cleaning(url_data = all_data, index = index, original_data = data)
        aggregate_data = product_data # pass the data to the aggregate data frame
    
    else:
        
        search_object = row["Product"]
        search_url = str("https://www.continente.pt/pesquisa/?q="+str(search_object)+"&start=0&srule=search-relevance&pmin=0.01&prefn1=brand&prefv1=Continente")
        strings_to_remove = data.iloc[index, 2] # get the set of strings to remove
  
        ## Call the function to retrieve the data from the web
        all_data = get_name_price(search_url)
        
        all_data["Supermarket"] = "Continente" # Adding Supermarket to the dataframe
        all_data["Category Object"] = data.iloc[index,1] # Adding Category to the dataframe
        all_data["Product Object"] = data.iloc[index,0] # Adding Product to the dataframe
        
        # Clean the results obtained so that we can have only the results that are of interest to us
        product_data = data_cleaning(url_data = all_data, index = index, original_data = data)
        
        aggregate_data = pd.concat([aggregate_data, product_data])
           

In [6]:
aggregate_data

Unnamed: 0,Product Name,Product Price,Supermarket,Category Object,Product Object
1,Bife de Peru,6.99,Continente,Talho,Perú
3,Escalopes de Peru Extrafino,7.99,Continente,Talho,Perú
4,Escalopes de Peru Extrafinos,8.49,Continente,Talho,Perú
8,Jardineira de Peru,8.99,Continente,Talho,Perú
9,Strogonoff de Peru,7.99,Continente,Talho,Perú
...,...,...,...,...,...
6,Iogurte Líquido Pinacolada,1.39,Continente,Laticinios,Iogurtes
7,Iogurte Líquido Manga,1.86,Continente,Laticinios,Iogurtes
8,"Iogurte Aroma Tutti Frutti, Morango, Limão, Fr...",1.13,Continente,Laticinios,Iogurtes
9,Iogurte Líquido Frutos Vermelhos,1.86,Continente,Laticinios,Iogurtes


### Comment
Further transformations regarding how to aggregate the prices, and the number of products to be used should be discussed so that we all get the same uniformised version of the data frame

In [7]:
aggregate_data["Product Price"]=aggregate_data["Product Price"].replace(',', '.', regex=True)

In [8]:
aggregate_data["Product Price"] = aggregate_data["Product Price"].astype(float)

In [9]:
print(aggregate_data.dtypes)

Product Name        object
Product Price      float64
Supermarket         object
Category Object     object
Product Object      object
dtype: object


In [10]:
final_df = aggregate_data.groupby(['Supermarket','Category Object','Product Object']).agg({'Product Price': [("Average Per Product", "mean"),("Min Price per Product", "min"),("Max Price per Product", "max") ,("Count of Products", "count")]})
final_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Product Price,Product Price,Product Price,Product Price
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Average Per Product,Min Price per Product,Max Price per Product,Count of Products
Supermarket,Category Object,Product Object,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Continente,Bebidas,Agua Garrafão,0.165,0.13,0.2,2
Continente,Casa,Detergente Loiça,2.358,0.79,4.58,10
Continente,Casa,Detergente WC,0.6,0.6,0.6,1
Continente,Casa,Detergente chao,0.6,0.6,0.6,1
Continente,Casa,Papel higienico,0.23,0.02,0.56,10
Continente,Casa,Sacos do Lixo,0.069,0.02,0.13,10
Continente,Charcutaria,Fiambre,10.214,6.45,13.3,10
Continente,Congelados,Douradinhos,5.53,5.53,5.53,1
Continente,Congelados,Ervilhas,3.758,0.93,10.38,10
Continente,Enlatados,Feijao,2.196,1.63,4.38,10


In [11]:
final_df.to_excel('Continente_ProductData.xlsx')