# Web scraping

In [108]:
import pandas as pd

# Imporst for the Translation with IBM Watson
import requests
import os
import json

# Imports the Google Cloud client library
from googletrans import Translator

In [2]:
supermarkets = {
    'aldi_north': pd.read_csv('data/products-info-german/cleaned_aldi_products.txt'),
    'aldi_south': pd.read_csv('data/products-info-german/cleaned_aldisued_products.txt'),
    'edeka': pd.read_csv('data/products-info-german/cleaned_edeka_products.txt'),
    'kaufland': pd.read_csv('data/products-info-german/cleaned_kaufland_products.txt'),
    'lidl': pd.read_csv('data/products-info-german/cleaned_lidl_products.txt'),
    'rewe': pd.read_csv('data/products-info-german/cleaned_rewe_products.txt')
}

Now that we have product info of the main 6 supermarkets in Berlin, and before we use it we are going to explore the data.

In [3]:
supermarkets['rewe']

Unnamed: 0.1,Unnamed: 0,Name,Price,Unit,Packet Size,Supermarket,Comparable Price
0,0,Ja! Briespitze 60% Fett,1.19,100 Gramm,Packung 200 Gramm,rewe,0.600
1,1,REWE Beste Wahl Brie 45% Fett,0.69,100 Gramm,Packung 100 Gramm,rewe,0.690
2,2,REWE Beste Wahl Ziegenfrischkäse Mousse 73%,1.99,100 Gramm,Becher 125 Gramm,rewe,1.590
3,3,Rewe Brie 45% Fett,0.79,100 Gramm,Packung 100 Gramm,rewe,0.790
4,4,Nutella,3.79,100 Gramm,Glas 750 Gramm,rewe,0.505
...,...,...,...,...,...,...,...
6238,6238,ja! Kartoffelringe Paprika,0.99,100 Gramm,Beutel 100 Gramm,rewe,0.990
6239,6239,ja! Schoko Rosinen,0.99,100 Gramm,Beutel 200 Gramm,rewe,0.500
6240,6240,Ja! Erdnüsse Paprika,0.81,100 Gramm,Beutel 200 Gramm,rewe,0.410
6241,6241,ja! Orange Cake,0.99,100 Gramm,Packung 300 Gramm,rewe,0.330


In [12]:
supermarkets['rewe'].dtypes

Name                 object
Price               float64
Unit                 object
Packet Size          object
Supermarket          object
Comparable Price    float64
dtype: object

In [13]:
supermarkets['rewe']['Name'].value_counts()

Pfanner Mango Fair 100%  Fair                                                1
Weihenstephan Haltbare Alpenmilch 3,5% Fett                                  1
Nordbrand Pina Colada 15 % Vol.                                              1
Meica Eisbeinfleisch in Aspik                                                1
Ehrmann Almighurt nach Herzenslust Mandarinen Käsekuchen & Knusperwaffeln    1
                                                                            ..
ja! Forellenfilet natur                                                      1
Seeberger Aprikosen extra Natural Power                                      1
Ja! Spül und Haushaltstuch 6 Stück                                           1
Rewe Backpapier Zuschnitte, 20 Stück                                         1
Vogtlandweide Haltbare Vollmilch 3,5%                                        1
Name: Name, Length: 6194, dtype: int64

In [14]:
for market in supermarkets:
    # Drop duplicates within the 'Name' column
    supermarkets[market] = supermarkets[market].drop_duplicates('Name')
    # Drop 'Unnamed: 0' column
    supermarkets[market] = supermarkets[market].drop(columns='Unnamed: 0', errors='ignore')

In [15]:
supermarkets['rewe']['Name'].str.split(',')

0                           [Ja! Briespitze 60% Fett]
1                     [REWE Beste Wahl Brie 45% Fett]
2       [REWE Beste Wahl Ziegenfrischkäse Mousse 73%]
3                                [Rewe Brie 45% Fett]
4                                           [Nutella]
                            ...                      
6238                     [ja! Kartoffelringe Paprika]
6239                             [ja! Schoko Rosinen]
6240                           [Ja! Erdnüsse Paprika]
6241                                [ja! Orange Cake]
6242           [ja! Kalifornische Pistazien geröstet]
Name: Name, Length: 6194, dtype: object

Once we have the data clean, we will translate it from German to English

## 2. Translate

We are goin to use the googletrans module in order to translate our .csv

### 2.1 IBM Watson Translate API

In [122]:
api_key = os.getenv("f5sAznhrKQyvBFFaZbtF60m5tzLbqWhyALQawBg5TjRI")
url = "https://api.eu-de.language-translator.watson.cloud.ibm.com/instances/58781bbf-1a05-4c1d-a4ee-338e692fecee/v3/translate?version=2018-05-01"
headers = {"Content-Type": "application/json"}
auth = ("apikey", api_key)

In [123]:
test = 'Pfanner Mango Fair 100%  Fair'
data = {"text":[test],"model_id":"de-en"}

In [124]:
json.dumps(data)

'{"text": ["Pfanner Mango Fair 100%  Fair"], "model_id": "de-en"}'

In [125]:
r = requests.post(headers=headers, auth=auth, url=url, data=json.dumps(data))

In [126]:
r.json()

{'code': 401, 'error': 'Unauthorized'}

In [127]:
print(auth)

('apikey', 'f5sAznhrKQyvBFFaZbtF60m5tzLbqWhyALQawBg5TjRI')


In [16]:
# Instantiates a client
translator = Translator()

In [20]:
# Example to see how it works
# 1 - translator.translate('Hello').text

In [None]:
# 1 - word.origin # .origin -> prints the original word

In [None]:
# 1 - word[0].text # .text -> prints the translated word

In order to translate the whole csv file, we will have to go through each cell and translate it. We only need to translate 'Name', 'Unit' and 'Packet Size'.

In [None]:
# Undertand how the method works
# 1 - translator.translate(supermarkets['rewe']['Name'][1], dest='en').text

The method accepts lists as input as well. So we are going to loop per each row in order to catch the rows that raise errors

In [None]:
src = 'auto' # Source language
dest = 'en' # Destiny language

for market in supermarkets:
    # Create new dataframe for the english version
    supermarkets[f'{market}-en'] = supermarkets[market].copy()
    
    # Loop through each row
    for row in 5:# range(len(supermarkets[market])):
        try:
            supermarkets[f'{market}-en']['Name'][row] = translator.translate(supermarkets[market]['Name'][row], dest=dest, src=src).text
        except:
            print(supermarkets[market]['Name'][row])

In [None]:
def translate(dataframe):
    # We need to translate 'Name', 'Unit', 'Packet Size'
    supermarkets[f'{dataframe}-en']

If it does not work, walkarounds:
- translate it manualy
- connect with google official API
- IBM class

## 3. Next Steps

- Create a dictionary with a dataframe per retailer with unique ingredients 
- [v0] - Build an algorithm able to (input: ingredient, output: % coverage + total price per each supermarket)
- [v1] - Build an algorithm able to (input: list of ingredients, output: % coverage + total price per each supermarket)
- [v2] - Use fuzzy buzzy to match the input (potato) with the ingredients by similarity.

### 3.2 [v0] algorithm
- [v0] - Build an algorithm able to (input: ingredient, output: % coverage + total price per each supermarket)

In [82]:
# We are looking for the price of the ingredient that we are searching per each supermarket
supermarkets['rewe'][supermarkets['rewe']['Name'] == 'Pfanner Mango Fair 100%  Fair']['Price'].values[0]

1.29

In [88]:
def search(ingredient:str):
    # New empty list
    result = []
    for market in supermarkets:
        a = supermarkets[market]
        try:
            price = a[a['Name'] == ingredient]['Price'].values[0]
            result.append(f'{market}: {price}')
        except:
            result.append(f'{market}: Not found')
    return result

# Try it with 'Coca Cola' because it is available in every supermarket
# Try with 'Pfanner Mango Fair 100%  Fair'

In [85]:
# Testing the function
search('Pfanner Mango Fair 100%  Fair')

['aldi_north: Not found',
 'aldi_south: Not found',
 'edeka: Not found',
 'kaufland: 1.89',
 'lidl: Not found',
 'rewe: 1.29']

In [86]:
# Testing the function
search('Coca Cola')

['aldi_north: 2.89',
 'aldi_south: 0.99',
 'edeka: 1.39',
 'kaufland: 11.4',
 'lidl: 0.99',
 'rewe: 0.99']

We can se that there are some inconsistencies with the 'Coca Cola' prices. Let's find out which is the problem

In [80]:
supermarkets['kaufland'][supermarkets['kaufland']['Name'] == 'Coca Cola']

Unnamed: 0,Name,Price,Unit,Packet Size,Supermarket,Comparable Price
15,Coca Cola,11.4,100 ml,"Kasten 12 L, 12 x 1l",kaufland,0.095


### 3.3 [v1] algorithm
- [v1] - Build an algorithm able to (input: list of ingredients, output: % coverage + total price per each supermarket)

In [100]:
def search_list(ingredients:list):
    # New empty list
    result = []
    
    for market in supermarkets:
        
        a = supermarkets[market]
        
        for ing in ingredients:    
            try:
                price = a[a['Name'] == ing]['Price'].values[0]
                result.append(f'{market}: {price} for the {ing}')
            except:
                result.append(f'{market}: Not found')
            
    return result

In [101]:
search_list(['Coca Cola', 'Pfanner Mango Fair 100%  Fair'])

['aldi_north: 2.89 for the Coca Cola',
 'aldi_north: Not found',
 'aldi_south: 0.99 for the Coca Cola',
 'aldi_south: Not found',
 'edeka: 1.39 for the Coca Cola',
 'edeka: Not found',
 'kaufland: 11.4 for the Coca Cola',
 'kaufland: 1.89 for the Pfanner Mango Fair 100%  Fair',
 'lidl: 0.99 for the Coca Cola',
 'lidl: Not found',
 'rewe: 0.99 for the Coca Cola',
 'rewe: 1.29 for the Pfanner Mango Fair 100%  Fair']

## Clean data

- Translate data to english
- Isolate the package size
- Check values
- Normalize Unit

- Create a unique list with unique products of all different markets
- Analyse it
- Conclude how to approach the solution

There are inconsistencies in the sizes of the packages. Let's normalize the package size to 100 (ml, g, ...). To do it, we have to get the unit from the package size and: [price / unit * 100]