# Supermarkets data cleaning

We are importing the data from https://github.com/kklichowski/Third-Project. An Ironhack graduate who scraped data from the main 6 supermarkets in Berlin. The goal of this document is to import the data and clean it for our model. 

In [1]:
import pandas as pd
import numpy as np
import pickle

## Clean data

- Translate data to english
- Isolate the package size
- Check values
- Normalize Unit

- Create a unique list with unique products of all different markets
- Analyse it
- Conclude how to approach the solution

### 1. Import data and explore

In [2]:
# Importing the data from excel
supermarkets_en = {
    'aldinorth': pd.read_excel('data/products-en/aldinorth-products-en.xls', index_col=0),
    'aldisouth': pd.read_excel('data/products-en/aldisouth-products-en.xls', index_col=0),
    'edeka': pd.read_excel('data/products-en/edeka-products-en.xls', index_col=0),
    'kaufland': pd.read_excel('data/products-en/kaufland-products-en.xls', index_col=0),
    'lidl': pd.read_excel('data/products-en/lidl-products-en.xls', index_col=0),
    'rewe': pd.read_excel('data/products-en/rewe-products-en.xls', index_col=0)
}

In [3]:
supermarkets_en['aldisouth']

Unnamed: 0,Name,Price,Unit,Pack size,Supermarket,Comparable Price,Unnamed: 7
0,Almare cream herring fillets and cream sauce,1.35,100 gram,Bowl 400 grams,Aldi south,0.33799999999999997,
1,Coke,0.99,100 ml,Pet bottle 1.25 L,demandado aldi,0.079,
2,Puré de pudín Desira semolina,0.35,100 gram,Cup 175 grams,Aldi south,0.2,
3,Nutella,3.89,100 gram,"""Glass 880 grams + 80 g free""",Aldi south,0.442,
4,Landvogt Original Schwäbische Maultaschen with...,1.29,100 gram,Pack of 360 grams,Aldi south,0.36,
...,...,...,...,...,...,...,...
3579,Obersteirische Molkerei / Aldi Nord Austrian m...,0.89,100 gram,Cup 400 grams,Aldi south,0.22,
3580,Ofterdinger Aldi Nord carrot salad,6.99,100 ml,Bottle of glass 0.7 L,Aldi south,0.999,
3581,Mümmelmann Jagdbitter herbal liqueur,4.99,100 ml,Bottle of glass 0.7 L,Aldi south,0.713,
3582,KÜR Basic Shampoo Walnut,0.65,100 ml,Bottle of plastic 500 ml,Aldi south,0.13,


In [4]:
supermarkets_en['edeka'].dtypes

Name                object
Price               object
Unit                object
Pack Size           object
Supermarket         object
Comparable Price    object
dtype: object

For the moment, we are going to focus in the columns 'Name' and 'Price'

In [5]:
supermarkets_en['edeka']['Name'].value_counts()

Danone orchard peach passion fruit                                                   2
Alete Milder Apple Juice after the 4th month                                         2
Thomy mustard sweeter                                                                2
Dr. Oetker Vitalis Yofibra Classic, 2 x 135 g                                        2
Maggi Meisterklasse mushroom sauce low in fat                                        2
                                                                                    ..
Minus L Choco Cappuccino                                                             1
Pepsi Cola Classic contains caffeine                                                 1
Milka Amavel Mousse au Praline                                                       1
Bols Peppermint Green Liqueur 24% Vol.                                               1
Hipp 3 follow-on milk organic double economy pack after the 10th month, 2 x 500 g    1
Name: Name, Length: 5014, dtype: int64

### 2. Clean data
#### 2.1 Remove duplicates

In [6]:
# Removing duplicate products
supermarkets_en['edeka'] = supermarkets_en['edeka'].drop_duplicates('Name')

In [7]:
supermarkets_en['edeka']['Name'].value_counts()

Danone Family Yogurt Cherry, 4 x 125 g                                               1
Children Bueno                                                                       1
Escal Flammkuchen Original Alsatian                                                  1
Libby's Peaches Half Fruit                                                           1
Landliebe yogurt with exquisite strawberries, 3.8% fat                               1
                                                                                    ..
Schwartau Extra Pineapple                                                            1
Seitenbacher Muesli 311 special mix                                                  1
Dr. Oetker Brandteig Garant                                                          1
Campari Bitter Aperitif, 25% vol.                                                    1
Hipp 3 follow-on milk organic double economy pack after the 10th month, 2 x 500 g    1
Name: Name, Length: 5014, dtype: int64

In [8]:
# Removing duplicate products for each DataFrame
for market in supermarkets_en:
    supermarkets_en[market] = supermarkets_en[market].drop_duplicates('Name')

In [9]:
supermarkets_en['rewe']['Name'].value_counts()

Iglo Vegetable Ideas Pan-Vegetable Italian Pan    1
Zimbo Thuringian grilled sausage                  1
Dr. Oetker cake glaze light                       1
Funny fresh crisp Hungarian                       1
Heinersdorfer cake plum crumble                   1
                                                 ..
Grafschafter Goldsaft Beet Syrup                  1
Tekrum Decor on Ice Premium ice cream cones       1
REWE Best choice Mousse au chocolat               1
Yes! Quark 40% fat                                1
Haribo fruit snails                               1
Name: Name, Length: 5049, dtype: int64

#### 2.2 Price to numeric

In [10]:
# Price to numeric
supermarkets_en['edeka']['Price'].value_counts()

0.99     376
1.99     354
2.99     202
1.49     198
1.29     179
        ... 
4.98       1
0.57       1
7.98       1
0.58       1
27.95      1
Name: Price, Length: 241, dtype: int64

In [11]:
# Let's check if there are null values
supermarkets_en['edeka']['Price'].isna().sum()

0

In [12]:
supermarkets_en['aldisouth']['Price'].head(50)

0         1.35
1         0.99
2         0.35
3         3.89
4         1.29
5         1.69
6         2.99
7         3.55
8         1.29
9         3.59
10        2.69
11        0.53
12        4.49
13        2.29
14        1.79
15        0.35
16        1.89
17        0.59
18        2.99
19        0.79
20        1.59
21        0.79
22        1.39
23        1.49
24        1.99
25        1.39
26        1.99
27         1.9
28        1.99
29        0.49
30        0.59
31        1.99
32        1.79
33        0.49
34       12.99
35        1.69
36        1.49
37        2.95
38        1.35
39        3.79
40        1.29
41        1.39
42        2.29
43        6.49
44        0.95
45    100 gram
46    100 gram
47    100 gram
48    100 gram
49        0.85
Name: Price, dtype: object

In [13]:
# count how many values contain 'grams'
supermarkets_en['edeka']['Price'].str.contains('gram').sum()

82

In [14]:
supermarkets_en['rewe']['Price'].dropna(inplace=True)

In [15]:
# To numeric for each DataFrame
for market in supermarkets_en:
    
    # Drop those rows with wrong values within the 'Price' column
    # supermarkets_en[market] = supermarkets_en[market][~supermarkets_en[market]['Price'].str.contains('gram')]
    
    #Drop columns that we do not use
    supermarkets_en[market].drop(columns=['Supermarket', 'Comparable Price', 'Unit', 'Pack Size', 'Pack size', 'Comparable price', 'Unnamed: 7'], inplace=True, errors='ignore')
    
    # Price to numeric
    supermarkets_en[market]['Price'] = pd.to_numeric(supermarkets_en[market]['Price'], errors='coerce')
    
    # Drop those rows with NaN values within the 'Price' column
    supermarkets_en[market].dropna(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  supermarkets_en[market]['Price'] = pd.to_numeric(supermarkets_en[market]['Price'], errors='coerce')
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  supermarkets_en[market].dropna(inplace=True)


In [16]:
supermarkets_en['rewe'].dtypes

Name      object
Price    float64
dtype: object

#### 2.3 Drop columns that we do not need

In [17]:
supermarkets_en['edeka'].head()

Unnamed: 0,Name,Price
0,Coke,1.39
1,Nutella,1.77
2,Becel Gold 70% fat,1.49
3,Iglo fish fingers,2.99
4,Good & Cheap Landgasthof Goulash Pan,1.99


In [18]:
supermarkets_en['rewe'].head()

Unnamed: 0,Name,Price
0,Yes! Bread tip 60% fat,1.19
1,REWE Best Choice Brie 45% fat,0.69
2,REWE Best Choice Goat Cream Cheese Mousse 73%,1.99
3,Rewe Brie 45% fat,0.79
4,Nutella,3.79


In [19]:
supermarkets_en['rewe'].dtypes

Name      object
Price    float64
dtype: object

In [20]:
supermarkets_en['rewe'].isna().sum()

Name     0
Price    0
dtype: int64

In [21]:
supermarkets_en['aldisouth']

Unnamed: 0,Name,Price
0,Almare cream herring fillets and cream sauce,1.35
1,Coke,0.99
2,Puré de pudín Desira semolina,0.35
3,Nutella,3.89
4,Landvogt Original Schwäbische Maultaschen with...,1.29
...,...,...
3579,Obersteirische Molkerei / Aldi Nord Austrian m...,0.89
3580,Ofterdinger Aldi Nord carrot salad,6.99
3581,Mümmelmann Jagdbitter herbal liqueur,4.99
3582,KÜR Basic Shampoo Walnut,0.65


For now, we are going to proceed with this data. Possible improvements for the future can be:
    - Normalize the sizes of the products
    - Add the rest of the supermarkets
    - Add more columns

### 3. Export data

In [22]:
for market in supermarkets_en:
        supermarkets_en[market].to_pickle(f'data/products-clean/{market}-products-clean.pkl')