# DM lesson4

Pour le 23 octobre 2019

Dans ce dataset: https://raw.githubusercontent.com/fspot/INFMDI-721/master/lesson5/products.csv, chaque ligne correspond à un produit alimentaire mis en vente par un utilisateur.

Objectif: cleaner le dataset.

On aimerait avoir une colonne de prix unifiés en euros. Problème: la currency n'est pas indiquée pour tous les produits: il va falloir essayer de "deviner" les currency manquantes, en se basant sur l'adresse IP de l'utilisateur.
La colonne "infos" liste des ingrédients présents dans le produit. On préfèrerait avoir une colonne de type bool par ingrédient, indiquant si le produit contient ou non cet ingrédient.
Voic une liste d'APIs qui peut vous être utile : https://github.com/public-apis/public-apis (mais vous pouvez en utiliser d'autres si vous le voulez).

In [1]:
import requests
import re
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
df = pd.read_csv('../../fspot-infmdi-721/lesson5/products.csv', sep=';')

### Get the approximate currency from IP

In [3]:
def getCurrencyFromIp(ip):
    if re.match('(\d{2,3}\.){3}\d{2,3}', ip):
        response = requests.get('http://ip-api.com/json/%s?fields=status,currency,country,city' % ip)
        if response.status_code == 200:
            reverseGeoInfo = response.json()
            if reverseGeoInfo['status'] == 'success':
                return reverseGeoInfo['currency']
    return None

Test

In [4]:
getCurrencyFromIp('666.666.666.666'), getCurrencyFromIp('26.191.237.49'), getCurrencyFromIp('58.90.204.239'), getCurrencyFromIp('226.52.32.70'), getCurrencyFromIp('nope')

(None, 'USD', 'JPY', None, None)

Apply

In [5]:
df['currency'] = df['ip_address'].apply(getCurrencyFromIp)

In [6]:
df.head(10)

Unnamed: 0,username,ip_address,product,price,infos,currency
0,ldrover0,666.666.666.666,Clam - Cherrystone,712.8,May contain sugar,
1,kizakov1,nope,Soup - Campbells Bean Medley,379.26,Contains peanut and fish,
2,abromet2,240.177.79.234,Island Oasis - Lemonade,305.96,Ingredients: mustard and fish,
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten,USD
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish",JPY
5,bsnozzwell5,226.52.32.70,Oil - Sesame,354.33 MGA,Ingredients: sugar and milk,
6,afairholme6,127.197.204.119,Chicken - Tenderloin,484.83,May contain sugar,
7,avowdon7,189.169.17.54,Dc Hikiage Hira Huba,111.56,Contains sugar,MXN
8,epridham8,187.129.113.105,Dried Figs,88.05,"Ingredients: sugar, milk and fish",MXN
9,tkendrew9,22.32.234.215,Pop - Club Soda Can,861.25,"May contain peanut, sugar, milk and fish",USD


In [7]:
# Some prices contain the currency
entriesWithCurrency = df['price'].str.contains(' \w+')
df.loc[entriesWithCurrency, ['currency']] = df.loc[entriesWithCurrency, 'price'].str.extract(' (\w+)$').to_numpy()
df.loc[entriesWithCurrency, ['price']] = df.loc[entriesWithCurrency, 'price'].str.extract('^(\d+)').to_numpy()

In [8]:
df['price'] = pd.to_numeric(df.price)

In [9]:
print('Items without a price :', df.loc[df['price'].isna(), 'price'].count())
print('Items with no currency :', df['currency'].str.contains('None').count())

Items without a price : 0
Items with no currency : 50


### Get the Euro rates and compute prices in Euro

In [21]:
ratesToEuro = requests.get('https://api.exchangerate-api.com/v4/latest/EUR').json()['rates']
ratesToEuro

{'EUR': 1,
 'AED': 4.094042,
 'ARS': 65.21342,
 'AUD': 1.623419,
 'BGN': 1.955823,
 'BRL': 4.58862,
 'BSD': 1.114952,
 'CAD': 1.458436,
 'CHF': 1.100472,
 'CLP': 807.974322,
 'CNY': 7.883909,
 'COP': 3841.380952,
 'CZK': 25.578905,
 'DKK': 7.470757,
 'DOP': 58.711063,
 'EGP': 18.014515,
 'FJD': 2.438087,
 'GBP': 0.860991,
 'GTQ': 8.663838,
 'HKD': 8.735032,
 'HRK': 7.438175,
 'HUF': 329.84622,
 'IDR': 15753.663889,
 'ILS': 3.942027,
 'INR': 79.010008,
 'ISK': 139.237966,
 'JPY': 120.920177,
 'KRW': 1305.44019,
 'KZT': 433.704301,
 'MXN': 21.294264,
 'MYR': 4.662728,
 'NOK': 10.178967,
 'NZD': 1.7386,
 'PAB': 1.114952,
 'PEN': 3.724189,
 'PHP': 56.998617,
 'PKR': 173.855603,
 'PLN': 4.278141,
 'PYG': 7333.545455,
 'RON': 4.7593,
 'RUB': 70.977655,
 'SAR': 4.179473,
 'SEK': 10.731358,
 'SGD': 1.517153,
 'THB': 33.743422,
 'TRY': 6.493398,
 'TWD': 34.056081,
 'UAH': 27.70302,
 'USD': 1.113788,
 'UYU': 41.64636,
 'VND': 26037.5,
 'ZAR': 16.352391}

In [11]:
def convertToEuro(price, currency):
    if currency in ratesToEuro:
        return price / ratesToEuro[currency]
    return -1

In [12]:
convertToEuro(949.79, 'JPY')

7.854685823028525

In [13]:
df['price_euro'] = df[['price', 'currency']].apply(lambda args: convertToEuro(args[0], args[1]), axis=1)

In [14]:
df.head(20)

Unnamed: 0,username,ip_address,product,price,infos,currency,price_euro
0,ldrover0,666.666.666.666,Clam - Cherrystone,712.8,May contain sugar,,-1.0
1,kizakov1,nope,Soup - Campbells Bean Medley,379.26,Contains peanut and fish,,-1.0
2,abromet2,240.177.79.234,Island Oasis - Lemonade,305.96,Ingredients: mustard and fish,,-1.0
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten,USD,314.377601
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish",JPY,7.854686
5,bsnozzwell5,226.52.32.70,Oil - Sesame,354.0,Ingredients: sugar and milk,MGA,-1.0
6,afairholme6,127.197.204.119,Chicken - Tenderloin,484.83,May contain sugar,,-1.0
7,avowdon7,189.169.17.54,Dc Hikiage Hira Huba,111.56,Contains sugar,MXN,5.23897
8,epridham8,187.129.113.105,Dried Figs,88.05,"Ingredients: sugar, milk and fish",MXN,4.134916
9,tkendrew9,22.32.234.215,Pop - Club Soda Can,861.25,"May contain peanut, sugar, milk and fish",USD,773.262057


### Extraction de la liste d'ingrédients

#### Cleanup and split the lists of ingredients

In [15]:
df['ingredients'] = df['infos'].str.replace('Ingredients:|Contains|May contain', ''). \
    str.replace(' and ', ',').str.replace(' ', '')

#### One hot encoding of the ingredients

In [19]:
df = df.join(df.ingredients.str.get_dummies(','))

In [20]:
df.head(10)

Unnamed: 0,username,ip_address,product,price,infos,currency,price_euro,ingredients,egg,fish,gluten,milk,mustard,peanut,soja,sugar
0,ldrover0,666.666.666.666,Clam - Cherrystone,712.8,May contain sugar,,-1.0,sugar,0,0,0,0,0,0,0,1
1,kizakov1,nope,Soup - Campbells Bean Medley,379.26,Contains peanut and fish,,-1.0,"peanut,fish",0,1,0,0,0,1,0,0
2,abromet2,240.177.79.234,Island Oasis - Lemonade,305.96,Ingredients: mustard and fish,,-1.0,"mustard,fish",0,1,0,0,1,0,0,0
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten,USD,314.377601,gluten,0,0,1,0,0,0,0,0
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish",JPY,7.854686,"sugar,egg,fish",1,1,0,0,0,0,0,1
5,bsnozzwell5,226.52.32.70,Oil - Sesame,354.0,Ingredients: sugar and milk,MGA,-1.0,"sugar,milk",0,0,0,1,0,0,0,1
6,afairholme6,127.197.204.119,Chicken - Tenderloin,484.83,May contain sugar,,-1.0,sugar,0,0,0,0,0,0,0,1
7,avowdon7,189.169.17.54,Dc Hikiage Hira Huba,111.56,Contains sugar,MXN,5.23897,sugar,0,0,0,0,0,0,0,1
8,epridham8,187.129.113.105,Dried Figs,88.05,"Ingredients: sugar, milk and fish",MXN,4.134916,"sugar,milk,fish",0,1,0,1,0,0,0,1
9,tkendrew9,22.32.234.215,Pop - Club Soda Can,861.25,"May contain peanut, sugar, milk and fish",USD,773.262057,"peanut,sugar,milk,fish",0,1,0,1,0,1,0,1
