## DM lesson4

Dans ce dataset: https://raw.githubusercontent.com/fspot/INFMDI-721/master/lesson5/products.csv,
        chaque ligne correspond à un produit alimentaire mis en vente par un utilisateur.

Objectif: cleaner le dataset.

- On aimerait avoir une colonne de prix unifiés en euros. Problème: la currency n'est pas indiquée pour tous les produits: il va falloir essayer de "deviner" les currency manquantes, en se basant sur l'adresse IP de l'utilisateur.
- La colonne "infos" liste des ingrédients présents dans le produit. On préfèrerait avoir une colonne de type bool par ingrédient, indiquant si le produit contient ou non cet ingrédient.

Voic une liste d'APIs qui peut vous être utile : https://github.com/public-apis/public-apis (mais vous pouvez en utiliser d'autres si vous le voulez).


In [1]:
# import needed modules
import requests
import numpy as np
import pandas as pd

# for regex
import re

In [2]:
# load csv file into pandas DataFrame
products = pd.read_csv('products.csv', sep=';')

products.head()

Unnamed: 0,username,ip_address,product,price,infos
0,ldrover0,666.666.666.666,Clam - Cherrystone,712.8,May contain sugar
1,kizakov1,nope,Soup - Campbells Bean Medley,379.26,Contains peanut and fish
2,abromet2,240.177.79.234,Island Oasis - Lemonade,305.96,Ingredients: mustard and fish
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish"


In [3]:
print(products.shape)

products.dtypes

(200, 5)


username      object
ip_address    object
product       object
price         object
infos         object
dtype: object

In [4]:
def get_geolocation_from_ip(ip):
    """
    Returns json data or None if invalid IP address
    
    We use https://ipgeolocationapi.com/
    """
    data = None
    
    request = requests.get("https://api.ipgeolocationapi.com/geolocate/{}".format(ip))
    
    if request:
        data = request.json()
    
    return data

In [5]:
# build 'currency_code' column
products['currency_code'] = [r['currency_code'] if r else None
                             for r in [get_geolocation_from_ip(ip) 
                                       for ip in products['ip_address']]]

products.head()

Unnamed: 0,username,ip_address,product,price,infos,currency_code
0,ldrover0,666.666.666.666,Clam - Cherrystone,712.8,May contain sugar,
1,kizakov1,nope,Soup - Campbells Bean Medley,379.26,Contains peanut and fish,
2,abromet2,240.177.79.234,Island Oasis - Lemonade,305.96,Ingredients: mustard and fish,
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten,USD
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish",JPY


In [6]:
# exchange rate to EUR
exchange_rate_to_eur = requests.get(
    "https://api.exchangerate-api.com/v4/latest/EUR").json()
#exchange_rate_to_eur = requests.get("https://api.ratesapi.io/api/latest").json()
#exchange_rate_to_eur = requests.get("https://api.exchangeratesapi.io/latest").json()

# build 'rate_to_euro' column
products['rate_to_euro'] = [exchange_rate_to_eur['rates'][c]
                            if c in exchange_rate_to_eur['rates']
                            else None
                            for c in products['currency_code']]

products.head()

Unnamed: 0,username,ip_address,product,price,infos,currency_code,rate_to_euro
0,ldrover0,666.666.666.666,Clam - Cherrystone,712.8,May contain sugar,,
1,kizakov1,nope,Soup - Campbells Bean Medley,379.26,Contains peanut and fish,,
2,abromet2,240.177.79.234,Island Oasis - Lemonade,305.96,Ingredients: mustard and fish,,
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten,USD,1.114136
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish",JPY,120.932224


In [7]:
# Work on a copy
df = products.copy()

# drop NaN
df = df.dropna()

# convert price to float
df['price'] = pd.to_numeric(df.price, errors='coerce')

# create a new column with price in euros
df['price_in_euro'] = df['price'] / df['rate_to_euro']

df.head()

Unnamed: 0,username,ip_address,product,price,infos,currency_code,rate_to_euro,price_in_euro
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten,USD,1.114136,314.279406
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish",JPY,120.932224,7.853903
7,avowdon7,189.169.17.54,Dc Hikiage Hira Huba,111.56,Contains sugar,MXN,21.351934,5.224819
8,epridham8,187.129.113.105,Dried Figs,88.05,"Ingredients: sugar, milk and fish",MXN,21.351934,4.123748
9,tkendrew9,22.32.234.215,Pop - Club Soda Can,861.25,"May contain peanut, sugar, milk and fish",USD,1.114136,773.020529


In [38]:
# Make the list of ingredients
df2 = df.copy()

# pattern for keeping only alphanumeric character
pattern = re.compile('[^a-zA-Z0-9]')

# Create new column  from 'infos'
# 1. convert to lower case
# 2. remove punctuation
# 3. split in a list
# 4. Drop empty strings
df2['info_as_list'] = df2.infos \
    .str.lower() \
    .apply(lambda w: pattern.sub(" ", w)) \
    .str.split() \
    .apply(lambda cell: list(filter(None, cell)))

# get list of unique words
results = set()
df2['info_as_list'].apply(results.update)
ingredients = list(results)

# remove non ingredients from the list
stop = ['may', 'contains', 'and', 'contain', 'ingredients']

pattern = re.compile(r'|'.join(stop))
ingredients = [ingredient for ingredient in [pattern.sub("", w) for w in ingredients]
               if ingredient]

print(ingredients)

['milk', 'egg', 'peanut', 'sugar', 'soja', 'mustard', 'fish', 'gluten']


In [37]:
for ingredient in ingredients:
    df2[ingredient] = df2.info_as_list.apply(lambda cell: ingredient in cell)
    
df2.head()

Unnamed: 0,username,ip_address,product,price,infos,currency_code,rate_to_euro,price_in_euro,info_as_list,milk,egg,peanut,sugar,soja,mustard,fish,gluten
3,kkarolowski3,26.191.237.49,"Water - Mineral, Natural",350.15,Contains gluten,USD,1.114136,314.279406,"[contains, gluten]",False,False,False,False,False,False,False,True
4,mbuckney4,58.90.204.239,Radish - Pickled,949.79,"May contain sugar, egg and fish",JPY,120.932224,7.853903,"[may, contain, sugar, egg, and, fish]",False,True,False,True,False,False,True,False
7,avowdon7,189.169.17.54,Dc Hikiage Hira Huba,111.56,Contains sugar,MXN,21.351934,5.224819,"[contains, sugar]",False,False,False,True,False,False,False,False
8,epridham8,187.129.113.105,Dried Figs,88.05,"Ingredients: sugar, milk and fish",MXN,21.351934,4.123748,"[ingredients, sugar, milk, and, fish]",True,False,False,True,False,False,True,False
9,tkendrew9,22.32.234.215,Pop - Club Soda Can,861.25,"May contain peanut, sugar, milk and fish",USD,1.114136,773.020529,"[may, contain, peanut, sugar, milk, and, fish]",True,False,True,True,False,False,True,False
