# Web Scraping

#### We're web scrapping the ingredients from INCI Beauty to get all the names of the ingredients, their marking from INCI Beauty and their functions

Libraries

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import string
import re
# from pprint import pprint
# from lxml import html
# from lxml.html import fromstring
# import urllib.request
# from urllib.request import urlopen
# import random
# import scrapy

In [2]:
#That's the base url that we will use to get all he ingredients data that we need
url = 'https://incibeauty.com/ingredients/'

In [3]:
len(url)

35

We're creating a function which will give us a list of the alphabet

In [4]:
def listAlphabet():
    return list(string.ascii_uppercase)

In [5]:
#We're storing the alphabet list in a variable
listAlphabet = listAlphabet()

#We're appending '1' to it as it's the page where the ingredients beginning with numbers are listed on
listAlphabet.append('1')

In [6]:
#We concatenate the base url and listAlphabet to get all the urls where ingredients are listed on
urls_to_extract = [url + alphabet for alphabet in listAlphabet]

In [7]:
#We have 27 links in total
len(urls_to_extract)

27

From the ingredients listings, we want to get the name, link and effect of each ingredient.

In [86]:
def getting_links_effect_and_ing_names(url):
    html = requests.get(url).content
    soup = BeautifulSoup(html, "html")
    table = soup.find("table", attrs={"class": "table table-striped table-striped-reverse table-inci"})
    return zip(table.find_all('a', {'class':'color-inherit'}), table.find_all("tr"), table.find_all('img',{'class':'fleur'}))

We're using a list comprehension containings tuples of ingredients names and links

In [87]:
table_data_ingredients = [(td.text.replace('\n', '').strip(), link['href'], item['src'][32:-4])
                          for url in urls_to_extract
                          for link, tr, item in getting_links_effect_and_ing_names(url)
                          for td in tr.find_all("td")
                          if 'href' in link.attrs
                          if td.text.replace('\n', '').strip()
                          if td.text.replace('\n', '').strip() == td.text.replace('\n', '').strip().upper()]

We are now creating our dataframe which will contain all the data from the ingredients

In [91]:
df = pd.DataFrame(table_data_ingredients, columns=['Names', 'Links', 'Effect'])

In [92]:
df.describe()

Unnamed: 0,Names,Links,Effect
count,13004,13004,13004
unique,13004,12993,4
top,STEARYL DIHYDROXYPROPYLDIMONIUM OLIGOSACCHARIDES,https://incibeauty.com/ingredients/16499-disod...,vert
freq,1,2,9409


In [94]:
df[df.duplicated(subset='Links', keep=False)]

Unnamed: 0,Names,Links,Effect
648,AMMONIUM LACTATE,https://incibeauty.com/ingredients/2557-ammoni...,vert
649,E328,https://incibeauty.com/ingredients/2557-ammoni...,vert
834,ARACHIDYL ALCOHOL (AND) BEHENYL ALCOHOL (AND) ...,https://incibeauty.com/ingredients/25430-arach...,jaune
835,MONTANOV 202,https://incibeauty.com/ingredients/25430-arach...,jaune
2196,CELLULOSE ACETATE BUTYRATE,https://incibeauty.com/ingredients/17422-cellu...,vert
2197,CAB,https://incibeauty.com/ingredients/17422-cellu...,vert
3916,DISODIUM LAURETH SULFOSUCCINATE,https://incibeauty.com/ingredients/16499-disod...,orange
3917,DLS,https://incibeauty.com/ingredients/16499-disod...,orange
3975,DISUBSTITUTED ALANINAMIDE,https://incibeauty.com/ingredients/44656-disub...,vert
3976,DSAA,https://incibeauty.com/ingredients/44656-disub...,vert


There are less unique links as there are ingredients named with their acronym.

We will now move on onto getting each ingredient function(s).

In [138]:
def usage_ing(url):
    html = requests.get(url).content
    soup = BeautifulSoup(html, "html")
    #usage_content = soup.find_all('ul',{'class':'fonctions-inci'})
    usage_content = soup.find_all('i')
    list_usages = [item.text.replace(' :','') for item in usage_content if item.text.replace(' :','')]
    return list_usages

In [140]:
df['Usages'] = df['Links'].apply(usage_ing)

Let's save our dataframe into a csv file.

In [148]:
df.to_csv('incibeauty_ingredients.csv', index=False)

In [4]:
data = pd.read_csv('./Data/incibeauty_ingredients.csv')

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13004 entries, 0 to 13003
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Names   13004 non-null  object
 1   Links   13004 non-null  object
 2   Effect  13004 non-null  object
 3   Usages  13004 non-null  object
dtypes: object(4)
memory usage: 406.5+ KB
