# Web Scraping

#### We're web scrapping the facial care products from Beauté test to get all their names, their ingredients, their size content, their price, for which skin types it's more appropirate, and other features that may be interesting.

Libraries

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

We need to get through the listing pages first to then go on each product page and get their ingredients and other interesting information.

In [2]:
bt_hp = 'https://www.beaute-test.com/'

In [3]:
links_listing_day_cream = [bt_hp + 'cremes_jour.php?page=' + str(i) for i in range(65)[1:]]

In [4]:
links_listing_night_cream = [bt_hp + 'cremes_nuit.php?page=' + str(i) for i in range(29)[1:]]

In [5]:
links_listing_day_and_night_cream = [bt_hp + 'cremes_jour_nuit.php?page=' + str(i) for i in range(82)[1:]]

In [6]:
links_listing_face_and_body_cream = [bt_hp + 'visage_et_corps.php?page=' + str(i) for i in range(20)[1:]]

In [7]:
links_listing_eye_cream = [bt_hp + 'contours_des_yeux.php?page=' + str(i) for i in range(49)[1:]]

In [8]:
links_listing_serum = [bt_hp + 'serums.php?page=' + str(i) for i in range(82)[1:]]

In [9]:
links_listing_exfoliation = [bt_hp + 'gommages_visage.php?page=' + str(i) for i in range(29)[1:]]

In [10]:
links_listing_mask = [bt_hp + 'masques_de_beaute.php?page=' + str(i) for i in range(80)[1:]]

In [11]:
links_listing_lip_balms = [bt_hp + 'baumes_levres.php?page=' + str(i) for i in range(45)[1:]]

In [12]:
links_listing_specific_skin_treatments = [bt_hp + 'soins_specifiques.php?page=' + str(i) for i in range(63)[1:]]

In [13]:
links_listing_make_up_removers = [bt_hp + 'demaquillants.php?page=' + str(i) for i in range(95)[1:]]

In [14]:
links_listing_eye_make_up_removers = [bt_hp + 'demaquillants_yeux.php?page=' + str(i) for i in range(10)[1:]]

In [15]:
links_listing_lotions = [bt_hp + 'lotions.php?page=' + str(i) for i in range(45)[1:]]

In [16]:
cosmetics_links = [links 
                   for sublist in [links_listing_day_cream, links_listing_night_cream, links_listing_day_and_night_cream,
                                  links_listing_face_and_body_cream, links_listing_eye_cream, links_listing_serum,
                                  links_listing_exfoliation, links_listing_mask, links_listing_lip_balms,
                                  links_listing_specific_skin_treatments, links_listing_make_up_removers,
                                  links_listing_eye_make_up_removers, links_listing_lotions]
                   for links in sublist]

We are creating a function to get all the basic info that we have on the listing pages.

In [17]:
def getting_cosmetics_global_info(url):
    html = requests.get(url).content
    soup = BeautifulSoup(html, "html")
    table = soup.find("table", attrs={"class": "bt__table"})
    media_content = table.find_all('div', {'class':'bt__media__content'})
    brand = table.find_all('td', {'class':'bt__table__item__brand'})
    price = table.find_all('td', {'class':'bt__table__item__price'})
    capacity = table.find_all('td', {'class':'bt__table__item__cont'})
    return zip(media_content, media_content, brand, price, capacity)

In [19]:
cosmetics_global_info = [(name.getText().strip(), 
                        bt_hp + link.find('a').get('href'),
                        url[28:].partition('.')[0],
                        brand.getText().strip(),
                        price.getText().strip().replace('\xa0?',' €'),
                        capacity.getText().strip().replace('\xa0', ' '))
                        for url in cosmetics_links
                        for name, link, brand, price, capacity in getting_cosmetics_global_info(url)]

We are now creating our dataframe which will contain all the data from the ingredients

In [20]:
df = pd.DataFrame(cosmetics_global_info, columns=['Names', 'Links', 'Types', 'Brands', 'Prices', 'Capacities'])

In [21]:
df.head()

Unnamed: 0,Names,Links,Types,Brands,Prices,Capacities
0,Soin Revolumisant Intense Anti-Âge Jour - Revi...,https://www.beaute-test.com/soin-revolumisant-...,cremes_jour,L'Oréal Paris,11.08 €,50 ml
1,Crème de Jour Anti-Rides Q10 - Cien,https://www.beaute-test.com/day_cream_q10_-_ci...,cremes_jour,Lidl,12.00 €,50 ml
2,BB Crème,https://www.beaute-test.com/soin_miracle_perfe...,cremes_jour,Garnier,5.49 €,50 ml
3,BB Skin Detox Fluid SPF 25,https://www.beaute-test.com/bb-skin-detox-flui...,cremes_jour,Clarins,26.90 €,45 ml
4,Soin Global Anti-Rides Jour - Lift+ Algo Rétinol,https://www.beaute-test.com/soin-global-anti-r...,cremes_jour,Diadermine,12.05 €,50 ml


Let's now store this dataframe into a csv file.

In [22]:
df.to_csv('./Data/beautetest_facial_care_products.csv', index=False)