## Ceneo Scraper

## Compenents of single opinion

|Compenet|Selector|Variable|
|--------|--------|--------|
|opinion ID|("data-entry-id")|opinion_id|
|opinion’s author|span.user-post__author-name|author|
|author’s recommendation|span.user-post__author-recomendation > em|recommendation|
|score expressed in number of stars|span.user-post__score-count|score|
|opinion’s content|div.user-post__text|content|
|list of product advantages|div.review-feature__title--positives ~ div.review-feature__item|pros|
|list of product disadvantages|div.review-feature__title--negatives ~ div.review-feature__item|cons|
|how many users think that opinion was helpful|button.vote-yes > span|helpful|
|how many users think that opinion was unhelpful|button.vote-no > span|unhelpful|
|publishing date|span.user-post__published > time:nth-child(1)["datetime"]|publish_date|
|purchase date|span.user-post__published > time:nth-child(2)["datetime"]|buy_date|

## Imports

In [69]:
import os
import json
import requests
from bs4 import BeautifulSoup

## Definition of extration function

In [70]:
def extract_content(ancestor, selector=None, attribute=None, return_list=False):
    if selector:
        if return_list:
            if attribute:
                return [tag[attribute].strip() for tag in ancestor.select(selector)]
            return [tag.text.strip() for tag in ancestor.select(selector)]
        if attribute:
            try:
                return ancestor.select_one(selector)[attribute].strip()
            except TypeError:
                return None
        try:
            return ancestor.select_one(selector).text.strip()
        except AttributeError:
            return None
    if attribute:
        return ancestor[attribute]
    return ancestor.text.strip()

## Dictionary with Opinion structure

In [71]:
selectors = {
    "opinion_id": (None, "data-entry-id"),
    "author": ("span.user-post__author-name",),
    "recommendation": ("span.user-post__author-recomendation > em",),
    "score": ("span.user-post__score-count",),
    "content": ("div.user-post__text",),
    "pros": ("div.review-feature__title--positives ~ div.review-feature__item", None, True),
    "cons": ("div.review-feature__title--negatives ~ div.review-feature__item", None, True),
    "helpful": ("button.vote-yes > span",),
    "unhelpful": ("button.vote-no > span",),
    "publish_date": ("span.user-post__published > time:nth-child(1)", "datetime"),
    "buy_date": ("span.user-post__published > time:nth-child(2)", "datetime")
}

## Transformation funtion

In [72]:
def score(score:str) -> float:
    s = score.split("/")
    return float(s[0].replace(",","."))/float(s[1])

## Dictionary with transformations

In [73]:
transformations = {
    "recommendation": lambda r: True if r == "Polecam" else False if r == "Nie polecam" else None,
    "score": score,
    "helpful": int,
    "unhelpful": int
}

## URL address for the first page with opinions

In [74]:
product_id = "108290707"
url = f"https://www.ceneo.pl/{product_id}#tab=reviews"
all_opinions = []

## Extract all opinions about product

In [75]:
while(url):
    response = requests.get(url)
    response.status_code
    page_dom = BeautifulSoup(response.text, "html.parser")
    opinions = page_dom.select("div.js_product-review")
    for opinion in opinions:
        single_opinion = {
            key: extract_content(opinion, *value)
                for key, value in selectors.items()
        }
        for key, value in transformations.items():
            single_opinion[key] = value(single_opinion[key])
        all_opinions.append(single_opinion)
    try:
        url = "https://www.ceneo.pl"+extract_content(page_dom,"a.pagination__next", "href")
    except TypeError:
        url = None

In [76]:
if not os.path.exists("opinions"):
    os.mkdir("opinions")
with open(f"opinions/{product_id}.json","w",encoding="UTF-8") as jf:
    json.dump(all_opinions, jf, indent=4, ensure_ascii=False)
