### Links Scraper

This notebook scrapes the [website](https://www.giustizia.it/giustizia/page/it/statistiche) of the Italian Ministry of Justice and retrieve information available regarding inmates' presence in Italian detention center. The scraper looks for entries containing the string `Detenuti italiani e stranieri presenti e capienze per istituto` and collects all links in a `csv` file stored in `outputs/raw/bulletines_links.csv`.

In [1]:
from playwright.async_api import async_playwright
import asyncio
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
import re

In [2]:
url = 'https://www.giustizia.it/giustizia/page/it/statistiche'
to_search= 'Detenuti italiani e stranieri presenti e capienze per istituto'

In [None]:
# "Hey, open up a browser"
playwright = await async_playwright().start()
browser = await playwright.chromium.launch(headless=False)

# Create a new browser window
page = await browser.new_page()
print("Opening up the browser...")

# Tell it to go to this page
await page.goto(url)
print(f"Going to {url}")

await page.wait_for_timeout(2000)

search_input = page.locator('form#searchForm input[aria-label="Cerca"]')

await search_input.fill(to_search)
await page.wait_for_timeout(2000)
await search_input.press('Enter')
print(f"Searching for {to_search}")

# Wait for the results to load
await page.wait_for_selector('ol.resultVivisimo', timeout=5000)

# Data storage
data = []
n = 1


while True:
    content = await page.content()
    await page.wait_for_timeout(5000)
    soup = BeautifulSoup(content, 'html.parser')
    links = soup.find_all('a', href=True)

    # Filter and extract the relevant links
    filtered_links = [
        link for link in links 
        if "contentId" in link['href'] and "Detenuti italiani e stranieri presenti e capienze per istituto" in link.get_text()
    ]

    for link in filtered_links:
            href = link['href']
            text = link.get_text(strip=True)
            content_id = re.search(r"contentId=(\w+)", href).group(1)
            last_update = text.split("aggiornamento al")[-1].strip()
            data.append([content_id, last_update, href])
    print(f'got link from page {n}')
    print(f"Total number of links: {len(data)}")
    print("##################")
    n = n+1

    next_button = await page.query_selector('img[alt="Vai alla pagina successiva"]')  # Select the image by alt text
    await page.wait_for_timeout(5000)

    # Adding a sleep to give the page some time
    await page.wait_for_timeout(5000)  # Wait for 1 second before checking for the button

    if next_button:
        print('next button found')
        # Check if the button is visible and can be clicked
        is_visible = await next_button.is_visible()
        is_enabled = await next_button.is_enabled()  # Check if the button is enabled

        if is_visible and is_enabled:
            print(f"Clicking on page {n}")
            # Click the "Next" button
            await next_button.click()
            await page.wait_for_selector('ol.resultVivisimo', timeout=5000)  # Additional wait time for the new page to load
        else:
            print('button not visible or enabled')
            break  # If the button is not visible or not enabled, break the loop
    else:
        print('button not found')
        break  # If no next button, break the loop

# Finally close the browser after everything is done
await browser.close()

In [None]:
# Create a DataFrame and convert dates
df = pd.DataFrame(data, columns=["ID", "Ultimo aggiornamento", "Hyperlink"])

# Month mapping for conversion to datetime
month_mapping = {
    "Gennaio": "January", "gennaio": "January",
    "Febbraio": "February", "febbraio": "February",
    "Marzo": "March", "marzo": "March",
    "Aprile": "April", "aprile": "April",
    "Maggio": "May", "maggio": "May",
    "Giugno": "June", "giugno": "June",
    "Luglio": "July", "luglio": "July",
    "Agosto": "August", "agosto": "August",
    "Settembre": "September", "settembre": "September",
    "Ottobre": "October", "ottobre": "October",
    "Novembre": "November", "novembre": "November",
    "Dicembre": "December", "dicembre": "December"
}

for italian, english in month_mapping.items():
    df['Ultimo aggiornamento'] = df['Ultimo aggiornamento'].str.replace(italian, english, regex=True)

df['Ultimo aggiornamento'] = pd.to_datetime(df['Ultimo aggiornamento'], format='%d %B %Y')
df = df.sort_values(by='Ultimo aggiornamento', ascending=False)

df.head()

In [None]:
df['Ultimo aggiornamento'].dt.year.value_counts().sort_index()

In [6]:
# Save the df as csv
df.to_csv('../outputs/clean/bulletines_links.csv', index=False, encoding='UTF-8')

After a review of the file we've found a set of duplicate values. Dataset [SST365607](https://www.giustizia.it/giustizia/it/mg_1_14_1.page?contentId=SST365607) and [SST360932](https://www.giustizia.it/giustizia/it/mg_1_14_1.page?contentId=SST360932) contain the same data from December 2021. Instead, October 2021 is missing (it used to be stored with id [SST352771](https://www.giustizia.it/giustizia/en/mg_1_14_1.page?contentId=SST352771).

In [7]:
# Remove one duplicate (SST365607 = SST360932)
duplicate_id = 'SST365607'
df = df[df['ID'] != 'SST365607']

In [8]:
# Save the df as csv
df.to_csv('../outputs/clean/bulletines_links.csv', index=False, encoding='UTF-8')