## Scraping *B'Tselem (בצלם)*'s Data on the Ethnic Cleansing of Palestine

::: {.callout-tip title="Video and Notebook Links"}

* <a href='https://georgetown.hosted.panopto.com/Panopto/Pages/Viewer.aspx?id=f866cbea-62c7-4bdf-abd9-b098005e6d26' target='_blank'>Click here to access the accompanying **video** for this writeup</a>
* Since Selenium doesn't work very well with Colab, there is no "Open in Colab" link for this writeup, so instead you can <a href='https://github.com/jpowerj/dsan5000/blob/main/writeups/selenium/Web_Scraping_With_Selenium.ipynb' target='_blank'>click here to view/download the `.ipynb` file from GitHub</a>, in case you'd like to run or modify the code yourself.
* Just as the [data-cleaning writeup](../eda-seaborn/THOR_EDA_with_Seaborn.ipynb){target='_blank'} linked to <a href='https://jjacobs.me/mdb/chomsky-hr2/' target='_blank'>Volume II of *The Political Economy of Human Rights*</a>, providing the full-on context for that dataset, here I'll provide a link to Ilan Pappe's <a href='https://jjacobs.me/mdb/ethnic-cleansing-of-palestine/' target='_blank'>*The Ethnic Cleansing of Palestine*</a>, which provides context for this dataset (e.g., for why I choose the phrase, and why the title here refers to, the "ethnic cleansing" of Palestine, over other possible descriptors)

:::

What happens if we try to scrape data from the Israeli human rights organization B'Tselem (בצלם), by just straightforwardly making a GET request using the `requests` library in Python?

In [72]:
# Python built-in libraries
import time

# 3rd-party libraries
import requests
from bs4 import BeautifulSoup

In [73]:
data_url = "https://statistics.btselem.org/en/all-fatalities/by-date-of-incident/pal-by-israel-sec/all?section=overall&tab=overview&ageSensor=%220%2C5%22"

In [74]:
response = requests.get(data_url)
print(response.text)

<!doctype html><html lang="en"><head><meta charset="utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge"><meta name="viewport" content="width=device-width,initial-scale=1"><link rel="icon" href="/favicon.ico"><title>Btselem</title><link rel="preconnect" href="https://fonts.gstatic.com"><link href="https://fonts.googleapis.com/css2?family=Rubik:wght@400;700&family=Tajawal:wght@400;700&display=swap" rel="stylesheet"><link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/@mdi/font@latest/css/materialdesignicons.min.css"><script src="https://code.jquery.com/jquery-3.2.1.min.js"></script><script src="/assets/jsmaps/jsmaps-libs.js"></script><script src="/assets/jsmaps/jsmaps-panzoom.js"></script><script src="/assets/jsmaps/jsmaps.min.js"></script><meta property="og:title" content="Database on fatalities and house demolitions">
<meta property="og:description" content="Database on fatalities and house demolitions in the Occupied Territories and Israel in the context of the conflict">

So, it seems we'll have to finally bite the bullet and use **Selenium** to scrape this data, since Selenium allows us to scrape even pages where data is **dynamically generated** using JavaScript.

In [75]:
# Uncomment and run the following line to install selenium, if you're using pip rather than conda
#!pip install selenium



And then the following test code will pop up a new externally-controlled Firefox window and automatically navigate to the Selenium homepage at `https://selenium.dev`, if you have both Selenium and the Firefox **driver** for your operating system installed

In [76]:
from selenium import webdriver

browser = webdriver.Firefox()
browser.get('http://selenium.dev/')

Now let's try with the actual site

In [114]:
browser = webdriver.Firefox()
browser.get(data_url)
time.sleep(0.1)

In [106]:
# Helper functions, executing lines of JavaScript code in the browser to help with scraping
def get_scroll_position(browser_obj):
    cur_y_pos = browser_obj.execute_script("return window.pageYOffset")
    return cur_y_pos
def scroll_to_bottom(browser_obj):
    browser_obj.execute_script("window.scrollTo(0,document.body.scrollHeight);")

In [100]:
# And use these helper functions
# Get initial y position
cur_y_pos = get_scroll_position(browser)
print(f"y position at start of code block: {cur_y_pos}")
# Perform a scroll
scroll_to_bottom(browser)
# Check whether we have scrolled more
y_pos_after_scroll = get_scroll_position(browser)
print(f"y position after scrolling: {y_pos_after_scroll}")

y position at start of code block: 69165.5
y position after scrolling: 69165.5


In [116]:
# Get initial y position
prev_y_pos = get_scroll_position(browser)
print(f"Initial y position: {prev_y_pos}")
y_pos_after_scroll = None
done_scrolling = False
while not done_scrolling:
    time.sleep(0.5)
    scroll_to_bottom(browser)
    # Check whether we have scrolled more
    y_pos_after_scroll = get_scroll_position(browser)
    print(f"Y position after scroll: {y_pos_after_scroll}")
    if y_pos_after_scroll == prev_y_pos:
        done_scrolling = True
    else:
        prev_y_pos = y_pos_after_scroll
final_y_pos = get_scroll_position(browser)
print(f"y position, now that we're done scrolling: {final_y_pos}")

Initial y position: 12023
Y position after scroll: 13805
Y position after scroll: 20066
Y position after scroll: 26381
Y position after scroll: 32430
Y position after scroll: 38536
Y position after scroll: 44268
Y position after scroll: 50087
Y position after scroll: 56155
Y position after scroll: 62404
Y position after scroll: 68259
Y position after scroll: 69165.5
Y position after scroll: 69165.5
y position, now that we're done scrolling: 69165.5


In [117]:
# Print the DOM (the html code of the page *after* elements have been dynamically
# loaded by JavaScript)
dom_content = browser.page_source

In [54]:
# And save this JavaScript-generated html code to our hard drive, using open() and .write()
with open("scraped_content.html", 'w', encoding='utf-8') as outfile:
    outfile.write(dom_content)


Now we can load this `.html` file locally (without even needing to access the web or use `requests` or anything), and parse it using `BeautifulSoup`:

In [119]:
with open("scraped_content.html", 'r', encoding='utf-8') as infile:
    raw_html = infile.read()
raw_html



In [56]:
# We can use BeautifulSoup to parse this Selenium-scraped html code just like we
# used it to parse "static" html code obtained using the requests library.
soup = BeautifulSoup(raw_html)

In [121]:
# Get objects representing the 334 `<div>` elements on the page which have
# class="v-list-item__content" (the class we found for each data box using
# the inspector in Firefox)
content_elts = soup.find_all("div", {'class': 'v-list-item__content'})
len(content_elts)

334

Now that we have these 334 individual data "boxes", we loop through them and extract the individual pieces of data within each one: the child's name, the textual description of their murder, and the name of the military operation in which they were killed

In [122]:
content_elts = soup.find_all("div", {'class': 'v-list-item__content'})
all_data = []
for cur_elt in content_elts:
    elt_headline = cur_elt.find("div", {'class': 'headline'})
    cur_name = elt_headline.text
    elt_main_text = cur_elt.find("p").text
    #print(elt_main_text)
    elt_badge = cur_elt.find("span")
    badge_text = ""
    if elt_badge is not None:
        print(f"Badge content: {elt_badge.text}")
        badge_text = elt_badge.text
    cur_data = {
        'name': cur_name,
        'text': elt_main_text,
        'badge_text': badge_text
    }
    all_data.append(cur_data)
    

Badge content:  Military operation: Shield and Arrow 
Badge content:  Military operation: Breaking Dawn 
Badge content:  Military operation: Breaking Dawn 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Military operation: Guardian of the Walls 
Badge content:  Mi

Now, since at the end of each loop iteration we created a **Python dictionary** with keys `name`, `text`, and `badge_text`, and added it as an element of `all_data`, we can look at any of the elements within `all_data` and know that it will have the name, text, and military operation info for that child:

In [124]:
all_data[0]

{'name': 'Muhammad Haitham Ibrahim Tamimi',
 'text': '2-year-old resident of a-Nabi Saleh, Ramallah and al-Bira District, injured on June 1, 2023, in a-Nabi Saleh, Ramallah and al-Bira District,  live ammunition, and died on June 5, 2023.  Additional information: Wounded in the head by soldiers’ gunfire together with his father, who was wounded in the shoulder, when the two were sitting in their car waiting to depart for a family visit.  [More Info] ',
 'badge_text': ''}

The main reason to create these Python dictionaries within each for loop iteration, however, is that this format is immediately "recognized" by Pandas, so that we can pass our list of dictionaries `all_data` directly into the `pd.DataFrame()` constructor to generate a full `DataFrame` object containing all of our data, with the correct headers based on the dictionary keys:

In [125]:
import pandas as pd

In [126]:
scraped_data_df = pd.DataFrame(all_data)
scraped_data_df

Unnamed: 0,name,text,badge_text
0,Muhammad Haitham Ibrahim Tamimi,"2-year-old resident of a-Nabi Saleh, Ramallah ...",
1,Hajar Khalil Salah al-Bahtini,"4 year old resident of Gaza city, Gaza Distric...",Military operation: Shield and Arrow
2,Jamil Najm a-Din Jamil Nijem,"3 year old resident of Jabalya R.C., North Gaz...",Military operation: Breaking Dawn
3,Alaa 'Abdallah Riyad Qadum,"5 year old resident of Gaza city, Gaza Distric...",Military operation: Breaking Dawn
4,Qusai Sameh Fawaz al-Kolak,"Under 1 year old resident of Gaza city, Gaza D...",Military operation: Guardian of the Walls
...,...,...,...
329,'Azzam Samir a-Sha'bi,"3 year old resident of Nablus, killed on April...",Military operation: Defensive Shield
330,Riham Abu Taha,"4 year old resident of Tall a-Sultan Camp, Raf...",
331,Shaimaa 'Imad al-Masri,"4 year old resident of Ramallah, Ramallah and ...",
332,Burhan al-Haymuni,"3 year old resident of Hebron, killed on Decem...",


And finally, just like any other Pandas `DataFrame` object, we can use Pandas' `to_csv()` function to save this structured dataset to our hard drive:

In [127]:
scraped_data_df.to_csv("parsed_data.csv")

## Scraping Amnesty International's Countries Page

[This page](https://www.amnesty.org/en/countries/) has a bit of a different structure, where we'll have to interact with the page in a different way from just scrolling. For example, right when we load the page we see an "Accept Cookies" dialog, and then once we're past that it looks like we'll need to click each individual letter "A", "B", "C", etc., to get the links to Amnesty's pages on countries starting with these letters

*(We'll see that, at the end of the day, we actually don't need to do all this clicking, but I'm including this part so that you know how to **programmatically click things in a page** using Selenium)*

In [222]:
browser = webdriver.Firefox()
browser.get(data_url)
time.sleep(0.1)

In [223]:
amnesty_url = "https://www.amnesty.org/en/countries/"

In [224]:
browser.get(amnesty_url)

Using the "By" syntax in Selenium: This `By` object that we import in the next cell allows us to **tell Selenium *how* we're selecting elements on the page**: for example, by the element(s) ID values, by the text of a link, or some other property of the text/button/link/heading/etc.

In [225]:
from selenium.webdriver.common.by import By

So, first we tell Selenium that we'd like to select elements by their **id** property, and then we specifically select the **Accept Cookies** button, which we found (using the Inspector panel in Firefox) has the id value `"ccc-notify-accept"`

In [226]:
accept_button = browser.find_element(By.ID, 'ccc-notify-accept')

In [227]:
accept_button.click()

Now that we've clicked the "Accept Cookies" button, here we'll start with some code that will allow us to click the "Z" button at the top of the page, revealing the countries in Amnesty's database which start with the letter "Z":

In [228]:
Z_button = browser.find_element(By.XPATH, "//button[text()='Z']")

In [229]:
Z_button.click()

However (as I talk about more in the video), the way this webpage is designed actually makes it easier for us to scrape, since it turns out that **all of the links for all of the countries** are loaded into the page at the beginning of when it is loaded, and then **subsequently** all of the links besides the countries for the currently-selected letter are **hidden** using the CSS value `display: none`.

What this means is that we don't even need to click on the letters programmatically: we can just get all of the `<a>` elements which exist (even if hidden) within the `<div>` element on the page containing the list of countries:

In [230]:
country_link_container = browser.find_element(By.CLASS_NAME, "listContainer")

In [231]:
country_links = country_link_container.find_elements(By.TAG_NAME, "a")

And the following code tells us that Amnesty lists 157 countries in total:

In [232]:
len(country_links)

157

In [241]:
first_country_link = country_links[31]

In [242]:
first_country_link.get_property("href")

'https://www.amnesty.org/en/location/europe-and-central-asia/croatia/'

Here, however, the following two code blocks show that there's a bit of a tricky approach we have to take if we want to get the text content of elements which are **hidden** on the actual displayed page in your browser. In the following cell, we see that just accessing the `.text` attribute on a country link will **not** produce the name of the country when the country's link is hidden:

In [243]:
first_country_link.text

''

**But**, we can use a trick here, which is that even when elements which are not visible on the page they will still have an attribute called `textContent`, so that if we access **this** attribute of the link (rather than the simpler `.text` attribute), we can in fact get the text that **would** be displayed **if** the element was not hidden:

In [244]:
first_country_link.get_attribute("textContent")

'Croatia'

And now, just like before, we make sure to create a **Python dictionary** at the end of each loop iteration, containing all of the data we've extracted from the element we were looking at in that loop iteration, so that it will be easy to create a Pandas `DataFrame` once we the loop has finished running:

In [246]:
all_data = []
for cl in country_links:
    country_url = cl.get_property('href')
    country_name = cl.get_attribute('textContent')
    country_data = {
        'name': country_name,
        'url': country_url
    }
    all_data.append(country_data)


So here we note that creating these dictionary objects in each loop iteration makes it easy to view individual records:

In [247]:
all_data[50]

{'name': 'Gambia',
 'url': 'https://www.amnesty.org/en/location/africa/west-and-central-africa/gambia/'}

In [200]:
[(cl.get_property("href"), cl.get_attribute("textContent")) for cl in country_links]

[('https://www.amnesty.org/en/location/asia-and-the-pacific/south-asia/afghanistan/',
  'Afghanistan'),
 ('https://www.amnesty.org/en/location/europe-and-central-asia/albania/',
  'Albania'),
 ('https://www.amnesty.org/en/location/middle-east-and-north-africa/algeria/',
  'Algeria'),
 ('https://www.amnesty.org/en/location/europe-and-central-asia/andorra/',
  'Andorra'),
 ('https://www.amnesty.org/en/location/africa/southern-africa/angola/',
  'Angola'),
 ('https://www.amnesty.org/en/location/americas/south-america/argentina/',
  'Argentina'),
 ('https://www.amnesty.org/en/location/europe-and-central-asia/armenia/',
  'Armenia'),
 ('https://www.amnesty.org/en/location/asia-and-the-pacific/south-east-asia-and-the-pacific/australia/',
  'Australia'),
 ('https://www.amnesty.org/en/location/europe-and-central-asia/austria/',
  'Austria'),
 ('https://www.amnesty.org/en/location/europe-and-central-asia/azerbaijan/',
  'Azerbaijan'),
 ('https://www.amnesty.org/en/location/middle-east-and-north

Plus makes it easy to "plug" the list of dictionary objects into the `pd.DataFrame()` constructor to quickly create a well-structured dataset:

In [248]:
country_df = pd.DataFrame(all_data)
country_df

Unnamed: 0,name,url
0,Afghanistan,https://www.amnesty.org/en/location/asia-and-t...
1,Albania,https://www.amnesty.org/en/location/europe-and...
2,Algeria,https://www.amnesty.org/en/location/middle-eas...
3,Andorra,https://www.amnesty.org/en/location/europe-and...
4,Angola,https://www.amnesty.org/en/location/africa/sou...
...,...,...
152,Venezuela,https://www.amnesty.org/en/location/americas/s...
153,Viet Nam,https://www.amnesty.org/en/location/asia-and-t...
154,Yemen,https://www.amnesty.org/en/location/middle-eas...
155,Zambia,https://www.amnesty.org/en/location/africa/sou...


Which we could then save just like we saved the B'Tselem data before, using Pandas' `.to_csv()` function:

In [None]:
country_df.to_csv("amnesty_countries.csv")