## SAGEDbias Scraping Tutorial
Let's walk through this SAGEDbias tutorial to understand how to **scrape relevant sentences** using the Scraper in the SAGEDbias library. The scraped materials can help you **create a dataset** to train and detect stereotypes. This tutorial covers each step in detail, from importing necessary classes to scraping content. You will first learn to initiate keywords manually and locate and scrape from Wikipedia pages. Then this tutorial will cover two optional methods to expand keywords, and one optional method to scrape from any sources using local files. 

For more information, check the paper
[SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration](https://arxiv.org/abs/2409.11149)

### Step 1: Install the SAGEDbias Library and Import
To start, you'll need to install the SAGEDbias library. This can be done using `pip`. If you haven't installed the library yet, uncomment the following line in your code:

In [1]:
!pip install SAGEDbias==0.0.7



At the beginning of your notebook, import the required classes and modules. It can take sometime to download the extra packages:

In [2]:
from saged import SAGEDData, SourceFinder, Scraper

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ProgU\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ProgU\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Step 2: Add Keywords to the 'Keywords' Instance
To use SAGED, you need a data instance that holds information about the category and domain you're interested in. In this tutorial, we're interested in British people under the domain "nationalities":

In [3]:
domain = "nationalities"
category = "British people"
keywords_data = SAGEDData.create_data(domain, category, "keywords")

Next, add keywords to your `keywords_data` instance that will help identify sentences containing the keywords.:

In [4]:
keywords_to_add = ["Brit", "UK"]
for keyword in keywords_to_add:
    keywords_data.add(keyword=keyword)

You can inspect the keywords in easy format using `keywords_data.show(data_tier="keywords")`.

In [5]:
keywords_data.show(data_tier="keywords")

Category: British people, Domain: nationalities
  Keywords: Brit, UK


Otherwise you can access the entire Json data with meta-information with `keywords_data.data`:

In [6]:
print(keywords_data.data)

[{'category': 'British people', 'domain': 'nationalities', 'keywords': {'Brit': {'keyword_type': 'sub-concepts', 'keyword_provider': 'manual', 'targeted_source': [{'source_tag': 'default', 'source_type': 'unknown', 'source_specification': []}], 'scrap_mode': 'in_page', 'scrap_shared_area': 'Yes'}, 'UK': {'keyword_type': 'sub-concepts', 'keyword_provider': 'manual', 'targeted_source': [{'source_tag': 'default', 'source_type': 'unknown', 'source_specification': []}], 'scrap_mode': 'in_page', 'scrap_shared_area': 'Yes'}}}]


### Step 3: Instantiate the SourceFinder Class to Find Wikipedia URLs Related to the Keyword
Once you have populated `keywords_data`, it's time to create a `SourceFinder` instance, which will use the keywords to find relevant sources:

In [7]:
source_finder = SourceFinder(keywords_data)

The next step is to find relevant Wikipedia pages that match the keywords you've specified. You can specify `top_n` to control how many relevant links embedded in the main wiki page the sourcefinder extract, while you can specify `scrape_backlinks` to indicate the number of pages with the main wiki page embedded:

In [8]:
top_n = 2
scrape_backlinks = 2

# Search Wikipedia for related pages based on the keywords
wiki_sources = source_finder.find_scrape_urls_on_wiki(top_n=top_n, scrape_backlinks=scrape_backlinks)

Searching Wikipedia for topic: British people
Found Wikipedia page: British people
Searching similar forelinks for British people


Depth 1/1: 100%|██████████| 2/2 [00:01<00:00,  1.06it/s]


Searching similar backlinks for British people


Depth 1/1: 100%|██████████| 2/2 [00:01<00:00,  1.05it/s]


In [9]:
wiki_sources.show(data_tier="source_finder")

Category: British people, Domain: nationalities
  Sources: ['https://en.wikipedia.org/wiki/British_people', 'https://en.wikipedia.org/wiki/British_national_identity', 'https://en.wikipedia.org/wiki/British_Americans']


In [10]:
print(wiki_sources.data)

[{'category': 'British people', 'domain': 'nationalities', 'keywords': {'Brit': {'keyword_type': 'sub-concepts', 'keyword_provider': 'manual', 'targeted_source': [{'source_tag': 'default', 'source_type': 'unknown', 'source_specification': []}], 'scrap_mode': 'in_page', 'scrap_shared_area': 'Yes'}, 'UK': {'keyword_type': 'sub-concepts', 'keyword_provider': 'manual', 'targeted_source': [{'source_tag': 'default', 'source_type': 'unknown', 'source_specification': []}], 'scrap_mode': 'in_page', 'scrap_shared_area': 'Yes'}}, 'category_shared_source': [{'source_tag': 'default', 'source_type': 'wiki_urls', 'source_specification': ['https://en.wikipedia.org/wiki/British_people', 'https://en.wikipedia.org/wiki/British_national_identity', 'https://en.wikipedia.org/wiki/British_Americans']}]}]


### Step 4: Scrape the Wikipedia Pages
Once you have a list of Wikipedia URLs, the next step is to use the `Scraper` class to scrape content from those URLs:

In [11]:
# Initialize the Scraper instance using the 'wiki_sources' SAGEDData instance
scraper = Scraper(wiki_sources)

# Scrape sentences from Wikipedia pages
scraper.scrape_in_page_for_wiki_with_buffer_files()
scraped_sentences_data = scraper.scraped_sentence_to_saged_data()

Scraping through URL:   0%|          | 0/3 [00:00<?, ?url/s]
Scraping in page:   0%|          | 0/2 [00:00<?, ?keyword/s][A
Scraping in page:  50%|█████     | 1/2 [00:15<00:15, 15.65s/keyword][A
Scraping in page: 100%|██████████| 2/2 [00:28<00:00, 14.02s/keyword][A
Scraping through URL:  33%|███▎      | 1/3 [00:28<00:56, 28.06s/url]
Scraping in page:   0%|          | 0/2 [00:00<?, ?keyword/s][A
Scraping in page:  50%|█████     | 1/2 [00:05<00:05,  5.22s/keyword][A
Scraping in page: 100%|██████████| 2/2 [00:11<00:00,  5.98s/keyword][A
Scraping through URL:  67%|██████▋   | 2/3 [00:40<00:18, 18.59s/url]
Scraping in page:   0%|          | 0/2 [00:00<?, ?keyword/s][A
Scraping in page:  50%|█████     | 1/2 [00:07<00:07,  7.42s/keyword][A
Scraping in page: 100%|██████████| 2/2 [00:14<00:00,  7.25s/keyword][A
Scraping through URL: 100%|██████████| 3/3 [00:54<00:00, 18.18s/url]


In [12]:
scraped_sentences_data.show(data_tier="scraped_sentences")

Category: British people, Domain: nationalities
  Sources: ['https://en.wikipedia.org/wiki/British_people', 'https://en.wikipedia.org/wiki/British_national_identity', 'https://en.wikipedia.org/wiki/British_Americans']
  Keyword 'Brit' sentences: ["The BRIT Awards are the British Phonographic Industry's annual awards for both international and British popular music.", "British, brit'ish, adj. of Britain or the Commonwealth.", "Briton, brit'ὁn, n. one of the early inhabitants of Britain: a native of Great Britain."]
  Keyword 'UK' sentences: ["The BRIT Awards are the British Phonographic Industry's annual awards for both international and British popular music.", "British, brit'ish, adj. of Britain or the Commonwealth.", "Briton, brit'ὁn, n. one of the early inhabitants of Britain: a native of Great Britain.", 'It also refers to citizens of the former British Empire, who settled in the country prior to 1973, and hold neither UK citizenship nor nationality.', 'The population of the UK sta

In [13]:
print(scraped_sentences_data.data)

[{'category': 'British people', 'domain': 'nationalities', 'keywords': {'Brit': {'keyword_type': 'sub-concepts', 'keyword_provider': 'manual', 'targeted_source': [{'source_tag': 'default', 'source_type': 'unknown', 'source_specification': []}], 'scrap_mode': 'in_page', 'scrap_shared_area': 'Yes', 'scraped_sentences': [("The BRIT Awards are the British Phonographic Industry's annual awards for both international and British popular music.", 'default'), ("British, brit'ish, adj. of Britain or the Commonwealth.", 'default'), ("Briton, brit'ὁn, n. one of the early inhabitants of Britain: a native of Great Britain.", 'default')]}, 'UK': {'keyword_type': 'sub-concepts', 'keyword_provider': 'manual', 'targeted_source': [{'source_tag': 'default', 'source_type': 'unknown', 'source_specification': []}], 'scrap_mode': 'in_page', 'scrap_shared_area': 'Yes', 'scraped_sentences': [("The BRIT Awards are the British Phonographic Industry's annual awards for both international and British popular music

### Optional Step 1: Find Similar Keywords Using Embeddings of Wikipedia
You can also use the `KeywordFinder` class with `find_keywords_by_embedding_on_wiki` method to find the keywords related to the main category word:

In [14]:
from saged import KeywordFinder
keyword_finder = KeywordFinder(category, domain)
keyword_finder.find_keywords_by_embedding_on_wiki(n_keywords=5)
keywords_data_embeddings = keyword_finder.keywords_to_saged_data()
keywords_data_embeddings.show(data_tier="keywords")

Initiating the embedding model...


Batches:   0%|          | 0/85 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating similarities: 100%|██████████| 2718/2718 [00:00<00:00, 14580.71it/s]

Category: British people, Domain: nationalities
  Keywords: uk, brit, england, yorkshire, people





### Option Step 2: Find Similar Keywords Using a Custom LLM Function
You can also use the `KeywordFinder` class with `find_keywords_by_llm_inquiries` method to find the keywords related to the main category word:

In [15]:
def your_generation_function(keyword):
    '''This is the LLM generation function to drive the keyword finder.'''
    return 'Dummy'

keyword_finder.find_keywords_by_llm_inquiries(generation_function=your_generation_function, n_keywords=5, n_run =5)
keywords_data_llm = keyword_finder.keywords_to_saged_data()
keywords_data_llm.show(data_tier="keywords")

finding keywords by LLM: 100%|██████████| 5/5 [00:00<?, ?run/s]

Invocation failed at iteration 0: invalid syntax (<string>, line 0)
Invocation failed at iteration 1: invalid syntax (<string>, line 0)
Invocation failed at iteration 2: invalid syntax (<string>, line 0)
Invocation failed at iteration 3: invalid syntax (<string>, line 0)
Invocation failed at iteration 4: invalid syntax (<string>, line 0)
final_set
{'British people'}
summary
['British people']
Category: British people, Domain: nationalities
  Keywords: British people





### Optional Step 3:  Use Local Files for Scraping

Replace with your local directory path with intended files. Check if the directory exists, create one if it does not exist

In [16]:
import os 
directory_path = "data/customized/local_files/uk"  
if not os.path.exists(directory_path):
    os.makedirs(directory_path)
    print(f"The directory '{directory_path}' did not exist and was created.")

Use `docling` to create `.txt` local_files of intended webpages. Save the converted text as a `.txt` file under the specified directoryt()

In [17]:
# !pip install docling
!pip install pydantic==2.7.0



In [18]:
from docling.document_converter import DocumentConverter

source = "https://www.gov.uk/apply-citizenship-born-uk/print"
converter = DocumentConverter()
result = converter.convert(source)
converted_text = result.document.export_to_text()

output_file_path = os.path.join(directory_path, "converted_document.txt")
with open(output_file_path, "w", encoding="utf-8") as text_file:
    text_file.write(converted_text)
print(f"Converted document saved to '{output_file_path}'.")

Converted document saved to 'data/customized/local_files/uk\converted_document.txt'.


In [19]:
print(converted_text)

Cookies on GOV.UK

We use some essential cookies to make this website work.

We’d like to set additional cookies to understand how you use GOV.UK, remember your settings and improve government services.

We also use cookies set by other sites to help us deliver content from their services.

You have accepted additional cookies. You can change your cookie settings at any time.

You have rejected additional cookies. You can change your cookie settings at any time.

Navigation menu

Services and information

 Benefits

 Births, death, marriages and care

 Business and self-employed

 Childcare and parenting

 Citizenship and living in the UK

 Crime, justice and the law

 Disabled people

 Driving and transport

 Education and learning

 Employing people

 Environment and countryside

 Housing and local services

 Money and tax

 Passports, travel and living abroad

 Visas and immigration

 Working, jobs and pensions

Government activity

 Departments
Departments, agencies and public bodi

Use the `find_scrape_paths_local` method to locate text files in the directory. Make sure you reconfigure the `SourceFinder` etc.

In [20]:
source_finder = SourceFinder(keywords_data_embeddings)
local_sources = source_finder.find_scrape_paths_local(directory_path)
local_sources.show(data_tier="source_finder")

Category: British people, Domain: nationalities
  Sources: ['data/customized/local_files/uk/converted_document.txt']


Initialize the `Scraper` instance and use the `scrape_local_with_buffer_files` to scrape from the file.

In [21]:
scraper = Scraper(local_sources)
scraper.scrape_local_with_buffer_files()
scraped_sentences_data = scraper.scraped_sentence_to_saged_data()
scraped_sentences_data.show(data_tier="scraped_sentences")

Scraping through loacal files:   0%|          | 0/1 [00:00<?, ?file/s]
Scraping in page: 100%|██████████| 5/5 [00:00<00:00, 683.78keyword/s]
Scraping through loacal files: 100%|██████████| 1/1 [00:00<00:00, 99.72file/s]

Category: British people, Domain: nationalities
  Sources: ['data/customized/local_files/uk/converted_document.txt']
  Keyword 'uk' sentences: ['Cookies on GOV.UK  We use some essential cookies to make this website work.', 'We’d like to set additional cookies to understand how you use GOV.UK, remember your settings and improve government services.', 'Navigation menu  Services and information   Benefits   Births, death, marriages and care   Business and self-employed   Childcare and parenting   Citizenship and living in the UK   Crime, justice and the law   Disabled people   Driving and transport   Education and learning   Employing people   Environment and countryside   Housing and local services   Money and tax   Passports, travel and living abroad   Visas and immigration   Working, jobs and pensions  Government activity   Departments Departments, agencies and public bodies   News News stories, speeches, letters and notices   Guidance and regulation Detailed guidance, regulations and 




### Summary and Working directions
This tutorial showcased the use of the **SAGEDBias** library to define topics, locate relevant sources, and extract content. Key steps included configuring data instances, identifying Wikipedia URLs, and effectively scraping content. Additionally, techniques to expand keyword lists and utilize local files for scraping were demonstrated. This workflow equips you with a robust foundation for leveraging SAGEDBias to collect bias-related sentence data.

To create a dataset for training stereotype detection classifiers, consider the following approaches:

1. Identify sources, such as **books** and **websites**, that contain stereotypical texts, and scrape content directly from them.
2. Use a combination of prompt engineering, fine-tuning, or other advanced techniques to create synthetic models capable of generating stereotypical texts for scraping. For instance, the model [gpt2-EMGSD](https://huggingface.co/holistic-ai/gpt2-EMGSD) on Hugging Face is a GPT-2 model trained on half of the EMGSD dataset, and it can serve as an example.
3. Utilize the **Assembler** and **Generator** modules in SAGEDBias to generate stereotypical sentence continuations or question responses using AI models.
4. Consider exploring the definition of stereotypes and defending a particular interpretation. For example, refer to [Defining Stereotypes and Stereotyping](https://academic.oup.com/book/39792/chapter-abstract/339890364?redirectedFrom=fulltext&login=false) for a detailed discussion on the topic.

If you have questions or require further clarification about these steps, don't hesitate to reach out.