# Extracting FAQ from HTML page

## Parsing HTML data

In [2]:
from bs4 import BeautifulSoup

with open("data/FAQ_red_cross_blood_services.html", "r", encoding="utf-8") as fp:
    hmtl_text = fp.read()
soup = BeautifulSoup(hmtl_text)

In [3]:
soup

<!DOCTYPE HTML>
<html lang="en">
<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<title>Questions About Donating Blood | Red Cross Blood Services</title>
<meta content="base" name="template"/>
<!--// Meta Tags //-->
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<!-- Open Graph -->
<meta name="tags"/>
<meta content="https://www.redcrossblood.org/faq.html" property="og:url"/>
<meta content="Frequently Asked Questions" property="og:title"/>
<meta content="Find out how often you can donate blood and answers to more frequently asked questions about the blood donation process with American Red Cross blood services." property="og:description"/>
<meta content="https://www.redcross.org/content/dam/redcrossblood/social-media-images/Masks_Facebook.jpg.img.jpeg" property="og:image"/>
<meta content="American Red Cross" property="og:site"/>
<meta content="website" property="og:type"/>
<meta con

## Extract JSON FAQ from HTML element

In [4]:
data_faq_json = soup.find(id="faq-data").get_attribute_list('data-faq-json')
data_faq_json

['{"filters": [{"name":"Donating Blood","slug":"donating-blood","subfilters": [{"name": "Blood Donation Process","slug": "donating-blood-blood-donation-process"},{"name": "Platelet Donations","slug": "donating-blood-platelets"},{"name": "Sickle Cell Trait Screening","slug": "donating-blood-sickle-cell-trait-screen"}]},{"name":"Eligibility","slug":"eligibility","subfilters": [{"name": "Medications and Vaccinations","slug": "eligibility-medications"},{"name": "General Health Considerations","slug": "eligibility-health"},{"name": "Travel Outside the U.S., Immigration","slug": "eligibility-travel"},{"name": "Medical Conditions that Affect Eligibility","slug": "eligibility-medicalconditions"},{"name": "Medical Treatments","slug": "eligibility-medicaltreatments"},{"name": "Personal Information","slug": "eligibility-personal-information"},{"name": "Sexually Transmitted Diseases","slug": "eligibility-stds"}]},{"name":"Hosting a blood drive","slug":"hosting-a-blood-drive","subfilters": [{"name"

In [5]:
import json
data_faq_dict = json.loads(data_faq_json[0], strict=False)
data_faq_dict

{'filters': [{'name': 'Donating Blood',
   'slug': 'donating-blood',
   'subfilters': [{'name': 'Blood Donation Process',
     'slug': 'donating-blood-blood-donation-process'},
    {'name': 'Platelet Donations', 'slug': 'donating-blood-platelets'},
    {'name': 'Sickle Cell Trait Screening',
     'slug': 'donating-blood-sickle-cell-trait-screen'}]},
  {'name': 'Eligibility',
   'slug': 'eligibility',
   'subfilters': [{'name': 'Medications and Vaccinations',
     'slug': 'eligibility-medications'},
    {'name': 'General Health Considerations', 'slug': 'eligibility-health'},
    {'name': 'Travel Outside the U.S., Immigration',
     'slug': 'eligibility-travel'},
    {'name': 'Medical Conditions that Affect Eligibility',
     'slug': 'eligibility-medicalconditions'},
    {'name': 'Medical Treatments', 'slug': 'eligibility-medicaltreatments'},
    {'name': 'Personal Information',
     'slug': 'eligibility-personal-information'},
    {'name': 'Sexually Transmitted Diseases', 'slug': 'eligi

In [6]:
data_faq_dict['faqs']

[{'title': 'How does the blood donation process work?',
  'description': '<p>Donating&nbsp;blood is a simple thing to do, but can make a big difference in the lives of others. The donation process from the time you arrive until the time you leave takes about an hour.&nbsp;The donation itself is only about 8-10 minutes on average.&nbsp;The steps in the process are:</p>\n<p>Registration</p>\n<ol style=list-style-position: inside;>\n<li>You will complete donor registration, which includes information such as your name, address, phone number, and donor identification number (if you have one).</li>\n<li>You will be asked to show a donor card, driver’s license or two other forms of ID.</li>\n</ol>\n<p>Health History and Mini Physical</p>\n<ol style=list-style-position: inside;>\n<li>You&nbsp;will answer some&nbsp;questions&nbsp;during a private and confidential interview about your health history and the places you have traveled.</li>\n<li>You will have your temperature, hemoglobin, blood pr

Removing unwanted information and formatting text

In [7]:
faqs = [item|{'description': BeautifulSoup(item['description']).text, 'category': item['category']['name']} for item in data_faq_dict['faqs'] ]
faqs

[{'title': 'How does the blood donation process work?',
  'description': 'Donating\xa0blood is a simple thing to do, but can make a big difference in the lives of others. The donation process from the time you arrive until the time you leave takes about an hour.\xa0The donation itself is only about 8-10 minutes on average.\xa0The steps in the process are:\nRegistration\n\nYou will complete donor registration, which includes information such as your name, address, phone number, and donor identification number (if you have one).\nYou will be asked to show a donor card, driver’s license or two other forms of ID.\n\nHealth History and Mini Physical\n\nYou\xa0will answer some\xa0questions\xa0during a private and confidential interview about your health history and the places you have traveled.\nYou will have your temperature, hemoglobin, blood pressure and pulse checked.\n\nDonation\n\nWe\xa0will cleanse an area on your arm and insert a brand–new, sterile needle for the blood draw. This fee

In [8]:
with open("data/faq.json", "w", encoding="utf-8") as fp:
    json.dump({"faqs": faqs}, fp, indent=4)