# NSMQ - Kwame AI Project

###### Title: A script For Parsing HTML (Using URLs) into Sections and Paragraphs.
###### By: Ernest Samuel, Team member; Data preprocessing Team
###### Date: 24-06-2023
#### Description:

This script contains Five (5) functions:

1. unique(array): For removing all duplicates from processed data. It accepts a list of items and removes duplicates from it

2. extract_rawTable_of_content(link, homePage): it takes the URL of the targeted website, up until the last forward slash "/" which is referred to as 'link', and the strings of characters after the forward slash "/" these characters are expected to the specific for the home page and is referred as 'homePage'. Then extract the table of contents of the textbook

3. extract_url(link, pageList, maxNmber, char): It extracts from the list of the table of content and concatenate it with the link to form the desired page URL

*         link: just as described in the function above.

*         pageList: A list containing the table of contents extracted from the previous function
        
*         maxNmber:  The maximum index number on the table of content of the desired textbook

*         char(Optional): A list containing the first character of the index in the table of content that is not a number

4. extract_url_content(url, file_name): Takes the textbook's URL and name of the textbook, extracts and structures the data from the URL, and stores it with the file name. NB: file_name is optional.

5. extract_textbook(url_list, textbook_name): Iterates over the list of URL list, generated by function 3, parse it into function 4 to extract the contents, and then save it as a JSON file with the name of the textbook 'textbook_name'.




### Extract URL for each pages of the Textbook

In [1]:

#-----------Function: Remove dublicates-------------------------#

def unique(array):
# This funtion takes a list and remove dublicates
    return list(dict.fromkeys(array))


def remove_items(a, b):
    # this function removes a subset from a supperset of a list.
    # a = supperset, b = subset
    a = [item for item in a if item not in b]
    return a

#---------------------------------------------------------------#

#-----------Function: Extract table of contents -----------------#
import requests
from bs4 import BeautifulSoup

import os
from urllib.parse import urljoin
import json

def extract_rawTable_of_content(link, homePage):

    # --> homePage = First landing pages of online view of the textbook(eg: "1-introduction")
    # ---> link = url of the site excluding landing page indexing (eg: "https://openstax.org/books/university-physics-volume-3/pages/")

    website_link = link+homePage
    url_list = []
    
    # Send a GET request to the website
    response = requests.get(website_link)
    if response.status_code == 200:
        # Parse the HTML content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find the table of contents div using its class
        table_of_contents_div = soup.find('div')
        
        if table_of_contents_div:
            # Find all the <a> tags within the table of contents div
            a_tags = table_of_contents_div.find_all('a')
            
            # Extract the href attribute from each <a> tag and store it in the list
            for a_tag in a_tags:
                href = a_tag.get('href')
                url_list.append(href)


                # if href.startswith("h"):
            
        else:
            print("Table of contents div not found on the website.")
                
    return unique(url_list) 
#----------------------------------------------------------------#

#----------- Extract URL for each page of the textbook -----------#

def extract_url(link, pageList, maxNmber, char = []):
    # This function generates the dirst number or alphabet that
    # is part of the table of content requied
    #------------------------------------------------------------------------#

    #-> maxNumber = maximum number of the index numbers of the table of content
    #--> char =  alphabets or string index in the table of content
    # --> pageList = a list containing landing pages of all needed charpter(eg: ["1-introduction", 'chapter-2' .. ])
    # ---> link = url of the site excluding landing page indexing (eg: "https://openstax.org/books/university-physics-volume-3/pages/")

    #------------------------------------------------------------------------#
    url = []
    list_pages = list(range(1,maxNmber+1))
    for item in range(len(char)):
        list_pages.append(char[item])

    for item in pageList:
        for value in list_pages:
        
            if item.startswith(str(value)):
                url.append(link+item)

    return  unique(url)


### Extract contents from a URL


In [5]:

def extract_url_content(url, file_name=''):
    # Send a GET request to the URL
    response = requests.get(url)

    # Parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")

    # Find the main content section
    main_content = soup.find("div")

   
    if main_content is None:
        print("Unable to find the main content section")
        return

    # Find the first title on the website
    titles = main_content.find_all(["h1", "h2", "h3", "h4", "h5"])
    first_title = None
    for title in titles:
        if title.text.strip():
            first_title = title.text.strip()
            first_title = str(file_name)+'-'+ first_title
            break
    file_name = first_title if first_title else str(file_name)+'-'+"content"

    # List to store the content
    content_list = []

     #----------------------------
     # Title of the chapter
    Heading = {}
    head = soup.find('head')
    Title = head.find_all('title')
    

    for head in Title:
        if head:
            name = head.text.strip()
            Heading["Title"] = name


    body = soup.find('body')
    paras =[]
    pp = body.find_all('p')
    
    for p in pp:
        paras.append(p.text.strip())


#---------------------------------------------

    # Find all the sections in the main content
    sections = main_content.find_all("section")

    # Set to store unique section identifiers
    section_identifiers = set()

    # Iterate over each section
    sec = [] # to generate sublicate of paragraph data
    for section in sections:
        section_data = {}

        # Extract section identifier
        section_id = section.get("id")
        section_class = section.get("class")
        section_uuid_key = section.get("data-uuid-key")
        section_data_type = section.get("data-type")
        section_class_tuple = tuple(section_class) if section_class is not None else ()
        section_identifier = (section_id, section_class_tuple, section_uuid_key, section_data_type)

        # Skip if section identifier is already encountered
        if section_identifier in section_identifiers:
            continue

        # Add section identifier to the set
        section_identifiers.add(section_identifier)

        # Extract section title
        #------------------------
        subtitle = soup.find(['h3','h4','h2','h1'])
        #subtitle = soup.find('h3')
        #----------------
        title = section.find(["h1", "h2", "h3", "h4", "h5"])
        if title:
            section_data["title"] = title.text.strip()
        else:
             section_data["title"] = subtitle.text.strip()

        # Extract section paragraphs
        paragraphs = section.find_all(["p", "span"])
        section_data["Section"] = []
        
        for paragraph in paragraphs:
            paragraph_text = paragraph.text.strip()
            if paragraph_text:
                section_data["Section"].append(paragraph_text)
                sec.append(paragraph_text)


        # Extract list items
        lists = section.find_all("ul")
        section_data["lists"] = []
        for ul in lists:
            list_items = ul.find_all("li")
            section_data["lists"].append([li.text.strip() for li in list_items])

        # Extract figures
        figures = section.find_all("div", {"class": "os-figure"})
        section_data["figures"] = []
        for figure in figures:
            figure_data = {}
            img = figure.find("img")
            # if img:
            #     image_url = urljoin(url, img["src"])
            #     figure_data["image"] = image_url
            # section_data["figures"].append(figure_data)

#------------------------------------------------------------#
            if img and "src" in img.attrs:
                image_url = urljoin(url, img["src"])
                figure_data["image"] = image_url

            caption = figure.find("figcaption")
            if caption:
                figure_data["caption"] = caption.text
            
            section_data["figures"].append(figure_data)




#------------------------------------------------------------#
        # Extract tables
        tables = section.find_all("table")
        section_data["tables"] = []
        for table in tables:
            table_data = []
            rows = table.find_all("tr")
            for row in rows:
                cells = row.find_all("td")
                table_data.append([cell.text.strip() for cell in cells])
            section_data["tables"].append(table_data)

        content_list.append(section_data)

    # Extract only paragraphs that are not in sections structure
    Heading["Paragraphs_Not_in_Sections"]= remove_items(paras,sec)

    # # Get the current working directory and create the file path for JSON
    # script_dir = os.getcwd()
    # json_path = os.path.join(script_dir, f"{file_name}.json")

    # # Save the content as JSON
    # with open(json_path, "w") as file:
    #     json.dump(content_list, file, indent=4)

    # # Print the JSON content
    # with open(json_path, "r") as file:
    #     print(file.read())

        # Observe that if you print the returned value, the actual formathing is properly retained
        # But if it is the json file, the formstings are lost.


    return Heading, content_list # Heading returns the title of the chapter and every other text that are not in sections of the data structure


### Extract for all URL in the textbook and save as one file

In [4]:

def extract_textbook(url_list, textbook_name):

    # List to store the content
    content_list = []
    page_data = {}
    pages = 0
    for url in url_list:
        page_content=[extract_url_content(url)]
    
        page_data['Page '+str(pages)] = page_content
        pages +=1
        
    content_list.append(page_data)

    # Get the current working directory and create the file path for JSON
    script_dir = os.getcwd()
    json_path = os.path.join(script_dir, f"{textbook_name}.json")

    # Save the content as JSON
    with open(json_path, "w") as file:
        json.dump(content_list, file, indent=4)

    # Print the JSON content
    # with open(json_path, "r") as file:
    #     print(file.read())
    
    return content_list

### Test Code

In [6]:


# Read CSV file containing link to the text book
import pandas as pd
openStax = pd.read_csv('OpenStax Textbooks - Sheet2.csv')

# select which textbook to extract
openStax = openStax.iloc[[1]]

# for each books, extract the URL content.
for BookName, urls in zip(openStax['BOOKS'], openStax['URL']):
   
    print(BookName)
    # initialize first page of view oline
    landpage = 'preface'
    site= str(urls)

    # Extract table of content
    pageList=extract_rawTable_of_content(site,landpage)

    # Initialize first index of your requred contents from the table of contents
    maxNmber = 10 # maximum number of numerical index on your table of value
    char = ['a', 'b','c', 'd', 'e', 'f', 'g', 'i']


    # Extract the list of URL to be parsed for website scraping (this is used in the htmlProcessing )
    URLs = extract_url(site, pageList, maxNmber, char)

    # Name of the textbook
    textbook = str(BookName)
    # extract textbook
    cont = extract_textbook(URLs, textbook)

College Physics 2e


In [8]:
cont

[{'Page 0': [({'Title': 'Ch. 1 Introduction to Science and the Realm of Physics, Physical Quantities, and Units - College Physics 2e | OpenStax',
     'Paragraphs_Not_in_Sections': ['What is your first reaction when you hear the word “physics”? Did you imagine working through difficult equations or memorizing formulas that seem to have no real use in life outside the physics classroom? Many people come to the subject of physics with a bit of fear. But as you begin your exploration of this broad-ranging subject, you may soon come to realize that physics plays a much larger role in your life than you first thought, no matter your life goals or career choice.',
      'Consider the Veil Nebula, a cloud of heated dust and gas located about 2,400 light years from Earth (a light year is the distance light travels in one year, or approximately 9.5 trillion kilometers). The unique structure is the ongoing result of a supernova that occurred 8,000 years ago. The shock wave from the explosion is 


## Structure of the stored data

- The data is stored page by page according to the website table of content in dictionary format.

- For each page, the data is structured to store; titles and Sections in each title. List, Figures and Table for each Section.

- Each section contains a list of paragraphs from the textbook.

- Figure contains a list of links to an image if the image is identified in a section.

- table contains all tabular data identified in the section.

- List contains all ordered or unordered lists found in a section.

## Observations

- If the data is stored in JSON format, some special characters, including maths equations, are returned in an encoded format. 

- But, if we print the content without storing it in JSON format, the data is returned exactly as it is in the textbook