# POV: You are a Policy Analyst at a Federal Agency. A new Executive Order about AI has been release and your Agency needs to know what are the action items and the timeline for each.

1) Find all the items associated to 'use case', 'safety', 'rights'.
2) Extract any action with a deadline.
Demos
3) Produce a summary of the document.
4) Produce a summary based on a specific topic.

Link:
- https://www.whitehouse.gov/wp-content/uploads/2024/03/M-24-10-Advancing-Governance-Innovation-and-Risk-Management-for-Agency-Use-of-Artificial-Intelligence.pdf

In [1]:
import re
import requests
from bs4 import BeautifulSoup

from IPython.display import HTML
import matplotlib.pyplot as plt
import PyPDF2

import warnings
warnings.filterwarnings('ignore')

In [2]:
def read_pdf_with_pypdf2(pdf_path):
    # Open the PDF file in binary mode
    with open(pdf_path, 'rb') as file:
        # Create a PDF reader object
        pdf_reader = PyPDF2.PdfReader(file)
        
        # Initialize a variable to hold the text
        full_text = ""
        
        # Loop through each page in the PDF
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]  # Get the page
            text = page.extract_text()  # Extract text from the page
            if text:  # If text extraction is successful
                full_text += text + "\n"  # Append the text to full_text variable
    
    # Return the extracted text
    return full_text

# Path to your PDF file
pdf_path = './M-24-10-Advancing-Governance-Innovation-and-Risk-Management-for-Agency-Use-of-Artificial-Intelligence.pdf'

# Call the read_pdf_with_pypdf2 function and print the result
text = read_pdf_with_pypdf2(pdf_path)
print(text[:100])

EXECUTIVE OFFICE OF THE PRESIDENT  
OFFICE OF MANAGEMENT AND BUDGET  
WASHINGTON, D.C. 20503  
 
 
T


In [4]:
import nltk
from nltk.tokenize import sent_tokenize

# Find the portions of the text that mention a specific topic

In [5]:
def search_and_return_sentences(document, search_term):
    # Tokenize the document into sentences
    sentences = sent_tokenize(document)

    # Use regular expression to find sentences containing the search term
    matching_sentences = [sentence for sentence in sentences if re.search(r'\b{}\b'.format(re.escape(search_term)), sentence, flags=re.IGNORECASE)]

    return matching_sentences

In [6]:
search_term = "safety"
result_sentences = search_and_return_sentences(text, search_term)

# Display the sentences containing the search term
for sentence in result_sentences:
    print(sentence)
    print('\n')

Consistent with  the AI in Government Act  of 2020,1 the Advancing American AI Act,2 and 
Executive Order 14110 on the Safe, Secure, and Trustworthy Development and Use of Artificial 
Intelligence , this memorandum directs agencies to advance AI governance  and innovation  while 
managing risks from the use of  AI in the Federal Government , particularly those affecting the 
rights and safety  of the public .3  
 
1.


As such, this memorandum establishes new agency 
requirements  and guidance  for AI governance , innovation , and risk management, including 
through specific minimum risk management practices for  uses of AI that impact the rights and 
safety of the public .


Instead, it establishes new requirements  and recommendations 
that, both independently and collectively,  address the specific  risks from relying  on AI to inform 
or carry out  agency decisions and actions, particularly when such reliance impacts  the rights and 
safety of the public.


To address these  risks 

# Find the portions of the text that mention a specific topic and highlight the topic

In [7]:
def highlight_search_term(sentence, search_term):
    # Use HTML to highlight the search term in the sentence
    highlighted_sentence = re.sub(r'\b{}\b'.format(re.escape(search_term)),
                                  '<span style="color: red; font-weight: bold;">{}</span>'.format(search_term),
                                  sentence, flags=re.IGNORECASE)
    return highlighted_sentence

In [8]:
def search_and_return_highlighted_sentences(document, search_term):
    # Tokenize the document into sentences
    sentences = sent_tokenize(document)

    # Filter and highlight the sentences that contain the search term
    matching_sentences = [highlight_search_term(sentence, search_term) for sentence in sentences if re.search(r'\b{}\b'.format(re.escape(search_term)), sentence, flags=re.IGNORECASE)]

    # Display the highlighted sentences as HTML
    display(HTML('<br><br>'.join(matching_sentences)))

In [9]:
search_term = "safety"
search_and_return_highlighted_sentences(text, search_term)

In [10]:
search_term = "rights"
search_and_return_highlighted_sentences(text, search_term)

# Find time sensitive information

In [11]:
def extract_day_values(sentence):
    # Use regular expression to extract day values from the sentence
    matches = re.findall(r'\b(\d+)\s*days?\b', sentence, flags=re.IGNORECASE)
    return [int(match) for match in matches]

In [12]:
def highlight_additional_term(sentence, additional_term):
    # Use HTML to highlight the search term in the sentence
    highlighted_sentence = re.sub(r'\b{}\b'.format(re.escape(additional_term)),
                                  '<span style="color: blue; font-weight: bold;">{}</span>'.format(additional_term),
                                  sentence, flags=re.IGNORECASE)
    return highlighted_sentence

In [13]:
def search_and_return_highlighted_time_mandates(document, additional_term=None):
    # Tokenize the document into sentences
    sentences = sent_tokenize(document)

    # Extract and store day values along with sentences
    sentences_with_days = {}
    for sentence in sentences:
        day_values = extract_day_values(sentence)
        if day_values:
            for day_value in day_values:
                if day_value not in sentences_with_days:
                    sentences_with_days[day_value] = []
                sentences_with_days[day_value].append(sentence)
    # Sort day values in ascending order
    sorted_days = sorted(sentences_with_days.keys())

    # Display sentences for each day value
    for day_value in sorted_days:
        # Print the amount of days in bold
        display(HTML(f'<span style="font-weight: bold;">{day_value} day mandates:</span>'))

        # Display the sentences that contain the specified amount of days
        for sentence in sentences_with_days[day_value]:
            highlighted_sentence = highlight_search_term(sentence, f'{day_value} days')
            if additional_term:
                highlighted_sentence = highlight_additional_term(highlighted_sentence, additional_term)
            display(HTML(highlighted_sentence))

In [14]:
search_and_return_highlighted_time_mandates(text, 'use case')

In [15]:
search_and_return_highlighted_time_mandates(text, 'safety')

# Summarize Document as a whole, or summarized the results for a specific keyword

In [16]:
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [17]:
def extractive_summarization_with_keyword(document, keyword=None, num_sentences=3):
    # Tokenize the document into sentences
    sentences = sent_tokenize(document)

    # Create TF-IDF matrix
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(sentences)

    # Calculate cosine similarity between sentences and keyword
    if keyword:
        keyword_vector = vectorizer.transform([keyword])
        similarity_scores = cosine_similarity(tfidf_matrix, keyword_vector)
        # Combine similarity scores with sentence scores
        combined_scores = similarity_scores.flatten()
        # Use PageRank algorithm to rank sentences based on combined scores
        ranked_sentences = [(score, sentence) for score, sentence in zip(combined_scores, sentences)]
        
    else:
        # Calculate cosine similarity between sentences
        similarity_matrix = cosine_similarity(tfidf_matrix, tfidf_matrix)
        # Use PageRank algorithm to rank sentences
        scores = similarity_matrix.sum(axis=1)
        ranked_sentences = [(score, sentence) for score, sentence in zip(scores, sentences)]
    
    
    ranked_sentences.sort(reverse=True)

    # Select the top N sentences for the summary
    summary_sentences = [sentence for _, sentence in ranked_sentences[:num_sentences]]
    summary = ' '.join(summary_sentences)

    return summary


In [18]:
# Set the keyword for the search
keyword_to_search = "use case"

In [19]:
# Set the number of sentences for the summary
num_summary_sentences = 10

# Perform extractive summarization with keyword search
summary = extractive_summarization_with_keyword(text, keyword_to_search, num_sentences=num_summary_sentences)

In [20]:
# Display the summary
print("\nSummary (related to the keyword '{}'):\n".format(keyword_to_search), summary)


Summary (related to the keyword 'use case'):
 AI Use Case Inventories . v. Reporting on AI Use Case s Not Subject to  Inventory . the agency’s plans to effectively govern its use of AI, including through its Chief AI 
Officer, AI Governance Boards, and improvements to its AI use case inventory;  
 
iv. This 
memorandum de scribes the roles, responsibilities, seniority, position, and reporting structures for 
agency CAIOs, including expanded reporting through agency AI use case inventories . Where people interact with a service relying on the  
AI and are likely to be impacted by the AI , agencies must also provide reasonable 
and timely notice44 about the use of the AI and a means to directly access any 
public documentation  about it in the use case inventory . Beginning 
with the use case inventory for 2024, agencies will be required , as applicable,  to identify 
which use cases are safety-impacting and rights-impacting A I and report additional detail 
on the risks—including risks