<table width=100%>
  <tr>
    <td width=170px> <img src="../images/logo.png" alt="Oxford Logo" height="160" width="160"></td>
    <td width=220px style="font-size: 35px; text-align:left"> 
      <table>
        <tr><td style="font-size: 30px; text-align:left"> Universal Digital Identity: <br>NLP Analysis of Government Initiatives</td></tr>
      </table>   
    </td>
    <td width=150>
 <table>
 <tr><td>  Matthew Comb (2910648)</td></tr>
  <tr><td>matthew.comb@linacre.ox.ac.uk</td></tr>
 <tr><td>  Dr Andrew Martin</td></tr>
  <tr><td>andrew.martin@kellogg.ox.ac.uk</td></tr>
  </table>
  </td></tr>
</table>  


# Abstract <a class="anchor" id="research-abstract"></a>

Governments globally recognise digital identity as essential for a thriving digital economy. By implementing digital identity systems, they strive to streamline citizen services, more effectively combat fraud, improve regulatory oversight, ensure accessibility for marginalised groups, and cut costs linked to conventional paper-based methods. This comprehensive approach highlights some of the many benefits of digital identity in modern governance and economic systems.

Implementing digital identity systems is, however, a complex challenge. The field, marked by over 6000 commercial patents filed worldwide, is now fragmented, and struggles with the significant challenge of interoperability between digital identity implementations. Currently, the landscape of digital identity is transforming owing to two key developments: the European Union's eIDAS regulatory reform, which aims to standardise electronic identification to enhance trust in online transactions, and the World Wide Web Consortium's Verifiable Credentials framework, which focuses on a decentralised approach that prioritises secure, private, and user-controlled digital identity verification. These initiatives involve a combination of centralised regulation and decentralised technology and bolster the prospects for a more unified approach to digital identity.

This study primarily focuses on two objectives. Firstly, it employs natural language processing to analyse the strategies and developmental stages of key government digital identity programs. Secondly, the study identifies key patterns within these programmes vital for establishing a universally relevant digital identity framework. By considering various digital identity paradigms and frameworks, the research seeks to deepen the understanding of the digital identity field and highlight effective practices that could facilitate the broader adoption of a comprehensive digital identity model.

# Table of Contents:

* [Abstract](#research-abstract)
* [Introduction](#introduction)
* [Research Questions](#research-questions)
* [Data Collection](#data-collection)
* [Initialisation](#initialisation)
* [Helper Functions](#helper-functions)



# Introduction <a class="anchor" id="introduction"></a>

Digital identity has become a critical component of modern governance, enabling governments to deliver services, enforce regulations, and support economic growth. Research into this domain often requires analyzing data from government websites, which serve as rich repositories of policies, frameworks, and updates related to digital identity systems. This Jupyter notebook provides a practical guide for data mining government websites to extract and analyze information relevant to digital identity.

The sample code in this notebook demonstrates how to:

1. **Identify and Scrape Relevant Web Pages**:
   - Use web scraping libraries like `requests` and `BeautifulSoup` to extract content from government websites.
   - Handle dynamic content using `Selenium` if necessary.

2. **Preprocess and Clean the Data**:
   - Remove unwanted elements like advertisements, navigation bars, and extraneous formatting.
   - Structure the data for analysis by extracting meaningful sections, such as headers, paragraphs, and tables.

3. **Perform Natural Language Processing (NLP)**:
   - Tokenize, clean, and preprocess textual data for further analysis.
   - Apply topic modeling, keyword extraction, or sentiment analysis to identify trends in digital identity discourse.

4. **Analyze and Visualize Findings**:
   - Summarize extracted data using word clouds, frequency distributions, or network diagrams.
   - Highlight key terms, themes, and patterns that emerge from the analysis.

This notebook is designed for researchers interested in studying government policies and practices on digital identity. The methods demonstrated here can be adapted to other domains requiring web data mining. By the end of this tutorial, you will have a foundational understanding of how to gather, process, and analyze data from government websites to support your research.

> **Note**: Ensure you comply with the terms of service and legal guidelines of the websites you scrape. Some sites may prohibit automated scraping or require explicit permission.

### Please Note

Results from this sample code may not match the findings in related journal publications. This is because the sample code does not include the processing of PDF files linked on the webpages, which are integral to the full data mining package used in the research. Additionally, webpages are subject to constant updates over time, leading to potential variations in the extracted content.


# Research Questions <a class="anchor" id="research-questions"></a>

1. **Are governments aligned in their approach to digital identity solutions?**

2. **What is the maturity of leading government digital identity programs?**

3. **What published information do governments prioritize in their digital identity approach?**

4. **What patterns have emerged in these developments that could pave the way for a stable and universal digital identity?**


## Initialisation <a class="anchor" id="initialisation"></a>

In [1]:
# import python data analysis library
import pandas as pd

# pd.set_option('max_columns', 120)
# pd.set_option('max_rows', 20)

In [2]:
# import scientific platform package
import numpy as np

In [3]:
# HTML imports
from IPython.core.display import HTML
from IPython.display import clear_output
import requests

In [4]:
# Beautiful Soup
from bs4 import BeautifulSoup
from bs4 import Tag

In [5]:
# Set debug flag
debug = False

In [6]:
# urllib
from urllib.parse import urlparse, urljoin, urlunparse
from urllib.request import urlopen

In [7]:
#system etc
import time
import os
import sys
import re
import unicodedata
import csv
import datetime
from collections import Counter

In [8]:
# Clustering packages
import sklearn
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import cosine_distances
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from scipy.cluster.hierarchy import ward, fcluster

## Helper Functions <a class="anchor" id="helper-functions"></a>

In [17]:
# Helper function to select data per time period
def select_period(df, start_year, end_year, id):
    new = df[(df.Year.astype(int) >= start_year) & (df.Year.astype(int) <= end_year)].copy()
    new['ruleset'] = id
    return new

In [18]:
# Helper function to remove query string from url e.g. https://some.domain.com/page?querystring
def remove_query_string(url):
    # Parse the URL into its components
    parsed_url = urlparse(url)

    # Create a modified version of the parsed URL without the query string
    modified_url = parsed_url._replace(query='')

    # Convert the modified URL back to a string
    cleaned_url = urlunparse(modified_url)

    return cleaned_url

In [19]:
# Helper function to determine if string url contains specific file extensions - therefore is file not page
def contains_substring(text):
    substrings = [".pdf", ".xls", ".doc"]
    for substring in substrings:
        if substring in text:
            return True
    return False

In [20]:
# Helper function to determine if the url is allowed according to given valid host list
def allow_url(url, allowed_hosts):
    hosts = allowed_hosts.split(",")
    allowed = False

    #for loop to iterate over words array
    #print("Allow URL: " + url)
    for host in hosts:
        #print("Allow Host: " + host)
        if url.startswith(host) or url.startswith("/"):
            allowed = True
    # print(allowed)
    return allowed

In [21]:
def write_scrape_line(text, indent_level):

    # Define the ANSI escape codes for different colors
    YELLOW = '\033[33m'
    WHITE = '\033[37m'

    indent_size = 2
    indent = " "
    if indent_level > 0:
        indent = " " * (1 + (indent_level * indent_size))

    if text != "STOP":
        current_time = datetime.datetime.now().strftime("%y-%m-%d %H:%M:%S")
        line = f"{WHITE}{current_time}{indent}{YELLOW}{text}"
        print(line)

In [22]:
def get_urls_webscrape(website_url, allowed_hosts):
    visited_urls = []
    starting_url = website_url
    starting_path = urlparse(starting_url).path
    crawl(starting_url, visited_urls, "", starting_path, allowed_hosts)
    return visited_urls

In [23]:
def clean_text(text):
    if debug: print("cleaning: " + text)
    # Remove unwanted characters
    #text = re.sub(r"[^\w\s]", "", text)
    #text = text.lower()
    # Normalize white space
    text = re.sub(r"\s+", " ", text)
    text = text.strip()
    # Convert the text to ASCII and remove diacritical marks
    text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore").decode("utf-8")
    # print(text)
    return text

In [24]:
def get_text_url(url):
    
    # Make a GET request to the URL
    response = requests.get(url, headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}, timeout = 30)
  
    # Check if the response is successful
    if response.status_code != 200:
        return None
    # Parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")
    # Extract the text from the HTML content
    text = " ".join(text.strip() for text in soup.stripped_strings)
    cleaned_text = clean_text(text)
    return cleaned_text

In [25]:
def has_style(tag):
	return tag.has_attr('style')

def has_class(tag):
	return tag.has_attr('class')

In [26]:
# A soup cleaning function to remove empty tags
def clean(soup):
    if soup.name == 'br' or soup.name == 'img' or soup.name == 'p' or soup.name == 'div':
        return
    try:
        ll = 0
        for j in soup.strings:
            ll += len(j.replace('\n', ''))
        if ll == 0:
            if debug: print("decomposing")
            if isinstance(soup, Tag):
                soup.decompose()
        else:
            for child in soup.children:
                clean(child)
    except Exception as e:
        print(e)
        pass

In [27]:
def dfs(soup, v):
    if soup.name == 'a' or soup.name == 'br':
        return
    try:

        lt = len(soup.get_text())
        ls = len(str(soup))

        if isinstance(soup, Tag):
            a = soup.find_all('a')
        else:
            a = []
        
        at = 0

        for j in a:
            at += len(j.get_text())
        lvt = lt - at
        v.append((soup, lt / ls * lvt))

        if isinstance(soup, Tag):
            for child in soup.children:
                dfs(child, v)
    except Exception as e:
        print(e)
        pass

In [28]:
def extract(soup, text_only = True, remove_img = True):

    filt = ['script', 'noscript', 'style', 'embed', 'label', 'form', 'input', 'iframe', 'head', 'meta', 'link', 'object', 'aside', 'channel']

    if (remove_img):
        filt.append('img')

    for ff in filt:
        for i in soup.find_all(ff):
            if isinstance(i, Tag):
                i.decompose()

    for tag in soup.find_all(has_style):
        del tag['style']
    for tag in soup.find_all(has_class):
        del tag['class']

    trimmed_text = soup.get_text().strip()
    if (trimmed_text == ""):
        print ("empty empty")
        return "", 0

    clean(soup)

    LVT = len(soup.get_text())
    for i in soup.find_all('a'):
        LVT -= len(i.get_text())
    v = []

    dfs(soup, v)

    mij = 0

    # print(v)
    for i in range(len(v)):
        if v[i][1] > v[mij][1]:
            mij = i

    if text_only:
        res = v[mij][0].get_text()
    else:
        res = str(v[mij][0])

    return res, v[mij][1] / LVT

In [29]:
def safe_list_get (l, idx, default):
    try:
        return l[idx]
    except IndexError:
        return default

In [30]:
def crawl(url, visited_urls, base_url, starting_path, allowed_hosts):
    if debug: print("crawl()")
    # Parse the URL to extract the base URL and path
    parsed_url = urlparse(url)
    base_url = "{0.scheme}://{0.netloc}".format(parsed_url)
    path = parsed_url.path

    if (remove_query_string(url) in visited_urls) or (((starting_path not in url) or (not path.startswith(starting_path))) and (allow_url(url, allowed_hosts) == False)):
        # print("Exit")
        return

    if (not "mailto" in url) and (not "&url=" in url):
        url2 = url
        if len(starting_path) > 1:
            write_scrape_line("Found Page: " + url2.replace(starting_path, "/../"), 2)
        else:
            write_scrape_line("Found Page: " + url2, 2)
     
    # time.sleep(.2)
    # Make a GET request to the URL
    response = requests.get(url, headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}, timeout = 30)
    # Check if the response is successful
    # print(response.status_code)
    if response.status_code != 200:
        return
    # Parse the HTML content
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find all the links in the HTML content
    links = [link.get("href") for link in soup.find_all("a")]
    
    # Add the URL to the visited URLs
    visited_urls.append(remove_query_string(url))

    # Recursively crawl the links that contain the base URL
    for link in links:
        try:
            if (not contains_substring(link)):   
                full_url = urljoin(base_url, link)
                if debug: print("URL: " + full_url)
                if (link.startswith(starting_path) or link.startswith(url) or allow_url(full_url, allowed_hosts)) and full_url not in visited_urls:
                    if debug: print("crawling")
                    crawl(full_url, visited_urls, base_url, starting_path, allowed_hosts)
                # else:
                    # print("Discarding URL: " + full_url)
        except:
            exy = "link"
            print("Link null in WebScraperHelper")

In [31]:
def get_text_webscrape(urls, inner=''):
    text = ""
    total = len(urls)
    counter = 0
    for url in urls:
        print(url)
        # Make a GET request to the URL
        response = requests.get(url, headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}, timeout = 30)
        # Check if the response is successful
        if debug: print(response.status_code)
        if response.status_code != 200:
            continue
        # Parse the HTML content      
        soup = BeautifulSoup(response.content, "html.parser")
        try:
            if inner != '':
               soup = soup.find(inner)
            extracted = extract(soup)
            l = safe_list_get(extracted, 0, "")
            text = clean_text(l)
        except Exception as err:
            print("error")
            text = ""
            return ""
        # return text #REMOVE THIS
    return text 

In [32]:
def get_term_frequencies_ordered(strings):
    """
    Given an array of strings, return a list of (term, frequency) pairs,
    ordered by frequency in descending order.

    :param strings: list of strings
    :return: list of tuples (term, frequency) sorted in descending frequency
    """
    freq_counter = Counter()

    # Iterate over each string in the input
    for s in strings:
        # Tokenize by splitting on whitespace (you could also lowercase if desired)
        tokens = s.split()
        freq_counter.update(tokens)

    # Sort by frequency (descending)
    term_freq_list = sorted(freq_counter.items(), key=lambda x: x[1], reverse=True)
    
    return term_freq_list


In [33]:
def get_pdf_links_webscrape(keyword, pages):
    # Initialize an empty list to store the PDF links
    pdf_links = []
    
    # Initialize the starting index
    start = 0
    counter = 0
    # Iterate over the specified number of pages
    for i in range(pages):
        counter = start

        # Build the URL with the keyword and starting index
        url = f"https://arxiv.org/search/?searchtype=all&query={keyword}&abstracts=show&size=50&order=-announced_date_first&start={start}"
        
        # Make a GET request to the URL
        response = requests.get(url, headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}, timeout = 30)
        
        # Check if the response is successful
        if response.status_code != 200:
            break
        
        # Parse the HTML content
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Find all the <li> items with class "arxiv-result"
        li_items = soup.find_all("li", class_="arxiv-result")
        
        # Iterate over the <li> items
        for li in li_items:
            
            # Find the first <a> tag with the text "pdf"
            a_tag = li.find("a", text="pdf")
            
            # Check if the <a> tag was found
            if a_tag:
                # Get the href attribute of the <a> tag
                pdf_link = a_tag.get("href")
                
                # Add the PDF link to the list
                pdf_links.append(pdf_link + ".pdf")
            
            # Logging code 
            counter += 1    
            log("B", "1", counter, (pages * 50), "collecting academia...")
            time.sleep(.1)
        

        # Increment the starting index
        start += 50

    clear_output(wait=True)
    return pdf_links

## Sample - Denmark <a class="anchor" id="sample-usecase"></a>

Following is a working example of one of the country sets contained in this piece of work. Note, due to the distributed nature of the data mining platform used in the research, this sample code does not contain all of the features including extracting pdf text content found in links on the crawled webpages. Please be aware this will change the results significantly as will updates made on the website over time. 

In [51]:
website = "https://en.digst.dk/systems/mitid/"
allowed_hosts = "https://en.digst.dk/strategy/,https://en.digst.dk/policy/,https://en.digst.dk/digital-governance/,https://en.digst.dk/digital-services/,https://en.digst.dk/systems/,https://www.mitid.dk/en-gb/"
urls = get_urls_webscrape(website, allowed_hosts)

[37m24-12-23 20:09:07     [33mFound Page: https://en.digst.dk/../
[37m24-12-23 20:09:09     [33mFound Page: https://en.digst.dk/policy/international-cooperation/
[37m24-12-23 20:09:10     [33mFound Page: https://en.digst.dk/strategy/
[37m24-12-23 20:09:11     [33mFound Page: https://en.digst.dk/strategy/the-national-strategy-for-digitalisation/
[37m24-12-23 20:09:12     [33mFound Page: https://en.digst.dk/strategy/the-joint-government-digital-strategy/
[37m24-12-23 20:09:14     [33mFound Page: https://en.digst.dk/strategy/the-danish-national-strategy-for-cyber-and-information-security/
[37m24-12-23 20:09:15     [33mFound Page: https://en.digst.dk/strategy/the-danish-national-strategy-for-artificial-intelligence/
[37m24-12-23 20:09:16     [33mFound Page: https://en.digst.dk/policy/
[37m24-12-23 20:09:18     [33mFound Page: https://en.digst.dk/policy/the-danish-digital-journey/
[37m24-12-23 20:09:19     [33mFound Page: https://en.digst.dk/policy/government-digital-aca

In [53]:
# Print the urls found while crawling the website
print(urls[:20])

['https://en.digst.dk/systems/mitid/', 'https://en.digst.dk/policy/international-cooperation/', 'https://en.digst.dk/strategy/', 'https://en.digst.dk/strategy/the-national-strategy-for-digitalisation/', 'https://en.digst.dk/strategy/the-joint-government-digital-strategy/', 'https://en.digst.dk/strategy/the-danish-national-strategy-for-cyber-and-information-security/', 'https://en.digst.dk/strategy/the-danish-national-strategy-for-artificial-intelligence/', 'https://en.digst.dk/policy/', 'https://en.digst.dk/policy/the-danish-digital-journey/', 'https://en.digst.dk/policy/government-digital-academy/', 'https://en.digst.dk/digital-governance/', 'https://en.digst.dk/digital-governance/digital-architecture/', 'https://en.digst.dk/digital-governance/data/', 'https://en.digst.dk/digital-governance/information-security-in-danish-authorities/', 'https://en.digst.dk/digital-governance/data-ethics-in-business/', 'https://en.digst.dk/digital-governance/new-technologies/', 'https://en.digst.dk/dig

In [55]:
text = []
for url in urls:
    text.append(get_text_webscrape([url], 'main'))

https://en.digst.dk/systems/mitid/
https://en.digst.dk/policy/international-cooperation/
https://en.digst.dk/strategy/
https://en.digst.dk/strategy/the-national-strategy-for-digitalisation/
https://en.digst.dk/strategy/the-joint-government-digital-strategy/
https://en.digst.dk/strategy/the-danish-national-strategy-for-cyber-and-information-security/
https://en.digst.dk/strategy/the-danish-national-strategy-for-artificial-intelligence/
https://en.digst.dk/policy/
https://en.digst.dk/policy/the-danish-digital-journey/
https://en.digst.dk/policy/government-digital-academy/
https://en.digst.dk/digital-governance/
https://en.digst.dk/digital-governance/digital-architecture/
https://en.digst.dk/digital-governance/data/
https://en.digst.dk/digital-governance/information-security-in-danish-authorities/
https://en.digst.dk/digital-governance/data-ethics-in-business/
https://en.digst.dk/digital-governance/new-technologies/
https://en.digst.dk/digital-services/
https://en.digst.dk/digital-service

In [57]:
text[0]

"MitID MitID (the Danish National eID) is Denmark's digital ID that residents will use to access their public self-service solutions. eID is the key to digital Denmark. Today, more than 90 percent of the population uses their national eID in situations where it is essential to document ones identity electronically. eID enhances the scope of communication between residents and the public sector, and helps the public sector to offer better services to residents and businesses. It allows for residents to access their public services 24 hours a day. The switch to digital-first was enabled by the rollout of Denmarks second-generation national eID (NemID) in 2010. This served as a communal login for public and private self-service solutions and online banking. Digital solutions are renewed or replaced over time. That happens because of security requirements and new technology. In 2022 Denmark's third generation eID, MitID, was introduced. This new eID satisfies the latest security requiremen

In [59]:
def text_to_sentences(text: str) -> list[str]:
    """
    Takes a text string as input and returns a list of sentence strings, all in lowercase.
    """
    # Convert the text to lowercase
    lowercase_text = text.lower().strip()
    
    # Split the text into sentences by ., ?, or !
    potential_sentences = re.split(r'[.?!]+', lowercase_text)
    
    # Strip whitespace from each sentence and filter out empty ones
    sentences = [sentence.strip() for sentence in potential_sentences if sentence.strip()]
    
    return sentences

In [61]:
chunks = []
for page in text:
    chunks = chunks + text_to_sentences(page)

In [63]:
print(chunks[:5])

["mitid mitid (the danish national eid) is denmark's digital id that residents will use to access their public self-service solutions", 'eid is the key to digital denmark', 'today, more than 90 percent of the population uses their national eid in situations where it is essential to document ones identity electronically', 'eid enhances the scope of communication between residents and the public sector, and helps the public sector to offer better services to residents and businesses', 'it allows for residents to access their public services 24 hours a day']


In [65]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

def extract_lemmatized_nouns(sentences):
    """
    Takes a list of sentences (strings) as input and returns a list of strings,
    where each output string contains only the lemmatized nouns from the corresponding input sentence.
    """

    def get_wordnet_pos(pos_tag):
        """
        Convert the part-of-speech tag from NLTK's 'pos_tag' to a format
        that the WordNetLemmatizer can understand (e.g., wordnet.NOUN).
        """
        if pos_tag.startswith('J'):
            return wordnet.ADJ
        elif pos_tag.startswith('V'):
            return wordnet.VERB
        elif pos_tag.startswith('N'):
            return wordnet.NOUN
        elif pos_tag.startswith('R'):
            return wordnet.ADV
        else:
            return None

    lemmatizer = WordNetLemmatizer()
    noun_sentences = []

    for sentence in sentences:
        # Tokenize the sentence
        tokens = word_tokenize(sentence)

        # Get the POS tags for each token
        pos_tags = nltk.pos_tag(tokens)

        # Collect only lemmatized nouns
        lemmatized_nouns = []
        for (word, tag) in pos_tags:
            wordnet_pos = get_wordnet_pos(tag)
            # Check if it's a noun
            if wordnet_pos == wordnet.NOUN:
                # Lemmatize the noun (lowercase to keep consistent)
                lemma = lemmatizer.lemmatize(word.lower(), pos=wordnet_pos)
                lemmatized_nouns.append(lemma)
        
        # Join the lemmatized nouns into a single string
        noun_sentences.append(" ".join(lemmatized_nouns))

    return noun_sentences

In [67]:
lemmings = extract_lemmatized_nouns(chunks)
print(lemmings[:5])

['mitid mitid eid denmark id resident access solution', 'eid key denmark', 'today percent population eid situation one identity', 'eid scope communication resident public sector public sector service resident business', 'resident access service hour day']


## Term Metrics <a class="anchor" id="term-metrics"></a>

In [69]:
freq = get_term_frequencies_ordered(lemmings)
print(freq[:20])

[('data', 268), ('government', 174), ('authority', 147), ('post', 137), ('sector', 129), ('information', 128), ('app', 125), ('denmark', 113), ('agency', 110), ('service', 100), ('dk', 100), ('business', 86), ('citizen', 84), ('public', 82), ('strategy', 79), ('security', 72), ('health', 68), ('right', 68), ('number', 66), ('card', 66)]


## Clustering <a class="anchor" id="clustering"></a>

In [71]:
# LDA 4 clustering routines----------------------------------------
# Global variables
global_topic_keywords_lda = []
global_count_matrix_lda = None
global_vectorizer_lda = None
global_lda_model_lda = None

def cluster_sentences_lda(sentences, n_topics=5, num_keywords=3):
    global global_topic_keywords_lda, global_count_matrix_lda, global_vectorizer_lda, global_lda_model_lda
    global_topic_keywords_lda = []
    global_vectorizer_lda = CountVectorizer(stop_words='english')
    global_count_matrix_lda = global_vectorizer_lda.fit_transform(sentences)

    global_lda_model_lda = LatentDirichletAllocation(n_components=n_topics, random_state=42)
    global_lda_model_lda.fit(global_count_matrix_lda)

    keywords = []
    terms = global_vectorizer_lda.get_feature_names_out()
    for topic_idx, topic in enumerate(global_lda_model_lda.components_):
        keywords = [terms[i] for i in topic.argsort()[-num_keywords:][::-1]]
        global_topic_keywords_lda.append(keywords)

    return global_topic_keywords_lda

def find_cluster_for_sentence_lda(sentence):
    global global_lda_model_lda, global_vectorizer_lda

    if global_lda_model_lda is None or not global_vectorizer_lda:
        return None, None

    sentence_vector = global_vectorizer_lda.transform([sentence])
    topic_distribution = global_lda_model_lda.transform(sentence_vector)
    most_likely_topic = topic_distribution[0].argmax()

    # Using the max probability for the topic as similarity
    similarity = topic_distribution[0][most_likely_topic]

    arr_str = []
    arr_str.append(str(most_likely_topic))
    arr_str.append(str(similarity))
    return arr_str

def nlp_preprocess_lda_file(num_clusters, num_keywords, file_path):
    # Example Usage:
    sample_texts = []

    # Open the file for reading
    with open(file_path, "r", encoding="utf-8") as file: 
        # Read each line and append it to the 'lines' list
        for line in file:
            text = unicodedata.normalize("NFKD", line).encode("ascii", "ignore").decode("utf-8")
            sample_texts.append(text.strip())  # Use strip() to remove newline characters

    global_cluster_keywords_lda = []
    global_cluster_keywords_lda = cluster_sentences_lda(sample_texts, num_clusters, num_keywords)

    str_arr = []
    for cluster in global_cluster_keywords_lda:
        str_arr.append(str(cluster))        
    return str_arr

def nlp_preprocess_lda(num_clusters, num_keywords, string_array):
    # Example Usage:
    sample_texts = []
    
    # Process each string in the input array
    for line in string_array:
        # Normalize and remove non-ASCII characters
        text = unicodedata.normalize("NFKD", line).encode("ascii", "ignore").decode("utf-8")
        sample_texts.append(text.strip())

    # Call your clustering function (not provided in this code snippet)
    global_cluster_keywords_lda = cluster_sentences_lda(sample_texts, num_clusters, num_keywords)

    # Convert each cluster's keywords to a string
    str_arr = []
    for cluster in global_cluster_keywords_lda:
        str_arr.append(str(cluster))

    return str_arr

In [73]:
clusters = nlp_preprocess_lda(5, 10, lemmings)
print(clusters)

["['data', 'right', 'protection', 'processing', 'information', 'agency', 'case', 'digst', 'number', 'authority']", "['data', 'service', 'eid', 'self', 'company', 'country', 'eu', 'gateway', 'ethic', 'mitid']", "['post', 'dk', 'authority', 'citizen', 'business', 'borger', 'resident', 'sector', 'people', 'solution']", "['app', 'health', 'card', 'insurance', 'number', 'phone', 'information', 'licence', 'lifeindenmark', 'user']", "['government', 'sector', 'agency', 'strategy', 'security', 'denmark', 'public', 'information', 'service', 'authority']"]


## PDF Routines<a class="anchor" id="pdf-routines"></a>

In [248]:

def extract_pdf_links(url):
    pdf_links = []
    
    try:
        response = requests.get(url, headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}, timeout = 30)
  
        response.raise_for_status()  # Check for request errors
        soup = BeautifulSoup(response.content, 'html.parser')
        
        for link in soup.find_all('a', href=True):
            href = link['href']
            if re.search(r'\.pdf$', href, re.I):
                if href.startswith('http') or href.startswith('www'):
                    pdf_links.append(href)
                else:
                    parsed_url = urlparse(url)
                    base_url = "{0.scheme}://{0.netloc}".format(parsed_url)
                    pdf_links.append(base_url + href)
    
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")
    
    return pdf_links

def get_text_webscrape2(urls):

    text = ""
    total = len(urls)
    counter = 0
    for url in urls:

        # Make a GET request to the URL
        response = requests.get(url, headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}, timeout = 30)
          
        # Check if the response is successful
        if response.status_code != 200:
            continue
        
        # Parse the HTML content
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Extract the text from the HTML content
        page_text = soup.get_text()
        
        # Add the text to the overall text
        text += page_text

        # Log code
        counter += 1
        log("C", "1", counter, total, "scraping text...")
        time.sleep(.2)
        
    
    cleaned_text = clean_text(text)
    
    # print(cleaned_text)
    return cleaned_text



def get_pdf_links_webscrape(keyword, pages):
    # Initialize an empty list to store the PDF links
    pdf_links = []
    
    # Initialize the starting index
    start = 0
    counter = 0
    # Iterate over the specified number of pages
    for i in range(pages):
        counter = start

        # Build the URL with the keyword and starting index
        url = f"https://arxiv.org/search/?searchtype=all&query={keyword}&abstracts=show&size=50&order=-announced_date_first&start={start}"
        
        # Make a GET request to the URL
        response = requests.get(url, headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}, timeout = 30)
        
        # Check if the response is successful
        if response.status_code != 200:
            break
        
        # Parse the HTML content
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Find all the <li> items with class "arxiv-result"
        li_items = soup.find_all("li", class_="arxiv-result")
        
        # Iterate over the <li> items
        for li in li_items:
            
            # Find the first <a> tag with the text "pdf"
            a_tag = li.find("a", text="pdf")
            
            # Check if the <a> tag was found
            if a_tag:
                # Get the href attribute of the <a> tag
                pdf_link = a_tag.get("href")
                
                # Add the PDF link to the list
                pdf_links.append(pdf_link + ".pdf")
            
            # Logging code 
            counter += 1    
            log("B", "1", counter, (pages * 50), "collecting academia...")
            time.sleep(.1)
        

        # Increment the starting index
        start += 50

    clear_output(wait=True)
    return pdf_links



def download_pdf(pdf_url, label):
    time.sleep(.2)
    folder = "./Files/Downloads/"+ label + "/";

    # Check whether the specified path exists or not
    isExist = os.path.exists(folder)
    if not isExist:
        # Create a new directory because it does not exist
        os.makedirs(folder)


    # Extract the file name from the URL
    file_name = pdf_url.split("/")[-1]
    
    # Build the full file path
    file_path = os.path.join(folder, file_name)


    if os.path.isfile(file_path) == False:
        # Make a GET request to the URL
        response = requests.get(pdf_url, headers = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}, timeout = 30)
     
        # Check if the response is successful
        if response.status_code != 200:
            return
    
        # Write the content to the file
        with open(file_path, "wb") as f:
            f.write(response.content)
    







## Residual Code<a class="anchor" id="residual-code"></a>

In [102]:
headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

In [103]:
"""Scrape metadata from target URL."""
import requests
from bs4 import BeautifulSoup
import pprint
import urllib

def get_download_url(url):
    """Scrape target URL for metadata."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    
    r = requests.get(url, headers=headers)
    html = BeautifulSoup(r.content, 'html.parser')
    print(html)
    return get_pdf(html)
    
def download_file(url, filename):
    urllib.urlretrieve ("http://www.example.com/songs/mp3.mp3", "mp3.mp3")

def scrape_page_metadata(url):
    """Scrape target URL for metadata."""
    headers = {
        'Access-Control-Allow-Origin': '*',
        'Access-Control-Allow-Methods': 'GET',
        'Access-Control-Allow-Headers': 'Content-Type',
        'Access-Control-Max-Age': '3600',
        'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
    
    r = requests.get(url, headers=headers)
    html = BeautifulSoup(r.content, 'html.parser')
    metadata = {
        'title': get_title(html),
        'description': get_description(html),
        'pdf': get_pdf(html),
        'image': get_image(html),
        'favicon': get_favicon(html, url),
        'sitename': get_site_name(html, url),
        'color': get_theme_color(html),
        'url': url
        }

    return metadata


def get_title(html):
    """Scrape page title."""
    title = None
    if html.title.string:
        title = html.title.string
    elif html.find("meta", property="og:title"):
        title = html.find("meta", property="og:title").get('content')
    elif html.find("meta", property="twitter:title"):
        title = html.find("meta", property="twitter:title").get('content')
    elif html.find("h1"):
        title = html.find("h1").string
    return title

def get_pdf(html):
    return html.find("meta",  {"name":"citation_pdf_url"}).get('content')

def get_description(html):
    """Scrape page description."""
    description = None
    if html.find("meta", property="description"):
        description = html.find("meta", property="description").get('content')
    elif html.find("meta", property="og:description"):
        description = html.find("meta", property="og:description").get('content')
    elif html.find("meta", property="twitter:description"):
        description = html.find("meta", property="twitter:description").get('content')
    elif html.find("p"):
        description = html.find("p").contents
    return description


def get_image(html):
    """Scrape share image."""
    image = None
    if html.find("meta", property="image"):
        image = html.find("meta", property="image").get('content')
    elif html.find("meta", property="og:image"):
        image = html.find("meta", property="og:image").get('content')
    elif html.find("meta", property="twitter:image"):
        image = html.find("meta", property="twitter:image").get('content')
    elif html.find("img", src=True):
        image = html.find_all("img")[0].get('src')
    return image


def get_site_name(html, url):
    """Scrape site name."""
    if html.find("meta", property="og:site_name"):
        site_name = html.find("meta", property="og:site_name").get('content')
    elif html.find("meta", property='twitter:title'):
        site_name = html.find("meta", property="twitter:title").get('content')
    else:
        site_name = url.split('//')[1]
        return site_name.split('/')[0].rsplit('.')[1].capitalize()
    return sitename


def get_favicon(html, url):
    """Scrape favicon."""
    if html.find("link", attrs={"rel": "icon"}):
        favicon = html.find("link", attrs={"rel": "icon"}).get('href')
    elif html.find("link", attrs={"rel": "shortcut icon"}):
        favicon = html.find("link", attrs={"rel": "shortcut icon"}).get('href')
    else:
        favicon = f'{url.rstrip("/")}/favicon.ico'
    return favicon


def get_theme_color(html):
    """Scrape brand color."""
    if html.find("meta", property="theme-color"):
        color = html.find("meta", property="theme-color").get('content')
        return color
    return None

In [104]:
import requests
import json
import urllib.request
from pathlib import Path
import pdfplumber
from sklearn.feature_extraction.text import CountVectorizer
import re
def pre_process(text):
    
    # lowercase
    text=text.lower()
    
    #remove tags
    text=re.sub("</?.*?>"," <> ",text)
    
    # remove special characters and digits
    text=re.sub("(\\d|\\W)+"," ",text)
    
    return text

def get_stop_words(stop_file_path):
    """load stop words """
    
    with open(stop_file_path, 'r', encoding="utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)
    
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    #use only topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        #keep track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    #create a tuples of feature,score
    #results = zip(feature_vals,score_vals)
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]
    
    return results

from bs4 import BeautifulSoup
pp = pprint.PrettyPrinter(indent=4)
for index, row in microsoft.iterrows():
    url = row['Result']
    print(url)
    #print(get_download_url(url))
    data = scrape_page_metadata(url)
    
    pdfURL = data["pdf"]
    print(pdfURL)
    
    path = "./PDF/" + row['Assignee']
    Path(path).mkdir(parents=True, exist_ok=True)
    
    path = "./PDF/" + row['Assignee'] + '/' + row['ID'] + ".pdf"
    print(path)
    urllib.request.urlretrieve(pdfURL, path)
    
    pages = list()
    fulltext = "";
    count = 0;
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            text = page.extract_text()
            text = pre_process(text)
            
            pages.append(text)
            count = count +1;
            if count > 15:
                fulltext = fulltext + text;

            #load a set of stop words
        stopwords=get_stop_words("./Files/stopwords.txt")

        #get the text column 
        #docs=df_idf['text'].tolist()

        #create a vocabulary of words, 
        #ignore words that appear in 85% of documents, 
        #eliminate stop words
        cv = CountVectorizer(max_df = 1, stop_words = stopwords, max_features = 10000)
        word_count_vector = cv.fit_transform(pages)  
            
        feature_names=cv.get_feature_names()
        #print(feature_names)
        
        from multi_rake import Rake

        rake = Rake()
        rake.max_words = 100000

        keywords = rake.apply(fulltext)
        
        print(fulltext)

        print(keywords[:10])

    break

https://patents.google.com/patent/US8104074B2/en
https://patentimages.storage.googleapis.com/57/c5/c0/8164ac58ecb706/US8104074.pdf
./PDF/Microsoft Corporation/US-8104074-B2.pdf
us b negotiation the ability for the various parties of the digi tive embodiments one or more other specifications can be tal identity system to make agreements regarding mutu used to facilitate communications between the various sub ally acceptable technologies claims and other require systems in system ments in example embodiments principal relying party encapsulation the ability to exchange requirements and and identity provider can each utilize one or more a claims in a technology neutral way between parties computer systems each computer system includes one or subsystems and more of volatile and non volatile computer readable media transformation the ability to translate claims between computer readable media includes storage media as well as technologies and semantically removable and non removable media i

Note: you may need to restart the kernel to use updated packages.
