# Data Extraction
Often times, we want to extract or scrape data from online resources of PDFs. The purpose of this project is to generate concise summaries for long documents or reports. This first step of data extraction is critical to obtain the text and manipulate it in such a way that the python program can read it.

## Extracting Text from an Online Article

In [2]:
#installations
#!pip install requests beautifulsoup4

In [1]:
#imports
import requests
from bs4 import BeautifulSoup

In [2]:
#Function to extract text from a web page
def extract_text_from_url(url):
    
    #sending a request to fetch the web page
    response = requests.get(url)

    #need to check is the request was successful
    if response.status_code == 200:
        #parse the page content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        #extract the main content from the article (usually inside <p> tags)
        paragraphs = soup.find_all('p')

        #combine the text from all paragraphs
        article_text = ' '.join([p.get_text() for p in paragraphs])
        
        #returning the text from the article as a string
        return article_text

    else:
        return "Failed to retrieve the webpage."


#Example usage
url = "https://medium.com/enterprise-rag/dynamic-expert-feedback-kg-rag-as-a-two-way-retrieval-memory-system-vs-one-way-vector-system-63b28c674336"

article_text = extract_text_from_url(url)

print(article_text)

Sign up Sign in Sign up Sign in Chia Jeng Yang Follow WhyHow.AI -- 1 Listen Share We have a structural problem in AI. While LLM and AI systems are getting better and better, one problem still stands out in particular — context injection. Context can come from: Implementations of Vector-Only RAG systems today are a ‘One-Way’ street. The common implementation pattern is a multi-agent system on top of a vector database so natural language queries can retrieve the right information for the LLM to construct an answer. This data typically from documents and databases can tell us what the data is, but not how the data should be used. For example, if we discover that the knowledge base contains an outdated fact i.e. “That manual should only be used for customers in Asia”, that information or expert knowledge cannot be easily considered by a Vector RAG system, the way that humans can. This type of problem space around expert knowledge infra is what we’re building up to. RAG is just the starting

In [3]:
article_text

'Sign up Sign in Sign up Sign in Chia Jeng Yang Follow WhyHow.AI -- 1 Listen Share We have a structural problem in AI. While LLM and AI systems are getting better and better, one problem still stands out in particular — context injection. Context can come from: Implementations of Vector-Only RAG systems today are a ‘One-Way’ street. The common implementation pattern is a multi-agent system on top of a vector database so natural language queries can retrieve the right information for the LLM to construct an answer. This data typically from documents and databases can tell us what the data is, but not how the data should be used. For example, if we discover that the knowledge base contains an outdated fact i.e. “That manual should only be used for customers in Asia”, that information or expert knowledge cannot be easily considered by a Vector RAG system, the way that humans can. This type of problem space around expert knowledge infra is what we’re building up to. RAG is just the startin

## Extracting Text from a PDF Document

In [10]:
#installations
#!pip install PyPDF2 pdfplumber

In [5]:
#imports
import pdfplumber

In [6]:
#Function to extract text from a PDF using pdfplumber
def extract_text_from_pdf_with_plumber(pdf_file_path):
    with pdfplumber.open(pdf_file_path) as pdf:
        pdf_text = ""
        for page in pdf.pages:
            pdf_text += page.extract_text()
        return pdf_text

   

In [7]:
#Example usage
pdf_file_path = "pan_2024.pdf"
pdf_text = extract_text_from_pdf_with_plumber(pdf_file_path)
print(pdf_text)

Overview of PAN 2024:
Multi-Author Writing Style Analysis,
Multilingual Text Detoxification,
Oppositional Thinking Analysis, and
Generative AI Authorship Verification
Condensed Lab Overview
Abinew Ali Ayele,1 Nikolay Babakov,2 Janek Bevendorff,3
Xavier Bonet Casals,4 Berta Chulvi,5 Daryna Dementieva,6 Ashaf Elnagar,7
Dayne Freitag,8 Maik Fröbe,9 Damir Korenčić,10 Maximilian Mayerl,11
Daniil Moskovskiy,12 Animesh Mukherjee,13 Alexander Panchenko,12
Martin Potthast,14 Francisco Rangel,15 Naquee Rizwan,13 Paolo Rosso,5,16
Florian Schneider,1 Alisa Smirnova,17 Efstathios Stamatatos,18
Elisei Stakovskii, Benno Stein,19 Mariona Taulé,4 Dmitry Ustalov,20
Xintong Wang,1 Matti Wiegmann,19 Seid Muhie Yimam,1 and Eva Zangerle21
1 Universität Hamburg, Germany, 2 Universidade de Santiago de Compostela, Spain
3 Leipzig University, Germany, 4 Universitat de Barcelona, Spain, 5 Univ. Politècnica
de València, Spain, 6 Technical University of Munich, Germany, 7 University of
Sharjah, United Arab Emirate

In [10]:
len(pdf_text)

76663