# Data Extraction
Often times, we want to extract or scrape data from online resources of PDFs. The purpose of this project is to generate concise summaries for long documents or reports. This first step of data extraction is critical to obtain the text and manipulate it in such a way that the python program can read it.

## Extracting Text from an Online Article

In [2]:
#installations
#!pip install requests beautifulsoup4

In [4]:
#imports
import requests
from bs4 import BeautifulSoup

In [5]:
#Function to extract text from a web page
def extract_text_from_url(url):
    
    #sending a request to fetch the web page
    response = requests.get(url)

    #need to check is the request was successful
    if response.status_code == 200:
        #parse the page content using BeautifulSoup
        soup = BeautifulSoup(response.content, 'html.parser')

        #extract the main content from the article (usually inside <p> tags)
        paragraphs = soup.find_all('p')

        #combine the text from all paragraphs
        article_text = ' '.join([p.get_text() for p in paragraphs])
        
        #returning the text from the article as a string
        return article_text

    else:
        return "Failed to retrieve the webpage."


#Example usage
url = "https://www.ctvnews.ca/lifestyle/influencer-couple-denies-leaving-kids-alone-on-cruise-1.7044439"

article_text = extract_text_from_url(url)

print(article_text)


	For most people, dinner on a cruise ship is a time to relax. 
	But when influencer couple Abby and Matt Howard decided to kick back with a dinner à deux, they ended up kicking up a storm. 
	The Arizona couple, who have 5.3 million followers on Tiktok, as well as 1.3 million and 677,000 Instagram followers individually, have had to furiously deny that they left their children alone on a cruise ship while enjoying their couple’s dinner. 
	Abby Howard posted over the weekend on Instagram that on a seven-day cruise, the couple had taken their kids – aged one and two – to dinner for the first five nights, but “it became apparent that they weren’t enjoying it and therefore we weren’t either.” 
	“So then we switched our dinner time to after their bedtime and FaceTimed the monitors while we ate,” she posted. 
	“And that worked out muchhhh (sic) better for everyone!” 
	Many followers understood her post as saying that they left the kids unattended in the cabin in order to go to dinner. When H

## Extracting Text from a PDF Document

In [10]:
#installations
#!pip install PyPDF2 pdfplumber

In [11]:
#imports
import pdfplumber

In [12]:
#Function to extract text from a PDF using pdfplumber
def extract_text_from_pdf_with_plumber(pdf_file_path):
    with pdfplumber.open(pdf_file_path) as pdf:
        pdf_text = ""
        for page in pdf.pages:
            pdf_text += page.extract_text()
        return pdf_text

   

In [None]:
#Example usage
pdf_file_path = "sample.pdf"
pdf_text = extract_text_from_pdf_with_plumber(pdf_file_path)
print(pdf_text)