<a href="https://colab.research.google.com/github/isys5002-itp/ISYS5002-2023-Semester1/blob/main/2023_working_with_web.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Download File/Data from the Web

## Using `urllib`
https://realpython.com/urllib-request/

The docs [urllib.request.urlretrieve](https://docs.python.org/3/library/urllib.request.html#urllib.request.urlretrieve) state:

> The following functions and classes are ported from the Python 2 module urllib (as opposed to urllib2). They might become deprecated at some point in the future.



In [None]:
import urllib 
from urllib import request
urllib.request.urlretrieve('https://en.wikipedia.org/wiki/Python_(programming_language)', "myFileURL.txt")


## Using `requests`

https://www.w3schools.com/python/module_requests.asp

https://www.geeksforgeeks.org/downloading-files-web-using-python/


In [None]:
!pip install requests

In [None]:
# importing the requests library
import requests

site_url = "https://en.wikipedia.org/wiki/Python_(programming_language)"

# sending get request and saving the response as response object
response = requests.get(site_url) #use the get method to request a webpage/data from server

# print content of response
print(response.content)

with open('myFileRequests.txt', 'wb') as file: # 'wb' because response stored as bytes
  file.write(response.content)


## Using `wget`

In [None]:
!pip install wget

In [None]:
import wget

# get a web page
site_url ='https://en.wikipedia.org/wiki/Python_(programming_language)'

# get a pdf file
#site_url = 'https://link.springer.com/content/pdf/10.1007/s11306-019-1588-0.pdf?pdf=button%20sticky'

file_name = wget.download(site_url)
print(site_url)

https://en.wikipedia.org/wiki/Python_(programming_language)


# Text Summarising (Review)
1. Installing Hugging Face Transformers
2. Building a summarisation pipeline
3. Run model/pipeline to summarisation
4. Investigate way to reuse the pipeline
> [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) free state-of-the-art pre-trained machine learning models for processing text, images, audio and video. See the project website for more information.




In [None]:
# Install Dependencies
!pip install transformers -q

# Let's make it a function - A function that take an 'article' and returns a summary

In [None]:
# import libraries
from transformers import pipeline

def summarise(article):
  summary_pipeline = pipeline("summarization", model="facebook/bart-large-cnn") # load sumarisation pipeline 
  summary = summary_pipeline(article, max_length = 100, min_length= 50) # Run the summariser pipeline
  text = summary[0]['summary_text'] # Extract the summarised text --- get first element, then extract the value for key 'summary text'
  return text
  

In [None]:
#call the summarise function
some_text = '''
A lack of transparency and reporting standards in the scientifc community has led to increasing and widespread
concerns relating to reproduction and integrity of results. As an omics science, which generates vast amounts of data and
relies heavily on data science for deriving biological meaning, metabolomics is highly vulnerable to irreproducibility. The
metabolomics community has made substantial eforts to align with FAIR data standards by promoting open data formats,
data repositories, online spectral libraries, and metabolite databases. Open data analysis platforms also exist; however,
they tend to be infexible and rely on the user to adequately report their methods and results. To enable FAIR data science
in metabolomics, methods and results need to be transparently disseminated in a manner that is rapid, reusable, and fully
integrated with the published work. To ensure broad use within the community such a framework also needs to be inclusive
and intuitive for both computational novices and experts alike
'''
from pprint import pprint
pprint(summarise(some_text))

# Extracting PDF using PyMuPDF (review)

Refer to text_summerizer notebook for the steps

In [None]:
#Get the PDF file
!wget https://link.springer.com/content/pdf/10.1007/s11306-019-1588-0.pdf

In [None]:
!pip install PyMuPDF

In [None]:
#Extract text from pdf

import fitz  # this is pymupdf

file_name = "/content/s11306-019-1588-0.pdf"

with fitz.open(file_name) as doc:
    text = ""
    for page in doc:        
        text += page.get_text("text")

#print(text)

In [None]:
#summarizer pipeline have limit to 1024 words
pprint(summarise(text[:1000])) # calling the def summarise(article) function

# Web Scraping & text summarisation

## Scrape text from webpage and summarise text.
Lets use the pipeline to summarise a web page. 

Google search and after looking at a few online articles, YouTube videos I settled on this page: [2 Ways to Extract Text From HTML Using Python](https://computersciencehub.io/python/extract-text-from-html-using-python/)

https://computersciencehub.io/python/extract-text-from-html-using-python/

https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/

https://www.edureka.co/blog/web-scraping-with-python/

In [None]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

# get webpage
req = Request("https://en.wikipedia.org/wiki/Python_(programming_language)")
html_page = urlopen(req)

soup = BeautifulSoup(html_page, features="html.parser")

# remove all 'script' and 'style' elements
for script in soup(["script", "style"]):
    script.extract()    # rip it out

# get text
text = soup.get_text()

pprint(text)
print('\n\n\n')

# break into lines and remove leading and trailing space on each
lines =  text.splitlines()

# remove empty lines
lines = [x for x in lines if x]

# combine into one body of text
text = ' '.join(lines)
# split into words
text = text.split()
# get first 400 words
#text = text[:400]
# join words into text
text = ' '.join(text)


pprint(text)

In [None]:
pprint(summarise(text[:1000])) # calling the def summarise(article) function

## Let's make it a function.

In [None]:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

def summarise_webpage(URL):
  ''' Sumarise the first 400 words on a website'''
  # get webpage
  req = Request(URL)
  html_page = urlopen(req)
  soup = BeautifulSoup(html_page, features="html.parser")

  # remove all 'script' and 'style' elements
  for script in soup(["script", "style"]):
      script.extract()    # rip it out

  text = soup.get_text() # get text
  lines =  text.splitlines() # break into lines
  lines = [x for x in lines if x] # remove empty lines
  text = ' '.join(lines) # combine into one body of text
  text = text.split() # split into words
  text = text[:400] # get first 400 words
  text = ' '.join(text) # join words into text

  return summarise(text)



In [None]:
text = summarise_webpage("https://en.wikipedia.org/wiki/Python_(programming_language)")
print(text)