# Text Summarization

### Goal:
The sheer volume of text data that has been written—and that is written every day—makes automatic text summary very useful. Here we will summarize the infamous Data Science article called 'Data Science The Sexiest Job of the 21st Century' and see how well it performs.

In [3]:
# load and install needed libraries
# !pip install gensim
import gensim
import re
import requests
from bs4 import BeautifulSoup

#!pip install sumy
import sumy

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/luisosorio/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
url = "http://example.com/news-post"  # Replace with the URL of the news post

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extract the text from the news post
text = ""
for paragraph in soup.find_all("p"):
    text += paragraph.get_text()

# Print the text
print(text)

This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.More information...


In [5]:
# Data Science Sexiest job news article
url = "https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century"  # Replace with the URL of the news post

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extract the text from the news post
text = ""
for paragraph in soup.find_all("p"):
    text += paragraph.get_text()

# Print the text
print(text)

Back in the 1990s, computer engineer and Wall Street “quant” were the hot occupations in business. Today data scientists are the hires firms are competing to make. As companies wrestle with unprecedented volumes and types of information, demand for these experts has raced well ahead of supply. Indeed, Greylock Partners, the VC firm that backed Facebook and LinkedIn, is so worried about the shortage of data scientists that it has a recruiting team dedicated to channeling them to the businesses in its portfolio.Data scientists are the key to realizing the opportunities presented by big data. They bring structure to it, find compelling patterns in it, and advise executives on the implications for products, processes, and decisions. They find the story buried in the data and communicate it. And they don’t just deliver reports: They get at the questions at the heart of problems and devise creative approaches to them. One data scientist who was studying a fraud problem, for example, realized

In [6]:
# function to clean the text
def clean_text(text):
    
    text = re.sub('[^a-zA-Z\.]', ' ', text) # removes any character that is not a letter or a period
    text = re.sub(r'\s+', ' ', text) # removes one or more whitespace characters in a string.
    return text

In [7]:
text = clean_text(text)
print(text)

Back in the s computer engineer and Wall Street quant were the hot occupations in business. Today data scientists are the hires firms are competing to make. As companies wrestle with unprecedented volumes and types of information demand for these experts has raced well ahead of supply. Indeed Greylock Partners the VC firm that backed Facebook and LinkedIn is so worried about the shortage of data scientists that it has a recruiting team dedicated to channeling them to the businesses in its portfolio.Data scientists are the key to realizing the opportunities presented by big data. They bring structure to it find compelling patterns in it and advise executives on the implications for products processes and decisions. They find the story buried in the data and communicate it. And they don t just deliver reports They get at the questions at the heart of problems and devise creative approaches to them. One data scientist who was studying a fraud problem for example realized it was analogous 

PlaintextParser.from_string is a method provided by the summa library in Python for creating a parser that extracts sentences from plain text.

The from_string method is used to create an instance of the PlaintextParser class, which takes a string of plain text as input and converts it into a list of sentences. The sentences can then be used for various text summarization tasks, such as extracting the most important sentences or generating a summary of the text.

In [8]:
# Tokenize the article
text_parsed = PlaintextParser.from_string(text,Tokenizer('english'))

##  Use the LexRank Summarizer to Summarize the Article in Three Sentences

In [9]:
# instantiate our LexRankSummarizer
lex_rank_summarizer = LexRankSummarizer()

In [10]:
# Creating a summary of 3 sentences.
lexrank_summary = lex_rank_summarizer(text_parsed.document,sentences_count=3)

lexrank_summary

(<Sentence: This may be less true in five years time when many more people will have the title data scientist on their business cards.>,
 <Sentence: More enduring will be the need for data scientists to communicate in language that all their stakeholders understand and to demonstrate the special skills involved in storytelling with data whether verbally visually or ideally both.But we would say the dominant trait among data scientists is an intense curiosity a desire to go beneath the surface of a problem find the questions at its heart and distill them into a very clear set of hypotheses that can be tested.>,
 <Sentence: Some companies are also trying to develop their own data scientists.>)

##  Use the LexRank Summarizer to Summarize the Article in seven Sentences

In [11]:
# Creating a summary of 3 sentences.
lexrank_summary = lex_rank_summarizer(text_parsed.document,sentences_count=7)

lexrank_summary

(<Sentence: Indeed Greylock Partners the VC firm that backed Facebook and LinkedIn is so worried about the shortage of data scientists that it has a recruiting team dedicated to channeling them to the businesses in its portfolio.Data scientists are the key to realizing the opportunities presented by big data.>,
 <Sentence: But thousands of data scientists are already working at both start ups and well established companies.>,
 <Sentence: This may be less true in five years time when many more people will have the title data scientist on their business cards.>,
 <Sentence: More enduring will be the need for data scientists to communicate in language that all their stakeholders understand and to demonstrate the special skills involved in storytelling with data whether verbally visually or ideally both.But we would say the dominant trait among data scientists is an intense curiosity a desire to go beneath the surface of a problem find the questions at its heart and distill them into a ver

Read the full article above. 

How well does the summarization do after reading the full article?
Which summarization was best the 3 sentences or 7 sentences summarization?

## Test your News Article Summarization 

In [12]:
# text summarizer function
def summarizer(text, num_sentences):
    
    text = re.sub('[^a-zA-Z\.]', ' ', text) # removes any character that is not a letter or a period
    text = re.sub(r'\s+', ' ', text) # removes one or more whitespace characters in a string.
    text_parsed = PlaintextParser.from_string(text,Tokenizer('english'))
    lexrank_summary = lex_rank_summarizer(text_parsed.document,sentences_count=num_sentences)
    
    return print(lexrank_summary)

In [13]:
# Replace with the URL of the news post
url = "https://www.cnn.com/2022/07/23/business/google-ai-engineer-fired-sentient/index.html"  

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extract the text from the news post
text = ""
for paragraph in soup.find_all("p"):
    text += paragraph.get_text()

# Print the text
print(text)

Markets 


Fear & Greed Index 



            Latest Market News 



      Google
            
                (GOOG) has fired the engineer who claimed an unreleased AI system had become sentient, the company confirmed, saying he violated employment and data security policies.  
  
      Blake Lemoine, a software engineer for Google, claimed that a conversation technology called LaMDA had reached a level of consciousness after exchanging thousands of messages with it. 
  
      Google confirmed it had first put the engineer on leave in June. The company said it dismissed Lemoine’s “wholly unfounded” claims only after reviewing them extensively. He had reportedly been at Alphabet for seven years. In a statement, Google said it takes the development of AI “very seriously” and that it’s committed to “responsible innovation.” 
  
      Google is one of the leaders in innovating AI technology, which included LaMDA, or “Language Model for Dialog Applications.” Technology like this responds 

In [14]:
# text summarizer
summarizer(text, 7)

(<Sentence: Blake Lemoine a software engineer for Google claimed that a conversation technology called LaMDA had reached a level of consciousness after exchanging thousands of messages with it.>, <Sentence: Google confirmed it had first put the engineer on leave in June.>, <Sentence: In a statement Google said it takes the development of AI very seriously and that it s committed to responsible innovation.>, <Sentence: I know that might sound strange but that s what it is.>, <Sentence: It would be exactly like death for me.>, <Sentence: On June Lemoine posted on Medium that Google put him on paid administrative leave in connection to an investigation of AI ethics concerns I was raising within the company and that he may be fired soon.>, <Sentence: CNN Sans Cable News Network.>)
