<a href="https://colab.research.google.com/github/m-mejiap/TopicosAvanzadosEnAnalitica/blob/main/Soluciones/E7-TextSummary.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Text Summary

Text summarization is important in the field of machine learning and natural language processing for several reasons:

1. **Information Retrieval:** Text summarization helps users quickly grasp the main points or key information from a large document, making it easier to decide whether to read the full document or not. This is particularly valuable in scenarios where individuals are inundated with vast amounts of textual data, such as news articles, research papers, or social media posts.

2. **Time Efficiency:** Summarization algorithms can process and generate summaries much faster than humans can read and summarize large texts. This saves time and allows users to focus their attention on the most relevant content.

3. **Content Extraction:** Text summarization can automatically extract essential information from a document, enabling applications like content recommendation, keyword extraction, and topic modeling.

4. **Content Generation:** Summarization models can be used to generate concise, coherent, and informative summaries for various purposes, such as creating abstracts for research papers, news article headlines, or social media post previews.

5. **Multilingual Support:** Text summarization can be applied to texts in multiple languages, making it a valuable tool for global communication and information retrieval.

6. **Personalization:** Summarization can be personalized to individual preferences. Machine learning models can learn from user feedback to generate summaries that align more closely with a user's interests and priorities.

7. **Scalability:** As the volume of digital content continues to grow, automated summarization becomes crucial for scaling information processing and retrieval. Machine learning-based summarization models can adapt and handle large volumes of text efficiently.

8. **Legal and Compliance:** In legal and regulatory contexts, automated summarization can help organizations review contracts, policies, and legal documents to ensure compliance and identify critical clauses or information.

9. **Search Engine Optimization (SEO):** Summarized content can be used to create concise and engaging snippets for search engine results, improving the discoverability of web content.

10. **Content Creation:** Summarization can be integrated into content creation tools, helping authors and content creators generate concise and informative content more efficiently.

Overall, text summarization is an essential component of machine learning and natural language processing, enabling efficient information retrieval, content extraction, and content generation across a wide range of applications and industries. It plays a critical role in handling the ever-increasing amount of textual data available in the digital age.

---
Exercise:

Now, as a data scientist expert in NLP, you are asked to create a model to be able to summarize text in Spanish. Your stakeholders will pass you an article and your model should summarize it.

In [None]:
#!pip install requests beautifulsoup4



In [None]:
#pip install googletrans==4.0.0-rc1



In [None]:
# Uncomment and run this cell if you're on Colab or Kaggle
#!git clone https://github.com/nlp-with-transformers/notebooks.git
#%cd notebooks
#from install import *
#install_requirements()
##hide
#from utils import *
#setup_chapter()
#pip install transformers

fatal: destination path 'notebooks' already exists and is not an empty directory.
/content/notebooks/notebooks
⏳ Installing base requirements ...
✅ Base requirements installed!
⏳ Installing Git LFS ...
✅ Git LFS installed!
No GPU was detected! This notebook can be *very* slow without a GPU 🐢
Go to Runtime > Change runtime type and select a GPU hardware accelerator.
Using transformers v4.16.2
Using datasets v1.16.1


In [None]:
#hide_output

from transformers import pipeline

classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

In [None]:
summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


Downloading:   0%|          | 0.00/1.14G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

In [None]:
import requests
import warnings
from bs4 import BeautifulSoup
from googletrans import Translator

def traducir_texto(texto, idioma_destino='es'):
    try:
        translator = Translator()
        traduccion = translator.translate(texto, dest=idioma_destino)
        return traduccion.text
    except Exception as e:
        return str(e)

# URL del artículo
url = "https://time.com/collection/time100-ai/6309026/geoffrey-hinton/"

# Realizar una solicitud HTTP para obtener el contenido de la página
response = requests.get(url)


# Verificar si la solicitud fue exitosa
if response.status_code == 200:
    # Analizar el contenido HTML de la página con BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Encontrar el contenido del artículo (puedes inspeccionar el HTML de la página para encontrar la estructura adecuada)
    article_content = soup.find("div", {"class": "article-content"})

    # Extraer el texto del artículo
    article_text = ""
    for paragraph in article_content.find_all("p"):
        article_text += paragraph.get_text() + "\n"


    # Imprimir el texto del artículo
    print("Original")
    print()
    print(article_text)
    print()
else:
    print("Error al obtener la página:", response.status_code)



# Verificar si la solicitud fue exitosa
if response.status_code == 200:
    # Analizar el contenido HTML de la página con BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Encontrar el contenido del artículo (puedes inspeccionar el HTML de la página para encontrar la estructura adecuada)
    article_content = soup.find("div", {"class": "article-content"})

    # Extraer el texto del artículo
    article_text = ""
    for paragraph in article_content.find_all("p"):
        article_text += traducir_texto(paragraph.get_text()) + "\n"

    # Imprimir el texto del artículo
    print("Traducido")
    print()
    print(article_text)
    print()
else:
    print("Error al obtener la página:", response.status_code)


# Verificar si la solicitud fue exitosa
if response.status_code == 200:
    # Analizar el contenido HTML de la página con BeautifulSoup
    soup = BeautifulSoup(response.text, "html.parser")

    # Encontrar el contenido del artículo (puedes inspeccionar el HTML de la página para encontrar la estructura adecuada)
    article_content = soup.find("div", {"class": "article-content"})

    # Extraer el texto del artículo
    article_text = ""
    for paragraph in article_content.find_all("p"):
        article_text += traducir_texto(summarizer(paragraph.get_text(), max_length=20, min_length=5, clean_up_tokenization_spaces=True, no_repeat_ngram_size=10)[0]['summary_text']) + "\n"
        #article_text += paragraph.get_text() + "\n"


    # Imprimir el texto del artículo
    print("Resumen")
    print()
    print(article_text)
else:
    print("Error al obtener la página:", response.status_code)

Original

Over the course of February, Geoffrey Hinton, one of the most influential AI researchers of the past 50 years, had a “slow eureka moment.”
Hinton, 76, has spent his career trying to build AI systems that model the human brain, mostly in academia before joining Google in 2013. He had always believed that the brain was better than the machines that he and others were building, and that by making them more like the brain, they would improve. But in February, he realized “the digital intelligence we’ve got now may be better than the brain already. It’s just not scaled up quite as big.” 
Developers around the world are currently racing to build the biggest AI systems that they can. Given the current rate at which AI companies are increasing the size of models, it could be less than five years until AI systems have 100 trillion connections—roughly as many as there are between neurons in the human brain.
Alarmed, Hinton left his post as VP and engineering fellow in May and gave a fl

Your max_length is set to 20, but you input_length is only 18. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=9)


Resumen

Geoffrey Hinton es uno de los investigadores de IA más influyentes de los últimos 50 años.
Hinton, de 76 años, ha pasado su carrera tratando de construir sistemas de IA que modelen el
Los desarrolladores de todo el mundo actualmente están corriendo para construir los sistemas de IA más grandes que puedan.
Hinton dejó su puesto como vicepresidente y becario de ingeniería en mayo y dio una oleada de
Nacido y criado en Inglaterra, Hinton proviene de una larga línea de luminarias.
Como estudiante universitario de la Universidad de Cambridge, Hinton probó una variedad de materias antes de graduarse con un
En la década de 1970, la inteligencia artificial estaba pasando por un período de entusiasmo amortiguado ahora
Hinton recibió el Premio Turing 2018 por su investigación en la Universidad de Toronto.
En 2012, Hinton y dos de sus estudiantes de posgrado, Alex Krizhevs
Hinton eligió Google sobre el último postor, Baidu, después de una semana.
Hinton ha sido fundamental en el desarrol