<a href="https://colab.research.google.com/github/inuwamobarak/document-summarization/blob/main/Longest_Wikipedia_article_summarised_to_10_sentences.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Summarization on Wikipedia Articles Using Python

**Introduction**

Document Summarization has become a vital task for various individuals and businesses that require a way of cutting down complexities involved with bulky documents. Summarization reduces a piece of document to a rendition less lengthy. This reduces the time complexity and effort initially required to consume that text. This is done such that the original message in the document is retained.

## Importing/Installing Dependencies

In [None]:
# Backend
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
pip install beautifulsoup4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
pip install lxml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# Core Libraries
import bs4 as bs
import urllib.request
import re

# Indirect requirements
import pandas as pd
import matplotlib.pyplot as plt
import io
import unicodedata
import numpy as np
import string

## Fetching Articles from Wikipedia

Before we start to load the data into the project, it is essential to understand a few points. Document summarization can be done in different depending on the overall objective. This is explained in details in the accompanying article.

In [None]:
# Scrapping the data and loading from url

wikipedia_article = urllib.request.urlopen('https://en.wikipedia.org/wiki/History_of_Poland_(1945%E2%80%931989)')  # Open the URL which is the link to Wikipedia article on Earth
article = wikipedia_article.read() # Loading the content of article with all unwanted characters and tags

Find details on the urlib.request library here: https://docs.python.org/3/library/urllib.request.html

## Preprocessing of the Data

The next vital thing is to remove unwanted content and ensure the article is as meaningly as possible. This will be our data processing stage.

In [None]:
parsed_article = bs.BeautifulSoup(article,'lxml') # BeautifulSoup lxml allows us to parse HTML and XML files

paragraphs = parsed_article.find_all('p') # Reads the <p> </p> tags in the article

article_text = ""

for p in paragraphs:
    article_text += p.text
    #article_text2 += p.

In [None]:
# Viewing content with symbols
article_text



In [None]:
# Droping unwanted characters and spaces

article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

In [None]:
# Viewing processed data without symbols
formatted_article_text



## Performing Text Tokenization

In [None]:
sentence_list = nltk.sent_tokenize(article_text) # Using NLTK and Punkt to generate tokens

In [None]:
sentence_list[:3] # Viewing few sentences

[' Timeline The history of Poland from 1945 to 1989 spans the period of Marxist–Leninist regime in Poland after the end of World War II.',
 'These years, while featuring general industrialization, urbanization and many improvements in the standard of living,[a1] were marred by early Stalinist repressions, social unrest, political strife and severe economic difficulties.',
 'Near the end of World War II, the advancing Soviet Red Army, along with the Polish Armed Forces in the East, pushed out the Nazi German forces from occupied Poland.']

## Weighting the Frequency of Words

In [None]:
stopwords = nltk.corpus.stopwords.words('english') # Loading the English version, you can change to other langages as required

# Iterating for individual words
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords: # Dodge stop words
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [None]:
maximum_frequncy = max(word_frequencies.values()) # Reading the number occurence of highest re-occuring word

In [None]:
maximum_frequncy

265

In [None]:
most_frequent_word = max(word_frequencies) # Printing the highest re-occuring word

In [None]:
most_frequent_word

'zone'

In [None]:
# Using the most occuring word as an avarage
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

## Finding the Score of Sentences

In [None]:
# We use the word frequency to measure the value of a sentence
sentence_scores = {}
for sent in sentence_list: # Reads article coontaining symbols
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys(): # Will ignore stop words in sentence_list
            if len(sent.split(' ')) < 32: # Dropping sentences with words more than 32. Summary should be short
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [None]:
# Viewing the value of each sentence
sentence_scores

{' Timeline The history of Poland from 1945 to 1989 spans the period of Marxist–Leninist regime in Poland after the end of World War II.': 0.41509433962264153,
 'These years, while featuring general industrialization, urbanization and many improvements in the standard of living,[a1] were marred by early Stalinist repressions, social unrest, political strife and severe economic difficulties.': 1.230188679245283,
 'Near the end of World War II, the advancing Soviet Red Army, along with the Polish Armed Forces in the East, pushed out the Nazi German forces from occupied Poland.': 0.3132075471698113,
 'In February 1945, the Yalta Conference sanctioned the formation of a provisional government of Poland from a compromise coalition, until postwar elections.': 0.490566037735849,
 'Joseph Stalin, the leader of the Soviet Union, manipulated the implementation of that ruling.': 0.23018867924528302,
 'A practically communist-controlled Provisional Government of National Unity was formed in Warsaw

## Extracting the Article Summary

In [None]:
# Making the final summary
number_of_sentence_to_summarize_to = 10

import heapq #  Heap queue algorithm, uses priority queue algorithm
summary_sentences = heapq.nlargest(number_of_sentence_to_summarize_to, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

Like his predecessors, Kania made promises that the regime could not fulfill because the authorities were still trapped by the contradiction: if they followed economic necessity, they would generate political instability. Nomenklatura members were appointed by the party and exercised political control in all spheres of public life, for example economic development, industry management, or education. Gierek government's growing difficulties led also to increased dependence on the Soviet Union, including tight economic cooperation and displays of submissiveness not seen under Gomułka's rule. These years, while featuring general industrialization, urbanization and many improvements in the standard of living,[a1] were marred by early Stalinist repressions, social unrest, political strife and severe economic difficulties. People of decidedly anticommunist or anti-PZPR orientations constituted a relatively small minority within the First Solidarity organization, which accommodated one millio

**Conclusion**

In this project, we have tried to explain the task of text summarization using Python NLTK and other helper libraries. We summarized the Wikipedia article into 10 sentences. Document summarization can be used in diverse scenarios. You can adjust the project to various use cases. Two things to adjust are simply the input point of the data and the output point where you specify the number of sentences. All the code for this project is made available at the GitHub repo available below.

**References**:

GitHub: https://github.com/inuwamobarak/document-summarization
History of Poland (1945–1989). (2023, April 2). In Wikipedia. https://en.wikipedia.org/wiki/History_of_Poland_(1945%E2%80%931989)