<a href="https://colab.research.google.com/github/inuwamobarak/document-summarization/blob/main/Wikipedia_Earth_Article_Summarization_Using%C2%A0Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Document Summarization on Wikipedia Articles Using Python

## Importing/Installing Dependencies

In [1]:
# Backend
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger') 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [2]:
pip install beautifulsoup4

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
pip install lxml

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
# Core Libraries
import bs4 as bs
import urllib.request
import re

# Indirect requirements
import pandas as pd
import matplotlib.pyplot as plt
import io
import unicodedata
import numpy as np
import string

## Fetching Articles from Wikipedia

In [23]:
# Scrapping the data and loading from url

wikipedia_article = urllib.request.urlopen('https://en.wikipedia.org/wiki/Earth')  # Open the URL which is the link to Wikipedia article on Earth
article = wikipedia_article.read() # Loading the content of article with all unwanted characters and tags

Find details on the urlib.request library here: https://docs.python.org/3/library/urllib.request.html

## Preprocessing of the Data

In [None]:
parsed_article = bs.BeautifulSoup(article,'lxml') # BeautifulSoup lxml allows us to parse HTML and XML files

paragraphs = parsed_article.find_all('p') # Reads the <p> </p> tags in the article

article_text = ""

for p in paragraphs:
    article_text += p.text
    #article_text2 += p.

In [24]:
# Viewiing content with symbols
article_text

'\nLand: 148940000\xa0km2 (57510000\xa0sq\xa0mi) – 29.2%\nEarth is the third planet from the Sun and the only place known in the universe where life has originated and found habitability. While Earth may not contain the largest volumes of water in the Solar System, only Earth sustains liquid surface water, extending over 70.8% of the Earth with its ocean, making Earth an ocean world. Earth\'s polar regions currently retain most of all other water with large sheets of ice covering ocean and land, dwarfing Earth\'s groundwater, lakes, rivers and atmospheric water. Land, consisting of continents and islands, extends over 29.2% of the Earth and is widely covered by vegetation. Below Earth\'s surface material lies Earth\'s crust consisting of several slowly moving tectonic plates, which interact to produce mountain ranges, volcanoes, and earthquakes. Earth\'s liquid outer core generates a magnetic field that shapes the magnetosphere of Earth, largely deflecting destructive solar winds and c

In [25]:
# Droping unwanted characters and spaces

article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r'\s+', ' ', formatted_article_text)

In [48]:
# Viewing processed data without symbols
formatted_article_text

' Land km sq mi Earth is the third planet from the Sun and the only place known in the universe where life has originated and found habitability While Earth may not contain the largest volumes of water in the Solar System only Earth sustains liquid surface water extending over of the Earth with its ocean making Earth an ocean world Earth s polar regions currently retain most of all other water with large sheets of ice covering ocean and land dwarfing Earth s groundwater lakes rivers and atmospheric water Land consisting of continents and islands extends over of the Earth and is widely covered by vegetation Below Earth s surface material lies Earth s crust consisting of several slowly moving tectonic plates which interact to produce mountain ranges volcanoes and earthquakes Earth s liquid outer core generates a magnetic field that shapes the magnetosphere of Earth largely deflecting destructive solar winds and cosmic radiation Earth has an atmosphere which sustains Earth s surface condi

## Performing Text Tokenization

In [26]:
sentence_list = nltk.sent_tokenize(article_text) # Using NLTK and Punkt to generate tokens

In [38]:
sentence_list[:3] # Viewing few sentences

[' Land: 148940000 km2 (57510000 sq mi) – 29.2% Earth is the third planet from the Sun and the only place known in the universe where life has originated and found habitability.',
 'While Earth may not contain the largest volumes of water in the Solar System, only Earth sustains liquid surface water, extending over 70.8% of the Earth with its ocean, making Earth an ocean world.',
 "Earth's polar regions currently retain most of all other water with large sheets of ice covering ocean and land, dwarfing Earth's groundwater, lakes, rivers and atmospheric water."]

## Weighting the Frequency of Words

In [53]:
stopwords = nltk.corpus.stopwords.words('english') # Loading the English version, you can change to other langages as required

# Iterating for individual words
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords: # Dodge stop words
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

In [54]:
maximum_frequncy = max(word_frequencies.values()) # Reading the number occurence of highest re-occuring word

In [55]:
maximum_frequncy

255

In [56]:
most_frequent_word = max(word_frequencies) # Printing the highest re-occuring word

In [57]:
most_frequent_word

'zone'

In [None]:
# Using the most occuring word as an avarage
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

## Finding the Score of Sentences

In [12]:
# We use the word frequency to measure the value of a sentence
sentence_scores = {}
for sent in sentence_list: # Reads article coontaining symbols
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys(): # Will ignore stop words in sentence_list
            if len(sent.split(' ')) < 32: # Dropping sentences with words more than 32. Summary should be short
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

In [41]:
# Viewing the value of each sentence
sentence_scores

{"Earth's polar regions currently retain most of all other water with large sheets of ice covering ocean and land, dwarfing Earth's groundwater, lakes, rivers and atmospheric water.": 1.0549019607843135,
 'Land, consisting of continents and islands, extends over 29.2% of the Earth and is widely covered by vegetation.': 0.28235294117647053,
 "Below Earth's surface material lies Earth's crust consisting of several slowly moving tectonic plates, which interact to produce mountain ranges, volcanoes, and earthquakes.": 0.6705882352941176,
 "Earth's liquid outer core generates a magnetic field that shapes the magnetosphere of Earth, largely deflecting destructive solar winds and cosmic radiation.": 0.4627450980392156,
 "Earth has an atmosphere, which sustains Earth's surface conditions and protects it from most meteoroids and UV-light at entry.": 0.48235294117647054,
 'It has a composition of primarily nitrogen and oxygen.': 0.09019607843137255,
 'Water vapor is widely present in the atmosph

## Extracting the Article Summary

In [47]:
# Making the final summary
number_of_sentence_to_summarize_to = 10

import heapq #  Heap queue algorithm, uses priority queue algorithm
summary_sentences = heapq.nlargest(number_of_sentence_to_summarize_to, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

70.8% or 361.13 million km2 (139.43 million sq mi) of Earth's surface consists of the interconnected ocean, making it Earth's global ocean or world ocean. Solar System planets with considerable atmospheres do partly host atmospheric water vapor, but they lack surface conditions for stable surface water. Earth's polar regions currently retain most of all other water with large sheets of ice covering ocean and land, dwarfing Earth's groundwater, lakes, rivers and atmospheric water. Earth's surface topography consists mostly of the topography of the ocean surface and to a lesser extend of the terrain of Earth's crust above sea level. Regarding the surface distribution of land and water, Earth can be divided into an oceans-focused water hemisphere and a landmasses-focused land hemisphere.
