## Scrape Wikipedia Article

In [1]:
import bs4 as bs
import urllib.request
import ssl

url = 'https://en.wikipedia.org/wiki/Natural_language_processing'

gcontext = ssl.SSLContext()
scraped_data = urllib.request.urlopen(url, context=gcontext)
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article, 'lxml')
paragraphs = parsed_article.find_all('p')

text = ""
for p in paragraphs:
    text += p.text

## Preprocess and format the data
* Remove square bracketted links
* Remove special characters and digits
* Remove extra spaces

In [2]:
import re

text_pp = re.sub(r'\[[0-9]*\]', ' ', text)
text_pp_formatted = re.sub(r'[^a-zA-Z]', ' ', text_pp)
text_pp_formatted = re.sub(r'\s+', ' ', text_pp_formatted)

## Tokenize the text to sentences and to words

In [3]:
import nltk

sentences = nltk.sent_tokenize(text_pp)
words = nltk.word_tokenize(text_pp_formatted)

## Find the Word Frequencies

In [4]:
stop_words = nltk.corpus.stopwords.words('english')

word_frequencies = {}
for word in words:
    if word not in stop_words:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1     

## Find the Weighted Word Frequencies 

In [5]:
max_frequency = max(word_frequencies.values())

weighted_word_frequencies = {}
for word in word_frequencies.keys():
    weighted_word_frequencies[word] = word_frequencies[word]/max_frequency

## Find the Sentence Scores

In [6]:
sentence_scores = {}

for sentence in sentences:
    words_in_sentence = nltk.word_tokenize(sentence.lower())
    for word in words_in_sentence:
        if word in word_frequencies.keys():
            if len(sentence.split(' ')) < 30:
                if sentence not in sentence_scores.keys():
                    sentence_scores[sentence] = word_frequencies[word]
                else:
                    sentence_scores[sentence] += word_frequencies[word]

## Sort the sentences on descending sentence scores

In [7]:
import heapq

top_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(top_sentences)

print(summary)

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. Systems based on machine-learning algorithms have many advantages over hand-produced rules:
The following is a list of some of the most commonly researched tasks in natural language processing. Since the so-called "statistical revolution"   in the late 1980s and mid-1990s, much natural language processing research has relied heavily on machine learning. Little further research in machine translation was conducted until the late 1980s when the first statistical machine translation systems were developed. Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. Some of the earliest-used machine learning algorithms, such as decis