In [1]:
import bs4 as bs
import urllib.request
import re
import nltk
import sys
import csv
import heapq

In [2]:
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Natural_language_processing')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

In [3]:
article_text

'Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.  The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.\nChallenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.\nNatural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, though at the time that was not articul

# Text Preprocessing

In [4]:
article_text = re.sub(r'\[[0-9]*\]', ' ', article_text)
article_text = re.sub(r'\s+', ' ', article_text)

In [5]:
format_article_text = re.sub('[^a-zA-Z]', ' ', article_text )

format_article_text = re.sub(r'\s+', ' ', format_article_text)

In [6]:
sentence_list = nltk.sent_tokenize(article_text)

In [7]:
from nltk.corpus import stopwords

In [8]:
stopwords = nltk.corpus.stopwords.words('english')

## Calculate word frequency

In [9]:
word_freq = {}
for word in nltk.word_tokenize(format_article_text):
    if word not in stopwords:
        if word not in word_freq.keys():
            word_freq[word] = 1
        else:
            word_freq[word] +=1

In [10]:
word_freq

{'Natural': 2,
 'language': 28,
 'processing': 16,
 'NLP': 17,
 'subfield': 1,
 'linguistics': 9,
 'computer': 4,
 'science': 3,
 'artificial': 2,
 'intelligence': 3,
 'concerned': 1,
 'interactions': 1,
 'computers': 2,
 'human': 2,
 'particular': 1,
 'program': 1,
 'process': 2,
 'analyze': 2,
 'large': 3,
 'amounts': 1,
 'natural': 18,
 'data': 5,
 'The': 7,
 'goal': 1,
 'capable': 1,
 'understanding': 4,
 'contents': 1,
 'documents': 4,
 'including': 1,
 'contextual': 1,
 'nuances': 1,
 'within': 1,
 'technology': 1,
 'accurately': 1,
 'extract': 1,
 'information': 1,
 'insights': 1,
 'contained': 1,
 'well': 2,
 'categorize': 1,
 'organize': 1,
 'Challenges': 1,
 'frequently': 2,
 'involve': 2,
 'speech': 5,
 'recognition': 2,
 'generation': 2,
 'roots': 1,
 'Already': 1,
 'Alan': 1,
 'Turing': 2,
 'published': 1,
 'article': 1,
 'titled': 1,
 'Computing': 1,
 'Machinery': 1,
 'Intelligence': 1,
 'proposed': 3,
 'called': 2,
 'test': 2,
 'criterion': 1,
 'though': 1,
 'time': 1,
 

In [11]:
max_freq = max(word_freq.values())

In [12]:
for word in word_freq.keys():
    word_freq[word] = (word_freq[word]/max_freq)

## Replace words with weighted frequency in sentences

In [13]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_freq.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_freq[word]
                else:
                    sentence_scores[sent] += word_freq[word]

In [14]:
sentence_scores

{'The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.': 1.6785714285714286,
 'The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.': 0.6428571428571428,
 'Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation.': 2.821428571428572,
 'Natural language processing has its roots in the 1950s.': 2.25,
 'The proposed test includes a task that involves the automated interpretation and generation of natural language.': 2.107142857142857,
 'Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules.': 3.071428571428572,
 'Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for langu

In [15]:

summary_sentences = heapq.nlargest(10, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
summary

'Starting in the late 1980s, however, there was a revolution in natural language processing with the introduction of machine learning algorithms for language processing. In the 2010s, representation learning and deep neural network-style machine learning methods became widespread in natural language processing. That popularity was due partly to a flurry of results showing that such techniques can achieve state-of-the-art results in many natural language tasks, e.g., in language modeling and parsing. Up to the 1980s, most natural language processing systems were based on complex sets of hand-written rules. The cache language models upon which many speech recognition systems now rely are examples of such statistical models. Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation. The following is a list of some of the most commonly researched tasks in natural language processing. Though natural langu