#### Text Summarization
url : https://medium.com/themenyouwanttobe/text-summarization-96079bf23e83

github : https://github.com/themenyouwanttobe/Text--Summarization/blob/master/Text-Summarization.py

Find top n important sentences from ANN Wikipedia Article using word frequency.

#### 1. Importing the libraries/packages

In [24]:
import bs4 as bs
import urllib.request
import re
import nltk
import heapq

In [25]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/limhyesu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/limhyesu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### 2. Getting the data 

In [26]:
data = urllib.request.urlopen("https://en.wikipedia.org/wiki/Artificial_neural_network").read()

In [27]:
soup = bs.BeautifulSoup(data,'lxml')

In [28]:
#data

In [29]:
#soup

In [30]:
text = ""
for paragraph in soup.find_all('p'):
    text += paragraph.text
# articles in Wikipedia are written under <p> tag.    

In [31]:
#text

#### 3. Data Cleaning

In [32]:
text = re.sub(r'\[[0-9]*\]',' ',text) # remove all the references in the text which is denoted by [1], [2] etc
text = re.sub(r'\s+',' ',text) # remove all the extra spaces with single space

In [33]:
#text

In [34]:
clean_text = text.lower() # convert into lower case
clean_text = re.sub(r'\W',' ',clean_text) # remove all the extra punctuation, digis, extra spaces
clean_text = re.sub(r'\d',' ',clean_text)
clean_text = re.sub(r'\s+',' ',clean_text)

In [35]:
sentences = nltk.sent_tokenize(text) # break the all big text into sentences using sent_tokenize()

In [36]:
sentences

[' Artificial neural networks (ANN) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains.',
 'The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs.',
 'Such systems "learn" to perform tasks by considering examples, generally without being programmed with any task-specific rules.',
 'For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as "cat" or "no cat" and using the results to identify cats in other images.',
 'They do this without any prior knowledge about cats, for example, that they have fur, tails, whiskers and cat-like faces.',
 'Instead, they automatically generate identifying characteristics from the learning material that they process.',
 'An ANN is based on a collection of connected units or nod

In [37]:
#clean_text

In [38]:
stop_words = nltk.corpus.stopwords.words('english')

In [39]:
#stop_words

#### 4. Building the Histogram

In [40]:
word2count = {} # dictionary

In [41]:
for word in nltk.word_tokenize(clean_text):
    if word not in stop_words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

In [42]:
# Weighted Histogram

for key in word2count.keys():
    word2count[key]=word2count[key]/max(word2count.values())

#### 5. Calculating the Sentence score

In [43]:
# Calculate the score

sent2score = {}
for sentence in sentences:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word2count.keys():
            if len(sentence.split(' '))<30:
                if sentence not in sent2score.keys():
                    sent2score[sentence]=word2count[word]
                else:
                    sent2score[sentence]+=word2count[word]

In [44]:
sent2score

{' Artificial neural networks (ANN) or connectionist systems are computing systems vaguely inspired by the biological neural networks that constitute animal brains.': 4.7283141029958555,
 '(2006) proposed learning a high-level representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine to model each layer.': 2.378834161869876,
 ', s n ∈ S {\\displaystyle \\textstyle {s_{1},...,s_{n}}\\in S} and actions a 1 , .': 0.5102040816326531,
 'A DBN can be used to generatively pre-train a DNN by using the learned DBN weights as the initial DNN weights.': 1.2310799319727892,
 'A DNN can be discriminatively trained with the standard backpropagation algorithm.': 0.6490221088435375,
 'A LAMSTAR neural network may serve as a dynamic neural network in spatial or time domains or both.': 3.0697278911564623,
 'A central claim of artificial neural networks is therefore that it embodies some new and powerful general principle for processing informati

#### 6. Find out the best sentences

In [45]:
best_sentences = heapq.nlargest(5, sent2score, key=sent2score.get)

In [46]:
for sentences in best_sentences:
    print(sentences)

Minimizing this cost using gradient descent for the class of neural networks called multilayer perceptrons (MLP), produces the backpropagation algorithm for training neural networks.
Large memory storage and retrieval neural networks (LAMSTAR) are fast deep learning neural networks of many layers that can use many filters simultaneously.
Between 2009 and 2012, recurrent neural networks and deep feedforward neural networks developed in Schmidhuber's research group won eight international competitions in pattern recognition and machine learning.
The motivation behind Artificial neural networks is not necessarily to strictly replicate neural function, but to use biological neural networks as an inspiration.
Alternatives to backpropagation include Extreme Learning Machines, "No-prop" networks, training without backtracking, "weightless" networks, and non-connectionist neural networks.
