# Text Summarization Using NLTK

### Importing Libraries and performing Web Scrapping to obtain input data

In [49]:
import bs4 as bs
import urllib.request
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
import heapq

scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Unsupervised_learning')
article = scraped_data.read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p') #find all text under element tag p

article = ""

for p in paragraphs:
    article += p.text

### Data Cleaning

In [50]:
# Removing Square Brackets and Extra Spaces
article = re.sub(r'\[[0-9]*\]', ' ', article)
article = re.sub(r'\s+', ' ', article)

# Removing special characters and digits
formatted_article = re.sub('[^a-zA-Z]', ' ', article )
formatted_article = re.sub(r'\s+', ' ', formatted_article)

### Find Weighted Frequency of Word

In [51]:
#stopwords = nltk.download('stopwords')
#nltk.download('punkt')
stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1


#Find Maximium frequency 
maximum_frequncy = max(word_frequencies.values())

for word in word_frequencies.keys():
    word_frequencies[word] = round((word_frequencies[word]/maximum_frequncy),2)

### Converting the entire Article to Sentences 

In [52]:
#Converting text to Sentences
sentence_list = nltk.sent_tokenize(article)

### Finding the Sentence Score

In [57]:
sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 40:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]
                    
#print(sentence_scores)

### Printing the Summary

In [63]:
#Printing top 10 sentences having highest sentence score
summary_sentences = heapq.nlargest(10, sentence_scores, key=sentence_scores.get)

summary = ' '.join(summary_sentences)
print(summary)

A highly practical example of latent variable models in machine learning is the topic modeling which is a statistical model for generating the words (observed variables) in the document based on the topic (latent variable) of the document. In contrast to supervised learning that usually makes use of human-labeled data, unsupervised learning, also known as self-organization allows for modeling of probability densities over inputs. Some of the most common algorithms used in unsupervised learning include: 1) Clustering (2) Anomaly detection (3) Neural Networks (4) Approaches for learning latent variable models. A central application of unsupervised learning is in the field of density estimation in statistics, though unsupervised learning encompasses many other domains involving summarizing and explaining data features. Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supe