#### ANN article data와 같은 방법으로 Automatic Summarization article data에 대해 진행
article : https://en.wikipedia.org/wiki/Automatic_summarization#Aided_summarization
    

#### 1. Importing libraries/packages

In [34]:
import bs4 as bs
import urllib.request
import re
import nltk
import heapq

In [35]:
# download stopwords and puncutaion
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/limhyesu/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/limhyesu/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

#### 2. Extract the data

In [36]:
# open url
data = urllib.request.urlopen("https://en.wikipedia.org/wiki/Automatic_summarization#Aided_summarization").read()

In [37]:
#data

In [38]:
# lxml : library for processing XML and HTML in Python
# use BeautifulSoup library to parse the document and extract the text in a beautiful manner.
soup = bs.BeautifulSoup(data, 'lxml')

In [39]:
#soup

**XML(Extensible Markup Language)** : 일반 텍스트 유니코드 기반 메타 언어로 태그 언어를 정의하기 위한 언어. 데이터 조작, 구성, 변환 및 쿼리를 위한 다양한 기술에 대한 액세스를 제공함.

**XML vs HTML**

XML :  데이터 교환을 위한 구조정의가 목적. 사용자가 태그를 정의할 수 있음. 환경 제한 없음.

HTML : 데이터 표현이 목적. 정해진 태그를 가지고 표현하는 것. 인터넷 웹 환경에서 작동되는 언어.

In [40]:
# getting article contents from the page.
# articles in Wikipedia are written under <p> tag. It can be different in other pages.
text = ""

# find_all(name, attrs, recursive, string, limit, **kwargs) : 해당 조건에 맞는 모든 태그들을 가져온다.
for paragraph in soup.find_all('p'):
    text += paragraph.text # .text : str 객체 반환

In [41]:
text

'Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.\nAutomatic data summarization is part of machine learning and data mining. The main idea of summarization is to find a subset of data which contains the "information" of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include summarization of documents, image collections and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context.\nThere are two general approac

#### 3. Data Cleaning

**re.sub(pattern, repl, string)** : string에서 pattern과 매치하는 텍스트를 repl로 치환한다.

정규식 문자열 앞에 r문자를 선행하면 이 정규식은 Raw String 규칙에 의하여 백슬래시 2개 대신 1개만 써도 두 개를 쓴 것과 동일한 의미를 갖게 된다. 백슬래시를 사용하지 않는 정규식이라면 r의 유무에 상관없이 동일한 정규식이 될 것이다.

In [42]:
text = re.sub(r'\[[0-9]*\]',' ',text) # [1], [2] 등으로 표시되어 있는 참조 부분 제거 하고 공백으로 대체.
text = re.sub(r'\s+', ' ', text) # 한 개 이상의 공백을 한 개의 공백으로 치환

In [43]:
text

'Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax. Automatic data summarization is part of machine learning and data mining. The main idea of summarization is to find a subset of data which contains the "information" of the entire set. Such techniques are widely used in industry today. Search engines are an example; others include summarization of documents, image collections and videos. Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e. salient) images. For surveillance videos, one might want to extract the important events from the uneventful context. There are two general approache

In [44]:
clean_text = text.lower()
clean_text = re.sub(r'\W', ' ', clean_text) # 문자+숫자가 아닌 문자를 공백으로 치환
clean_text = re.sub(r'\d', ' ', clean_text) # 숫자를 공백으로 치환
clean_text = re.sub(r'\s+', ' ', clean_text) # 한 개 이상의 공백을 한 개의 공백으로 치환

In [45]:
sentences = nltk.sent_tokenize(text) # text를 문장 단위로 tokenize 하여 tuple 반환

stop_words = nltk.corpus.stopwords.words('english') # nltk에 지정된 영어 stopwords

In [56]:
sentences

['Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document.',
 'Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.',
 'Automatic data summarization is part of machine learning and data mining.',
 'The main idea of summarization is to find a subset of data which contains the "information" of the entire set.',
 'Such techniques are widely used in industry today.',
 'Search engines are an example; others include summarization of documents, image collections and videos.',
 'Document summarization tries to create a representative summary or abstract of the entire document, by finding the most informative sentences, while in image summarization the system finds the most representative and important (i.e.',
 'salient) images.',
 'For surveillance videos, one might want to extract the important events from the uneventful context

In [47]:
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

#### 4. Build the histogram

In [48]:
word2count = {} # create an empty dictionary. key : value. word : count

# clean_text에 있는 모든 word에 대해, stop_words에 속하지 않는 word를 histogram으로 만듬.
# word2count에 없다면 1로 만들어주고, 이미 포함되어 있다면 1을 더함.

for word in nltk.word_tokenize(clean_text):
    if word not in stop_words: 
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

In [49]:
# calculated the weighted histogram

for key in word2count.keys():
    word2count[key] = word2count[key]/max(word2count.values())

In [50]:
word2count

{'people': 0.011494252873563218,
 'predict': 0.011494252873563218,
 'comparing': 0.011494252873563218,
 'allow': 0.011494252873563218,
 'rules': 0.011494252873563218,
 'discuss': 0.011494252873563218,
 'entropy': 0.022988505747126436,
 'systems': 0.16091954022988506,
 'selecting': 0.05747126436781609,
 'linguistic': 0.011494252873563218,
 'upper': 0.011494252873563218,
 'duplicate': 0.011494252873563218,
 'brief': 0.011494252873563218,
 'response': 0.011494252873563218,
 'keyword': 0.011494252873563218,
 'without': 0.05747126436781609,
 'applying': 0.04597701149425287,
 'et': 0.034482758620689655,
 'good': 0.034482758620689655,
 'files': 0.011494252873563218,
 'hub': 0.011494252873563218,
 'internal': 0.011494252873563218,
 'eigenvector': 0.022988505747126436,
 'uses': 0.04597701149425287,
 'risk': 0.011494252873563218,
 'less': 0.011494252873563218,
 'field': 0.011494252873563218,
 'focused': 0.022988505747126436,
 'mmr': 0.022988505747126436,
 'useful': 0.034482758620689655,
 'simpli

#### 5. Calculating the Sentence score

In [51]:
sent2score = {} # create empty dictionary

# text를 setn_tokenize한 sentences에 대해서 score를 메김.
# sentences에 있는 각 sentence에 대하여 word_tokenize한다.
# 각 word가 word2count.keys()에 포함되어 있고 해당 sentence의 단어 수가 30보다 적은 경우 
# 그 sentence에 score를 메긴다.
# 만약 word가 속한 sentence가 sent2score.keys()에 없다면 (즉, word가 sentence의 첫 단어라면) 
# sent2score[sentence]를 word2count[word]로 초기화, 있다면 word2count[word] 추가.

for sentence in sentences:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word2count.keys():
            if len(sentence.split(' ')) < 30:
                if sentence not in sent2score.keys():
                    sent2score[sentence] = word2count[word]
                else:
                    sent2score[sentence] += word2count[word]

In [52]:
sent2score

{'"Natural" and "processing" would also be linked because they would both appear in the same string of N words.': 3.2493506493506494,
 'A high level of overlap should indicate a high level of shared concepts between the two summaries.': 1.3180512016718913,
 'A more principled way to estimate sentence importance is using random walks and eigenvector centrality.': 1.8953474179640504,
 'A post- processing step is then applied to merge adjacent instances of these T unigrams.': 1.2884870876250187,
 'A promising line in document summarization is adaptive document/text summarization.': 3.1426630840423946,
 'A random walk on this graph will have a stationary distribution that assigns large probabilities to the terms in the centers of the clusters.': 1.4716562173458723,
 'A related application is summarizing news articles.': 1.481060606060606,
 'A summary in this context is useful to show the most representative images of results in an image collection exploration system.': 3.367918170402957,
 

#### 6. Find out the best sentences

In [57]:
best_sentences = heapq.nlargest(5, sent2score, key=sent2score.get)

In [58]:
for sentences in best_sentences:
    print(sentences)

Thus, to get ranked highly and placed in a summary, a sentence must be similar to many sentences that are in turn also similar to many other sentences.
For example, in document summarization, one would like the summary to cover all important and relevant concepts in the document.
Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document.
At a very high level, summarization algorithms try to find subsets of objects (like set of sentences, or a set of images), which cover information of the entire set.
The unsupervised approach to summarization is also quite similar in spirit to unsupervised keyphrase extraction and gets around the issue of costly training data.
