#**Extraction Method**

# Install necessary dependencies

In [1]:
import nltk
import numpy as np
import pandas as pd
nltk.download('punkt')
nltk.download('stopwords') 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Get Text Document

We use the description of a very popular role-playing game (RPG) Skyrim from
Bethesda Softworks for summarization. 

In [2]:
DOCUMENT = """
The Elder Scrolls V: Skyrim is an action role-playing video game developed by Bethesda Game Studios and published by Bethesda Softworks.

It is the fifth main installment in The Elder Scrolls series, following The Elder Scrolls IV: Oblivion.
"""

In [3]:
import re

DOCUMENT = re.sub(r'\n|\r', ' ', DOCUMENT)
#DOCUMENT = re.sub(r' +', ' ', DOCUMENT)
DOCUMENT = DOCUMENT.strip()

In [4]:
print(DOCUMENT)

The Elder Scrolls V: Skyrim is an action role-playing video game developed by Bethesda Game Studios and published by Bethesda Softworks.  It is the fifth main installment in The Elder Scrolls series, following The Elder Scrolls IV: Oblivion.


# Summarization with Gensim

Let’s look at an implementation of document summarization by leveraging Gensim’s
summarization module. It is pretty straightforward.

In [5]:
sentences = nltk.sent_tokenize(DOCUMENT)
len(sentences)

2

#Conti

In [6]:
sentences = nltk.sent_tokenize(DOCUMENT)
sentences

['The Elder Scrolls V: Skyrim is an action role-playing video game developed by Bethesda Game Studios and published by Bethesda Softworks.',
 'It is the fifth main installment in The Elder Scrolls series, following The Elder Scrolls IV: Oblivion.']

# Basic Text pre-processing

In [7]:
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document) #Function Def Vectorize
norm_sentences = normalize_corpus(sentences)
norm_sentences[:3]

array(['elder scrolls v skyrim action roleplaying video game developed bethesda game studios published bethesda softworks',
       'fifth main installment elder scrolls series following elder scrolls iv oblivion'],
      dtype='<U113')

# Text Representation with Feature Engineering

We will be vectorizing our normalized sentences using the TF-IDF feature engineering
scheme. We keep things simple and don’t filter out any words based on document
frequency. But feel free to try that out and maybe even leverage n-grams as features.

In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
dt_matrix = tv.fit_transform(norm_sentences)
dt_matrix = dt_matrix.toarray()

vocab = tv.get_feature_names()
td_matrix = dt_matrix.T
print(td_matrix.shape)
pd.DataFrame(np.round(td_matrix, 2), index=vocab).head(10)

(19, 2)




Unnamed: 0,0,1
action,0.24,0.0
bethesda,0.48,0.0
developed,0.24,0.0
elder,0.17,0.43
fifth,0.0,0.3
following,0.0,0.3
game,0.48,0.0
installment,0.0,0.3
iv,0.0,0.3
main,0.0,0.3


The Elder Scrolls V: Skyrim is an action role-playing video game developed by Bethesda Game Studios and published by Bethesda Softworks.

It is the fifth main installment in The Elder Scrolls series, following The Elder Scrolls IV: Oblivion.