## EXTRACTIVE SUMMARY

There are basically two types of summary: extractive and abstractive


**Extractive summary** is the type of summary that involves no paraphrasing. No usage of words or sentences outside the summary is used. 

**Abstractive summary** is the type of summary we use regularly in schools or whenever we need to summarize a document. synonyms, paraphrasing and usage of words or sentences to caputure the intent of the original words and sentences are used.

**How does a computer know which sentences are important to our document ?**

**How does a computer understand our sentences to make summary out of it ?**

Computers can be used for both type of summary. **Extractive summary** depends on sentence score while **abstractive summary** uses machine learning.


***This notebook is a walkthrough with proper documentation of how to summarize documents using extractive summary***

In [2]:
import nltk
import re
import numpy as np
from nltk.corpus import stopwords # to obtain stopwords
from nltk import sent_tokenize, word_tokenize # to tokenize sentences
from sklearn.feature_extraction.text import TfidfVectorizer # to compute sentence score
from nltk.stem import PorterStemmer # for stemming
from nltk import pos_tag # for part of speech tagging
from nltk import WordNetLemmatizer # for word lemmatization

**I have a document in my directory that I'll use to perform the summary**

In [3]:
with open('test result.txt', 'r') as file:
    doc = file.read()

**The files are read in form of a string. To get the best out of our document, it needs to be in the proper form for preprocessing: hence the need to tokenize our sentences.**

*The document when it is read.....*

In [6]:
print('The document is read in a string: ')
print()
doc

The document is read in a string: 



"# personality type: The test indicates that i'm an Advocate. I really agree with this considering my approach to life's situation and people. I look forward to providing solution to issues not for selfish purposes. Needful to say, I can be very pragmatic in my thinking, introvert, inspiring and convincing\n\n#thinking style: this result shows that i'm more pragmatic in my thinking, offering logical and approachable way to solve issues and view situations beyond theories.\n\n#communicative style: this result indicates that I'm suppoertive and cautious, but more supportive. Also I am reserved\n\n#Thinking style: the test result shows that i am an OPTIMIZER. one of my approach to issues that defines me as an Advocate is concrete steps to implementing solutions, which as an optimizer, I adopt in my thinking,finding ways to be more productive, efficient and organized.\n\n#team role test: For every time I get the chance to participate in a team work, I tend to discover I am able to express 

### Step 1

***Sentence tokenization***

In [7]:
tokenized_sent = sent_tokenize(doc)
print(tokenized_sent)

["# personality type: The test indicates that i'm an Advocate.", "I really agree with this considering my approach to life's situation and people.", 'I look forward to providing solution to issues not for selfish purposes.', "Needful to say, I can be very pragmatic in my thinking, introvert, inspiring and convincing\n\n#thinking style: this result shows that i'm more pragmatic in my thinking, offering logical and approachable way to solve issues and view situations beyond theories.", "#communicative style: this result indicates that I'm suppoertive and cautious, but more supportive.", 'Also I am reserved\n\n#Thinking style: the test result shows that i am an OPTIMIZER.', 'one of my approach to issues that defines me as an Advocate is concrete steps to implementing solutions, which as an optimizer, I adopt in my thinking,finding ways to be more productive, efficient and organized.', '#team role test: For every time I get the chance to participate in a team work, I tend to discover I am 

The document has been tokenized into sentences. This means that, we divided the document into **tokens** of **sentences** where a sentence ends with a **fullstop**

#### Text preprocessing

1. Removal of special characters

In [42]:
def regex(doc):
    tokenize_sent = sent_tokenize(doc)
    regex = r'[^a-zA-Z0-9\s]'
    text_ = [re.sub(regex, '', i) for i in tokenize_sent]
    return text_

text = regex(doc)

In [44]:
text

['Komolehin Israel\nGroup 5\n personality type The test indicates that im an Advocate',
 'I really agree with this considering my approach to lifes situation and people',
 'I look forward to providing solution to issues not for selfish purposes',
 'Needful to say i can be very pragmatic in my thinking introvert inspiring and convincing\n\nthinking style this result shows that im more pragmatic in my thinking offering logical and approachable way to solve issues and view situations beyond theories',
 'communicative style this result indicates that Im suppoertive and cautious but more supportive',
 'Also I am reserved\n\nThinking style the test result shows that i am an OPTIMIZER',
 'one of my approach to issues that defines me as an Advocate is concrete steps to implementing solutions which as an optimizer I adopt in my thinkingfinding ways to be more productive efficient and organized',
 'team role test For every time I get the chance to participate in a team work I tend to discover I 

In [20]:
regex = r'[^a-zA-Z0-9\s]'
#re.sub(regex, '', text)
    

In [21]:
text_ = [re.sub(regex, '', i) for i in tokenize_sent]
text_

['Komolehin Israel\nGroup 5\n personality type The test indicates that im an Advocate',
 'I really agree with this considering my approach to lifes situation and people',
 'I look forward to providing solution to issues not for selfish purposes',
 'Needful to say i can be very pragmatic in my thinking introvert inspiring and convincing\n\nthinking style this result shows that im more pragmatic in my thinking offering logical and approachable way to solve issues and view situations beyond theories',
 'communicative style this result indicates that Im suppoertive and cautious but more supportive',
 'Also I am reserved\n\nThinking style the test result shows that i am an OPTIMIZER',
 'one of my approach to issues that defines me as an Advocate is concrete steps to implementing solutions which as an optimizer I adopt in my thinkingfinding ways to be more productive efficient and organized',
 'team role test For every time I get the chance to participate in a team work I tend to discover I 

In [405]:
def text_preprocessing(text):
    tokens = [word for sent in sent_tokenize(text) for word in word_tokenize(sent)]
    tokens = [word.lower() for word in tokens]
    new_tokens = [i for i in tokens if i not in stopwords.words('english')]
    tokens = [word for word in new_tokens if len(word) >= 3]
    stemmer = PorterStemmer()

    tokens = [stemmer.stem(word) for word in tokens]
    
    tagged_corpus = pos_tag(tokens)

    Noun_tags = ['NN', 'NNP', 'NNPS', 'NNS']
    verb_tags = ['VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ']
    lemmatizer = WordNetLemmatizer()
    
    def prac_lemmatize(token, tag):
        if tag in Noun_tags:
            return lemmatizer.lemmatize(token, 'n')
        elif tag in verb_tags:
            return lemmatizer.lemmatize(token, 'v')
        else:
            return lemmatizer.lemmatize(token, 'n')
    lemmatized_text = ' '.join([prac_lemmatize(token, tag) for token, tag in tagged_corpus])
    
    return lemmatized_text 

In [406]:
final_text = np.array([text_preprocessing(i) for i in text_])

In [407]:
final_text

array(['komolehin israel group person type test indic advoc',
       'realli agre consid approach life situat peopl',
       'look forward provid solut issu selfish purpos',
       'need say pragmat think introvert inspir convinc think style result show pragmat think offer logic approach way solv issu view situat beyond theori',
       'commun style result indic suppoert cautiou support',
       'also reserv think style test result show optim',
       'one approach issu defin advoc concret step implement solut optim adopt thinkingfind way product effici organ',
       'team role test everi time get chanc particip team work tend discov abl express good interperson skill basic soft skill yet learn',
       'also show commuun effect resolv conflict team mate team role pie chart chart show execut innov characteris eager get work organ do',
       'also creativ gener team strong imagin desir origin defin innov also advoc',
       'zodiac indic jester', 'particularli joker jester',
       'w

In [408]:
vectorizer = TfidfVectorizer(strip_accents = 'unicode', norm = 'l2')

In [409]:
matrix = vectorizer.fit_transform(final_text).todense()

In [410]:
matrix

matrix([[0.        , 0.        , 0.29424196, ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        ...,
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.65533981],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ],
        [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
         0.        ]])

#### row sum

In [411]:
sent_score = matrix.sum(axis=1)
sent_score.shape

(13, 1)

In [412]:
sent_score_total = sent_score.sum()

In [413]:
average = sent_score_total/sent_score.shape[0]

In [414]:
summary = []

In [415]:
for i in range(sent_score.shape[0]):
    if sent_score[i] >= average:
        summary.append(tokenize_sent[i])

In [416]:
summary

["Needful to say, i can be very pragmatic in my thinking, introvert, inspiring and convincing\n\n#thinking style: this result shows that i'm more pragmatic in my thinking, offering logical and approachable way to solve issues and view situations beyond theories.",
 'one of my approach to issues that defines me as an Advocate is concrete steps to implementing solutions, which as an optimizer, I adopt in my thinking,finding ways to be more productive, efficient and organized.',
 '#team role test: For every time I get the chance to participate in a team work, I tend to discover I am able to express a good interpersonal skills and some basic soft skills, yet learning.',
 "It also show that I can commuunicate effectively and resolve conflict with team mates\n\n#team role pie chart: this chart shows that I'm more of an executive and innovator characterised by my eagerness to get the work organized and done.",
 'Also as the creative generator of a team, strong imagination and desire to be ori

In [422]:
sum3 = ''.join(summary)

In [428]:
file = open('summary.pdf', 'wb') 
PdfFileWriter.write()

TypeError: write() missing 2 required positional arguments: 'self' and 'stream'

In [420]:
help(PdfFileWriter.write)

Help on function write in module PyPDF2.pdf:

write(self, stream)
    Writes the collection of pages added to this object out as a PDF file.
    
    :param stream: An object to write the file to.  The object must support
        the write method and the tell method, similar to a file object.



In [331]:
''.join(tokenize_sent)

1606