# Summarizer

There are two styles of creating summaries viz. Extractive & Abstractive. Extractive summaries pick up important sentences as-is and put them in order for creating final summary. Abstractive summarization on the other hand paraphrases the important information to create crisp summary in its own words. **While abstractive summaries are more useful**, they are harder to create and evaluate. Hence, as first step we focus on extractive summarization which will pick up most important sentences and arrange them in the structure described above. Once this task is done correctly, then we can focus on the abstractive summarization.[1]

# Table of Contents - Extractive Summarizer
1. [Frequency Method](#freq)
2. [OpenNyai's Extractive Summarizer for Indian Court Judgements](#opennyai)


## Extractive Summarizer
### Frequency Method <a class="anchor" id="freq"></a>
1. Find frequency of all words in the text data.
2. Sentence Tokenize text data and assign value to each sentence based on presence of words.
3. Sentences which contain more high frequency words will be kept in the summary. 

In [5]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords

In [11]:
text = "my name is shubham kumar shukla. It is my pleasure to got opportunity to write article for xyz related to nlp"

In [9]:
stopwords1 = set(stopwords.words("english"))

In [14]:
words = word_tokenize(text)

In [16]:
freqTable = {}
for word in words:
    word = word.lower()
    if word in stopwords1:
        continue
    
    if word in freqTable:
        freqTable[word] += 1
    else:
        freqTable[word] = 1

In [17]:
freqTable
# Step-1 Completed

{'name': 1,
 'shubham': 1,
 'kumar': 1,
 'shukla': 1,
 '.': 1,
 'pleasure': 1,
 'got': 1,
 'opportunity': 1,
 'write': 1,
 'article': 1,
 'xyz': 1,
 'related': 1,
 'nlp': 1}

In [19]:
sentences = sent_tokenize(text)
sentences


['my name is shubham kumar shukla.',
 'It is my pleasure to got opportunity to write article for xyz related to nlp']

In [20]:
sentenceValue = {}
for sentence in sentences:
    for word, freq in freqTable.items():
        if word in sentence.lower():
            if sentence in sentenceValue:
                sentenceValue[sentence] += freq
            else:
                sentenceValue[sentence] = freq
sentenceValue
# Step 2 Completed

{'my name is shubham kumar shukla.': 5,
 'It is my pleasure to got opportunity to write article for xyz related to nlp': 8}

In [24]:
sumValues = 0
for sentence in sentenceValue:
    sumValues += sentenceValue[sentence]
average = int(sumValues / len(sentenceValue))
average

6

In [26]:
summary = ''
for sentence in sentences:
    # A sentence is selected to be in the summary if its' value is more than 1.2*average
    if (sentence in sentenceValue) and(sentenceValue[sentence] > (1.2 * average)):
        summary += "" + sentence

summary

'It is my pleasure to got opportunity to write article for xyz related to nlp'

Clearly this is not the best method. Important small statements would miss being part of the summary. There aer other AI ways as well to select sentences.

## OpenNyai's Extractive Summarizer for Indian Court Judgements <a class="anchor" id="opennyai"></a>

Summary will have 5 sections **Facts summary, Arguments summary, Issue summary, Analysis Summary and Decision Summary.** Sectionwise summary will be created for each of it. The [rhetorical roles](https://github.com/d-saikrishna/NLP/blob/master/Summarization/Rhetorical%20Roles.ipynb) assigned to each sentence in judgement will help in this. But not all rhetorical roles are treated alike. “Issues” and “Decision” written in original judgement are very crisp and rich in information -- so these sentences are directly added to their section summaries without any evaluation. "Preamble” is important in setting the context of case and also copied to main summary directly.[1]

For remaining rhetorical roles, sentences are ranked in descending order of importance as predicted by the AI model and top-ranked sentences are added to their section sumamries summary. The AI model was trained on head-notes written for 10440 Supreme Court Judgements. More about this AI model is [here](https://github.com/Legal-NLP-EkStep/judgment_extractive_summarizer)[3]. In brief, the importance of a sentence is determined not alone by the words it has (as in frequency method or words based AI models) but also by the rhetorical roles of the sentence. 

**Thus, Summarizer model needs Rhetorical Role model output as input. Hence Rhetorical Role prediction model needs to run before Summarizer model runs.**

In [28]:
from opennyai import Pipeline
from opennyai.utils import Data
import urllib

In [27]:
#Sample court judgements
text1 = open('SampleTexts/sample_judgment1.txt').read()
text2 = open('SampleTexts/sample_judgment2.txt').read()

# you can also load your text files directly into this
texts_to_process = [text1, text2]

# create Data object for data  preprocessing before running ML models
data = Data(texts_to_process, preprocessing_nlp_model='en_core_web_trf')

[38;5;4mℹ Pre-processing will happen on CPU![0m


In [29]:
# If you have access to GPU then set this to True else False
use_gpu = False

In [30]:
pipeline = Pipeline(components=['Rhetorical_Role', 'Summarizer'], use_gpu=use_gpu, verbose=True,
                    summarizer_summary_length=0.0)

results = pipeline(data)

json_result_doc_1 = results[0]
summaries_doc_1 = results[0]['summary']

[38;5;4mℹ Loading Rhetorical Role...[0m
[38;5;4mℹ Rhetorical Roles will use CPU![0m


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


[38;5;4mℹ Loading Extractive summarizer...[0m
[38;5;4mℹ Extractive Summarizer will use CPU![0m


Downloading: "https://storage.googleapis.com/indianlegalbert/OPEN_SOURCED_FILES/Extractive_summarization/model/model_headnotes/model.pt" to /home/krishna/.opennyai/extractivesummarizer/model.pt


  0%|          | 0.00/476M [00:00<?, ?B/s]

100%|██████████████████████████████████████| 433/433 [00:00<00:00, 241957.58B/s]
100%|██████████████████████████| 440473133/440473133 [31:33<00:00, 232668.53B/s]


[38;5;4mℹ Preprocessing rhetorical role model input!!![0m


100%|█████████████████████████████████████████████| 2/2 [00:16<00:00,  8.34s/it]
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


[38;5;4mℹ Processing documents with rhetorical role model!!![0m


100%|█████████████████████████████████████████████| 2/2 [00:28<00:00, 14.34s/it]


[38;5;4mℹ Processing documents with extractive summarizer model!!![0m


  0%|                                                     | 0/2 [00:00<?, ?it/s]
  0%|                                                 | 0/231508 [00:00<?, ?B/s][A
  0%|▏                                   | 1024/231508 [00:00<01:08, 3364.21B/s][A
 15%|█████                             | 34816/231508 [00:00<00:02, 79224.63B/s][A
 30%|██████████▏                       | 69632/231508 [00:00<00:01, 99161.45B/s][A
 60%|███████████████████▏            | 139264/231508 [00:01<00:00, 153322.79B/s][A
100%|████████████████████████████████| 231508/231508 [00:02<00:00, 102630.76B/s][A
100%|█████████████████████████████████████████████| 2/2 [00:17<00:00,  8.86s/it]


In [33]:
summaries_doc_1.keys()

dict_keys(['PREAMBLE', 'facts', 'arguments', 'ANALYSIS', 'decision'])

In [34]:
summaries_doc_1['facts']

'The work was to be completed within one year that is before 15 th December 1972.\nThe respondent thereafter filed a suit being O.S. No. 206 of 1989 before the Court of Civil Judge (Senior Division), Bhubaneswar (hereinafter referred to as the trial court) under Section 20 of the Arbitration Act, 1940 (for short, the 1940 Act) seeking reference of the dispute to arbitration.\nBy order of the trial court dated 14th February 1990, the suit was decreed in favour of the respondent and he was directed to file the original F2 agreement in the court for referring the dispute to arbitration.\nHowever, the respondent did not file the original F2 agreement as directed.\nIn the meantime, the 1940 Act was repealed and the Arbitration and Conciliation Act, 1996 (for short, the 1996 Act) came into force.\n4. The respondent thereafter filed an application in the disposed of suit before the trial court, praying for appointment of an arbitrator under the provisions of the 1996 Act.\nThe respondent ther

In [37]:
json_result_doc_2 = results[1]
summaries_doc_2 = results[1]['summary']
summaries_doc_2['issue']

'Point(s) for consideration:-\n\n5) The only point for consideration before this Court is whether in the present facts and circumstances of the case, the appellant has made out a case for grant of bail or not?\n'

# References:
1. https://opennyai.readthedocs.io/en/latest/summariser/summariser.html
2. [Topcoder article on Frequency method](https://www.topcoder.com/thrive/articles/text-summarization-in-nlp)
3. [Ek-Step Extractive Summarizer](https://github.com/Legal-NLP-EkStep/judgment_extractive_summarizer)