# Introduction

In [1]:
# Warnings
import warnings
warnings.filterwarnings('ignore')

# BEGIN: fix Python or Notebook SSL CERTIFICATE_VERIFY_FAILED
import os, ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context
# END: fix Python or Notebook SSL CERTIFICATE_VERIFY_FAILED

## Installing pre-requsite libraries
* https://pypi.org/project/bert-extractive-summarizer/

In [2]:
!pip install sumy transformers sentencepiece



### Import libraries

In [3]:
from sumy.nlp.tokenizers import Tokenizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.summarizers.lex_rank import LexRankSummarizer

In [4]:
content = "Text_Summarize_Text/content.txt"

output_sentences_count = 10

with open(content, "r", encoding="utf-8") as f: # open(r'C:\Users\...site_1.html', "r") as f:
    article = f.read()  
    
# article

In [5]:
my_parser = PlaintextParser.from_string(article, Tokenizer('english'))

# Creating a summary of 3 sentences.
lex_rank_summarizer = LexRankSummarizer()
lexrank_summary = lex_rank_summarizer(my_parser.document, sentences_count = output_sentences_count)

# Printing the summary
for sentence in lexrank_summary:
  print(sentence)

This leads to situations in which “people consent to the collection, use, and disclosure of their personal data when it is not in their self-interest to do so” (Solove 2013, p. 1895).
Not only this: the long-term storage of big data means that situations may arise in which there is a desire to use it for purposes that may not be remotely connected to those for which it was gathered (and for which consent was given) in the first place.
What alternative measures might be taken to ensure that genuinely informed consent is obtained from those who give up their data for all the purposes to which it might be subsequently put?
If big data might be used in such a way as to turn a profit, then what regulation should exist around its use, in terms both of its commercial exploitation, and of trade in the data itself?
It has yet to be properly held to account by either ethicists or legislators, and it would appear that the longer this situation continues, the more this business is likely to mushro

## LSA (Latent semantic analysis)

In [6]:
from sumy.summarizers.lsa import LsaSummarizer

# creating the summarizer
lsa_summarizer = LsaSummarizer()
lsa_summary = lsa_summarizer(my_parser.document, sentences_count = output_sentences_count)

# Printing the summary
for sentence in lsa_summary:
    print(sentence)

It further suggests that insights from the religious domain might be of considerable value in developing these new approaches to big data.
Extremely large datasets are not simply quantitatively different to smaller ones: their sheer scale brings about a “step change”, making them qualitatively distinct, too.
In practice, when data has been anonymized (or “de-identified”) it may well be possible to break that anonymity by triangulating between multiple datasets.
To illustrate this, consider the data which is routinely stored whenever individuals give samples as part of a medical procedure.
Guidelines and recommendations are a good start, but more is surely required to ensure that all data science practitioners act in ethically responsible ways.
Now, although a science based on the manipulation of data might appear “objective” to an outsider, we have seen that this is, in fact, very far from being the case, and that data scientists are required to use considerable personal skill and judg

## Luhn Summarization algorithm’s approach is based on TF-IDF (Term Frequency-Inverse Document Frequency). 

In [7]:
from sumy.summarizers.luhn import LuhnSummarizer

#  Creating the summarizer
luhn_summarizer = LuhnSummarizer()
luhn_summary = luhn_summarizer(my_parser.document, sentences_count = output_sentences_count)

# Printing the summary
for sentence in luhn_summary:
  print(sentence)

In practice, however, it has been observed that this is a process which is geared more towards limiting the liabilities of those harvesting the data rather than genuinely informing the data subjects: “the parties gathering the data typically attempt to minimize the ability of the person about whom the data is being gathered to comprehend the scope of data, and its usage, through a mixture of sharp design and obscure legal jargon” (Wilbanks 2014, p. 235).
The practical difficulties of managing privacy and generating informed consent have been summed up as: (1) people do not read privacy policies; (2) if people read them, they do not understand them; (3) if people read and understand them, they often lack enough background knowledge to make an informed choice; and (4) if people read them, understand them, and can make an informed choice, their choice may be skewed by various decision-making difficulties.
However, given that not every disaster may be foreseen, and that those with nefariou

## extractive method is the KL-Sum algorithm

In [8]:
from sumy.summarizers.kl import KLSummarizer
kl_summarizer = KLSummarizer()
kl_summary = kl_summarizer(my_parser.document, sentences_count = output_sentences_count)

# Printing the summary
for sentence in kl_summary:
    print(sentence)

All of these concerns are ongoing.
The distinctiveness of big data goes beyond straightforward issues of size.
3.
What concerns are specific to the big data context?
We may immediately note that anyone who engages with services which make use of computers surrenders data to those who run those computers—whether they are consciously aware of it or not.
Ownership Are data property?
Should it be the person to whom it relates, or the organisation which has gathered it?
Should the data be freely available, to assist the researchers?
Attention needs to be paid not only to the analysis of data, but to the presentation of the fruits of that analysis.
The End of Science?


## Summarization with T5 Transformers

In [9]:
from transformers import T5Tokenizer, T5Config, T5ForConditionalGeneration

my_model = T5ForConditionalGeneration.from_pretrained('t5-small')
tokenizer = T5Tokenizer.from_pretrained('t5-small')

input_ids = tokenizer.encode(article, return_tensors='pt', max_length=750)
summary_ids = my_model.generate(input_ids)

t5_summary = tokenizer.decode(summary_ids[0])
print(t5_summary)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


<pad> <extra_id_0>,<extra_id_1>, and<extra_id_2>,<extra_id_3>, and<extra_id_4>,<extra_id_5>, and their sheer scale brings


# GPT-2 Transformers

In [10]:
# Importing model and tokenizer
from transformers import GPT2Tokenizer,GPT2LMHeadModel

# Instantiating the model and tokenizer with gpt-2
tokenizer=GPT2Tokenizer.from_pretrained('gpt2')
model=GPT2LMHeadModel.from_pretrained('gpt2')

# Encoding text to get input ids & pass them to model.generate()
inputs=tokenizer.batch_encode_plus([article], return_tensors='pt', max_length=750)
summary_ids=model.generate(inputs['input_ids'], early_stopping=True)

GPT_summary=tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(GPT_summary)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 750, but ``max_length`` is set to 20.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


As any technology develops it might be expected that its increasing capabilities give rise to a succession of ethical issues, and computer technology is certainly no exception to this (Tavani 2013a, pp. 6 ff.). Concerns which have been raised in the past include the deskilling of the workforce, increased unemployment, and the health, stress, and isolation of workers, together with issues relating to the storage of personal data in the form of databases (Barbour 1992, pp. 146 ff.). A further concern is the implication of computers in broadening divisions between rich and poor, through the opening up of imbalances between those who have access to computer facilities and the benefits which they bring, and those who do not (Tavani 2013a, p. 305; Barbour 1992, p. 156). All of these concerns are ongoing.

In recent years the increasingly widespread use of computers in all walks of life, from the PCs and smartphones that many consumers use on a daily basis to the supercomputers used in resear

# XLM Transformers

In [11]:
# Importing model and tokenizer
from transformers import XLMWithLMHeadModel, XLMTokenizer

# Instantiating the model and tokenizer 
tokenizer = XLMTokenizer.from_pretrained('xlm-mlm-en-2048')
model = XLMWithLMHeadModel.from_pretrained('xlm-mlm-en-2048')

# Encoding text to get input ids & pass them to model.generate()
inputs = tokenizer.batch_encode_articles([article], return_tensors='pt', max_length=512)
summary_ids = model.generate(inputs['input_ids'], early_stopping=True)

# Decode and print the summary
XLM_summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(XLM_summary)

Some weights of XLMWithLMHeadModel were not initialized from the model checkpoint at xlm-mlm-en-2048 and are newly initialized: ['transformer.position_ids']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


AttributeError: 'XLMTokenizer' object has no attribute 'batch_encode_articles'