<a href="https://colab.research.google.com/github/harshbansal7/abstractive-summarization-using-T5/blob/master/abstractive_summarization_using_t5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [3]:
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from sklearn.model_selection import train_test_split

# Load the data into a DataFrame
df = pd.read_csv("/content/abstractive-summarization-using-T5/datasets/train.csv")
df_val = pd.read_csv("/content/abstractive-summarization-using-T5/datasets/val.csv")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
df = df.drop(columns='FileName')
df = df.rename(columns={"Abstract":"source_text", "RHS":"target_text"})
df = df[['source_text', 'target_text']]

df['source_text'] = "summarize: " + df['source_text']
df

df_val = df_val.drop(columns='FileName')
df_val = df_val.rename(columns={"Abstract":"source_text", "RHS":"target_text"})
df_val = df_val[['source_text', 'target_text']]

df_val['source_text'] = "summarize: " + df_val['source_text']
df_val

Unnamed: 0,source_text,target_text
0,summarize: Human face can be seen as a soft t...,We model the deformation of the human face due...
1,summarize: In this paper we use a numerical p...,Bifurcation and postbifurcation of inflated hy...
2,summarize: Modularisation product platforms p...,Existing methods in modular product family dev...
3,summarize: In order to investigate the micros...,A DRX model of FGH96 of IFW process is establi...
4,summarize: An efficient approach is proposed ...,Propose a pragmatic approach for simulating co...
...,...,...
95,summarize: This paper proposes a strategy for...,Efficient strategy for GPU computing of FGFEA ...
96,summarize: A family of spatial beam finite el...,We analyse fixed pole approach in geometricall...
97,summarize: A new adaptive multiscale method i...,A new adaptive multiscale method AMM is develo...
98,summarize: A nonlocal extension of the damage...,A new nonlocal damage plasticity model has bee...


In [5]:
%%capture
!pip install --upgrade simplet5

In [6]:
from simplet5 import SimpleT5
model = SimpleT5()

INFO:pytorch_lightning.utilities.seed:Global seed set to 42


In [7]:
# load (supports t5, mt5, byT5 models)
# model.from_pretrained("t5","t5-base")

# model.train(train_df=df, # pandas dataframe with 2 columns: source_text & target_text
#             eval_df=df_val, # pandas dataframe with 2 columns: source_text & target_text
#             source_max_token_len = 512, 
#             target_max_token_len = 100,
#             batch_size = 8,
#             max_epochs = 10,
#             use_gpu = True,
#             outputdir = "/content",
#             early_stopping_patience_epochs = 0,
#             precision = 32
# )

In [8]:
# import shutil
# shutil.make_archive('model-archive', 'zip', '/content/simplet5-epoch-9-train-loss-1.3233-val-loss-2.5556')

In [9]:
model.load_model("t5","/content/abstractive-summarization-using-T5/simplet5-epoch-9-train-loss-1.3675-val-loss-2.6217", use_gpu=True)

In [10]:
%%capture
!pip install pytextrank
!python -m spacy download en_core_web_sm
!pip install --upgrade scipy networkx

In [11]:
import spacy
import pytextrank

def extract_important_sentences(text, limit_phrases=15, limit_sentences=5):
    en_nlp = spacy.load("en_core_web_sm")
    en_nlp.add_pipe("textrank", last=True)
    doc = en_nlp(text)
    tr = doc._.textrank
    summary = ""
    for sent in tr.summary(limit_phrases=limit_phrases, limit_sentences=limit_sentences):
        summary += sent.text + " "
    return summary

In [12]:
def create_summaries(text):

    print("ACTUAL ABSTRACT - " + text)
    print("\nLength of Abstract = " + str(len(text.split())))
    sumtext = "summarize: " + text
    actual_text_prediction = model.predict(sumtext)[0]
    print("\nDIRECT SUMMARIZATION USING T5 - " + actual_text_prediction)
    print("\nLength of Summary = " + str(len(actual_text_prediction.split())))

    newtext = extract_important_sentences(text, 20, 6)
    newtext = "summarize: " + newtext
    extractive_text_prediction = model.predict(newtext)[0]
    print("\nSUMMARIZATION AFTER EXTRACTIVE USING T5 - " + extractive_text_prediction)
    print("\nLength of Summary = " + str(len(extractive_text_prediction.split())))

In [13]:
text = """Since their early discovery, thin films have quickly found industrial uses, including in ornamental and optical applications. The range of applications for thin film technology has expanded to the point where nearly every industrial sector now uses it to impart specific physical and chemical properties to the surface of bulk materials. This expansion has been aided by the development of vacuum technology and electric power facilities. Recently, the most technologically sophisticated applications, such microelectronics and biomedicine, have been made possible by the ability to adjust the film properties by the change of the microstructure via the deposition parameters used in a particular deposition procedure. Despite such remarkable advancements, the relationship between all phases of the manufacture of thin films, specifically deposition parameters-morphology and characteristics, is not entirely precise. The development of complex models for an accurate prediction of film properties has been hampered, among other things, by the lack of characterization techniques suited for probing films with thickness less than a single atomic layer and a lack of knowledge of the physics. Additionally, there are still certain challenges with the mass production of advanced structures, such as quantum wells and wires, as well as a relatively high cost for their deposition. Thin film technology will be more competitive for cutting-edge technological applications once these obstacles are removed."""

In [14]:
create_summaries(text)

ACTUAL ABSTRACT - Since their early discovery, thin films have quickly found industrial uses, including in ornamental and optical applications. The range of applications for thin film technology has expanded to the point where nearly every industrial sector now uses it to impart specific physical and chemical properties to the surface of bulk materials. This expansion has been aided by the development of vacuum technology and electric power facilities. Recently, the most technologically sophisticated applications, such microelectronics and biomedicine, have been made possible by the ability to adjust the film properties by the change of the microstructure via the deposition parameters used in a particular deposition procedure. Despite such remarkable advancements, the relationship between all phases of the manufacture of thin films, specifically deposition parameters-morphology and characteristics, is not entirely precise. The development of complex models for an accurate prediction of

In [15]:
# Create an empty dataframe
pred_df = pd.DataFrame(columns=["target_text", "predicted_text", "predicted_after_extractive"])

df_val = pd.read_csv("/content/abstractive-summarization-using-T5/datasets/val.csv")
df_val = df_val.drop(columns='FileName')
df_val = df_val.rename(columns={"Abstract":"source_text", "RHS":"target_text"})
df_val = df_val[['source_text', 'target_text']]

# Iterate over the validation dataset
for i, row in df_val.iterrows():
    # Make a prediction for the current row
    pred_text = model.predict("summarize: " + row["source_text"])
    newtext = extract_important_sentences(row["source_text"], 20, 6)
    newtext = "summarize: " + newtext
    pred_text2 = model.predict(newtext)
    # Add the prediction and the target text to the new dataframe
    pred_df.loc[i] = [row["target_text"], pred_text[0], pred_text2[0]]
    
pred_df

Token indices sequence length is longer than the specified maximum sequence length for this model (572 > 512). Running this sequence through the model will result in indexing errors


Unnamed: 0,target_text,predicted_text,predicted_after_extractive
0,We model the deformation of the human face due...,3 D muscle structures are embedded inside a fa...,Muscle forces are decomposed into discrete poi...
1,Bifurcation and postbifurcation of inflated hy...,We analyse bifurcation and postbifurcation of ...,We model arterial wall tissue with this class ...
2,Existing methods in modular product family dev...,We analyse a case in which the aim was rationa...,Modularisation product platforms product famil...
3,A DRX model of FGH96 of IFW process is establi...,The microstructure evolution of FGH96 ring par...,The microstructure evolution of FGH96 ring par...
4,Propose a pragmatic approach for simulating co...,We propose a novel approach for industrial par...,The eXtended Finite Element Method is used to ...
...,...,...,...
95,Efficient strategy for GPU computing of FGFEA ...,A matrix free GPU implementation of Fixed Grid...,We propose a strategy for the efficient implem...
96,We analyse fixed pole approach in geometricall...,A family of spatial beam finite elements based...,A family of spatial beam finite elements based...
97,A new adaptive multiscale method AMM is develo...,A new adaptive multiscale method is developed ...,Macroscopic nodes are placed uniformly along e...
98,A new nonlocal damage plasticity model has bee...,A nonlocal extension of the damage plasticity ...,The influence of mesh size on the structural r...


In [16]:
!pip install py-rouge

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting py-rouge
  Downloading py_rouge-1.1-py3-none-any.whl (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.8/56.8 KB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: py-rouge
Successfully installed py-rouge-1.1


In [17]:
import rouge

def prepare_results(p, r, f):
    return '\t{}:\t{}: {:5.2f}\t{}: {:5.2f}\t{}: {:5.2f}'.format(metric, 'P', 100.0 * p, 'R', 100.0 * r, 'F1', 100.0 * f)

print("Results on Predictions Directly using T5\n")
for aggregator in ['Avg', 'Best']:
    print('Evaluation with {}'.format(aggregator))
    apply_avg = aggregator == 'Avg'
    apply_best = aggregator == 'Best'

    evaluator = rouge.Rouge(metrics=['rouge-n', 'rouge-l', 'rouge-w'],
                           max_n=2,
                           limit_length=True,
                           length_limit=100,
                           length_limit_type='words',
                           apply_avg=apply_avg,
                           apply_best=apply_best,
                           alpha=0.5, # Default F1_score
                           weight_factor=1.2,
                           stemming=True)


    all_hypothesis = pred_df['predicted_text']
    all_references = pred_df['target_text']

    scores = evaluator.get_scores(all_hypothesis, all_references)

    for metric, results in sorted(scores.items(), key=lambda x: x[0]):
        if not apply_avg and not apply_best: # value is a type of list as we evaluate each summary vs each reference
            for hypothesis_id, results_per_ref in enumerate(results):
                nb_references = len(results_per_ref['p'])
                for reference_id in range(nb_references):
                    print('\tHypothesis #{} & Reference #{}: '.format(hypothesis_id, reference_id))
                    print('\t' + prepare_results(results_per_ref['p'][reference_id], results_per_ref['r'][reference_id], results_per_ref['f'][reference_id]))
            print()
        else:
            print(prepare_results(results['p'], results['r'], results['f']))
    print()

print("Results on Predictions from Extractive + T5\n")
for aggregator in ['Avg', 'Best']:
    print('Evaluation with {}'.format(aggregator))
    apply_avg = aggregator == 'Avg'
    apply_best = aggregator == 'Best'

    evaluator = rouge.Rouge(metrics=['rouge-n', 'rouge-l', 'rouge-w'],
                           max_n=2,
                           limit_length=True,
                           length_limit=100,
                           length_limit_type='words',
                           apply_avg=apply_avg,
                           apply_best=apply_best,
                           alpha=0.5, # Default F1_score
                           weight_factor=1.2,
                           stemming=True)


    all_hypothesis = pred_df['predicted_after_extractive']
    all_references = pred_df['target_text']

    scores = evaluator.get_scores(all_hypothesis, all_references)

    for metric, results in sorted(scores.items(), key=lambda x: x[0]):
        if not apply_avg and not apply_best: # value is a type of list as we evaluate each summary vs each reference
            for hypothesis_id, results_per_ref in enumerate(results):
                nb_references = len(results_per_ref['p'])
                for reference_id in range(nb_references):
                    print('\tHypothesis #{} & Reference #{}: '.format(hypothesis_id, reference_id))
                    print('\t' + prepare_results(results_per_ref['p'][reference_id], results_per_ref['r'][reference_id], results_per_ref['f'][reference_id]))
            print()
        else:
            print(prepare_results(results['p'], results['r'], results['f']))
    print()

Results on Predictions Directly using T5

Evaluation with Avg
	rouge-1:	P: 43.38	R: 50.34	F1: 45.51
	rouge-2:	P: 20.10	R: 23.33	F1: 21.05
	rouge-l:	P: 37.03	R: 41.94	F1: 38.65
	rouge-w:	P: 23.51	R: 12.43	F1: 15.81

Evaluation with Best
	rouge-1:	P: 43.38	R: 50.34	F1: 45.51
	rouge-2:	P: 20.10	R: 23.33	F1: 21.05
	rouge-l:	P: 37.03	R: 41.94	F1: 38.65
	rouge-w:	P: 23.51	R: 12.43	F1: 15.81

Results on Predictions from Extractive + T5

Evaluation with Avg
	rouge-1:	P: 41.71	R: 45.47	F1: 42.46
	rouge-2:	P: 17.51	R: 19.24	F1: 17.87
	rouge-l:	P: 33.49	R: 35.98	F1: 34.08
	rouge-w:	P: 21.19	R: 10.49	F1: 13.64

Evaluation with Best
	rouge-1:	P: 41.71	R: 45.47	F1: 42.46
	rouge-2:	P: 17.51	R: 19.24	F1: 17.87
	rouge-l:	P: 33.49	R: 35.98	F1: 34.08
	rouge-w:	P: 21.19	R: 10.49	F1: 13.64



**Few Examples**

In [18]:
create_summaries("""The study objective is to contemplate the effectiveness of COVID-19 on the air pollution of Indian territory from January 2020 to April 2020. We have executed data from European Space Agency (ESA) and CPCB online portal for air quality data dissemination. The Sentinel e 5 P satellite images elucidate that the Air quality of Indian territory has been improved significantly during COVID-19. Mumbai and Delhi are one of the most populated cities. These two cities have observed a substantial decrease in Nitrogen Dioxide (40e50%) compared to the same period last year. It suggests that the emergence of COVID-19 has been proved to a necessary evil as being advantageous for mitigating air pollution on Indian territory during the lock-down. The study found a significant decline in Nitrogen Dioxide in reputed states of India, i.e., Delhi and Mumbai. Moreover, a faded track of Nitrogen Dioxide can be seen at the Maritime route in the Indian Ocean. An upsurge in the environmental quality of India will also be beneficial for its neighbor countries, i.e., China, Pakistan, Iran, and Afghanistan.""")

ACTUAL ABSTRACT - The study objective is to contemplate the effectiveness of COVID-19 on the air pollution of Indian territory from January 2020 to April 2020. We have executed data from European Space Agency (ESA) and CPCB online portal for air quality data dissemination. The Sentinel e 5 P satellite images elucidate that the Air quality of Indian territory has been improved significantly during COVID-19. Mumbai and Delhi are one of the most populated cities. These two cities have observed a substantial decrease in Nitrogen Dioxide (40e50%) compared to the same period last year. It suggests that the emergence of COVID-19 has been proved to a necessary evil as being advantageous for mitigating air pollution on Indian territory during the lock-down. The study found a significant decline in Nitrogen Dioxide in reputed states of India, i.e., Delhi and Mumbai. Moreover, a faded track of Nitrogen Dioxide can be seen at the Maritime route in the Indian Ocean. An upsurge in the environmental

In [19]:
create_summaries("""The aim of the study is to introduce some approach which might help in improving daily temperature of data. Weather is a natural a phenomenon for which forecasting is a great challenge today. Weather parameters such as Rainfall, Relative Humidity , Wind Speed, Air Temperature are highly non-linear and complex phenomena, which include mathematical simulation and modeling for its correct forecasting. Weather Forecasting is use to simplify the purpose of knowledge and tools that are used for the state of atmosphere at a given place. The prediction is becoming more complicated due to changing weather condition. There are different software and types are available for Time Series forecasting. Our aim is to analyze the parameters and do the comparison of some strategies in predicting these temperatures. Here we tend to analyze the data of given parameters and notice the prediction for few period using the strategy of Autoregressive Integrated Moving Average (ARIMA) and Exponential Smoothing (ETS).The data from meteorological centers are taken for comparison of methods using packages such as ggplot2, forecast, time Date in R and automatic prediction strategies are available within the package applied for modeling with ARIMA and ETS methods. On basis of accuracy we tend to attempt the simplest Methodology. Our model will compare on basis of MAE, MASE, MAPE AND RMSE. The identification of model will chromatic inspection of both the ACF and PACF to hypothesize many possible models will estimated by selection criteria AIC, AICc and BIC.""")

ACTUAL ABSTRACT - The aim of the study is to introduce some approach which might help in improving daily temperature of data. Weather is a natural a phenomenon for which forecasting is a great challenge today. Weather parameters such as Rainfall, Relative Humidity , Wind Speed, Air Temperature are highly non-linear and complex phenomena, which include mathematical simulation and modeling for its correct forecasting. Weather Forecasting is use to simplify the purpose of knowledge and tools that are used for the state of atmosphere at a given place. The prediction is becoming more complicated due to changing weather condition. There are different software and types are available for Time Series forecasting. Our aim is to analyze the parameters and do the comparison of some strategies in predicting these temperatures. Here we tend to analyze the data of given parameters and notice the prediction for few period using the strategy of Autoregressive Integrated Moving Average (ARIMA) and Expo

In [20]:
create_summaries("""This paper explores the concept of economies and diseconomies of scale in the production process. Economies of scale refer to cost advantages that a firm experience as it increases its level of output, while diseconomies of scale refer to the increased costs that a firm experience as it increases its level of output. The paper provides a comprehensive examination of the different types of economies and diseconomies of scale, including internal and external economies and diseconomies of scale. The paper also discusses the various factors that can affect economies and diseconomies of scale and provides insights on how firms can effectively navigate these challenges. The paper concludes by highlighting the importance of considering economies and diseconomies of scale in the production process and the impact it can have on the overall efficiency and profitability of a firm.""")

ACTUAL ABSTRACT - This paper explores the concept of economies and diseconomies of scale in the production process. Economies of scale refer to cost advantages that a firm experience as it increases its level of output, while diseconomies of scale refer to the increased costs that a firm experience as it increases its level of output. The paper provides a comprehensive examination of the different types of economies and diseconomies of scale, including internal and external economies and diseconomies of scale. The paper also discusses the various factors that can affect economies and diseconomies of scale and provides insights on how firms can effectively navigate these challenges. The paper concludes by highlighting the importance of considering economies and diseconomies of scale in the production process and the impact it can have on the overall efficiency and profitability of a firm.

Length of Abstract = 137

DIRECT SUMMARIZATION USING T5 - The paper explores the concept of econom

Summarizing Examples

In [21]:
create_summaries("""Let's use two hypothetical retail businesses as an example and compare them. One of them is a major company by the name of Malwart, while the other is a tiny neighbourhood shop by the name of Bob's Sporting Goods. Bob handles all of his distribution and inventory management manually and alone. Malwart maintains its distribution in the meanwhile using sophisticated software created only for them. It should come as no surprise that Malwart manages his inventory and distribution considerably more effectively and productively than Bob does. However, because his business is too small and he cannot afford to spend so much money, Bob is unable to invest in similar software.""")

ACTUAL ABSTRACT - Let's use two hypothetical retail businesses as an example and compare them. One of them is a major company by the name of Malwart, while the other is a tiny neighbourhood shop by the name of Bob's Sporting Goods. Bob handles all of his distribution and inventory management manually and alone. Malwart maintains its distribution in the meanwhile using sophisticated software created only for them. It should come as no surprise that Malwart manages his inventory and distribution considerably more effectively and productively than Bob does. However, because his business is too small and he cannot afford to spend so much money, Bob is unable to invest in similar software.

Length of Abstract = 110

DIRECT SUMMARIZATION USING T5 - We compare two retail businesses as examples. Malwart is a major company while Bob is a small neighbourhood shop. Bob can't afford to spend so much money on software for his business. Malwart manages his distribution and inventory considerably mor

Summarizing generic text paragraphs

In [22]:
create_summaries("""Holi is the festival of colors celebrated with our loved ones. It is one of the biggest festivals in our country which comes every year during March. Children, adults and even the elder citizens take part in the fun and preparations of Holi for three days starting from a full moon day. People from all religions play Holi by exchanging sweets, gujiya, thandai and splashing colors on each other. Water guns and water balloons are also used by children during the Holidays.""")

ACTUAL ABSTRACT - Holi is the festival of colors celebrated with our loved ones. It is one of the biggest festivals in our country which comes every year during March. Children, adults and even the elder citizens take part in the fun and preparations of Holi for three days starting from a full moon day. People from all religions play Holi by exchanging sweets, gujiya, thandai and splashing colors on each other. Water guns and water balloons are also used by children during the Holidays.

Length of Abstract = 82

DIRECT SUMMARIZATION USING T5 - Children and adults take part in the fun and preparations of Holi for three days starting from a full moon day. People from all religions play Holi by exchanging sweets, gujiya, thandai and splashing colors on each other. Water guns and water balloons are also used by children during the Holidays.

Length of Summary = 51

SUMMARIZATION AFTER EXTRACTIVE USING T5 - Holi is the festival of colors celebrated with our loved ones. People from all relig