# Prediction phase for text summarization

In this notebook, we will load one of the model for text summarization, which we have already trained in the previous notebook. 

Then, we will apply the model on its corresponding test dataset (which we also load in this notebook) to generate its corresponding summaries. 


First of all, we need install some libraries and import those packages that we will use in this notebook:


In [None]:
# only run this cell if you are in collab
!pip install ohmeow-blurr
!pip install nlp


In [2]:
import nlp
from fastai.text.all import *
from transformers import *

from blurr.data.all import *
from blurr.modeling.all import *


## Loading the model and the test dataset


Let us to start loading the model. We have trained several models for different datasets. In the next cell, please, choose one of the dataset:

In [3]:
datasetInfo = [['cnn_dailymail', '3.0.0', 'article', 'highlights'],
               ['gigaword', '1.2.0', 'document', 'summary'],
               ['xsum', '1.1.0', 'document', 'summary'],
               ['reddit','1.0.0', 'content', 'summary'],

               ['biomrc', 'biomrc_large_A', 'abstract','answer'],
               ['biomrc', 'biomrc_large_B', 'abstract','title'],
               ['emotion', '0.0.0','text','label']]
#                ['biomrc', 'biomrc_small_A', 'abstract','answer'],
#                ['biomrc', 'biomrc_small_B', 'abstract','answer'],
#                ['biomrc', 'biomrc_tiny_A', 'abstract','answer'],
#                ['biomrc', 'biomrc_tiny_B', 'abstract','answer']]

#Please, select the dataset 
numDataset = 2

nameDataset=datasetInfo[numDataset][0]      #name of the dataset
versionDataset=datasetInfo[numDataset][1]   #version of the dataset
text_field=datasetInfo[numDataset][2]       #name of the field that contains the input texts
summary_field=datasetInfo[numDataset][3]    #name of the field that contains the summaries

print("Prediction for ", nameDataset)

Prediction for  xsum


### Loading the test dataset
We only need to load the test split of the selected dataset:

In [9]:
import pandas as pd 

print('This may take some minutes...')
test_data = nlp.load_dataset(nameDataset, versionDataset, split='test')
print('size of the test dataset {} = {}'.format(nameDataset,len(test_data)) )

#we load it into a pandas dataframe
df_test = pd.DataFrame(test_data)


This may take some minutes...


Using custom data configuration 1.1.0


size of the test dataset xsum = 11333


### Loadint the model 
If you are running on Google Colab, you would need to mount your 
google drive:


In [5]:
if 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    root_colab='drive/My Drive/Colab Notebooks/'
    root=root_colab+'NLPwithDL/TextSummarization/'
else:
    print('Not running on CoLab')
    root='./'



Mounted at /content/drive


In [6]:
path_model=root+'models/'+nameDataset+'.pkl'
model = load_learner(fname=path_model)
print('{} model was loaded'.format(path_model))


drive/My Drive/Colab Notebooks/NLPwithDL/TextSummarization/models/xsum.pkl model was loaded


### Predicting...

We now apply the model on each text of the test dataset.
Moreover, we will save a csv (whose name is the dataset name) containing the input texts, their gold summaries and their predicted summaries:

In [None]:

import time
start_time = time.time()

input_texts = []        #list to save the input texts
gold_summaries = []     #list to save the summaries from the test dataset
predicted_summaries = []    #list to save the summaries created by the model


total = len(test_data)

#This for traverses all texts from the test dataet


for index, row in df_test.iterrows():
    print(row[text_field], row[summary_field])
    
    #save the input text
    input_texts.append(row[text_field])
    #save its corresponding summary
    gold_summaries.append(row[summary_field])

    print("Predicting summary for {} / {} ".format(index,total))
    
    #now we use the model to generate the summary
    predicted = model.blurr_summarize(row[text_field])
    #we save the generated summary
    predicted_summaries.append(predicted[0])

    print("--- time ---" , ((time.time() - start_time)/60)," min --- ", (time.time() - start_time),' sec ---') 

#Now, we create a dataframe with these three lists and save it into a csv

data={"input_text": input_texts, "gold_summary": gold_summaries, "predicted_summary" : predicted_summaries}
#data = {"newSummaries":predicted_summaries,'originalSummaries':gold_summaries,'fullTexts':input_text}

#we create a dataframe to save the input
df = pd.DataFrame( data , columns = ['input_text', 'gold_summary', 'predicted_summary'])
#df = pd.DataFrame( data , columns = ["newSummaries","originalSummaries","fullTexts"])

#The csv for saving the predictions should be called as the dataset name :
path_predictions=root+'outputs/'+nameDataset+'.csv' #name of the file to save them
df.to_csv(path_predictions) #we save the dataframe into the csv
print('generated summaries were saved into {}'.format(path_predictions))


The London trio are up for best UK act and best album, as well as getting two nominations in the best song category."We got told like this morning 'Oh I think you're nominated'", said Dappy."And I was like 'Oh yeah, which one?' And now we've got nominated for four awards. I mean, wow!"Bandmate Fazer added: "We thought it's best of us to come down and mingle with everyone and say hello to the cameras. And now we find we've got four nominations."The band have two shots at the best song prize, getting the nod for their Tynchy Stryder collaboration Number One, and single Strong Again.Their album Uncle B will also go up against records by the likes of Beyonce and Kanye West.N-Dubz picked up the best newcomer Mobo in 2007, but female member Tulisa said they wouldn't be too disappointed if they didn't win this time around."At the end of the day we're grateful to be where we are in our careers."If it don't happen then it don't happen - live to fight another day and keep on making albums and hi