In [None]:
#Mount drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [None]:
%cd /content/drive/MyDrive/table_text_summarization_package/

/content/drive/MyDrive/table_text_summarization_package


In [None]:
!pip install -r requirements.txt
!pip install --upgrade accelerate

## **Imports and Initializations**

In [None]:
from table_text_summarization import Summarizer
from helper import get_evaluation_metrics
import os
import pandas as pd
import json

# **Table-To-Insights**

The objective of the solution notebook is to demonstrate how to use summarization module from Tiger-NLP 🐯 for Finetuning and Inferencing table-to-insight objective.



*   **Table-to-Insights** : The model has to generate insight/summary of a table which is converted into a free flowing text and passed as input sequence.

We will see how to easily load and preprocess the dataset for the task, and how to use the Trainer API to train a model on it. We will also demonstrate how to inference a dataset incase you already have a trained model.

The trainer/inference module is based on pytorch framework and can leverage GPU accelerated machine for training/inferencing.



In [None]:
# Initializing Summarizer object for table-to-insight task
# summary_type = "table" triggers pre processing functions required for processing tabular data.
model = Summarizer(summary_type="table")

## **Preprocessing the data**

The preprocessing step involves conversion of tables/dataframes present in csv or excels into free flowing text and is saved as JSONs for model training/inferencing.

 - **Example folder structure and file type of raw data**
```
|-- train_folder/
|   |-- table_1.csv
|   |-- table_1.jsonl
|   |-- table_2.csv
|   |-- table_2.jsonl
|    ...
|    ...
|   |-- table_n.csv
|   |-- table_n.jsonl
```

  -> **.csv** - Contains the table in structured format which needs to be processed to flattened text

  -> **.jsonl** - Contains two fields which is Summary and Highlighted cells.
  e.g.
```
{summary : "actual summary of the data", highlighted_cells : [[1,2],[2,2],[3,2]]}
```
   - Summary represents the actual summary of the table which will be used for model training.
   - Highlighted cells represents the cells for which the summary is available or want to generate during inference.

   In the above example, `highlighted cells: [[1,2],[2,2],[3,2]]` specify the pre-process function to select data only in rows 1,2,3 and column 2 from the table and `summary: "actual summary of the data"` will have the actual summary of these 3 rows




In [None]:
# model.pre_process() is used to pre process tables into flattened text
# `data` argument is the folder path of the data
# `data_type` argument specifies the type of the data for model training
model.pre_process(data_path='data/train_folder',data_type='train')

actual summary
[[1, 0], [1, 1], [1, 2]]
Executed Lattice...
Pre-processing done...


In [None]:
model.pre_process(data_path="data/test_folder",data_type='test')

Executed Lattice...
Pre-processing done...


In [None]:
model.pre_process(data_path="data/validation_folder",data_type='validation')

validation data table summary
[[1, 0], [1, 1], [1, 2], [1, 3], [1, 4]]
Executed Lattice...
Pre-processing done...


## **Fine-tuning the model**

Now that our data is ready, we can download the pretrained model and fine-tune it. Since our task is of the sequence-to-sequence kind, we use the AutoModelForSeq2SeqLM class internally. Like with the tokenizer, the from_pretrained method will download and cache the model for us.

We can use pretrained-model from huggingface and it supports Seq2Seq architectures. e.g. T5, BART, GPT etc.

Below mentioned are some of the important points :

* The user should provide the training and validation file path(optional) in **json/csv format**. The text column name should be **"text"** and summary column name should be **"summary"**

* The user should also provide the output path where all the model results will be saved.

* If the model name is not specified, default model(**facebook/bart-large-cnn**) is taken into consideration

* If the model is t5  specifiy **model type =t5**, so that t5 params will get considered. else **model type = others**

* Default model params are considered for the architecture specified. We can pass [**kwargs](https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#trainingarguments) to the train function to alter the default model params.

* The user should provide **train_prediction = True** for train data prediction and **val_prediction = True** for validation data prediction.

* **Train/validation** and **eval metrics** are stored in **prediction folder** inside output folder path.
```
|--output_path/
|    |--prediction folder/
|       |-- train folder/
|          |--train_data_benchmarks.csv
|          |--train_generated_predictions.jsonl
|       |--validation folder/
|          |--validation_data_benchmarks.csv
|          |--validation_generated_predictions.jsonl
```


For more information the user can refer [HuggingFace Summarization](https://github.com/huggingface/transformers/tree/main/examples/pytorch/summarization)

**NOTE**: When **output_path='table'** then fine-tuned model is saved in **'table/training'** folder

In [None]:
# model.train() function is used for fine tunning
# `train_data_path` :
# `output_path` :
# `model_name` :
# `model_type` :
# `train_prediction` :
# `val_prediction` :
# `valn_path` :
# **kwargs : Additional arguments for the models

# Using trainer function with necessary params
model.train(train_data_path="data/train_folder/processed/train_data.json",output_path="table", model_name="t5-small",model_type="t5",valn_path="data/validation_folder/processed/validation_data.json")

# Using trainer function with additional params
# model.train(train_data_path="data/train_folder/processed/train_data.json",output_path="table", model_name="t5-small",model_type="t5",valn_path="data/validation_folder/processed/validation_data.json",train_prediction=True,val_prediction=True,per_device_train_batch_size=2,learning_rate=5e-5,max_train_samples=50)

Executed
Model Name: t5-small
validation
training started
train prediction
val prediction


## **Inferencing the model**


### **Using test data**

See the up-to-date list of [available models](https://huggingface.co/models?pipeline_tag=summarization)

Below mentioned are some of the important points :

* The user should provide the test path in  **json/csv format**. The text column name should be **"text"**

* The user should also provide the output path where all the model results will be saved.

* We can pass pretrained / fine tuned model path. If the model name is not specified, default model(**facebook/bart-large-cnn**) is taken into consideration

* If the model is t5  specifiy **model type =t5**, so that t5 params will get considered. else **model type = others**

We can pass [**kwargs](https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#trainingarguments) to the predict function to alter the default model params.

(i.e) test_data_benchmarks.csv(optional -it generates only if test data has actual summary)

* **Train/validation/test prediction** and **eval metrics** are stored in **prediction folder** inside output folder path.
```
|--output_path/
|    |--prediction folder/
|       |-- train folder/
|          |--train_data_benchmarks.csv
|          |--train_generated_predictions.jsonl
|       |--validation folder/
|          |--validation_data_benchmarks.csv
|          |--validation_generated_predictions.jsonl
|       |--test folder/
|          |--test_data_benchmarks.csv
|          |--test_generated_predictions.jsonl
```

#### **Pretrained model**

In [None]:
# model.predict() function is used for prediction
# `test_path` :
# `output_path` :
# `model_name` :
# `model_type` :
# **kwargs : Additional arguments for the models

#initializing the object for inference using pretrained model so that params get refreshed. params like learning rate etc modified for finetuning is not used.
pretrain_model = Summarizer(summary_type="table")

# Using predict function with necessary params
pretrain_model.predict(test_path="data/test_folder/processed/test_data.json",output_path="table/prediction/test", model_name="t5-small",model_type="t5")

# Using predict function with additional params
# pretrain_model.predict(test_path="data/test_folder/processed/test_data.json",output_path="table/prediction/test",model_name="t5-small",model_type="t5",per_device_train_batch_size=2,learning_rate=5e-5,max_train_samples=50)




#### **Finetuned model**

In [None]:
# model.predict() function is used for prediction
# `test_path` :
# `output_path` :
# `model_name` :
# `model_type` :
# **kwargs : Additional arguments for the models

# Using predict function with necessary params for finetuned model
model.predict(test_path="data/test_folder/processed/test_data.json",output_path="table/prediction/test", model_name="table/training",model_type="t5")

## Using predict function with additional params
# model.predict(test_path="data/test_folder/processed/test_data.json",output_path="table/prediction/test",model_name="table/training",model_type="t5",per_device_train_batch_size=2,learning_rate=5e-5,max_train_samples=50)


### **Using single context**

We need to pass context as input and it extracts the summary for the given context.

See the up-to-date list of [available models](https://huggingface.co/models?pipeline_tag=summarization)

Below mentioned are some of the important points :

* The user should provide the context as a string format.

* We can pass pretrained / fine tuned model path. If the model name is not specified, default model(**facebook/bart-large-cnn**) is taken into consideration

* If the model is t5  specifiy **model type =t5**, so that t5 params will get considered. else **model type = others**

We can pass [**kwargs](https://huggingface.co/transformers/v3.0.2/main_classes/trainer.html#trainingarguments) to the predict function to alter the default model params.

In [None]:
context = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building,"\
"and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side."\
"During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure"\
"in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930."\
"It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top"\
"of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, "\
"the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."

#### **Pretrained model**

In [None]:
# model.predict() function is used for prediction
# `context` :
# `model_name` :
# **kwargs : Additional arguments for the models

#initializing the object for inference using pretrained model so that params get refreshed. params like learning rate etc modified for finetuning is not used.
pretrain_model = Summarizer(summary_type="table")

# Using predict function with necessary params
result=pretrain_model.predict(context=context,model_name="t5-small",model_type="t5")

# Using predict function with additional params
# result=pretrain_model.predict(context=context,model_name="t5-small",,model_type="t5",learning_rate=5e-5,min_length=5, max_length=20)

#### **Finetuned model**

In [None]:
# model.predict() function is used for prediction
# `context` :
# `model_name` :
# **kwargs : Additional arguments for the models

# Using predict function with necessary params
result=model.predict(context=context,model_name="table/training",model_type="t5")

# Using predict function with additional params
# result=model.predict(context=context,model_name="table/training",,model_type="t5",learning_rate=5e-5,min_length=5, max_length=20)

## **Evaluating Model**

We can use the classical generative text evaluation metric like **BLEU**, **ROUGE** and **Semantic Similarity** scores to benchmark the model.

In [None]:
# specify prediction data path
path="table/prediction/test/"
# reading predicted file
pred_df = open(path+'test_generated_predictions.jsonl')
pred_df = json.load(pred_df)
pred_df = pd.DataFrame(pred_df)
pred_df.head(2)

Unnamed: 0,text,actual_summary,predicted_summary,model_name,summary_type,data_type
0,<table> <cell> CHIPS_ DC SUMMIT IL <header> de...,validation data table summary,cell> CHIPS_ DC SUMMIT IL header> destination ...,t5-small,table,train


In [None]:
# get_evaluation_metrics() function is used to generate benchmarking scores on different metrics
# `actuals` : Actual text or reference
# `predicted` : Generated text or predictions

bleu,rouge_one,rouge_l,semantic,scores_df  =  get_evaluation_metrics(actuals=pred_df['actual_summary'].tolist(), predicted=pred_df['predicted_summary'].tolist())
scores_df.to_csv(os.path.join(path,"metric_scores.csv"))
print(bleu,rouge_one,rouge_l,semantic)

0.0 0.05263157894736842 0.05263157894736842 0.15557736158370972
