## Translation + Summarization Combined

This notebook combines two translation and two summarization models.

## Installation

**IMPORTANT - Make sure to download the val_data_set.txt.rtf file from the github repo from the roBERTaSummarization folder**

In [None]:
!gdown '1ogNYfR6Xql88vZA_sP6aQrQCTgPARgnd&confirm=t' #get Zac model from google drive

Downloading...
From: https://drive.google.com/uc?id=1ogNYfR6Xql88vZA_sP6aQrQCTgPARgnd&confirm=t
To: /content/ZAC_RNN_translator.tar.gz
100% 60.2M/60.2M [00:00<00:00, 147MB/s]


In [None]:
!gdown '12jWWU39omv_1sGwMkqimAT4tkRciJwDE&confirm=t' #get Oriana's model from google drive

Downloading...
From: https://drive.google.com/uc?id=12jWWU39omv_1sGwMkqimAT4tkRciJwDE&confirm=t
To: /content/translator.tar.gz
100% 118M/118M [00:00<00:00, 177MB/s] 


In [None]:
#Zachary
!pip install "tensorflow-text>=2.10"
!pip install einops

#Oriana
!apt install --allow-change-held-packages libcudnn8=8.1.0.77-1+cuda11.2
!pip install -q -U tensorflow-text
!pip install matplotlib


#Anas
!git clone https://github.com/google/seq2seq.git
!pip install -e seq2seq
!pip install dill==0.3.4
!pip install datasets==1.0.2
!rm seq2seq_trainer.py
!wget https://raw.githubusercontent.com/huggingface/transformers/master/examples/seq2seq/seq2seq_trainer.py

#Dagar
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tensorflow-text>=2.10
  Downloading tensorflow_text-2.11.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 5.0 MB/s 
Collecting tensorflow<2.12,>=2.11.0
  Downloading tensorflow-2.11.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (588.3 MB)
[K     |████████████████████████████████| 588.3 MB 20 kB/s 
Collecting keras<2.12,>=2.11.0
  Downloading keras-2.11.0-py2.py3-none-any.whl (1.7 MB)
[K     |████████████████████████████████| 1.7 MB 92.7 MB/s 
Collecting tensorboard<2.12,>=2.11
  Downloading tensorboard-2.11.0-py3-none-any.whl (6.0 MB)
[K     |████████████████████████████████| 6.0 MB 92.2 MB/s 
Collecting flatbuffers>=2.0
  Downloading flatbuffers-22.11.23-py2.py3-none-any.whl (26 kB)
Collecting tensorflow-estimator<2.12,>=2.11.0
  Downloading tensorflow_estimator-2.11.0-py2.py3-none-any.whl (43

## Imports

In [None]:
#Oriana
import logging
import time
import numpy as np
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds
import tensorflow as tf
import tensorflow_text
import shutil



#Zachary
import numpy as np
import typing
from typing import Any, Tuple
import einops
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import tensorflow as tf
import tensorflow_text as tf_text
import pathlib

#Anas
from transformers import RobertaTokenizerFast
from transformers import EncoderDecoderModel
from transformers import Seq2SeqTrainer
from transformers import TrainingArguments
from dataclasses import dataclass, field
from typing import Optional


#Dagar
import torch
import json 
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

##Model 1 (NMT with a Transformer and Keras) (Oriana)

Unzip tar.gz file

In [None]:
!tar -xzvf /content/translator.tar.gz

._translator
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.macl'
translator/
translator/._.DS_Store
tar: Ignoring unknown extended header keyword 'SCHILY.fflags'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.FinderInfo'
translator/.DS_Store
translator/._variables
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
translator/variables/
translator/._saved_model.pb
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
translator/saved_model.pb
translator/._assets
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
translator/assets/
translator/assets/._en_vocab.txt
tar: Ignoring unknown extended header keyword 'LIBARCHIVE.xattr.com.apple.quarantine'
translator/asset

### Testing



Input:         : Ale mělo by to být stabilní, dokud nedokončíme vstřikování.

Prediction     : but it should be stable until we complete the vaccine .

Ground truth   : But the site should remain stable until we finish the infusion.


In [None]:
Transformer = tf.saved_model.load('./translator')

In [None]:
Transformer('Kde je toaleta').numpy()

b'where toilets are'

In [None]:
Transformer('Ale mělo by to být stabilní, dokud nedokončíme vstřikování.').numpy()

b'but it should be stable until we complete the vaccine .'

In [None]:
print(Transformer('Ale mělo by to být stabilní, dokud nedokončíme vstřikování.'))

tf.Tensor(b'but it should be stable until we complete the vaccine .', shape=(), dtype=string)


In [None]:
sentence = 'Ale mělo by to být stabilní, dokud nedokončíme vstřikování.'
ground_truth = 'But the site should remain stable until we finish the infusion.'

In [None]:
print(f'{"Input:":15s}: {sentence}')
print(f'{"Prediction":15s}: {Transformer(sentence).numpy().decode("utf-8")}')
print(f'{"Ground truth":15s}: {ground_truth}')

Input:         : Ale mělo by to být stabilní, dokud nedokončíme vstřikování.
Prediction     : but it should be stable until we complete the vaccine .
Ground truth   : But the site should remain stable until we finish the infusion.


In [None]:
text = """Ale mělo by to být stabilní, dokud nedokončíme vstřikování.
Ale mělo by to být stabilní, dokud nedokončíme vstřikování.
Ale mělo by to být stabilní, dokud nedokončíme vstřikování.
Ale mělo by to být stabilní, dokud nedokončíme vstřikování."""

print("Czech Lines:\n")
for l in lines:
  print(l)

print("\nEnglish Lines:\n")
for l in lines:
  print(Transformer(l).numpy().decode("utf-8"))

print("\nEnglish Lines w/ End Lines:\n")
for l in lines:
  print(f'{Transformer(sentence).numpy().decode("utf-8")}\n')

list = ''
print("\nJoining English Lines:\n")
for l in lines:
  string = str(Transformer(l).numpy().decode("utf-8"))
  list += '\n'
  list += string
print(list)

print("\nJoining List:\n")
joined = ''.join(list)
print(joined)

Czech Lines:

Ale mělo by to být stabilní, dokud nedokončíme vstřikování.

Ale mělo by to být stabilní, dokud nedokončíme vstřikování.

Ale mělo by to být stabilní, dokud nedokončíme vstřikování.

Ale mělo by to být stabilní, dokud nedokončíme vstřikování.

English Lines:

but it should be stable until we complete the vaccine .
but it should be stable until we complete the vaccine .
but it should be stable until we complete the vaccine .
but it should be stable until we complete the vaccine .

English Lines w/ End Lines:

but it should be stable until we complete the vaccine .

but it should be stable until we complete the vaccine .

but it should be stable until we complete the vaccine .

but it should be stable until we complete the vaccine .


Joining English Lines:


but it should be stable until we complete the vaccine .
but it should be stable until we complete the vaccine .
but it should be stable until we complete the vaccine .
but it should be stable until we complete the vacc

### Translation Model 1 Runner

In [None]:
Transformer = tf.saved_model.load('/content/translator')

In [None]:
# First Paramater - the path to the file that needs to be translated
# Second paramter - where the output needs to be written to
# only this function will be called by main()

def Translation_Model_1_Runner(input_filepath, output_filepath):

  # read function
  lines = Translation_Model_1_Read_File(input_filepath)

  # adding predicted lines to list
  list = ''
  for l in lines:
    strings = str(Transformer(l).numpy().decode("utf-8"))
    if list:
      list += '\n'
      list += strings 
    else:
      print("Making Predications...")
      list += strings
    

  # combining lines from list
  translated_text = ''.join(list)

  # write function
  write_to_file(output_filepath, translated_text)

In [None]:
def Translation_Model_1_Read_File(filepath):
  text = open(filepath, 'r')
  lines = text.readlines()
  text.close()
  return lines

##Model 2 (RNN model with Attention) (Zachary)

In [None]:
!tar -xvpf /content/ZAC_RNN_translator.tar.gz #make sure the file path is correct

ZAC_RNN_translator/
ZAC_RNN_translator/fingerprint.pb
ZAC_RNN_translator/variables/
ZAC_RNN_translator/variables/variables.data-00000-of-00001
ZAC_RNN_translator/variables/variables.index
ZAC_RNN_translator/assets/
ZAC_RNN_translator/saved_model.pb


In [None]:
RNN_model = tf.saved_model.load('/content/ZAC_RNN_translator') #make sure the file path is correct

### TESTING outputs

In [None]:
inputs = [
    'Je tady opravdu zima.', # "It's really cold here."
    'Tohle je můj život.', # "This is my life."
    'V jeho pokoji je nepořádek.' # "His room is a mess"
]

In [None]:
%%time
for t in inputs:
  print(RNN_model.translate([t])[0].numpy().decode())

print()

its really cold here 
this is my life [UNK] 
in his room is clean up in your room 

CPU times: user 2.96 s, sys: 636 ms, total: 3.59 s
Wall time: 5.71 s


In [None]:
#the first paramater will contain the path to the file that needs to be translated
#the second paramter will contain where the output needs to be written to
#a write function has been provided
#only this function will be called by main()
def Translation_Model_2_Runner(input_filepath, output_filepath): 
  #do the stuff you need here
  trans = ''

  text = Translation_Model_2_Read_File(input_filepath)
  
  for t in text:
    trans += RNN_model.translate([t])[0].numpy().decode()
  
  write_to_file(output_filepath, trans)



In [None]:
def Translation_Model_2_Read_File(filepath):
  text = ""
  file1 = open(filepath, "r")
  lines = file1.readlines()
  file1.close()
  return lines

##Model 3 (Abstractive roBERTa model) (Anas)

In [None]:
model=EncoderDecoderModel.from_pretrained("./checkpoint-6432")
tokenizer=RobertaTokenizerFast.from_pretrained("roberta-base")

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/481 [00:00<?, ?B/s]

**IMPORTANT - Make sure to download the val_data_set.txt.rtf file from the github repo from the roBERTaSummarization folder**

In [None]:
val_dataset = open('val_data_set.txt.rtf', 'r')

In [None]:
#the first paramater will contain the path to the file that needs to be translated
#the second paramter will contain where the output needs to be written to
#a write function has been provided
#only this function will be called by main()
def Summarization_Model_1_Runner(input_filepath, output_filepath):
  summarized_text = "Hello World 1!!!!"
  device = model.to("cuda")
  #get text from the input_filepat
  text = Summarization_Model_1_Read_File(input_filepath)

  #get rid of the newline char and preprocess
  preprocess_text = text.strip().replace("\n","")
  roBERTa_Text = "summarize: "+preprocess_text
  tokenized_text =  tokenizer.cls_token[roBERTa_Text].to(device)

  # summmarize 
  output_str = model.generate(tokenized_text,
                               input_ids=inputs.attention_mask,
                               decoder_input_ids=input.decoder_input_ids,
                               min_length=8,
                               max_length=40
                               warmup_steps=40,
                               eval_steps=16,
                               )

  #decode
  summarized_text = tokenizer.batch_decode(output_str,ref=val_dataset)

  #write to file
  write_to_file(output_filepath, summarized_text)

In [None]:
#do the stuff you need here you can add more code blocks if you want
def Summarization_Model_1_Read_File(filepath):
  text = ""
  file1 = open(filepath, "r")
  lines = file1.readlines()
  for line in lines:
    text = text + line
  file1.close()
  return text

##Model 4 (T5-small model) (Dagar)

In [None]:
def Summarization_Model_2_Runner(input_filepath, output_filepath):
  summarized_text = "Hello World 1!!!!"
  #do the stuff you need here

  #get the custom trained model and pretrained tokenizer
  model = AutoModelForSeq2SeqLM.from_pretrained("Dagar/t5-small-science-papers-NIPS")
  tokenizer = AutoTokenizer.from_pretrained('t5-small')
  device = torch.device('cpu')

  #get text from the input_filepat
  text = Summarization_Model_2_Read_File(input_filepath)

  #get rid of the newline char and preprocess
  preprocess_text = text.strip().replace("\n","")
  t5_prepared_Text = "summarize: "+preprocess_text

  tokenized_text = tokenizer.encode(t5_prepared_Text, return_tensors="pt").to(device)

  # summmarize 
  summary_ids = model.generate(tokenized_text,
                                    num_beams=4,
                                    no_repeat_ngram_size=2,
                                    min_length=30,
                                    max_length=128,
                                    early_stopping=True)

  #decode
  summarized_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

  #write to file
  write_to_file(output_filepath, summarized_text)

In [None]:
#do the stuff you need here you can add more code blocks if you want

def Summarization_Model_2_Read_File(filepath):
  text = ""
  file1 = open(filepath, "r")
  lines = file1.readlines()
  for line in lines:
    text = text + line
  file1.close()
  return text

##Write Function

In [None]:
#provide text so it can be written to a file
#needs to be a string
def write_to_file(filepath, text):
  file1 = open(filepath, "w")
  file1.write(text)
  file1.close()

##Runner

In [None]:
!mkdir Oriana
!mkdir Zachary
!mkdir Anas
!mkdir Dagar

In [None]:
def main(input_filepath):
  Translation_Model_1_Runner(input_filepath, "./Oriana/Translation_1.txt")

  Translation_Model_2_Runner(input_filepath, "./Zachary/Translation_2.txt")

  Summarization_Model_1_Runner("./Oriana/Translation_1.txt", "./Anas/Summarization_1_1.txt")
  Summarization_Model_1_Runner("./Zachary/Translation_2.txt", "./Anas/Summarization_1_2.txt")

  Summarization_Model_2_Runner("./Oriana/Translation_1.txt", "./Dagar/Summarization_2_1.txt")
  Summarization_Model_2_Runner("./Zachary/Translation_2.txt", "./Dagar/Summarization_2_2.txt")

  print("DONE!")

In [None]:
!gdown '1NJ_m72cSJL0ioGWXH_gee5kPzGtejKDN&confirm=t' #this gives you a small test paper

Downloading...
From: https://drive.google.com/uc?id=1NJ_m72cSJL0ioGWXH_gee5kPzGtejKDN&confirm=t
To: /content/test_paper.txt
  0% 0.00/4.69k [00:00<?, ?B/s]100% 4.69k/4.69k [00:00<00:00, 8.12MB/s]


In [None]:
main("/content/test_paper.txt") #IMPORTANT - you need to give it the path of the paper!!!

Making Predications...


Token indices sequence length is longer than the specified maximum sequence length for this model (623 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (623 > 512). Running this sequence through the model will result in indexing errors


DONE!


**IMPORTANT - this main function does not include Anas model just in case the val_data_set.txt.rtf file was not download-able. This is the backup**

In [None]:
def main_2(input_filepath): #this main function does not include Anas model just in case the val_data_set.txt.rtf file was not download-able
  Translation_Model_1_Runner(input_filepath, "./Oriana/Translation_1.txt")

  Translation_Model_2_Runner(input_filepath, "./Zachary/Translation_2.txt")


  Summarization_Model_2_Runner("./Oriana/Translation_1.txt", "./Dagar/Summarization_2_1.txt")
  Summarization_Model_2_Runner("./Zachary/Translation_2.txt", "./Dagar/Summarization_2_2.txt")

  print("DONE!")

In [None]:
main_2("/content/test_paper.txt")

Making Predications...


Token indices sequence length is longer than the specified maximum sequence length for this model (643 > 512). Running this sequence through the model will result in indexing errors


DONE!


##Clean Up

In [None]:
!rm  -r Oriana
!rm -r Zachary
!rm -r Anas
!rm -r Dagar

In [None]:
shutil.rmtree('./translator')