# **Experimentation V2**

### **Create Paper Data**

In [1]:
# Import the paperscraper module
from paperscraper import paperscraper

# Create an instance of the class
scraper = paperscraper('2024-10-04')

# Get sections of AIAYN paper using html_subdivide function
AIAYN_sections = scraper.html_subdivide('https://arxiv.org/html/1706.03762v7')

### **Summarisation**

In [2]:
# Import os
import os

# Import GPT model and tokenizer
os.environ['HF_HOME'] = r'C:\Users\josha\AppData\Local\Temp'        # Set cache directory for HuggingFace
from transformers import T5Tokenizer, T5ForConditionalGeneration    # Using Google FLAN-T5

# Create an instance of the tokenizer
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

# Create an instance of the model
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base", pad_token_id=tokenizer.eos_token_id)

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [3]:
# Create summarisation function
def section_summarizer(text):
    # Define prompt
    prompt = f'''You are an AI assistant capable of summarizing academic content on AI and Machine Learning for non-academic readers. Your summaries are given in the third-person, and are about 2-3 sentences in length.
    Summarize the following abstract:
    {text}'''

    # Encode prompt
    input_ids = tokenizer.encode(prompt, return_tensors='pt')

    # Generate response
    output = model.generate(
        input_ids,                  # Input ids
        max_length = 500,           # Maximum number of tokens to generate
        num_beams = 5,              # Needed as next token is found using beam search
        no_repeat_ngram_size = 2,   # Stops model from repeating word sequences repeatedly
        early_stopping = False,      # If output becomes not very good, stop generating
    )

    # Return the decoded the output
    return tokenizer.decode(output[0], skip_special_tokens=True)

In [5]:
summarised_sections = {}

for section, content in AIAYN_sections.items():
    summarised_sections[section] = section_summarizer(content)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Token indices sequence length is longer than the specified maximum sequence length for this model (6070 > 512). Running this sequence through the model will result in indexing errors


In [6]:
summarised_sections

{'Abstract': 'We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2.',
 'Introduction': 'We propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output.',
 'Background': 'The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.',
 'Model Architecture': 'We propose an encoder-decoder architecture based on a multi-head self-attention and point-wise fully connected feed-forward networks.',
 'Encoder and Decoder Stacks': 'We employ a residual connection around each of the two sub-layers, followed by 