<a href="https://colab.research.google.com/github/mapcrafter2048/Literature-Review-Generator-ML-17/blob/main/sliding_window1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install datasets

Collecting datasets
  Downloading datasets-2.20.0-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting requests>=2.32.2 (from datasets)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.20.0-py3-none-any.whl (547 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m547.8/547.8 kB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m11.8 MB/s

In [2]:
import logging
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, DataCollatorForSeq2Seq, AdamW
from datasets import load_dataset
from tqdm import tqdm

# Set logging level
logging.basicConfig(level=logging.ERROR)

# Constants
DATASET_SIZE = 2400000
SAMPLE_FRACTION = 0.004
TRAIN_TEST_SPLIT = 0.1
MAX_INPUT_LENGTH = 512  # Use 512 for T5
MIN_TARGET_LENGTH = 5
MAX_TARGET_LENGTH = 128
BATCH_SIZE = 8
MAX_EPOCHS = 2
MODEL_CHECKPOINT = "t5-small"
SLIDING_WINDOW_OVERLAP = 128  # Adjust as needed

In [5]:


# Load dataset
data = load_dataset("scillm/scientific_papers-archive")

# Downsample the dataset to 0.4%
sample_size = int(DATASET_SIZE * SAMPLE_FRACTION)
data = data.shuffle(seed=42)
sampled_data = data['train'].train_test_split(train_size=sample_size, seed=42)['train']

# Split the dataset into train and test
train_test_data = sampled_data.train_test_split(test_size=TRAIN_TEST_SPLIT, seed=42)
train_data = train_test_data['train']
test_data = train_test_data['test']

# Load model and tokenizer
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_CHECKPOINT)
tokenizer = AutoTokenizer.from_pretrained(MODEL_CHECKPOINT)
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

# Sliding window approach for long documents
def sliding_window(text, max_length, overlap):
    tokens = tokenizer.encode(text, truncation=False)
    segments = []
    start = 0
    while start < len(tokens):
        end = min(start + max_length, len(tokens))
        segment = tokens[start:end]
        segments.append(segment)
        if end == len(tokens):
            break
        start += max_length - overlap
    return segments

# Preprocess function
def preprocess_function(examples):
    inputs = ["summarize: " + doc for doc in examples["input"]]
    all_input_ids = []
    all_attention_masks = []
    all_labels = []

    for input_text in inputs:
        # The tokenizer is called outside the sliding_window function
        encoded_segments = tokenizer(input_text, max_length=MAX_INPUT_LENGTH, truncation=True, padding="max_length", return_tensors="np")
        # Convert the encoded segments to a list to ensure correct unpacking
        all_input_ids.append(encoded_segments["input_ids"].tolist())
        all_attention_masks.append(encoded_segments["attention_mask"].tolist())

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["output"], max_length=MAX_TARGET_LENGTH, truncation=True, padding="max_length", return_tensors="np")
        # Append each label individually to match the structure of input_ids and attention_masks
        for label in labels["input_ids"]:
            all_labels.append(label.tolist())

    model_inputs = {
        "input_ids": all_input_ids,
        "attention_mask": all_attention_masks,
        "labels": all_labels,
    }
    return model_inputs
# Apply preprocessing
tokenized_train = train_data.map(preprocess_function, batched=True, remove_columns=["input", "output"])
tokenized_test = test_data.map(preprocess_function, batched=True, remove_columns=["input", "output"])

# Fix column lengths
def fix_column_length(dataset, column_name, expected_length):
    def fix_length(example):
        example[column_name] = example[column_name] + [0] * (expected_length - len(example[column_name]))
        return example

    return dataset.map(lambda example: fix_length(example), batched=True)

train_data_fixed = fix_column_length(tokenized_train, "input_ids", MAX_INPUT_LENGTH)
test_data_fixed = fix_column_length(tokenized_test, "input_ids", MAX_INPUT_LENGTH)

# Custom Dataset class
class CustomDataset(Dataset):
    def __init__(self, data):
        self.input_ids = torch.tensor(data["input_ids"])
        self.attention_mask = torch.tensor(data["attention_mask"])
        self.labels = torch.tensor(data["labels"])

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "attention_mask": self.attention_mask[idx],
            "labels": self.labels[idx]
        }

train_dataset = CustomDataset(train_data_fixed)
test_dataset = CustomDataset(test_data_fixed)

# DataLoader creation
def create_dataloader(dataset, batch_size):
    return DataLoader(dataset, batch_size=batch_size, collate_fn=data_collator)

train_dataloader = create_dataloader(train_dataset, BATCH_SIZE)
test_dataloader = create_dataloader(test_dataset, BATCH_SIZE)

# Setup for training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Training loop
def train():
    model.train()
    total_loss = 0
    for batch in tqdm(train_dataloader, desc="Training"):
        optimizer.zero_grad()
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)

        # Ensure labels are not empty
        if labels.shape[0] == 0:
            continue

        # Reshape input_ids to have two dimensions
        input_ids = input_ids.view(input_ids.size(0), -1)

        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    return total_loss / len(train_dataloader)

# Evaluation loop
def evaluate():
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for batch in tqdm(test_dataloader, desc="Evaluating"):
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)

            # Ensure labels are not empty
            if labels.shape[0] == 0:
                continue

            # Reshape input_ids to have two dimensions
            input_ids = input_ids.view(input_ids.size(0), -1)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss

            total_loss += loss.item()
    return total_loss / len(test_dataloader)

# Training and evaluation
for epoch in range(MAX_EPOCHS):
    print(f"Epoch {epoch + 1}/{MAX_EPOCHS}")
    train_loss = train()
    print(f"Training Loss: {train_loss:.4f}")

    eval_loss = evaluate()
    print(f"Evaluation Loss: {eval_loss:.4f}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/8.27k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.65G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.89G [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/162M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/164M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

Map:   0%|          | 0/8640 [00:00<?, ? examples/s]



Map:   0%|          | 0/960 [00:00<?, ? examples/s]

Map:   0%|          | 0/8640 [00:00<?, ? examples/s]

Map:   0%|          | 0/960 [00:00<?, ? examples/s]



Epoch 1/2


  batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
Training: 100%|██████████| 1080/1080 [06:13<00:00,  2.89it/s]


Training Loss: 2.7964


Evaluating: 100%|██████████| 120/120 [00:14<00:00,  8.23it/s]


Evaluation Loss: 2.3616
Epoch 2/2


Training: 100%|██████████| 1080/1080 [06:19<00:00,  2.85it/s]


Training Loss: 2.4484


Evaluating: 100%|██████████| 120/120 [00:14<00:00,  8.21it/s]

Evaluation Loss: 2.2726





In [9]:
def generate_summary(text, model, tokenizer, max_input_length=512, max_target_length=128):
    """
    Generates a summary for a given text using the specified model and tokenizer.
    Args:
        text (str): The input text to summarize.
        model (transformers.PreTrainedModel): The pre-trained model for summarization.
        tokenizer (transformers.PreTrainedTokenizer): The tokenizer for the model.
        max_input_length (int): Maximum input length for the model.
        max_target_length (int): Maximum output length for the generated summary.
    Returns:
        str: The generated summary.
    """
    # Split the text into segments using a sliding window approach
    def sliding_window(text, max_length, overlap):
        tokens = tokenizer.encode(text, truncation=False)
        segments = []
        start = 0
        while start < len(tokens):
            end = min(start + max_length, len(tokens))
            segment = tokens[start:end]
            decoded_segment = tokenizer.decode(segment, skip_special_tokens=True)
            segments.append(decoded_segment)
            if end == len(tokens):
                break
            start += max_length - overlap
        return segments

    # Process the text using the sliding window function
    segments = sliding_window(text, max_input_length, SLIDING_WINDOW_OVERLAP)

    # Encode the segments
    encoded_segments = tokenizer(
        segments,
        max_length=max_input_length,
        truncation=True,
        padding="max_length",
        return_tensors="pt"
    )

    # Move tensors to the appropriate device
    input_ids = encoded_segments["input_ids"].to(device)
    attention_mask = encoded_segments["attention_mask"].to(device)

    # Generate summaries for each segment
    model.eval()
    with torch.no_grad():
        outputs = model.generate(
            input_ids=input_ids,
            attention_mask=attention_mask,
            max_length=max_target_length,
            num_beams=4,
            early_stopping=True
        )

    # Decode the outputs to text
    decoded_summaries = tokenizer.batch_decode(outputs, skip_special_tokens=True)

    # Combine the segment summaries into a single summary
    summary = ' '.join(decoded_summaries)
    return summary

# Example usage
text = """The quest for more efficient and sustainable energy storage solutions is pivotal in advancing the transition to renewable energy systems. This research paper delves into the latest advancements in lithium-ion battery technologies, focusing on enhancing energy storage efficiency and addressing current limitations. By integrating experimental studies with theoretical modeling, this study offers a comprehensive examination of novel materials, design innovations, and performance metrics that contribute to the improvement of lithium-ion battery systems.

The research begins with an overview of the fundamental principles of lithium-ion battery technology, including the roles of cathodes, anodes, electrolytes, and separators in energy storage and release processes. Despite their widespread use, conventional lithium-ion batteries face challenges such as limited energy density, safety concerns, and degradation over time. This study addresses these issues by investigating several innovative approaches to improving battery performance and longevity.

One significant area of focus is the development of advanced cathode materials. The study explores the potential of high-capacity cathode materials such as lithium-rich layered oxides and high-voltage spinel structures. Experimental results demonstrate that these new materials can increase the energy density of lithium-ion batteries by up to 25%, compared to traditional cathode materials. The research also examines the impact of these materials on battery cycle life and stability, revealing that they offer improved performance under high-stress conditions, thereby enhancing the overall reliability of the battery systems.

Another critical aspect of the study is the optimization of anode materials. Silicon-based anodes are highlighted as a promising alternative to conventional graphite anodes due to their higher theoretical capacity. However, silicon anodes suffer from significant volume expansion and mechanical degradation during cycling. The research investigates various strategies to mitigate these issues, including the use of silicon-carbon composites and novel binder materials. The findings indicate that these advanced anode materials can significantly enhance the capacity and cycle stability of lithium-ion batteries, achieving a capacity increase of up to 30% while maintaining stability over extended charge-discharge cycles.

The paper also addresses the development of advanced electrolytes and separators. Solid-state electrolytes are examined as a safer and more efficient alternative to conventional liquid electrolytes. The study presents new solid-state electrolyte formulations with improved ionic conductivity and thermal stability. Experimental data show that these solid-state electrolytes can enhance battery safety by reducing the risk of leakage and thermal runaway while also improving overall energy efficiency. Additionally, the research explores the integration of advanced separators that enhance the stability and performance of the battery cells, contributing to longer battery life and higher energy density.

To provide a holistic view of the advancements in lithium-ion battery technology, the study includes a detailed analysis of battery management systems (BMS). The research highlights the importance of BMS in optimizing battery performance, balancing cell voltages, and ensuring safety. The study proposes novel algorithms and software enhancements for BMS that improve the accuracy of state-of-charge (SOC) and state-of-health (SOH) estimations, leading to more efficient battery utilization and extended service life.

The paper also considers the environmental and economic implications of adopting advanced lithium-ion battery technologies. By improving energy density and extending battery life, these innovations can reduce the overall cost of energy storage systems and decrease the environmental impact associated with battery production and disposal. The study emphasizes the importance of developing sustainable and cost-effective manufacturing processes for new battery materials and technologies.

In conclusion, this research paper offers a comprehensive overview of recent advancements in lithium-ion battery technologies, focusing on the enhancement of energy storage efficiency through novel materials, design improvements, and system optimizations. The findings reveal significant potential for increasing energy density, improving cycle stability, and enhancing safety in lithium-ion batteries. The study underscores the importance of continued research and development in this field to address current limitations and support the transition to renewable energy systems. As the demand for efficient and sustainable energy storage solutions grows, the insights provided by this research will play a crucial role in advancing battery technologies and supporting broader efforts towards a sustainable energy future.

The paper calls for further investigation into emerging technologies and materials, as well as the development of scalable manufacturing processes to bring these innovations to market. By addressing both technical and practical challenges, the research aims to contribute to the development of next-generation energy storage solutions that meet the demands of modern energy systems and contribute to a cleaner, more sustainable world."""
summary = generate_summary(text, model, tokenizer)
print(summary)


Token indices sequence length is longer than the specified maximum sequence length for this model (973 > 512). Running this sequence through the model will result in indexing errors


-based lithium-ion battery technology is a key step in advancing the transition to renewable energy systems. The research focuses on the development of advanced electrolytes and separators. The research focuses on the development of advanced solid-state electrolytes and separators. The research explores the potential of high-capacity cathode materials such as lithium-rich layered oxides and high-voltage spinel structures. The research focuses on the development of advanced electrolytes and separators. batteries. The research focuses on the development of advanced solid-state electrolytes and separators that enhance battery safety by reducing the risk of leakage and thermal runaway while also improving overall energy efficiency. The research highlights the importance of developing sustainable and cost-effective manufacturing processes for new lithium-ion battery technologies. The research highlights the importance of developing sustainable and cost-effective manufacturing processes for 