This notebook focuses on generating summaries by using BART large CNN model (https://huggingface.co/facebook/bart-large-cnn) from the impact sentences extracted earlier. It produces two distinct summaries: one highlighting structural impacts and the other detailing community impacts. The summarized results will be available in .txt format.

Please ensure to paste the input_path, which is the location of 0_input.ipynb file, at the start of this notebook. This step is the only requirement to load all necessary information for the execution of the code.

Recommended Google Colab Runtime Type: A100 GPU (preferred) or V100 GPU, as this notebook involves running machine learning models.

In [1]:
# Input file path (must navigate at the beginning of each file)
input_path = "/content/drive/My Drive/ImpactDataMining/Turkiye_Earthquake/Result"

All the below sections automatically retrieve data from the 0_input.ipynb file, as well as results from previous notebooks in this series. The code is designed to run using this information, so no further edits are required beyond this point.

In [2]:
!pip install transformers
import os
import json
import torch
import math

from google.colab import drive
from transformers import pipeline, BartTokenizer, BartForConditionalGeneration



In [3]:
import time

start_time = time.time()

In [4]:
def current_path():
  print("Current working directory")
  print(os.getcwd())
  print()

current_path()
drive.mount('/content/drive')
os.chdir(input_path)
current_path()

Current working directory
/content

Mounted at /content/drive
Current working directory
/content/drive/My Drive/ImpactDataMining/Turkiye_Earthquake/Result



In [5]:
with open('0_input.json', 'r') as file:
    data = json.load(file)
    result_path = data['result_path']
    overlap_tokens = data['overlap_tokens']

In [6]:
os.makedirs(result_path, exist_ok=True)
os.chdir(result_path)
current_path()

Current working directory
/content/drive/My Drive/ImpactDataMining/Turkiye_Earthquake/Result



In [7]:
with open('2b_results.json', 'r') as file:
    data = json.load(file)
    sent_nested = data['sent_nested']
    sent_struct_nested = data['sent_struct_nested']
    sent_comm_nested = data['sent_comm_nested']

In [8]:
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-mnli')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

In [9]:
def summary_chunk(text):
  result = []
  for n in text:
    if len(n) > 1:
      para = ' '.join([sent for sent in n])
      if len(para.split()) > 30:
        inputs = tokenizer(para, max_length=1024, return_tensors='pt', truncation=True)
        inputs.to(device)

        length_of_inputs = inputs['input_ids'].shape[1]
        summary_ids = model.generate(inputs['input_ids'], max_length=length_of_inputs, min_length=30,
                           do_sample=False, early_stopping=True, num_beams=4)
        result.append([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0])
      else:
        result.append(para)
    else:
      result.append(n[0])
  return result

In [10]:
def sliding_window(text, max_tokens=1024, overlap_tokens=200):
  para = text
  para_len = [len(n.split()) for n in para]
  chunks_len = []; chunks = []

  running_sum_len = 0; running_sum_text = '';
  overlap = []; overlap_1 = [];

  for count, text in zip(para_len, para):
    if running_sum_len + count > max_tokens:
      chunks_len.append(running_sum_len)
      chunks.append(running_sum_text)

      overlap_count = 0; overlap_text = '';
      for prev_elem, prev_text in zip(reversed(overlap), reversed(overlap_1)):
        if overlap_count + prev_elem < overlap_tokens:
          overlap_count = overlap_count + prev_elem
          overlap_text = ' '.join([overlap_text, prev_text])
        else:
          break

      running_sum_len = overlap_count + count
      running_sum_text = ' '.join([overlap_text, text])

      overlap = []; overlap_1 = []
    else:
      running_sum_len = running_sum_len + count
      running_sum_text = ' '.join([running_sum_text, text])

      overlap.append(count)
      overlap_1.append(text)

  if running_sum_len:
    chunks_len.append(running_sum_len)
    chunks.append(running_sum_text)
  return chunks

def summarize_with_bart(chunk):
  summary = []
  if len(chunk.split()) > 30:
    inputs = tokenizer(chunk, max_length=1024, return_tensors='pt', truncation=True)
    inputs.to(device)

    inputs_len = inputs['input_ids'].shape[1]
    summary_ids = model.generate(inputs['input_ids'], max_length=inputs_len, min_length=512,
                  do_sample=False, early_stopping=True, num_beams=4)
    summary.append([tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_ids][0])
  else:
    summary.append(chunk)
  return summary

def hierarchical_summarize(text, max_tokens=1024, overlap_tokens=200):
    chunks = sliding_window(text, max_tokens=max_tokens, overlap_tokens=overlap_tokens)
    count = 0
    summaries = []
    for chunk in chunks:
        summary = summarize_with_bart(chunk)
        summaries.append(summary[0])
    aggregated_summary = " ".join([n for n in summaries])

    if len(aggregated_summary.split()) > max_tokens:
        return hierarchical_summarize(summaries, max_tokens=max_tokens)
    else:
        return aggregated_summary

In [11]:
result_struct_summ = summary_chunk(sent_struct_nested)
summ_struct = hierarchical_summarize(result_struct_summ, overlap_tokens=overlap_tokens)

print('Structural impact summary:')
summ_struct

Structural impact summary:


"As of February 19, 2023, the number of reported completely and partially collapsed buildings was 28,362. 75,717 buildings and 306,563 dwellings were either collapsed or severely damaged. Around half of the buildings in the affected regions of Türikiye were constructed before 2000. Damage to personal property is expected to be significant. Unfortunately, insured losses may be only around $1 billion (USD) due to low insurance coverage in the region. The impact of the quake on buildings is clearly dependent on location and location of the location. This does not mean that buildings at this location may be able to withstand this shaking without collapse, while sustaining moderate to heavy damage. The most vulnerable buildings are those constructed after 2000 and considered to be vulnerable to earthquake events. The 1999 Kocaeli and Duzce earthquakes, which resulted in significant building damage and collapses, led to major changes in seismic design. There is evidence that even post-2000 a

In [12]:
result_comm_summ = summary_chunk(sent_comm_nested)
summ_comm = hierarchical_summarize(result_comm_summ, overlap_tokens=overlap_tokens)

print('Community impact summary:')
summ_comm



Community impact summary:


'As of March 8, the total official death toll due to these earthquakes was reported to be 45,968 confirmed deaths in Türkiye and 7,259 deaths in Syria. More than 100,000 people were reported as injured. The earthquake sequence resulted in a very large number of fatalities and injuries. Extreme Event Solutions at Verisk predicted that the economic losses and industry-insured losses due to the earthquake sequence in T Turkey will likely exceed $20 billion (USD) and $1 billion ( USD), respectively (Verisk 2023). Reportedly, hundreds of shipping containers were ablaze. In Southern T Turkey, members of commerce chambers, exchanges, and industrial zones have opted to halt their production to provide aid to survivors. In particular, gas supply was halted in Kahramanmaraş, Gaziantep, and Hatay provinces. On February 7, the Ministry of Health reported that injured people from Iskenderun were transferred to Mersin City Hospital in ambulances. Ninety-eight wounded patients were transferred the da

In [13]:
# Saving results to a .txt file
with open('3_structural_impact_summary.txt', 'w') as file:
    file.write(summ_struct)
with open('3_community_impact_summary.txt', 'w') as file:
    file.write(summ_comm)

In [14]:
end_time = time.time()
execution_time = end_time - start_time

print("Execution time:", execution_time, "seconds")

Execution time: 289.4751958847046 seconds
