This notebook is dedicated to the cleaning and preprocessing of text extracted from the document in sentence form. It involves refining the text by removing or cleaning elements such as URLs, footnotes, and titles of figures & tables. Users can directly utilize this notebook following the 0_input.ipynb if they wish to include all sections of the document in their analysis. However, if users wish to exclude certain sections from their analysis, they should first utilize the 1a_preprocessing notebook to identify and select these sections. Following this, they need to update the drop_headings variable in 0_input.ipynb with the sections they have chosen to omit and run the file again. Once this update is made, users can then run this notebook, which will process the text reflecting the specified omissions. An updated table of contents will be printed to verify the sections included in the analysis.

Please ensure to paste the input_path, which is the location of 0_input.ipynb file, at the start of this notebook. This step is the only requirement to load all necessary information for the execution of the code.

At the end of this notebook, the processed sentences extracted from paragraphs and tables are displayed in the variables text_body and text_table, respectively.

Recommended Google Colab Runtime Type: CPU, as this notebook does not involve running machine learning models.

In [None]:
# Input file path (must navigate at the beginning of each file)
input_path = "/content/drive/My Drive/ImpactDataMining/Turkiye_Earthquake/Result"

All the below sections automatically retrieve data from the 0_input.ipynb file, as well as results from previous notebooks in this series. The code is designed to run using this information, so no further edits are required beyond this point.

In [None]:
!pip install python-docx
!pip install anytree

import docx
import nltk; nltk.download(['punkt', 'averaged_perceptron_tagger'])
import os
import json
import re
import numpy as np
import pandas as pd

from google.colab import drive
from anytree import Node, RenderTree, search



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
import time
start_time = time.time()

In [None]:
def current_path():
  print("Current working directory")
  print(os.getcwd())
  print()

current_path()
drive.mount('/content/drive')
os.chdir(input_path)
current_path()

Current working directory
/content/drive/MyDrive/ResilienceDataMining/Turkiye_Earthquake/Result

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Current working directory
/content/drive/My Drive/ResilienceDataMining/Turkiye_Earthquake/Result



In [None]:
with open('0_input.json', 'r') as file:
    data = json.load(file)
    data_path = data['data_path']
    result_path = data['result_path']
    file_name = data['file_name']
    drop_headings = data['drop_headings']

In [None]:
drive.mount('/content/drive')
os.chdir(data_path)
current_path()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Current working directory
/content/drive/My Drive/ResilienceDataMining/Turkiye_Earthquake/Data



In [None]:
doc = docx.Document(file_name)

names = []
for para in doc.paragraphs:
  names.append(para.style.name)

text = []
for para in doc.paragraphs:
  text.append(para.text)

text_table = []
for table in doc.tables:
  table_data = []
  for row in table.rows:
    row_data = []
    for cell in row.cells:
        row_data.append(cell.text)
    table_data.append(row_data)
  text_table.append(table_data)

In [None]:
drive.mount('/content/drive')
os.chdir(result_path)
current_path()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Current working directory
/content/drive/My Drive/ResilienceDataMining/Turkiye_Earthquake/Result



In [None]:
def remove_headings(root, heading):
  node_to_remove = search.find_by_attr(root, name="name", value=heading)

  # List to store all nodes that will be removed
  removed_nodes = []

  if node_to_remove:
      # Get all descendants of the node
      removed_nodes.append(node_to_remove)
      removed_nodes.extend(node_to_remove.descendants)

      # Detach the node, which also removes all of its descendants
      node_to_remove.parent = None
      remove_nodes = [n.name for n in removed_nodes]
  return [n.name for n in removed_nodes]

def remove_headings_idx(removed_nodes, headings, headings_idx):
  idx_omit = []
  for i, n in enumerate(removed_nodes):
    idx_omit.extend([headings_idx[i1] for i1, n1 in enumerate(headings) if n1 == n])
    if i == len(removed_nodes)-1:
      if n == headings[-1]:
        idx_omit.append(len(text))
      else:
        idx_omit.extend([headings_idx[i1+1] for i1, n1 in enumerate(headings) if n1 == n])
  return idx_omit

In [None]:
if len(drop_headings) > 0:
  headings_tree = [(names[i], text[i]) for i, n in enumerate(names) if n.startswith('Heading')]
  headings = [text[i] for i, n in enumerate(names) if n.startswith('Heading')]
  headings_idx = [i for i, n in enumerate(names) if n.startswith('Heading')]

  # Root node
  root = Node("Document")

  # Keep track of the last node for each heading level
  last_nodes = {0: root}

  for level, title in headings_tree:
    level_num = int(level.split(' ')[-1])

    # Reset the last nodes for levels greater than the current level
    for i in range(level_num + 1, max(last_nodes.keys()) + 1):
      if i in last_nodes:
        del last_nodes[i]

    # Find the correct parent by looking at the last node at the previous level or above
    parent = None
    for i in range(level_num - 1, -1, -1):
      parent = last_nodes.get(i)
      if parent:
        break

    # Create the new node and store it as the last node for its level
    node = Node(title, parent=parent)
    last_nodes[level_num] = node

  drop_idx = []
  for n in drop_headings:
    removed_nodes = remove_headings(root, n)
    idx = remove_headings_idx(removed_nodes, headings, headings_idx)
    if len(idx) > 2:
      idx = [idx[0], idx[-1]]
    drop_idx.append(idx)

  # Print the tree
  print('UPDATED TABLE OF CONTENTS')
  for pre, _, node in RenderTree(root):
    print(f"{pre}{node.name}")

  # Start by reverse sorting the indices so that the order of elements is not disturbed while removing
  for n in drop_idx[::-1]:
    for k in range(len(n)-1):
      start_idx = n[k]
      end_idx = n[k+1]
      del text[start_idx:end_idx]
      del names[start_idx:end_idx]

UPDATED TABLE OF CONTENTS
Document
└── EXECUTIVE SUMMARY
    ├── Introduction
    │   ├── Casualties and Injuries
    │   ├── Economic Losses
    │   │   ├── Direct Losses Due to Structural Damage
    │   │   └── Losses Due to Impacts on Supply Chains
    │   ├── Other Societal Impacts
    │   ├── Official Response
    │   └── Report Scope
    ├── Seismic Hazard and Recorded Ground Motions
    │   ├── Tectonic Setting of Türkiye
    │   ├── 2023 Mw 7.7 and Mw 7.6 Earthquakes Features
    │   │   ├── Mw 7.7 Event
    │   │   └── Mw 7.6 Event
    │   ├── Evolution of Seismic Zonation and Current Seismic Hazard Maps
    │   │   ├── Türkiye
    │   │   └── Syria
    │   ├── Recorded Ground Motions
    │   │   ├── Türkiye
    │   │   └── Syria
    │   └── Response Spectra
    │       ├── Mw 7.7 Event (at 01:17 UTC)
    │       └── Mw 7.6 Event (at 10:24 UTC)
    ├── Local Codes and Construction Practices
    │   ├── Türkiye
    │   └── Syria
    │       ├── Code Development
    │       └── 

In [None]:
text_normal = [text[i] for i, n in enumerate(names) if n.startswith("Normal")]
text_normal = [n for n in text_normal if n != '' and n != '\t']

# footnotes handling
fnote_char_idx = []; fnote_num_idx = [];
for i, n in enumerate(text_normal):
  if re.findall('^[a-z]\s*[A-Z0-9\u201c\u2018|https]', n):
    fnote_char_idx.append(i)
  if re.findall('^\d+\s*[A-Z0-9\u201c\u2018|https]', n):
    fnote_num_idx.append(i)
fnote_num = [int(re.match(r'^\d+',text_normal[i]).group()) for i in fnote_num_idx]
fnote_char = [re.match(r'^[a-z]',text_normal[i]).group() for i in fnote_char_idx]

result = []; result_idx_num = [];
for num, idx in zip(fnote_num, fnote_num_idx):
    if not result:
        if num == 1:
            result.append(num)
            result_idx_num.append(idx)
    elif num == result[-1] + 1:
        result.append(num)
        result_idx_num.append(idx)
    else:
        continue
result = []; result_idx_char = [];
for num, idx in zip(fnote_char, fnote_char_idx):
    if not result:
        if num == "a":
            result.append("a")
            result_idx_char.append(idx)
    elif num == chr(ord(result[-1]) + 1):
        result.append(num)
        result_idx_char.append(idx)
    else:
        continue

fnote_idx = list(set(result_idx_num+result_idx_char))
text_fnote = [text_normal[i] for i in fnote_idx]
fnote_idx = [text.index(n) for n in text_fnote]
text = [text[i] for i in range(len(text)) if i not in fnote_idx]
names = [names[i] for i in range(len(names)) if i not in fnote_idx]

In [None]:
# https handling
for i, n in enumerate(text):
  if re.findall('http[s]?://\S+', n):
    text[i] = re.sub(r'http[s]?://\S+', '', n)

In [None]:
# Table, Figure titles handling
idx_heading = [i for i, n in enumerate(names) if n.startswith("Heading")]
b = [n for n in range(len(text)) if n not in idx_heading]
text_body = [text[i] for i in b]
text_body = [n for n in text_body if n != ""]

idx_fig_table = [];
for i, n in enumerate(text_body):
  pattern = r'^(?:Table|Figure)\s(?:\d|[A-Z])+(?:\.\d+)*(?:\.|:)\s'
  match = re.search(pattern,n)
  if match:
    if i != 0:
      match1 = re.search(r'\(a\).*\t', text_body[i-1])
      if match1:
        idx_fig_table.extend(list(range(i-1,i+1)))
      else:
        idx_fig_table.append(i)
    else:
      idx_fig_table.append(i)

idx = [text_body.index(n) for n in text_body if text_body.index(n) not in idx_fig_table]
text_body = [text_body[i] for i in idx]
for i, n in enumerate(text_body):
  if n.strip().endswith(":"):
    text_body[i] = re.sub(":", ".", n)

# Headings handling
idx_1 = []; idx_1_n = [];
for n in text_body:
  if ":" in n and n.strip().endswith('.') is False:
    t = n[:n.index(":")]
    text_tokenized = nltk.word_tokenize(t)
    text_tokenized = [w for w in text_tokenized if re.findall(r'\w+', w)]
    text_pos_tagged = nltk.pos_tag(text_tokenized)
    idx_1.append(text_body.index(n))

    for k in text_pos_tagged:
      if not re.match('NN', k[1]):
        idx_1_n.append(text_body.index(n))
        break
idx_1 = [i for i in idx_1 if i not in idx_1_n]
idx = [i for i in range(len(text_body)) if i not in idx_1]
text_body = [text_body[i] for i in idx]

In [None]:
idx_1 = [i for i, n in enumerate(text_body) if n.strip().endswith(".") is False]
temp = []; temp_last = idx_1[-1];
for i in range(len(idx_1)-1):
    if idx_1[i+1] - idx_1[i] == 1:
        temp.append(idx_1[i+1])

# spelling
for i, para in enumerate(text_body):
  text_body[i] = re.sub(r'e\.g\.\s+', 'e.g.', para)
  text_body[i] = re.sub(r'et al\.\s+', 'et al.', text_body[i])
  text_body[i] = re.sub(r'(?=.)Fig. ', 'Fig.', text_body[i])
  text_body[i] = re.sub(r'No\.\s+', 'No.', text_body[i])
  text_body[i] = re.sub(r'deg\.', 'degree', text_body[i])
  text_body[i] = re.sub(r'\)$', ').', text_body[i])

In [None]:
idx_1 = [i for i in idx_1 if i not in temp]
idx_2 = []
for i in range(len(idx_1)-1):
  for k in range(idx_1[i],idx_1[i+1]):
    if text_body[k].strip().endswith("."):
      idx_2.append(k)
      break
if len(idx_2) < len(idx_1):
  idx_2.append(temp_last)

for n in range(len(idx_1)):
    text_body[idx_1[n]] = " ".join(text_body[idx_1[n]:idx_2[n]+1])

for n in range(len(idx_1))[::-1]:
    text_body[idx_1[n]+1:idx_2[n]+1] = ""

In [None]:
def table_format(table):
  for i, line in enumerate(table):
    for i1, col in enumerate(line):
      pattern1 = r'^(?:Table|Figure)\s(?:\d|[A-Z])+(?:\.\d+)*(?:\.|:)\s'
      match1 = re.search(pattern1, col)
      if match1:
        table[i][i1] = ""

      pattern2 = 'http[s]?://\S+'
      match2 = re.search(pattern2, col)
      if match2:
         table[i][i1]= re.sub(r'http[s]?://\S+', '', col)

  for i, line in enumerate(table):
    if all(col == "" for col in line):
      table.pop(i)

  try:
    return table_format1(table)
  except IndexError:
    return table

def table_format1(table):
  ta_idx = []
  for i, line in enumerate(table):
    if line[0] != "":
      ta_idx.append(1)
    else:
      ta_idx.append(0)

  if (ta_idx[0] == 0 and 0 not in ta_idx[1:]) or 0 not in ta_idx:
    return table
  else:
    idx = []
    count = 0; zeros = 0
    for i, item in enumerate(ta_idx):
      if item == 1:
        count = count + 1
      else:
        zeros = zeros + 1
        if count > 0 and (i == len(ta_idx) - 1 or ta_idx[i+1] == 1):
          count = count +  zeros
          idx.append(count)
          count = 0; zeros = 0
    if count > 0:
      idx.append(count)

    table = [list(x) for x in zip(*table)]
    for i, n in enumerate(table):
      result = []
      start = 0
      for i1 in idx:
        end = start + i1
        result.append(" ".join(n[start:end]))
        start = end
      table[i] = result
    return [list(x) for x in zip(*table)]

def is_number(string):
    if string.isnumeric():  # Check if the string represents an integer
        return True
    else:
        try:
            float(string)  # Try to convert the string to a float
            return True
        except ValueError:  # If the conversion fails, it's not a number
            return False

def combine_table(tables):
  for t1, t2 in zip(tables[:-1], tables[1:]):
    if len(t1[-1]) == len(t2[0]):
      for i, (c1, c2) in enumerate(zip(t1[-1][1:], t2[0][1:])): #starting from 2nd column
        if c1.endswith(".") is False and is_number(c1) is False:
          t1[-1][i+1] = " ".join([c1, c2])
          t2[0][i+1] = ""

  for i, t in enumerate(tables):
    for i1, l in enumerate(t):
      if all(c == "" for c in l):
        tables[i].pop(i1)
  return tables

def remove_short_lists(lst):
    # If the element is not a list, return it as is.
    if not isinstance(lst, list):
        return lst

    # Process the list elements
    result = []
    for item in lst:
        # Recursively apply the function to sublists
        if isinstance(item, list):
            processed_item = remove_short_lists(item)
            # Add the item to the result if it's a non-empty list with at least 2 elements
            if isinstance(processed_item, list) and len(processed_item) > 0:
                result.append(processed_item)
        else:
            # Directly add non-list items
            result.append(item)
    return result

In [None]:
for i, t in enumerate(text_table):
  text_table[i] = table_format(t)

text_table = [t for t in text_table if t]
for t in text_table:
  for i, c  in enumerate(t):
    t[i] = [r for r in c if len(r) > 5]
text_table = remove_short_lists(text_table)

text_table = combine_table(text_table)
text_table = list(dict.fromkeys([c for t in text_table for r in t for c in r]))

In [None]:
def identify_sentence(sent):
  for i, item in enumerate(sent):
    match_VB = re.search('VB.*', item[1])
    if match_VB:
      return sent
    else:
      if i == len(sent) - 1:
        sent = ''
        return sent
      else:
        continue

def identify_sentence2(sent):
  try:
    match_VB = re.search('VB.', sent[0][1])
    if match_VB:
      return identify_sentence(sent[1:])
    else:
      return identify_sentence(sent)
  except IndexError:
    pass

def data_sentences(data):
  n = len(data);
  sent_tokenized = [None]*n; sent_pos_tagged = [None]*n; sent = [None]*n;
  idx = [None]*n; result = [None]*n;
  sentences = [nltk.sent_tokenize(p) for p in data]
  for i, n in enumerate(sentences):
    sentences[i] = [k for k in n if len(k.split())>=4]
  for i, n in enumerate(sentences):
    sent_tokenized[i] =  [nltk.word_tokenize(k) for k in n]
  for i, n in enumerate(sent_tokenized):
    sent_pos_tagged[i] =  [nltk.pos_tag(k) for k in n]
  for i, n in enumerate(sent_pos_tagged):
    sent[i] =  [identify_sentence2(k) for k in n]
  for i, n in enumerate(sent):
    idx[i] = [0 if k == '' else 1 for k in n]
  for i, (a, b) in enumerate(zip(sentences, idx)):
    result[i] = [s for s, d in zip(a, b) if d == 1]
  result = [n for n in result if n != []]
  idx = [n for n in [list(range(len(n))) for n in result]]
  result = [k for n in result for k in n]
  return result, idx

In [None]:
text_body, idx_body = data_sentences(text_body)
text_table, idx_table = data_sentences(text_table)

for i, n in enumerate(text_table):
      pattern_1 = r'^\n'
      pattern_2 = r'\([a-z]\)\s*'
      pattern_3 = r'\n'

      n = re.sub(pattern_1, '', n)
      n = re.sub(pattern_2, '', n)
      text_table[i] = re.sub(pattern_3, ' - ', n)

In [None]:
text_body

['An Mw 7.8 earthquake occurred at a depth of 17.9 km and with epicenter coordinates 37.174°N 37.032°E near the city of Nurdağı in the Gaziantep province of Türkiye at about 4:17 am local time on February 6, 2023.',
 'Due to the shallow depth of the earthquake and a bilateral rupture towards the southwest and the northeast with an area of approximately 100 km × 75 km, the earthquake impacted 10 provinces in Türkiye and several others in Syria, resulting in significant casualties due to the collapse of many buildings.',
 'Because of a combination of forward directivity, basin effects, and site amplification, very large ground shaking (up to 1.3g Peak Ground Acceleration (PGA) and 170 cm/s Peak Ground Velocity (PGV)) was recorded in Hatay.',
 'The response spectra of several of the recorded ground motions considerably exceeded the Maximum Considered Earthquake (MCE) levels for certain period ranges.',
 'This earthquake was followed by many aftershocks, including several larger than magni

In [None]:
text_table

['Joint Report Section Leads:',
 'For a full listing of this report and all other StEER products (briefings, reports and datasets) with full citation information and DOIs, please visit the StEER website:  -  -  - For a full listing of over 300 different earthquakes occurring in more than 50 countries during the last 70 years that the EERI’s Learning From Earthquakes program, including this and other Virtual Earthquake Reconnaissance Team (VERT) reports, datasets, and publications, please visit the EERI LFE website:',
 'Light Detection and Ranging',
 'Major intensity event - Succession of events - Joint/compounding hazards',
 'Sufficiently populated areas to create measurable impact - Communities with a history of recovery - Noteworthy code or construction practices - Critical infrastructure - Under-documented structure classes - Instrumented structures',
 'Availability/interest of members - Sufficient media/social media coverage of the event, including the potential to automate the min

In [None]:
# Saving results to a JSON file
with open('1_results.json', 'w') as file:
    json.dump({
        'text_body': text_body, 'text_table': text_table,
        'idx_body': idx_body, 'idx_table': idx_table
        }, file)

# Saving results to an excel file
df1 = pd.DataFrame(text_body)
df2 = pd.DataFrame(text_table)

with pd.ExcelWriter('1_results.xlsx', engine='openpyxl') as writer:
    df1.to_excel(writer, sheet_name='Paragraph text', index=False, header=False)
    df2.to_excel(writer, sheet_name='Table text', index=False, header=False)

In [None]:
end_time = time.time()
execution_time = end_time - start_time

print("Execution time:", execution_time, "seconds")

Execution time: 9.569961786270142 seconds
