# Processing Subtitles Dataset and Creating JSON Objects: From Raw Text to Structured Data

## Downloading and Extracting Data
First we download a language dataset from the specified URL, rename it, and then extract its contents.

In [None]:
!wget https://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/xml/fr.zip

--2023-07-30 12:27:05--  https://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/xml/fr.zip
Resolving opus.nlpl.eu (opus.nlpl.eu)... 193.166.25.9
Connecting to opus.nlpl.eu (opus.nlpl.eu)|193.166.25.9|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/xml/fr.zip [following]
--2023-07-30 12:27:06--  https://object.pouta.csc.fi/OPUS-OpenSubtitles/v2018/xml/fr.zip
Resolving object.pouta.csc.fi (object.pouta.csc.fi)... 86.50.254.18, 86.50.254.19
Connecting to object.pouta.csc.fi (object.pouta.csc.fi)|86.50.254.18|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6311975458 (5.9G) [application/zip]
Saving to: ‘download.php?f=OpenSubtitles%2Fv2018%2Fxml%2Ffr.zip’


2023-07-30 12:31:46 (21.6 MB/s) - ‘download.php?f=OpenSubtitles%2Fv2018%2Fxml%2Ffr.zip’ saved [6311975458/6311975458]



In [None]:
!mv /content/download.php?f=OpenSubtitles%2Fv2018%2Fxml%2Ffr.zip /content/fr.zip

In [None]:
!unzip fr.zip

## Creating JSON Objects

In this code cell, a Python function named `create_json_objects` is defined. It takes a list of lines as input and processes them to create JSON objects containing context, knowledge, and response.

In [None]:
import json

def create_json_objects(lines_list):
    json_objects = []
    context_lines = []
    knowledge = ""

    for line in lines_list:
        # If we have less than 9 context lines, keep adding lines to the context
        if len(context_lines) < 9:
            context_lines.append(line.strip())
        else:
            # Create the JSON object for the current set of lines
            response_line = line.strip()
            if response_line:
                json_object = {
                    "context": context_lines.copy(),
                    "knowledge": knowledge,
                    "response": response_line
                }
                json_objects.append(json_object)

            # Reset context_lines with the last 8 lines of the previous context
            context_lines = context_lines[1:] + [line.strip()]

    return json_objects

def save_json_objects_to_file(json_objects, output_file):
    with open(output_file, 'w', encoding='utf-8') as file:
        json.dump(json_objects, file, ensure_ascii=False)

## Processing XML Files and Creating JSON

Here, XML files are processed, and JSON objects are created from the extracted data. We navigate through the XML structure, processe text data, and construct JSON objects. Progress is tracked using the tqdm library.

In [None]:
import os
import xml.etree.ElementTree as ET
from tqdm import tqdm
import codecs

folder_path = "/content/OpenSubtitles/xml/fr"
output_folder = "/content/new_dataset"
total_files = sum(len(files) for _, _, files in os.walk(folder_path))
with tqdm(total=total_files, desc="Processing files") as pbar:
    for root, dirs, files in os.walk(folder_path):
        for file in files:
            try:
                # Load XML file
                file_path = os.path.join(root, file)
                tree = ET.parse(file_path)
                root_element = tree.getroot()

                # # Remove time tags
                # for time_tag in root_element.iter("time"):
                #     time_tag.clear()

                # Process s tags
                lines = []
                for s_tag in root_element.iter("s"):
                    line = " ".join(w_tag.text for w_tag in s_tag.findall("w"))
                    lines.append(line)

                # Replace punctuation with just the punctuation mark in each line
                lines = [line.replace(" .", ".")
                         .replace(" ,", ",")
                         .replace(" !", "!")
                         .replace(" ?", "?")
                         .replace(" :", ":")
                         .replace(" ;", ";")
                         .lstrip("- ") for line in lines]

                # Write formatted text to a file with proper encoding
                json_objects = create_json_objects(lines)
                # new file name
                directories = file_path.split(os.sep)
                new_file = '-'.join(directories[-3:]).replace('.xml', '.json')
                # new file path
                output_path = os.path.join(output_folder, new_file)
                # write to file
                save_json_objects_to_file(json_objects, output_path)

            except Exception as e:
                print(f"Exception occurred while processing file: {file_path}")
                print(f"Exception details: {str(e)}")
                # Remove the file if an exception is raised
                # os.remove(file_path)

            # Update the progress bar
            pbar.update(1)

Processing files:  44%|████▍     | 56221/127204 [1:10:53<1:49:48, 10.77it/s]

Exception occurred while processing file: /content/OpenSubtitles/xml/fr/2008/1031415/4528771.xml
Exception details: not well-formed (invalid token): line 4135, column 30


Processing files:  79%|███████▉  | 101060/127204 [2:07:22<23:10, 18.80it/s]

Exception occurred while processing file: /content/OpenSubtitles/xml/fr/2006/798028/4555239.xml
Exception details: not well-formed (invalid token): line 3807, column 36


Processing files: 100%|██████████| 127204/127204 [2:40:48<00:00, 13.18it/s]


In [None]:
!du -sh /content/new_dataset

40G	/content/new_dataset


## Cleaning Up and Archiving

In this part, we clean up temporary files, compresse the newly created JSON files into a ZIP archive, and copy the archive to a destination folder.

In [None]:
!rm -rf /content/OpenSubtitles

In [None]:
import shutil

# Replace 'your_folder_name' with the actual folder name containing the JSON files
folder_path = '/content/new_dataset'
output_zip_file = '/content/new_dataset.zip'

shutil.make_archive(output_zip_file[:-4], 'zip', folder_path)

'/content/new_dataset.zip'

In [None]:
import shutil

source_folder = '/content/new_dataset.zip'
destination_folder = '/content/drive/MyDrive/new_dataset'

shutil.copy(source_folder, destination_folder)

'/content/drive/MyDrive/new_dataset/new_dataset.zip'