# Data Extraction

In the end, I used a raw .txt file of "Origins of the Worls" but H.G. Wells to train my knowledge graph. I originally wanted to use the ArXiv dataset because it had lots of scientific language. However, I could not deploy scipy in Python 3.12 (to be sorted later), so I decided to go for a history corpus rather than scientifif.

In any case I used Ray to extract this data from a json and process it into a text file. Thought this code was worth keeping. Additionally I created code to unzip a 7-zip file.

### Ray & ArXiv Dataset

In [1]:
import ray
import pandas as pd

# Initialize Ray
ray.init()

# Load the JSON file using pandas (Ray can distribute pandas DataFrames)
df = pd.read_json('../data/raw/arxiv-metadata-oai-snapshot.json', lines=True)

# Select the relevant columns: title, authors, and abstract
selected_df = df[['title', 'authors', 'abstract']]


2024-09-24 13:17:29,977	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2024-09-24 13:17:37,591	INFO worker.py:1786 -- Started a local Ray instance.


In [4]:

# Define a function to process each batch of data
@ray.remote
def process_batch(batch):
    # Combine title, authors, and abstract into unstructured text
    combined_text = batch.apply(lambda row: f"Title: {row['title']}\nAuthors: {row['authors']}\nAbstract: {row['abstract']}\n\n", axis=1)
    return combined_text.tolist()  # Return as a list

# Split the data into smaller batches (adjust the batch size as needed)
batch_size = 1000
batches = [selected_df[i:i+batch_size] for i in range(0, selected_df.shape[0], batch_size)]

# Use Ray to process each batch in parallel
results = ray.get([process_batch.remote(batch) for batch in batches])



In [5]:

# Combine the results into a single list of text entries
combined_results = [item for sublist in results if isinstance(sublist, list) for item in sublist]

# Write the combined text to a file
with open('../data/processed/arxiv_combined_text.txt', 'w', encoding='utf-8') as f:
    for entry in combined_results:
        f.write(entry)

print("Combined text has been saved to 'arxiv_combined_text.txt'")

Combined text has been saved to 'arxiv_combined_text.txt'


### Unzip 7zip file

In [4]:
import py7zr
import os

def extract_7z_to_txt(archive_path, output_txt_path):
    # Create a directory to extract files
    extraction_dir = 'extracted_files'
    os.makedirs(extraction_dir, exist_ok=True)

    with py7zr.SevenZipFile(archive_path, mode='r') as archive:
        archive.extractall(path=extraction_dir)  # Extract to a temporary directory

    # Write the contents of all extracted files into a single text file
    with open(output_txt_path, 'w', encoding='utf-8') as outfile:
        for root, _, files in os.walk(extraction_dir):
            for filename in files:
                file_path = os.path.join(root, filename)
                try:
                    with open(file_path, 'r', encoding='utf-8') as infile:
                        outfile.write(infile.read() + '\n')
                except UnicodeDecodeError:
                    # If UTF-8 fails, try reading in binary mode and decode as 'ISO-8859-1' or handle as needed
                    with open(file_path, 'rb') as infile:
                        outfile.write(infile.read().decode('ISO-8859-1', errors='ignore') + '\n')

# Usage
extract_7z_to_txt('../data/raw/wikipedia-aa-html.7z', 'output.txt')
