# 01 Data Cleaning

This notebook is for loading, cleaning, and chunking the Spanish text files from the `data/` folder. It will cover steps such as:
- Loading raw text data.
- Preprocessing text (e.g., lowercasing, removing special characters, tokenization).
- Splitting texts into manageable chunks for analysis and model training.

## Initial Data Quality Assessment

This section will detail the initial assessment of the raw text data, identifying common issues such as:

- **Encoding Issues**: Detecting and handling incorrect character encodings.
- **Special Characters and Punctuation**: Identifying and deciding on strategies for handling non-standard characters, excessive punctuation, or symbols.
- **Whitespace and Line Breaks**: Addressing inconsistent spacing, multiple line breaks, or leading/trailing whitespace.
- **Missing or Corrupted Data**: Identifying any sections of text that are incomplete or unreadable.
- **Structural Inconsistencies**: Noting any variations in text structure that might affect parsing or chunking.

## Proposed Remediation Steps

Based on the initial assessment, this section will outline the planned steps to clean and preprocess the data:

1.  **Standardize Encoding**: Convert all text to a consistent encoding (e.g., UTF-8).
2.  **Normalize Text**:
    *   Convert all text to lowercase.
    *   Remove unwanted special characters, numbers, and excessive punctuation.
    *   Handle contractions and expand abbreviations if necessary.
3.  **Clean Whitespace**:
    *   Remove extra spaces, tabs, and newlines.
    *   Strip leading and trailing whitespace from lines and paragraphs.
4.  **Tokenization**: Break down the text into words or sentences.
5.  **Stop Word Removal**: Eliminate common words that do not carry significant meaning (e.g., "the", "is", "a").
6.  **Lemmatization/Stemming**: Reduce words to their base or root form to reduce vocabulary size and improve consistency.
7.  **Chunking Strategy**: Define and implement a strategy for splitting the cleaned text into appropriate-sized chunks for downstream tasks.

In [32]:
import os
import re
import pandas as pd

# Define the path to the raw text data
#data_file_path = '../data/Guerra y paz - Lev Nikolaievich Tolstoi.txt'
#data_file_path = '../data/Crimen y castigo (trad. F. Oter - Fiodor Dostoyevski.txt'
data_file_path = '../data/Anna Karenina - Tolstoi, Lev N_.txt'
#data_file_path = '../data/Los hermanos Karamazov - Fiodor M. Dostoievski.txt'

# Check if the file exists
if not os.path.exists(data_file_path):
    raise FileNotFoundError(f"Data file not found at {data_file_path}")

# Load the raw text data
with open(data_file_path, 'r', encoding='utf-8') as file:
    raw_text = file.read()

print(f"Successfully loaded data from {data_file_path}")
print(f"Total length of raw text: {len(raw_text)} characters")
print(f"First 500 characters:\n{raw_text[:500]}...")


Successfully loaded data from ../data/Anna Karenina - Tolstoi, Lev N_.txt
Total length of raw text: 2000809 characters
First 500 characters:
Primera parte





I



Todas las familias felices se parecen; las desdichadas lo son cada una a su modo.

Todo estaba patas arriba en casa de los Oblonski. Enterada de que su marido tenía una relación con la antigua institutriz francesa de sus hijos, le había anunciado que no podía seguir viviendo con él bajo el mismo techo. Esa situación, que se prolongaba ya por tres días, era dolorosa no sólo para el matrimonio, sino también para los demás miembros de la familia y la servidumbre. Tanto unos ...


### Data Profiling

Now, let's perform a detailed data profiling to identify specific data quality issues.

In [43]:

# 1. Encoding Issues (basic check - assuming UTF-8 for loading)
print("\n--- Encoding Check ---")
try:
    raw_text.encode('utf-8').decode('utf-8')
    print("Text appears to be valid UTF-8.")
except UnicodeDecodeError as e:
    print(f"Encoding issue detected: {e}")

# 2. Special Characters and Punctuation
print("\n--- Special Characters and Punctuation Check ---")
non_alphanumeric_pattern = r'[^a-zA-Z0-9áéíóúüñÁÉÍÓÚÜÑ.,;!?:\s]'
special_chars = re.findall(non_alphanumeric_pattern, raw_text)
unique_special_chars = sorted(list(set(special_chars)))

print(f"Number of unique special characters (excluding common punctuation): {len(unique_special_chars)}")
if unique_special_chars:
    print(f"Unique special characters: {', '.join(unique_special_chars[:20])}{'...' if len(unique_special_chars) > 20 else ''}")

# 3. Whitespace and Line Breaks
print("\n--- Whitespace and Line Break Check ---")
multiple_spaces = re.findall(r' {2,}', raw_text)
multiple_newlines = re.findall(r'\n{2,}', raw_text)
footnotes = re.findall(r'\[\d+\]', raw_text)
leading_trailing_whitespace_lines = [line for line in raw_text.split('\n') if line.strip() != '' and (line[0].isspace() or line[-1].isspace())]

print(f"Number of occurrences of multiple spaces: {len(multiple_spaces)}")
print(f"Number of occurrences of multiple newlines: {len(multiple_newlines)}")
print(f"Number of footnote markers: {len(footnotes)}")
print(f"Number of lines with leading/trailing whitespace: {len(leading_trailing_whitespace_lines)}")

# 4. Missing or Corrupted Data (simple check for empty sections)
print("\n--- Missing or Corrupted Data Check ---")
empty_sections = re.findall(r'\n\s*\n', raw_text)
print(f"Number of empty sections (multiple newlines with only whitespace): {len(empty_sections)}")

# 5. Structural Inconsistencies (example: check for common chapter/section patterns)
print("\n--- Structural Inconsistencies Check ---")
roman_numeral_pattern = r'\b[IVXLCDM]+\n+'
roman_plus_name_pattern = r'\b[IVXLCDM]+\.\s+[^\.]+\n+'
parte_pattern = r'(?i)(?:PRIMERA|SEGUNDA|TERCERA|CUARTA|QUINTA|SEXTA|SEPTIMA) PARTE\n'
libro_pattern = r'Libro\s[^\.]+\.\s[^\.]+\n'
just_chapter_numbers = re.findall(roman_numeral_pattern, raw_text) 
chapters_plus_names = re.findall(roman_plus_name_pattern, raw_text)
if len(chapters_plus_names) > 0:
    chapters_found = chapters_plus_names
else:
    chapters_found = just_chapter_numbers
number_chapters_found = len(chapters_found)
partes_found = re.findall(parte_pattern, raw_text)
libros_found = re.findall(libro_pattern, raw_text)

print(f"Number of potential chapter headings found: {number_chapters_found}")
print(f"Number of partes found: {len(partes_found)}")
print(f"Number of libros found: {len(libros_found)}")

# Basic statistics
word_count = len(re.findall(r'\b\w+\b', raw_text))
unique_words = len(set(re.findall(r'\b\w+\b', raw_text.lower())))
print(f"\n--- Basic Statistics ---")
print(f"Total word count: {word_count}")
print(f"Unique word count: {unique_words}")



--- Encoding Check ---
Text appears to be valid UTF-8.

--- Special Characters and Punctuation Check ---
Number of unique special characters (excluding common punctuation): 22
Unique special characters: (, ), *, -, ¡, «, ´, », ¿, Ç, à, â, ä, ç, è, ê, ï, –, ’, “...

--- Whitespace and Line Break Check ---
Number of occurrences of multiple spaces: 0
Number of occurrences of multiple newlines: 0
Number of footnote markers: 0
Number of lines with leading/trailing whitespace: 0

--- Missing or Corrupted Data Check ---
Number of empty sections (multiple newlines with only whitespace): 0

--- Structural Inconsistencies Check ---
Number of potential chapter headings found: 0
Number of partes found: 0
Number of libros found: 0

--- Basic Statistics ---
Total word count: 344882
Unique word count: 21218


### Data Quality Measurement Criteria and Reporting

We will define clear measurement criteria for data quality issues identified and set up a preliminary reporting mechanism.

In [36]:
def run_data_quality_report():

    print("\n--- Data Quality Report ---")

    quality_issues = {
        "encoding_issues": {
            "description": "Presence of non-UTF-8 characters or decoding errors.",
            "status": "Acceptable" if 'Encoding issue detected' not in locals() else "Issue detected",
            "details": str(e) if 'e' in locals() and isinstance(e, UnicodeDecodeError) else "N/A"
        },
        "unusual_special_characters": {
            "description": "Number of unique special characters (excluding common punctuation).",
            "count": len(unique_special_chars),
            "threshold": "< 50 unique characters (indicative, not strict)",
            "status": "High" if len(unique_special_chars) > 50 else "Acceptable",
            "details": f"Found {len(unique_special_chars)} unique special characters."
        },
        "multiple_spaces": {
            "description": "Occurrences of two or more consecutive spaces.",
            "count": len(multiple_spaces),
            "threshold": "< 1% of total word count",
            "status": "High" if len(multiple_spaces) > word_count * 0.01 else "Acceptable",
            "details": f"Found {len(multiple_spaces)} instances of multiple spaces."
        },
        "multiple_newlines": {
            "description": "Occurrences of two or more consecutive newlines.",
            "count": len(multiple_newlines),
            "threshold": "< 0.5% of total lines",
            "status": "High" if len(multiple_newlines) > len(raw_text.split('\n')) * 0.005 else "Acceptable",
            "details": f"Found {len(multiple_newlines)} instances of multiple newlines."
        },
        "leading_trailing_whitespace": {
            "description": "Lines with leading or trailing whitespace.",
            "count": len(leading_trailing_whitespace_lines),
            "threshold": "< 1% of total lines",
            "status": "High" if len(leading_trailing_whitespace_lines) > len(raw_text.split('\n')) * 0.01 else "Acceptable",
            "details": f"Found {len(leading_trailing_whitespace_lines)} lines with leading/trailing whitespace."
        },    
        "footnote_markers": {
            "description": "Text containing footmarkers, which are not to be learnt from",
            "count": len(footnotes),
            "threshold": "0 occurrences",
            "status": "High" if len(footnotes) > 0 else "Acceptable",
            "details": f"Found {len(footnotes)} footnote markers."
        },    
        "book_part_markers": {
            "description": "Text book parts markers, which are not to be learnt from",
            "count": len(partes_found),
            "threshold": "0 occurrences",
            "status": "High" if len(partes_found) > 0 else "Acceptable",
            "details": f"Found {len(partes_found)} book parts markers."
        },    
        "chapters_markers": {
            "description": "Text chapter markers, which are not to be learnt from",
            "count": len(chapters_found),
            "threshold": "0 occurrences",
            "status": "High" if len(chapters_found) > 0 else "Acceptable",
            "details": f"Found {len(chapters_found)} chapters markers."
        },
        "book markers": {
            "description": "Book marker, which are not to be learnt from",
            "count": len(libros_found),
            "threshold": "0 occurrences",
            "status": "High" if len(libros_found) > 0 else "Acceptable",
            "details": f"Found {len(libros_found)} chapters markers."
        },
        "empty_sections": {
            "description": "Sections of text that are empty or contain only whitespace.",
            "count": len(empty_sections),
            "threshold": "0 occurrences",
            "status": "High" if len(empty_sections) > 0 else "Acceptable",
            "details": f"Found {len(empty_sections)} empty sections."
        },


    }

    for issue, data in quality_issues.items():
        if data["status"]!= "Acceptable":
            print(f"\nIssue: {issue.replace('_', ' ').title()}")
            print(f"  Description: {data['description']}")
            if 'count' in data:
                print(f"  Count: {data['count']}")
                print(f"  Threshold: {data['threshold']}")
            print(f"  Status: {data['status']}")
            print(f"  Details: {data['details']}")

    # Summary of overall data quality
    overall_status = "Good" if all(data['status'] == "Acceptable" or data['status'] == "No issues detected" for data in quality_issues.values()) else "Needs Attention"
    print(f"\n--- Overall Data Quality Status: {overall_status} ---")


In [37]:
run_data_quality_report()


--- Data Quality Report ---

Issue: Multiple Newlines
  Description: Occurrences of two or more consecutive newlines.
  Count: 7809
  Threshold: < 0.5% of total lines
  Status: High
  Details: Found 7809 instances of multiple newlines.

Issue: Book Part Markers
  Description: Text book parts markers, which are not to be learnt from
  Count: 6
  Threshold: 0 occurrences
  Status: High
  Details: Found 6 book parts markers.

Issue: Chapters Markers
  Description: Text chapter markers, which are not to be learnt from
  Count: 239
  Threshold: 0 occurrences
  Status: High
  Details: Found 239 chapters markers.

Issue: Empty Sections
  Description: Sections of text that are empty or contain only whitespace.
  Count: 7809
  Threshold: 0 occurrences
  Status: High
  Details: Found 7809 empty sections.

--- Overall Data Quality Status: Needs Attention ---


### Data Cleaning

We will start cleaning the book to remove unwanted characters and passages, like footnotes markers and chapters.

In [40]:
# Clean Chapters
raw_text = re.sub(roman_plus_name_pattern, '', raw_text)
raw_text = re.sub(roman_numeral_pattern, '', raw_text)

#Clean Partes
raw_text = re.sub(parte_pattern, '', raw_text)

#Clean Libros
raw_text = re.sub(libro_pattern, '', raw_text)

# Remove footnote markers
raw_text = re.sub(r'\[\d+\]', '', raw_text)
raw_text = re.sub(r'(?<=\w)\d+', '', raw_text)

In [42]:
# collapse multiple blank/whitespace-only lines into exactly two newlines
blank_lines_re = re.compile(r'(?:\r?\n\s*){2,}')
raw_text = blank_lines_re.sub('\n', raw_text).strip()
print(raw_text[:1000])  # Return the first 1000 characters of the cleaned text

Todas las familias felices se parecen; las desdichadas lo son cada una a su modo.
Todo estaba patas arriba en casa de los Oblonski. Enterada de que su marido tenía una relación con la antigua institutriz francesa de sus hijos, le había anunciado que no podía seguir viviendo con él bajo el mismo techo. Esa situación, que se prolongaba ya por tres días, era dolorosa no sólo para el matrimonio, sino también para los demás miembros de la familia y la servidumbre. Tanto unos como otros se daban cuenta de que no tenía sentido que siguieran viviendo juntos, que los huéspedes ocasionales de cualquier pensión tenían más cosas en común que cuantos habitaban esa casa. La mujer no salía de sus habitaciones, y el marido hacía ya tres días que no ponía el pie por allí. Los niños corrían de un lado para otro desconcertados; la institutriz inglesa había discutido con el ama de llaves y había escrito una nota a una amiga en la que le solicitaba que le buscara una nueva colocación; el cocinero se había 

In [44]:
run_data_quality_report()


--- Data Quality Report ---

--- Overall Data Quality Status: Good ---


In [45]:
print(raw_text[:10000])

Todas las familias felices se parecen; las desdichadas lo son cada una a su modo.
Todo estaba patas arriba en casa de los Oblonski. Enterada de que su marido tenía una relación con la antigua institutriz francesa de sus hijos, le había anunciado que no podía seguir viviendo con él bajo el mismo techo. Esa situación, que se prolongaba ya por tres días, era dolorosa no sólo para el matrimonio, sino también para los demás miembros de la familia y la servidumbre. Tanto unos como otros se daban cuenta de que no tenía sentido que siguieran viviendo juntos, que los huéspedes ocasionales de cualquier pensión tenían más cosas en común que cuantos habitaban esa casa. La mujer no salía de sus habitaciones, y el marido hacía ya tres días que no ponía el pie por allí. Los niños corrían de un lado para otro desconcertados; la institutriz inglesa había discutido con el ama de llaves y había escrito una nota a una amiga en la que le solicitaba que le buscara una nueva colocación; el cocinero se había 

In [19]:
with open('../data/Guerra_y_paz_cleaned.txt', 'w', encoding='utf-8') as file:
    file.write(raw_text)

In [40]:
with open('../data/crime_y_castigo_cleaned.txt', 'w', encoding='utf-8') as file:
    file.write(raw_text)

In [46]:
with open('../data/anna_karenina_cleaned.txt', 'w', encoding='utf-8') as file:
    file.write(raw_text)

In [31]:
with open('../data/hermanos_karamazov_cleaned.txt', 'w', encoding='utf-8') as file:
    file.write(raw_text)