# Step 1: Processing the raw data

## `process_OCR_data_files(df)`

### Description:
This function processes Optical Character Recognition (OCR) data stored in a DataFrame. It performs various text analysis tasks including part-of-speech (PoS) analysis, abbreviation lookup from X1.json file for all the synonyms, date extraction, and chunking based on dates. The processed data is then filtered based on the latest date and stored in CSV files.

### Parameters:
- `df` (DataFrame): The input DataFrame containing OCR data.

### Returns:
- `processed_data` (DataFrame): The processed and filtered data stored in a DataFrame.

### Working:

1. **Perform PoS Analysis**:
   - The function first performs PoS analysis on the text data using the spaCy library.

2. **Load Abbreviation Lookup Data**:
   - It loads abbreviation lookup data from a JSON file ('X1.json').

3. **Find Abbreviations**:
   - For each token in the DataFrame, the function looks up its abbreviation based on the loaded data and adds it as a new column.

4. **Extract Dates**:
   - Dates are extracted from the text using regular expressions for the format 'dd/mm/yyyy' or 'dd/mm/yy'.

5. **Check Chronological Order**:
   - It checks if the extracted dates are in chronological order.

6. **Chunking**:
   - The DataFrame is divided into chunks based on non-empty dates. If no non-empty dates are found, all rows are placed in a single chunk.

7. **Concatenate Chunks**:
   - All chunks are concatenated into a single DataFrame.

8. **Filter Data**:
   - The function identifies the latest date among the extracted dates and filters the data to keep only those rows with that date.

9. **Store Discarded Data**:
   - Dates of discarded chunks are stored for reference.

10. **Write to CSV**:
    - Filtered data chunks are written to CSV files in a directory named 'files'.

11. **Read from CSV**:
    - The function reads the first filtered chunk from the CSV file if available.

### Example Usage:
```python
# Load DataFrame with OCR data
df_data = ...

# Process OCR data
processed_data = process_OCR_data_files(df_data)


In [7]:
import pandas as pd
import glob
import os
import pandas as pd
import spacy
import en_core_web_sm
import json
import re
from datetime import datetime

# Define the function to process OCR data files
def process_OCR_data_files(df):
    # Step 1 – Perform PoS Analysis
    # Initialize an empty list to store features
    features = []
    # Load English language model
    nlp = en_core_web_sm.load()
    # Iterate over each text in the DataFrame
    for text in df['text']:
        # Process text with spaCy
        doc = nlp(text)
        # Iterate over each token in the document
        for token in doc:
            # Append token and its part-of-speech tag to features list
            features.append({'token': token.text, 'pos': token.pos_})
    # Create a DataFrame from the features list
    df_pos = pd.DataFrame.from_records(features)

    # Load X1.json for abbreviation lookup
    with open('X1.json', 'r') as f:
        data = json.load(f)

    # Function to find abbreviation for a token
    def find_abbreviation(token):
        for item in data:
            if token in item['Synonyms']:
                return item['Abbreviation']
        return None

    # Apply the function to add a new column with abbreviations
    df_pos['Abbreviation'] = df_pos['token'].apply(find_abbreviation)

    # Function to extract dates from text using regular expressions
    def extract_dates(text):
        # Regular expression for date format (dd/mm/yyyy or dd/mm/yy)
        pattern = r'\b(\d{1,2}/\d{1,2}/\d{2,4})\b'
        # Find all matches of date pattern in the text
        matches = re.findall(pattern, text)
        dates = []
        for match in matches:
            try:
                # Try parsing with different date formats
                dates.append(datetime.strptime(match, '%d/%m/%Y'))
            except ValueError:
                pass  # Skip if date format is not recognized
        return dates

    # Apply the function to extract dates from 'token' column
    df_pos['dates'] = df_pos['token'].apply(extract_dates)

    # Check if dates are in chronological order
    all_dates = [date for sublist in df_pos['dates'] for date in sublist]
    chronological_order = all(all_dates[i] <= all_dates[i+1] for i in range(len(all_dates)-1))

    if chronological_order:
        print("Dates are in chronological order.")
    else:
        print("Dates are not in chronological order.")

    # Function to get indexes of rows with non-empty dates
    def non_empty_dates_indexes(df):
        return df[df['dates'].apply(lambda x: len(x) > 0)].index.tolist()

    # Copy DataFrame for manipulation
    temp = df_pos.copy()

    # Get indexes of rows with non-empty dates
    indexes_with_non_empty_dates = non_empty_dates_indexes(temp)
    print(indexes_with_non_empty_dates)

    # Check if non-empty dates exist
    if indexes_with_non_empty_dates:
        # Function to generate tuples for date intervals
        def generate_tuples(lst):
            tuples = [(0, lst[0])]
            for i in range(1, len(lst)):
                tuples.append((lst[i-1], lst[i]))
            return tuples

        lst = indexes_with_non_empty_dates
        tuples = generate_tuples(lst)
        date_intervals = tuples
        print("")
        print(date_intervals)

        # Function to divide dataframe based on index ranges
        def divide_dataframe(df, index_ranges):
            dataframes = []
            for start, end in index_ranges:
                chunk = df.iloc[start:end]
                dataframes.append(chunk)
            return dataframes

        # Divide DataFrame into chunks based on date intervals
        chunks = divide_dataframe(temp, date_intervals)
    else:
        # If no non-empty dates exist, create a single chunk with all rows
        chunks = [temp]

    # Print chunks
    for i, chunk in enumerate(chunks):
        print(f"Chunk {i+1}:")
        print(chunk)

    # Concatenate all chunks into a single DataFrame
    concatenated_df = pd.concat(chunks, ignore_index=True)

    # Find the latest date
    latest_date = concatenated_df['dates'].max()

    # Filter chunks to keep only those with the latest date
    filtered_chunks = [chunk for chunk in chunks if chunk['dates'].max() == latest_date]

    # Find and store dates of discarded chunks
    discarded_dates = [chunk['dates'].max() for chunk in chunks if chunk['dates'].max() != latest_date]

    # Check if 'files' directory exists, if not create it
    if not os.path.exists('files'):
        os.makedirs('files')

    # Print information about the chunks
    if len(filtered_chunks) > 1:
        print("Multiple chunks with identical latest dates are added.")
    else:
        print("Only one chunk is added.")

    # Print the number of discarded chunks and their dates
    if len(discarded_dates) > 0:
        print(f"{len(discarded_dates)} chunks were discarded.")
        print("Dates of discarded chunks:")
        for date in discarded_dates:
            print(date)

    # Write filtered chunks to CSV
    for i, chunk in enumerate(filtered_chunks):
        print(f"Chunk {i+1}:")
        print(chunk)
        pd.DataFrame(chunk).to_csv(f"./files/data_{FILE_NAME}.csv", index=False)

    # Read the first filtered chunk from CSV
    if len(filtered_chunks) > 0:
        return pd.read_csv(f"./files/data_{FILE_NAME}.csv")
    else:
        return pd.DataFrame()

# Iterate over each file in the 'OCR data' directory
for file_path in glob.glob('./OCR data/*.txt'):
    # Load data from the current file

    FILE_NAME = file_path.split("/",3)[2][:-4]
    
    with open(file_path, 'r') as file:
        data = [line.strip() for line in file.readlines()]
    # Create a DataFrame from the loaded data
    df_data = pd.DataFrame(data, columns=['text'])
    # Process the OCR data
    processed_data = process_OCR_data_files(df_data)
    # Optional: You can do further processing or analysis with the processed data


Dates are in chronological order.
[]
Chunk 1:
        token    pos Abbreviation dates
0     Patient  PROPN         None    []
1        Name  PROPN         None    []
2          ee    ADP         None    []
3     Barcode  PROPN         None    []
4           ;  PUNCT         None    []
..        ...    ...          ...   ...
438   Medical  PROPN         None    []
439      Care  PROPN         None    []
440        in    ADP         None    []
441  Diabetes  PROPN         None    []
442      2015    NUM         None    []

[443 rows x 4 columns]
Only one chunk is added.
Chunk 1:
        token    pos Abbreviation dates
0     Patient  PROPN         None    []
1        Name  PROPN         None    []
2          ee    ADP         None    []
3     Barcode  PROPN         None    []
4           ;  PUNCT         None    []
..        ...    ...          ...   ...
438   Medical  PROPN         None    []
439      Care  PROPN         None    []
440        in    ADP         None    []
441  Diabetes  P

Dates are in chronological order.
[189]

[(0, 189)]
Chunk 1:
        token    pos Abbreviation dates
0          29    NUM         None    []
1         APR  PROPN         None    []
2         MRN   NOUN         None    []
3           :  PUNCT         None    []
4        Name   NOUN         None    []
..        ...    ...          ...   ...
184   Comment  PROPN         None    []
185      COM1   NOUN         None    []
186  Comments   NOUN         None    []
187         :  PUNCT         None    []
188         (  PUNCT         None    []

[189 rows x 4 columns]
Only one chunk is added.
Chunk 1:
        token    pos Abbreviation dates
0          29    NUM         None    []
1         APR  PROPN         None    []
2         MRN   NOUN         None    []
3           :  PUNCT         None    []
4        Name   NOUN         None    []
..        ...    ...          ...   ...
184   Comment  PROPN         None    []
185      COM1   NOUN         None    []
186  Comments   NOUN         None    []
1

Dates are not in chronological order.
[9, 33, 37, 41, 180, 184, 188, 347, 351, 355, 461, 465, 469, 653, 675, 680, 684, 799, 821, 825, 829, 1063, 1085, 1089, 1093, 1270, 1292, 1296, 1300, 1427, 1458, 1462, 1466, 1587, 1609, 1613, 1617, 1909, 1931, 1935, 1939, 2415, 2446, 2450, 2454, 2574, 2591, 2593, 2701, 2747, 2769, 2773, 2778, 2912, 2934, 2938, 2942, 3104, 3135, 3139, 3143, 3268, 3290, 3294, 3298]

[(0, 9), (9, 33), (33, 37), (37, 41), (41, 180), (180, 184), (184, 188), (188, 347), (347, 351), (351, 355), (355, 461), (461, 465), (465, 469), (469, 653), (653, 675), (675, 680), (680, 684), (684, 799), (799, 821), (821, 825), (825, 829), (829, 1063), (1063, 1085), (1085, 1089), (1089, 1093), (1093, 1270), (1270, 1292), (1292, 1296), (1296, 1300), (1300, 1427), (1427, 1458), (1458, 1462), (1462, 1466), (1466, 1587), (1587, 1609), (1609, 1613), (1613, 1617), (1617, 1909), (1909, 1931), (1931, 1935), (1935, 1939), (1939, 2415), (2415, 2446), (2446, 2450), (2450, 2454), (2454, 2574), (2574,

Dates are in chronological order.
[19, 393]

[(0, 19), (19, 393)]
Chunk 1:
          token    pos Abbreviation dates
0     iilaverty  PROPN         None    []
1           -—-  PROPN         None    []
2     Pathology  PROPN         None    []
3        Report  PROPN         None    []
4       resuits   VERB         None    []
5         @RCPA    ADV         None    []
6     pathology   NOUN         None    []
7     ENQUIRIES   VERB         None    []
8           tet  PROPN         None    []
9        cramer   NOUN         None    []
10           NF  PROPN         None    []
11      removed   VERB         None    []
12      removed   VERB         None    []
13      removed   VERB         None    []
14        Phone   NOUN         None    []
15        D.O.B  PROPN         None    []
16  XXXXXXXXXXX  PROPN         None    []
17      removed   VERB         None    []
18     Reported   VERB         None    []
Chunk 2:
           token    pos Abbreviation                  dates
19    18/01/2024

## Hardest Parts and Challenges:

### PoS Analysis with spaCy:
Performing Part-of-Speech (PoS) analysis using spaCy was challenging due to the complexity of natural language processing (NLP) tasks. It required understanding spaCy's API, tokenization, and parsing of text data.

### Date Extraction with Regular Expressions:
Extracting dates from OCR text using regular expressions presented challenges due to variations in date formats and potential errors in recognition. Handling different date formats ('dd/mm/yyyy' or 'dd/mm/yy') required careful consideration and robust error handling.

### Chunking Based on Dates:
Chunking the DataFrame based on dates was complex, especially when dealing with varying intervals between dates. The algorithm had to handle scenarios with non-contiguous date ranges and ensure the integrity of data chunks.

### Handling Abbreviation Lookup:
Implementing abbreviation lookup involved loading data from a JSON file and efficiently searching for abbreviations based on token synonyms. Managing the data structure and optimizing the lookup process were challenging tasks.

### Writing to CSV Files:
Writing filtered data chunks to CSV files required managing file paths, ensuring file integrity, and handling potential errors during file I/O operations. Ensuring that each chunk was correctly written to the designated CSV file was crucial for data preservation.

### Chronological Order Validation:
Validating if dates extracted from the OCR text were in chronological order posed a challenge, especially when dealing with large datasets. Ensuring the accuracy of date ordering required careful iteration and comparison of date values.

### Error Handling and Robustness:
Ensuring robust error handling throughout the codebase was essential to handle various edge cases, such as invalid dates, missing abbreviation data, or unexpected text formats. Implementing robust error handling mechanisms enhanced the reliability of the function.

### Documentation and Code Clarity:
Documenting the code and maintaining clarity in function descriptions, parameter usage, and return values were critical for readability and understanding. Clear documentation facilitated collaboration and maintenance of the codebase.

### Performance Optimization:
Optimizing the performance of date extraction, chunking, and file operations was necessary to enhance the efficiency of the function, especially when dealing with large OCR datasets. Implementing efficient algorithms and data structures helped improve overall performance.

### Cross-Platform Compatibility:
Ensuring cross-platform compatibility, especially regarding file paths and directory structures, was important to make the function portable across different operating systems. Handling platform-specific quirks and ensuring consistent behavior added complexity to the development process.
