## `process_records(csv_file, FILE_NAME)`

### Description:
This function processes records from a CSV file and extracts parameter values.

### Parameters:
- `csv_file` (str): Path to the CSV file.
- `FILE_NAME` (str): Identifier for the processed file.

### Returns:
None

### Working:

1. **Read CSV File**:
   - Read the CSV file into a DataFrame.

2. **Replace Values**:
   - Replace 'token' column values with 'Abbreviation' values where 'Abbreviation' is not NaN.

3. **Load JSON Data**:
   - Load JSON data from 'X1.json' and extract values of 'Abbreviation'.

4. **Concatenate Tokens**:
   - Concatenate 'token' column values into a single string.

5. **Define Regex Pattern**:
   - Define a regex pattern for finding abbreviation matches.

6. **Tokenization**:
   - Tokenize the original text based on the regex pattern.

7. **Combine Tokens**:
   - Combine consecutive tokens into phrases until an abbreviation is encountered.

8. **Extract Parameter Values**:
   - Extract parameter values from combined tokens based on specific conditions.

9. **Write Results to Text File**:
   - Write the extracted results to a text file named 'result_{FILE_NAME}.txt'.

### Example:
```python
process_records('data.csv', 'data')


In [1]:
import glob
import pandas as pd
import json
import re

def process_records(csv_file, FILE_NAME):
    print(f"Processing file: {csv_file}")
    # Read the CSV file into a DataFrame
    df = pd.read_csv(csv_file)
    
    # Replace 'token' column with 'Abbreviations' where 'Abbreviations' is not NaN
    abbr_df = df.copy()
    abbr_df['token'] = abbr_df.apply(lambda row: row['Abbreviation'] if pd.notnull(row['Abbreviation']) else row['token'], axis=1)
    abbr_df.drop(columns=['Abbreviation'], inplace=True)
    
    # Load the JSON file
    with open('X1.json', 'r') as file:
        data = json.load(file)
    
    # Extract values of 'Abbreviation' from the JSON data
    abbr_value_list = [item['Abbreviation'] for item in data if 'Abbreviation' in item]
    
    # Concatenate 'token' column values into a single string
    concatenated_strings = " ".join(abbr_df['token'].astype(str))
    print("Concatenated strings:", concatenated_strings)
    
    # Define a regex pattern for finding abbreviation matches
    pattern = r'(\b(?:' + '|'.join(abbr_value_list + [word.upper() for word in abbr_value_list]) + r')\b)[\s,:.]+'
    print("Regex pattern:", pattern)
    
    # Tokenize the original text based on the regex pattern
    tokens = re.split(pattern, concatenated_strings)
    tokens = [token.strip() for token in tokens if token.strip()]  # Remove empty tokens and strip leading/trailing whitespaces
    print("Tokens:", tokens)
    
    # Combine tokens into phrases
    combined_tokens = []
    i = 0
    while i < len(tokens):
        combined_token = tokens[i]
        i += 1
        while i < len(tokens) and tokens[i] not in abbr_value_list:
            combined_token += ' ' + tokens[i]
            i += 1
        combined_tokens.append(combined_token)
    print("Combined tokens:", combined_tokens)
    
    # Extract parameter values from combined tokens
    result = []
    universal_units = [
        "mm", "cm", "m", "km", "mg", "g", "kg", "ml", "L", "°C", "°F", 
        "in", "ft", "yd", "mi", "M", "k", "°", "s", "Hz", "N", "Pa", "J", 
        "W", "C", "V", "F", "Ω", "S", "Wb", "T", "H", "lm", "lx", "Bq", 
        "Gy", "Sv", "kat", "mol", "dB", "rpm", "rad", "g/cm³", "m/s²", 
        "m²", "m³", "m/s", "m/s²", "m/s³", "m/s⁴", "m/s⁵", "m/s⁶", 
        "m²/s", "m²/s²", "m³/s", "m³/s²", "m³/s³", "m/s³", "N/m", "N/m²", 
        "N/m³", "Pa/s", "Pa/s²", "J/K", "J/mol", "J/(mol·K)", "J/kg", 
        "J/(kg·K)", "J/m", "J/m²", "J/m³", "W/m", "W/m²", "W/m³", "W/(m·K)", 
        "W/(m²·K)", "W/(m³·K)", "C/m³", "C/m²", "C/m", "V/m", "F/m", "Ω/m", 
        "S/m", "T/m", "H/m", "mol/m³", "kat/m³", "kg/m", "g/m", "mol/m²", 
        "mol/m³", "kat/m²", "kat/m³", "g/kmol", "g/mol", "g/m²", "g/m³", 
        "kg/kmol", "kg/mol", "kg/m²", "kg/m³", "L/mol", "L/m³", "lm/W", 
        "lx·s", "kat/kg", "kat/mg", "kat/g", "kat/kg", "kat/kg·s", "kat/mg·s",
        "kat/g·s", "kat/kg·s", "m²/g", "m²/kg", "m²/mg", "m²/g·s", "m²/kg·s",
        "m²/mg·s", "m³/g", "m³/kg", "m³/mg", "m³/g·s", "m³/kg·s", "m³/mg·s",
        "m/s·g", "m/s·kg", "m/s·mg", "m/s·g·s", "m/s·kg·s", "m/s·mg·s", 
        "m/s²·g", "m/s²·kg", "m/s²·mg", "m/s²·g·s", "m/s²·kg·s", "m/s²·mg·s",
        "N·s", "N/m²·s", "N/m³·s", "N·m", "N/m·s", "N/m²·s", "N/m³·s", "Pa·s",
        "Pa·s²", "Pa/m", "Pa/m²", "Pa/m³", "Pa/m²·s", "J/m", "J/m²", "J/m³", 
        "J/kg·s", "J/m²·s", "J/m³·s", "W/m", "W/m²", "W/m³", "W/m²·K", "W/m³·K", 
        "W/m²·s", "W/m³·s", "C/m", "C/m²", "C/m³", "C/m²·s", "C/m³·s", "V/m", 
        "V/m²", "V/m³", "F/m", "F/m²", "F/m³", "Ω/m", "Ω/m²", "Ω/m³", "S/m", 
        "S/m²", "S/m³", "T/m", "T/m²", "T/m³", "H/m", "H/m²", "H/m³", "lm/m²", 
        "lx/m²", "Bq/m", "Bq/m²", "Bq/m³", "Gy/s", "Gy/m²", "Gy/m³", "Sv/s", 
        "Sv/m²", "Sv/m³", "kat/m", "kat/m²", "kat/m³", "mol/m", "mol/m²", 
        "mol/m³", "dB/m", "rpm/m", "rpm/m²", "rad/m", "rad/m²", "g/cm²", "g/cm³", 
        "kg/m²", "kg/m³", "g/m²", "g/m³", "mg/m²", "mg/m³", "kg/m²·s", "kg/m³·s", 
        "g/m²·s", "g/m³·s", "mg/m²·s", "mg/m³·s", "kg/m²·K", "kg/m³·K", "g/m²·K", 
        "g/m³·K", "mg/m²·K", "mg/m³·K", "kg/m²·s·K", "kg/m³·s·K", "g/m²·s·K", 
        "g/m³·s·K", "mg/m²·s·K", "mg/m³·s·K"
    ]

    # Extract parameter values from combined tokens
    for item in combined_tokens:
        words = item.split()
        for i, word in enumerate(words):
            if word.lower() in universal_units:
                if i > 0 and re.match(r'^\d+(?:\.\d+)?$', words[i - 1]):
                    value = words[i - 1]
                    unit = word.lower()
                    parameter = words[0]
                    result.append({"parameter": parameter, "value": value, "unit": unit})
    print("Extracted results:", result)
    
    # Write the extracted results to a text file
    with open(f"./results/result_{FILE_NAME}.txt", 'w') as file:
        for item in result:
            file.write("%s\n" % item)

# Get the list of CSV files in the directory
records = glob.glob('./files/*.csv')
# Process each CSV file iteratively
for file_path in records:
    FILE_NAME = file_path.split("_", 2)[1].split(".", 2)[0]
    process_records(file_path, FILE_NAME)

Processing file: ./files/data_02b832e1-66cc-4f35-8b52-abf41cd821b2.csv
Concatenated strings: Patient Name ee Barcode ; [ eennnennninittte Age : Gender : 64 / Female Sample Collected On : 03 / Mar/2022 07:16AM Order I2 d : 483940780 Sample Received On : 03 / Mar/2022 11:54AM Referred By : Self Report Generated On : 03 / Mar/2022 12:57PM Customer Since : 03 / Mar/2022 Sample Temperature : Maintained Y Sample Type : Whole Blood EDTA Report Status : Final Report DEPARTMENT OF BIOCHEMISTRY HBAIC Test Name Value Unit Bio . Ref Interval HbAlIc - Glycated Hemoglobin Hbalc ( Glycosylated Hemoglobin ) | 10.50 % 4.2 - 5.7 Method : HPLC Average Estimated Glucose - plasma 254.65 mg / dt Method : Calculated INTERPRETATION : AS PER AMERICAN DIABETES ASSOCIATION ( ADA ): + REFERENCE GROUP GLYCOSYLATED HEMOGLOBIN ( HBA‘Cc ) in % Non diabetic < 5.7 At Risk ( Prediabetes ) 5.7 -6.4 Diagnosing Diabetes > = 6.5 Age > 19 Years Goals of Therapy : < 7.0 Actions Suggested : > 8.0 Therapeutic goals for glycemic

# Challenges and Hard Parts: `process_records` Function

## Overview:
The `process_records` function aims to parse CSV files, replace values, extract parameter values, and write the results to a text file. Despite its straightforward goal, several challenges and complexities were encountered during its implementation.

## Challenges:

1. **Data Parsing Complexity**:
   - Parsing CSV files and handling DataFrame operations were challenging, especially with large datasets or complex data structures.

2. **Abbreviation Matching**:
   - Matching abbreviations from the 'Abbreviation' column to the 'token' column involved complexity in handling cases, punctuation, and variations in text format.

3. **Regular Expression Complexity**:
   - Constructing and debugging regular expressions for tokenization and abbreviation matching were complex, particularly when dealing with various text formats and patterns.

4. **Data Integration**:
   - Integrating data from different sources, such as CSV files and JSON files, required careful handling to ensure data consistency and accuracy.

5. **Error Handling**:
   - Handling errors gracefully, such as missing files or unexpected data formats, was crucial for robustness and reliability.

## Hard Parts:

1. **Optimizing Performance**:
   - Ensuring the function's efficiency and scalability, especially with large datasets, required optimization techniques and careful resource management.

2. **Handling Edge Cases**:
   - Addressing edge cases and corner scenarios, such as rare abbreviations or irregular text patterns, posed challenges in ensuring comprehensive data processing.

3. **Maintaining Code Readability**:
   - Balancing between code complexity and readability was crucial for maintainability and collaboration, especially in a function with multiple processing steps and dependencies.

4. **Testing and Validation**:
   - Developing comprehensive test cases and validation methods to verify the accuracy and correctness of the function's output was essential but time-consuming.

5. **Documentation and Communication**:
   - Clearly documenting the function's behavior, parameters, and usage, as well as effectively communicating challenges and solutions to team members, stakeholders, or users, was vital for understanding and collaboration.

## Conclusion:
The `process_records` function involved several challenges and hard parts related to data parsing, text processing, error handling, performance optimization, and documentation. Addressing these challenges required a combination of technical expertise, problem-solving skills, and effective communication within the development team.


# Results can be read here 

In [2]:
results = glob.glob('./results/*.txt')
for i in results:
    file = open(i, "r")
    content=file.readlines()
    print(content)
    file.close()

[]
[]
["{'parameter': 'PH', 'value': '10', 'unit': '°'}\n", "{'parameter': 'RBC', 'value': '10', 'unit': '°'}\n", "{'parameter': 'RBC', 'value': '3', 'unit': 'mi'}\n"]
["{'parameter': 'ANDRO', 'value': '2', 'unit': 'm'}\n", "{'parameter': 'ANDRO', 'value': '7', 'unit': 'g'}\n"]
["{'parameter': 'ESR', 'value': '20', 'unit': 'mm'}\n", "{'parameter': 'ESR', 'value': '150', 'unit': 'g'}\n", "{'parameter': 'EGFR', 'value': '90', 'unit': 'ml'}\n", "{'parameter': 'EGFR', 'value': '90', 'unit': 'ml'}\n", "{'parameter': 'AST', 'value': '80', 'unit': 'g'}\n", "{'parameter': 'AST', 'value': '48', 'unit': 'g'}\n", "{'parameter': 'AST', 'value': '39', 'unit': 'g'}\n", "{'parameter': 'AST', 'value': '110', 'unit': 'mm'}\n", "{'parameter': 'LDH', 'value': '35', 'unit': 'g'}\n"]
["{'parameter': 'EGFR', 'value': '79', 'unit': 'ml'}\n"]
[]
[]
["{'parameter': 'I2', 'value': '254.65', 'unit': 'mg'}\n"]
