This script automates license label assignment by matching license IDs in a CSV file with preprocessed license text files. It leverages fuzzy matching to handle potential variations in license ID formats.

Here's a breakdown of the steps:

1. **Load CSV Data:** The script reads a CSV file (assumed to be located at '../../data/raw/data2-0.csv') using pandas.
2. **Preprocessed License Directory:** It defines the directory containing preprocessed license text files (assumed to be '../../data/processed/preprocessed_licenses_txt').
3. **Iterate Through CSV Rows:** The script iterates through each row in the CSV data frame.
4. **Extract License ID and Family:** It extracts the license ID (assuming it's in a column named 'License ID') and license family from the current row.
5. **Clean License ID (Optional):** The code performs optional cleaning on the license ID, handling potential prefixes like "LicenseRef-scancode" or "LicenseRef-MB".
6. **Identify Label Columns:** It finds all columns containing the value 'x' (assuming these represent labels for the license).
7. **Fuzzy Matching for License Text:**
   - The script iterates through filenames of preprocessed license text files.
   - For each filename, it extracts the license name (without extension or specific patterns like "deprecated") and calculates a fuzzy matching score using `fuzz.ratio` from the `fuzzywuzzy` library. This score indicates the similarity between the license ID and the license name in the filename.
   - It keeps track of the filename with the highest matching score.
8. **Match and Label Assignment:**
   - If the best matching score is 100 (perfect match), the script opens the corresponding license text file and assigns the extracted text along with the license ID, family, and labels to a dictionary. This dictionary is then saved as a JSON file with the license ID as the filename in the '../../data/processed/preprocessed_licenses_json_2' directory.
   - If no perfect match is found, the script creates a dictionary with the license ID, family, labels, and an empty text field. This dictionary is saved as a JSON file with the license ID as the filename in the '../../data/processed/preprocessed_licenses_unmatched' directory. Additionally, it keeps track of the unmatched license ID, best matching filename (if any), and the corresponding score for logging purposes.
9. **Print Unmatched Count:** After processing all CSV rows, the script prints the total number of unmatched licenses.

**Notes:**

- This code assumes specific column names and data formats in the CSV file. You might need to adjust it based on your actual data structure.
- The script uses a simple fuzzy matching approach. Consider exploring more sophisticated techniques for license matching if needed.

In [2]:
import os
import pandas as pd
from fuzzywuzzy import fuzz
import json

# Load the CSV file
df = pd.read_csv("../../data/raw/data2-0.csv")

# Directory containing preprocessed licenses
license_dir = "../../data/processed/preprocessed_licenses_txt"
count=0
unmatched_licenses = []

# Create a label dictionary
for index, row in df.iterrows():
    label_dict = {}
    license_id = row['License ID']  # Assuming a column named 'license_id'
    if "LicenseRef-scancode" in license_id:
        license_id = license_id[20:]
    elif "LicenseRef-MB" in license_id:
        license_id = license_id[14:]
    labels = [col for col in df.columns if row[col] == 'x']  # Get all columns with 'x'
    license_family = row['License Family']  # Get family

    best_match_score = 0
    best_match_filename = None

    for filename in os.listdir(license_dir):
        if filename.endswith(".txt"):
            filename_without_extension = os.path.splitext(filename)[0][:-13]
            if "deprecated" in filename_without_extension:
                filename_without_extension = filename_without_extension[11:]
            score = fuzz.ratio(license_id, filename_without_extension)
            if score > best_match_score:
                best_match_score = score
                best_match_filename = filename
        
    if best_match_score >= 100:
        with open(os.path.join(license_dir, best_match_filename), "r") as f:
            license_text = f.read()
            label_dict= {'name': license_id, 'family': license_family, 'labels': labels, 'text': license_text}
        with open(f"../../data/processed/preprocessed_licenses_json_2/{license_id}.json", "w") as f:
            json.dump(label_dict, f, indent=4)
    else:
        count+=1
        unmatched_licenses.append(license_id)
        print(f"No match found for: {license_id}, best match: {best_match_filename}, score: {best_match_score}")
        label_dict= {'name': license_id, 'family': license_family, 'labels': labels, 'text': ""}
        with open(f"../../data/processed/preprocessed_licenses_unmatched/{license_id}.json", "w") as f:
            json.dump(label_dict, f, indent=4)
        with open(f"../../data/processed/unmatchedText/{license_id}.txt", "w") as outfile:
            outfile.write("")
print(count)

No match found for: Adobe-Helvetica, best match: Adobe-Utopia_preprocessed.txt, score: 67
No match found for: Adobe-utopia-license, best match: Adobe-Utopia_preprocessed.txt, score: 69
No match found for: LicenseRef-Scancode-amd-historical, best match: Linux-man-pages-1-para_preprocessed.txt, score: 39
No match found for: Aelfred2, best match: AdaCore-doc_preprocessed.txt, score: 42
No match found for: AES256, best match: APSL-1.2_preprocessed.txt, score: 43
No match found for: ams-fonts, best match: Lucida-Bitmap-Fonts_preprocessed.txt, score: 50
No match found for: Apple-iOS-Sample-Code, best match: BSD-Source-Code_preprocessed.txt, score: 50
No match found for: arc4-random-number-license, best match: Unlicense_preprocessed.txt, score: 46
No match found for: Avisynth-C-Interface-Exception-GPL-2.0-or-later, best match: GCC-exception-2.0-note_preprocessed.txt, score: 55
No match found for: Ayam, best match: Jam_preprocessed.txt, score: 57
No match found for: BigInteger, best match: Int