<a href="https://colab.research.google.com/github/paskn/tools-as-notebooks/blob/main/Add%20Lemmatize_docs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Lemmatization Tool

**Purpose:** This notebook lemmatizes English language texts stored in a CSV file.

**Inputs:**
*   A CSV file where one column contains the text to be lemmatized.
*   The name of the column containing the text.
*   The desired name for the output file.

**Outputs:**
*   A new CSV file, which is a copy of your input CSV with an added 'text_lemmatized' column. This file will be saved to your Google Drive in a directory named `Colab_Data` and will also be available for direct download.

**Instructions:**
1.  Run the cells in this notebook sequentially from top to bottom.
2.  When prompted, upload your CSV file.
3.  Configure the parameters in the 'User Input Configuration' section.
4.  Follow the prompts for Google Drive authentication if you haven't used Colab with your Drive recently.

In [None]:
#@title Step 1: Install libraries required for the notebook.
# It will only run once or if the libraries are not already installed in your Colab environment.
print("Installing pandas and spacy...")
!pip install pandas spacy -q
!python -m spacy download en_core_web_sm
print("Installation complete.")

In [None]:
#@title Step 2: Import the libraries needed for the script.
import pandas as pd
from google.colab import files
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import os # Will be used for checking Google Drive path

# Load the spaCy model
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

print("Libraries imported.")

In [None]:
#@title Step 3:  Mount your Google Drive to Colab, allowing the notebook to save files.
# It also creates a directory named 'Colab_Data' in your Drive if it doesn't already exist.

from google.colab import drive
drive.mount('/content/drive')

output_directory = '/content/drive/MyDrive/Colab_Data'
if not os.path.exists(output_directory):
    os.makedirs(output_directory)
    print(f"Directory '{output_directory}' created in your Google Drive.")
else:
    print(f"Directory '{output_directory}' already exists in your Google Drive.")

# Test file to ensure Drive is writable (optional, can be removed)
try:
    with open(os.path.join(output_directory, 'drive_test.txt'), 'w') as f:
        f.write('Google Drive connection successful.')
    print("Successfully wrote a test file to Google Drive.")
except Exception as e:
    print(f"Error writing to Google Drive: {e}")
    print("Please ensure you have granted necessary permissions and have space in your Drive.")

In [None]:
#@title Step 4: Upload Your CSV File
# Run this cell to upload your CSV file containing the text to be lemmatized.
# The file should have one column with the text data you want to process.

print("Please upload your CSV file.")
uploaded_file = files.upload()

# Get the name of the uploaded file
# Assuming only one file is uploaded
if uploaded_file:
    input_csv_name = next(iter(uploaded_file))
    print(f"Uploaded file '{input_csv_name}' successfully.")
    # Optional: Display first few lines to confirm it's a CSV
    # try:
    #   df_preview = pd.read_csv(input_csv_name)
    #   print("Preview of your uploaded data:")
    #   print(df_preview.head())
    # except Exception as e:
    #   print(f"Could not read the uploaded file as CSV: {e}. Please ensure it's a valid CSV.")
else:
    input_csv_name = None
    print("No file uploaded. Please run this cell again to upload your data.")

In [None]:
#@title Step 5: Configure Parameters
# Please specify the details for your lemmatization.

#@markdown ---
#@markdown ### Input Data Configuration
#@markdown Enter the name of the column in your CSV that contains the text:
text_column_name = "text" #@param {type:"string"}

#@markdown ---
#@markdown ### Output File Configuration
#@markdown Enter the desired name for your output CSV file (e.g., `lemmatized_results.csv`):
output_filename = "lemmatized_output.csv" #@param {type:"string"}
#@markdown ---

print("Parameters configured:")
print(f"  Text column name: '{text_column_name}'")
print(f"  Output filename: '{output_filename}'")

# Basic validation for output filename
if not output_filename.endswith('.csv'):
    output_filename += '.csv'
    print(f"  Adjusted output filename to: '{output_filename}' (ensured .csv extension)")

In [None]:
#@title Step 6: Perform Lemmatization
# This cell processes your data and performs lemmatization.

# Ensure a file was uploaded and parameters are set
if input_csv_name is None:
    print("Error: No CSV file was uploaded. Please go back to the 'Upload Your CSV File' cell and upload your data.")
elif not text_column_name:
    print("Error: The 'text_column_name' is not specified. Please go to 'Configure Parameters' and set it.")
else:
    try:
        print(f"Reading uploaded CSV file: {input_csv_name}...")
        df = pd.read_csv(input_csv_name)
        print("CSV file loaded successfully.")

        if text_column_name not in df.columns:
            print(f"Error: Column '{text_column_name}' not found in the uploaded CSV.")
            print(f"Available columns are: {df.columns.tolist()}")
        else:
            print(f"Using column '{text_column_name}' for lemmatization.")

            # Function to lemmatize text
            def lemmatize_filter_stopwords_and_nonalpha(text):
                # Handle potential NaN or non-string values
                if not isinstance(text, str) or not text.strip(): # also check for empty/whitespace-only strings
                    return ""

                doc = nlp(text)
                filtered_lemmas = []

                for token in doc:
                    # 1. Get the lemma and convert to lowercase
                    #    Handle pronouns like 'I', 'me', 'he' which spaCy lemmatizes to "-PRON-"
                    #    We'll use the original text (lowercased) for pronouns.
                    if token.lemma_ == "-PRON-":
                        lemma = token.text.lower()
                    else:
                        lemma = token.lemma_.lower()

                    # 2. Filter:
                    #    - Check if the original token is alphabetic (token.is_alpha)
                    #    - Check if the (lowercase) lemma is not a stopword
                    #    - Optional: Check for minimum length (e.g., len(lemma) > 1)
                    if token.is_alpha and lemma not in STOP_WORDS:
                        filtered_lemmas.append(lemma)

                return " ".join(filtered_lemmas)

            print("Applying lemmatization...")
            df['text_lemmatized'] = df[text_column_name].fillna("").astype(str).apply(lemmatize_filter_stopwords_and_nonalpha)
            print("Lemmatization complete.")
            print("Preview of data with lemmatized text:")
            print(df[[text_column_name, 'text_lemmatized']].head())

    except Exception as e:
        print(f"An error occurred during lemmatization: {e}")
        df = None # Ensure df is None if there's an error

In [None]:
#@title Step 7: Save Results to Google Drive and Provide Download Link

if 'df' in globals() and df is not None and 'text_lemmatized' in df.columns:
    try:
        # Construct the full path for saving in Google Drive
        output_drive_path = os.path.join(output_directory, output_filename)

        print(f"Saving the results to: {output_drive_path} ...")
        df.to_csv(output_drive_path, index=False)
        print("File saved successfully to your Google Drive.")

        print(f"Providing download link for '{output_filename}' ...")
        files.download(output_drive_path) # Offer direct download as well
        print(f"If the download doesn't start automatically, please check your browser's download permissions for this site.")

        print("\n--- Results Summary ---")
        print("First 10 rows of the output data:")
        print(df.head(10))

    except Exception as e:
        print(f"An error occurred while saving or downloading the file: {e}")
elif 'df' in globals() and df is not None and 'text_lemmatized' not in df.columns:
    print("Error: The 'text_lemmatized' column was not successfully created. Cannot save results.")
    print("Please check the 'Perform Lemmatization' cell for errors.")
else:
    print("Error: No data to save. Please ensure the previous steps ran correctly and produced a DataFrame 'df'.")

---
# ✅ Lemmatization Complete!

Thank you for using the Text Lemmatization Tool!

Your results have been:
1.  Saved to your Google Drive in the `Colab_Data` folder (as `[Your Output File Name]`).
2.  Offered as a direct download to your computer.

**Troubleshooting & Tips:**
*   **File Not Found Errors:** Ensure you've uploaded your CSV and that the Google Drive path is correct and accessible.
*   **Incorrect Text Column:** Double-check the column name specified in the "Configure Parameters" section matches exactly with your CSV.
*   **CSV Encoding:** This notebook assumes UTF-8 encoding for CSV files. If you encounter reading errors, your file might have a different encoding.
*   **Permissions:** If Google Drive saving fails, ensure Colab has the necessary permissions to access your Drive. You might need to re-run the "Mount Google Drive" cell.
*   **Large Files:** Processing very large files can take time and might hit Colab's resource limits. Consider processing data in chunks if needed.

If you encounter other issues, reviewing the error messages in each cell can provide clues.