<a href="https://colab.research.google.com/github/paskn/tools-as-notebooks/blob/main/Add%20Lemmatize_docs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Lemmatization: Introduction

This notebook could be used as an app to lemmatize English language texts stored in a CSV file.

__How it works?__

Step-by-step, you will run Python code on Google servers.

1. You will connect this server to your Google Drive.
2. Then, you will pull in and set up the Python code.
3. You upload your documents and specify location of your data.
4. Finally, you will run the code on your data and the output will be writen to your Google Drive folder.

In [None]:
#@title Setup 1: Mount Google Drive for Loading and Storing Data
#@markdown grant permissions to access your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Step 2: upload data

1. On the left, find the folder icon and click it.
2. Nagigate to drive, MyDrive and open your `Colab_Data` folder
3. Put mouse on the `Colab_Data` and click the three dots button on the right. Then choose "upload" and select your data file.

Your data will appear under `Colab_Data` below. Click the three dots button on the name of the file you just uploaded and choose "Copy path"

In [None]:
#@title Step 3: Load the code for lemmatization

!pip install -U spacy
!python -m spacy download en_core_web_sm

import pandas as pd
import spacy
from spacy.lang.en.stop_words import STOP_WORDS


nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])


def lemmatize_filter_stopwords_and_nonalpha(text):
    # Handle potential NaN or non-string values
    if not isinstance(text, str) or not text.strip(): # also check for empty/whitespace-only strings
        return ""

    doc = nlp(text)
    filtered_lemmas = []

    for token in doc:
        # 1. Get the lemma and convert to lowercase
        #    Handle pronouns like 'I', 'me', 'he' which spaCy lemmatizes to "-PRON-"
        #    We'll use the original text (lowercased) for pronouns.
        if token.lemma_ == "-PRON-":
            lemma = token.text.lower()
        else:
            lemma = token.lemma_.lower()

        # 2. Filter:
        #    - Check if the original token is alphabetic (token.is_alpha)
        #    - Check if the (lowercase) lemma is not a stopword
        #    - Optional: Check for minimum length (e.g., len(lemma) > 1)
        if token.is_alpha and lemma not in STOP_WORDS:
            filtered_lemmas.append(lemma)

    return " ".join(filtered_lemmas)


# Function to lemmatize text
def lemmatize_text(text):
    # Handle potential NaN or non-string values
    if not isinstance(text, str):
        return "" # Or return text, or handle as needed
    doc = nlp(text)
    # token.lemma_ gives the base form of the word
    lemmas = [token.lemma_ for token in doc]
    return " ".join(lemmas)

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m63.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
#@title Step 4: Load your data lemmatize it
#@markdown if your data has a column 'text' it should work; otherwise, adjust the code

data_path = "/content/drive/MyDrive/Colab_Data/grok-prompts.csv" #@param {type:"string"}
df = pd.read_csv(data_path)


df['text_lemmatized'] = df['text'].fillna("").astype(str).apply(lemmatize_filter_stopwords_and_nonalpha)
df.to_csv(data_path+"-lemmatized.csv", index=False)

print("saved to: "+data_path+"-lemmatized.csv")
df.head()

saved to: /content/drive/MyDrive/Colab_Data/grok-prompts.csv-lemmatized.csv


Unnamed: 0,text,text_lemmatized
0,A stunning female model walks down a high-fash...,stunning female model walk high fashion runway...
1,"A stylish woman with long, wavy lavender hair ...",stylish woman long wavy lavender hair seat des...
2,Create an image featuring two central female c...,create image feature central female character ...
3,A prominent figure dressed in a dark blue suit...,prominent figure dress dark blue suit white sh...
4,A detailed text prompt for generating an image...,detailed text prompt generate image similar pr...
