# Pre-process data for keyword analysis




**Preprocess Data**
1.   character normalization (lower case)
2.   lemmatization
3.   bigram extraction and replacement by multi-word units

The output is *preprocessed_aca_df.csv* which has two columns:
*  lemmatized_tokens contains lemmatized token lists without punct and stop words. -> for tf-idf
*  lemmatized_text contains lemmatized strings with punct and stop words, but without new line marks. -> for word embedding




Notes:

 * In acknowledgment of the contributions made, portions of this code were developed with the guidance and assistance of ChatGPT.
 * Some preprocessing methods are from Albrecht, J. 2020. *Blueprints for Text Analytics Using Python*. O’Reilly Online Learning. https://oreilly.com/library/view/blueprints-for-text/9781492074076/

# Import data and libraries

In [None]:
!pip install tqdm



In [None]:
import re
import pandas as pd
import spacy
import ast

from tqdm.auto import tqdm
from collections import Counter
from itertools import tee, islice
tqdm.pandas()

from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
## input data
aca_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_sentiment_analysis/aca_sentiment_df.csv", delimiter=",")


## output data
# clean text with no noises such as "[ &amp;#x200B;" and emojis
# clean_aca_df_path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_semantic_shift_tfidf/clean_aca_df.csv"
# lemma list without punct and stop words -> tf-idf, lemmatized string with punct and stop words without new lines -> word embedding training
lemmatized_aca_df_path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_semantic_shift_tfidf/aca/lemmatized_aca_df.csv"
# all bigrams
bigram_path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_semantic_shift_tfidf/aca/bigram_aca_df.csv"
# validated bigrams
validated_bigram_path =  "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_semantic_shift_tfidf/aca/validated_bigram_aca_df.csv"
# validated bigrams are replaced by multi-word units in lemma lists and lemmatized strings.
preprocessed_aca_df_path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_semantic_shift_tfidf/preprocessed_aca_df.csv" # with multi-word units

aca_df = pd.read_csv(lemmatized_aca_df_path)


# If python cannot recognize the list column, use the following code:
aca_df['lemmatized_tokens'] = aca_df['lemmatized_tokens'].apply(ast.literal_eval)

# Preprocess Data

Creat two types of data:

1.    Token lists saved in **lemmatized_tokens** containing lowercased lemmas, **without** punctuation and stopwords.
2.    Strings saved in **lemmatized_text** containing lowercased lemmas **with** punctuation and stopwords. (No new line marks.)

The outputs are saved in lemmatized_aca_df.csv.

## Lemmatization

In [None]:
import spacy
nlp=spacy.load('en_core_web_sm', disable=['parser','ner'])

"""# Use GPU
spacy.require_gpu()
if spacy.prefer_gpu():
    print(1)
else: print(0)
"""

def lemmatize_text(text):
    """
    Processes the input text using SpaCy to lemmatize and lowercase each word, preserving punctuation,
    and removes newline characters. It returns a list of lemmatized tokens (excluding stop words and punctuation)
    and a string of all tokens (lemmatized and lowercased, including punctuation).

    Args:
        text (str): The text to be processed.

    Returns:
        tuple: A tuple containing a list of lemmatized tokens (excluding stop words and punctuation)
               and a string of all tokens (including punctuation).
    """
    # remove new line marks
    text = text.replace("\n", " ")

    doc = nlp(text)
    lemmatized_tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct] # a list
    all_tokens = [token.lemma_.lower() if not token.is_punct else token.text for token in doc]

    # concat all tokens into a string
    lemmatized_text = " ".join(all_tokens)

    return lemmatized_tokens, lemmatized_text

In [None]:
aca_df['text'] = aca_df['text'].fillna('')
aca_df[['lemmatized_tokens', 'lemmatized_text']] = aca_df['text'].progress_apply(lambda x: pd.Series(lemmatize_text(x)))

  0%|          | 0/14519 [00:00<?, ?it/s]

In [None]:
aca_df.drop(columns=['text'], inplace=True)
aca_df.drop(columns=['impurity'], inplace=True)
aca_df.head()
aca_df.to_csv(lemmatized_aca_df_path, index=False)

## Bigram Extraction and Replacement



1.   Create a bigram dictionary with bigrams' frequencies.
2.   Only choose **N most frequent** **bigrams** (e.g. climate change) and replace them in the columns lemmatized_tokens and lemmatized_text by **multi-word units** (e.g. climatechange)



### Find all bigrams

In [None]:
def find_bigrams(tokens):
    """
    Generate bigrams from a list of tokens.

    Args:
        tokens (list): A list of words.

    Returns:
        list: A list of bigrams.
    """
    bigram_iterator = zip(tokens, islice(tokens, 1, None))
    return [" ".join(bigram) for bigram in bigram_iterator]

In [None]:
aca_df['bigrams'] = aca_df['lemmatized_tokens'].apply(find_bigrams)
# Flatten the list of bigrams and count frequencies
all_bigrams = [bigram for sublist in aca_df['bigrams'] for bigram in sublist]
bigram_freq = Counter(all_bigrams)

# Create a DataFrame from the bigram frequencies
bigram_df = pd.DataFrame(bigram_freq.items(), columns=['bigram', 'frequency'])

# remove bigrams with frequency less than 15
bigram_df = bigram_df[bigram_df['frequency'] >= 15]

# Sort the DataFrame by frequency in descending order
bigram_df = bigram_df.sort_values(by='frequency', ascending=False).reset_index(drop=True)
bigram_df.to_csv(bigram_path, index=False)

In [None]:
# Check the bigram df
bigram_df

Unnamed: 0,bigram,frequency
0,climate change,79535
1,impact climate,6278
2,change impact,4093
3,change adaptation,3452
4,effect climate,3258
...,...,...
12282,transportation sector,15
12283,potentially vulnerable,15
12284,farmer attitude,15
12285,change acknowledge,15


### Find all valid bigrams
Threshold:
1.  Frequency >= 430
2.  Are spacy noun chunks

In [None]:
# Define a function
nlp = spacy.load("en_core_web_sm")

def extract_noun_chunks(df, bigram_col, frequency_col):
    """
    Extract and filter noun chunks from a DataFrame containing bigrams.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the bigrams and their frequencies.
    bigram_col (str): The name of the column containing the bigrams.
    frequency_col (str): The name of the column containing the frequencies.

    Returns:
    pd.DataFrame: A new DataFrame containing only the noun chunks and their frequencies.
    """


    # Initialize an empty list to store valid noun chunk bigrams
    valid_noun_chunks = []

    # Iterate over the DataFrame
    for _, row in df.iterrows():
        bigram = row[bigram_col]
        frequency = row[frequency_col]

        # Create a SpaCy Doc object
        doc = nlp(bigram)

        # Check if the bigram matches any noun chunk in the Doc
        for chunk in doc.noun_chunks:
            if chunk.text == bigram:
                valid_noun_chunks.append((bigram, frequency))
                break

    # Create a new DataFrame with the valid noun chunks
    noun_chunk_df = pd.DataFrame(valid_noun_chunks, columns=[bigram_col, frequency_col])

    return noun_chunk_df

In [None]:
# Find validated bigrams
bigram_df = pd.read_csv(bigram_path)
bigram_df = bigram_df[bigram_df['frequency'] >= 430]
noun_chunk_df = extract_noun_chunks(bigram_df, "bigram", "frequency")
print(noun_chunk_df)

               bigram  frequency
0      climate change      79535
1      impact climate       6278
2       change impact       4093
3      effect climate       3258
4      future climate       2828
..                ...        ...
79    context climate        436
80          study aim        436
81  adaptation policy        435
82       21st century        432
83     current future        430

[84 rows x 2 columns]


### Save validated bigrams

In [None]:
noun_chunk_df.to_csv(validated_bigram_path, index=False)

## Replace chosen bigrams by multi-word units

In [None]:
def replace_bigrams_with_placeholders(text, bigrams):
    """
    Replace occurrences of bigrams in the text with placeholders.

    Args:
        text (str): The text to process.
        bigrams (list): A list of bigrams to be replaced with placeholders.

    Returns:
        str: The text with bigrams replaced by placeholders.
    """
    placeholder_map = {}
    for bigram in bigrams:
        placeholder = f"__{''.join(bigram.split())}__"
        placeholder_map[bigram] = placeholder
        text = text.replace(bigram, placeholder)
    return text, placeholder_map

def restore_placeholders_to_bigrams(text, placeholder_map):
    """
    Restore placeholders in the text to their original bigrams.

    Args:
        text (str): The text with placeholders.
        placeholder_map (dict): A map from bigrams to placeholders.

    Returns:
        str: The text with placeholders replaced by original bigrams.
    """
    for bigram, placeholder in placeholder_map.items():
        text = text.replace(placeholder, bigram.replace(' ', ''))
    return text

def merge_bigrams_in_tokens(row, bigrams):
    """
    Merges adjacent tokens in the 'lemmatized_tokens' field of a DataFrame row into a multi-word unit based on specified bigrams.
    Also replaces occurrences of these bigrams in the 'lemmatized_text' field.

    Args:
        row (pd.Series): A row from the DataFrame.
        bigrams (list): A list of bigrams to be merged into multi-word units.

    Returns:
        pd.Series: The modified row with tokens merged in both 'lemmatized_tokens' and 'lemmatized_text'.
    """
    text = row['lemmatized_text']
    tokens = row['lemmatized_tokens']

    #  Dealing with lemmatized_tokens
    if not tokens:
        return row

    # Creating a set for faster lookup
    bigram_set = set(bigrams)
    new_tokens = []
    i = 0

    while i < len(tokens):
        if i < len(tokens) - 1:
            potential_bigram = f"{tokens[i]} {tokens[i + 1]}"
            if potential_bigram in bigram_set:
                new_tokens.append(potential_bigram.replace(' ', ''))
                i += 2  # Skip the next token since it's part of the bigram
                continue
        new_tokens.append(tokens[i])
        i += 1

    row['lemmatized_tokens'] = new_tokens

    # Dealing with lemmatized_text
    text, placeholder_map = replace_bigrams_with_placeholders(text, bigrams)
    tokens = text.split()  # Assuming text is already tokenized and joined by spaces
    text = restore_placeholders_to_bigrams(' '.join(tokens), placeholder_map)

    row['lemmatized_text'] = text
    return row


In [None]:
# Apply the replacement function to each row of the DataFrame
# Get list of bigrams
noun_chunk_df = pd.read_csv(validated_bigram_path)
bigrams_list = noun_chunk_df['bigram'].tolist()

# Apply the function to each row of aca_df
aca_df = aca_df.apply(lambda row: merge_bigrams_in_tokens(row, bigrams_list), axis=1)

In [None]:
# aca_df.drop(columns=['bigrams'], inplace=True)
aca_df

Unnamed: 0,id,year,lemmatized_tokens,lemmatized_text
0,1,2021,"[change, world, meme, time, effects, climatech...",change the world one meme at a time : the effe...
1,2,2021,"[relationship, social, norms, avoidance, futur...","the relationship between social norms , avoida..."
2,3,2013,"[evaluation, urban, citizens, awareness, clima...",an evaluation of urban citizens ' awareness of...
3,4,2021,"[high, climatechange, riskperception, promote,...",how and when high climatechange riskperception...
4,5,2022,"[climatechange, impact, adaptation, highway, a...",climatechange impact and adaptation for highwa...
...,...,...,...,...
14514,14969,2016,"[green, infrastructure, climatechange, adaptat...",green infrastructure and climatechange adaptat...
14515,14970,2015,"[application, climate, downscaled, data, desig...",application of climate downscaled data for the...
14516,14971,2014,"[extreme, vulnerability, smallholder, farmer, ...",extreme vulnerability of smallholder farmer to...
14517,14972,2020,"[examine, boreal, forest, resilience, temperat...",examine boreal forest resilience to temperatur...


# Save the preprocessed data

In [None]:
# One more step to remove stopwords, just in case the previous step didnt work
df = aca_df
nlp=spacy.load('en_core_web_sm', disable=['parser','ner'])
# nlp.Defaults.stop_words.add("climatechange")
nlp.vocab["climatechange"].is_stop = False


def remove_stopwords(token_list):
    """
    Remove stopwords from a list of tokens.

    Args:
        token_list (list): A list of tokens (words).

    Returns:
        list: A list of tokens with stopwords removed.
    """
    return [token for token in token_list if not nlp.vocab[token].is_stop]

# apply the function and create a new column
df['new_lemmatized_tokens'] = df['lemmatized_tokens'].apply(remove_stopwords)

df.drop(columns=['lemmatized_tokens'], inplace=True)
# df.drop(columns=['contains_not'], inplace=True)
df.rename(columns={'new_lemmatized_tokens': 'lemmatized_tokens'}, inplace=True)



def contains_not(token_list):
    """
    Check if the token list contains the word 'not'.

    Args:
        token_list (list): A list of tokens (words).

    Returns:
        bool: True if 'not' is in the list, False otherwise.
    """
    return 'climatechange' in token_list

# Apply the function
df['contains_not'] = df['lemmatized_tokens'].apply(contains_not)
df_true = df[df['contains_not'] == True]

# Print the raws/tokens contain the stop word I defined.
print(df_true.head())

   id  year                                    lemmatized_text  \
0   1  2021  change the world one meme at a time : the effe...   
1   2  2021  the relationship between social norms , avoida...   
2   3  2013  an evaluation of urban citizens ' awareness of...   
3   4  2021  how and when high climatechange riskperception...   
4   5  2022  climatechange impact and adaptation for highwa...   

                                   lemmatized_tokens  contains_not  
0  [change, world, meme, time, effects, climatech...          True  
1  [relationship, social, norms, avoidance, futur...          True  
2  [evaluation, urban, citizens, awareness, clima...          True  
3  [high, climatechange, riskperception, promote,...          True  
4  [climatechange, impact, adaptation, highway, a...          True  


In [None]:
preprocessed_aca_df = aca_df.copy()

In [None]:
# save
preprocessed_aca_df.head()
preprocessed_aca_df.to_csv(preprocessed_aca_df_path, index=False)