# Pre-process data keyword analysis


**1st Step - Clean Data**
1.   spot noise
2.   noise removal (emoji, emoticons, digits, stopwords, punctuation marks, etc.)

The output is *clean_red_df.csv*


**2nd Step - Preprocess Data**
1.   character normalization (lower case)
2.   lemmatization
3.   bigram extraction and replacement by multi-word units

The output is *preprocessed_red_df.csv* which has two columns:
*  lemmatized_tokens contains lemmatized token lists without punct and stop words. -> for tf-idf
*  lemmatized_text contains lemmatized strings with punct and stop words, but without new line marks. -> for word embedding




Notes:

 * In acknowledgment of the contributions made, portions of this code were developed with the guidance and assistance of ChatGPT.
 * Some preprocessing methods are from Albrecht, J. 2020. *Blueprints for Text Analytics Using Python*. O’Reilly Online Learning. https://oreilly.com/library/view/blueprints-for-text/9781492074076/

# Import data and libraries

In [None]:
!pip install tqdm



In [None]:
import re
import pandas as pd
import ast

from tqdm.auto import tqdm
from collections import Counter
from itertools import tee, islice
tqdm.pandas()

from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
## input data
red_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_sentiment_analysis/red_sentiment_df.csv", delimiter=",")


## output data
# clean text with no noises
clean_red_df_path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_semantic_shift_tfidf/clean_red_df.csv"
# lemma list without punct and stop words -> tf-idf, lemmatized string with punct and stop words without new lines -> word embedding training
lemmatized_red_df_path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_semantic_shift_tfidf/red/lemmatized_red_df.csv"
# all bigrams
bigram_path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_semantic_shift_tfidf/bigram_red_df.csv"
# validated bigrams
validated_bigram_path =  "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_semantic_shift_tfidf/red/validated_bigram_red_df.csv"
# validated bigrams are replaced by multi-word units in lemma lists and lemmatized strings.
preprocessed_red_df_path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_semantic_shift_tfidf/red/preprocessed_red_df.csv" # with multi-word units
red_df = pd.read_csv(lemmatized_red_df_path)

# If python cannot recognize the list column, use the following code:
red_df['lemmatized_tokens'] = red_df['lemmatized_tokens'].apply(ast.literal_eval)

# Clean data for the Reddit corpus

## 1. Spot noise

In [None]:
import re

# Define the regular expression for suspicious characters
RE_SUSPICIOUS = re.compile(r'[&#<>{}\[\]\\]')

def impurity(text, min_len):
    """
    Calculate the ratio of suspicious characters in a given text.

    Parameters:
    text (str): The text to be analyzed.
    min_len (int): The minimum length of text to consider for analysis.

    Returns:
    float: The ratio of suspicious characters in the text if length is above min_len; otherwise 0.
    """
    # Check if text is a string and has the required minimum length
    if not isinstance(text, str) or len(text) < min_len:
        return 0
    else:
        # Calculate the ratio
        return len(RE_SUSPICIOUS.findall(text)) / len(text)


In [None]:
# Apply the function to the 'text' column of your DataFrame
red_df['impurity'] = red_df['text'].apply(impurity, min_len=20)
red_df[['text','impurity']].sort_values(by='impurity', ascending=False).head(20)

Unnamed: 0,text,impurity
77125,Citations from the above: \[1\] [ \[2\] [ \[3\...,0.483871
111155,&amp;#x200B; [ [ [ [,0.3
1239,Citations [1]( [2]( [3]( [4a]( [4b]( [4c]( [5]...,0.295082
45616,[ &amp;#x200B; [ &amp;#x200B; [,0.225806
24798,Yes. &amp;#x200B; [ [ [,0.217391
22149,[ &amp;#x200B; [ &amp;#x200B; [ &amp;#x200B; [,0.217391
47703,play C:\\Users\\Media\\Music\\Soundbytes\\real...,0.213115
117052,Maybe start with: [ [ [ [ [ [,0.206897
111423,🇨🇳 /|\ __ 🌍 / \/\ /\,0.2
92916,\[ Citation needed \],0.190476


## 2. Remove noise

In [None]:
# from bluprints p97
import html


def remove_emojis(text):
    """Remove emojis from the text using a regular expression.

    Args:
        text (str): The input text containing emojis.

    Returns:
        str: The text with emojis removed.
    """
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002702-\U000027B0"  # Dingbats
                               u"\U000024C2-\U0001F251"  # Enclosed characters
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)


def clean(text):
    if pd.notna(text) and isinstance(text, str):
        # convert html escapes like &amp; to characters.
        text = html.unescape(text)
        # tags like <tab>
        text = re.sub(r'<[^<>]*>', ' ', text)
        # markdown URLs like [Some text](https://....)
        text = re.sub(r'\[([^\[\]]*)\]\([^\(\)]*\)', r'\1', text)
        # text or code in brackets like [0]
        text = re.sub(r'\[[^\[\]]*\]', ' ', text)
        # replace &#x200B;
        text = re.sub('&#x200B;', ' ', text)
        # Remove hashtag marks
        text = re.sub(r'#(\w+)', '\1', text)
        # Remove "I'm a bot"
        text = re.sub('I\'m a bot', ' ', text)
        # standalone sequences of specials, matches &# but not #cool
        text = re.sub(r'(?:^|\s)[&#<>{}\[\]+|\\:-]{1,}(?:\s|$)', ' ', text)
        # standalone sequences of hyphens like --- or ==
        text = re.sub(r'(?:^|\s)[\-=\+]{2,}(?:\s|$)', ' ', text)
        # replace irrelecant punct such as /, \, |, ^, (, and )
        text = re.sub(r'[\/|\\^()\[\]<>*{}#&;]', ' ', text)
        # Remove numbers
        text = re.sub(r'\d+', '', text)
        # Remove emojis
        text = remove_emojis(text)
        # sequences of white spaces
        text = re.sub(r'\s+', ' ', text)



    return text.strip()

In [None]:
red_df['text'] = red_df['text'].fillna('')
red_df['clean_text'] = red_df['text'].apply(clean)
red_df['impurity']   = red_df['clean_text'].apply(impurity, min_len=20)
# Check the impurity after cleaning
red_df[['clean_text', 'impurity']].sort_values(by='impurity', ascending=False).head(20)

Unnamed: 0,clean_text,impurity
0,Discussing climate change with a skeptic on an...,0.0
102544,Not OP but thanks for this post.,0.0
102546,It already is... extinction rates are very hig...,0.0
102547,The heat dome is truly frightening tbh,0.0
102548,Animals that are doing great in my area thanks...,0.0
102549,Denying reality even as the fires rise,0.0
102550,In Canada specifically NRC research on permafr...,0.0
102551,Whenever you are in a adverse climate. Kuwait ...,0.0
102552,When the pool thermometer acts as a dummy device.,0.0
102553,"There very few places where water exists, but ...",0.0


In [None]:
# Save the clean texts
red_df.rename(columns={'text':'raw_text'}, inplace=True)
red_df.rename(columns={'clean_text':'text'}, inplace=True)
red_df.drop(columns=['impurity'], inplace=True)
red_df.to_csv(clean_red_df_path, index=False)

# Preprocess Data

Creat two types of data:

1.    Token lists saved in **lemmatized_tokens** containing lowercased lemmas, **without** punctuation and stopwords.
2.    Strings saved in **lemmatized_text** containing lowercased lemmas **with** punctuation and stopwords. (No new line marks.)

The outputs are saved in lemmatized_red_df.csv.

## Lemmatization

In [None]:
import spacy
nlp=spacy.load('en_core_web_sm', disable=['parser','ner'])

"""# Use GPU
spacy.require_gpu()
if spacy.prefer_gpu():
    print(1)
else: print(0)
"""

def lemmatize_text(text):
    """
    Processes the input text using SpaCy to lemmatize and lowercase each word, preserving punctuation,
    and removes newline characters. It returns a list of lemmatized tokens (excluding stop words and punctuation)
    and a string of all tokens (lemmatized and lowercased, including punctuation).

    Args:
        text (str): The text to be processed.

    Returns:
        tuple: A tuple containing a list of lemmatized tokens (excluding stop words and punctuation)
               and a string of all tokens (including punctuation).
    """
    # remove new line marks
    text = text.replace("\n", " ")

    doc = nlp(text)
    lemmatized_tokens = [token.lemma_.lower() for token in doc if not token.is_stop and not token.is_punct] # a list
    all_tokens = [token.lemma_.lower() if not token.is_punct else token.text for token in doc]

    # concat all tokens of a text into a string
    lemmatized_text = " ".join(all_tokens)

    return lemmatized_tokens, lemmatized_text

0


In [None]:
red_df['text'] = red_df['text'].fillna('')
red_df[['lemmatized_tokens', 'lemmatized_text']] = red_df['text'].progress_apply(lambda x: pd.Series(lemmatize_text(x)))

  0%|          | 0/153825 [00:00<?, ?it/s]

In [None]:
red_df.drop(columns=['text'], inplace=True)
red_df.head()
red_df.to_csv(lemmatized_red_df_path, index=False)

Unnamed: 0,id,year,raw_text,lemmatized_tokens,lemmatized_text
0,c7w2a9f,2013,Discussing climate change with a skeptic on an...,"[discuss, climate, change, skeptic, site, pull...",discuss climate change with a skeptic on anoth...
1,c7x3p76,2013,That hasn't even been considered for several y...,"[consider, year, huiman, emission, get, close,...",that have not even be consider for several yea...
2,c7xjxtf,2013,anything on non- carbon dioxide GHGs? I though...,"[non-, carbon, dioxide, ghgs, think, release, ...",anything on non- carbon dioxide ghgs ? i think...
3,c7xkqi8,2013,That would be easy to find as well since there...,"[easy, find, lot, line, material, volcano, emi...",that would be easy to find as well since there...
4,c7xp7wy,2013,"Cool, thanks","[cool, thank]","cool , thank"


## Bigram Extraction and Replacement



1.   Create a bigram dictionary with bigrams' frequencies.
2.   Only choose **N most frequent** **bigrams** (e.g. climate change) and replace them in the columns lemmatized_tokens and lemmatized_text by **multi-word units** (e.g. climatechange)



### Find all bigrams

In [None]:
def find_bigrams(tokens):
    """
    Generate bigrams from a list of tokens.

    Args:
        tokens (list): A list of words.

    Returns:
        list: A list of bigrams.
    """
    bigram_iterator = zip(tokens, islice(tokens, 1, None))
    return [" ".join(bigram) for bigram in bigram_iterator]

In [None]:
red_df['bigrams'] = red_df['lemmatized_tokens'].apply(find_bigrams)
# Flatten the list of bigrams and count frequencies
all_bigrams = [bigram for sublist in red_df['bigrams'] for bigram in sublist]
bigram_freq = Counter(all_bigrams)

# Create a DataFrame from the bigram frequencies
bigram_df = pd.DataFrame(bigram_freq.items(), columns=['bigram', 'frequency'])

# remove bigrams with frequency less than 15
bigram_df = bigram_df[bigram_df['frequency'] >= 15]

# Sort the DataFrame by frequency in descending order
bigram_df = bigram_df.sort_values(by='frequency', ascending=False).reset_index(drop=True)
bigram_df.to_csv(bigram_path, index=False)

In [None]:
# Check the bigram df
bigram_df

Unnamed: 0,bigram,frequency
0,carbon dioxide,34152
1,climate change,32560
2,fossil fuel,6604
3,global warming,6366
4,sea level,6257
...,...,...
34733,way result,15
34734,ask research,15
34735,review find,15
34736,word go,15


### Find all valid bigrams
Threshold:
1.  Frequency >= 850
2.  Are spacy noun chunks

In [None]:
# Define a function
nlp = spacy.load("en_core_web_sm")

def extract_noun_chunks(df, bigram_col, frequency_col):
    """
    Extract and filter noun chunks from a DataFrame containing bigrams.

    Parameters:
    df (pd.DataFrame): The DataFrame containing the bigrams and their frequencies.
    bigram_col (str): The name of the column containing the bigrams.
    frequency_col (str): The name of the column containing the frequencies.

    Returns:
    pd.DataFrame: A new DataFrame containing only the noun chunks and their frequencies.
    """


    # Initialize an empty list to store valid noun chunk bigrams
    valid_noun_chunks = []

    # Iterate over the DataFrame
    for _, row in df.iterrows():
        bigram = row[bigram_col]
        frequency = row[frequency_col]

        # Create a SpaCy Doc object
        doc = nlp(bigram)

        # Check if the bigram matches any noun chunk in the Doc
        for chunk in doc.noun_chunks:
            if chunk.text == bigram:
                valid_noun_chunks.append((bigram, frequency))
                break

    # Create a new DataFrame with the valid noun chunks
    noun_chunk_df = pd.DataFrame(valid_noun_chunks, columns=[bigram_col, frequency_col])

    return noun_chunk_df

In [None]:
# Find validated bigrams
bigram_df = pd.read_csv("/content/bigram_red_df.csv")
bigram_df = bigram_df[bigram_df['frequency'] >= 850]
noun_chunk_df = extract_noun_chunks(bigram_df, "bigram", "frequency")
print(noun_chunk_df)

                   bigram  frequency
0          carbon dioxide      34152
1          climate change      32560
2             fossil fuel       6604
3          global warming       6366
4               sea level       6257
5          greenhouse gas       4123
6                     ° c       4014
7              level rise       3795
8       climate scientist       2840
9         climate science       2757
10                ice age       2666
11              long term       2614
12                  et al       2327
13       dioxide emission       2304
14     global temperature       2226
15            peer review       2086
16           million year       1935
17          dioxide level       1907
18          nuclear power       1844
19      greenhouse effect       1811
20     dioxide atmosphere       1744
21          climate model       1677
22                sea ice       1674
23            ipcc report       1485
24     atmospheric carbon       1476
25            water vapor       1446
2

### Save validated bigrams

In [None]:
noun_chunk_df.to_csv(validated_bigram_path, index=False)

## Replace chosen bigrams by multi-word units

In [None]:
def replace_bigrams_with_placeholders(text, bigrams):
    """
    Replace occurrences of bigrams in the text with placeholders.

    Args:
        text (str): The text to process.
        bigrams (list): A list of bigrams to be replaced with placeholders.

    Returns:
        str: The text with bigrams replaced by placeholders.
    """
    placeholder_map = {}
    for bigram in bigrams:
        placeholder = f"__{''.join(bigram.split())}__"
        placeholder_map[bigram] = placeholder
        text = text.replace(bigram, placeholder)
    return text, placeholder_map

def restore_placeholders_to_bigrams(text, placeholder_map):
    """
    Restore placeholders in the text to their original bigrams.

    Args:
        text (str): The text with placeholders.
        placeholder_map (dict): A map from bigrams to placeholders.

    Returns:
        str: The text with placeholders replaced by original bigrams.
    """
    for bigram, placeholder in placeholder_map.items():
        text = text.replace(placeholder, bigram.replace(' ', ''))
    return text

def merge_bigrams_in_tokens(row, bigrams):
    """
    Merges adjacent tokens in the 'lemmatized_tokens' field of a DataFrame row into a multi-word unit based on specified bigrams.
    Also replaces occurrences of these bigrams in the 'lemmatized_text' field.

    Args:
        row (pd.Series): A row from the DataFrame.
        bigrams (list): A list of bigrams to be merged into multi-word units.

    Returns:
        pd.Series: The modified row with tokens merged in both 'lemmatized_tokens' and 'lemmatized_text'.
    """
    text = row['lemmatized_text']
    tokens = row['lemmatized_tokens']

    #  Dealing with lemmatized_tokens
    if not tokens:
        return row

    # Creating a set for faster lookup
    bigram_set = set(bigrams)
    new_tokens = []
    i = 0

    while i < len(tokens):
        if i < len(tokens) - 1:
            potential_bigram = f"{tokens[i]} {tokens[i + 1]}"
            if potential_bigram in bigram_set:
                new_tokens.append(potential_bigram.replace(' ', ''))
                i += 2  # Skip the next token since it's part of the bigram
                continue
        new_tokens.append(tokens[i])
        i += 1

    row['lemmatized_tokens'] = new_tokens

    # Dealing with lemmatized_text
    text, placeholder_map = replace_bigrams_with_placeholders(text, bigrams)
    tokens = text.split()  # Assuming text is already tokenized and joined by spaces
    text = restore_placeholders_to_bigrams(' '.join(tokens), placeholder_map)

    row['lemmatized_text'] = text
    return row


In [None]:
# Apply the replacement function to each row of the DataFrame
# Get list of bigrams
noun_chunk_df = pd.read_csv(validated_bigram_path)
bigrams_list = noun_chunk_df['bigram'].tolist()

# Apply the function to each row of red_df
red_df = red_df.apply(lambda row: merge_bigrams_in_tokens(row, bigrams_list), axis=1)

In [None]:
# red_df.drop(columns=['bigrams'], inplace=True)
red_df

Unnamed: 0,id,year,raw_text,lemmatized_tokens,lemmatized_text
0,c7w2a9f,2013,Discussing climate change with a skeptic on an...,"[discuss, climatechange, skeptic, site, pull, ...",discuss climatechange with a skeptic on anothe...
1,c7x3p76,2013,That hasn't even been considered for several y...,"[consider, year, huiman, emission, get, close,...",that have not even be consider for several yea...
2,c7xjxtf,2013,anything on non- carbon dioxide GHGs? I though...,"[non-, carbondioxide, ghgs, think, release, lo...",anything on non- carbondioxide ghgs ? i think ...
3,c7xkqi8,2013,That would be easy to find as well since there...,"[easy, find, lot, line, material, volcano, emi...",that would be easy to find as well since there...
4,c7xp7wy,2013,"Cool, thanks","[cool, thank]","cool , thank"
...,...,...,...,...,...
153820,j2ftozo,2022,Obviously money is the end game here for the c...,"[obviously, money, end, game, controller, plai...",obviously money be the end game here for the c...
153821,j2fu0wk,2022,Crazy thought. Are civilized humans only 5000 ...,"[crazy, thought, civilized, human, year, old, ...",crazy thought . be civilized human only year o...
153822,j2funi3,2022,"Agreed. Science is not always 100% accurate, b...","[agree, science, accurate, term, climatechange...","agree . science be not always % accurate , but..."
153823,j2fw9it,2022,Why is this downvoted? It’s a great question V...,"[downvote, great, question, curious, insight, ...",why be this downvote ? it ’ a great question v...


# Save the preprocessed data

In [None]:
# One more step to remove stopwords, just in case the previous step didnt work
import spacy
nlp=spacy.load('en_core_web_sm', disable=['parser','ner'])
# Adding the new stop words
additional_stop_words = {
    "say", "get", "know", "may", "one", "mr", "also",
    "xxxxxxxx", "deeeeeestroyyyyyyye", ".,which", "för",
    "är", "på", "klimatförändringarna", "vår", "utsläpp",
    "like", "think", "come", "read", "want", "thing", "look",
    "work", "point", "way"
}
nlp.Defaults.stop_words.update(additional_stop_words)


df = pd.read_csv(preprocessed_red_df_path)
df['lemmatized_tokens'] = df['lemmatized_tokens'].apply(ast.literal_eval)

def remove_stopwords(token_list):
    """
    Remove stopwords from a list of tokens.

    Args:
        token_list (list): A list of tokens (words).

    Returns:
        list: A list of tokens with stopwords removed.
    """
    return [token for token in token_list if not nlp.vocab[token].is_stop]

# apply the function and create a new column
red_df['new_lemmatized_tokens'] = red_df['lemmatized_tokens'].apply(remove_stopwords)

red_df.drop(columns=['lemmatized_tokens'], inplace=True)
# df.drop(columns=['contains_not'], inplace=True)
red_df.rename(columns={'new_lemmatized_tokens': 'lemmatized_tokens'}, inplace=True)




# def contains_not(token_list):
#     """
#     Check if the token list contains the word 'not'.

#     Args:
#         token_list (list): A list of tokens (words).

#     Returns:
#         bool: True if 'not' is in the list, False otherwise.
#     """
#     return 'not' in token_list

# # Apply the function
# df['contains_not'] = df['lemmatized_tokens'].apply(contains_not)
# df_true = df[df['contains_not'] == True]

# # Print the raws/tokens contain the stop word I defined.
# print(df_true.head())

In [None]:
preprocessed_red_df = red_df.copy()

In [None]:
preprocessed_red_df

Unnamed: 0,id,year,raw_text,lemmatized_text,lemmatized_tokens
0,c7w2a9f,2013,Discussing climate change with a skeptic on an...,discuss climatechange with a skeptic on anothe...,"[discuss, climatechange, skeptic, site, pull, ..."
1,c7x3p76,2013,That hasn't even been considered for several y...,that have not even be consider for several yea...,"[consider, year, huiman, emission, close, gton..."
2,c7xjxtf,2013,anything on non- carbon dioxide GHGs? I though...,anything on non- carbondioxide ghgs ? i think ...,"[non-, carbondioxide, ghgs, release, lot, sulp..."
3,c7xkqi8,2013,That would be easy to find as well since there...,that would be easy to find as well since there...,"[easy, find, lot, line, material, volcano, emi..."
4,c7xp7wy,2013,"Cool, thanks","cool , thank","[cool, thank]"
...,...,...,...,...,...
153820,j2ftozo,2022,Obviously money is the end game here for the c...,obviously money be the end game here for the c...,"[obviously, money, end, game, controller, plai..."
153821,j2fu0wk,2022,Crazy thought. Are civilized humans only 5000 ...,crazy thought . be civilized human only year o...,"[crazy, thought, civilized, human, year, old, ..."
153822,j2funi3,2022,"Agreed. Science is not always 100% accurate, b...","agree . science be not always % accurate , but...","[agree, science, accurate, term, climatechange..."
153823,j2fw9it,2022,Why is this downvoted? It’s a great question V...,why be this downvote ? it ’ a great question v...,"[downvote, great, question, curious, insight, ..."


In [None]:
# save
preprocessed_red_df.drop(columns=['raw_text'], inplace=True)

# preprocessed_red_df.head()
preprocessed_red_df.to_csv(preprocessed_red_df_path, index=False)