# Preprocess Data for Sentiment and Emotion Analysis and Topic Modeling.
Author: Junruo Zhu

Date: 13 Oct, 2023

This code preprocess data set by panda, mainly for the sentiment analysis, emotion analysis, and topic modeling.

The input data is in csv files. The pre-processed texts contains punctuation markers and are not lemmatized.

 * In acknowledgment of the contributions made, portions of this code were developed with the guidance and assistance of ChatGPT.

## Step 0. Preparation

### 0.0 Get all libraries ready

In [None]:
# Get all libraries ready
import pandas as pd
import numpy as np
import re
import string
import nltk
import os

### 0.1 Read input files and define output paths

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### read the input files
# academic one
aca_df = pd.read_csv("/content/data_no_duplicates.csv", delimiter=",")
aca_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_sentiment_analysis/aca_sentiment_df.csv')
# Make a new column named "text" consists of the columns "title" and "abstract".
aca_df["text"] = aca_df["title"] + ". " + aca_df["abstract"]
Remove the colums "title" and "abstract"
aca_df = aca_df[["id", "year", "text"]]
aca_df = aca_df[aca_df["text"].str.contains("climate change", case=False, na=False)]

# Reddit one
red_df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_sentiment_analysis/red_sentiment_df.csv", delimiter=",")
red_df = red_df.rename(columns={"body": "text"})


### output file path
# academic one
aca_sentiment_path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_sentiment_analysis/aca_sentiment_df.csv" # for sentiment and emotion analysis and topic modeling
# Reddit one
red_sentiment_path = "/content/drive/MyDrive/Colab Notebooks/Masters_Thesis/0_corpus/preprocessed_for_sentiment_analysis/red_sentiment_df.csv" # for sentiment and emotion analysis and topic modeling

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### 0.2 Quick View of the original data set

#### Academic one

In [None]:
print(f"The original dataframe's shape is {aca_df.shape}")
# How many unique years in total
unique_years_sorted = np.sort(aca_df["year"].unique())
print(f"There are these years {unique_years_sorted}")
aca_df.head()

The original dataframe's shape is (14519, 3)
There are these years [2013 2014 2015 2016 2017 2018 2019 2020 2021 2022]


Unnamed: 0,id,year,text
0,1,2021,Changing the World One Meme at a Time: The Eff...
1,2,2021,"The Relationship between Social Norms, Avoidan..."
2,3,2013,An Evaluation of Urban Citizens' Awareness of ...
3,4,2021,How and when higher climate change risk percep...
4,5,2022,Climate change impact and adaptation for highw...


#### Reddit one

In [None]:
print(f"The original dataframe's shape is {red_df.shape}")
# How many unique years in total
unique_years_sorted = np.sort(red_df["year"].unique())
print(f"There are these years {unique_years_sorted}")
red_df.head()

The original dataframe's shape is (173597, 3)
There are these years [2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022]


Unnamed: 0,id,year,text
0,c20ghim,2011,"Well, REAL science would be a good start."
1,c22h8zu,2011,"Upvoted. I suggest you ""distinguish"" the post ..."
2,c22j047,2011,Good suggestion. Thank you.
3,c22mvy6,2011,Looks like your non-partisan approach has garn...
4,c22mxhk,2011,That should tell you something about how polit...


## Step 1. Preprocess for **Sentiment and emotion analysis**
In this step, I
1. extract data in specific years which I need,
2. keep punctuation marks, stop words (such as *not*) and capitalized letters, because they are important for the corresponding models to analysis texts' sentiments and emotions.

Note:
1. I do not lemmatize text.
2. There are no new lines in one text.
3. Due to the differences in writing fashions between two corpora, I apply corresponding processing methods to the corpora. Please see the follwoing steps for details.

#### Academic one
In this step I extract data in specific years and convert some LaTex expressions into readable.

In [None]:
# Only get data from 2013 to 2022 these ten years
aca_df = aca_df[(aca_df["year"] >= 2013) & (aca_df["year"] <= 2022)].copy()
print(f"The new dataframe's shape is {aca_df.shape}")
# How many unique years in total
unique_years_sorted = np.sort(aca_df["year"].unique())
print(f"There are these years {unique_years_sorted}")
aca_df.head()

The new dataframe's shape is (14519, 3)
There are these years [2013 2014 2015 2016 2017 2018 2019 2020 2021 2022]


Unnamed: 0,id,year,text
0,1,2021,Changing the World One Meme at a Time: The Eff...
1,2,2021,"The Relationship between Social Norms, Avoidan..."
2,3,2013,An Evaluation of Urban Citizens' Awareness of ...
3,4,2021,How and when higher climate change risk percep...
4,5,2022,Climate change impact and adaptation for highw...


In [None]:
# Create a new data frame which is the same as aca_df
aca_sentiment_df = aca_df.copy()

Define fucntions to process texts # data = data.dropna()

In [None]:
def process_aca_text_series(series):
    """Convert LaTex expressions to English in a Pandas Series."""
    if series.dtype == "object":
        # Replace various forms of "degree Celsius"
        series = series.str.replace(r"\^\{\\circ\}C|\$\^\{o\}C\$|\$\^\\circ\$C|\{\\deg\}C|\{degree sign\}C|degC|deg C", " degree", regex=True)

        # Replace various forms of "carbon dioxide"
        series = series.str.replace(r"CO\$_2\$|(?i)carbon dioxide|CO\\textsubscript\{2\}|\$CO_\{2\}\$", "carbon dioxide", regex=True)

        # Replace representation of "calcium cation"
        series = series.str.replace(r"Ca\$\^\{2\+\}\$", "calcium cation", regex=True)

        # Remove URLs
        series = series.str.replace(r"https?://\S+|www\.\S+", " ", regex=True)

        # Remove copy right water print e.g. (C) 2013 Elsevier Ltd. All rights reserved.
        series = series.str.replace(r"\(C\) \d{4} .*[.]", "", regex=True)

        # Replace \n by a space
        series = series.str.replace(r"\n", " ", regex=True)

        return series.str.strip()

    else:
        return pd.Series(["" for _ in range(len(series))])  # Return a Series of empty strings if the input is not text

Run functions to process texts

In [None]:
aca_sentiment_df["text"] = process_aca_text_series(aca_sentiment_df["text"])
aca_sentiment_df = aca_sentiment_df.dropna()

Check and save the processed corpora

In [None]:
# Check the processed one
aca_sentiment_df.head()

Unnamed: 0,id,year,text
0,1,2021,Changing the World One Meme at a Time: The Eff...
1,2,2021,"The Relationship between Social Norms, Avoidan..."
2,3,2013,An Evaluation of Urban Citizens' Awareness of ...
3,4,2021,How and when higher climate change risk percep...
4,5,2022,Climate change impact and adaptation for highw...


In [None]:
# Save the preprocessed corpus for sentiment and emotion analyses
aca_sentiment_df.to_csv(aca_sentiment_path, index=False)

!ls /content/

drive  sample_data


#### Reddit one
In this step, I extract data in specific years and remove duplicated content, user names, ect.

In [None]:
# Only get data from 2013 to 2022 years these ten years
red_df = red_df[(red_df["year"] >= 2013) & (red_df["year"] <= 2022)].copy()
print(f"The new dataframe's shape is {red_df.shape}")
# How many unique years in total
unique_years_sorted = np.sort(red_df["year"].unique())
print(f"There are these years {unique_years_sorted}")
red_df.head()

The new dataframe's shape is (153861, 3)
There are these years [2013 2014 2015 2016 2017 2018 2019 2020 2021 2022]


Unnamed: 0,id,year,text
0,c7w2a9f,2013,Discussing climate change with a skeptic on an...
1,c7x3p76,2013,That hasn't even been considered for several y...
2,c7xjxtf,2013,anything on non- carbon dioxide GHGs? I though...
3,c7xkqi8,2013,That would be easy to find as well since there...
4,c7xp7wy,2013,"Cool, thanks"


In [None]:
# Create a new data frame which is the same as reddit_df
red_sentiment_df = red_df.copy()

Define functions to process texts

In [None]:
def process_red_text_series(series):
    """Make text readable for humans - Vectorized version for Pandas Series."""
    if series.dtype == "object":
        # Remove URLs, user names, subreddit names, quotes, and other unnecessary content
        series = series.str.replace(r"https?://\S+|www\.\S+|/?r/|/u/[\w-]+|^>.*?>|^.*?>|\[original\]\( reduced by \d{1,2}%. \(I'm a bot\)", " ", regex=True)

        # Restore ">" and "<"
        series = series.str.replace("&gt;", "> ", regex=True)
        series = series.str.replace("&lt;", "< ", regex=True)

        # Make expressions the same
        series = series.str.replace(r"CO2|co2|Co2", " carbon dioxide", regex=True)
        series = series.str.replace(r"ca2+|Ca2+|CA2+", " calcium cation", regex=True)

        # Capitalize marked text and remove asterisks
        series = series.str.replace(r"\*{1,3}([^*]+)\*{1,3}", lambda match: match.group(1).upper(), regex=True)

        # Remove "ha"
        series = series.str.replace(r"\b(bwa)*(a*ha)+\b", " ", regex=True, flags=re.IGNORECASE)

        # Remove multiple spaces
        series = series.str.replace(r"\s+", " ", regex=True)

        return series.str.strip()


    else:
        return pd.Series(["" for _ in range(len(series))])  # Return a Series of empty strings if the input is not text


def process_reddit_data(df, column_name):
    """Process the DataFrame by removing unnecessary expressions and make text readable."""
    # Remove texts only containing "[deleted]" and "[removed]"
    df_filtered = df[~df[column_name].isin(["[deleted]", "[removed]"])].copy()

    # Apply the vectorized function
    df_filtered.loc[:, column_name] = process_red_text_series(df_filtered[column_name])

    # Remove duplicated lines
    df_filtered = df_filtered.drop_duplicates(subset=[column_name], keep = "first").reset_index(drop=True)
    df_filtered = df_filtered.dropna()

    return df_filtered


Process the data frame

In [None]:
red_sentiment_df = process_reddit_data(red_sentiment_df, "text")

Check and save the processed corpora

In [None]:
# check the preprocessed results
# red_sentiment_df.head()
filtered_df = red_sentiment_df[red_sentiment_df['text'].str.contains('haha', case=False, na=False)]
print(filtered_df['text'])

In [None]:
# Save the preprocessed dataframe.
red_sentiment_df.to_csv(red_sentiment_path, index=False)

!ls /content/

drive  red_sentiment_df.csv  sample_data
