<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 5px; height: 50px">

# Project 3: Web APIs & NLP

### Project Title: Generative AI and Art - understanding and predicting chatter from online communities

**DSI-41 Group 2**: Muhammad Faaiz Khan, Lionel Foo, Gabriel Tan

## Part 2: Data Cleaning and Feature Engineering

### 2.1 Imports
___

In [2]:
# Importing libraries for EDA, modelling and analysis.
import pandas as pd
import re
from emoji import demojize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from textblob import TextBlob

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 400

In [3]:
# Reimporting the dataframe from scraping in Section 1
reddit_df = pd.read_csv('../data/reddit_df.csv')

### 2.1 Filtering empty/automated posts
___

In [3]:
# Checking the dataframe info
reddit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7789 entries, 0 to 7788
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   subr-def_ai   7789 non-null   int64 
 1   is_op         7789 non-null   int64 
 2   author        7316 non-null   object
 3   post_id       7789 non-null   object
 4   body          7789 non-null   object
 5   upvotes       7789 non-null   int64 
 6   num_comments  7789 non-null   int64 
dtypes: int64(4), object(3)
memory usage: 426.1+ KB


#### 2.1.1 Null and "[deleted]" entries

Based on the info above, there are 473 empty entries for `author`. This occurs when the author of the post has deleted their account at the time of viewing/scraping. This is acceptable for the prediction model as we will not be using this feature for prediction. For the purpose of analysis, we will relabel empty entries for `author` as "[deleted]".

Before renaming these `author` entries, we will first check for the number of posts where the entry for `body` is "[deleted]". This indicates that the post has been deleted from the subreddit.

In [4]:
# Checking the info for rows where the contents of the post are "[deleted]"
test = reddit_df.loc[reddit_df['body']=='[deleted]',:]
test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 111 entries, 61 to 7517
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   subr-def_ai   111 non-null    int64 
 1   is_op         111 non-null    int64 
 2   author        0 non-null      object
 3   post_id       111 non-null    object
 4   body          111 non-null    object
 5   upvotes       111 non-null    int64 
 6   num_comments  111 non-null    int64 
dtypes: int64(4), object(3)
memory usage: 6.9+ KB


There are 111 rows where the post contents are "[deleted]". The `author` feature for these rows are also all null values. These rows are not useful for modelling purposes and will be removed from the dataframe.

Posts with the content "[removed]" will also be removed from the dataframe. These posts were removed by the moderation team and will not be useful for our modelling/analysis.

In [5]:
# Filtering out all rows where the the contents of the post are "[deleted]"
reddit_df = reddit_df.loc[reddit_df['body']!='[deleted]',:].reset_index(drop=True)

# Filtering out all rows where the the contents of the post are "[removed]"
reddit_df = reddit_df.loc[reddit_df['body']!='[removed]',:].reset_index(drop=True)

# Relabelling null entries under 'author' as "[deleted]"
reddit_df['author'].fillna(value='[deleted]', inplace=True)

#### 2.1.2 Filtering Automated Posts

We also want to filter out the automated posts from Reddit bots/moderators. We will filter these posts by the entries in `author`.

In [6]:
# Define list of known bots/automated posts between both subreddits
bot_list = ['DefendingAIArt-ModTeam',
            'AutoModerator',
            'WikiSummarizerBot',
            'BookFinderBot',
            'sneakpeekbot',
            'Anti-ThisBot-IB',
            'exclaim_bot',
            'of_patrol_bot',
            'AmputatorBot',
            'savevideobot',
            'RemindMeBot']

In [7]:
# Filtering out the automated posts in our dataframe
for bot in bot_list:
    reddit_df = reddit_df.loc[reddit_df['author']!=bot,:].reset_index(drop=True)

# Checking updated dataframe
reddit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7485 entries, 0 to 7484
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   subr-def_ai   7485 non-null   int64 
 1   is_op         7485 non-null   int64 
 2   author        7485 non-null   object
 3   post_id       7485 non-null   object
 4   body          7485 non-null   object
 5   upvotes       7485 non-null   int64 
 6   num_comments  7485 non-null   int64 
dtypes: int64(4), object(3)
memory usage: 409.5+ KB


#### 2.1.3 Filtering posts affected by redact.dev

After an attempt at running WordVectorizer, the specific sentence below was found to be recurring multiple times in the dataframe:

    this message was mass deleted/edited with redact.dev

Upon investigation, posts containing this sentence appear to be meaningess strings of text. This is a result of the post being removed from Reddit through [redact.dev](https://redact.dev/), an online service used to mass delete a user's internet posts. We will remove these posts from our dataframe as they do not provide meaningful contribution to the model/analysis.

In [8]:
# Define the string to search for
string_to_search = 'this message was mass deleted/edited with redact.dev'

# Use boolean indexing to filter rows containing the specified string
rows_with_string = reddit_df[reddit_df['body'].str.contains(re.escape(string_to_search))]

# Display the rows
print(rows_with_string['body'])

# Refer to the posts printed in the output below for an example of the meaningless strings of text

75                    versed imagine decide bike gaze physical fear shocking fragile capable ` this message was mass deleted/edited with redact.dev `
120                          heavy cake touch governor dog coherent smoggy joke hurry quaint ` this message was mass deleted/edited with redact.dev `
124                                    wipe pet juggle boast fact like late label aback full ` this message was mass deleted/edited with redact.dev `
3518              cooperative compare fuel groovy hard-to-find modern shame pet snails aloof ` this message was mass deleted/edited with redact.dev `
3535                    recognise marble dazzling grab public innate crush plate tease point ` this message was mass deleted/edited with redact.dev `
3601                     cows disagreeable soup rhythm grey bear plant truck mindless narrow ` this message was mass deleted/edited with redact.dev `
3954                         rhythm pathetic rotten full wrench tap tease juggle touch scale ` this 

In [9]:
# Filtering out the posts removed by redact.dev
reddit_df = reddit_df[~reddit_df['body'].str.contains(re.escape(string_to_search))]

# Checking updated dataframe
reddit_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7473 entries, 0 to 7484
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   subr-def_ai   7473 non-null   int64 
 1   is_op         7473 non-null   int64 
 2   author        7473 non-null   object
 3   post_id       7473 non-null   object
 4   body          7473 non-null   object
 5   upvotes       7473 non-null   int64 
 6   num_comments  7473 non-null   int64 
dtypes: int64(4), object(3)
memory usage: 467.1+ KB


### 2.2 Cleaning post content
___

#### 2.2.1 Removing hyperlinks and GIFs

Multiple posts contain links to other websites, including Giphy, a site to share GIF images. Running our Lemmatiser/WordVectoriser on such hyperlinks will not yield meaningful contributions to our model. We will thus remove these hyperlinks at this stage using regex.

In [10]:
# Checking number of rows with hyperlinks within 'body'
test_link = reddit_df[reddit_df['body'].str.contains('http.*')]
test_link.shape

(338, 7)

In [11]:
# Checking number of rows with giphy links within 'body'
test_gif = reddit_df[reddit_df['body'].str.contains('![gif](giphy|', regex=False)]
test_gif.shape

(33, 7)

In [12]:
# Using regex to search the 'body' and remove hyperlinks from posts
reddit_df['body'] = reddit_df['body'].str.replace('http[^ ]*', '', regex=True)

# Check if any posts remain with hyperlinks
test_link = reddit_df[reddit_df['body'].str.contains('http.*')]
test_link.shape

(0, 7)

In [13]:
# Using regex to search the 'body' and remove giphy links from posts
reddit_df['body'] = reddit_df['body'].str.replace('[!][[]gif[]][^ ]*', '', regex=True)

# Check if any posts remain with giphy links
test_gif = reddit_df[reddit_df['body'].str.contains('![gif](giphy|', regex=False)]
test_gif.shape

  pat = re.compile(pat, flags=flags)


(0, 7)

#### 2.2.2 Converting Emojis to text

After scraping, emojis within posts are recorded as a coded alphanumerical string. We will use the demojize function from the emoji library to convert these strings to a textual format, which will give better context to the post when performing sentiment analysis and modelling.

In [14]:
# Showing example of emoji within post
reddit_df[reddit_df['post_id']=='1043ums']

Unnamed: 0,subr-def_ai,is_op,author,post_id,body,upvotes,num_comments
1116,1,1,[deleted],1043ums,More ableism from the anti-AI crowd... Gross. 😟,193,6


In [15]:
# Applying the demojize function on 'body'
reddit_df['body'] = reddit_df['body'].apply(demojize)

In [16]:
# Showing example of emoji after running demojize
reddit_df[reddit_df['post_id']=='1043ums']

Unnamed: 0,subr-def_ai,is_op,author,post_id,body,upvotes,num_comments
1116,1,1,[deleted],1043ums,More ableism from the anti-AI crowd... Gross. :worried_face:,193,6


#### Removing empty posts after cleaning

After performing the data cleaning steps above, certain rows have in the dataframe have become either empty, or filled with only newline characters. We will drop these rows in this section.

In [17]:
# Replace NaN values with an empty string for consistency
reddit_df['body'] = reddit_df['body'].fillna('')

# Check if any rows in 'body' consist of only '0', '', or only newline character
rows_with_zeros_or_newline = reddit_df[(reddit_df['body'].isin(['0', '', '\n', '\n\n', '\n\n\n', '\n\n\n\n']))]

# Count the rows that meet the condition
rows_with_zeros_or_newline.shape

(30, 7)

In [18]:
# Drop the rows with no actual text and reset the index
reddit_df = reddit_df.drop(rows_with_zeros_or_newline.index).reset_index(drop=True)
# Display the updated DataFrame info
reddit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7443 entries, 0 to 7442
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   subr-def_ai   7443 non-null   int64 
 1   is_op         7443 non-null   int64 
 2   author        7443 non-null   object
 3   post_id       7443 non-null   object
 4   body          7443 non-null   object
 5   upvotes       7443 non-null   int64 
 6   num_comments  7443 non-null   int64 
dtypes: int64(4), object(3)
memory usage: 407.2+ KB


### 2.3 Feature engineering
___

We will engineer the following features below to assist with EDA and modelling:


|Feature|Type|Description|
|---|---|---|
|`post_length`|int|Character length of the post|
|`post_word_count`|int|Number of words in the post|
|`neg`|float|Measure of negativity of post*|
|`neu`|float|Measure of neutrality of post*|
|`pos`|float|Measure of positivity of post*|
|`compound`|float|Compound score obtained from `neg`, `neu` and `pos`|
|`subjectivity_score`|float|Measure of subjectivity of post|

*These parameters will be generated using SentimentIntensityAnalyzer from the nltk.sentiment.vader library. Further documentation [here](https://www.nltk.org/howto/sentiment.html).

**`subjectivity_score` is obtained using the TextBlob library. Further documentation [here](https://textblob.readthedocs.io/en/dev/quickstart.html).

#### 2.3.1 Engineering `post_length` and `post_word_count`

In [19]:
# Engineering feature for the character length of each post
reddit_df['post_length'] = reddit_df['body'].str.len()

In [20]:
# Engineering feature for the number of words in each post
reddit_df['post_word_count'] = reddit_df['body'].str.split().str.len()

#### 2.3.2 Engineering sentiment analysis features

In [21]:
# Instantiate Sentiment Intensity Analyzer
sent = SentimentIntensityAnalyzer()

# Apply sentiment analysis to the 'body' column
reddit_df['sentiment_scores'] = reddit_df['body'].apply(lambda x: sent.polarity_scores(x))

# Expand the sentiment scores into separate columns
sentiment_df = reddit_df['sentiment_scores'].apply(pd.Series)

# Concatenate the sentiment scores DataFrame with the original DataFrame
reddit_df = pd.concat([reddit_df, sentiment_df], axis=1)

# Display the updated DataFrame
reddit_df.head(1)

Unnamed: 0,subr-def_ai,is_op,author,post_id,body,upvotes,num_comments,post_length,post_word_count,sentiment_scores,neg,neu,pos,compound
0,1,1,[deleted],101n5dv,"[TW: DEATH THREAT] And they say that ""AI bros"" are the ones harassing the artists?",498,9,82,15,"{'neg': 0.385, 'neu': 0.615, 'pos': 0.0, 'compound': -0.8455}",0.385,0.615,0.0,-0.8455


In [22]:
# Function to get subjectivity score using TextBlob
def get_subjectivity_score(text):
    blob = TextBlob(text)
    return blob.sentiment.subjectivity

In [23]:
# Apply subjectivity analysis to the 'body' column
reddit_df['subjectivity_score'] = reddit_df['body'].apply(get_subjectivity_score)

# Drop unnecessary 'sentiment_scores' column after splitting
reddit_df.drop(columns=['sentiment_scores'], inplace=True)

# Display the updated DataFrame
reddit_df.head(1)

Unnamed: 0,subr-def_ai,is_op,author,post_id,body,upvotes,num_comments,post_length,post_word_count,neg,neu,pos,compound,subjectivity_score
0,1,1,[deleted],101n5dv,"[TW: DEATH THREAT] And they say that ""AI bros"" are the ones harassing the artists?",498,9,82,15,0.385,0.615,0.0,-0.8455,0.0


In [24]:
# Group by 'subr-def_ai' and calculate average scores
average_scores = reddit_df.groupby('subr-def_ai').agg({
    'neg': 'mean',
    'neu': 'mean',
    'pos': 'mean',
    'compound': 'mean',
    'subjectivity_score': 'mean'
}).reset_index()

# Add labels for better readability
average_scores['subreddit'] = average_scores['subr-def_ai'].map({0: 'ArtistHate', 1: 'DefendingAIArt'})

# Reorder columns for better display
average_scores = average_scores[['subreddit', 'neg', 'neu', 'pos', 'compound', 'subjectivity_score']]

# Display the average scores
print(average_scores)

        subreddit       neg       neu       pos  compound  subjectivity_score
0      ArtistHate  0.092619  0.775681  0.130045  0.114222            0.467722
1  DefendingAIArt  0.099033  0.775771  0.124971  0.060208            0.439919


### 2.4 Class Imbalance Analysis
___

- For our project to accurately predict the subreddit to which a particular text belongs, we have noted a noteworthy class imbalance, especially within posts containing fewer than 50 words in the *r/DefendingAIArt* subreddit. Recognizing the importance of obtaining meaningful insights from our data, we have conducted a thorough analysis and decided to address this imbalance.

- In the context of our broader goal, where *r/DefendingAIArt* is associated with pro AI art sentiments and *r/ArtistHate* with anti AI art sentiments, we understand the significance of each comment contributing substantively to our model's training and evaluation. Comments with minimal word counts are likely to provide limited information and context, potentially affecting the accuracy of our predictions.

- To enhance the quality of our predictive model, we have opted to selectively remove rows from the *r/DefendingAIArt* subreddit, focusing on those with the lowest word count. This strategic approach ensures that our model is trained on more informative and context-rich comments, fostering a better understanding of the sentiments expressed within each subreddit.

- By achieving a balanced representation between *r/DefendingAIArt* and *r/ArtistHate*, our predictive model aims to accurately discern the subtle nuances between pro and anti AI sentiments, ultimately improving its performance in distinguishing between the two subreddits.

In [25]:
# Calculate class balance
class_balance = reddit_df['subr-def_ai'].value_counts(normalize=True)
class_counts = reddit_df['subr-def_ai'].value_counts()

# Define class labels
class_labels = {1: 'DefendingAIArt', 0: 'ArtistHate'}

# Rename the indices using the class labels
class_balance.index = class_balance.index.map(class_labels)
class_counts.index = class_counts.index.map(class_labels)

# Print out the class balance with labels
print("Class Balance (Normalized):")
print(class_balance)

# Print out the actual value counts with labels
print("\nClass Balance (Actual Counts):")
print(class_counts)

Class Balance (Normalized):
subr-def_ai
DefendingAIArt    0.594921
ArtistHate        0.405079
Name: proportion, dtype: float64

Class Balance (Actual Counts):
subr-def_ai
DefendingAIArt    4428
ArtistHate        3015
Name: count, dtype: int64


In [26]:
# Desired class balance
desired_balance = 3015

# Sort the rows for 'DefendingAIArt' based on 'status_word_count'
defending_ai_sorted = reddit_df[reddit_df['subr-def_ai'] == 1].sort_values(by='post_word_count')

# Calculate the number of rows to drop
rows_to_drop = len(defending_ai_sorted) - desired_balance

# Drop the excess rows
reddit_df.drop(index=defending_ai_sorted.head(rows_to_drop).index, inplace=True)

# Verify the new class balance
new_class_balance = reddit_df['subr-def_ai'].value_counts()
print("New Class Balance (Actual Counts):")
print(new_class_balance)

New Class Balance (Actual Counts):
subr-def_ai
1    3015
0    3015
Name: count, dtype: int64


In [27]:
reddit_df.shape

(6030, 14)

In [28]:
# Exporting to csv format for EDA and modelling step
reddit_df.to_csv('../data/reddit_df_2.csv', index=False)

#### **(DEPRECATED)** Sentiment Analysis using sst2

**The code in this section been deprecated in favour of using SentimentIntensityAnalyzer(), which does not have a character limit and has a faster run time.**

Before developing the prediction model, we will engineer features that capture sentiment, using sentiment analysis on each post.

For the purpose of this project, we will not be creating a unique sentiment analysis model. Instead, we will use the default sentiment analysis model provided by Huggingface, [sst2](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).

Unfortunately, sst2 does not work on character strings longer than 1300 characters.

In [None]:
# reddit_df[reddit_df['post_length']>1300].count()

The posts identified above exceed this character limit. We will filter out posts from our dataframe where post length is greater than 1300 characters.

In [None]:
# Filter out posts from the dataframe where 'post_length' > 1300

# reddit_df = reddit_df.loc[reddit_df['post_length']<1300, :].reset_index(drop=True)
# reddit_df.shape

In [None]:
# Initialise sst2 sentiment analysis model

# sent_an = pipeline("sentiment-analysis")

In [None]:
# body_list = [x for x in reddit_df['body']]

In [None]:
# sent = []
# sent_score = []
# for body in body_list:
#     result = sent_an(body)
#     sent.append(result[0]['label'])
#     sent_score.append(result[0]['score'])

In [None]:
# reddit_df['sent'] = pd.DataFrame(sent, columns=['sent'])
# reddit_df['sent_score'] = pd.DataFrame(sent_score, columns=['sent_score'])

In [None]:
# reddit_df.to_csv('data/reddit_engdf.csv', index=False)

In [None]:
# reddit_df.groupby(['subr-def_ai']).mean(['sent_score'])