# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web APIs & NLP

**DSI-41 Group 2**: Muhammad Faaiz Khan, Lionel Foo, Gabriel Tan

## **Project title**: Generative AI and Art - understanding and predicting chatter from online communities

## Part 1 Data Collection

## 01. Imports

In [1]:
import pandas as pd
import praw
from praw.models import MoreComments

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 400

## 1. Data Scraping
We will use the PRAW API to perform scraping on 2 Subreddits: *r/DefendingAIArt* and *r/ArtistHate*

In [2]:
# Unique identifier client_id and client_secret retrieved from personal application registered on Reddit.
# The client_id and client_secret has been redacted as they are confidential information. Do input your own client_id & secret to run the code.
reddit = praw.Reddit(user_agent="PRAW", client_id="", 
                     client_secret="")

The scraping process is summarised below:
1. Define a dictionary where the keys are our column names, with empty lists as the values.
2. Loop through all posts in each subreddit, appending the relevant post information into our dictionary with each loop.
3. Convert the dictionary to dataframe format and export.

The information scraped from the subreddit is defined below:


|Feature|Type|Description|
|---|---|---|
|`subr-def_ai`|int|Boolean, whether the post is from *r/DefendingAIArt* (1) or *r/ArtistHate* (0)|
|`is_op`|int|Boolean, whether the post is the original post/OP (1) or a comment (0) |
|`author`|obj|Provide an instance (Username) of *Redditor* |
|`post_id`|obj|The unique id of the *post*/*comment*|
|`body`|str|Content of the *post*/*comment*|
|`upvotes`|int|Number of upvotes for the *post*/*comment*|
|`num_comments`|int|Number of comments/responses to the post| 

*For OP, `body` will be a concatenation of both its title and its post content (if any). Comments have no title and thus do not require this concatenation.

In [3]:
# First defining the dictionary before the scraping process
reddit_dict = {'subr-def_ai':[],
                'is_op': [],
                'author': [],
                'post_id': [],
                'body': [],
                'upvotes': [],
                'num_comments': []}

To facilitate in populating our dictionary with the scraped data, we will define the functions below.

In [4]:
# Define function to append post information to the dictionary
def dictapp(dict, post, def_ai=True, op=False):
    if op:
        dict['is_op'].append(1)
        if post.selftext:
            dict['body'].append(post.title + ' ' + post.selftext)
        else:
            dict['body'].append(post.title)
    else:
        dict['is_op'].append(0)
        dict['body'].append(post.body)
    dict['author'].append(post.author)
    dict['num_comments'].append(replycnt(post, op))
    dict['subr-def_ai'].append(int(def_ai))
    dict['upvotes'].append(post.score)
    dict['post_id'].append(post.id)


# Defining function to count replies to comment. This is used in dictapp() above.
def replycnt(comment, op):
    if op:
        reply_obj = comment.comments
    else:
        reply_obj = comment.replies
    count = 0
    for reply in reply_obj:
        count += 1
    return count

Further attributes of the .submission class can be found [here](https://praw.readthedocs.io/en/stable/code_overview/models/submission.html).
Further attributes of the .subreddit class can be found [here](https://praw.readthedocs.io/en/stable/code_overview/models/subreddit.html).

We start scraping *r/DefendingAIArt* and append the posts to `reddit_dict`.

In [5]:
for submission in reddit.subreddit("DefendingAIArt").top("all"):
    submission.comments.replace_more(limit=0)                   # Ignores elements that expand the comments on the page
    dictapp(reddit_dict, submission, def_ai=True, op=True)      # Appends the OP to the dictionary
    for comment in submission.comments.list():
        dictapp(reddit_dict, comment, def_ai=True)              # For loop to append all comments in submission

Call this function with 'time_filter' as a keyword argument.
  for submission in reddit.subreddit("DefendingAIArt").top("all"):


In [6]:
# Checking dataframe
pd.DataFrame(reddit_dict).head()

Unnamed: 0,subr-def_ai,is_op,author,post_id,body,upvotes,num_comments
0,1,1,,101n5dv,"[TW: DEATH THREAT] And they say that ""AI bros"" are the ones harassing the artists?",499,9
1,1,0,Zinthaniel,j2plqsw,"there's no rule in this sub requiring you to hide the tweet handle. So its kind of idiotic to do so, especially when the tweet is glorifying killing people who use AI.",30,1
2,1,0,,j2oryjg,"""Corpos telling modern artists to die""\nIT'S FREE AND OPEN SOURCE",55,2
3,1,0,chillaxinbball,j2rbhzy,Unfortunately there are a few idiots on Twitter that are being rude which is giving the antiai crowd a huge confirmation bias boner. The Anti ai crowd has a hard time separating individuals from the group and seeing that the *majority* of the hateful comments comes from them.,12,0
4,1,0,Trippy-Worlds,j2oyyyb,Why is the username crossed out? They need to be reported on Twitter and probably to the FBI. \n\nWould really like to see who all those likes are as well. Please tell us the Twitter ID. Suggesting violence is not permissible!,22,1


Next, we scrape r/ArtistHate and append the posts to `reddit_dict`.

In [7]:
for submission in reddit.subreddit("ArtistHate").top("all"):
    submission.comments.replace_more(limit=0)
    dictapp(reddit_dict, submission, def_ai=False, op=True)      # Note that def_ai is set to False in this block as we are scraping r/ArtistHate instead
    for comment in submission.comments.list():
        dictapp(reddit_dict, comment, def_ai=False)

Call this function with 'time_filter' as a keyword argument.
  for submission in reddit.subreddit("ArtistHate").top("all"):


In [8]:
# Checking tail end of dataframe
pd.DataFrame(reddit_dict).tail()

Unnamed: 0,subr-def_ai,is_op,author,post_id,body,upvotes,num_comments
7786,0,0,,kgtcoti,[removed],7,1
7787,0,0,KoumoriChinpo,kgqrhw0,Buying the projection machine wouldn't be lazy but I think using it to create pictures with would. But that's a weird unrealistic hypothetical that doesn't really help the argument you are trying to make. It's like arguing hypothetically if orange juice caused cancer we should stop drinking orange juice.,14,0
7788,0,0,KoumoriChinpo,kgu9wfh,"Hm. Nah, it's still just you interpreting me saying ""AI"" the way you want it to pretend you are making some kind of point.",4,1
7789,0,0,KoumoriChinpo,kgu72a1,That's wonderful. Good for you.,3,0
7790,0,0,GrandFrequency,kgua1yy,"Yeah, it's definitely not going over your head at all.",1,0


The dictionary is then converted to a dataframe `reddit_df`, then exported to .csv format.

In [9]:
# Creating dataframe and exporting to csv format
reddit_df = pd.DataFrame(reddit_dict)

# Import library:
import os  # to work with files/directories

# Define the output folder path:
output_folder_path = '../project_3/output'

# Check if the output folder exists, and create it if not:
if not os.path.exists(output_folder_path):
    os.makedirs(output_folder_path)

# Save CSV file within the 'output' folder:
reddit_df.to_csv(os.path.join(output_folder_path, 'reddit.csv'))