<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 5px; height: 50px">

# Project 3: Web APIs & NLP

### Project Title: Generative AI and Art - understanding and predicting chatter from online communities

**DSI-41 Group 2**: Muhammad Faaiz Khan, Lionel Foo, Gabriel Tan

## Part 1: Data Scraping


### 1.1 Importing libaries
___

In [1]:
# Importing libraries for data scraping
import pandas as pd
import praw
from praw.models import MoreComments

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 400

### 1.2 PRAW API
___

We will use the [PRAW API](https://praw.readthedocs.io/en/stable/) to perform scraping.

In [2]:
# Unique identifier client_id and client_secret retrieved from personal application registered on Reddit.
'''
Note: You will have to fill in these identifier keys with your own set. Refer to below link:
https://praw.readthedocs.io/en/latest/getting_started/authentication.html
'''
reddit = praw.Reddit(user_agent="PRAW", client_id="", 
                     client_secret="")

The scraping process is summarised below:
1. Define a dictionary where the keys are our column names, with empty lists as the values.
2. Loop through all posts in each subreddit, appending the relevant post information into our dictionary with each loop.
3. Convert the dictionary to dataframe format and export.

The data dictionary for our scraped dataframe is defined below:


|Feature|Type|Description|
|---|---|---|
|`subr-def_ai`|int|Assigned boolean to indicate whether the post is from *r/DefendingAIArt* (1) or *r/ArtistHate* (0)|
|`is_op`|int|Assigned boolean, whether the post is the OP* (1) or a comment (0) |
|`author`|str|Username of the person making the post|
|`post_id`|str|Unique identifier string for each post|
|`body`|str|Content of the post**|
|`upvotes`|int|Number of upvotes for the post|
|`num_comments`|int|Number of direct comments/responses to the post| 

*OP refers to the original post for each thread.

**For the OP, `body` will be a concatenation of both its title and its post content (if any). Comments have no title and thus do not require this concatenation.

In [None]:
# First defining the dictionary before the scraping process
reddit_dict = {'subr-def_ai':[],
                'is_op': [],
                'author': [],
                'post_id': [],
                'body': [],
                'upvotes': [],
                'num_comments': []}

To facilitate in populating our dictionary with the scraped data, we will define the functions below.

In [5]:
# Define function to append post information to the dictionary
def dictapp(dict, post, def_ai=True, op=False):
    if op:
        dict['is_op'].append(1)
        if post.selftext:
            dict['body'].append(post.title + ' ' + post.selftext)
        else:
            dict['body'].append(post.title)
    else:
        dict['is_op'].append(0)
        dict['body'].append(post.body)
    dict['author'].append(post.author)
    dict['num_comments'].append(replycnt(post, op))
    dict['subr-def_ai'].append(int(def_ai))
    dict['upvotes'].append(post.score)
    dict['post_id'].append(post.id)


# Defining function to count replies to comment. This is used in dictapp() above.
def replycnt(comment, op):
    if op:
        reply_obj = comment.comments
    else:
        reply_obj = comment.replies
    count = 0
    for reply in reply_obj:
        count += 1
    return count

Further attributes of the .submission class can be found [here](https://praw.readthedocs.io/en/stable/code_overview/models/submission.html).
Further attributes of the .subreddit class can be found [here](https://praw.readthedocs.io/en/stable/code_overview/models/subreddit.html).

We start scraping *r/DefendingAIArt* and append the posts to `reddit_dict`.

In [None]:
for submission in reddit.subreddit("DefendingAIArt").top("all"):     # For loop to go through all threads in the subreddit
    submission.comments.replace_more(limit=0)                        # Ignores elements that expand the comments on the page
    dictapp(reddit_dict, submission, def_ai=True, op=True)           # Appends the OP to the dictionary
    for comment in submission.comments.list():
        dictapp(reddit_dict, comment, def_ai=True)                   # For loop to append all comments in the thread

Call this function with 'time_filter' as a keyword argument.
  for submission in reddit.subreddit("DefendingAIArt").top("all"):


In [None]:
# Checking dataframe
pd.DataFrame(reddit_dict).head()

Unnamed: 0,subr-def_ai,is_op,author,post_id,body,upvotes,num_comments
0,1,1,,101n5dv,"[TW: DEATH THREAT] And they say that ""AI bros"" are the ones harassing the artists?",498,9
1,1,0,Zinthaniel,j2plqsw,"there's no rule in this sub requiring you to hide the tweet handle. So its kind of idiotic to do so, especially when the tweet is glorifying killing people who use AI.",30,1
2,1,0,,j2oryjg,"""Corpos telling modern artists to die""\nIT'S FREE AND OPEN SOURCE",56,2
3,1,0,chillaxinbball,j2rbhzy,Unfortunately there are a few idiots on Twitter that are being rude which is giving the antiai crowd a huge confirmation bias boner. The Anti ai crowd has a hard time separating individuals from the group and seeing that the *majority* of the hateful comments comes from them.,12,0
4,1,0,Trippy-Worlds,j2oyyyb,Why is the username crossed out? They need to be reported on Twitter and probably to the FBI. \n\nWould really like to see who all those likes are as well. Please tell us the Twitter ID. Suggesting violence is not permissible!,23,1


Next, we repeat the process above to scrape *r/ArtistHate* and append the posts to `reddit_dict`.

In [None]:
for submission in reddit.subreddit("ArtistHate").top("all"):
    submission.comments.replace_more(limit=0)
    dictapp(reddit_dict, submission, def_ai=False, op=True)      # Note that def_ai is set to False in this block as we are scraping r/ArtistHate instead
    for comment in submission.comments.list():
        dictapp(reddit_dict, comment, def_ai=False)

Call this function with 'time_filter' as a keyword argument.
  for submission in reddit.subreddit("ArtistHate").top("all"):


In [None]:
# Checking tail end of dataframe
pd.DataFrame(reddit_dict).tail()

Unnamed: 0,subr-def_ai,is_op,author,post_id,body,upvotes,num_comments
7784,0,0,Captain_Pumpkinhead,k45jc10,"Yeah, that makes sense. If you're advertising for a drawing tool like a pen tablet/display, you should definitely be using hand drawn art. Preferably drawn using the advertised device.",20,0
7785,0,0,Alkaia1,k471c4e,I am really glad that they did the right thing and terminated their collaboration.,10,0
7786,0,0,dontthrowmeaway2023,k48x2bv,phew that´s good to hear :),4,0
7787,0,0,Shyraku,k49jjxg,"I will remember to look at XXPen's product when my Wacom finally die, I think they deserve that",2,0
7788,0,0,WonderfulWanderer777,k45dhqx,Yeah...,11,0


The dictionary is then converted to a dataframe `reddit_df`, then exported to .csv format.

In [None]:
# Creating dataframe and exporting to csv format
reddit_df = pd.DataFrame(reddit_dict)
# reddit_df.to_csv('data/reddit_df.csv', index=False)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7789 entries, 0 to 7788
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   subr-def_ai   7789 non-null   int64 
 1   is_op         7789 non-null   int64 
 2   author        7316 non-null   object
 3   post_id       7789 non-null   object
 4   body          7789 non-null   object
 5   upvotes       7789 non-null   int64 
 6   num_comments  7789 non-null   int64 
dtypes: int64(4), object(3)
memory usage: 426.1+ KB


#### **(DEPRECATED)** Scraping test data from *r/aiwars*
___
Initial intention was to run the prediction model on scraped data from a seperate third subreddit, *r/aiwars*. Instead of doing so, we will perform live scraping of *r/aiwars* and real-time prediction using Streamlit. Please refer to the python scripts within the streamlit_widget folder.

In [3]:
# submission = reddit.submission(url='https://www.reddit.com/r/aiwars/comments/17slemz/which_one_is_it_antiai/')  

In [7]:
# # Resetting reddit_dict for scraping r/aiwars
# reddit_dict = {'subr-def_ai':[],
#                 'is_op': [],
#                 'author': [],
#                 'post_id': [],
#                 'body': [],
#                 'upvotes': [],
#                 'num_comments': []}

In [8]:
# submission.comments.replace_more(limit=0)         # Ignores elements that expand the comments on the page
# for comment in submission.comments.list():
#     dictapp(reddit_dict, comment)                 # For loop to append all comments in the thread

In [15]:
# # Creating dataframe and exporting to csv format
# aiwars_df = pd.DataFrame(reddit_dict)
# aiwars_df.to_csv('../data/aiwars_df.csv', index=False)