# COGS 108 - Data Checkpoint

# Names

- Mariam Bachar (A16217374)
- Alexandra Hernandez (A16730685)
- Brian Kwon (A16306826)
- Andrew Uhm (A16729684)
- Ethan Wang (A17229824)

<a id='research_question'></a>
# Research Question

*Do certain keywords as identified by CLIP correlate with the popularity (as measured by the equivalent of “likes”) that artwork receives on social media?*

# Hypothesis

We predict that digital artwork that contains certain keywords as predicted by CLIP (painting vs. watercolor vs. digital) will indeed have a positive correlation to popularity on social media. As humans observing what is popular, we notice that certain features tend to repeat themselves across posts, which leads us to believe a correlation will be found.

# Data

Our ideal dataset would be a representative sample of images representing most genres of art. Our variables would be the image, the caption of an image, and their associated ‘likes’. We would want a decent amount of observations spanning that representative sample aforementioned, somewhere in the ballpark of ~3000 images alongside their “like” count and the original artist’s follow count. Files can be anonymized with integer IDs. From there, we would process the images to extract the captions using CLIP and store that to the corresponding data point’s image as well. Ideally images would all be the same size. Furthermore the ideal dataset would have published dates as well in order to make comparisons to past trends. In order to define popularity, we would define it as a number of likes in proportion to the maximum number of likes in our dataset, defining it regressively instead of binarily.

A real data set we could use could be from DeviantArt’s API. We acknowledge that this data is different from our ideal. For one, the images are not perfectly square. We will thus crop and size the images down to a predetermined size (e.g. 768x768) in order to normalize. DeviantArt also likely has its own culture, which means our findings may not be representative of other social media and by taking images from their home page we may not be seeing old posts. Furthermore the fields that we require are actually optional, so we would have to filter to images that actually have all the data fields we require filled out.

# Ethics and Privacy

There are a number of ethical concerns regarding this research question that we must be mindful of as we analyze data. The most obvious issue is that we are tagging artwork as unpopular by virtue of not identifying said artwork as popular. However, this should not be a strong issue as we are not presenting identifying pieces of information of specific pieces of artwork or individual artists, so it should not be possible to label a specific artwork or artist as “unpopular”.

In terms of normalization, a possible solution would be to take a ratio between the number of likes on the artwork and the number of followers that certain artist has in order to take into account the disparity between larger artists and smaller artists in terms of popularity, as more popular artists would get more likes due to a larger audience. Additionally, it is entirely possible that our analysis may exclude cultural influences of minority groups. Since those residing in developed countries have more leisure time/resources (such as drawing software or drawing e-tablets), it is plausible that most digital art posted to social media is likely from developed countries. Thus, the work we analyze may disproportionately represent artwork and cultural trends of majority groups of developed countries while glossing over minority groups, which tend to be similar across developed countries.

Finally, because the artworks are on a public forum, they have consented to allowing their art to be analyzed. The Deviantart TOS states that you cannot “reproduce, distribute, publicly display or perform, or prepare derivative works”, which does not include the use of the artworks for an analytic survey. Although there is no clear-cut solution for this, it serves us well to keep this fact in mind when drawing conclusions upon our analyses.

# Dataset(s)

- Dataset Name: deviation_info
- Link to the dataset: https://github.com/COGS108/Group_Sp23_Project_Group_3/blob/master/deviation_info.csv
- Number of observations: 1188

This dataset is a set of deviations (that is images from deviantart) that contain deviation ids and metadata about the deviation itself as well as the author. It does not include the actual images.

- Dataset Name: caption_info
- Link to the dataset: https://github.com/COGS108/Group_Sp23_Project_Group_3/blob/master/caption_info.csv
- Number of observations: 1188

This dataset is a set of captions processed from the image, corresponding to a deviation id. It was processed using the CLIP interrogator in Automatic1111's stable diffusion webui.

- Dataset Name: images
- Link to the dataset: https://github.com/COGS108/Group_Sp23_Project_Group_3/tree/master/images
- Number of observations: 1198

This dataset is a directory of images in png format that are named based on their corresponding deviation ids, it is the actual images. There are 10 extra images in here that aren't found in our other datasets.


All of the datasets were built from scraping, and use deviation ids as their identifiers. Because of this, we can easily add them together based on those deviation ids if necessary.

# Setup

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import requests
import urllib
from bs4 import BeautifulSoup
import deviantart

import time
from datetime import datetime
from pathlib import Path

# DeviantArt API: https://www.deviantart.com/developers/http/v1/20210526
# Open-Source Python wrapper for DA API: https://github.com/neighbordog/deviantart

In [None]:
# creates a pd df from the CSV file if it exists, else creates a blank df.
csv_file = 'deviation_info.csv'
try:
    deviation_df = pd.read_csv(csv_file)
except FileNotFoundError:
    deviation_df = pd.DataFrame()

In [None]:
# Separate API keys in case of requesting issues.
andrew_DA_API = deviantart.Api("25542", "61a232f232df245f2560a3cb72ecc535")
ethan_DA_API = deviantart.Api("25492", "06217cf59e73b401dc0a14d00857a793")

# access token is da.access_token.

In [None]:
# README: use your own token
cur_access = andrew_DA_API

In [None]:
# how many images we want to fetch * 10
n = 120

In [None]:
for i in range(n):
    print('on iteration', i, '* 10')
    # grab 10 images at a time. DeviantArt calls their posts "deviations".
    deviations = cur_access.browse(endpoint='popular', timerange='alltime', offset=i*10, limit=10)['results']
    
    for deviation in deviations:
        # saves image to file by deviation id using url for local CLIP analysis
        if deviation.content is None:
            print('null deviation on iteration', i)
            continue
        url = deviation.content['src']
        dId = deviation.deviationid
        filename = f"images/{dId}.png"
        path = Path(filename)
        if path.is_file():
            pass
        else:
            open(filename, 'w').close()
            urllib.request.urlretrieve(url, filename)
        
        # these serve as examples of how to make a request when the python wrapper doesn't work
        username = deviation.author.username
        request = f"https://www.deviantart.com/api/v1/oauth2/user/profile/{username}?access_token={cur_access.access_token}&expand=user.stats"
        response = requests.get(request)
        authorData = response.json()
        authorWatchers = authorData['user']['stats']['watchers']
        authorPageViews = authorData['stats']['profile_pageviews'] # deemed unnecessary?
        authorDeviations = authorData['stats']['user_deviations']
        
        request = f"https://www.deviantart.com/api/v1/oauth2/deviation/metadata?access_token={cur_access.access_token}&deviationids={deviation}&ext_stats=True"
        response = requests.get(request)
        metaData = response.json()
        views = metaData['metadata'][0]['stats']['views']
        
        # gathering relevant data, turning it into a new observation.
        row = {
            'Deviation ID': deviation.deviationid,
            'Title': deviation.title,
            'Author': deviation.author,
            'Views': views,
            'Favorites': deviation.stats['favourites'],
            'Comments': deviation.stats['comments'],
            'URL Link': deviation.url,
            'Date Posted': datetime.fromtimestamp(int(deviation.published_time)),
            'Height': deviation.content['height'],
            'Width': deviation.content['width'],
            'File Size': deviation.content['filesize'],
            'Author Watchers': authorWatchers,
            'Author Page Views': authorPageViews,
            'Author Deviations': authorDeviations
        }
        row_df = pd.DataFrame(row, index=[0])
        deviation_df = pd.concat([deviation_df, row_df], ignore_index=True)
        
    # when running on the most popular posts, we will likely get duplicates. remove them.
    deviation_df = deviation_df.drop_duplicates(subset='Deviation ID')
    
    # grab every 15 seconds in order to adhere to DeviantArt fetch rate.
    if n > 1:
        time.sleep(15)

In [None]:
# put our dataframe into a CSV file so scraping can be collaborative.
# to_csv overwrites but should be OK since we are reading from the CSV to populate the dataframe anyways.
deviation_df.to_csv('deviation_info.csv', index=False)

In [None]:
deviation_df

In [None]:
# alternative clip interrogator.
'''
# https://github.com/pharmapsychotic/clip-interrogator

import os
import torch
from PIL import Image
from clip_interrogator import Config, Interrogator

# setting up dataframe for captions.
csv_file = 'caption_info.csv'
try:
    caption_df = pd.read_csv(csv_file)
except FileNotFoundError:
    caption_df = pd.DataFrame()

# setting up interrogator
ci = Interrogator(Config(clip_model_name="RN50/openai"))

subset_df = deviation_df[0:3]

for deviation in subset_df.values:
    dId = deviation[0]
    image = Image.open(f"images/{dId}.png").convert('RGB')
    
    caption = ci.interrogate(image)
    print(dId)
    print(caption)
'''

In [None]:
import requests
import json
from PIL import Image
import base64
import cv2

In [None]:
# setting up dataframe for captions.
csv_file = 'caption_info.csv'
try:
    caption_df = pd.read_csv(csv_file)
except FileNotFoundError:
    caption_df = pd.DataFrame()

# this is so we can start farther down if we get an error.
j = 700
subset_df = deviation_df[j:]
# save every n captions.
n = 10

i = 0
for deviation in subset_df.values:
    dId = deviation[0]
    image = Image.open(f"images/{dId}.png")
    
    # https://www.reddit.com/r/StableDiffusion/comments/11f938k/using_automatic1111_apis_for_clip/
    # https://stackoverflow.com/questions/52494592/wrong-colours-with-cv2-imdecode-python-opencv
    # below converts image into string to pass through API.
    # --------------------------------------------------------------
    cv2_image = np.array(image)
    cv2_image = cv2.cvtColor(cv2_image, cv2.COLOR_BGR2RGB)
    _, buffer = cv2.imencode('.png', cv2_image)
    input_image = base64.b64encode(buffer).decode('utf-8')

    url = "http://127.0.0.1:7860/sdapi/v1/interrogate"
    headers = {'Content-Type': 'application/json'}

    payload = {
        "image": input_image,
        "model": "clip"
    }
    response = requests.post(url, headers=headers, data=json.dumps(payload))
    if response.status_code == 200:
        caption = response.json()['caption']
    else:
        caption = "NA"
    # ----------------------------------------------------------- after this, caption is the caption.
    
    # add info to caption dataframe.
    row = {
        'Deviation ID': dId,
        'Caption': caption
    }
    row_df = pd.DataFrame(row, index=[0])
    caption_df = pd.concat([caption_df, row_df], ignore_index=True)
    
    i+=1
    if (i > n):
        # write it out in case of crash, since it'll take a long time.
        caption_df.to_csv('caption_info.csv', index=False)
        i=0
    j+=1
    print(f"Progress report: {j}", end='\r')

# then save at the end regardless.
caption_df = caption_df.drop_duplicates(subset='Deviation ID')
caption_df.to_csv('caption_info.csv', index=False)
print("\nDone")

In [None]:
caption_df = pd.read_csv('caption_info.csv')
caption_df

# Data Cleaning

Describe your data cleaning steps here.

In [None]:
# read in the .csv files which should have been created from the data gathering.
deviation_df = pd.read_csv('deviation_info.csv')
caption_df = pd.read_csv('caption_info.csv')

# drop NA values if they exist in our deviation_df. Fortunately, all of our images have good values.
print(deviation_df.shape)
deviation_df = deviation_df.dropna()
print(deviation_df.shape)

# when a caption is failed to be read, it is given "NA" for a caption. Remove these from the list.
# also if there is an NA then we drop it.
print(caption_df.shape)
caption_df = caption_df[caption_df['Caption'] != "NA"]
caption_df = caption_df.dropna()
print(caption_df.shape)

# drop duplicates of both, just in case, although in our setup we already do this.
deviation_df = deviation_df.drop_duplicates(subset='Deviation ID')
caption_df = caption_df.drop_duplicates(subset='Deviation ID')

# for the most part, our data is already clean.