# COGS 108 - Data Checkpoint

# Names

- Mariam Bachar (A16217374)
- Alexandra Hernandez (A16730685)
- Brian Kwon (A16306826)
- Andrew Uhm (A16729684)
- Ethan Wang (A17229824)

<a id='research_question'></a>
# Research Question

*Do certain keywords as identified by CLIP correlate with the popularity (as measured by the equivalent of “likes”) that artwork receives on social media?*

# Hypothesis

We predict that digital artwork that contains certain keywords as predicted by CLIP (painting vs. watercolor vs. digital) will indeed have a positive correlation to popularity on social media. As humans observing what is popular, we notice that certain features tend to repeat themselves across posts, which leads us to believe a correlation will be found.

# Data

Our ideal dataset would be a representative sample of images representing most genres of art. Our variables would be the image, the caption of an image, and their associated ‘likes’. We would want a decent amount of observations spanning that representative sample aforementioned, somewhere in the ballpark of ~3000 images alongside their “like” count and the original artist’s follow count. Files can be anonymized with integer IDs. From there, we would process the images to extract the captions using CLIP and store that to the corresponding data point’s image as well. Ideally images would all be the same size. Furthermore the ideal dataset would have published dates as well in order to make comparisons to past trends. In order to define popularity, we would define it as a number of likes in proportion to the maximum number of likes in our dataset, defining it regressively instead of binarily.

A real data set we could use could be from DeviantArt’s API. We acknowledge that this data is different from our ideal. For one, the images are not perfectly square. We will thus crop and size the images down to a predetermined size (e.g. 768x768) in order to normalize. DeviantArt also likely has its own culture, which means our findings may not be representative of other social media and by taking images from their home page we may not be seeing old posts. Furthermore the fields that we require are actually optional, so we would have to filter to images that actually have all the data fields we require filled out. 

# Ethics and Privacy

There are a number of ethical concerns regarding this research question that we must be mindful of as we analyze data. The most obvious issue is that we are tagging artwork as unpopular by virtue of not identifying said artwork as popular. However, this should not be a strong issue as we are not presenting identifying pieces of information of specific pieces of artwork or individual artists, so it should not be possible to label a specific artwork or artist as “unpopular”.

In terms of normalization, a possible solution would be to take a ratio between the number of likes on the artwork and the number of followers that certain artist has in order to take into account the disparity between larger artists and smaller artists in terms of popularity, as more popular artists would get more likes due to a larger audience.
Additionally, it is entirely possible that our analysis may exclude cultural influences of minority groups. Since those residing in developed countries have more leisure time/resources (such as drawing software or drawing e-tablets), it is plausible that most digital art posted to social media is likely from developed countries. Thus, the work we analyze may disproportionately represent artwork and cultural trends of majority groups of developed countries while glossing over minority groups, which tend to be similar across developed countries. 

Finally, because the artworks are on a public forum, they have consented to allowing their art to be analyzed. The Deviantart TOS states that you cannot  “reproduce, distribute, publicly display or perform, or prepare derivative works”, which does not include the use of the artworks for an analytic survey. 
Although there is no clear-cut solution for this, it serves us well to keep this fact in mind when drawing conclusions upon our analyses.

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name:
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [25]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import requests
import urllib
from bs4 import BeautifulSoup
import deviantart

import time
from datetime import datetime
from pathlib import Path

# DeviantArt API: https://www.deviantart.com/developers/http/v1/20210526
# Open-Source Python wrapper for DA API: https://github.com/neighbordog/deviantart

In [40]:
# creates a pd df from the csv file if it exists, else creates a blank df
csv_file = 'deviation_info.csv'
try:
    deviation_df = pd.read_csv(csv_file)
except FileNotFoundError:
    deviation_df = pd.DataFrame()

In [28]:
# Separate API keys in case of requesting issues
andrew_DA_API = deviantart.Api("25542", "61a232f232df245f2560a3cb72ecc535")
ethan_DA_API = deviantart.Api("25492", "06217cf59e73b401dc0a14d00857a793")

# access token is da.access_token

In [29]:
# README: use your own token
cur_access = andrew_DA_API

In [30]:
# how many images we want to fetch * 10
n = 1

In [44]:
for i in range(n):
    # grab 10 images at a time. DeviantArt calls their posts "deviations".
    # TODO: consider timerange 'onemonth'
    deviations = cur_access.browse(endpoint='popular', timerange='alltime', offset=i*10, limit=10)['results']
    
    for deviation in deviations:
        # saves image to file by deviation id using url for local CLIP analysis
        url = deviation.content['src']
        dId = deviation.deviationid
        filename = f"images/{dId}.png"
        path = Path(filename)
        if path.is_file():
            pass
        else:
            open(filename, 'w').close()
            urllib.request.urlretrieve(url, filename)
        
        # these serve as examples of how to make a request when the python wrapper doesn't work
        username = deviation.author.username
        request = f"https://www.deviantart.com/api/v1/oauth2/user/profile/{username}?access_token={cur_access.access_token}&expand=user.stats"
        response = requests.get(request)
        authorData = response.json()
        authorWatchers = authorData['user']['stats']['watchers']
        authorPageViews = authorData['stats']['profile_pageviews'] # deemed unnecessary?
        authorDeviations = authorData['stats']['user_deviations']
        
        request = f"https://www.deviantart.com/api/v1/oauth2/deviation/metadata?access_token={cur_access.access_token}&deviationids={deviation}&ext_stats=True"
        response = requests.get(request)
        metaData = response.json()
        views = metaData['metadata'][0]['stats']['views']
        
        # gathering relevant data, turning it into a new observation
        row = {
            'Deviation ID': deviation.deviationid,
            'Title': deviation.title,
            'Author': deviation.author,
            'Views': views,
            'Favorites': deviation.stats['favourites'],
            'Comments': deviation.stats['comments'],
            'URL Link': deviation.url,
            'Date Posted': datetime.fromtimestamp(int(deviation.published_time)),
            'Height': deviation.content['height'],
            'Width': deviation.content['width'],
            'File Size': deviation.content['filesize'],
            'Author Watchers': authorWatchers,
            'Author Page Views': authorPageViews,
            'Author Deviations': authorDeviations
        }
        row_df = pd.DataFrame(row, index=[0])
        deviation_df = pd.concat([deviation_df, row_df], ignore_index=True)
        
    # when running on the most popular posts, we will likely get duplicates. remove them.
    deviation_df = deviation_df.drop_duplicates(subset='Deviation ID')
    
    # grab every 15 seconds in order to adhere to DeviantArt fetch rate.
    if n > 1:
        time.sleep(15)

In [45]:
# put our df into a csv file so scraping can be collaborative
# to_csv overwrites but should be ok since we are reading from the csv to populate the df anyways
deviation_df.to_csv('deviation_info.csv', index=False)

In [46]:
deviation_df

Unnamed: 0,Deviation ID,Title,Author,Views,Favorites,Comments,URL Link,Date Posted,Height,Width,File Size,Author Watchers,Author Page Views,Author Deviations
0,402F3AC2-00A0-C125-C38A-7D7D115FEB19,Mother's Day 2023,TC-96,226843,1851,119,https://www.deviantart.com/tc-96/art/Mother-s-...,2023-05-14 03:03:33,1373,1280,10188807,76682,14104535,1125
1,7B95FF3B-DD45-2551-1E6C-610FB8CF0141,3245. Light Jolteon TCG (Fan Art),Cryptid-Creations,159558,1167,21,https://www.deviantart.com/cryptid-creations/a...,2023-05-15 08:00:05,652,900,892442,179511,8199386,3353
2,CC32A3B6-5771-30C2-012C-F8A921465FD9,sneaky bud!,Apofiss,142872,889,11,https://www.deviantart.com/apofiss/art/sneaky-...,2023-05-15 09:42:06,338,600,143290,170567,6085730,755
3,EF788684-E3ED-4A25-C651-0D89108E677D,Moushley Fanart,LoulouVZ,380472,944,11,https://www.deviantart.com/loulouvz/art/Moushl...,2023-05-11 01:33:28,2925,2925,1201402,17672,1026305,3846
4,E1C46EE7-1A48-48B8-5FD6-56C6AFCA3B70,Happy Mother's Day 2023,OUTCASTComix,202538,779,41,https://www.deviantart.com/outcastcomix/art/Ha...,2023-05-14 09:49:46,1276,1024,3697959,24119,2604771,948
5,52D5DBF5-8DDE-A6D9-3CFC-6564E9EB6BA3,Splash pose,rongs1234,415298,928,20,https://www.deviantart.com/rongs1234/art/Splas...,2023-05-11 14:27:08,589,1280,1811339,21869,1527302,4131
6,1AEB56D7-5B52-A69B-9323-F6065F6A816D,Lavender Dreams. [Adoptable],SenchaFox,416702,713,11,https://www.deviantart.com/senchafox/art/Laven...,2023-05-10 13:52:23,1024,1024,27792020,1237,13961,92
7,AA714BFC-3D7D-9DDA-9833-A9840E21B5FB,The Eclipse Knight,AnatoFinnstark,179850,794,12,https://www.deviantart.com/anatofinnstark/art/...,2023-05-12 08:17:52,1260,900,18628140,44934,777723,869
8,D5CAADA8-8214-84E9-29CE-9F0D9E1A9258,Commission - Yor Forger,Neldorwen,365188,755,20,https://www.deviantart.com/neldorwen/art/Commi...,2023-05-10 10:56:09,1100,1100,672737,1761,178913,1056
9,60BAD3C8-F557-177C-C812-CFD3219E9356,Cloud isopod,pikaole,256350,517,5,https://www.deviantart.com/pikaole/art/Cloud-i...,2023-05-11 19:46:41,567,800,258765,17798,296397,811


# Data Cleaning

Describe your data cleaning steps here.

In [None]:
# code (can have multiple)