# COGS 108 - Data Checkpoint

# Names

- Mariam Bachar (A16217374)
- Alexandra Hernandez (A16730685)
- Brian Kwon (A16306826)
- Andrew Uhm (A16729684)
- Ethan Wang (A17229824)

<a id='research_question'></a>
# Research Question

*Do certain keywords as identified by CLIP correlate with the popularity (as measured by the equivalent of “likes”) that artwork receives on social media?*

# Hypothesis

We predict that digital artwork that contains certain keywords as predicted by CLIP (painting vs. watercolor vs. digital) will indeed have a positive correlation to popularity on social media. As humans observing what is popular, we notice that certain features tend to repeat themselves across posts, which leads us to believe a correlation will be found.

# Data

Our ideal dataset would be a representative sample of images representing most genres of art. Our variables would be the image, the caption of an image, and their associated ‘likes’. We would want a decent amount of observations spanning that representative sample aforementioned, somewhere in the ballpark of ~3000 images alongside their “like” count and the original artist’s follow count. Files can be anonymized with integer IDs. From there, we would process the images to extract the captions using CLIP and store that to the corresponding data point’s image as well. Ideally images would all be the same size. Furthermore the ideal dataset would have published dates as well in order to make comparisons to past trends. In order to define popularity, we would define it as a number of likes in proportion to the maximum number of likes in our dataset, defining it regressively instead of binarily.

A real data set we could use could be from DeviantArt’s API. We acknowledge that this data is different from our ideal. For one, the images are not perfectly square. We will thus crop and size the images down to a predetermined size (e.g. 768x768) in order to normalize. DeviantArt also likely has its own culture, which means our findings may not be representative of other social media and by taking images from their home page we may not be seeing old posts. Furthermore the fields that we require are actually optional, so we would have to filter to images that actually have all the data fields we require filled out. 

# Ethics and Privacy

There are a number of ethical concerns regarding this research question that we must be mindful of as we analyze data. The most obvious issue is that we are tagging artwork as unpopular by virtue of not identifying said artwork as popular. However, this should not be a strong issue as we are not presenting identifying pieces of information of specific pieces of artwork or individual artists, so it should not be possible to label a specific artwork or artist as “unpopular”.

In terms of normalization, a possible solution would be to take a ratio between the number of likes on the artwork and the number of followers that certain artist has in order to take into account the disparity between larger artists and smaller artists in terms of popularity, as more popular artists would get more likes due to a larger audience.
Additionally, it is entirely possible that our analysis may exclude cultural influences of minority groups. Since those residing in developed countries have more leisure time/resources (such as drawing software or drawing e-tablets), it is plausible that most digital art posted to social media is likely from developed countries. Thus, the work we analyze may disproportionately represent artwork and cultural trends of majority groups of developed countries while glossing over minority groups, which tend to be similar across developed countries. 

Finally, because the artworks are on a public forum, they have consented to allowing their art to be analyzed. The Deviantart TOS states that you cannot  “reproduce, distribute, publicly display or perform, or prepare derivative works”, which does not include the use of the artworks for an analytic survey. 
Although there is no clear-cut solution for this, it serves us well to keep this fact in mind when drawing conclusions upon our analyses.

# Dataset(s)

*Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name:
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [171]:
#pip install:
    # requests, deviantart, 

#resources:
    # https://www.deviantart.com/developers/http/v1/20210526
    # https://github.com/neighbordog/deviantart
    
import requests
import urllib
from bs4 import BeautifulSoup
import deviantart
from datetime import datetime
import time
import pandas as pd

#please use your own client id and client secret to getting all of us IP banned lol. Shouldn't happen but might.
da = deviantart.Api("25492", "06217cf59e73b401dc0a14d00857a793")
#your access token = da.access_token

#this block reads in the file if it exists, if it doesn't it creates it.
csv_file = 'deviation_info.csv'
try:
    data = pd.read_csv(csv_file)
except FileNotFoundError:
    data = pd.DataFrame()

for i in range(0,1):
    #grabs 10 images starting from the i*10 + 1 index.
    deviations = da.browse(endpoint='popular', offset=i*10, limit=10)['results']
    
    for deviation in deviations:
        #the url of the image itself is here.
        url = deviation.content['src']
        #saves image to file by deviation id.
        dId = deviation.deviationid
        urllib.request.urlretrieve(url, f"images/{dId}.png")
        
        #these serve as examples of how to make a request when the python wrapper fails you.
        username = deviation.author.username
        request = f"https://www.deviantart.com/api/v1/oauth2/user/profile/{username}?access_token={da.access_token}&expand=user.stats"
        response = requests.get(request)
        authorData = response.json()
        authorWatchers = authorData['user']['stats']['watchers']
        #authorPageViews = authorData['stats']['profile_pageviews'] #not deemed necessary.
        authorDeviations = authorData['stats']['user_deviations']
        
        request = f"https://www.deviantart.com/api/v1/oauth2/deviation/metadata?access_token={da.access_token}&deviationids={deviation}&ext_stats=True"
        response = requests.get(request)
        metaData = response.json()
        views = metaData['metadata'][0]['stats']['views']
        
        row = {
            #gathering all other data we can possibly gather.
            'Deviation ID': deviation.deviationid,
            'Title': deviation.title,
            'Author': deviation.author,
            'Views': views,
            'Favorites': deviation.stats['favourites'],
            'Comments': deviation.stats['comments'],
            'URL Link': deviation.url,
            'Date Posted': datetime.fromtimestamp(int(dailydeviations[0].published_time)),
            'Height': deviation.content['height'],
            'Width': deviation.content['width'],
            'File Size': deviation.content['filesize'],
            'Author Watchers': authorWatchers,
            'Author Deviations': authorDeviations
        }
        data = data.append(row, ignore_index=True)
        
    #time.sleep(15) #sleep 15 seconds to avoid deviantart rate limiting us.
    
    #when run multiple tops, duplicates will crop up. This removes them.
    #turns out that the src differs sometimes, so it will still duplicate.
    data = data.drop_duplicates()

data.to_csv('deviation_info.csv', index=False)

  data = data.append(row, ignore_index=True)
  data = data.append(row, ignore_index=True)
  data = data.append(row, ignore_index=True)
  data = data.append(row, ignore_index=True)
  data = data.append(row, ignore_index=True)
  data = data.append(row, ignore_index=True)
  data = data.append(row, ignore_index=True)
  data = data.append(row, ignore_index=True)
  data = data.append(row, ignore_index=True)
  data = data.append(row, ignore_index=True)


# Data Cleaning

Describe your data cleaning steps here.

In [3]:
# code (can have multiple)