# COGS 108 - EDA Checkpoint

## Names
- Mariam Bachar (A16217374)
- Alexandra Hernandez (A16730685)
- Brian Kwon (A16306826)
- Andrew Uhm (A16729684)
- Ethan Wang (A17229824)

## Research Question

Do certain keywords as identified by CLIP correlate with the popularity (as measured by the equivalent of “likes”) that artwork receives on social media?

## Setup

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [4]:
# retrieving our deviation info.
csv_file = 'deviation_info.csv'
try:
    deviation_df = pd.read_csv(csv_file)
except:
    FileNotFoundError

In [9]:
deviation_df.head()

Unnamed: 0,Deviation ID,Title,Author,Views,Favorites,Comments,URL Link,Date Posted,Height,Width,File Size,Author Watchers,Author Page Views,Author Deviations
0,6797CD44-47EA-B405-9377-5D41D83F33FE,A l'aise...,thrumyeye,2022815,31101,2385,https://www.deviantart.com/thrumyeye/art/A-l-a...,2011-02-17 23:43:04,599,900,408379,36527,1354598,2179
1,83750DFB-D23E-00A3-DF4F-D164A07DF605,Tiger cub,Kamirah,1271452,20720,2097,https://www.deviantart.com/kamirah/art/Tiger-c...,2008-07-11 06:10:53,800,800,523370,45123,7500677,358
2,8F1ED8A1-88A5-861A-F83B-77916A1481A0,Baby Steps 0268P,Sooper-Deviant,1097719,24050,937,https://www.deviantart.com/sooper-deviant/art/...,2009-10-05 06:55:37,400,400,4493241,35586,1438276,345
3,66D5BA39-C0D4-7A95-52FF-C0694149142E,Sky Turtle,yuumei,1056323,23103,918,https://www.deviantart.com/yuumei/art/Sky-Turt...,2014-01-23 10:46:57,532,1000,390864,397714,22504533,842
4,1C127EBF-EFF7-7BC4-004D-0355A2856C05,Riders,sandara,1240225,15643,512,https://www.deviantart.com/sandara/art/Riders-...,2013-09-30 23:37:02,770,1200,832434,212588,7566524,796


In [6]:
# retrieving our CLIP caption info.
csv_file = 'caption_info.csv'
try:
    caption_df = pd.read_csv(csv_file)
except FileNotFoundError:
    caption_df = pd.DataFrame()

In [16]:
caption_df.head()

Unnamed: 0,Deviation ID,Caption
0,6797CD44-47EA-B405-9377-5D41D83F33FE,a red fox is sitting in the grass and looking ...
1,83750DFB-D23E-00A3-DF4F-D164A07DF605,a tiger sitting on top of a rock in a forest w...
2,8F1ED8A1-88A5-861A-F83B-77916A1481A0,a red panda cub walking on a tree branch in th...
3,66D5BA39-C0D4-7A95-52FF-C0694149142E,a bird flying through the air with its wings s...
4,1C127EBF-EFF7-7BC4-004D-0355A2856C05,a man riding on the back of a horse next to a ...


## Data Cleaning

We have previously checked for N/A values to drop and have already removed duplicates when constructing our datasets. There's still a couple of things to do; first, let's join our dataframes together based on their common ID for easier usage later:

In [27]:
# joining the two dataframes for good measure.
df = pd.merge(deviation_df, caption_df, on='Deviation ID', how='inner')
df.head()

Unnamed: 0,Deviation ID,Title,Author,Views,Favorites,Comments,URL Link,Date Posted,Height,Width,File Size,Author Watchers,Author Page Views,Author Deviations,Caption
0,6797CD44-47EA-B405-9377-5D41D83F33FE,A l'aise...,thrumyeye,2022815,31101,2385,https://www.deviantart.com/thrumyeye/art/A-l-a...,2011-02-17 23:43:04,599,900,408379,36527,1354598,2179,a red fox is sitting in the grass and looking ...
1,83750DFB-D23E-00A3-DF4F-D164A07DF605,Tiger cub,Kamirah,1271452,20720,2097,https://www.deviantart.com/kamirah/art/Tiger-c...,2008-07-11 06:10:53,800,800,523370,45123,7500677,358,a tiger sitting on top of a rock in a forest w...
2,8F1ED8A1-88A5-861A-F83B-77916A1481A0,Baby Steps 0268P,Sooper-Deviant,1097719,24050,937,https://www.deviantart.com/sooper-deviant/art/...,2009-10-05 06:55:37,400,400,4493241,35586,1438276,345,a red panda cub walking on a tree branch in th...
3,66D5BA39-C0D4-7A95-52FF-C0694149142E,Sky Turtle,yuumei,1056323,23103,918,https://www.deviantart.com/yuumei/art/Sky-Turt...,2014-01-23 10:46:57,532,1000,390864,397714,22504533,842,a bird flying through the air with its wings s...
4,1C127EBF-EFF7-7BC4-004D-0355A2856C05,Riders,sandara,1240225,15643,512,https://www.deviantart.com/sandara/art/Riders-...,2013-09-30 23:37:02,770,1200,832434,212588,7566524,796,a man riding on the back of a horse next to a ...


We want to anonymize the data to maintain ethical integrity as discussed in our project proposal. Let's remove the Author field, as it doesn't contribute anything otherwise. We don't need the URL anymore and it also reveals the author on top of that, so we remove it too. We also acknowledge that webpage "views" can be manipulated by bots. The views on the deviation (read: artwork) are still relevant despite knowing that, but the views on the Author's page matter less so. Let's remove that also:

In [28]:
df = df.drop('URL Link', axis=1)
df = df.drop('Author', axis=1)
df = df.drop('Author Page Views', axis=1)

In [29]:
df.head()

Unnamed: 0,Deviation ID,Title,Views,Favorites,Comments,Date Posted,Height,Width,File Size,Author Watchers,Author Deviations,Caption
0,6797CD44-47EA-B405-9377-5D41D83F33FE,A l'aise...,2022815,31101,2385,2011-02-17 23:43:04,599,900,408379,36527,2179,a red fox is sitting in the grass and looking ...
1,83750DFB-D23E-00A3-DF4F-D164A07DF605,Tiger cub,1271452,20720,2097,2008-07-11 06:10:53,800,800,523370,45123,358,a tiger sitting on top of a rock in a forest w...
2,8F1ED8A1-88A5-861A-F83B-77916A1481A0,Baby Steps 0268P,1097719,24050,937,2009-10-05 06:55:37,400,400,4493241,35586,345,a red panda cub walking on a tree branch in th...
3,66D5BA39-C0D4-7A95-52FF-C0694149142E,Sky Turtle,1056323,23103,918,2014-01-23 10:46:57,532,1000,390864,397714,842,a bird flying through the air with its wings s...
4,1C127EBF-EFF7-7BC4-004D-0355A2856C05,Riders,1240225,15643,512,2013-09-30 23:37:02,770,1200,832434,212588,796,a man riding on the back of a horse next to a ...


We can justify the rest of the columns. We need the ID for identification purposes. The title may relate to the CLIP caption. The views, favorites, and comments are how we are gauging the popularity of the post. The date posted is relevant for trend analysis. The file size, height, and width implies the rendering definition (e.g. 1080px) of the work. Author watchers and deviations may imply how much experience or traction this author has had on DeviantArt. Finally, the CLIP caption is integral to our hypothesis.

Now we consider adjusting the string-based content of our data. We do not want the stopwords such as "a", "is", "the", etc. within our analysis, as they are irrelevant:

In [None]:
# TODO: consider whether it matters we edit deviation titles or not (Andrew: I don't think we need to)
# TODO: use nltk to remove stopwords in AI caption

## Data Analysis and Results

Describe how EDA works here and what is being explored

In [None]:
# TODO:
# define the popularity metric using views, favorites, and comments fields
# draw out popularity trends based on date posted then normalize if issue
# we have 3 relationships we have to draw out: popularity with fileSizeAndQuality, authorExperience, and the AICaption.
# we then want to make some sort of statistical test that says popularity is not related to fileSizeAndQuality or authorExperience
# but also a test that says popularity IS related to the AICaption.
# REMEMBER that it's not critical that these tests pass; these are just the test results we need to confirm our hypothesis. 
#       if our hypothesis is wrong, no big deal
# are we using CLIP or https://stablediffusionweb.com? If the latter, we need to change some parts of our proposal/hypothesis