# Drawception Scraping Process
Scraping drawception occurs in 3 stages. The first step is to collect the url's to the pages with the images on them. These pages are each referred to as games, and each game has between 6-12 image panels. After the game url's are collected the data from the panels can be scraped. It's important to allow some time to pass for the players of the game to react to the images. Any reactions made after the scrape will not be included in my data. The final step is to download the images to disk, which is done with a third function.

For each step there is extra code to prevent collecting duplicate data and also allowing the data to be collected in batches.

## Data to be Collected

The data specific to the image
- pre_caption
- post_caption
- image_url
- author
- panel_number
- LIKE
- HAHA
- WOW
- LOVE
- DUCK

The data specific to the game (a series of 6-12 images)
- game_url
- player_num
- game_duration
- game_date
- game_tags

Extra features
- REACT (the sum of LIKE HAHA WOW LOVE and DUCK)
- image_path (local path to the image file)


### Import Libraries

In [1]:
# Import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import json
import time
import os
from IPython.display import clear_output

## Scrape Game URL's

The function scrapes from the drawception browse game page. There are only 100 pages availible to scrape at a time. It takes less than a week to refresh the entire list. The list is saved and loaded as a dataframe for convinience.

In [2]:
# Function to scrape all pages in the browse games part of drawception up to specified number.
# Thanks to: https://towardsdatascience.com/the-simplest-cleanest-method-for-tracking-a-for-loops-progress-and-expected-run-time-in-python-972675392b3
def scrape_recent(stop_page):
    
    browse_games_url = 'https://drawception.com/browse/recent-games/'
    game_list = []
    
    # Make sure I don't access a page that doesn't exist
    if (stop_page > 100) | (stop_page <= 0):
        stop_page = 100
    
    # look at all pages in reverse order, by going backwards if (when!) the list shifts in the middle of scraping 
    # we miss a game instead of getting a duplicate.
    for i in range(stop_page,0,-1):
        
        clear_output(wait=True)
        
        res = requests.get(f'{browse_games_url}{i}/')
        soup = BeautifulSoup(res.content, 'lxml')
        temp_list = [game.attrs['href'] for game in soup.find_all(attrs={'class':'thumbpanel'})]
        game_list.extend(temp_list)
        
        print(f'Scraped page {i}')
        time.sleep(1)
    
    return game_list

## Scrape a Single Game

This function takes a drawception game and records the game into a dataframe.

In [13]:
def scrape_game(game_url):
    
    # Collect page information
    base_url = 'https://drawception.com'
    res = requests.get(base_url+game_url)
    soup = BeautifulSoup(res.content, 'lxml')
    
    # Get panels variable and check that the game page has panels
    # If no panels then something is up with the game so skip it. NSFW games do this.
    panels = soup.find_all(attrs = {'class':'col-sm-12 col-md-4'})
    if len(panels) == 0:
        return None
    
    # Escape if page doesn't load
    if res.status_code != 200:
        return None
    
    ### Header stuff that applies to all panels ###
    #  game_url : player_num : game_duration: game_date : game_tags
    
    # html with tags like 'top game' color palette and other misc tags
    top_padding = soup.find(attrs = {'class':'text-center add-padding-bottom2x add-padding-top'})
    game_tags = []
    if top_padding != None:
        game_tags = list(set([s.text.strip() for s in top_padding.find_all(name='span')]))

    # String with player count duration and game date
    header_str = soup.find(name='p', attrs={'class':'lead text-muted add-margin-top2x'}).text
    header_str = header_str.strip().split()

    player_num = header_str[0]
    game_date = ' '.join(header_str[6:9])
    game_duration = ' '.join(header_str[11:13])
    
    
    data_rows = []
    for i in range(len(panels)):
        if panels[i].find('img') != None:

            this_row = {}
            # Things to collect for this row:
            #  panel-number : pre-caption : post-caption : image : author : smile : love : haha : wow : duck

            # Check if this is the first panel then record the pre_caption
            if i == 0:
                this_row['pre_caption'] = 'draw_first'
            else:
                this_row['pre_caption'] = panels[i].find(name = 'img').attrs['title']

            # Check if this is the last panel then record the post_caption
            if i == (len(panels)-1):
                this_row['post_caption'] = 'draw_last'
            else:
                this_row['post_caption'] = panels[i+1].find(name='p').text.strip()

            # Image URL
            this_row['image_url'] = panels[i].find('img').attrs['src']

            # Author name (deleted or banned accounts appear as OooOOOoOo)
            if panels[i].find(attrs = {'class':'panel-user'}).find('a') != None:
                this_row['author'] = panels[i].find(attrs = {'class':'panel-user'}).find('a').text
            else:
                # Ghost name
                this_row['author'] = panels[i].find(attrs = {'class':'panel-user'}).find('span').text
                
            # Panel number is simple to collect
            this_row['panel_number'] = i+1

            # Reactions could exist or not exist in a dictionary. Start by setting all of them to 0.
            reaction_types = ['LIKE', 'HAHA', 'WOW', 'LOVE', 'DUCK']
            react_data = json.loads(panels[i].find('reactions').attrs['reactions_data'])
            for react in reaction_types:
                this_row[react] = 0
            # Now loop through the reaction dictionary and reset the non-zeros
            for react_dict in react_data:
                this_row[react_dict['id']] = react_dict['num']
                
            # Add in the universal game stuff here
            # game_url : player_num : game_duration: game_date : game_tags
            this_row['game_url'] = game_url
            this_row['player_num'] = player_num
            this_row['game_date'] = game_date
            this_row['game_tags'] = game_tags

            data_rows.append(this_row)
            
            # Sleep here, no matter how the function is called the sleep is included
            time.sleep(1)

    return pd.DataFrame(data_rows)

## Scrape a Batch of Games

This function runs through a list of game URL's and downloads each game that isn't found in the dataframe already. This way I can control how much I'm downloading at any given time.

In [4]:
# scrape list
def batch_scrape(game_list, dataframe, batch_size):
    # Count up to batch size like a while loop. Use a for loop to handle the edge case of finishing the list
    count = 0
    
    for i in range(len(game_list)):
        if game_list[i] not in list(drawception['game_url']):
            
            clear_output(wait=True)
            count += 1
            
            print(f'Scraped page {count}/{batch_size} - {game_list[i]}')
            scrape_game(game_list[i])
            dataframe = dataframe.append(scrape_game(game_list[i]))
            
            if count >= batch_size:
                return dataframe
    print('End of Game List')
    return dataframe

## Download Images

This function looks at the dataframe of all the images and downloads any images that it can't find in the drawings directory.

In [21]:
# Thanks to AlexG on stack exchange
# https://stackoverflow.com/questions/8286352/how-to-save-an-image-locally-using-python-whose-url-address-i-already-know
def image_scrape(dataframe, batch_size):

    # Count up to batch size like a while loop. Use a for loop to handle the edge case of finishing the list
    count = 0
    
    for _, row in dataframe.iterrows():
        
        # Get the file path from the url. My local data will mirror the url.
        path_list = row['image_url'].split('/')
        my_file = f'../{"/".join(path_list[3:6])}'
        
        # Check if the image is already downloaded
        if not os.path.exists(my_file):
        
            # This code runs if the image needs scraping
            clear_output(wait=True)
            count += 1
            print(f'Scraped image {count}/{batch_size} - {my_file}')

            # Create directory if it doesn't exist
            my_path = f'../{"/".join(path_list[3:5])}'
            if not os.path.exists(my_path):
                os.makedirs(my_path)

            # Get the image from the web
            page = requests.get(row['image_url'])
            time.sleep(1)

            # Write image file
            with open(my_file, 'wb') as f:
                f.write(page.content)
                
            # Escape early so I can control how long this code runs
            if count >= batch_size:
                return
                
    print('End of DataFrame')
    return

# Put it all into Practice

Step 1: Get the game URL's

In [19]:
# Read in saved games list and convert from dataframe back to list.
# Alternatively you can just skip this step and just collect new stuff.
games_df = pd.read_csv('../data/games_jan01.csv')
games = games_df['0'].tolist()

In [7]:
# Scrape every game off the browse games section. max 100
new_scrape = scrape_recent(5)
len(new_scrape)

Scraped page 1


105

In [None]:
# Adds the lists together then gets rid of duplicates, possibly altering the order but that's fine.
games.extend(new_scrape)
games = list(set(games))

In [None]:
# Save the list to a file, ### CHANGE THE FILENAME ###
game_df = pd.DataFrame(new_scrape)
game_df.to_csv('../data/games_jan17.csv', index=False)

Step 2: Get the panel data

In [16]:
# Load previously collected data.
# Run first scrape to initialize the columns of the master DataFrame in the case it's the first time.
# drawception = scrape_game(games[0])

drawception = pd.read_csv('../data/drawception_master.csv')

In [30]:
# Scrape games off the games list
drawception = batch_scrape(games, drawception, 500)
drawception.shape

End of Game List


(40921, 14)

In [32]:
# Save the data to a file, ### CHANGE THE FILENAME ###
drawception.to_csv('../data/drawception_master.csv', index=False)

Step 3: Get the images

In [33]:
image_scrape(drawception, 5000)

Scraped image 1051/5000 - ../drawings/1040071/xYFZnpBFOw.png
End of DataFrame
