# Painting Details Scraping
In this section we define the artists that we wish to use in the project and then scrape the text-based data. <br/>
__NOTE:__ There is no special reason that I chose these Artists. But points that I considered were; different _genres_ & _styles_. The _period_ in which they produced artworks (_not to have Artists all from the same period in time_). The number of pieces of art (_to few would not provide enough data for analysis, too many prove to be impractical considering the processing power needed_). 

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import sys

## Define Artists
It would be nice to process the data from all artists displayed on the _WikiArt_ website. But, Time & Processing constraints prevent this. Therefore a small sellection of Artists have been chosen for this project. The selecttion vas based on this criteria: <br/>
- They have at least one artworks on the WikiArt website that I liked.
- Collectively they provide a range of Styles and Genres.
- Collectively they produced their works at different time periods.
- I did not want artists that only have 2 or 3 artworks listed.
- I did not want lots of artists with very large collections. _(Limitations in processing power/time to consider)_

In [2]:
ARTISTS = ["piet-mondrian",
           "berthe-morisot",
           "tamara-de-lempicka",
           "katsushika-hokusai",
           "m-c-escher",
           "l-s-lowry",
           "roy-lichtenstein",
           "jackson-pollock",
           "edward-hicks",
           "agnes-martin",
           "pablo-picasso",
           "francisco-goya",
           "henry-fuseli",
           "karl-bodmer"]

## Helper Function to Extract the URL of the Artwork Image
We read the _HTML_ of webpage containing the Artists cataloged images and extract the individual _URLs_ for each art piece. <br/>
<img src="./images/web_scrape2.png" width="1280px"><br/>
__NOTE:__ Above we see a screen grab of the _Developer Tools_ view of an artists catalog. It is possble to scrape the data directly from the webpages HTML code.

In [3]:
def get_painting_urls_for_artist(artist):

    # create an empty list to store the URLs 
    url_list = []

    # construct the URL
    url      = "https://www.wikiart.org/en/{}/all-works/text-list".format(artist)
    
    # GET the website HTML source
    page     = requests.get(url)

    # extract HTML
    soup     = BeautifulSoup(page.text, 'lxml')

    # find all <a> tags within the HTML and loop through them
    for a_tag in soup.find_all('a'):
        
        # select only those with the artists name that do not contaian a target attribue
        if "en/{}".format(artist) in str(a_tag) and "target" not in str(a_tag):
            
            # contract the url of the page containing the image ref URL and details
            url_list.append("https://www.wikiart.org{}".format(a_tag["href"]))
            
    # return the full list of formatted URLs
    return(url_list)

## Helper Function to Extract the Text Based Data About the Image
From the _HTML_ of the webpage detailing the individual artworks we extract the _text_ data and the _URL_ of the artworks image file. <br/>
<img src="./images/web_scrape1.png" width="1280px"><br/>
__NOTE:__ Above we see a screen grab of the _Developer Tools_ view of an artworks page. It is possible to scrape both the _Text_ data and the artwork image files URL from the webpages HTML code.

In [4]:
def get_painting_details(url):
    
    # GET the website HTML source    
    page         = requests.get(url)

    # extract HTML
    soup         = BeautifulSoup(page.text, "lxml")

    # define variables
    image_url    = ""
    image_style  = ""
    image_genre  = ""
    image_media  = ""
    image_title  = ""
    image_year   = ""
    image_artist = ""
    
    # find the image tag
    try:
        image_tag = soup.find("img")
        image_url = image_tag["src"]
    # NOTE: in the event of there being an error reading the HTML just exit try block and use the default empty string
    except:
        pass 
    
    # find the artical tag to extract title, artist and year 
    try:
        article_tag  = soup.find("article")

        # extract just the text to give the title
        image_title  = article_tag.find("h3").text
    
        # extract just the text to give the artist
        image_artist = article_tag.find("a").text
    
        # extract just the text to give the year
        li_tag       = article_tag.find("li", class_ = '')
        image_year   = li_tag.find("span").text[0:4]
    # NOTE: in the event of there being an error reading the HTML just exit try block and use the default empty string
    except:
        pass
    
    # clean up the list tag and extract text 
    try:
        li_tag = soup.find_all("li", class_ = 'dictionary-values')

        # loop through all items in the list
        for li_item in li_tag:
        
            try:
                # for each list item identify the <p> tag and extract the text
                item_text       = BeautifulSoup(li_item.text, "lxml")
                item_p_tag_text = item_text.find("p")
    
                # clean line breaks, convert to lowercase and split field name from field value(s)
                field_name, field_value = str(item_p_tag_text.text).strip().replace('\r', '').replace('\n', '').lower().split(":")
        
                # assign relivant value to field
                if field_name == "style" : image_style = field_value
                if field_name == "genre" : image_genre = field_value
                if field_name == "media" : image_media = field_value
            # NOTE: in the event of there being an error reading the HTML just exit try block and use the default empty string
            except:
                pass        
    except:
        pass
    
    
    # return a dictionary of all extracted values  
    return({"title"  : image_title,
            "year"   : image_year,
            "artist" : image_artist,
            "style"  : image_style,
            "genre"  : image_genre,
            "media"  : image_media,
            "url"    : image_url})      

## Get Artwork Text Data
We work through the list of chosen Artists. We first find the web page containing the calalog of their artworks and then, using the information that we have gathered, we more th the web page of the individual work of art and extract the data that we require. Once we have the information, we store it as a collection of _.csv_ files, one per artist, for further use in the project.

In [5]:
# initialize the artist counter
artist_counter = 1

# loop through out list of artists
for artist in ARTISTS:
    
    # create a dataframe for details of the current artist
    df_artist_works_of_art = pd.DataFrame()
    
    # initialize the artwork counter 
    artwork_counter = 1

    # get a list of the artists works and loop through them
    for work_of_art_details in get_painting_urls_for_artist(artist):
        
        # output progress
        sys.stdout.write(f"\rProcessing [{str(artist_counter).zfill(2)}/{str(artwork_counter).zfill(4)}]")
        
        # call our helper function to extract details for the specific picture
        df_artist_works_of_art = df_artist_works_of_art.append(eval(str(get_painting_details(work_of_art_details))), ignore_index = True)
        
        # increment artwork counter
        artwork_counter += 1
    
    # write the dataframe of the artists work to a csv (Charater Seperated Fields - ';' as commas are used in some fields)
    df_artist_works_of_art.to_csv("".join(["./data/artists/", artist, ".csv"]), sep = ";" )
                                                            
    # increment artist counter
    artist_counter += 1

Processing [03/0104]