**Text Scraping**

This notebook includes the code used to scrape the description pages for both Rhizome and MOMA websites. The resulting data is used for some visualisations as well as storytelling. For Rhizome extracting data was also deemed important to help create medium information for the artworks (based on certain keywords such as html, java, flash etc...)

In [1]:
#imports and path 
from __future__ import print_function
from bs4 import BeautifulSoup
import requests
import pandas as pd
path = '/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 2/EPUB/PROJECT/EPDS/'

**Rhizome Scraping**
- After some trials with the html construction of the Rhizome artwork pages we decided to grab the div that includes all three possible descriptions (summary, artist statement, description) and clean the results 

In [7]:
#function to scrape summaries, descriptions and artists statements from Rhizome website 
def url_to_text_rhizome(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "html.parser")
    accordion = [p.text.strip() for p in soup.find(id="AccordionDescriptionBody").find_all('div')]
    print(url)
    return accordion

In [None]:
#grab URLs from artworks DF, scrape them and return them back to the DF
rhz_artworks_extra = pd.read_pickle(path+'Rhizome_data/rhizome_artworks_extra.pkl')
urls = rhz_artworks_extra['URL'].to_list()
scrapes = [url_to_text_rhizome(u) for u in urls]
rhz_artworks_extra_text = rhz_artworks_extra.copy()
rhz_artworks_extra_text['Text'] = pd.Series(scrapes)
#fix an erroneous ID in original first round of scraping 
rhz_artworks_extra_text.loc[777, 'ID'] = '926, 1268'
rhz_artworks_extra_text = rhz_artworks_extra_text.astype(str)
rhz_artworks_extra_text.to_pickle(path+'Rhizome_data/rhizome_artworks_extra_text.pkl')

**MOMA Scraping**
- For MOMA URLs there was only one possible description on the page but the containing div is repeated elsewhereso we used its parent container to only extract what we needed 
- We show the process for one dept only, but this was repeated for all of them 

In [3]:
#load department DFs
moma_arch_cont = pd.read_pickle('MOMA_data/pickle/departments/architecture_design_cont.pkl')
moma_arch_mod = pd.read_pickle('MOMA_data/pickle/departments/architecture_design_mod.pkl')
moma_design_cont = pd.read_pickle('MOMA_data/pickle/departments/architecture_design_img_cont.pkl')
moma_design_mod = pd.read_pickle('MOMA_data/pickle/departments/architecture_design_img_mod.pkl')
moma_draw_cont = pd.read_pickle('MOMA_data/pickle/departments/draws_prints_cont.pkl')
moma_draw_mod = pd.read_pickle('MOMA_data/pickle/departments/draws_prints_mod.pkl')
moma_films_cont = pd.read_pickle('MOMA_data/pickle/departments/films_cont.pkl')
moma_films_mod = pd.read_pickle('MOMA_data/pickle/departments/films_mod.pkl')
moma_fluxus_cont = pd.read_pickle('MOMA_data/pickle/departments/fluxus_cont.pkl')
moma_fluxus_mod = pd.read_pickle('MOMA_data/pickle/departments/fluxus_mod.pkl')
moma_media_cont = pd.read_pickle('MOMA_data/pickle/departments/media_perf_cont.pkl')
moma_media_mod = pd.read_pickle('MOMA_data/pickle/departments/media_perf_mod.pkl')
moma_paint_cont = pd.read_pickle('MOMA_data/pickle/departments/paint_sculp_cont.pkl')
moma_paint_mod = pd.read_pickle('MOMA_data/pickle/departments/paint_sculp_mod.pkl')
moma_photo_cont = pd.read_pickle('MOMA_data/pickle/departments/photo_cont.pkl')
moma_photo_mod = pd.read_pickle('MOMA_data/pickle/departments/photo_mod.pkl')

In [4]:
#function to extract text if available w/ exclusions for 404 status, missing URLs, and pages w/ no despcription
def url_to_text_moma(url):
    if url != 'missing':
        page = requests.get(url)
        status = page.status_code
        if status != 404:
            soup = BeautifulSoup(page.content, "html.parser")
            try:
                if soup.find(class_="uneven-columns--work").find(class_='main-content') is not None:
                    text = soup.find(class_="uneven-columns--work").find(class_='main-content').find_all('p')
                else:
                    text = ''
            except AttributeError:
                text = ''
                pass
        else:
            text = '404'
    else:
        text = 'missing'     
    print(url)
    return text

In [47]:
#grap all links from a dept as a list 
links = moma_photo_mod['URL'].to_list()
#process links w/ function 
moma_photo_to_add = [url_to_text_moma(u) for u in links]
#add results back to a copy of the original DF
moma_photo_mod_text = moma_photo_mod.copy()
moma_photo_mod_text['Text'] = moma_photo_to_add
moma_photo_mod_text['Text'] = moma_photo_mod_text['Text'].astype(str)
moma_photo_mod_text.to_pickle('/Users/laurentfintoni/Desktop/University/COURSE DOCS/YEAR 2/EPUB/PROJECT/EPDS/MOMA_data/pickle/departments/photo_mod_text.pkl')

**Rhizome Cleaning**
- Clean the text of html parsing errors 
- Analyse most common words for custom stopwords
- Run text through nltk english stopwords and custom ones 
- Extract 20 keywords for each text and add them as a new column (to use for assessing mediums of artworks)

In [12]:
rhz_artworks_extra_text = pd.read_pickle(path+'Rhizome_data/rhizome_artworks_extra_text.pkl')


In [None]:
#create a list of strings to remove from all scrapes
remove = ['description edit\\\\n\\\\n\\\\t\\\\t\\\\t\\\\t', '[', ']', '\\\\n\\\\n\\\\n\\\\', '\'description edit\', ', 'edit\\\\n\\\\n', '\'summary edit\', ', 'tttt', 'nn']
for char in remove:
    rhz_artworks_extra_text['Text'] = rhz_artworks_extra_text['Text'].str.replace(char, '')

#export cleaned version to pickle
rhz_artworks_extra_text_clean = rhz_artworks_extra_text.copy()
rhz_artworks_extra_text_clean.to_pickle(path+'Rhizome_data/rhizome_artworks_extra_text_clean.pkl')

In [None]:
#check top 50 keywords 
common = pd.Series(' '.join(rhz_artworks_extra_text_clean['Text']).split()).value_counts()[:50]
common

In [17]:
#imports for cleaning and keyword extraction
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

In [18]:
# create a list of stop words and add custom stopwords
stop_words = set(stopwords.words("english"))
new_words =  ['the', 'a ', 'rhizome', '\'attributed', 'summary', 'to:', '\'inception:', 'staff\',', '\'attribution:', '\'summary', 'staffinception:', '-', '2021\',', '\'legacy', '2001\',']
stop_words = stop_words.union(new_words)

In [19]:
#remove stopwords from cleanedtext 
rhz_artworks_extra_text_clean['Text'] = rhz_artworks_extra_text_clean['Text'].str.lower()
rhz_artworks_extra_text_clean['Text'] = rhz_artworks_extra_text_clean['Text'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))

In [None]:
#check top 50 keywords again
common_clean = pd.Series(' '.join(rhz_artworks_extra_text_clean['Text']).split()).value_counts()[:50]
common_clean

In [23]:
#function to extract keywords and pass them to a new column
def get_keywords(row):
    some_text = row['Text']
    tokens = nltk.tokenize.word_tokenize(some_text)
    keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
    keywords_string = ', '.join(keywords[0:20])
    return keywords_string

rhz_artworks_extra_text_clean['Keywords'] = rhz_artworks_extra_text_clean.apply(lambda row:get_keywords(row), axis=1) 

In [26]:
#export to new pickle for reuse
rhz_artworks_extra_text_clean.to_pickle(path+'Rhizome_data/rhizome_artworks_extra_text_clean_stop_keywords.pkl')

In [None]:
rhz_artworks_extra_text_clean.loc[rhz_artworks_extra_text_clean['Keywords'].str.contains('html', case=False)]

**MOMA Cleaning**