**Text Scraping**

This notebook includes the code used to scrape the description pages for both Rhizome and MOMA websites. The resulting data is used for some visualisations as well as storytelling. For Rhizome extracting data was also deemed important to help create medium information for the artworks (based on certain keywords such as html, java, flash etc...)

In [1]:
#imports and path 
from __future__ import print_function
from bs4 import BeautifulSoup
import requests
import pandas as pd
#imports for cleaning and keyword extraction
import re
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize 
from nltk.util import ngrams
path = '../'

In [58]:
#FUNCTIONS

#function to scrape summaries, descriptions and artists statements from Rhizome website 
def url_to_text_rhizome(url):
    page = requests.get(url).text
    soup = BeautifulSoup(page, "html.parser")
    accordion = [p.text.strip() for p in soup.find(id="AccordionDescriptionBody").find_all('div')]
    print(url)
    return accordion

#function to extract text from MoMA website w/ exclusions for 404 status, missing URLs, and pages w/ no despcription
def url_to_text_moma(url):
    if url != 'missing':
        page = requests.get(url)
        status = page.status_code
        if status != 404:
            soup = BeautifulSoup(page.content, "html.parser")
            try:
                if soup.find(class_="uneven-columns--work").find(class_='main-content') is not None:
                    text = soup.find(class_="uneven-columns--work").find(class_='main-content').find_all('p')
                else:
                    text = ''
            except AttributeError:
                text = ''
                pass
        else:
            text = '404'
    else:
        text = 'missing'     
    print(url)
    return text

#function to extract keywords and pass them to a new column
def get_keywords(row):
    punct_tokenizer = nltk.RegexpTokenizer(r"\w+")
    some_text = row['Text']
    tokens = punct_tokenizer.tokenize(some_text)
    keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
    freqdist = nltk.FreqDist(keywords)
    most_common = freqdist.most_common(20)
    most_common_list = []
    for t in most_common:
        most_common_list.append(t[0])
    keywords_string = ', '.join(most_common_list)
    return keywords_string


**Rhizome Scraping**
- After some trials with the html construction of the Rhizome artwork pages we decided to grab the div that includes all three possible descriptions (summary, artist statement, description) and clean the results 

In [None]:
#grab URLs from artworks DF, scrape them and return them back to the DF
rhz_artworks_extra = pd.read_pickle(path+'Rhizome_data/rhizome_artworks_extra.pkl')
urls = rhz_artworks_extra['URL'].to_list()
scrapes = [url_to_text_rhizome(u) for u in urls]
rhz_artworks_extra_text = rhz_artworks_extra.copy()
rhz_artworks_extra_text['Text'] = pd.Series(scrapes)
#fix an erroneous ID in original first round of scraping 
rhz_artworks_extra_text.loc[777, 'ID'] = '926, 1268'
rhz_artworks_extra_text = rhz_artworks_extra_text.astype(str)
rhz_artworks_extra_text.to_pickle(path+'Rhizome_data/rhizome_artworks_extra_text.pkl')

**MOMA Scraping**
- For MOMA URLs there was only one possible description on the page but the containing div is repeated elsewhereso we used its parent container to only extract what we needed 
- We show the process for one dept only, but this was repeated for all of them 

In [10]:
#load department DFs
moma_arch_cont = pd.read_pickle(path+'MOMA_data/pickle/departments/architecture_design_cont.pkl')
moma_arch_mod = pd.read_pickle(path+'MOMA_data/pickle/departments/architecture_design_mod.pkl')
moma_design_cont = pd.read_pickle(path+'MOMA_data/pickle/departments/architecture_design_img_cont.pkl')
moma_design_mod = pd.read_pickle(path+'MOMA_data/pickle/departments/architecture_design_img_mod.pkl')
moma_draw_cont = pd.read_pickle(path+'MOMA_data/pickle/departments/draws_prints_cont.pkl')
moma_draw_mod = pd.read_pickle(path+'MOMA_data/pickle/departments/draws_prints_mod.pkl')
moma_films_cont = pd.read_pickle(path+'MOMA_data/pickle/departments/films_cont.pkl')
moma_films_mod = pd.read_pickle(path+'MOMA_data/pickle/departments/films_mod.pkl')
moma_fluxus_cont = pd.read_pickle(path+'MOMA_data/pickle/departments/fluxus_cont.pkl')
moma_fluxus_mod = pd.read_pickle(path+'MOMA_data/pickle/departments/fluxus_mod.pkl')
moma_media_cont = pd.read_pickle(path+'MOMA_data/pickle/departments/media_perf_cont.pkl')
moma_media_mod = pd.read_pickle(path+'MOMA_data/pickle/departments/media_perf_mod.pkl')
moma_paint_cont = pd.read_pickle(path+'MOMA_data/pickle/departments/paint_sculp_cont.pkl')
moma_paint_mod = pd.read_pickle(path+'MOMA_data/pickle/departments/paint_sculp_mod.pkl')
moma_photo_cont = pd.read_pickle(path+'MOMA_data/pickle/departments/photo_cont.pkl')
moma_photo_mod = pd.read_pickle(path+'MOMA_data/pickle/departments/photo_mod.pkl')

In [47]:
#grap all links from a dept as a list 
links = moma_photo_mod['URL'].to_list()
#process links w/ function 
moma_photo_to_add = [url_to_text_moma(u) for u in links]
#add results back to a copy of the original DF
moma_photo_mod_text = moma_photo_mod.copy()
moma_photo_mod_text['Text'] = moma_photo_to_add
moma_photo_mod_text['Text'] = moma_photo_mod_text['Text'].astype(str)
moma_photo_mod_text.to_pickle(path+'MOMA_data/pickle/departments/photo_mod_text.pkl')

**Rhizome Cleaning & Keywords**
- Clean the text of html parsing errors 
- Analyse most common words for custom stopwords
- Run text through nltk english stopwords and custom ones 
- Extract 20 keywords for each text and add them as a new column (to use for assessing mediums of artworks)

In [59]:
#load the DF w/ scraped text
rhz_artworks_extra_text = pd.read_pickle(path+'Rhizome_data/rhizome_artworks_extra_text.pkl')
rhz_artworks_extra_text = rhz_artworks_extra_text.astype('string')

In [60]:
#remove all punctuation
rhz_artworks_extra_text['Text'] = rhz_artworks_extra_text['Text'].replace(regex=r'[^\w\s]', value='')

#create a list of strings to remove from all scrapes
remove = ['description editnntttt', '[', ']', 'nnn', 'description edit', 'editnn', 'summary edit', 'tttt', 'nn']
for char in remove:
    rhz_artworks_extra_text['Text'] = rhz_artworks_extra_text['Text'].str.replace(char, '')

#export cleaned version to pickle
rhz_artworks_extra_text_clean = rhz_artworks_extra_text.copy()
rhz_artworks_extra_text_clean.to_pickle(path+'Rhizome_data/rhizome_artworks_extra_text_clean.pkl')

  rhz_artworks_extra_text['Text'] = rhz_artworks_extra_text['Text'].str.replace(char, '')


In [61]:
# create a list of stop words and add custom stopwords
stop_words = set(stopwords.words("english"))
new_words =  ['rhizome', 'attributed', 'summary', 'inception', 'staff', 'attribution', 'summary', 'legacy', 'nattributed']
stop_words = stop_words.union(new_words)

In [62]:
#remove stopwords from cleanedtext 
rhz_artworks_extra_text_clean['Text'] = rhz_artworks_extra_text_clean['Text'].str.lower()
rhz_artworks_extra_text_clean['Text'] = rhz_artworks_extra_text_clean['Text'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))
#remove errand string
rhz_artworks_extra_text_clean['Text'] = rhz_artworks_extra_text_clean['Text'].replace(regex={r'.attributed.': '', r'.legacy descriptive tags.': '', r'.attribution: staff.': '', r'.attribution.': '', r'.inception.': '', r'http[a-z]*\s': ' '})

In [63]:
#extract kws and assign them to new column
rhz_artworks_extra_text_clean['Keywords'] = rhz_artworks_extra_text_clean.apply(lambda row:get_keywords(row), axis=1) 

In [64]:
#export new pickle for reuse
rhz_artworks_extra_text_clean.to_pickle(path+'Rhizome_data/rhizome_artworks_extra_text_clean_stop_keywords.pkl')

**MOMA Cleaning**
- Clean the text 
- Remove empty entries and duplicates from DFs

In [76]:
#load DFs with text
moma_arch_cont_text = pd.read_pickle(path+'MOMA_data/pickle/departments/architecture_design_cont_text.pkl')
moma_arch_mod_text = pd.read_pickle(path+'MOMA_data/pickle/departments/architecture_design_mod_text_only.pkl')
moma_draw_cont_text = pd.read_pickle(path+'MOMA_data/pickle/departments/draws_prints_cont_text.pkl')
moma_draw_mod_text = pd.read_pickle(path+'MOMA_data/pickle/departments/draws_prints_mod_text.pkl')
moma_films_cont_text = pd.read_pickle(path+'MOMA_data/pickle/departments/films_cont_text.pkl')
moma_films_mod_text = pd.read_pickle(path+'MOMA_data/pickle/departments/films_mod_text.pkl')
moma_media_cont_text = pd.read_pickle(path+'MOMA_data/pickle/departments/media_perf_cont_text.pkl')
moma_media_mod_text = pd.read_pickle(path+'MOMA_data/pickle/departments/media_perf_mod_text.pkl')
moma_paint_cont_text = pd.read_pickle(path+'MOMA_data/pickle/departments/paint_sculp_cont_text.pkl')
moma_paint_mod_text = pd.read_pickle(path+'MOMA_data/pickle/departments/paint_sculp_mod_text.pkl')
moma_photo_cont_text = pd.read_pickle(path+'MOMA_data/pickle/departments/photo_cont_text.pkl')
moma_photo_mod_text = pd.read_pickle(path+'MOMA_data/pickle/departments/photo_mod_text.pkl')


In [77]:
#DFs to list then clean them 
moma_texts = [moma_arch_cont_text, moma_draw_cont_text, moma_draw_mod_text, moma_films_cont_text, moma_films_mod_text, moma_media_cont_text, moma_media_mod_text, moma_paint_cont_text, moma_paint_mod_text, moma_photo_cont_text, moma_photo_mod_text]

remove = ['[', ']', '</p>', '<p>', '<strong>', '</strong>', '<em>', '</em>', '</br>']
for text in moma_texts:
    for char in remove:
        text['Text'] = text['Text'].str.replace(char, '')

#exception for DF that was formatted differently due to scraping error (it has one column only w/ scraped text)
for char in remove:
    moma_arch_mod_text['text'] = moma_arch_mod_text['text'].str.replace(char, '')

  text['Text'] = text['Text'].str.replace(char, '')
  moma_arch_mod_text['text'] = moma_arch_mod_text['text'].str.replace(char, '')


In [78]:
#replace empty strings w/ missing value to filter useful results only
moma_arch_cont_text['Text'] = moma_arch_cont_text['Text'].replace([''], 'missing')
moma_arch_mod_text['text'] = moma_arch_mod_text['text'].replace([''], 'missing')
moma_draw_cont_text['Text'] = moma_draw_cont_text['Text'].replace([''], 'missing')
moma_draw_mod_text['Text'] = moma_draw_mod_text['Text'].replace([''], 'missing')
moma_films_cont_text['Text'] = moma_films_cont_text['Text'].replace([''], 'missing')
moma_films_mod_text['Text'] = moma_films_mod_text['Text'].replace([''], 'missing')
moma_media_cont_text['Text'] = moma_media_cont_text['Text'].replace([''], 'missing')
moma_media_mod_text['Text'] = moma_media_mod_text['Text'].replace([''], 'missing')
moma_paint_cont_text['Text'] = moma_paint_cont_text['Text'].replace([''], 'missing')
moma_paint_mod_text['Text'] = moma_paint_mod_text['Text'].replace([''], 'missing')
moma_photo_cont_text['Text'] = moma_photo_cont_text['Text'].replace([''], 'missing')
moma_photo_mod_text['Text'] = moma_photo_mod_text['Text'].replace([''], 'missing')

In [79]:
#save cleaned DFs back 
moma_arch_cont_text.to_pickle(path+'MOMA_data/pickle/departments/architecture_design_cont_text.pkl')
moma_arch_mod_text.to_pickle(path+'MOMA_data/pickle/departments/architecture_design_mod_text_only.pkl')
moma_draw_cont_text.to_pickle(path+'MOMA_data/pickle/departments/draws_prints_cont_text.pkl')
moma_draw_mod_text.to_pickle(path+'MOMA_data/pickle/departments/draws_prints_mod_text.pkl')
moma_films_cont_text.to_pickle(path+'MOMA_data/pickle/departments/films_cont_text.pkl')
moma_films_mod_text.to_pickle(path+'MOMA_data/pickle/departments/films_mod_text.pkl')
moma_media_cont_text.to_pickle(path+'MOMA_data/pickle/departments/media_perf_cont_text.pkl')
moma_media_mod_text.to_pickle(path+'MOMA_data/pickle/departments/media_perf_mod_text.pkl')
moma_paint_cont_text.to_pickle(path+'MOMA_data/pickle/departments/paint_sculp_cont_text.pkl')
moma_paint_mod_text.to_pickle(path+'MOMA_data/pickle/departments/paint_sculp_mod_text.pkl')
moma_photo_cont_text.to_pickle(path+'MOMA_data/pickle/departments/photo_cont_text.pkl')
moma_photo_mod_text.to_pickle(path+'MOMA_data/pickle/departments/photo_mod_text.pkl')

In [80]:
#reduce DFs down to entries w/ only useful text and remove duplicates 
moma_arch_cont_text_final = moma_arch_cont_text.loc[moma_arch_cont_text['Text'] != 'missing']
moma_arch_cont_text_final = moma_arch_cont_text_final.drop_duplicates(subset=['Text'])
moma_arch_mod_text_final = moma_arch_mod_text.loc[moma_arch_mod_text['text'] != 'missing']
moma_arch_mod_text_final = moma_arch_mod_text_final.drop_duplicates(subset=['text'])
moma_draw_cont_text_final = moma_draw_cont_text.loc[moma_draw_cont_text['Text'] != 'missing']
moma_draw_cont_text_final = moma_draw_cont_text_final.drop_duplicates(subset=['Text'])
moma_draw_mod_text_final = moma_draw_mod_text.loc[moma_draw_mod_text['Text'] != 'missing']
moma_draw_mod_text_final = moma_draw_mod_text_final.drop_duplicates(subset=['Text'])
moma_films_cont_text_final = moma_films_cont_text.loc[moma_films_cont_text['Text'] != 'missing']
moma_films_cont_text_final = moma_films_cont_text_final.drop_duplicates(subset=['Text'])
moma_films_mod_text_final = moma_films_mod_text.loc[moma_films_mod_text['Text'] != 'missing']
moma_films_mod_text_final = moma_films_mod_text_final.drop_duplicates(subset=['Text'])
moma_media_cont_text_final = moma_media_cont_text.loc[moma_media_cont_text['Text'] != 'missing']
moma_media_cont_text_final = moma_media_cont_text_final.drop_duplicates(subset=['Text'])
moma_media_mod_text_final = moma_media_mod_text.loc[moma_media_mod_text['Text'] != 'missing']
moma_media_mod_text_final = moma_media_mod_text_final.drop_duplicates(subset=['Text'])
moma_paint_cont_text_final = moma_paint_cont_text.loc[moma_paint_cont_text['Text'] != 'missing']
moma_paint_cont_text_final = moma_paint_cont_text_final.drop_duplicates(subset=['Text'])
moma_paint_mod_text_final = moma_paint_mod_text.loc[moma_paint_mod_text['Text'] != 'missing']
moma_paint_mod_text_final = moma_paint_mod_text_final.drop_duplicates(subset=['Text'])
moma_photo_cont_text_final = moma_photo_cont_text.loc[moma_photo_cont_text['Text'] != 'missing']
moma_photo_cont_text_final = moma_photo_cont_text_final.drop_duplicates(subset=['Text'])
moma_photo_mod_text_final = moma_photo_mod_text.loc[moma_photo_mod_text['Text'] != 'missing']
moma_photo_mod_text_final = moma_photo_mod_text_final.drop_duplicates(subset=['Text'])

In [81]:
#pickle final results for visualisation 
moma_arch_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/architecture_design_cont_text_final.pkl')
moma_arch_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/architecture_design_mod_text_only_final.pkl')
moma_draw_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/draws_prints_cont_text_final.pkl')
moma_draw_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/draws_prints_mod_text_final.pkl')
moma_films_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/films_cont_text_final.pkl')
moma_films_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/films_mod_text_final.pkl')
moma_media_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/media_perf_cont_text_final.pkl')
moma_media_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/media_perf_mod_text_final.pkl')
moma_paint_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/paint_sculp_cont_text_final.pkl')
moma_paint_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/paint_sculp_mod_text_final.pkl')
moma_photo_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/photo_cont_text_final.pkl')
moma_photo_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/photo_mod_text_final.pkl')

In [82]:
#load stopwords and remove them
stop_words = set(stopwords.words("english"))
new_words =  ['.', 'one', 'two', 'also']
stop_words = stop_words.union(new_words)

moma_texts_final = [moma_arch_cont_text_final, moma_draw_cont_text_final, moma_draw_mod_text_final, moma_films_cont_text_final, moma_films_mod_text_final, moma_media_cont_text_final, moma_media_mod_text_final, moma_paint_cont_text_final, moma_paint_mod_text_final, moma_photo_cont_text_final, moma_photo_mod_text_final]

for text in moma_texts_final:
    text['Text'] = text['Text'].str.lower()
    text['Text'] = text['Text'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))

#exception 
moma_arch_mod_text_final['text'] = moma_arch_mod_text_final['text'].str.lower()
moma_arch_mod_text_final['text'] = moma_arch_mod_text_final['text'].apply(lambda x: ' '.join([item for item in x.split() if item not in stop_words]))

In [83]:
#get keywords 
def get_keywords_2(row):
    punct_tokenizer = nltk.RegexpTokenizer(r"\w+")
    some_text = row['text']
    tokens = punct_tokenizer.tokenize(some_text)
    keywords = [keyword for keyword in tokens if keyword.isalpha() and not keyword in stop_words]
    freqdist = nltk.FreqDist(keywords)
    most_common = freqdist.most_common(20)
    most_common_list = []
    for t in most_common:
        most_common_list.append(t[0])
    keywords_string = ', '.join(most_common_list)
    return keywords_string

moma_arch_cont_text_final['Keywords'] = moma_arch_cont_text_final.apply(lambda row:get_keywords(row), axis=1)
moma_draw_cont_text_final['Keywords'] = moma_draw_cont_text_final.apply(lambda row:get_keywords(row), axis=1) 
moma_draw_mod_text_final['Keywords'] = moma_draw_mod_text_final.apply(lambda row:get_keywords(row), axis=1) 
moma_films_cont_text_final['Keywords'] = moma_films_cont_text_final.apply(lambda row:get_keywords(row), axis=1)
moma_films_mod_text_final['Keywords'] = moma_films_mod_text_final.apply(lambda row:get_keywords(row), axis=1) 
moma_media_cont_text_final['Keywords'] = moma_media_cont_text_final.apply(lambda row:get_keywords(row), axis=1)
moma_media_mod_text_final['Keywords'] = moma_media_mod_text_final.apply(lambda row:get_keywords(row), axis=1)
moma_paint_cont_text_final['Keywords'] = moma_paint_cont_text_final.apply(lambda row:get_keywords(row), axis=1)
moma_paint_mod_text_final['Keywords'] = moma_paint_mod_text_final.apply(lambda row:get_keywords(row), axis=1)
moma_photo_cont_text_final['Keywords'] = moma_photo_cont_text_final.apply(lambda row:get_keywords(row), axis=1)
moma_photo_mod_text_final['Keywords'] = moma_photo_mod_text_final.apply(lambda row:get_keywords(row), axis=1)
moma_arch_mod_text_final['Keywords'] = moma_arch_mod_text_final.apply(lambda row:get_keywords_2(row), axis=1) 


In [85]:
#pickle version w/ stopwords removed and keywords added
moma_arch_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/architecture_design_cont_text_final_stop.pkl')
moma_arch_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/architecture_design_mod_text_only_final_stop.pkl')
moma_draw_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/draws_prints_cont_text_final_stop.pkl')
moma_draw_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/draws_prints_mod_text_final_stop.pkl')
moma_films_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/films_cont_text_final_stop.pkl')
moma_films_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/films_mod_text_final_stop.pkl')
moma_media_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/media_perf_cont_text_final_stop.pkl')
moma_media_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/media_perf_mod_text_final_stop.pkl')
moma_paint_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/paint_sculp_cont_text_final_stop.pkl')
moma_paint_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/paint_sculp_mod_text_final_stop.pkl')
moma_photo_cont_text_final.to_pickle(path+'MOMA_data/pickle/departments/photo_cont_text_final_stop.pkl')
moma_photo_mod_text_final.to_pickle(path+'MOMA_data/pickle/departments/photo_mod_text_final_stop.pkl')