First, we need to install the SPARQLWrapper and the BeautifulSoup packages, as these are the two packages used to download the multimodal features.

In [27]:
!pip install SPARQLWrapper
!pip install bs4



In [6]:
import pandas as pd
from SPARQLWrapper import SPARQLWrapper, JSON

Next, we read the extended mapping that contains the DBpedia urls

In [7]:
extended_mapping = pd.read_csv('full_extended_dbbook_img_links.tsv', sep='\t')

In [8]:
extended_mapping

Unnamed: 0,id,name,dbpedia_url,wiki_url,img_url
0,1,Dragonfly in Amber,http://dbpedia.org/resource/Dragonfly_in_Amber,https://en.wikipedia.org/wiki/Dragonfly_in_Amber,https://upload.wikimedia.org/wikipedia/en/thum...
1,2,Drums of Autumn,http://dbpedia.org/resource/Drums_of_Autumn,https://en.wikipedia.org/wiki/Drums_of_Autumn,https://upload.wikimedia.org/wikipedia/en/thum...
2,3,Snow Crash,http://dbpedia.org/resource/Snow_Crash,https://en.wikipedia.org/wiki/Snow_Crash,https://upload.wikimedia.org/wikipedia/en/thum...
3,4,The Ponder Heart,http://dbpedia.org/resource/The_Ponder_Heart,https://en.wikipedia.org/wiki/The_Ponder_Heart,https://upload.wikimedia.org/wikipedia/en/thum...
4,5,The Shadow of the Torturer,http://dbpedia.org/resource/The_Shadow_of_the_...,https://en.wikipedia.org/wiki/The_Shadow_of_th...,https://upload.wikimedia.org/wikipedia/en/7/7f...
...,...,...,...,...,...
7583,8165,The Between,http://dbpedia.org/resource/The_Between,https://en.wikipedia.org/wiki/The_Between,https://upload.wikimedia.org/wikipedia/en/8/83...
7584,8166,Visions of Cody,http://dbpedia.org/resource/Visions_of_Cody,https://en.wikipedia.org/wiki/Visions_of_Cody,https://upload.wikimedia.org/wikipedia/en/thum...
7585,8167,Abel's Island,http://dbpedia.org/resource/Abel's_Island,https://en.wikipedia.org/wiki/Abel's_Island,https://upload.wikimedia.org/wikipedia/en/1/18...
7586,8168,Abhorsen,http://dbpedia.org/resource/Abhorsen,https://en.wikipedia.org/wiki/Abhorsen,https://upload.wikimedia.org/wikipedia/en/thum...


Starting from each DBpedia uri, we convert it into a wikipedia url to get the book cover (scraped with BS4), while from the DBpedia uri we get the book abstract through a SPARQL query.

In [14]:
def get_dbpedia_abstract(book_uri):
    sparql = SPARQLWrapper("https://dbpedia.org/sparql")
    
    query = f"""
    SELECT ?abstract WHERE {{
        <{book_uri}> dbo:abstract ?abstract .
        FILTER (lang(?abstract) = "en")
    }}
    """
    
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    
    try:
        results = sparql.query().convert()
        for result in results["results"]["bindings"]:
            return result["abstract"]["value"]
        return "No abstract found."
    except Exception as e:
        return f"Error: {e}"

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import urllib.request
import time
import os

def convert_url(dbpedia_url):
    dbpart = dbpedia_url.split('/')[-1]
    return 'https://en.wikipedia.org/wiki/' + dbpart

def get_book_image(dburl, item_id):
    possible_suffixes = ["", "_(novel)", "_(book)"]
    
    for suffix in possible_suffixes:
        try:
            wikiurl = convert_url(dburl) + suffix
            filename = f'_images/{item_id}.jpg'
            
            if os.path.exists(filename):
                print(f'existing: {wikiurl}')
                return "", ""
            
            r = requests.get(wikiurl)
            soup = BeautifulSoup(r.content, 'lxml')
            
            img_tag = soup.find('span', class_='mw-default-size')
            if not img_tag:
                img_tag = soup.find('td', class_='infobox-image')
            
            if img_tag:
                img_url = 'https:' + img_tag.find('img')['src'].strip().split(' ')[0]
                urllib.request.urlretrieve(img_url, filename)
                
                with open('id_db_img_wiki_dbbook.tsv', 'a') as tsvfile:
                    tsvfile.write(f'{item_id}\t{dburl}\t{img_url}\t{wikiurl}\n')
                
                print(f'done: {wikiurl}')
                time.sleep(1)

                return img_url, wikiurl

        except Exception as e:
            print(e)
            continue
        
    time.sleep(1)

Iterate over the DataFrame and get multimodal data for each item.

In [25]:
for i, row in extended_mapping.iterrows():
    
    item_id = row['id']
    dbpedia_uri = row['dbpedia_url']

    text = get_dbpedia_abstract(dbpedia_uri)
    img_url, wiki_url = get_book_image(dbpedia_uri, item_id)

    print(f'Item: {item_id}, DBpedia uri: {dbpedia_uri}, img url: {img_url}, wiki_url: {wiki_url}')
    print(f'Abstract: {text}')

    # add to the pandas

    break

done: https://en.wikipedia.org/wiki/Dragonfly_in_Amber
Item: 1, DBpedia uri: http://dbpedia.org/resource/Dragonfly_in_Amber, img url: https://upload.wikimedia.org/wikipedia/en/thumb/1/19/Gabaldon-Dragonfly_in_Amber-1992.jpg/220px-Gabaldon-Dragonfly_in_Amber-1992.jpg, wiki_url: https://en.wikipedia.org/wiki/Dragonfly_in_Amber
Abstract: Dragonfly in Amber is the second book in the Outlander series of novels by Diana Gabaldon. Centered on time travelling 20th century nurse Claire Randall and her 18th century Scottish Highlander warrior husband Jamie Fraser, the books contain elements of historical fiction, romance, adventure and fantasy. This installment chronicles Claire and Jamie's efforts to prevent the Jacobite rising that Claire knows will end disastrously for the Scots. A television adaptation of the series, called Outlander, premiered on Starz in August 2014. The show's second season—which aired May through July 2016—was based on the events of Dragonfly in Amber.
