First, we need to install the yt_dlp package, the SPARQLWrapper, and the BeautifulSoup package, as these are the two packages used to download the multimodal features.

In [27]:
!pip install yt_dlp
!pip install SPARQLWrapper
!pip install bs4



In [16]:
import yt_dlp
from bs4 import BeautifulSoup
import pandas as pd
import requests
from SPARQLWrapper import SPARQLWrapper, JSON

To download the multimodal data, first we need to read the .tsv file that has the links of such data.

In [4]:
extended_mapping = pd.read_csv('ml1m_full_extended_mapping.tsv', sep='\t')

In [6]:
extended_mapping.head(5)

Unnamed: 0,movie_id,dburl,wiki_url,img_url,name,youtubeId
0,2,http://dbpedia.org/resource/Jumanji,https://en.wikipedia.org/wiki/Jumanji,https://upload.wikimedia.org/wikipedia/en/b/b6...,Jumanji (1995),3LPANjHlPxo
1,3,http://dbpedia.org/resource/Grumpier_Old_Men,https://en.wikipedia.org/wiki/Grumpier_Old_Men,https://upload.wikimedia.org/wikipedia/en/0/03...,Grumpier Old Men (1995),rEnOoWs3FuA
2,4,http://dbpedia.org/resource/Waiting_to_Exhale,https://en.wikipedia.org/wiki/Waiting_to_Exhale,https://upload.wikimedia.org/wikipedia/en/c/ca...,Waiting to Exhale (1995),j9xml1CxgXI
3,5,http://dbpedia.org/resource/Father_of_the_Brid...,https://en.wikipedia.org/wiki/Father_of_the_Br...,https://upload.wikimedia.org/wikipedia/en/e/e1...,Father of the Bride Part II (1995),BbvnDlu_Zjc
4,6,http://dbpedia.org/resource/Heat_(1995_film),https://en.wikipedia.org/wiki/Heat_(1995_film),https://upload.wikimedia.org/wikipedia/en/6/6c...,Heat (1995),2GfZl4kuVNI


This function saves a YT video in .mp4 format, given its link, into the correct folder.

In [7]:
def download_youtube_video(youtube_id, item_id):
    url = f'https://www.youtube.com/watch?v={youtube_id}'
    filename = f'_videos/{item_id}.mp4'
    ydl_opts = {
        'outtmpl': filename,
        'format': 'bestvideo+bestaudio/best'
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        ydl.download([url])

Similarly, these functions are used to scrape the wikipedia page associated to a movie (through the DBpedia uri), in order to get the movie poster; then, starting from the DBpedia uri, we also get the plot of the movie.

In [24]:
def convert_url(dbpedia_url):
    dbpart = dbpedia_url.split('/')[-1]
    return 'https://en.wikipedia.org/wiki/' + dbpart

def get_movie_poster(dbpedia_url, item_id):
    wikipedia_url = convert_url(dbpedia_url)
    response = requests.get(wikipedia_url)
    soup = BeautifulSoup(response.text, 'html.parser')
    infobox = soup.find('table', {'class': 'infobox'})
    if infobox:
        img_tag = infobox.find('img')
        if img_tag:
            img_url = 'https:' + img_tag['src']
            img_data = requests.get(img_url).content
            with open(f'_posters/{item_id}.jpg', 'wb') as handler:
                handler.write(img_data)
            return img_url

def get_movie_plot(dbpedia_uri):
    sparql = SPARQLWrapper("https://dbpedia.org/sparql")
    query = f"""
    SELECT ?abstract WHERE {{
        <{dbpedia_uri}> dbo:abstract ?abstract .
        FILTER (lang(?abstract) = "en")
    }}
    """
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    try:
        results = sparql.query().convert()
        for result in results["results"]["bindings"]:
            return result["abstract"]["value"]
        return "No abstract found."
    except Exception as e:
        return f"Error: {e}"

The original dataframe is iterated to gather all multimodal data we are interest in.

In [25]:
for i, row in extended_mapping.iterrows():

    item_id = row['movie_id']
    dburl = row['dburl']
    yt_id = row['youtubeId']

    print(dburl)

    # this downloads the movie plot
    plot = get_movie_plot(dburl)

    # this stores the movie poster in the _posters folder
    get_movie_poster(dburl, item_id)

    # this stores the movie trailer in the _videos folder
    download_youtube_video(yt_id, item_id)


    break

http://dbpedia.org/resource/Jumanji
[youtube] Extracting URL: https://www.youtube.com/watch?v=3LPANjHlPxo
[youtube] 3LPANjHlPxo: Downloading webpage
[youtube] 3LPANjHlPxo: Downloading tv client config
[youtube] 3LPANjHlPxo: Downloading player 4fcd6e4a
[youtube] 3LPANjHlPxo: Downloading tv player API JSON
[youtube] 3LPANjHlPxo: Downloading ios player API JSON
[youtube] 3LPANjHlPxo: Downloading m3u8 information
[info] 3LPANjHlPxo: Downloading 1 format(s): 136+251
[download] _videos/2.mp4.mkv has already been downloaded
