<a href="https://colab.research.google.com/github/msaantonova/IR-IE-project---Musicals-Recommendation-System/blob/main/Musicals_recommendation_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**IR/ IE Project** - Musical recommendation system

Workflow:

1. Finding the corpus

Possible corpus: https://docs.google.com/document/d/1TqIBg_3frwDtaE518Cq_AxAdtSsM93OWp8MTfJq0iPE/edit?tab=t.0

POr existing dataset with movie musicals reviews: https://www.kaggle.com/datasets/bwandowando/rotten-tomatoes-best-musicals-of-all-time/data
2. Indexing
3. Knowledge graphs (where do they go?)
4. Create query examples
5. Build an extraction pipeline for recognition of names etc.

**1. Working with the dataset**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
import zipfile
import os
from google.colab import files

In [3]:
file_path1 = '/content/movies.csv'
with open(file_path1, 'r', encoding='utf-8') as f:
    for i in range(5):
        print(f.readline())

"movieId","movieTitle","movieYear","movieURL","critic_score","audience_score"

"6da4d992-97ca-3acc-834c-37c1ad34872c","Singin' in the Rain",1952,"https://www.rottentomatoes.com/m/singin_in_the_rain","100%","95%"

"56b1624a-1ceb-3598-8ee2-62236554094a","Top Hat",1935,"https://www.rottentomatoes.com/m/top_hat","100%","90%"

"718dd15b-50ee-3090-84c7-13244b587b00","Meet Me in St. Louis",1944,"https://www.rottentomatoes.com/m/meet_me_in_st_louis","100%","87%"

"80f30474-3ac3-36c7-a43b-648d1a627432","The Wizard of Oz",1939,"https://www.rottentomatoes.com/m/the_wizard_of_oz_1939","98%","89%"



**`3. Template for possible knowledge graph `**

In [6]:
def add_musical_to_graph(G, title, release_date, genre, location, director, actors, characters, songs, spotify_link, time_period, composer, source_material):
    G.add_node(title, type='musical')

    # release date
    G.add_node(release_date, type='date')
    G.add_edge(title, release_date, relationship='released_on')

    # genre
    G.add_node(genre, type='genre')
    G.add_edge(title, genre, relationship='has_genre')

    # location
    G.add_node(location, type='place')
    G.add_edge(title, location, relationship='set_in')

    # director
    G.add_node(director, type='director')
    G.add_edge(title, director, relationship='directed_by')

    # actors
    for actor in actors:
        G.add_node(actor, type='actor')
        G.add_edge(title, actor, relationship='features_actor')

    # characters
    for character in characters:
        G.add_node(character, type='character')
        G.add_edge(title, character, relationship='features_character')
        G.add_edge(actor, character, relationship='played_by_actor')

    # songs
    for song in songs:
        G.add_node(song, type='song')
        G.add_edge(title, song, relationship='includes_song')
        G.add_edge(song, composer, relationship='written_by')
        G.add_edge(song, actor, relationship='performed_by')
        G.add_edge(song, character, relationship='performed_by')

    # Spotify or apple music
    G.add_node(spotify_link, type='link')
    G.add_edge(title, spotify_link, relationship='has_music_link')

    #time_period
    G.add_node(time_period, type='time_period')
    G.add_edge(title, time_period, relationship='set_in_time_period')

    #composer
    G.add_node(composer, type='composer')
    G.add_edge(composer, title, relationship='created_music_for')

    #source_material
    G.add_node(source_material, type='source_material')
    G.add_edge(title, source_material, relationship='is_based_on_source')

    #source_author
    G.add_node(source_author, type='source_author')
    G.add_edge(source_material, source_author, relationship='is_written_by')


    return G


4. How to get information? Manuelly or parsing

EXAMPLE - Parsing from IMDB

In [13]:
!pip install IMDbPY

from imdb import IMDb

ia = IMDb()

results = ia.search_movie("Les Misérables")

movie = results[0]
ia.update(movie)

print("🎭 Title:", movie['title'])
print("📅 Release date:", movie['year'])
print("🎬 Director(s):", [d['name'] for d in movie.get('directors', [])])
print("🎞️ Genres:", movie.get('genres', []))
print("⭐ Actors:", [a['name'] for a in movie.get('cast', [])[:5]])
print("🙋🏼‍♂️ Characters:", [a.currentRole['name'] for a in movie.get('cast', [])[:5]])



# print("🎹 Composer", [c['name'] for c in movie.get('composer', [])])
# print("🎵 Songs", [s['name'] for s in movie.get('songs', [])])
# print("⏱️ Time period", movie.get('time period', []))
# source author, source material

🎭 Title: Les Misérables
📅 Release date: 2012
🎬 Director(s): ['Tom Hooper']
🎞️ Genres: ['Drama', 'Musical', 'Romance']
⭐ Actors: ['Hugh Jackman', 'Russell Crowe', 'Anne Hathaway', 'Amanda Seyfried', 'Sacha Baron Cohen']
🙋🏼‍♂️ Characters: ['Jean Valjean', 'Javert', 'Fantine', 'Cosette', 'Thénardier']


There is a kaggle dataset that contains 100 most popular musicals. However there is not enough information forr the knowledge graph. Therefore the parsing could be done on thhe basis of the mentioned datased on the Wiipedia or IMDB website to get additional nessesary information.

In [None]:
import csv
from imdb import IMDb

# Инициализируем IMDb клиент
ia = IMDb()

# Хранилище результатов
musicals_data = []

# Читаем первые 5 мюзиклов из CSV
with open(file_path1, 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for i, row in enumerate(reader):
        if i >= 5:  # ограничим пока 5 строками
            break

        title = row['movieTitle'].strip()
        print(f"🔍 Ищу: {title}")

        try:
            # Ищем фильм на IMDb
            results = ia.search_movie(title)
            if not results:
                print(f"⚠️ Не найдено: {title}")
                continue

            movie = results[0]
            ia.update(movie)

            # Собираем нужную информацию
            data = {
                "title": movie.get('title'),
                "year": movie.get('year'),
                "directors": [d['name'] for d in movie.get('directors', [])],
                "genres": movie.get('genres', []),
                "actors": [a['name'] for a in movie.get('cast', [])[:5]],  # первые 5 актёров
            }
            musicals_data.append(data)

        except Exception as e:
            print(f"❌ Ошибка при обработке {title}: {e}")

# Выводим результаты
for entry in musicals_data:
    print(entry)


🔍 Ищу: Singin' in the Rain
🔍 Ищу: Top Hat
🔍 Ищу: Meet Me in St. Louis
🔍 Ищу: The Wizard of Oz
🔍 Ищу: A Hard Day's Night
{'title': "Singin' in the Rain", 'year': 1952, 'directors': ['Stanley Donen', 'Gene Kelly'], 'genres': ['Comedy', 'Musical', 'Romance'], 'actors': ['Gene Kelly', "Donald O'Connor", 'Debbie Reynolds', 'Jean Hagen', 'Millard Mitchell']}
{'title': 'Top Hat', 'year': 1935, 'directors': ['Mark Sandrich'], 'genres': ['Comedy', 'Musical', 'Romance'], 'actors': ['Fred Astaire', 'Ginger Rogers', 'Edward Everett Horton', 'Erik Rhodes', 'Eric Blore']}
{'title': 'Meet Me in St. Louis', 'year': 1944, 'directors': ['Vincente Minnelli'], 'genres': ['Comedy', 'Drama', 'Family', 'Musical', 'Romance'], 'actors': ['Judy Garland', "Margaret O'Brien", 'Mary Astor', 'Lucille Bremer', 'Leon Ames']}
{'title': 'The Wizard of Oz', 'year': 1939, 'directors': ['Victor Fleming', 'King Vidor'], 'genres': ['Adventure', 'Family', 'Fantasy', 'Musical'], 'actors': ['Judy Garland', 'Frank Morgan', 'Ray

5. Next steps:

    1) Fix the knowledge graph according to the plan ✅

    2) Figure out how to save combined csv

    3) Figure out how to add links (spotify/ apple music)

    4) Check requirements + testing queries