<span style="font-size: 36px;">W4111_Fall_2024_003 - Introduction to Databases:<br> Final Project Part 1<br>Data Loading<br>Game of Thrones Data and IMDB</span>

# Overview

## Application Scenario

There are many types of applications that use databases. Two of the most common types are:
1. __Interactive/Operational Applications__ allow end users to create, retrieve, update and delete information. Online banking, e-commerce, course registration, etc. are examples. The programming track students implement a simple, [full-stack](https://en.wikipedia.org/wiki/Solution_stack), interaction application.<br><br>
2. __Data Insight/Decision Support Applications__ are primarily for analysts and business professionals. The applications enable complex query and data navigation to provide information useful for managing and improving business processes, products, etc. Non-programming track students implement complex queries that provide datasets useful for visualization. Visualization is often the first step in a data insight project.

The following diagram depicts some major elements of the applications. 
- Both tracks implement a simple [data engineering](https://en.wikipedia.org/wiki/Data_engineering) project, specifically an [extract-transform-load](https://en.wikipedia.org/wiki/Extract,_transform,_load) application in a Jupyter notebook.
  - The input datatsets are information from [IMDB](https://developer.imdb.com/non-commercial-datasets/) and information about [Game of Thrones](https://github.com/jeffreylancaster/game-of-thrones).
  - The data engineering tasks process and load information into three databases:
      - A local installation of MySQL
      - A cloud document database on [MongoDB Atlas](https://www.mongodb.com/atlas)
      - A graph database on [Neo4j Aura](https://neo4j.com/product/auradb/)
- The non-programming track implements a very simply decision support/data insight application in a Jupyter notebook. The application queries the various databases to produce "views" that can be used for visualization.
- The programming track implements a web application that implements simple retrieve and data navigation. 

| <img src="overall-system.jpg" width="800px;"> |
| :---: |
| __Overall Application Concept__ |

## Data Engineering

The following diagram is an overview of data engineering concepts, and entity-relationship modeling in general.

The data engineering tasks for the project are primarily _bottom-up data analysis and engineering._ There are two datasets that are the input to the data engineering:
1. IMDB data in comma separated value file.
2. Games-of-Thrones data in [JSON](https://en.wikipedia.org/wiki/JSON) files.

To simplify the project, the explanation of the HW problem:
1. Provide a Jupyter notebook for the initial data loading.
1. Defines the "to be" data model.
2. Explains how to map from the "as is" data to the "to be" data.

The providing the information in the project explanation eliminates the need for the students to figure would what the input data contains and a "good" to-be data model.

| <img src="./top_down_bottom_up.jpg" width="700px;"> |
| :---: |
| __Data Modeling__ |

| <img src="./data-janitor.jpg" width="700px;"> |
| :---: |
| __Data Engineering__ |

## Data Insight Application

A separate document will explain the data analysis and insight tasks. The current project and purpose of this notebook is to ensure that students have a working environment.

## Interactive Web Application

A separate document will explain the web application tasks. The current project and purpose of this notebook is to ensure that students have a working environment.

## This Notebook

This notebook simply enables students to get started with the environment, test their setup, etc.

## Prerequisites

To complete this notebook, the students must have setup and configured a MongoDB Atlas and Neo4j Aura environment. The recorded recitation of 2024-NOV-17 explains how to do the setup. There are also online instructions.

# Initialization

__Notes:__
1. Several of the cells below have a commented out ```%pip install``` statement. You may need to uncomment and execute the statement to install packages. You only need to run the statement the first time you use the notebook.
2. There are cells where you will have to replace URLs, user IDs and passwords with the values you set when creating databases or accounts.
3. The output of the test statements will be different for you based on previous use of the databases.

## General Python Packages

In [1]:
import copy

In [2]:
import json

In [3]:
import pandas

In [4]:
# You should have installed the packages for previous homework assignments
#
import pymysql
import sqlalchemy

In [5]:
# You may have to do %pip installs to use the packages below.
#
# %pip install "pymongo[srv]"
#
import pymongo

In [6]:
# You may have to do %pip installs to use the packages below.
#
# %pip install neo4j
#
import neo4j

In [7]:
import numpy

## MySQL

### ipython-sql

In [8]:
# You have installed and configured ipython-sql for previous assignments.
# https://pypi.org/project/ipython-sql/
#
%load_ext sql

In [9]:
# Make sure that you set these values to the correct values for your installation and 
# configuration of MySQL
#
db_user = "root"
db_password = "dbuserdbuser"

In [10]:
# Create the URL for connecting to the database.
# Do not worry about the local_infile=1, I did that for wizard reasons that you should not have to use.
#
db_url = f"mysql+pymysql://{db_user}:{db_password}@localhost?local_infile=1"

In [11]:
# Initialize ipython-sql
#
%sql $db_url

In [12]:
# Your answer will be different based on the databases that you have created on your local MySQL instance.
#
%sql show databases

 * mysql+pymysql://root:***@localhost?local_infile=1
8 rows affected.


Database
classicmodels
db_book
hw1b_ng2695
information_schema
mysql
performance_schema
sys
w4111_hw2_ng2695


### PyMySQL

In [13]:
default_mysql_conn = pymysql.connect(
    user=db_user,
    password=db_password,
    host="localhost",
    port=3306,
    cursorclass=pymysql.cursors.DictCursor,
    autocommit=True
)

In [14]:
cur = default_mysql_conn.cursor()

result = cur.execute("show databases;");
result = cur.fetchall()
result_df = pandas.DataFrame(result)
result_df

Unnamed: 0,Database
0,classicmodels
1,db_book
2,hw1b_ng2695
3,information_schema
4,mysql
5,performance_schema
6,sys
7,w4111_hw2_ng2695


### SQLAlchemy

In [15]:
from sqlalchemy import create_engine
default_engine = create_engine(db_url)

In [16]:
result_df = pandas.read_sql(
    "show databases", con=default_engine
)
result_df

Unnamed: 0,Database
0,classicmodels
1,db_book
2,hw1b_ng2695
3,information_schema
4,mysql
5,performance_schema
6,sys
7,w4111_hw2_ng2695


## MongoDB Atlas

In [17]:
# You will need to replace with your settings.
#
mongo_db_pw="3ZOGE7vkjPRB2io0"
mongodb_user="ng2695"

In [18]:
# You can get the URL and info below from looking at the connect instructions for you MongoDB Atlas instance.
#
mongo_db_url = f"mongodb+srv://{mongodb_user}:{mongo_db_pw}@w4111hw.aajrd.mongodb.net/?retryWrites=true&w=majority&appName=W4111HW"

In [19]:
mongo_client = pymongo.MongoClient(mongo_db_url)

In [20]:
# Your list of databases will be different.
#
list(mongo_client.list_databases())

[{'name': 'f24_got', 'sizeOnDisk': 475136, 'empty': False},
 {'name': 'sample_mflix', 'sizeOnDisk': 128372736, 'empty': False},
 {'name': 'admin', 'sizeOnDisk': 393216, 'empty': False},
 {'name': 'local', 'sizeOnDisk': 8016089088, 'empty': False}]

## Neo4j

In [21]:
# You need to have created a Neo4j Aura DB with a user ID and information.
# Please make sure you copied the information for connecting.
# You can download this information when you create your instance.
# The download will be a text file with the information below.
# You will have to modify the file you download to wrap the strings with "
#
NEO4J_URI="neo4j+ssc://152bd215.databases.neo4j.io"
NEO4J_USERNAME="neo4j"
NEO4J_PASSWORD="Q1DShtnmlomcYoqFIT1_rv3XL4t-4ooCmWvKZUzWelc"
AURA_INSTANCEID="152bd215"
AURA_INSTANCENAME="Instance01"

In [22]:
from neo4j import GraphDatabase

# URI examples: "neo4j://localhost", "neo4j+s://xxx.databases.neo4j.io"
URI = NEO4J_URI
AUTH = (NEO4J_USERNAME, NEO4J_PASSWORD)

with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()

__Note:__ You should have already created a Neo4j account and loaded the Movie Database.

In [23]:
# The following code assumes that you followed the tutorial for the Movie Database and loaded the data.
#
with GraphDatabase.driver(URI, auth=AUTH) as driver:
    # driver.verify_connectivity()
    
    records, summary, keys = driver.execute_query(
        "MATCH (p:Person) where p.name='Tom Hanks' RETURN p.name AS name, p.born as birth_year ",
    )
    
    # Loop through results and do something with them
    # There is probably an easier way to do this.
    #
    person_records = []
    for person in records:
        new_p = dict(person)
        person_records.append(new_p)
    
    # Summary information
    print("The query `{query}` returned {records_count} records in {time} ms.".format(
        query=summary.query, records_count=len(records),
        time=summary.result_available_after,
    ))

The query `MATCH (p:Person) where p.name='Tom Hanks' RETURN p.name AS name, p.born as birth_year ` returned 1 records in 61 ms.


In [24]:
results_df = pandas.DataFrame(person_records)
results_df.head(10)

Unnamed: 0,name,birth_year
0,Tom Hanks,1956


# Load Data

## Overview

The first step is to load two datasets. The project directory has two subdirectories containing the datasets:
1. ```/data/IMDB``` contains a subset of the non-commercial [IMDB dataset data](https://developer.imdb.com/non-commercial-datasets/). The dataset to loading contains only:
      1. A subset of the data files.
      2. The data files only contain records from the IMDB data files related to Game of Thrones and the actors in the series.
2. ```data/GoT``` contains of copy of a [Game of Thrones dataset](https://github.com/jeffreylancaster/game-of-thrones). The project does not require loading all of the files in the directory.

You must set the ```data_dir``` variable in the cell below to the location where you cloned or unzipped the project.

The initial data loading loads the IMDB data into MySQL and the GoT data into MongoDB. We load data into Neo4j in later steps.

In [40]:
# In this cell, please enter the file systems path to the location where you cloned 
# or unzipped the project. Replace my value with the path for your system.
#
data_dir = r"C:\Users\natha\W4111-Project-Template"

In [45]:
imdb_data_dir = data_dir + r"\data\IMDB"

In [46]:
# This magic will list the files in the IMDB data directories.
#
%ls $imdb_data_dir

 Volume in drive C is OS
 Volume Serial Number is 940C-3CFD

 Directory of C:\Users\natha\W4111-Project-Template\data\IMDB

12/06/2024  05:26 PM    <DIR>          .
12/06/2024  05:26 PM    <DIR>          ..
12/06/2024  05:26 PM            31,805 got_imdb_name_basics.csv
12/06/2024  05:26 PM         2,406,441 got_imdb_title_basics.csv
12/06/2024  05:26 PM         1,688,109 got_imdb_title_principals.csv
12/06/2024  05:26 PM           338,284 got_imdb_title_ratings.csv
               4 File(s)      4,464,639 bytes
               2 Dir(s)  803,351,941,120 bytes free


In [47]:
got_data_dir = data_dir + r"\data\GoT"

In [48]:
# This magic will list the files in the IMDB data directories.
#
%ls $got_data_dir

 Volume in drive C is OS
 Volume Serial Number is 940C-3CFD

 Directory of C:\Users\natha\W4111-Project-Template\data\GoT

12/06/2024  05:26 PM    <DIR>          .
12/06/2024  05:26 PM    <DIR>          ..
12/06/2024  05:26 PM            45,580 character_relationship_scenes.json
12/06/2024  05:26 PM           212,804 characters.json
12/06/2024  05:26 PM             3,204 characters-gender.json
12/06/2024  05:26 PM            11,052 characters-gender-all.json
12/06/2024  05:26 PM             4,009 characters-groups.json
12/06/2024  05:26 PM            24,341 characters-include.json
12/06/2024  05:26 PM             3,223 colors.json
12/06/2024  05:26 PM         1,415,481 costars.json
12/06/2024  05:26 PM         1,903,641 episodes.json
12/06/2024  05:26 PM            89,943 heatmap.json
12/06/2024  05:26 PM           454,201 keyValues.json
12/06/2024  05:26 PM           284,643 lands-of-ice-and-fire.json
12/06/2024  05:26 PM             4,291 locations.json
12/06/2024  05:26 PM          

## Load Atlas MongoDB

### Some Helper Code

This section provide some code that helps with the data loading.

In [49]:
def load_json_file(file_path):
    """
    Load the JSON information in the parameter into python objects and return the result.
    """

    with open(file_path, "r") as in_file:
        result = json.load(in_file)

    return result


In [50]:
# This is a set of files to load and save to MongoDB Atlas for the project.
#
# Set this to the location where you have the data files in your installation
#
data_path = got_data_dir + "/"

def load_and_save_file(file_path,
                       mongodb_con, database_name,
                       collection_name, 
                       top_level_element=None, 
                       is_list=True,
                       drop_collection=True):

    j_data = load_json_file(file_path)

    if top_level_element:
        j_data = j_data[top_level_element]

    if not is_list:
        j_data = [j_data]

    if drop_collection:
        mongodb_con[database_name][collection_name].drop()

    mongodb_con[database_name][collection_name].insert_many(j_data)


def load_and_save_all_files(data_file_info, mongodb_con):

    for f in data_file_info:
        load_and_save_file(
            file_path=data_path + f['file_name'],
            mongodb_con=mongodb_con,
            database_name=f['database'],
            collection_name=f['collection'],
            top_level_element=f['top_level_element'],
            is_list=f["is_list"]
        )

### Load the Data Files into MongoDB

In [51]:
# The files to load for the project. The format of each entry in the array is:
# - file_name is the name of the file.
# - top_level_element is the JSON element name that wraps the data.
# - database is the name to use for your MongoDB database.
# - collection is the name of the document collection in the database.
# - is_list specificies the input format in the file.
#
data_file_load_info = [
    {
        "file_name": "characters-groups.json",
        "top_level_element": "groups",
        "database": "f24_project_got",
        "collection": "characters_groups",
        "is_list": True
    },
    {
        "file_name": "characters.json",
        "top_level_element": "characters",
        "database": "f24_project_got",
        "collection": "characters",
        "is_list": True
    },
    {
        "file_name": "episodes.json",
        "top_level_element": "episodes",
        "database": "f24_project_got",
        "collection": "episodes",
        "is_list": True
    },
    {
        "file_name": "locations.json",
        "top_level_element": "regions",
        "database": "f24_project_got",
        "collection": "locations",
        "is_list": True
    }    
]

__Note:__ If you have previously created the database, you should drop it.

In [52]:
# Specify the database
db = mongo_client["f24_got"]

# Drop collections
db["characters_groups"].drop()
db["characters"].drop()
db["episodes"].drop()
db["locations"].drop()

In [53]:
# This loads and saves all of the files.
#
load_and_save_all_files(data_file_load_info, mongo_client)

### Test the Loading

#### Characters

In [54]:
# Test if we loaded properly
# 
characters = mongo_client["f24_project_got"]["characters"].find(
    {},
    { 
        "_id": 0, 
        "characterName": 1,
        "actorName": 1,
        "actorLink": 1,
        "houseName": 1,
        "royal": 1,
        "kingsguard": 1,
        "nickname": 1
    }
)
characters_df = pandas.DataFrame(list(characters))

In [55]:
# Convert "Not a number" to None
#
characters_df = characters_df.replace({numpy.nan: None})

In [56]:
characters_df.tail(10)

Unnamed: 0,characterName,actorName,actorLink,houseName,royal,nickname,kingsguard
379,Yohn Royce,Rupert Vansittart,/name/nm0889338/,,,,
380,Yoren,Francis Magee,/name/nm0535837/,,,,
381,Young Benjen Stark,Matteo Elezi,/name/nm5502295/,Stark,,,
382,Young Cersei Lannister,Nell Williams,/name/nm5309709/,Lannister,,,
383,Young Lyanna Stark,Cordelia Hill,/name/nm8108764/,Stark,,,
384,Young Nan,Annette Tierney,/name/nm1519719/,,,,
385,Young Ned,Robert Aramayo,/name/nm7075019/,Stark,,,
386,Young Ned Stark,Sebastian Croft,/name/nm7509185/,Stark,,,
387,Young Rodrik Cassel,Fergus Leathem,/name/nm7509186/,,,,
388,Zanrush,Gerald Lepkowski,/name/nm0503319/,,,,


#### Episodes

In [57]:
# Test if we loaded properly
# 
episodes = mongo_client["f24_project_got"]["episodes"].find(
    {},
    { 
        "_id": 0
    }
)
episodes_df = pandas.DataFrame(list(episodes))

In [58]:
episodes_df.tail(10)

Unnamed: 0,seasonNum,episodeNum,episodeTitle,episodeLink,episodeAirDate,episodeDescription,openingSequenceLocations,scenes
63,7,4,The Spoils of War,/title/tt5775846/,2017-08-06,Daenerys takes matters into her own hands. Ary...,"[King's Landing, Dragonstone, Pyke, Winterfell...","[{'sceneStart': '0:03:55', 'sceneEnd': '0:04:1..."
64,7,5,Eastwatch,/title/tt5775854/,2017-08-13,Daenerys demands loyalty from the surviving La...,"[King's Landing, Dragonstone, Winterfell, The ...","[{'sceneStart': '0:03:55', 'sceneEnd': '0:05:4..."
65,7,6,Beyond the Wall,/title/tt5775864/,2017-08-20,Jon and his team go beyond the wall to capture...,"[King's Landing, Dragonstone, Winterfell, The ...","[{'sceneStart': '0:04:03', 'sceneEnd': '0:04:1..."
66,7,7,The Dragon and the Wolf,/title/tt5775874/,2017-08-27,Season finale of the epic series.,"[King's Landing, Dragonstone, Winterfell, The ...","[{'sceneStart': '0:04:36', 'sceneEnd': '0:04:5..."
67,8,1,Winterfell,/title/tt5924366/,2019-04-14,Jon and Daenerys arrive in Winterfell and are ...,"[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:04:39', 'sceneEnd': '0:04:5..."
68,8,2,A Knight of the Seven Kingdoms,/title/tt6027908/,2019-04-21,The battle at Winterfell is approaching. Jaime...,"[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:03:39', 'sceneEnd': '0:07:4..."
69,8,3,The Long Night,/title/tt6027912/,2019-04-28,The Night King and his army have arrived at Wi...,"[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:03:37', 'sceneEnd': '0:04:1..."
70,8,4,The Last of the Starks,/title/tt6027914/,2019-05-05,"In the wake of a costly victory, Jon and Daene...","[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:04:11', 'sceneEnd': '0:05:0..."
71,8,5,The Bells,/title/tt6027916/,2019-05-12,Daenerys and Cersei weigh their options as an ...,"[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:04:45', 'sceneEnd': '0:05:1..."
72,8,6,The Iron Throne,/title/tt6027920/,2019-05-19,In the aftermath of the devastating attack on ...,"[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:05:08', 'sceneEnd': '0:06:0..."


#### Locations

In [59]:
# Test if we loaded properly
# 
locations = mongo_client["f24_project_got"]["locations"].find(
    {},
    { 
        "_id": 0
    }
)
locations_df = pandas.DataFrame(list(locations))

In [60]:
episodes_df.tail(10)

Unnamed: 0,seasonNum,episodeNum,episodeTitle,episodeLink,episodeAirDate,episodeDescription,openingSequenceLocations,scenes
63,7,4,The Spoils of War,/title/tt5775846/,2017-08-06,Daenerys takes matters into her own hands. Ary...,"[King's Landing, Dragonstone, Pyke, Winterfell...","[{'sceneStart': '0:03:55', 'sceneEnd': '0:04:1..."
64,7,5,Eastwatch,/title/tt5775854/,2017-08-13,Daenerys demands loyalty from the surviving La...,"[King's Landing, Dragonstone, Winterfell, The ...","[{'sceneStart': '0:03:55', 'sceneEnd': '0:05:4..."
65,7,6,Beyond the Wall,/title/tt5775864/,2017-08-20,Jon and his team go beyond the wall to capture...,"[King's Landing, Dragonstone, Winterfell, The ...","[{'sceneStart': '0:04:03', 'sceneEnd': '0:04:1..."
66,7,7,The Dragon and the Wolf,/title/tt5775874/,2017-08-27,Season finale of the epic series.,"[King's Landing, Dragonstone, Winterfell, The ...","[{'sceneStart': '0:04:36', 'sceneEnd': '0:04:5..."
67,8,1,Winterfell,/title/tt5924366/,2019-04-14,Jon and Daenerys arrive in Winterfell and are ...,"[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:04:39', 'sceneEnd': '0:04:5..."
68,8,2,A Knight of the Seven Kingdoms,/title/tt6027908/,2019-04-21,The battle at Winterfell is approaching. Jaime...,"[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:03:39', 'sceneEnd': '0:07:4..."
69,8,3,The Long Night,/title/tt6027912/,2019-04-28,The Night King and his army have arrived at Wi...,"[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:03:37', 'sceneEnd': '0:04:1..."
70,8,4,The Last of the Starks,/title/tt6027914/,2019-05-05,"In the wake of a costly victory, Jon and Daene...","[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:04:11', 'sceneEnd': '0:05:0..."
71,8,5,The Bells,/title/tt6027916/,2019-05-12,Daenerys and Cersei weigh their options as an ...,"[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:04:45', 'sceneEnd': '0:05:1..."
72,8,6,The Iron Throne,/title/tt6027920/,2019-05-19,In the aftermath of the devastating attack on ...,"[Last Hearth, Winterfell, King's Landing]","[{'sceneStart': '0:05:08', 'sceneEnd': '0:06:0..."


## Neo4j Aura

All we are going to do for now is load basic characters and actor information into Aura and create some relationships.

Specifically, the code in this section:
- Creates a node for each character.
- Creates a node for each actor.
- Creates a relationship between the character node and actor node.

  
The first step is to get the ```characterName, actorName,``` and ```nconst``` out of the ```actorLink``` in the ```characters``` documents. We can use  a pipeline to extract the ```nconst```.

In [102]:
# Aggregation pipeline to extract the actorName, characterName and IMDB link for the character.
# We use a pipline because we use a pipeline operator to simplify extract the IMDB nconst from a URL.
pipeline = [
    {
        "$project": {
            "_id": 0,
            "characterName": 1,
            "actorName": 1,
            "nconst": {
                "$arrayElemAt": [
                    {"$split": ["$actorLink", "/"]}, 
                    2
                ]
            }
        }
    }
]

# Execute the aggregation pipeline
all_characters = mongo_client.f24_project_got.characters.aggregate(pipeline)
all_characters = list(all_characters)

In [103]:
# Do a simple test to see if we got data.
#
all_characters[0:3]

[{'characterName': 'Addam Marbrand',
  'actorName': 'B.J. Hogg',
  'nconst': 'nm0389698'},
 {'characterName': 'Aegon Targaryen', 'nconst': None},
 {'characterName': 'Aeron Greyjoy',
  'actorName': 'Michael Feast',
  'nconst': 'nm0269923'}]

In [104]:
# We have the same character name several times, which will cause weird phantom matches.
# So, each character gets a unique ID.
#
i = 1
for c in all_characters:
    c['character_id'] = "ch_" + str(i)
    i = i + 1

In [105]:
all_characters[0:3]

[{'characterName': 'Addam Marbrand',
  'actorName': 'B.J. Hogg',
  'nconst': 'nm0389698',
  'character_id': 'ch_1'},
 {'characterName': 'Aegon Targaryen', 'nconst': None, 'character_id': 'ch_2'},
 {'characterName': 'Aeron Greyjoy',
  'actorName': 'Michael Feast',
  'nconst': 'nm0269923',
  'character_id': 'ch_3'}]

In [106]:
# Preprocess the information into lists to make the Neo4j code simpler.
# This is not really necessary but it made the process easier to test and debug for me.
#
actors_to_insert = []
relationships_to_insert = []
characters_to_insert = []

In [107]:
for a in all_characters:
    
    # Extract the values from the key 'a' and set to None if there is no value for the key.
    n = a.get('actorName', None)
    l = a.get('nconst', None)
    c = a.get('characterName', None)
    cid = a.get('character_id', None)

    # Get the information for the actor.
    #
    if n is not None:
        actors_to_insert.append({'actorName': n, 'nconst': l})

    
    characters_to_insert.append({'character_id': cid, 'characterName': c })

    # Sometimes there may not ne a character to actor relationship link.
    if l:
        relationships_to_insert.append({'character_id': cid, 'nconst': l})
    

__Note:__ You must manually delete information from a previous load.

In [108]:
relationships_to_insert

[{'character_id': 'ch_1', 'nconst': 'nm0389698'},
 {'character_id': 'ch_3', 'nconst': 'nm0269923'},
 {'character_id': 'ch_4', 'nconst': 'nm0727778'},
 {'character_id': 'ch_5', 'nconst': 'nm6729880'},
 {'character_id': 'ch_6', 'nconst': 'nm0853583'},
 {'character_id': 'ch_7', 'nconst': 'nm0203801'},
 {'character_id': 'ch_8', 'nconst': 'nm8257864'},
 {'character_id': 'ch_9', 'nconst': 'nm0571654'},
 {'character_id': 'ch_10', 'nconst': 'nm1528121'},
 {'character_id': 'ch_11', 'nconst': 'nm0000980'},
 {'character_id': 'ch_12', 'nconst': 'nm0649046'},
 {'character_id': 'ch_13', 'nconst': 'nm1783582'},
 {'character_id': 'ch_14', 'nconst': 'nm8127149'},
 {'character_id': 'ch_15', 'nconst': 'nm1074361'},
 {'character_id': 'ch_16', 'nconst': 'nm3586035'},
 {'character_id': 'ch_18', 'nconst': 'nm0538869'},
 {'character_id': 'ch_19', 'nconst': 'nm4207240'},
 {'character_id': 'ch_20', 'nconst': 'nm0568400'},
 {'character_id': 'ch_21', 'nconst': 'nm1152798'},
 {'character_id': 'ch_23', 'nconst': 'n

In [109]:
# This is a simple function to create a a node for a character.
#
def insert_character(c):
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        # driver.verify_connectivity()
        
        summary = driver.execute_query(
            "CREATE (:Character:IMDB {characterName: $characterName, character_id: $character_id})",
            characterName=c['characterName'],
            character_id=c["character_id"]
        ).summary

        result = summary.counters.nodes_created

        if result != 1:
            print("Creating character ", c, "did something strange.")

        return result


In [110]:
# Insert a character
def insert_actor(a):
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        # driver.verify_connectivity()
        created_count = 0
        updated_count = 0
        # print(a)

        # Some of the actors are already in the database and their is a unique constraint onf
        # the name attribute. So, if the person exists, we simple add the nconst property.
        #
        
        records, summary, keys = driver.execute_query(
            "MATCH (p:Person) where p.name=$actorName RETURN p ",
            actorName=a['actorName']
        )
        if len(records) > 0:
            records, summary, keys = driver.execute_query(
                "MATCH (p:Person) where p.name=$actorName set p.nconst=$nconst",
                actorName=a['actorName'],
                nconst=a['nconst']
            )

            result = summary.counters.properties_set
            if result != 1:
                print("Updating an existing Person", a, "did something strange.")
                
            updated_count = result
        else:
            summary = driver.execute_query(
                "CREATE (:Person:IMDB {name: $actorName, nconst: $nconst})",
                actorName=a['actorName'],
                nconst=a['nconst']
            ).summary
            
            result = summary.counters.nodes_created

            if result != 1:
                print("Creating actor ", a, "did something strange.")

            created_count = result

        

        return (updated_count, created_count)



In [111]:
# Insert a link
def insert_link(l):
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        # driver.verify_connectivity()
        # print(l)
        summary= driver.execute_query(
            """
            MATCH (p:Person {nconst: $nconst}), (c:Character:IMDB {character_id: $character_id})
            create (c)-[:PLAYED_BY]->(p)
            """,
            character_id=l['character_id'],
            nconst=l['nconst']
            ).summary

        result = summary.counters.relationships_created

        if result != 1:
            print("Something weird happened creating relationship", l, ". Created relationships = ", result)

        return result


        # print("Query = ", summary.query)
        

In [112]:
characters_created = 0

for c in characters_to_insert:
    result = insert_character(c)
    characters_created += result
    if characters_created % 25 == 0:
        print("Created ", characters_created, "characters so far.")

print("Created", characters_created, "characters.")

Created  25 characters so far.
Created  50 characters so far.
Created  75 characters so far.
Created  100 characters so far.
Created  125 characters so far.
Created  150 characters so far.
Created  175 characters so far.
Created  200 characters so far.
Created  225 characters so far.
Created  250 characters so far.
Created  275 characters so far.
Created  300 characters so far.
Created  325 characters so far.
Created  350 characters so far.
Created  375 characters so far.
Created 389 characters.


In [113]:
len(characters_to_insert)

389

In [114]:
updated_count = 0
created_count = 0

for a in actors_to_insert:
    r = insert_actor(a)
    updated_count += r[0]
    created_count += r[1]

    if (updated_count + created_count) % 25 == 0:
        print("Created ", created_count, "Updated ", updated_count)

print("Final Created ", created_count, "Updated ", updated_count)

Created  25 Updated  0
Created  50 Updated  0
Created  75 Updated  0
Created  100 Updated  0
Created  125 Updated  0
Created  150 Updated  0
Created  175 Updated  0
Created  200 Updated  0
Created  225 Updated  0
Created  250 Updated  0
Created  275 Updated  0
Created  299 Updated  1
Created  324 Updated  1
Created  347 Updated  3
Final Created  351 Updated  3


In [115]:
len(actors_to_insert)

354

In [116]:
relationships_created = 0

for l in relationships_to_insert:
    result = insert_link(l)
    relationships_created += result

    if relationships_created % 25 == 0:
        print("Created", relationships_created, "relationships.")

print("Created", relationships_created, "relationships.")
    

Created 25 relationships.
Created 50 relationships.
Created 75 relationships.
Created 100 relationships.
Created 125 relationships.
Created 150 relationships.
Created 175 relationships.
Created 200 relationships.
Created 225 relationships.
Created 250 relationships.
Created 275 relationships.
Created 300 relationships.
Created 325 relationships.
Created 350 relationships.
Created 353 relationships.


In [117]:
len(relationships_to_insert)

353

## IMDB

The IMDB data is in a file in the project directory.

In [77]:
%ls $imdb_data_dir

 Volume in drive C is OS
 Volume Serial Number is 940C-3CFD

 Directory of C:\Users\natha\W4111-Project-Template\data\IMDB

12/06/2024  05:26 PM    <DIR>          .
12/06/2024  05:26 PM    <DIR>          ..
12/06/2024  05:26 PM            31,805 got_imdb_name_basics.csv
12/06/2024  05:26 PM         2,406,441 got_imdb_title_basics.csv
12/06/2024  05:26 PM         1,688,109 got_imdb_title_principals.csv
12/06/2024  05:26 PM           338,284 got_imdb_title_ratings.csv
               4 File(s)      4,464,639 bytes
               2 Dir(s)  804,063,076,352 bytes free


In [78]:
%sql drop database if exists f24_project

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.


[]

In [79]:
%sql create database f24_project

 * mysql+pymysql://root:***@localhost?local_infile=1
1 rows affected.


[]

We are simply going to load the data for now. The data files do not have a header row identifying the
columns. So, we have to define the columns and not use the header inferring for the file read.

In [80]:
name_basics_df = pandas.read_csv(imdb_data_dir + '/got_imdb_name_basics.csv', 
                                  header=None)

In [81]:
name_basics_df.columns = ["nconst", "primaryName", "birthYear", "deathYear", 
                          "primaryProfession", "knownForTitles"]

In [82]:
name_basics_df.to_sql('name_basics', schema='f24_project',
                      con=default_engine, index=False, if_exists='append')

351

In [85]:
%sql use f24_project;

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.


[]

In [86]:
%sql select * from name_basics limit 5;

 * mysql+pymysql://root:***@localhost?local_infile=1
5 rows affected.


nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
nm0389698,B.J. Hogg,1955.0,2020.0,"actor,music_department","tt0986233,tt1240982,tt0970411,tt0944947"
nm0269923,Michael Feast,1946.0,,"actor,composer,soundtrack","tt0120879,tt0472160,tt0362192,tt0810823"
nm0727778,David Rintoul,1948.0,,"actor,archive_footage","tt1139328,tt4786824,tt6079772,tt1007029"
nm6729880,Chuku Modu,1990.0,,"actor,writer,producer","tt4154664,tt2674426,tt0944947,tt6470478"
nm0853583,Owen Teale,1961.0,,"actor,writer,archive_footage","tt0102797,tt0944947,tt0485301,tt0462396"


In [87]:
title_basics_df = pandas.read_csv(imdb_data_dir + '/got_imdb_title_basics.csv', header=None)

In [88]:
title_basics_df.columns = ["tconst", "titleType", "primaryTitle",
                          "originalTitle", "isAdult", "startYear", "endYear",
                          "runtimeMinutes", "genres"]

In [89]:
title_basics_df.to_sql('title_basics', schema='f24_project',
                      con=default_engine, index=False, if_exists='replace')

29058

In [90]:
title_principals_df = pandas.read_csv(imdb_data_dir + '/got_imdb_title_principals.csv', header=None)

In [91]:
title_principals_df.head(5)

Unnamed: 0,0,1,2,3,4,5
0,nm0389698,tt0087286,5,actor,,"[""Big Billy""]"
1,nm0389698,tt0101201,9,actor,,"[""Billy Murray""]"
2,nm0389698,tt0120087,8,actor,,"[""Col. Reece""]"
3,nm0389698,tt0122738,9,actor,,"[""Minister""]"
4,nm0389698,tt0124972,7,actor,,"[""Mr.Ken Campbell""]"


In [92]:
title_principals_df.columns = ["nconst", "tconst", "ordering", "category", "job", "characters"]

In [93]:
title_principals_df.to_sql('title_principals', schema='f24_project',
                      con=default_engine, index=False, if_exists='replace')

34193

In [94]:
title_ratings_df = pandas.read_csv(imdb_data_dir + '/got_imdb_title_ratings.csv')

In [95]:
title_ratings_df.columns= ['tconst', "averageRating", "noOfVotes"]

In [96]:
title_ratings_df.to_sql('title_ratings', schema='f24_project',
                      con=default_engine, index=False, if_exists='replace')

17798

In [97]:
%sql use f24_project

 * mysql+pymysql://root:***@localhost?local_infile=1
0 rows affected.


[]

In [98]:
%sql select * from name_basics limit 5;

 * mysql+pymysql://root:***@localhost?local_infile=1
5 rows affected.


nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
nm0389698,B.J. Hogg,1955.0,2020.0,"actor,music_department","tt0986233,tt1240982,tt0970411,tt0944947"
nm0269923,Michael Feast,1946.0,,"actor,composer,soundtrack","tt0120879,tt0472160,tt0362192,tt0810823"
nm0727778,David Rintoul,1948.0,,"actor,archive_footage","tt1139328,tt4786824,tt6079772,tt1007029"
nm6729880,Chuku Modu,1990.0,,"actor,writer,producer","tt4154664,tt2674426,tt0944947,tt6470478"
nm0853583,Owen Teale,1961.0,,"actor,writer,archive_footage","tt0102797,tt0944947,tt0485301,tt0462396"


In [99]:
%sql select * from title_basics limit 5

 * mysql+pymysql://root:***@localhost?local_infile=1
5 rows affected.


tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
tt0054518,tvSeries,The Avengers,The Avengers,0,1961.0,1969.0,50.0,"Action,Comedy,Crime"
tt0054571,tvSeries,Three Live Wires,Three Live Wires,0,1961.0,,30.0,Comedy
tt0055556,movie,"Two Living, One Dead","Two Living, One Dead",0,1961.0,,105.0,"Crime,Drama,Thriller"
tt0056105,movie,The Swingin' Maiden,The Iron Maiden,0,1962.0,,98.0,"Comedy,Romance"
tt0056696,movie,Young and Willing,The Wild and the Willing,0,1962.0,,110.0,"Drama,Romance"


In [100]:
%sql select * from title_principals limit 5

 * mysql+pymysql://root:***@localhost?local_infile=1
5 rows affected.


nconst,tconst,ordering,category,job,characters
nm0389698,tt0087286,5,actor,,"[""Big Billy""]"
nm0389698,tt0101201,9,actor,,"[""Billy Murray""]"
nm0389698,tt0120087,8,actor,,"[""Col. Reece""]"
nm0389698,tt0122738,9,actor,,"[""Minister""]"
nm0389698,tt0124972,7,actor,,"[""Mr.Ken Campbell""]"


In [101]:
%sql select * from title_ratings limit 5

 * mysql+pymysql://root:***@localhost?local_infile=1
5 rows affected.


tconst,averageRating,noOfVotes
tt0055556,7.3,81
tt0056105,6.4,587
tt0056696,5.9,185
tt0057435,6.2,486
tt0058142,6.9,1535
