## Book: Graph Data Science with Neo4j
### Exercise 01: Netflix movies

In [None]:
## setup with required packages
import os
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

## Get Data: Netflix movies from Kaggle

In [4]:
## get Netflix data - from Kaggle
## https://www.kaggle.com/datasets/shivamb/netflix-shows?resource=download
DATA_DIR = os.path.join(os.getcwd(), 'data')
DATASET_FILE = os.path.join(DATA_DIR, 'netflix_titles.csv.zip')
data = pd.read_csv(DATASET_FILE)
data.head(2)

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."


### Data Structure

In [5]:

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


Check types:

In [None]:
# check types
data.type.value_counts()

Check cast column:

In [None]:
data.cast.head(5)

Observations:

* some items don't have cast info (NaN)
* for those that do, comma-separated list of all the actors in that show

## Next: Import into Neo4j

LOAD CSV stmt in Neo4j takes the form of:

LOAD CSV  
[WITH HEADERS]  
FROM 'file:///file_in_import_folder.csv'  
AS line  
[FIELDTERMINATOR ',']  
// do stuff with 'line'  

Over to Neo4j to:

1. Create new database
2. Click 'Open folder'
3. Click 'Import'

Save data to csv file for use with Neo4j

In [None]:
data.to_csv('data/netflix.csv', index=False)

## Check other fields

Rating:

In [None]:
data.rating.head(5)

In [None]:
unique_ratings = data['rating'].unique()
print(unique_ratings)

Not much value there - I thought it was IMDB rating or something. THAT would be useful!

listed_in - what is that?

In [None]:
data.listed_in.head(5)

In [None]:
unique_listed_in = data['listed_in'].unique()
print(unique_listed_in)

Basically 'genre'. Could be interesting but would need to manipulate, since one-to-many relationships. (but that's what graph databases are for, right?)