# Getting Started

In this course, you will create a graph database of movies from a set of CSV files.

<img 
    src="https://graphacademy.neo4j.com/courses/importing-cypher/1-importing-data/1-getting-started/images/data-model.svg" 
    alt="Data Model"
    style="width: 50%; height: auto; display: block; margin: 0 auto;"
/>

## Source data

When you import data into Neo4j, you typically start with a set of source files.

You may have exported this source data from:

- Relational databases

- Web APIs

- Public data directories

- BI tools

- Speadsheets (e.g. Excel or Google Sheets)

The data in the source files may not be in the format needed for your graph data model:

- The source files could contain more data than you need.

- There may not be a 1-1 mapping of the data in a CSV file to a node or relationship.

- The data types might not map directly onto those supported in Neo4j.

Likely, you will need to transform the data before or during the import.

## Create a graph

Before you start the import process, you should:

1. Understand the data in the source CSV files.

2. Inspect and clean (if necessary) the data in the source data files.

3. Understand the graph data model you will be implementing during the import.

Before you import data into Neo4j, there is no data structure, only the database itself - as you import data, you create the graph data model.

Once you have the source data and a graph data model, you can create the graph by importing the data.

Before you import data into Neo4j, there is no data structure, only the database itself - as you import data, you create the graph data model.

Once you have the source data and a graph data model, you can create the graph by importing the data.

The import involves creating Cypher code to:

- Read the source data.

- Transform the data as needed.

- Create nodes, relationships, and properties to create the graph.

Creating an import process will likely require multiple iterations as you build, test, and refactor.

# CSV files

## Normalized Data

**Definition:** Data is organized to minimize redundancy by separating it into multiple related tables (in relational databases) or nodes (in graph databases).

**Characteristics:**
- Minimizes data duplication
- Reduces update anomalies (insert, update, delete)
- Maintains data integrity through relationships
- Requires joins/relationships to reconstruct complete information

```cypher
# Normalized Tables
CUSTOMERS (customer_id, name, email)
ORDERS (order_id, customer_id, order_date)
ORDER_ITEMS (order_id, product_id, quantity)
PRODUCTS (product_id, name, price)
```

Example in Graph Database (Normalized):

```cypher
(customer:Customer {id: 1, name: "John"})-[:PLACED]->(order:Order {id: 101, date: "2023-11-08"})
(order)-[:CONTAINS]->(product:Product {id: 1001, name: "Laptop", price: 999})
```

## Denormalized Data

**Definition:** Data is stored with some redundancy to optimize read performance, often by duplicating data across multiple records.

**Characteristics:**

- Improves read performance by reducing joins
- Increases storage requirements
- Can lead to update anomalies if not managed carefully
- Simplifies queries by keeping related data together

Example in Document Database:

```json
{
  "order_id": 101,
  "customer": {
    "name": "John",
    "email": "john@example.com"
  },
  "items": [
    {
      "product_name": "Laptop",
      "price": 999,
      "quantity": 1
    }
  ]
}
```

Example in Graph Database (Denormalized):

```cypher
// Some properties duplicated for faster access
(order:Order {
  id: 101, 
  date: "2023-11-08",
  customer_name: "John",  // Denormalized from Customer
  total: 999
})-[:CONTAINS]->(product:Product {id: 1001, name: "Laptop"})
```

# Loading CSV files

use the [LOAD CSV](https://neo4j.com/docs/cypher-manual/current/clauses/load-csv/?_gl=1*s1xfuf*_gcl_au*MjEzNTI4NjkxNy4xNzU3MjU4NDMzLjc4MDQ1OTczLjE3NTg0MTY3NjUuMTc1ODQxNjc2NA..*_ga*MTkzMzgxNTk1LjE3NTcyNTg0MzQ.*_ga_DL38Q8KGQC*czE3NjI2MDk2NDYkbzMxJGcxJHQxNzYyNjExMjczJGozJGwwJGgw*_ga_DZP8Z65KK4*czE3NjI2MDk2NDYkbzMxJGcxJHQxNzYyNjExMjczJGozJGwwJGgw) Cypher clause

```cypher
LOAD CSV [WITH HEADERS] FROM url [AS alias] [FIELDTERMINATOR char]
```

You are going to load a [CSV file that contains people data](https://data.neo4j.com/importing-cypher/people.csv?_gl=1*qxkz4i*_gcl_au*MjEzNTI4NjkxNy4xNzU3MjU4NDMzLjc4MDQ1OTczLjE3NTg0MTY3NjUuMTc1ODQxNjc2NA..*_ga*MTkzMzgxNTk1LjE3NTcyNTg0MzQ.*_ga_DL38Q8KGQC*czE3NjI2MDk2NDYkbzMxJGcxJHQxNzYyNjExMjczJGozJGwwJGgw*_ga_DZP8Z65KK4*czE3NjI2MDk2NDYkbzMxJGcxJHQxNzYyNjExMjczJGozJGwwJGgw):

In [1]:
import pandas as pd

df = pd.read_csv('https://data.neo4j.com/importing-cypher/people.csv')
df.head()

Unnamed: 0,personId,name,birthYear
0,23945,Gerard Pires,1942
1,553509,Helen Reddy,1941
2,113934,Susan Flannery,1939


In [5]:
import os

from dotenv import load_dotenv

load_dotenv()

from neo4j import GraphDatabase

neo4j_uri = os.getenv("NEO4J_URI")
neo4j_user = os.getenv("NEO4J_USERNAME")
neo4j_pass = os.getenv("NEO4J_PASSWORD")
neo4j_db = os.getenv("NEO4J_DATABASE")

neo4j_driver = GraphDatabase.driver(neo4j_uri,
                                   auth=(neo4j_user,neo4j_pass))

In [11]:
import textwrap
from utils import execute_query

In [12]:
cypher = textwrap.dedent("""
LOAD CSV WITH HEADERS
FROM 'https://data.neo4j.com/importing-cypher/people.csv'
as row
RETURN row
""")

In [13]:
res = execute_query(neo4j_driver, cypher)

res

ServiceUnavailable: Couldn't connect to localhost:7687 (resolved to ('127.0.0.1:7687',)):
Failed to read four byte Bolt handshake response from server ResolvedIPv4Address(('127.0.0.1', 7687)) (deadline Deadline(timeout=60.0))