<a href="https://colab.research.google.com/github/lbsocial/data-analysis-with-generative-ai/blob/main/Social_Media_ETL_Neo4j_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üöÄ Building a Social Media Knowledge Graph with Python & Neo4j

**Turn flat data into a connected network.**

In this tutorial, we will build a complete ETL (Extract, Transform, Load) pipeline. We will simulate a rich dataset of social media posts (users, tweets, hashtags, locations) and ingest them into a Graph Database to reveal the hidden relationships between them.

**What we will build:**
1.  **Synthetic Data Engine:** Use `Faker` to generate realistic, nested JSON data (similar to Twitter/X API v2).
2.  **Context-Aware Content:** Instead of random gibberish, we will generate semantically consistent text (about AI, Databases, etc.) to prepare for future Vector Search analysis.
3.  **High-Performance Ingestion:** Use the Neo4j Python Driver and Cypher `UNWIND` to batch load data efficiently.

---
### üõ†Ô∏è Step 1: Install Dependencies
We need `neo4j` to connect to the database and `faker` to generate our synthetic data.

In [None]:
pip install neo4j faker -q

### ü¶ú Step 2: Generate Context-Aware Social Data
In this step, we create the "dummy" dataset using Python and Faker.

**The Strategy:**
To make this data useful for **AI & Vector Search** later, we aren't just generating random noise. We are generating **Semantic Clusters**.
* We define 4 core topics (Neo4j, AI, Python, Cloud).
* We map specific sentences to each topic.
* This ensures that when we visualize the data later, we will see clear groups of related content.

**Visualizing the Data Structure:**
We nest the **Author** and **Place** objects *inside* the Tweet JSON. This "document-style" structure is easier to pass to the database in one go.

![Tweet Object Structure](https://github.com/lbsocial/data-analysis-with-generative-ai/blob/1660b372f63accb8e55ea4b439924ba764639182/image/Gemini_Generated_Image_kcezvkcezvkcezvk.png?raw=true)

In [None]:
import json
import random
from faker import Faker
import datetime

fake = Faker()
Faker.seed(42)

# 1. Setup Static Data
# FIXED: Added "full_name" key to all entries
locations_db = [
    {"full_name": "New York, NY", "country": "US", "lon": -74.00, "lat": 40.71},
    {"full_name": "London, UK", "country": "GB", "lon": -0.12, "lat": 51.50},
    {"full_name": "Tokyo, JP", "country": "JP", "lon": 139.69, "lat": 35.68},
    {"full_name": "San Francisco, CA", "country": "US", "lon": -122.41, "lat": 37.77},
    {"full_name": "Paris, FR", "country": "FR", "lon": 2.35, "lat": 48.85}
]

# DOMAIN SPECIFIC CONTENT (Enables Semantic Search later)
topic_content = {
    "Neo4j": [
        "Graph databases are game changers for handling complex relationships.",
        "Just learned how to use Cypher query language, it is so intuitive!",
        "Relational DBs struggle with joins, but graphs handle them naturally.",
        "Building a recommendation engine is much easier with nodes and edges."
    ],
    "AI": [
        "The new Large Language Models are hallucinating less and reasoning more.",
        "Generative AI is transforming how we write code every day.",
        "Thinking about the ethics of autonomous agents in production.",
        "Just deployed a new transformer model to the cloud."
    ],
    "Python": [
        "I love how clean Python syntax is for data science projects.",
        "Pandas and NumPy are essential tools for any data engineer.",
        "Debugging async code in Python can be tricky but worth it.",
        "Automating my daily workflows with a simple Python script."
    ],
    "Cloud": [
        "Serverless architecture really cuts down on maintenance overhead.",
        "Scaling kubernetes clusters across multiple regions today.",
        "Cloud costs are getting high, need to optimize our storage buckets.",
        "Deploying microservices to the edge for lower latency."
    ]
}

hashtags_list = list(topic_content.keys())
usernames = ["Alice_Data", "Bob_Graphs", "Charlie_AI", "Dave_Dev", "Eve_Sec"]

# 2. Pre-generate Users (With 19-digit IDs)
user_map = {}
for username in usernames:
    user_map[username] = {
        "id": str(fake.unique.random_number(digits=19)), # Real ID Format
        "username": username,
        "name": fake.name(),
        "public_metrics": {
            "followers_count": random.randint(100, 50000),
            "following_count": random.randint(10, 2000),
            "tweet_count": random.randint(50, 5000)
        }
    }

tweets = []
print("Generating 100 Context-Aware Tweets...")

for i in range(100):
    author = user_map[random.choice(usernames)]
    city = random.choice(locations_db)

    # Pick a Primary Topic to drive the text content
    primary_topic = random.choice(hashtags_list)

    # Generate meaningful text based on the topic
    base_text = random.choice(topic_content[primary_topic])

    # Add some random tags (ensure the primary topic is included)
    tags = [primary_topic]
    if random.random() > 0.5:
        other_tags = [t for t in hashtags_list if t != primary_topic]
        tags += random.sample(other_tags, k=random.randint(1, 2))

    hashtag_entities = [{"tag": t} for t in tags]

    # Generate IDs and Jitter GPS
    tweet_id = str(fake.unique.random_number(digits=19))
    tweet_lon = city["lon"] + random.uniform(-0.05, 0.05)
    tweet_lat = city["lat"] + random.uniform(-0.05, 0.05)

    tweet = {
        "id": tweet_id,
        "text": base_text + " " + " ".join([f"#{t}" for t in tags]),
        "created_at": fake.date_time_between(start_date="-30d", end_date="now").isoformat(),
        "author_id": author["id"],
        "public_metrics": {
            "like_count": random.randint(0, 1000),
            "retweet_count": random.randint(0, 500),
            "reply_count": random.randint(0, 50)
        },
        "entities": {
            "hashtags": hashtag_entities
        },
        "geo": {
            "coordinates": {"type": "Point", "coordinates": [tweet_lon, tweet_lat]}
        },
        "place": {
            "full_name": city["full_name"],
            "country": city["country"],
            "centroid": [city["lon"], city["lat"]]
        },
        "__expansion_author": author
    }
    tweets.append(tweet)

with open("dummy_tweets.json", "w") as f:
    json.dump(tweets, f, indent=2)

print("‚úÖ Generated context-aware tweets (Ready for Vector Search).")

In [None]:
with open("dummy_tweets.json", "r") as f:
    data = json.load(f)
    for tweet in data:
        print(json.dumps(tweet, indent=2))
        break

### üîå Step 3: Connect to Neo4j
We use the official Python driver.
* **Note:** Ensure you have added your `URI` and `password` to the Colab "Secrets" tab.

In [None]:
from google.colab import userdata

password = userdata.get('password')
# --- CONFIGURATION ---
URI = userdata.get('URI')
AUTH = ("neo4j", f"{password}")


In [None]:
try:
    with GraphDatabase.driver(URI, auth=AUTH) as driver:
        driver.verify_connectivity()
        print("‚úÖ SUCCESS! Connected to Neo4j.")
except Exception as e:
    print(f"‚ùå ERROR: {e}")

### üì• Step 4: Ingest Data (Graph Construction)
We use **Cypher** (Neo4j's query language) to map our JSON data into nodes and relationships.

**The Blueprint:**
The image below shows the exact graph structure our query will build. Notice how we split the single Tweet object into four distinct, connected nodes.

![Graph Data Model](https://github.com/lbsocial/data-analysis-with-generative-ai/blob/1660b372f63accb8e55ea4b439924ba764639182/image/Gemini_Generated_Image_vslfdxvslfdxvslf.png?raw=true)


**The Cypher Strategy:**
1.  **UNWIND:** We pass the entire list of tweets as a single parameter (`$batch`). `UNWIND` processes them one row at a time.
2.  **MERGE (User, Place, Hashtag):** We use `MERGE` to find existing nodes or create them if they don't exist. This prevents duplicates.
3.  **CREATE (Tweet):** We use `CREATE` for tweets because every tweet ID is globally unique; we always want a new node.
4.  **Relationships:** We connect the entities as shown in the diagram above.

In [None]:
import json
from neo4j import GraphDatabase


query = """
UNWIND $batch AS row

// --------------------------------------------------------
// 1. NODE: USER (Complete Profile Metrics)
// --------------------------------------------------------
MERGE (u:User {id: row.__expansion_author.id})
SET u.username = row.__expansion_author.username,
    u.name = row.__expansion_author.name,

    // --- USER METRICS ---
    u.followers = row.__expansion_author.public_metrics.followers_count,
    u.following = row.__expansion_author.public_metrics.following_count,
    u.tweet_count = row.__expansion_author.public_metrics.tweet_count,
    u.listed_count = row.__expansion_author.public_metrics.listed_count // (If available)

// --------------------------------------------------------
// 2. NODE: TWEET (Complete Engagement Metrics)
// --------------------------------------------------------
CREATE (t:Tweet {id: row.id})
SET t.text = row.text,
    t.created_at = datetime(row.created_at),

    // --- TWEET METRICS ---
    t.likes = row.public_metrics.like_count,
    t.retweets = row.public_metrics.retweet_count,
    t.replies = row.public_metrics.reply_count,
    t.quotes = row.public_metrics.quote_count,

    // --- GEO PIN ---
    t.location = point({
        longitude: row.geo.coordinates.coordinates[0],
        latitude:  row.geo.coordinates.coordinates[1]
    })

// Link User -> Tweet
MERGE (u)-[:POSTED]->(t)

// --------------------------------------------------------
// 3. NODE: PLACE (City Center)
// --------------------------------------------------------
MERGE (p:Place {name: row.place.full_name})
ON CREATE SET
    p.country = row.place.country,
    p.location = point({
        longitude: row.place.centroid[0],
        latitude:  row.place.centroid[1]
    })

// Link Tweet -> Place
MERGE (t)-[:LOCATED_AT]->(p)

// --------------------------------------------------------
// 4. NODE: HASHTAG
// --------------------------------------------------------
FOREACH (tagObj IN row.entities.hashtags |
    MERGE (h:Hashtag {name: toLower(tagObj.tag)})
    MERGE (t)-[:TAGGED_WITH]->(h)
)
"""

print("Importing Fully Loaded Graph...")
# Make sure to use the filename you generated in the last step
with open("dummy_tweets.json", "r") as f:
    data = json.load(f)

with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()
    with driver.session(database="neo4j") as session:
        session.execute_write(lambda tx: tx.run(query, batch=data))

print("‚úÖ Success! All properties (Followers, Tweet Count, etc.) are now in the graph.")

### üèÅ Conclusion
You have successfully built a pipeline that transforms raw JSON streams into a graph!

**What's Next?**
Now that your data is in Neo4j, you have a foundation for:
* **Graph Analytics:** "Who is the most influential user?"
* **Vector Search:** "Find tweets about data science." (Using the semantic text we just generated!)