# Beyond Flat Files - A Short Introduction to JSON

### Why CSV Isn't Always Enough ü§∑‚Äç‚ôÄÔ∏è

We've worked a lot with CSV (Comma-Separated Values) files. They are fantastic for one specific kind of data: **flat, tabular data**. Think of a spreadsheet‚Äîa clean grid of rows and columns.

But what happens when our data isn't flat?


**Motivating Example 1: Nested Data**

Imagine we have data about blog posts. A single post has a title, an author, and a list of comments. Each comment, in turn, has its own author and text.

How would you store this in a CSV?

  * You could have columns like `comment_1_author`, `comment_1_text`, `comment_2_author`, `comment_2_text`, etc. This is messy and has a fixed limit on the number of comments.
  * You could create separate CSV files (one for posts, one for comments) and then join them. This works, but it complicates our data loading process.

**Motivating Example 2: Inconsistent Data**
Consider an e-commerce site selling different products.

  * A **book** has a title and an ISBN.
  * A **t-shirt** has a color and a size.
  * A **laptop** has RAM and a screen size.

A single CSV file to hold all products would have many empty cells, making it sparse and inefficient.

The core limitation of CSV is its inability to natively represent **hierarchy** or **nested structures**. It forces everything into a simple two-dimensional grid.

### JSON to the Rescue\! ‚ú®

This is where **JSON (JavaScript Object Notation)** comes in. It's a lightweight, human-readable text format designed for semi-structured data. Despite the name, it is completely language-independent.

JSON is built on two fundamental structures that map directly to Python data types:

1.  **Objects (like Python Dictionaries):** A collection of key-value pairs, enclosed in curly braces `{}`. Keys must be strings, and values can be any JSON data type.
      * `{"name": "Alice", "age": 30}`
2.  **Arrays (like Python Lists):** An ordered list of values, enclosed in square brackets `[]`.
      * `["apple", "banana", "cherry"]`

The other data types are strings, numbers, booleans (`true`/`false`), and `null` (equivalent to Python's `None`).


Let's revisit our blog post example. In JSON, it's clean and intuitive:

```json
{
  "title": "My First Post",
  "author": "Bob",
  "published_date": "2025-10-09",
  "comments": [
    {
      "author": "Alice",
      "text": "Great post!"
    },
    {
      "author": "Charlie",
      "text": "Very informative."
    }
  ]
}
```

See how the `comments` key holds an array of comment objects? The hierarchical relationship is perfectly preserved.

Here are a few more examples of JSON objects, showing various levels of complexity:

**Example 1: Simple Object** (Already shown, but for completeness)

In [None]:
{
  "person": {
    "firstName": "Bob",
    "lastName": "Smith"
  },
  "address": {
    "street": "123 Main St",
    "city": "Anytown",
    "zipCode": "12345"
  },
  "contact": {
    "email": "bob.smith@example.com",
    "phone": "555-1234"
  }
}

{'person': {'firstName': 'Bob', 'lastName': 'Smith'},
 'address': {'street': '123 Main St', 'city': 'Anytown', 'zipCode': '12345'},
 'contact': {'email': 'bob.smith@example.com', 'phone': '555-1234'}}

In [None]:
{
  "orderId": "ORD12345",
  "customer": {
    "customerId": "CUST987",
    "name": "Charlie Brown"
  },
  "items": [
    {
      "productId": "PROD001",
      "name": "Laptop",
      "price": 1200.00,
      "quantity": 1
    },
    {
      "productId": "PROD005",
      "name": "Mouse",
      "price": 25.00,
      "quantity": 2
    }
  ],
  "totalAmount": 1250.00
}

In [None]:
{
  "surveyTitle": "Customer Feedback Survey",
  "surveyDate": "2024-07-24",
  "questions": [
    {
      "questionId": "Q1",
      "text": "How satisfied are you with our service?",
      "type": "rating",
      "options": [1, 2, 3, 4, 5]
    },
    {
      "questionId": "Q2",
      "text": "Which features do you use most often?",
      "type": "checkbox",
      "options": ["Feature A", "Feature B", "Feature C"],
      "responses": ["Feature A", "Feature C"]
    },
    {
      "questionId": "Q3",
      "text": "Any additional comments?",
      "type": "text",
      "response": "The service was great!"
    }
  ],
  "respondentId": "RESP5678"
}

### Working with JSON in Python üêç

Python has a built-in `json` module that makes working with this format incredibly easy. The two most important functions are `json.load()` (to read from a file) and `json.loads()` (to read from a string).

Let's compare loading a simple CSV vs. a simple JSON into Pandas.

#### **Loading CSV**

For a collection of records in a CSV, it's a one-liner:

In [None]:
import pandas as pd

# users.csv
# id,name,email
# 1,Alice,alice@example.com
# 2,Bob,bob@example.com

df_csv = pd.read_csv('users.csv')
print(df_csv)

#### **Loading JSON**

The structure of the JSON file matters.

**Case 1: An Array of Objects (Ideal for a DataFrame)**
This is the most direct equivalent to a CSV with multiple records.

In [None]:
# users.json
# [
#   {"id": 1, "name": "Alice", "email": "alice@example.com"},
#   {"id": 2, "name": "Bob", "email": "bob@example.com"}
# ]

df_json = pd.read_json('users.json')
print(df_json)

As you can see, `pd.read_json` works just as easily here.

**Case 2: Line-Delimited JSON**
Sometimes, a file contains a separate, complete JSON object on each line. This is a common format for data streams and exports (e.g., from MongoDB).

In [None]:
# users_lines.json
# {"id": 1, "name": "Alice", "email": "alice@example.com"}
# {"id": 2, "name": "Bob", "email": "bob@example.com"}

# The magic argument is `lines=True`
df_lines = pd.read_json('users_lines.json', lines=True)
print(df_lines)

**Case 3: Nested JSON (The Fun Part\!)**
What about our blog post example? If we just use `read_json`, the `comments` column will contain a list of dictionaries, which isn't very useful for analysis. This is where we use Pandas' secret weapon: `json_normalize`.

In [None]:
# Create the blogpost.json file
blog_post_data = {
  "title": "My First Post",
  "author": "Bob",
  "published_date": "2025-10-09",
  "comments": [
    {
      "author": "Alice",
      "text": "Great post!"
    },
    {
      "author": "Charlie",
      "text": "Very informative."
    }
  ]
}

import json

with open('blogpost.json', 'w') as f:
    json.dump(blog_post_data, f, indent=4)

print("blogpost.json created successfully.")

blogpost.json created successfully.


In [None]:
with open('blogpost.json', 'r') as f:
    data = json.load(f)

In [None]:
data['comments'][0]['text']

'Great post!'

In [None]:
contacts = [{'name': 'A', "address": "AA"},{'name': 'B'},{'address': 'CC'}]

In [None]:
with open('contacts.json', 'w') as f:
    json.dump(contacts, f, indent=4)


In [None]:
pd.read_json('contacts.json')

Unnamed: 0,name,address
0,A,AA
1,B,
2,,CC


In [None]:
pd.read_json('blogpost.json')

Unnamed: 0,title,author,published_date,comments
0,My First Post,Bob,2025-10-09,"{'author': 'Alice', 'text': 'Great post!'}"
1,My First Post,Bob,2025-10-09,"{'author': 'Charlie', 'text': 'Very informativ..."


In [None]:
import json
import pandas as pd

with open('blogpost.json', 'r') as f:
    data = json.load(f) # Load the JSON into a Python dict

# Normalize the data without specifying record_path or meta
# This will result in a DataFrame with a single row and a 'comments' column
# containing the list of comment dictionaries.
df_without_flattening = pd.json_normalize(data)

print("DataFrame without using record_path or meta:")
print(df_without_flattening)

DataFrame without using record_path or meta:
           title author published_date  \
0  My First Post    Bob     2025-10-09   

                                            comments  
0  [{'author': 'Alice', 'text': 'Great post!'}, {...  


In [None]:
df_without_flattening.shape

(1, 4)

In [None]:
import json
import pandas as pd

with open('blogpost.json', 'r') as f:
    data = json.load(f) # Load the JSON into a Python dict

print("DataFrame using record_path and meta (with meta_prefix):")
# Normalize the nested 'comments' data using record_path and meta
# Use meta_prefix to avoid naming conflicts for the 'author' column
df_comments = pd.json_normalize(
    data,
    record_path=['comments'], # The list to unpack into rows
    meta=['title', 'author'], # Parent-level fields to include
    meta_prefix='post_' # Add a prefix to metadata columns
)

print(df_comments)

DataFrame using record_path and meta (with meta_prefix):
    author               text     post_title post_author
0    Alice        Great post!  My First Post         Bob
1  Charlie  Very informative.  My First Post         Bob


Want more example? Check out ["GitHub Events"](https://https://docs.github.com/en/rest/using-the-rest-api/github-event-types?apiVersion=2022-11-28)


### \#\# A Quick Word on XML üèõÔ∏è

You might also encounter another format called **XML (eXtensible Markup Language)**. It solves the same problem of representing hierarchical data but does so with a tag-based syntax, similar to HTML.

```xml
<post>
    <title>My First Post</title>
    <author>Bob</author>
    <comments>
        <comment>
            <author>Alice</author>
            <text>Great post!</text>
        </comment>
        <comment>
            <author>Charlie</author>
            <text>Very informative.</text>
        </comment>
    </comments>
</post>
```

While JSON is more popular today for web APIs and modern applications due to its simplicity and lower verbosity, XML is far from obsolete. It has some distinct advantages:

  * **Schema and Validation:** XML has very mature systems (like XSD) for defining and enforcing a strict data structure. This is critical in enterprise systems, finance, and government, where data integrity is paramount.
  * **Namespaces:** It provides a standard way to avoid naming conflicts when mixing data from different sources.
  * **Comments:** XML natively supports comments, which JSON does not. This is useful for configuration files or documents that need annotation.

So, while you'll probably work with JSON more often, it's important to recognize XML and understand that its robustness and strictness keep it relevant in many professional contexts.

### **Recap**

  * **CSV** is for flat, tabular data.
  * **JSON** excels at representing nested or hierarchical data structures.
  * Python's `json` library and Pandas (`pd.read_json`, `pd.json_normalize`) give you powerful tools to load, parse, and flatten JSON into usable DataFrames.
  * **XML** is another format for hierarchical data, valued for its schema enforcement and maturity in enterprise settings.