## Inferring schema in MongoDB

You can control the shape and contents of documents in a collection by defining a schema. Schemas let you require specific fields, control the type of a field's value, and validate changes before committing write operations.

Benefits:
- improved query performance
- improved data organization

We will demonstrate how to perform transaction in the following: 

## Connect to mongoDB

In [1]:
# URI generation

import os
from pymongo import MongoClient


# Prompt user for MongoDB credentials
# Replace the value of the variables with your own credentials

# Generate the MongoDB URI
password = "a123456"
# you need to copy this URI from mongoDB portal
MONGODB_URI = f"mongodb+srv://luckyboy:{password}@clusterforhist4701.5ijtuzc.mongodb.net/" 

# Set the MONGODB_URI environment variable
os.environ["MONGODB_URI"] = MONGODB_URI

# Display the generated URI
print("Generated MONGODB_URI:")
print(MONGODB_URI)

client = MongoClient(MONGODB_URI)

for db_name in client.list_database_names():
	print(db_name)

Generated MONGODB_URI:
mongodb+srv://luckyboy:a123456@clusterforhist4701.5ijtuzc.mongodb.net/
HIST4701s_trial_2
admin
local


## Inferring schema

Say we are working with the previous dataset: 

```
a_set_of_new_archives = [
    {
        "account_number": "0987654321",
        "account_name": "Jane Smith",
        "balance": 2500.75,
        "currency": "USD"
    },
    {
        "account_number": "9876543210",
        "account_name": "Alice Johnson",
        "balance": 500.25,
        "currency": "EUR"
    },
    {
        "account_number": "5678901234",
        "account_name": "Bob Williams",
        "balance": 3500.0,
        "currency": "GBP"
    }
]
```

We can request mongoDB infer schema for our dataset, so that we dont need to manually write down the schema.

In [2]:
from collections import defaultdict

def infer_schema(data):
    schema = defaultdict(set)

    for document in data:
        for key, value in document.items():
            schema[key].add(type(value).__name__)

    inferred_schema = {}
    for key, value in schema.items():
        inferred_schema[key] = list(value)[0] if len(value) == 1 else "mixed"

    return inferred_schema

# Sample data
a_set_of_new_archives = [
    {
        "account_number": "0987654321",
        "account_name": "Jane Smith",
        "balance": 2500.75,
        "currency": "USD"
    },
    {
        "account_number": "9876543210",
        "account_name": "Alice Johnson",
        "balance": 500.25,
        "currency": "EUR"
    },
    {
        "account_number": "5678901234",
        "account_name": "Bob Williams",
        "balance": 3500.0,
        "currency": "GBP"
    }
]

# Infer schema
schema = infer_schema(a_set_of_new_archives)

# Print inferred schema
for key, value in schema.items():
    print(f"{key}: {value}")

account_number: str
account_name: str
balance: float
currency: str


The result is:
```
account_number: str
account_name: str
balance: float
currency: str
```

We observe that when the database consists of homogeneous data, it allows us to obtain a one-to-one correspondence between the attributes of the data and their respective data types in the collection.

## Checking if we have created indexes for each attribute in our entities schema

We can modify the infer_schema() function to include a check for each index name and verify if an index has been created for it:


In [3]:
db = client["HIST4701s_trial_2"]
collection = db["archives_trial_2"]

# Verify updated schema
index_info = collection.index_information()
for key in schema.keys():
    index_name = f"{key}_1"
    if index_name in index_info:
        print(f"{key}: {schema[key]} (Index: {index_name})")
    else:
        print(f"{key}: {schema[key]} (Index not created)")

account_number: str (Index not created)
account_name: str (Index not created)
balance: float (Index not created)
currency: str (Index not created)
