<a href="https://colab.research.google.com/github/matthewpecsok/data_engineering/blob/main/tutorials/de_nosql_databases_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This tutorial will introduce you to  MongoDB a document database. MongoDB deals in data that is effictively a dictionary in Python or JSON data if it's a file. This document-oriented database is very different from the relational database designs we've seen thus far. We'll learn how to populate the database with data, how to query the database for documents we'd like to find, how to extract data for data engineering purposes.

In [22]:
!pip install faker
!pip install pymongo

Collecting faker
  Downloading Faker-24.4.0-py3-none-any.whl (1.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m14.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: faker
Successfully installed faker-24.4.0
Collecting pymongo
  Downloading pymongo-4.6.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (676 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m676.9/676.9 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting dnspython<3.0.0,>=1.16.0 (from pymongo)
  Downloading dnspython-2.6.1-py3-none-any.whl (307 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m307.7/307.7 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, pymongo
Successfully installed dnspython-2.6.1 pymongo-4.6.3


In [23]:
from faker import Faker
from pymongo import MongoClient
import random
import datetime
import re

# install MongoDB

In [24]:
!apt-get install gnupg curl

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
curl is already the newest version (7.81.0-1ubuntu1.16).
gnupg is already the newest version (2.2.27-3ubuntu2.1).
gnupg set to manually installed.
0 upgraded, 0 newly installed, 0 to remove and 45 not upgraded.


In [25]:
!curl -fsSL https://pgp.mongodb.com/server-7.0.asc | \
   sudo gpg -o /usr/share/keyrings/mongodb-server-7.0.gpg \
   --dearmor

In [26]:
!echo "deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse" | sudo tee /etc/apt/sources.list.d/mongodb-org-7.0.list

deb [ arch=amd64,arm64 signed-by=/usr/share/keyrings/mongodb-server-7.0.gpg ] https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 multiverse


In [27]:
!apt-get update

0% [Working]            Get:1 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,626 B]
Hit:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64  InRelease
Ign:3 https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 InRelease
Get:4 http://security.ubuntu.com/ubuntu jammy-security InRelease [110 kB]
Hit:5 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:6 https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 Release [2,090 B]
Get:7 https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0 Release.gpg [866 B]
Get:8 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [119 kB]
Hit:9 https://ppa.launchpadcontent.net/c2d4u.team/c2d4u4.0+/ubuntu jammy InRelease
Get:10 https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0/multiverse amd64 Packages [32.7 kB]
Hit:11 https://ppa.launchpadcontent.net/deadsnakes/ppa/ubuntu jammy InRelease
Hit:12 https://ppa.launchpadcontent.net/graphics-drivers/ppa/ubuntu jammy InRelease
Get:13 ht

In [28]:
!apt-get install -y mongodb-org

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  mongodb-database-tools mongodb-mongosh mongodb-org-database mongodb-org-database-tools-extra
  mongodb-org-mongos mongodb-org-server mongodb-org-shell mongodb-org-tools
The following NEW packages will be installed:
  mongodb-database-tools mongodb-mongosh mongodb-org mongodb-org-database
  mongodb-org-database-tools-extra mongodb-org-mongos mongodb-org-server mongodb-org-shell
  mongodb-org-tools
0 upgraded, 9 newly installed, 0 to remove and 45 not upgraded.
Need to get 167 MB of archives.
After this operation, 539 MB of additional disk space will be used.
Get:1 https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0/multiverse amd64 mongodb-database-tools amd64 100.9.4 [51.9 MB]
Get:2 https://repo.mongodb.org/apt/ubuntu jammy/mongodb-org/7.0/multiverse amd64 mongodb-mongosh amd64 2.2.2 [52.6 MB]
Get:3 https://repo.mongodb.org/apt/ub

In [29]:
!mkdir /data
!mkdir /data/db

In [30]:
import subprocess
subprocess.Popen(["mongod"])

<Popen: returncode: None args: ['mongod']>

In [31]:
from pymongo import MongoClient
client = MongoClient()
client.list_database_names() # ['admin', 'local']

['admin', 'config', 'local']

## create the db

name our database

In [32]:
db = client['cloud_purchase_db']

## Set up Collections

drop the collections in case they already exist so we don't duplicate data

In [33]:
db.drop_collection('customers')
db.drop_collection('products')
#db.drop_collection('orders')

{'ok': 1.0}

do we have any collections?

## list collections

In [34]:
db.list_collection_names()

[]

In [35]:
# Create Faker instance
fake = Faker()

# Create the data

## create customer data

create from 1 to 5 orders (randomly)
pick a random product id from 1 to 10
units between 10 and 100
year from 1 year ago to today

In [36]:
# Generate customer orders
def gen_orders():

  orders = []
  for i in range(1,random.randint(1, 5)):

      productid = random.randint(1, 10)
      units = random.randint(10, 100)
      if random.random() < 0.1:
          units *= 10
      purchase_date = fake.date_between(start_date='-1y', end_date='today')
      purchase_date = datetime.datetime.combine(purchase_date, datetime.datetime.min.time())

      order = {
          'productid': productid,
          'units': units,
          'purchase_date': purchase_date
      }

      orders.append(order)

  return orders

In [37]:
gen_orders()

[{'productid': 6,
  'units': 75,
  'purchase_date': datetime.datetime(2024, 1, 21, 0, 0)}]

generate 100 customers

In [38]:
# Generate customers
customers = []
for i in range(100):
    customer = {
        'customerid': i+1,
        'name': fake.name(),
        'email': fake.email(),
        'phone': [fake.phone_number(),fake.phone_number(),fake.phone_number()],
        'orders' : gen_orders() #this is where we generate orders
    }
    customers.append(customer)

In [39]:
len(customers)

100

In [40]:
customers[0:3]

[{'customerid': 1,
  'name': 'Brian Clark',
  'email': 'bsawyer@example.net',
  'phone': ['001-579-476-0664x139', '(309)504-3134x92639', '(556)824-2044'],
  'orders': [{'productid': 10,
    'units': 44,
    'purchase_date': datetime.datetime(2023, 9, 7, 0, 0)},
   {'productid': 8,
    'units': 83,
    'purchase_date': datetime.datetime(2023, 8, 18, 0, 0)},
   {'productid': 4,
    'units': 49,
    'purchase_date': datetime.datetime(2023, 5, 20, 0, 0)}]},
 {'customerid': 2,
  'name': 'Chad Smith',
  'email': 'scotthenderson@example.org',
  'phone': ['975-352-0590', '404-926-6375', '+1-615-936-0115x702'],
  'orders': [{'productid': 7,
    'units': 660,
    'purchase_date': datetime.datetime(2024, 2, 11, 0, 0)},
   {'productid': 3,
    'units': 92,
    'purchase_date': datetime.datetime(2023, 5, 11, 0, 0)},
   {'productid': 5,
    'units': 76,
    'purchase_date': datetime.datetime(2023, 9, 30, 0, 0)}]},
 {'customerid': 3,
  'name': 'Ronald Greene',
  'email': 'christopher93@example.org',


In [41]:
# Insert customers into MongoDB
db.customers.insert_many(customers)

InsertManyResult([ObjectId('660c8b705138676bc12023ca'), ObjectId('660c8b705138676bc12023cb'), ObjectId('660c8b705138676bc12023cc'), ObjectId('660c8b705138676bc12023cd'), ObjectId('660c8b705138676bc12023ce'), ObjectId('660c8b705138676bc12023cf'), ObjectId('660c8b705138676bc12023d0'), ObjectId('660c8b705138676bc12023d1'), ObjectId('660c8b705138676bc12023d2'), ObjectId('660c8b705138676bc12023d3'), ObjectId('660c8b705138676bc12023d4'), ObjectId('660c8b705138676bc12023d5'), ObjectId('660c8b705138676bc12023d6'), ObjectId('660c8b705138676bc12023d7'), ObjectId('660c8b705138676bc12023d8'), ObjectId('660c8b705138676bc12023d9'), ObjectId('660c8b705138676bc12023da'), ObjectId('660c8b705138676bc12023db'), ObjectId('660c8b705138676bc12023dc'), ObjectId('660c8b705138676bc12023dd'), ObjectId('660c8b705138676bc12023de'), ObjectId('660c8b705138676bc12023df'), ObjectId('660c8b705138676bc12023e0'), ObjectId('660c8b705138676bc12023e1'), ObjectId('660c8b705138676bc12023e2'), ObjectId('660c8b705138676bc12023

## create products data

In [42]:
# Generate products
products = []
for i in range(10):
    product = {
        'productid': i+1,
        'category': random.choice(['Electronics', 'Clothing', 'Books', 'Home']),
        'price': random.randint(1, 100)
    }
    products.append(product)

In [43]:
products[0:3]

[{'productid': 1, 'category': 'Electronics', 'price': 8},
 {'productid': 2, 'category': 'Home', 'price': 69},
 {'productid': 3, 'category': 'Home', 'price': 3}]

put the 10 products into the database

In [44]:
# Insert products into MongoDB
db.products.insert_many(products)

InsertManyResult([ObjectId('660c8b705138676bc120242e'), ObjectId('660c8b705138676bc120242f'), ObjectId('660c8b705138676bc1202430'), ObjectId('660c8b705138676bc1202431'), ObjectId('660c8b705138676bc1202432'), ObjectId('660c8b705138676bc1202433'), ObjectId('660c8b705138676bc1202434'), ObjectId('660c8b705138676bc1202435'), ObjectId('660c8b705138676bc1202436'), ObjectId('660c8b705138676bc1202437')], acknowledged=True)

In [45]:
db.list_collection_names() # list collections

['products', 'customers']

loop through collections and count the number of documents

In [46]:
for collection_name in db.list_collection_names():
  collection = db.get_collection(collection_name)
  print(f'{collection_name}:{collection.count_documents({})}')

products:10
customers:100


# Querying



## Customers

Find all customers.

In [47]:
for document in db.customers.find({}):
  print(document)

{'_id': ObjectId('660c8b705138676bc12023ca'), 'customerid': 1, 'name': 'Brian Clark', 'email': 'bsawyer@example.net', 'phone': ['001-579-476-0664x139', '(309)504-3134x92639', '(556)824-2044'], 'orders': [{'productid': 10, 'units': 44, 'purchase_date': datetime.datetime(2023, 9, 7, 0, 0)}, {'productid': 8, 'units': 83, 'purchase_date': datetime.datetime(2023, 8, 18, 0, 0)}, {'productid': 4, 'units': 49, 'purchase_date': datetime.datetime(2023, 5, 20, 0, 0)}]}
{'_id': ObjectId('660c8b705138676bc12023cb'), 'customerid': 2, 'name': 'Chad Smith', 'email': 'scotthenderson@example.org', 'phone': ['975-352-0590', '404-926-6375', '+1-615-936-0115x702'], 'orders': [{'productid': 7, 'units': 660, 'purchase_date': datetime.datetime(2024, 2, 11, 0, 0)}, {'productid': 3, 'units': 92, 'purchase_date': datetime.datetime(2023, 5, 11, 0, 0)}, {'productid': 5, 'units': 76, 'purchase_date': datetime.datetime(2023, 9, 30, 0, 0)}]}
{'_id': ObjectId('660c8b705138676bc12023cc'), 'customerid': 3, 'name': 'Rona

find customerid 76

In [48]:
import pprint

In [49]:
for document in db.customers.find({'customerid':76}):
  print(pprint.pprint(document))

{'_id': ObjectId('660c8b705138676bc1202415'),
 'customerid': 76,
 'email': 'richard53@example.com',
 'name': 'Eric Rodriguez',
 'orders': [{'productid': 3,
             'purchase_date': datetime.datetime(2023, 6, 26, 0, 0),
             'units': 54}],
 'phone': ['955-286-8938x1837', '(375)892-1632x91392', '4025299349']}
None


## Products

find all products

In [50]:
for document in db.products.find({}):
  print(document)

{'_id': ObjectId('660c8b705138676bc120242e'), 'productid': 1, 'category': 'Electronics', 'price': 8}
{'_id': ObjectId('660c8b705138676bc120242f'), 'productid': 2, 'category': 'Home', 'price': 69}
{'_id': ObjectId('660c8b705138676bc1202430'), 'productid': 3, 'category': 'Home', 'price': 3}
{'_id': ObjectId('660c8b705138676bc1202431'), 'productid': 4, 'category': 'Home', 'price': 23}
{'_id': ObjectId('660c8b705138676bc1202432'), 'productid': 5, 'category': 'Electronics', 'price': 58}
{'_id': ObjectId('660c8b705138676bc1202433'), 'productid': 6, 'category': 'Books', 'price': 10}
{'_id': ObjectId('660c8b705138676bc1202434'), 'productid': 7, 'category': 'Clothing', 'price': 26}
{'_id': ObjectId('660c8b705138676bc1202435'), 'productid': 8, 'category': 'Books', 'price': 56}
{'_id': ObjectId('660c8b705138676bc1202436'), 'productid': 9, 'category': 'Home', 'price': 51}
{'_id': ObjectId('660c8b705138676bc1202437'), 'productid': 10, 'category': 'Electronics', 'price': 36}


find all products with prices less than 40

In [51]:
for document in db.products.find({'price': {'$lt':40}}):
  print(document)

{'_id': ObjectId('660c8b705138676bc120242e'), 'productid': 1, 'category': 'Electronics', 'price': 8}
{'_id': ObjectId('660c8b705138676bc1202430'), 'productid': 3, 'category': 'Home', 'price': 3}
{'_id': ObjectId('660c8b705138676bc1202431'), 'productid': 4, 'category': 'Home', 'price': 23}
{'_id': ObjectId('660c8b705138676bc1202433'), 'productid': 6, 'category': 'Books', 'price': 10}
{'_id': ObjectId('660c8b705138676bc1202434'), 'productid': 7, 'category': 'Clothing', 'price': 26}
{'_id': ObjectId('660c8b705138676bc1202437'), 'productid': 10, 'category': 'Electronics', 'price': 36}


find all products with category clothing

the find method returns a cursor that lazily loads the result set in batches as we iterate over them.

In [52]:
db.products.find({'category': 'Clothing'})

<pymongo.cursor.Cursor at 0x7b024a0fae30>

In [53]:
for document in db.products.find({'category': 'Clothing'}):
  print(document)

{'_id': ObjectId('660c8b705138676bc1202434'), 'productid': 7, 'category': 'Clothing', 'price': 26}


not equal query

In [54]:
for document in db.products.find({"category": {"$ne": "Clothing"}}):
  print(document)

{'_id': ObjectId('660c8b705138676bc120242e'), 'productid': 1, 'category': 'Electronics', 'price': 8}
{'_id': ObjectId('660c8b705138676bc120242f'), 'productid': 2, 'category': 'Home', 'price': 69}
{'_id': ObjectId('660c8b705138676bc1202430'), 'productid': 3, 'category': 'Home', 'price': 3}
{'_id': ObjectId('660c8b705138676bc1202431'), 'productid': 4, 'category': 'Home', 'price': 23}
{'_id': ObjectId('660c8b705138676bc1202432'), 'productid': 5, 'category': 'Electronics', 'price': 58}
{'_id': ObjectId('660c8b705138676bc1202433'), 'productid': 6, 'category': 'Books', 'price': 10}
{'_id': ObjectId('660c8b705138676bc1202435'), 'productid': 8, 'category': 'Books', 'price': 56}
{'_id': ObjectId('660c8b705138676bc1202436'), 'productid': 9, 'category': 'Home', 'price': 51}
{'_id': ObjectId('660c8b705138676bc1202437'), 'productid': 10, 'category': 'Electronics', 'price': 36}


the equivalent of SQL's query `like '%string%'`

In [55]:
for document in db.products.find({"category": re.compile("electron", re.IGNORECASE)}):
  print(document)

{'_id': ObjectId('660c8b705138676bc120242e'), 'productid': 1, 'category': 'Electronics', 'price': 8}
{'_id': ObjectId('660c8b705138676bc1202432'), 'productid': 5, 'category': 'Electronics', 'price': 58}
{'_id': ObjectId('660c8b705138676bc1202437'), 'productid': 10, 'category': 'Electronics', 'price': 36}


find all products with price less than 40 AND category Clothing

In [56]:
for document in db.products.find({'price': {'$lt':40},'category': 'Clothing'}):
  print(document)

{'_id': ObjectId('660c8b705138676bc1202434'), 'productid': 7, 'category': 'Clothing', 'price': 26}


## Orders

In [57]:
# Calculate the total number of orders
pipeline = [
    {
        '$unwind': '$orders'
    },
    {
        '$group': {
            '_id': None,
            'total_orders': {'$sum': 1}
        }
    },
    {
        '$project': {
            '_id': 0,
            'total_orders': 1
        }
    }
]

result = db.customers.aggregate(pipeline)

# Extract the total number of orders
total_orders = next(result)['total_orders']

# Print the total number of orders
print(f"Total Orders: {total_orders}")

Total Orders: 199


In [58]:
filter_criteria = {
    'orders.units': {'$lt': 45000}
}

# Calculate the total number of orders with units less than 12000
pipeline = [
    {
      '$unwind': '$orders'
    },
    {
      '$match': filter_criteria
    },
    {
        '$group': {
            '_id': None,
            'total_orders': {'$sum': 1}
        }
    },
    {
        '$project': {
            '_id': 0,
            'total_orders': 1
        }
    }
]

result = db.customers.aggregate(pipeline)

# Extract the total number of orders
total_orders = next(result)['total_orders']

# Print the total number of orders
print(f"Total Orders: {total_orders}")

Total Orders: 199


# Extraction


## Dump mongodb data to json file.

create a JSON file with the filtered data.

In [59]:
from bson.json_util import dumps
import json

open a file. create a cursor that gets passed to dumps which takes binary json data and converts it serialized json, deseralizes it to python object and then serializes it back to a file.  

## dump orders

## dump customers

(and orders as they are nested)

In [60]:
with open('customers.json', 'w') as file:
  cursor = db.customers.find({})
  file.write(dumps(cursor))

## dump products

In [61]:
with open('products.json', 'w') as file:
  cursor = db.products.find({})
  file.write(dumps(cursor))

In [62]:
!ls -lh *.json

-rw-r--r-- 1 root root 37K Apr  2 22:49 customers.json
-rw-r--r-- 1 root root 16M Feb  8 01:24 patient_records_batch_10.json
-rw-r--r-- 1 root root 16M Feb  8 01:22 patient_records_batch_1.json
-rw-r--r-- 1 root root 16M Feb  8 01:22 patient_records_batch_2.json
-rw-r--r-- 1 root root 16M Feb  8 01:22 patient_records_batch_3.json
-rw-r--r-- 1 root root 16M Feb  8 01:22 patient_records_batch_4.json
-rw-r--r-- 1 root root 16M Feb  8 01:23 patient_records_batch_5.json
-rw-r--r-- 1 root root 16M Feb  8 01:23 patient_records_batch_6.json
-rw-r--r-- 1 root root 16M Feb  8 01:23 patient_records_batch_7.json
-rw-r--r-- 1 root root 16M Feb  8 01:23 patient_records_batch_8.json
-rw-r--r-- 1 root root 16M Feb  8 01:23 patient_records_batch_9.json
-rw-r--r-- 1 root root 986 Apr  2 22:49 products.json


# Reading JSON file data into Python

## JSON into Pandas DataFrame

In [63]:
import pandas as pd

In [64]:
customers_df = pd.read_json('customers.json')
customers_df.head(2)

Unnamed: 0,_id,customerid,name,email,phone,orders
0,{'$oid': '660c8b705138676bc12023ca'},1,Brian Clark,bsawyer@example.net,"[001-579-476-0664x139, (309)504-3134x92639, (5...","[{'productid': 10, 'units': 44, 'purchase_date..."
1,{'$oid': '660c8b705138676bc12023cb'},2,Chad Smith,scotthenderson@example.org,"[975-352-0590, 404-926-6375, +1-615-936-0115x702]","[{'productid': 7, 'units': 660, 'purchase_date..."


In [65]:
products_df = pd.read_json('products.json')
products_df.head(2)

Unnamed: 0,_id,productid,category,price
0,{'$oid': '660c8b705138676bc120242e'},1,Electronics,8
1,{'$oid': '660c8b705138676bc120242f'},2,Home,69


### Merge DataFrames

In [66]:
customers_df.head(4)

Unnamed: 0,_id,customerid,name,email,phone,orders
0,{'$oid': '660c8b705138676bc12023ca'},1,Brian Clark,bsawyer@example.net,"[001-579-476-0664x139, (309)504-3134x92639, (5...","[{'productid': 10, 'units': 44, 'purchase_date..."
1,{'$oid': '660c8b705138676bc12023cb'},2,Chad Smith,scotthenderson@example.org,"[975-352-0590, 404-926-6375, +1-615-936-0115x702]","[{'productid': 7, 'units': 660, 'purchase_date..."
2,{'$oid': '660c8b705138676bc12023cc'},3,Ronald Greene,christopher93@example.org,"[(602)620-7683, (446)691-2662, 001-845-348-123...","[{'productid': 9, 'units': 25, 'purchase_date'..."
3,{'$oid': '660c8b705138676bc12023cd'},4,Mrs. Patricia Green,jeffrey25@example.net,"[+1-541-788-8131x09656, +1-323-867-3305x94155,...","[{'productid': 10, 'units': 57, 'purchase_date..."


In [67]:
orders_df = customers_df.explode('orders')

In [68]:
df_orders_expanded = pd.concat([orders_df.drop(['orders'], axis=1), orders_df['orders'].apply(pd.Series)], axis=1)
df_orders_expanded.head(3)


Unnamed: 0,_id,customerid,name,email,phone,productid,units,purchase_date,0
0,{'$oid': '660c8b705138676bc12023ca'},1,Brian Clark,bsawyer@example.net,"[001-579-476-0664x139, (309)504-3134x92639, (5...",10.0,44.0,{'$date': '2023-09-07T00:00:00Z'},
0,{'$oid': '660c8b705138676bc12023ca'},1,Brian Clark,bsawyer@example.net,"[001-579-476-0664x139, (309)504-3134x92639, (5...",8.0,83.0,{'$date': '2023-08-18T00:00:00Z'},
0,{'$oid': '660c8b705138676bc12023ca'},1,Brian Clark,bsawyer@example.net,"[001-579-476-0664x139, (309)504-3134x92639, (5...",4.0,49.0,{'$date': '2023-05-20T00:00:00Z'},


In [69]:
customer_product_orders = df_orders_expanded.merge(products_df, on='productid', how='left')
customer_product_orders.head(3)

Unnamed: 0,_id_x,customerid,name,email,phone,productid,units,purchase_date,0,_id_y,category,price
0,{'$oid': '660c8b705138676bc12023ca'},1,Brian Clark,bsawyer@example.net,"[001-579-476-0664x139, (309)504-3134x92639, (5...",10.0,44.0,{'$date': '2023-09-07T00:00:00Z'},,{'$oid': '660c8b705138676bc1202437'},Electronics,36.0
1,{'$oid': '660c8b705138676bc12023ca'},1,Brian Clark,bsawyer@example.net,"[001-579-476-0664x139, (309)504-3134x92639, (5...",8.0,83.0,{'$date': '2023-08-18T00:00:00Z'},,{'$oid': '660c8b705138676bc1202435'},Books,56.0
2,{'$oid': '660c8b705138676bc12023ca'},1,Brian Clark,bsawyer@example.net,"[001-579-476-0664x139, (309)504-3134x92639, (5...",4.0,49.0,{'$date': '2023-05-20T00:00:00Z'},,{'$oid': '660c8b705138676bc1202431'},Home,23.0


In [70]:
customer_product_orders['total_sales'] = customer_product_orders['price'] * customer_product_orders['units']

In [71]:
customer_product_orders.groupby(by='category').agg({'total_sales': 'sum'}).sort_values(by='total_sales', ascending=False)

Unnamed: 0_level_0,total_sales
category,Unnamed: 1_level_1
Home,384581.0
Electronics,253046.0
Books,167046.0
Clothing,62686.0


How to unzip a zip file.

In [72]:
!wget -O patient_records.zip https://github.com/matthewpecsok/data_engineering/raw/main/data/patient_records.zip

--2024-04-02 22:49:21--  https://github.com/matthewpecsok/data_engineering/raw/main/data/patient_records.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/matthewpecsok/data_engineering/main/data/patient_records.zip [following]
--2024-04-02 22:49:22--  https://raw.githubusercontent.com/matthewpecsok/data_engineering/main/data/patient_records.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25417183 (24M) [application/zip]
Saving to: ‘patient_records.zip’


2024-04-02 22:49:22 (108 MB/s) - ‘patient_records.zip’ saved [25417183/25417183]



In [73]:
!ls -l

total 194752
-rw-r--r-- 1 root root    37420 Apr  2 22:49 customers.json
-rw-r--r-- 1 root root 11673600 Apr  2 22:14 medication_database.db
-rw-r--r-- 1 root root 16002564 Feb  8 01:24 patient_records_batch_10.json
-rw-r--r-- 1 root root 16183594 Feb  8 01:22 patient_records_batch_1.json
-rw-r--r-- 1 root root 16257996 Feb  8 01:22 patient_records_batch_2.json
-rw-r--r-- 1 root root 16138041 Feb  8 01:22 patient_records_batch_3.json
-rw-r--r-- 1 root root 16157479 Feb  8 01:22 patient_records_batch_4.json
-rw-r--r-- 1 root root 16309649 Feb  8 01:23 patient_records_batch_5.json
-rw-r--r-- 1 root root 16271092 Feb  8 01:23 patient_records_batch_6.json
-rw-r--r-- 1 root root 16186389 Feb  8 01:23 patient_records_batch_7.json
-rw-r--r-- 1 root root 16367684 Feb  8 01:23 patient_records_batch_8.json
-rw-r--r-- 1 root root 16389066 Feb  8 01:23 patient_records_batch_9.json
-rw-r--r-- 1 root root 25417183 Apr  2 22:49 patient_records.zip
-rw-r--r-- 1 root root      986 Apr  2 22:49 products

In [None]:
!unzip patient_records.zip

Archive:  patient_records.zip
replace patient_records_batch_7.json? [y]es, [n]o, [A]ll, [N]one, [r]ename: 

In [None]:
!wget -O medication_database.db https://github.com/matthewpecsok/data_engineering/raw/main/data/medication_database.db

In [None]:
import sqlite3
import pandas as pd

In [None]:
medication_con = sqlite3.connect('medication_database.db')

pd.read_sql_query('SELECT * FROM sqlite_master', medication_con)

In [None]:
pd.read_sql_query('SELECT * FROM medications', medication_con)

In [None]:
import json

In [None]:
patients_1_batch = json.load(open('patient_records_batch_1.json'))

In [None]:
len(patients_1_batch)

In [None]:
patients_1_batch[0:20]