# HBase Data Modeling and Querying

Dakeun Park

120462429

## 1. Designing the Schema

Let's assume we're creating a database for a simple bookstore. We need tables for Books and Authors.

- Books Table

  Row Key: ISBN (International Standard Book Number)

  Column Families:

  details: General information about the book.

  details:title: The title of the book.

  details:author: Author ID (link to Authors table).

  stock: Information about book availability.

  stock:quantity: Number of copies available.

- Authors Table

  Row Key: Author ID

  Column Families:

  info: Information about the author.

  info:name: Author's name.

  *info*:birthdate: Author's birth date.

## 2. Creating Tables in HBase

Connect to HBase and create tables using Python:

In [1]:
import happybase

try:
    connection = happybase.Connection('hbase-docker', port=9090)
    if connection:
        print("Connected to HBase.")
    else:
        print("FAIL")
except Exception as e:
    print("Failed to connect HBase:", e)

Connected to HBase.


In [2]:
# Connect to HBase
connection = happybase.Connection('hbase-docker', port=9090)

# List current tables
existing_tables = connection.tables()
print("Existing tables:", [table.decode('utf-8') for table in existing_tables])

# Creating the 'Books' table if not already created
if b'Books' not in existing_tables:
    connection.create_table(
        'Books',
        {'details': dict(max_versions=1),
         'stock': dict(max_versions=1)}
    )
    print("Created 'Books' table.")
else:
    print("'Books' table already exists.")

# Creating the 'Authors' table if not already created
if b'Authors' not in existing_tables:
    connection.create_table(
        'Authors',
        {'info': dict(max_versions=1)}
    )
    print("Created 'Authors' table.")
else:
    print("'Authors' table already exists.")

# Print tables to verify
updated_tables = connection.tables()
print("Updated tables list:", [table.decode('utf-8') for table in updated_tables])

connection.close()


Existing tables: ['Authors', 'Books', 'denormalized']
'Books' table already exists.
'Authors' table already exists.
Updated tables list: ['Authors', 'Books', 'denormalized']


## 3. Populating Tables with Sample Data

In [3]:
connection = happybase.Connection('hbase-docker', port=9090)

# Connect to 'Books' table
table = connection.table('Books')

# Insert data into 'Books'
table.put('978-3-16-148410-0', {'details:title': 'Sample Book Title', 'details:author': '1', 'stock:quantity': '5'})
table.put('978-3-16-148410-1', {'details:title': 'Another Sample Book Title', 'details:author': '1', 'stock:quantity': '7'})

# Connect to 'Authors' table
table = connection.table('Authors')

# Insert data into 'Authors'
table.put('1', {'info:name': 'John Doe', 'info:birthdate': '1990-01-01'})

connection.close()


## 4. Implementing Queries


#### Single-Row Query

In [4]:
# Function for querying a single data
def fetch_and_print_row(connection, table_name, row_key):
    try:
        table = connection.table(table_name)
        row = table.row(row_key)
        # if row with the given key exists
        if row:
            print(f"\nDetails for {table_name}:")
            for key, value in row.items():
                print(f"{key.decode('utf-8')}: {value.decode('utf-8')}")
        else:
            print(f"No data found for row key: {row_key} in table: {table_name}")
    except Exception as e:
        print(f"Failed to fetch data from {table_name}: {str(e)}")

# Example usage
connection = happybase.Connection('hbase-docker', port=9090)
fetch_and_print_row(connection, 'Books', '978-3-16-148410-0')
fetch_and_print_row(connection, 'Authors', '1')
connection.close()



Details for Books:
details:author: 1
details:title: Sample Book Title
stock:quantity: 5

Details for Authors:
info:birthdate: 1990-01-01
info:name: John Doe


#### Multi-Row Query Using Scans

A basic scan fetches all rows in a table or within a range of row keys.

In [5]:
def scan_table(connection, table_name, start_key=None, end_key=None):
    table = connection.table(table_name)
    print(f"Scanning table {table_name}...")
    for key, data in table.scan(row_start=start_key, row_stop=end_key):
        print(f"Row key: {key.decode('utf-8')}")
        for column, value in data.items():
            print(f"  {column.decode('utf-8')}: {value.decode('utf-8')}")
        print("")

# Example usage
connection = happybase.Connection('hbase-docker', port=9090)
# Muli-Row query given the key range.
scan_table(connection, 'Books', start_key='978-3-16-148410-0', end_key='978-3-16-148410-9')
connection.close()

Scanning table Books...
Row key: 978-3-16-148410-0
  details:author: 1
  details:title: Sample Book Title
  stock:quantity: 5

Row key: 978-3-16-148410-1
  details:author: 1
  details:title: Another Sample Book Title
  stock:quantity: 7



#### Range Query with Filters
You can refine scans further using filters.

In [6]:
def filtered_scan_table(connection, table_name, filter_string):
    table = connection.table(table_name)
    print(f"Scanning table {table_name} with filter: {filter_string}...")
    for key, data in table.scan(filter=filter_string):
        print(f"Row key: {key.decode('utf-8')}")
        for column, value in data.items():
            print(f"  {column.decode('utf-8')}: {value.decode('utf-8')}")
        print("")

# Example usage
connection = happybase.Connection('hbase-docker', port=9090)
# Searching for books with author id 1 
filter_string = "SingleColumnValueFilter('details', 'author', =, 'binary:1')"
filtered_scan_table(connection, 'Books', filter_string)
connection.close()

Scanning table Books with filter: SingleColumnValueFilter('details', 'author', =, 'binary:1')...
Row key: 978-3-16-148410-0
  details:author: 1
  details:title: Sample Book Title
  stock:quantity: 5

Row key: 978-3-16-148410-1
  details:author: 1
  details:title: Another Sample Book Title
  stock:quantity: 7



#### CRUD operations using python class

In Python, we can utilize classes to implement create, read, update, and delete operations.

In [7]:
class HBaseCRUD:
    def __init__(self, host, port=9090):
        """
        Initialize connection to the HBase server.
        """
        self.connection = happybase.Connection(host, port)

    def create_or_update(self, table_name, row_key, data):
        """
        Create or update data in an HBase table.
        """
        table = self.connection.table(table_name)
        table.put(row_key, data)
        print(f"Data inserted/updated in {table_name} for row {row_key}")

    def read(self, table_name, row_key):
        """
        Read data from an HBase table.
        """
        table = self.connection.table(table_name)
        data = table.row(row_key)
        if data:
            print(f"Data retrieved from {table_name} for row {row_key}:")
            return {k.decode('utf-8'): v.decode('utf-8') for k, v in data.items()}
        else:
            print(f"No data found for row {row_key} in table {table_name}")
            return None

    def delete(self, table_name, row_key):
        """
        Delete a row from an HBase table.
        """
        table = self.connection.table(table_name)
        table.delete(row_key)
        print(f"Row {row_key} deleted from {table_name}")

    def scan_table(self, table_name, start_key=None, end_key=None, filter_string=None):
        """Scan for rows in a table optionally within a key range and with a filter."""
        table = self.connection.table(table_name)
        print(f"Scanning table {table_name}...")
        rows = table.scan(row_start=start_key, row_stop=end_key, filter=filter_string)
        result = []
        for key, data in rows:
            decoded_data = {k.decode('utf-8'): v.decode('utf-8') for k, v in data.items()}
            result.append((key.decode('utf-8'), decoded_data))
        return result

    def scan_filtered_table(self, table_name, column, value, comparator='='):
        """Scan for rows in a table with a column filter, supporting multiple comparison operators."""
        table = self.connection.table(table_name)
        filter_string = (
            f"SingleColumnValueFilter ('{column.split(':')[0]}', '{column.split(':')[1]}', "
            f"{comparator}, 'binary:{value}', true, true)"
        )
        rows = table.scan(filter=filter_string)
        result = []
        for key, data in rows:
            decoded_data = {k.decode('utf-8'): v.decode('utf-8') for k, v in data.items()}
            result.append((key.decode('utf-8'), decoded_data))
        return result

    def close_connection(self):
        """
        Close the HBase connection.
        """
        self.connection.close()
        print("Connection closed")

#### Class implementation

Performing various Operations using the class structure.

In [8]:
hbase = HBaseCRUD('hbase-docker')

# Insert data into 'Books' table
book_data = {
    'details:title': 'Sample Book Title', 
    'details:author': '1', 
    'stock:quantity': '5'
}

hbase.create_or_update('Books', '978-3-16-148410-0', book_data)

# Read data from 'Books' table
book = hbase.read('Books', '978-3-16-148410-0')
print(book)

# Update data in 'Books' table
update_data = {
    'stock:quantity': '10'
}
hbase.create_or_update('Books', '978-3-16-148410-0', update_data)

# Read data from 'Books' table
book = hbase.read('Books', '978-3-16-148410-0')
print(book)

# Delete row from 'Books' table
hbase.delete('Books', '978-3-16-148410-0')

# Close connection
hbase.close_connection()

Data inserted/updated in Books for row 978-3-16-148410-0
Data retrieved from Books for row 978-3-16-148410-0:
{'details:author': '1', 'details:title': 'Sample Book Title', 'stock:quantity': '5'}
Data inserted/updated in Books for row 978-3-16-148410-0
Data retrieved from Books for row 978-3-16-148410-0:
{'details:author': '1', 'details:title': 'Sample Book Title', 'stock:quantity': '10'}
Row 978-3-16-148410-0 deleted from Books
Connection closed


#### Create or Update multiple data - Books Table

In [9]:
# Example book data entries
books_data = [
    {
        'row_key': '978-0-13-110163-0',
        'data': {
            'details:title': 'Introduction to Algorithms',
            'details:author': '2',  # Assuming author ID '2' is linked in the Authors table
            'stock:quantity': '15'
        }
    },
    {
        'row_key': '978-0-13-595705-9',
        'data': {
            'details:title': 'Artificial Intelligence: A Modern Approach',
            'details:author': '3',  # Assuming author ID '3'
            'stock:quantity': '20'
        }
    },
    {
        'row_key': '978-0-201-83595-3',
        'data': {
            'details:title': 'The C Programming Language',
            'details:author': '4',  # Assuming author ID '4'
            'stock:quantity': '8'
        }
    },
    {
        'row_key': '978-0-596-52068-7',
        'data': {
            'details:title': 'Learning Python',
            'details:author': '5',  # Assuming author ID '5'
            'stock:quantity': '12'
        }
    },
    {
        'row_key': '978-0-262-03384-8',
        'data': {
            'details:title': 'Algorithms Unlocked',
            'details:author': '2',  # Thomas H. Cormen
            'stock:quantity': '10'
        }
    },
    {
        'row_key': '978-0-262-53305-8',
        'data': {
            'details:title': 'Introduction to Autonomous Robots',
            'details:author': '3',  # Stuart Russell
            'stock:quantity': '7'
        }
    },
    {
        'row_key': '978-0-13-110362-7',
        'data': {
            'details:title': 'The UNIX Programming Environment',
            'details:author': '4',  # Brian Kernighan
            'stock:quantity': '5'
        }
    },
    {
        'row_key': '978-1-59327-708-4',
        'data': {
            'details:title': 'Python Crash Course',
            'details:author': '5',  # Mark Lutz
            'stock:quantity': '12'
        }
    }
    
]

# Initialize HBase CRUD operations for the 'Books' table
hbase = HBaseCRUD('hbase-docker')

# Inserting the book data into the 'Books' table
for book in books_data:
    hbase.create_or_update('Books', book['row_key'], book['data'])
    print(f"Inserted book with ISBN {book['row_key']}")

# Close the connection after operations
hbase.close_connection()

Data inserted/updated in Books for row 978-0-13-110163-0
Inserted book with ISBN 978-0-13-110163-0
Data inserted/updated in Books for row 978-0-13-595705-9
Inserted book with ISBN 978-0-13-595705-9
Data inserted/updated in Books for row 978-0-201-83595-3
Inserted book with ISBN 978-0-201-83595-3
Data inserted/updated in Books for row 978-0-596-52068-7
Inserted book with ISBN 978-0-596-52068-7
Data inserted/updated in Books for row 978-0-262-03384-8
Inserted book with ISBN 978-0-262-03384-8
Data inserted/updated in Books for row 978-0-262-53305-8
Inserted book with ISBN 978-0-262-53305-8
Data inserted/updated in Books for row 978-0-13-110362-7
Inserted book with ISBN 978-0-13-110362-7
Data inserted/updated in Books for row 978-1-59327-708-4
Inserted book with ISBN 978-1-59327-708-4
Connection closed


#### Read using key range

In [10]:
hbase = HBaseCRUD('hbase-docker')
    
# Scan with a range of ISBNs
selected_books = hbase.scan_table('Books', start_key='978-0-13-110163-0', end_key='978-0-201-83595-3')
for key, data in selected_books:
    print(f"ISBN: {key}")
    for column, value in data.items():
        print(f"  {column}: {value}")
    print("")

# Close connection after operations
hbase.close_connection()

Scanning table Books...
ISBN: 978-0-13-110163-0
  details:author: 2
  details:title: Introduction to Algorithms
  stock:quantity: 15

ISBN: 978-0-13-110362-7
  details:author: 4
  details:title: The UNIX Programming Environment
  stock:quantity: 5

ISBN: 978-0-13-595705-9
  details:author: 3
  details:title: Artificial Intelligence: A Modern Approach
  stock:quantity: 20

Connection closed


#### Create or Update Multiple data - Authors table

In [11]:
# Sample authors data
authors_data = [
    {
        'row_key': '2',
        'data': {
            'info:name': 'Thomas H. Cormen',
            'info:birthdate': '1956-02-24'
        }
    },
    {
        'row_key': '3',
        'data': {
            'info:name': 'Stuart Russell',
            'info:birthdate': '1962-05-03'
        }
    },
    {
        'row_key': '4',
        'data': {
            'info:name': 'Brian Kernighan',
            'info:birthdate': '1942-01-01'
        }
    },
    {
        'row_key': '5',
        'data': {
            'info:name': 'Mark Lutz',
            'info:birthdate': '1956-01-01'
        }
    }
]

hbase = HBaseCRUD('hbase-docker')

# Inserting the authors data into the 'Authors' table
for author in authors_data:
    hbase.create_or_update('Authors', author['row_key'], author['data'])

# Close connection
hbase.close_connection()

Data inserted/updated in Authors for row 2
Data inserted/updated in Authors for row 3
Data inserted/updated in Authors for row 4
Data inserted/updated in Authors for row 5
Connection closed


#### Read with conditions


In [12]:
hbase = HBaseCRUD('hbase-docker')

# Author id = 2
print("Books with Author id 2:")
filtered_books = hbase.scan_filtered_table('Books', 'details:author', '2', '=')
for key, data in filtered_books:
    print(f"ISBN: {key}")
    for column, value in data.items():
        print(f"  {column}: {value}")
    print("")

# Make sure the connection is closed properly after operations
hbase.close_connection()

Books with Author id 2:
ISBN: 978-0-13-110163-0
  details:author: 2
  details:title: Introduction to Algorithms
  stock:quantity: 15

ISBN: 978-0-262-03384-8
  details:author: 2
  details:title: Algorithms Unlocked
  stock:quantity: 10

Connection closed


#### Simulating join operations

In [13]:
"""Connect to the HBase server."""
connection = happybase.Connection('hbase-docker', 9090)

def get_author_details(connection, author_id):
    """Fetch author details by author ID from the Authors table."""
    table = connection.table('Authors')
    row = table.row(author_id.encode('utf-8'))
    if row:
        author_name = row[b'info:name'].decode('utf-8') if b'info:name' in row else "Unknown Author"
        author_birthdate = row[b'info:birthdate'].decode('utf-8') if b'info:birthdate' in row else "Unknown Birthdate"
        return author_name, author_birthdate
    return "Unknown Author", "Unknown Birthdate"

def get_books_with_authors(connection):
    """Fetch all books and enrich them with author details from the Authors table."""
    table = connection.table('Books')
    books = table.scan()
    results = []
    for key, data in books:
        author_id = data[b'details:author'].decode('utf-8')
        author_name, author_birthdate = get_author_details(connection, author_id)
        book_info = {
            'ISBN': key.decode('utf-8'),
            'Title': data[b'details:title'].decode('utf-8'),
            'Author ID': author_id,
            'Author Name': author_name,
            'Author Birthdate': author_birthdate,
            'Quantity': data[b'stock:quantity'].decode('utf-8')
        }
        results.append(book_info)
    return results

try:
    books_with_authors = get_books_with_authors(connection)
    for book in books_with_authors:
        print(f"ISBN: {book['ISBN']}, Title: {book['Title']}, Author: {book['Author Name']}, Birthdate: {book['Author Birthdate']}, Quantity: {book['Quantity']}")
finally:
    """Close the connection to the HBase server."""
    connection.close()

ISBN: 978-0-13-110163-0, Title: Introduction to Algorithms, Author: Thomas H. Cormen, Birthdate: 1956-02-24, Quantity: 15
ISBN: 978-0-13-110362-7, Title: The UNIX Programming Environment, Author: Brian Kernighan, Birthdate: 1942-01-01, Quantity: 5
ISBN: 978-0-13-595705-9, Title: Artificial Intelligence: A Modern Approach, Author: Stuart Russell, Birthdate: 1962-05-03, Quantity: 20
ISBN: 978-0-201-83595-3, Title: The C Programming Language, Author: Brian Kernighan, Birthdate: 1942-01-01, Quantity: 8
ISBN: 978-0-262-03384-8, Title: Algorithms Unlocked, Author: Thomas H. Cormen, Birthdate: 1956-02-24, Quantity: 10
ISBN: 978-0-262-53305-8, Title: Introduction to Autonomous Robots, Author: Stuart Russell, Birthdate: 1962-05-03, Quantity: 7
ISBN: 978-0-596-52068-7, Title: Learning Python, Author: Mark Lutz, Birthdate: 1956-01-01, Quantity: 12
ISBN: 978-1-59327-708-4, Title: Python Crash Course, Author: Mark Lutz, Birthdate: 1956-01-01, Quantity: 12
ISBN: 978-3-16-148410-1, Title: Another Sam

## 6. Experiment with Data Modeling

*Denormalization*

In [14]:
def create_denormalized_table(hbase_connection):
    """
    Creates a 'denormalized' table in HBase with specified column families,
    if it doesn't already exist.

    Args:
    hbase_connection (happybase.Connection): The connection to HBase.
    """
    table_name = 'denormalized'
    families = ['book_details', 'author_details']
    family_options = {
        'book_details': dict(max_versions=1),
        'author_details': dict(max_versions=1)
    }

    try:
        # Check if the table already exists
        if table_name.encode('utf-8') not in hbase_connection.tables():
            # Define column families with options
            families_dict = {fam: family_options.get(fam, dict()) for fam in families}
            # Create the table with specified column families and their options
            hbase_connection.create_table(table_name, families_dict)
            print(f"Table '{table_name}' created with families {list(families_dict.keys())}")
        else:
            print(f"Table '{table_name}' already exists.")
    except Exception as e:
        print(f"Failed to create table '{table_name}': {str(e)}")

def populate_denormalized_table(connection):
    table = connection.table('Books')
    denormalized_table = connection.table('denormalized')
    books = table.scan()
    for key, data in books:
        isbn = key.decode('utf-8')
        author_id = data[b'details:author'].decode('utf-8')
        author_name, author_birthdate = get_author_details(connection, author_id)

        # Prepare the data to insert into the denormalized table
        book_data = {
            b'book_details:title': data[b'details:title'],
            b'book_details:quantity': data[b'stock:quantity']
        }
        author_data = {
            b'author_details:name': author_name.encode('utf-8'),
            b'author_details:birthdate': author_birthdate.encode('utf-8')
        }

        # Combine book and author data into a single dictionary for insertion
        combined_data = {**book_data, **author_data}
        denormalized_table.put(isbn.encode('utf-8'), combined_data)

    print("Populated the denormalized table with book and author details.")

connection = happybase.Connection('hbase-docker', 9090)

# Create the denormalized table if it doesn't exist
create_denormalized_table(connection)

# Populate the denormalized table with data
populate_denormalized_table(connection)

# Close the connection after operations
connection.close()


Table 'denormalized' already exists.
Populated the denormalized table with book and author details.


#### Read the denormalized table


In [15]:
def read_denormalized_table(connection):
    """Reads all entries from the 'denormalized' table and prints them."""
    table = connection.table('denormalized')
    print("Reading data from 'denormalized' table...")
    for key, data in table.scan():
        isbn = key.decode('utf-8')
        title = data.get(b'book_details:title', b'').decode('utf-8')
        quantity = data.get(b'book_details:quantity', b'').decode('utf-8')
        author_name = data.get(b'author_details:name', b'').decode('utf-8')
        author_birthdate = data.get(b'author_details:birthdate', b'').decode('utf-8')

        print(f"ISBN: {isbn}")
        print(f"  Title: {title}")
        print(f"  Quantity: {quantity}")
        print(f"  Author Name: {author_name}")
        print(f"  Author Birthdate: {author_birthdate}")
        print("")

# Usage Example
connection = happybase.Connection('hbase-docker', 9090)
try:
    read_denormalized_table(connection)
finally:
    connection.close()

Reading data from 'denormalized' table...
ISBN: 978-0-13-110163-0
  Title: Introduction to Algorithms
  Quantity: 15
  Author Name: Thomas H. Cormen
  Author Birthdate: 1956-02-24

ISBN: 978-0-13-110362-7
  Title: The UNIX Programming Environment
  Quantity: 5
  Author Name: Brian Kernighan
  Author Birthdate: 1942-01-01

ISBN: 978-0-13-595705-9
  Title: Artificial Intelligence: A Modern Approach
  Quantity: 20
  Author Name: Stuart Russell
  Author Birthdate: 1962-05-03

ISBN: 978-0-201-83595-3
  Title: The C Programming Language
  Quantity: 8
  Author Name: Brian Kernighan
  Author Birthdate: 1942-01-01

ISBN: 978-0-262-03384-8
  Title: Algorithms Unlocked
  Quantity: 10
  Author Name: Thomas H. Cormen
  Author Birthdate: 1956-02-24

ISBN: 978-0-262-53305-8
  Title: Introduction to Autonomous Robots
  Quantity: 7
  Author Name: Stuart Russell
  Author Birthdate: 1962-05-03

ISBN: 978-0-596-52068-7
  Title: Learning Python
  Quantity: 12
  Author Name: Mark Lutz
  Author Birthdate: 19