# HBase Data Modeling and Querying

Dakeun Park

## Setting Up HBase in Docker

This Jupyter Notebook provides a step-by-step guide to setting up an HBase instance running inside a Docker container. This setup is ideal for development and testing environments where HBase needs to be isolated and reproducible.

## Prerequisites
- Docker must be installed on your machine.
- You should have administrative access to run Docker commands.
- Jupyter Notebook environment should have access to the system shell.

## How to Use Docker Compose File
- Create the data directory: 

    - Before you start the container, make sure the data directory exists in the same directory where your docker-compose.yml file is located. You can create it with mkdir data.

- Start the HBase service:

    - Run docker-compose up -d to start the HBase container in the background.

- Check the status of the container:

    - Use docker-compose ps to see the status of your container.

- Accessing the HBase container's shell for troubleshooting:

    - If you need to access the container, use docker-compose exec hbase bash.

- Stopping the HBase service:
- 
    - Run docker-compose down to stop and remove the container.

### Step 1: Prepare Data Directory

Create a directory on your host machine where HBase can store its data persistently.

In [1]:
!mkdir -p data
!ls

data  docker-compose.yml  hbase_v1.ipynb  update_host.py


### Step 2: Compose Hbase image

In [3]:
!docker compose up -d

[1A[1B[0G[?25l[+] Running 1/0
 [32m✔[0m Container hbase-docker  [32mRunning[0m                                         [34m0.0s [0m
[?25h

#### Verify the Image is Running

Check that the HBase Docker image is present in your Docker image list.

In [5]:
# !docker images
!docker compose ps

NAME           IMAGE          COMMAND               SERVICE   CREATED          STATUS          PORTS
hbase-docker   dajobe/hbase   "/opt/hbase-server"   hbase     25 minutes ago   Up 25 minutes   0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 0.0.0.0:8085->8085/tcp, :::8085->8085/tcp, 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp, 0.0.0.0:9095->9095/tcp, :::9095->9095/tcp, 0.0.0.0:16010->16010/tcp, :::16010->16010/tcp


### Run the python file to change host name.

In [6]:
# sudo python3 update_host.py

### Step 3: Run the HBase Container
Start the HBase container. This command runs the container in the background, maps a local directory for data persistence, and sets a hostname for the container.

In [8]:
!id=$(docker run --name=hbase-docker -h hbase-docker -d -v $PWD/data:/data dajobe/hbase)


docker: Error response from daemon: Conflict. The container name "/hbase-docker" is already in use by container "a5dcbcd62ff56db01e8f1e3229ae9afc3aced94bc447389372bdbb2027963f4e". You have to remove (or rename) that container to be able to reuse that name.
See 'docker run --help'.


In [40]:
!docker start hbase-docker

hbase-docker


#### Check Container Status
Ensure that the HBase container is running correctly.

In [39]:
!docker ps

CONTAINER ID   IMAGE          COMMAND               CREATED         STATUS         PORTS                                                                                                                                                                                                                                                                  NAMES
8a40eb24aedf   dajobe/hbase   "/opt/hbase-server"   7 minutes ago   Up 7 minutes   0.0.0.0:2181->2181/tcp, :::2181->2181/tcp, 0.0.0.0:8080->8080/tcp, :::8080->8080/tcp, 0.0.0.0:8085->8085/tcp, :::8085->8085/tcp, 0.0.0.0:9090->9090/tcp, :::9090->9090/tcp, 0.0.0.0:9095->9095/tcp, :::9095->9095/tcp, 0.0.0.0:16010->16010/tcp, :::16010->16010/tcp   hbase-docker


### Step 4: Access Container's Bash Shell
This step is optional and can be used for troubleshooting or advanced configuration.

In [12]:
# This command won't directly work in Jupyter Notebook as interactive mode is not supported.
# Use this command in your terminal if necessary.
# docker exec -it hbase-docker bash

#### Retrieve Container's IP Address
Obtain the IP address assigned to the Docker container. This is useful for network configuration.

In [13]:
# Use this command in your terminal.
# docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' hbase-docker


#### Update Local /etc/hosts File
Update your local machine's /etc/hosts file to include the IP address of the container. This step requires administrative privileges and cannot be executed directly from Jupyter. Use the output from the previous step to manually edit your /etc/hosts file.

In [14]:
# Add this line to your /etc/hosts file
# <container-ip> hbase-docker
# Replace <container-ip> with the actual IP address from the previous output.

## 1. Designing the Schema

Let's assume we're creating a database for a simple bookstore. We need tables for Books and Authors.

- Books Table

  Row Key: ISBN (International Standard Book Number)

  Column Families:

  details: General information about the book.

  details:title: The title of the book.

  details:author: Author ID (link to Authors table).

  stock: Information about book availability.

  stock:quantity: Number of copies available.

- Authors Table

  Row Key: Author ID

  Column Families:

  info: Information about the author.

  info:name: Author's name.

  *info*:birthdate: Author's birth date.

## 2. Setting Up HBase and Python Environment

You would typically set up HBase on a Linux system that is part of a Hadoop cluster. Since we're operating within a Jupyter Notebook environment here, we'll describe the necessary steps assuming HBase and Hadoop are properly installed and configured on your system.

You’ll need to install the happybase library to interact with HBase:

In [7]:
pip install happybase

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## 3. Creating Tables in HBase

Here’s how you can connect to HBase and create tables using Python:

In [8]:
import happybase

try:
    # Connect to HBase
    connection = happybase.Connection('hbase-docker', port=9090)
    print("Connected to HBase.")

    # List current tables
    existing_tables = connection.tables()
    print("Existing tables:", [table.decode('utf-8') for table in existing_tables])

    # Creating the 'Books' table if not already created
    if b'Books' not in existing_tables:
        connection.create_table(
            'Books',
            {'details': dict(max_versions=1),
             'stock': dict(max_versions=1)}
        )
        print("Created 'Books' table.")
    else:
        print("'Books' table already exists.")

    # Creating the 'Authors' table if not already created
    if b'Authors' not in existing_tables:
        connection.create_table(
            'Authors',
            {'info': dict(max_versions=1)}
        )
        print("Created 'Authors' table.")
    else:
        print("'Authors' table already exists.")

    # Print tables to verify
    updated_tables = connection.tables()
    print("Updated tables list:", [table.decode('utf-8') for table in updated_tables])

except Exception as e:
    print("Failed to connect or modify HBase:", e)

Connected to HBase.
Existing tables: []
Created 'Books' table.
Created 'Authors' table.
Updated tables list: ['Authors', 'Books']


## 4. Populating Tables with Sample Data

In [9]:
# Connect to 'Books' table
table = connection.table('Books')

# Insert data into 'Books'
table.put('978-3-16-148410-0', {'details:title': 'Sample Book Title', 'details:author': '1', 'stock:quantity': '5'})

# Connect to 'Authors' table
table = connection.table('Authors')

# Insert data into 'Authors'
table.put('1', {'info:name': 'John Doe', 'info:birthdate': '1990-01-01'})

# Connect to 'Books' table
table = connection.table('Books')



## 5. Implementing Queries
Single-Row Query

In [10]:
# Connect to 'Books' table
table = connection.table('Books')

# Fetch data from 'Books'
book = table.row('978-3-16-148410-0')
print("Book Details:")
for key, value in book.items():
    print(f"{key.decode('utf-8')}: {value.decode('utf-8')}")

# Fetch data from 'Authors'
table = connection.table('Authors')
author = table.row('1')
print("\nAuthor Details:")
for key, value in author.items():
    print(f"{key.decode('utf-8')}: {value.decode('utf-8')}")

Book Details:
details:author: 1
details:title: Sample Book Title
stock:quantity: 5

Author Details:
info:birthdate: 1990-01-01
info:name: John Doe


Multi-Row and Range Queries

In [11]:
# Scan rows in 'Authors' table
for key, data in table.scan():
    print(key, data)

b'1' {b'info:birthdate': b'1990-01-01', b'info:name': b'John Doe'}


## 6. Experiment with Data Modeling
Data modeling in HBase can affect performance significantly. Consider whether to normalize data (which may lead to multiple cross-table queries) or denormalize it (which increases storage but may decrease the number of queries).

## Work in Progress