# HBase Data Modeling and Querying

Dakeun Park

## Setting Up HBase in Docker

This Jupyter Notebook provides a step-by-step guide to setting up an HBase instance running inside a Docker container. This setup is ideal for development and testing environments where HBase needs to be isolated and reproducible.

## Prerequisites
- Docker must be installed on your machine.
- You should have administrative access to run Docker commands.
- Jupyter Notebook environment should have access to the system shell.

### Step 1: Pull the HBase Docker Image
Pull the latest HBase Docker image from the Docker Hub. This image contains all necessary HBase components and configurations.

In [11]:
# docker pull dajobe/hbase

#### Verify the Image is Downloaded

Check that the HBase Docker image is present in your Docker image list.

In [12]:
!docker images

REPOSITORY     TAG       IMAGE ID       CREATED       SIZE
dajobe/hbase   latest    cfd7eefee902   5 years ago   492MB


### Step 2: Prepare Data Directory

Create a directory on your host machine where HBase can store its data persistently.

In [13]:
!mkdir -p data
!ls

data				 hbase.ipynb
docker-desktop-4.29.0-amd64.deb  jdk-22_linux-x64_bin
hadoop-3.3.6.tar.gz		 jdk-22_linux-x64_bin.tar.gz


### Step 3: Run the HBase Container
Start the HBase container. This command runs the container in the background, maps a local directory for data persistence, and sets a hostname for the container.

In [14]:
# id=$(docker run --name=hbase-docker -h hbase-docker -d -v $PWD/data:/data dajobe/hbase)


#### Check Container Status
Ensure that the HBase container is running correctly.

In [15]:
!docker ps

CONTAINER ID   IMAGE          COMMAND               CREATED       STATUS          PORTS                                                         NAMES
a5dcbcd62ff5   dajobe/hbase   "/opt/hbase-server"   3 hours ago   Up 38 minutes   2181/tcp, 8080/tcp, 8085/tcp, 9090/tcp, 9095/tcp, 16010/tcp   hbase-docker


### Step 4: Access Container's Bash Shell
This step is optional and can be used for troubleshooting or advanced configuration.

In [16]:
# This command won't directly work in Jupyter Notebook as interactive mode is not supported.
# Use this command in your terminal if necessary.
# docker exec -it hbase-docker bash

#### Retrieve Container's IP Address
Obtain the IP address assigned to the Docker container. This is useful for network configuration.

In [17]:
# Use this command in your terminal.
# docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' hbase-docker


#### Update Local /etc/hosts File
Update your local machine's /etc/hosts file to include the IP address of the container. This step requires administrative privileges and cannot be executed directly from Jupyter. Use the output from the previous step to manually edit your /etc/hosts file.

In [18]:
# Add this line to your /etc/hosts file
# <container-ip> hbase-docker
# Replace <container-ip> with the actual IP address from the previous output.

## 1. Designing the Schema

Let's assume we're creating a database for a simple bookstore. We need tables for Books and Authors.

- Books Table

  Row Key: ISBN (International Standard Book Number)

  Column Families:

  details: General information about the book.

  details:title: The title of the book.

  details:author: Author ID (link to Authors table).

  stock: Information about book availability.

  stock:quantity: Number of copies available.

- Authors Table

  Row Key: Author ID

  Column Families:

  info: Information about the author.

  info:name: Author's name.

  *info*:birthdate: Author's birth date.

## 2. Setting Up HBase and Python Environment

You would typically set up HBase on a Linux system that is part of a Hadoop cluster. Since we're operating within a Jupyter Notebook environment here, we'll describe the necessary steps assuming HBase and Hadoop are properly installed and configured on your system.

You’ll need to install the happybase library to interact with HBase:

In [1]:
pip install happybase

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


## 3. Creating Tables in HBase

Here’s how you can connect to HBase and create tables using Python:

In [19]:
import happybase

# Connect to HBase
connection = happybase.Connection('hbase-docker',port=9090)

# Creating the 'Books' table
connection.create_table(
    'Books',
    {'details': dict(max_versions=1),
     'stock': dict(max_versions=1)}
)

# Creating the 'Authors' table
connection.create_table(
    'Authors',
    {'info': dict(max_versions=1)}
)

# Print tables to verify
print(connection.tables())

[b'Authors', b'Books']


## 4. Populating Tables with Sample Data

In [20]:
# Connect to 'Books' table
table = connection.table('Books')

# Insert data into 'Books'
table.put('978-3-16-148410-0', {'details:title': 'Sample Book Title', 'details:author': '1', 'stock:quantity': '5'})

# Connect to 'Authors' table
table = connection.table('Authors')

# Insert data into 'Authors'
table.put('1', {'info:name': 'John Doe', 'info:birthdate': '1990-01-01'})

## 5. Implementing Queries
Single-Row Query

In [21]:
# Fetch a single row from 'Books'
row = table.row('978-3-16-148410-0')
print(row)

{}


Multi-Row and Range Queries

In [22]:
# Scan rows in 'Authors' table
for key, data in table.scan():
    print(key, data)

b'1' {b'info:birthdate': b'1990-01-01', b'info:name': b'John Doe'}


## 6. Experiment with Data Modeling
Data modeling in HBase can affect performance significantly. Consider whether to normalize data (which may lead to multiple cross-table queries) or denormalize it (which increases storage but may decrease the number of queries).

## Conclusion
This framework sets up a basic interaction with HBase using Python in a Jupyter Notebook. For a real-world application, you'd likely need to adjust your HBase configuration, handle large datasets, and optimize your schema further based on your specific access patterns and scalability requirements.