+++
title =  "GCP Cloud Run: LOC Normalizer"
date = "2024-04-28"
description = "Normalizing a JSON into A DB.. Autonomously. "
author = "Justin Napolitano"
tags = ['git', 'python', 'gcp', 'bash','workflow automation', 'docker','containerization']
images = ["images/feature-gcp.png"]
categories = ["projects"]
+++


# Library of Congress Normalizer Job

This [repo](https://github.com/justin-napolitano/loc_normalizer) normalizes the existing library of congress schema into a db that wil then be used to construct a knowledge graph of supreme court law. 

## Plan

1. Setup a venv to run locally
2. Install requirements
3. Write out the script to interface with gcp
4. Set up a docker container and test locally
5. build the image
6. upload to gcp
7. create the job

## Setup the venv

### Install
I installed virtualenv locally on ubuntu

### Create
I then run ```virtualenv {path to venvs}```

### Activate

Then source the venv bin to activate

```source {path to venv}/bin/activate```
   
### Install requirements

``` pip install -r requirements.txt```


In [5]:
pip install -r ../requirements.txt

Collecting bs4==0.0.2 (from -r ../requirements.txt (line 2))
  Downloading bs4-0.0.2-py2.py3-none-any.whl.metadata (411 bytes)
Collecting cachetools==5.3.3 (from -r ../requirements.txt (line 3))
  Downloading cachetools-5.3.3-py3-none-any.whl.metadata (5.3 kB)
Collecting flatten-json==0.1.14 (from -r ../requirements.txt (line 6))
  Downloading flatten_json-0.1.14-py3-none-any.whl.metadata (4.2 kB)
Collecting google-api-core==2.18.0 (from -r ../requirements.txt (line 7))
  Downloading google_api_core-2.18.0-py3-none-any.whl.metadata (2.7 kB)
Collecting google-auth==2.29.0 (from -r ../requirements.txt (line 8))
  Downloading google_auth-2.29.0-py2.py3-none-any.whl.metadata (4.7 kB)
Collecting google-cloud-appengine-logging==1.4.3 (from -r ../requirements.txt (line 9))
  Downloading google_cloud_appengine_logging-1.4.3-py2.py3-none-any.whl.metadata (5.4 kB)
Collecting google-cloud-audit-log==0.2.5 (from -r ../requirements.txt (line 10))
  Downloading google_cloud_audit_log-0.2.5-py2.py3-n

## Write out the Script

### Steps
1. Initialize the Google Logging Service
2. Initialize The Google Cloud Storage Service
1. Initialize the Bigquery Client
2. Grab a json blob
3. Process the blob
4. Move the blob to a processed bucket


#### Initialize The Google Cloud Storage Service

I created a Gloud Service Client Class available at : https://github.com/justin-napolitano/gcputils/blob/bc421debf4c828522580ec79ab634b2e2bf402a4/GoogleCloudLogging.py

It is imported below and tested below.  Note that cli specific arguments are commented out for testing in ipynb. 

In [25]:
# loc_flattener.py
# library_of_congress_scraper.py

from __future__ import print_function
from gcputils.gcpclient import GCSClient
from gcputils.GoogleCloudLogging import GoogleCloudLogging
from bs4 import BeautifulSoup
import requests
import json
import os
import time
from pprint import pprint
import html
from flatten_json import flatten
import google.cloud.logging
import logging
import argparse




In [26]:

def initialize_google_cloud_logging_client(project_id, credentials_path=None):
    return GoogleCloudLogging(project_id, credentials_path=credentials_path)


def main():
    # parser = argparse.ArgumentParser(description='Run the script locally or in the cloud.')
    # parser.add_argument('--local', action='store_true', help='Run the script locally with credentials path')
    # args = parser.parse_args()

    project_id = os.getenv('GCP_PROJECT_ID', 'smart-axis-421517')
    bucket_name = os.getenv('BUCKET_NAME', 'loc-scraper')

    credentials_path = None
    # if args.local:
    credentials_path = os.getenv('GCP_CREDENTIALS_PATH', 'secret.json')

    # Initialize logging
    logging_client = initialize_google_cloud_logging_client(project_id,credentials_path)
    logging_client.setup_logging()


if __name__ == "__main__":
    main()

#### Initialize the Google Cloud Storage Client

The Google Cloud Storage Client is available at https://github.com/justin-napolitano/gcputils/blob/bc421debf4c828522580ec79ab634b2e2bf402a4/gcpclient.py

Calling the client and listing the buckets to test below

In [28]:

def initialize_google_cloud_logging_client(project_id, credentials_path=None):
    return GoogleCloudLogging(project_id, credentials_path=credentials_path)

def initialize_gcs_client(project_id, credentials_path=None):
    return GCSClient(project_id, credentials_path=credentials_path)

def list_gcs_buckets(client):
    try:
        buckets = client.list_buckets()
        print("Buckets:", buckets)
        logging.info(f"Buckets: {buckets}")
    except Exception as e:
        logging.error(f"Error listing buckets: {e}")

def main():
    # parser = argparse.ArgumentParser(description='Run the script locally or in the cloud.')
    # parser.add_argument('--local', action='store_true', help='Run the script locally with credentials path')
    # args = parser.parse_args()

    project_id = os.getenv('GCP_PROJECT_ID', 'smart-axis-421517')
    bucket_name = os.getenv('BUCKET_NAME', 'loc-scraper')

    credentials_path = None
    # if args.local:
    credentials_path = os.getenv('GCP_CREDENTIALS_PATH', 'secret.json')

    # Initialize logging
    logging_client = initialize_google_cloud_logging_client(project_id,credentials_path)
    logging_client.setup_logging()

    gcs_client = initialize_gcs_client(project_id, credentials_path)
    list_gcs_buckets(gcs_client)


if __name__ == "__main__":
    main()

trying creds file
Buckets: ['loc-scraper', 'smart-axis-421517_cloudbuild']


#### Access the Blobs within the bucket

Now I need to grab a blob from the bucket. IN this case I just want to grab one from the top of the heap without pulling a lot of data into context. 

##### Addition to the storage class 

```Python

def list_blobs(self, bucket_name):
        """
        Lists all blobs in the specified bucket in Google Cloud Storage.

        Args:
            bucket_name (str): Name of the bucket.

        Returns:
            list: A list of blob names.
        """
        # Get the bucket
        bucket = self.client.bucket(bucket_name)
        
        # List all blobs in the bucket
        blobs = list(bucket.list_blobs())
        
        blob_names = [blob.name for blob in blobs]
        return blob_names

def pop_blob(self, bucket_name):
        """
        Selects and removes the first blob from the specified bucket in Google Cloud Storage.

        Args:
            bucket_name (str): Name of the bucket.

        Returns:
            google.cloud.storage.blob.Blob: The first blob from the bucket.
        """
        # Get the bucket
        bucket = self.client.bucket(bucket_name)
        
        # List all blobs in the bucket
        blobs = list(bucket.list_blobs())
        
        if not blobs:
            print(f"No blobs found in bucket '{bucket_name}'.")
            return None

        # Get the first blob
        first_blob = blobs[0]
        
        print(f"First blob selected: {first_blob.name}")
        return first_blob

```

##### Test Run 

In [29]:
def initialize_google_cloud_logging_client(project_id, credentials_path=None):
    return GoogleCloudLogging(project_id, credentials_path=credentials_path)

def initialize_gcs_client(project_id, credentials_path=None):
    return GCSClient(project_id, credentials_path=credentials_path)

def list_gcs_buckets(client):
    try:
        buckets = client.list_buckets()
        print("Buckets:", buckets)
        logging.info(f"Buckets: {buckets}")
    except Exception as e:
        logging.error(f"Error listing buckets: {e}")

def main():
    # parser = argparse.ArgumentParser(description='Run the script locally or in the cloud.')
    # parser.add_argument('--local', action='store_true', help='Run the script locally with credentials path')
    # args = parser.parse_args()

    project_id = os.getenv('GCP_PROJECT_ID', 'smart-axis-421517')
    bucket_name = os.getenv('BUCKET_NAME', 'loc-scraper')

    credentials_path = None
    # if args.local:
    credentials_path = os.getenv('GCP_CREDENTIALS_PATH', 'secret.json')

    # Initialize logging
    logging_client = initialize_google_cloud_logging_client(project_id,credentials_path)
    logging_client.setup_logging()

    # List Buckets for testing
    gcs_client = initialize_gcs_client(project_id, credentials_path)
    list_gcs_buckets(gcs_client)

    # Grab A blob from the heap
    first_blob = gcs_client.pop_blob(bucket_name)
    if first_blob:
        print(f"First blob name: {first_blob.name}")


if __name__ == "__main__":
    main()

trying creds file
Buckets: ['loc-scraper', 'smart-axis-421517_cloudbuild']


AttributeError: 'GCSClient' object has no attribute 'pop_blob'