# **S3 CC Data Ingestion**
***

## 1: Bronze Layer

### Step 1.1: helper functions
- run the magic command with file path to bring in helper functions: %run /Workspace/Shared/gu_census_crawl/common_crawl/src/helper_functions

In [0]:
# Install the nbformat module
%pip install nbformat==5.10.4

# Run magic command on helper functions
%run /Workspace/Shared/gu_census_crawl/common_crawl/src/helper_functions_new.ipynb

### Step 1.2: Set Secret Variables
- Call secrets that are stored in databricks utilities for aws credentials

In [0]:
aws_access_key_id = dbutils.secrets.get(scope='aws_cc', key='aws_access_key_id')
aws_secret_access_key = dbutils.secrets.get(scope='aws_cc', key='aws_secret_access_key')

### Step 1.3: Call list master indexes method
- Sava as table for unitity catalog


In [0]:
# Create Pandas DataFrame
df_master_crawls = list_master_indexes()

# Convert to Spark DataFrame
df_master_crawls = spark.createDataFrame(df_master_crawls)

# Get totals
total = df_master_crawls.count()
print(f"Total Master Indexes: {total}")

#Save to bronze table
df_master_crawls.write.mode("overwrite").saveAsTable("`census_bureau_capstone`.bronze.raw_master_crawls")

# Display Spark DataFrame
display(df_master_crawls)

In [0]:
import pandas as pd
df_master_crawls_2025 = df_master_crawls.filter(df_master_crawls.master_index.contains("crawl-data/CC-MAIN-2025"))

df_master_crawls_2025.write.mode("overwrite").saveAsTable("`census_bureau_capstone`.silver.cleaned_master_crawls_2025")

display(df_master_crawls_2025)

df_master_crawls_2025_pandas = df_master_crawls_2025.toPandas()

### Step 1.4: Call list crawls method
- Call the list_crawls() function from the helpers 
- utlize batching to iterate and union crawls from filtered master crawls dataframe

In [0]:
crawl_list = batch_crawl_list(df_master_crawls_2025_pandas, "master_index")

In [0]:
print(crawl_list)

In [0]:
df_crawls = batch_ingest_crawls(crawl_list)

display(df_crawls)