In [1]:
#1
#aws s3 bucket config
#AAI-540 Group 3 FP

In [2]:
import boto3
import os
import pandas as pd
import sagemaker

Unable to load JumpStart region config.
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/python3/lib/python3.12/site-packages/sagemaker/jumpstart/constants.py", line 69, in _load_region_config
    with open(filepath) as f:
         ^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/ec2-user/anaconda3/envs/python3/lib/python3.12/site-packages/sagemaker/jumpstart/region_config.json'


sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ec2-user/.config/sagemaker/config.yaml


#### Notebook 1 — S3 bucket config & dataset upload

#### Resolves AWS execution context (region, role, account) and uses the SageMaker default bucket for this account.

#### Uploads the project datasets to S3 for downstream notebooks (Athena tables / model profiles / router training).

#### Note: The “Unable to load JumpStart region config” warning is an environment message and does not affect S3 uploads.

In [3]:
session = boto3.session.Session()
sess = sagemaker.Session()
region = session.region_name
role = sagemaker.get_execution_role()
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()
s3 = boto3.Session().client(service_name="s3", region_name=region)
account_id = boto3.client("sts").get_caller_identity().get("Account")
sm = boto3.Session().client(service_name="sagemaker", region_name=region)

In [4]:
print("Region:", region)
print("Account:", account_id)
print("Role:", role)
print("Default bucket:", bucket)

Region: us-east-1
Account: 907086662522
Role: arn:aws:iam::907086662522:role/LabRole
Default bucket: sagemaker-us-east-1-907086662522


In [5]:
print("Default bucket: {}".format(bucket))

Default bucket: sagemaker-us-east-1-907086662522


In [6]:
s3_private_path_csv = f"s3://{bucket}"
print(s3_private_path_csv)
%store s3_private_path_csv
%store

s3://sagemaker-us-east-1-907086662522
Stored 's3_private_path_csv' (str)
Stored variables and their in-db values:
s3_private_path_csv             -> 's3://sagemaker-us-east-1-907086662522'


### Dataset Upload to Amazon S3

This section transfers all locally preprocessed datasets from Notebook 0 into the SageMaker S3 bucket.  
These files will be accessed by downstream notebooks for feature engineering, model profiling, and router training.

In [7]:
files_expected = [
    "aimodelpoll.csv",
    "lifearchitectmodels.csv",
    "llmachievements.csv",
    "llmpricingdata.csv",
    "openllmleader.csv",
    "overviewaimodels.csv",
    "chatbotarena.json",   # optional/large
]

print("Local file check:")
for f in files_expected:
    print(f"{f:22} ->", "FOUND" if os.path.exists(f) else "MISSING")

Local file check:
aimodelpoll.csv        -> FOUND
lifearchitectmodels.csv -> FOUND
llmachievements.csv    -> FOUND
llmpricingdata.csv     -> FOUND
openllmleader.csv      -> FOUND
overviewaimodels.csv   -> FOUND
chatbotarena.json      -> FOUND


### Upload: AI Model Poll Dataset

Uploads the AI model usage poll dataset to the configured S3 bucket.

This dataset contains demographic-based usage proportions for different AI services (ChatGPT, Claude, Gemini, etc.) and is used later for feature construction and model comparison.

Source: Preprocessed locally in Notebook 0.

In [8]:
!aws s3 cp aimodelpoll.csv {s3_private_path_csv}/aimodelpoll.csv

upload: ./aimodelpoll.csv to s3://sagemaker-us-east-1-907086662522/aimodelpoll.csv


### Upload: Chatbot Arena Comparisons (Optional / Large File)

Uploads pairwise model comparison data from Chatbot Arena.

This dataset can be large and is optional for minimal pipeline execution, but provides additional preference-based evaluation signals when available.


In [9]:
!aws s3 cp chatbotarena.json {s3_private_path_csv}/chatbotarena.json

upload: ./chatbotarena.json to s3://sagemaker-us-east-1-907086662522/chatbotarena.json


### Upload: LifeArchitect Model Specifications

Uploads the LifeArchitect model table containing technical and capability attributes of AI models (parameters, benchmarks, architecture, etc.).

This dataset is the primary source for constructing model capability profiles used by the router training pipeline.

Source: Cleaned and structured in Notebook 0 from LifeArchitect data export.


In [10]:
!aws s3 cp lifearchitectmodels.csv {s3_private_path_csv}/lifearchitectmodels.csv

upload: ./lifearchitectmodels.csv to s3://sagemaker-us-east-1-907086662522/lifearchitectmodels.csv


### Upload: LLM Achievement Benchmarks

Uploads model vs human benchmark performance results across multiple evaluation tasks.

Used to derive capability indicators and performance comparisons during model profile construction.


In [11]:
!aws s3 cp llmachievements.csv {s3_private_path_csv}/llmachievements.csv

upload: ./llmachievements.csv to s3://sagemaker-us-east-1-907086662522/llmachievements.csv


### Upload: LLM Pricing Data

Uploads cost and pricing information for AI models.

This dataset is used to estimate routing cost trade-offs and support cost-aware model selection.


In [12]:
!aws s3 cp llmpricingdata.csv {s3_private_path_csv}/llmpricingdata.csv

upload: ./llmpricingdata.csv to s3://sagemaker-us-east-1-907086662522/llmpricingdata.csv


### Upload: Open LLM Leaderboard Results

Uploads leaderboard rankings and evaluation metrics for open-source language models.

Used as an additional performance signal when constructing unified model profiles.

In [13]:
!aws s3 cp openllmleader.csv {s3_private_path_csv}/openllmleader.csv

upload: ./openllmleader.csv to s3://sagemaker-us-east-1-907086662522/openllmleader.csv


### Upload: Overview AI Models Metadata

Uploads high-level descriptive metadata about AI models including release timeline, provider, and classification.

Serves as the base reference table for joining model identity across datasets.


In [14]:
!aws s3 cp overviewaimodels.csv {s3_private_path_csv}/overviewmodels.csv

upload: ./overviewaimodels.csv to s3://sagemaker-us-east-1-907086662522/overviewmodels.csv


In [15]:
!aws s3 ls {s3_private_path_csv}

                           PRE aai-540/
                           PRE amazon-reviews-pds/
                           PRE athena-results/
                           PRE athena/
                           PRE feature-store/
                           PRE models/
                           PRE processed/
                           PRE table1/
                           PRE table2/
                           PRE table3/
2026-02-22 23:15:50       7979 aimodelpoll.csv
2026-02-22 23:15:51  113342632 chatbotarena.json
2026-02-22 23:15:53      51984 lifearchitectmodels.csv
2026-02-22 23:15:53      18853 llmachievements.csv
2026-02-22 23:15:54       1587 llmpricingdata.csv
2026-02-22 23:15:55      29356 openllmleader.csv
2026-02-22 23:15:56    6317510 overviewmodels.csv


In [16]:
# Quick read-back test to confirm S3 access + file integrity
test = pd.read_csv(f"{s3_private_path_csv}/lifearchitectmodels.csv")
print("Read lifearchitectmodels.csv from S3:", test.shape)
print("First columns:", list(test.columns)[:6])

Read lifearchitectmodels.csv from S3: (754, 14)
First columns: ['model', 'lab', 'parameters_b', 'tokens_trained_b', 'ratio_tokens_params', 'alscore']


severe performance issues, see also https://github.com/dask/dask/issues/10276

To fix, you should specify a lower version bound on s3fs, or
update the current installation.

