# Nobel Prize API Data Engineering

## Project Overview
This notebook shows **API data consumption, normalization, and modeling techniques** using the Nobel Prize API.


## Key Technical Skills
- **API Data Consumption**: REST API interaction with proper error handling
- **Data Normalization**: Converting nested JSON structures to flat tabular format
- **Data Modeling**: Handling relationships between laureates and prizes
- **Data Engineering**: ET pipeline (extract & transform) implementation in Python

---

## Setup and Dependencies

In [2]:
import requests
import pandas as pd
import numpy as np

---

## API Data Ingestion

### Generic API Data Fetcher

In [3]:
def get_api_data(BASE_URL: str, limit: int = None, format: str = 'json') -> dict:
    try:
        response = requests.get(url=BASE_URL,params={'limit':limit, 'format': format})
        response.raise_for_status()
        return response.json()
    except requests.exceptions.RequestException as e:
        print(f"GET data error: {e}")
        return None

**Technical Features:**
- Error handling with try/except
- Configurable parameters (limit, format)
- HTTP status code validation with `raise_for_status()`

---

## Data Source Analysis

### Endpoint Discovery and Sizing

In [4]:
laureates_url = 'https://api.nobelprize.org/2.1/laureates'
print(f"laureatesResult size: {get_api_data(BASE_URL=laureates_url, limit=1).get('meta').get('count')}")
nobel_prizes_url = 'https://api.nobelprize.org/2.1/nobelPrizes'
print(f"nobelPrizesResult size: {get_api_data(BASE_URL=nobel_prizes_url, limit=1).get('meta').get('count')}")

laureatesResult size: 1004
nobelPrizesResult size: 676


**Data Engineering Best Practice:**
- **Data profiling**: Understanding data volume before full ingestion
- **Metadata extraction**: Using API meta information for capacity planning
- **Endpoint optimization**: Testing with minimal data first

---

## Full Dataset Extraction

In [5]:
laureates_json_data = get_api_data(BASE_URL=laureates_url,limit=1004)
nobel_prizes_json_data = get_api_data(BASE_URL=nobel_prizes_url,limit=676)

**Strategy:** Using discovered counts to fetch complete datasets in single API calls.

---

## Data Structure Exploration

In [6]:
laureates_json_data.get('laureates')[0].get('birth').get('date')

'1943-00-00'

In [7]:
nobel_prizes_json_data.get('nobelPrizes')[0].get('category').get('en')

'Chemistry'

**Purpose:** Understanding nested JSON structure to inform normalization strategy.

---

## Data Normalization Engine

### JSONPath-Style Nested Value Extractor

In [8]:
#JSONPath-style
def extract_nested_value(data: dict, path: str, default=np.nan):
    keys = path.split('.')
    current = data
    for key in keys:
        if key in current:
            current = current[key]
        else:
            return default
    return current

**Features:**
- **JSONPath-style syntax**: Industry-standard dot notation (`birth.place.country.en`)
- **Safe navigation**: Handles missing keys
- **Default value handling**: Uses `np.nan` for missing data compatibility with pandas
- **Recursive traversal**: Navigates arbitrary nesting levels

**Example Usage:**

In [9]:
extract_nested_value(laureates_json_data.get('laureates')[0],'birth.place.country.en')

'USA'

---

## Data Modeling & Normalization

### Multi-Source DataFrame Factory

In [10]:
def normalize_to_dataframe(json_data: dict) -> pd.DataFrame:
    rows = []
    if json_data.__contains__('laureates'):
        # Laureates normalization logic
        json_data = json_data.get('laureates')
        for laureate in json_data:
            row = {
                'laureate_id': laureate.get('id'),
                'known_name': extract_nested_value(laureate,'knownName.en'),
                'gender': laureate.get('gender') if 'gender' in laureate else np.nan,
                'birth_date': extract_nested_value(laureate,'birth.date'),
                'born_city': extract_nested_value(laureate,'birth.place.city.en'),
                'born_country': extract_nested_value(laureate,'birth.place.country.en'),
                'born_country_now': extract_nested_value(laureate,'birth.place.countryNow.en'),
                'continent': extract_nested_value(laureate, 'birth.place.continent.en'),
                'death_date': extract_nested_value(laureate, 'death.date')
            }
            rows.append(row)
    elif json_data.__contains__('nobelPrizes'):
        # Nobel Prizes normalization with relationship modeling
        json_data = json_data.get('nobelPrizes')
        for nobel_prize in json_data:
            row = {
                'year': nobel_prize.get('awardYear') if 'awardYear' in nobel_prize else np.nan,
                'category': extract_nested_value(nobel_prize,'category.en'),
                'date_awarded': nobel_prize.get('dateAwarded') if 'dateAwarded' in nobel_prize else np.nan,
                'prize_amount': nobel_prize.get('prizeAmount') if 'prizeAmount' in nobel_prize else np.nan,
                'prize_amount_adjusted': nobel_prize.get('prizeAmountAdjusted') if 'prizeAmountAdjusted' in nobel_prize else np.nan,
                'top_motivation': extract_nested_value(nobel_prize,'topMotivation.en')
            }
            # Complex relationship modeling: Prize -> Multiple Laureates
            if nobel_prize.__contains__('laureates'):
                for nobel_prize_laureate in nobel_prize.get('laureates'):
                    new_row = row.copy()
                    new_row.update({
                        'laureate_id': nobel_prize_laureate.get('id'),
                        'motivation': extract_nested_value(nobel_prize_laureate,'motivation.en'),
                        'portion': extract_nested_value(nobel_prize_laureate,'portion'),
                    })
                    rows.append(new_row)
            else:
                rows.append(row)

    return pd.DataFrame(rows)

    

### Data Modeling Techniques Implemented:

#### 1. **Polymorphic Data Processing**
- Single function handles multiple data schemas (`laureates` vs `nobelPrizes`)
- Dynamic schema detection using `__contains__()`

#### 2. **One-to-Many Relationship Modeling**
- **Challenge**: One Nobel Prize can have multiple laureates
- **Solution**: Row multiplication - each laureate gets their own row with prize details
- **Result**: Enables individual laureate analysis while preserving prize context

#### 3. **Missing Data Handling**
- Consistent use of `np.nan` for missing values
- Conditional field extraction with fallbacks
- Pandas-compatible null value strategy

---

## DataFrame Creation & Validation

### Laureates Dataset

In [11]:
laureates = normalize_to_dataframe(laureates_json_data)
laureates.to_csv('laureates.csv')
laureates

Unnamed: 0,laureate_id,known_name,gender,birth_date,born_city,born_country,born_country_now,continent,death_date
0,745,A. Michael Spence,male,1943-00-00,"Montclair, NJ",USA,USA,North America,
1,102,Aage N. Bohr,male,1922-06-19,Copenhagen,Denmark,Denmark,Europe,2009-09-08
2,779,Aaron Ciechanover,male,1947-10-01,Haifa,British Protectorate of Palestine,Israel,Asia,
3,259,Aaron Klug,male,1926-08-11,Zelvas,Lithuania,Lithuania,Europe,2018-11-20
4,1004,Abdulrazak Gurnah,male,1948-00-00,,,,,
...,...,...,...,...,...,...,...,...,...
999,826,Yoichiro Nambu,male,1921-01-18,Tokyo,Japan,Japan,Asia,2015-07-05
1000,927,Yoshinori Ohsumi,male,1945-02-09,Fukuoka,Japan,Japan,Asia,
1001,265,Yuan T. Lee,male,1936-11-19,Hsinchu,Taiwan,Taiwan,Asia,
1002,794,Yves Chauvin,male,1930-10-10,Menin,Belgium,Belgium,Europe,2015-01-27


**Output:** Clean tabular dataset with laureate biographical information.

### Nobel Prizes Dataset 

In [12]:
nobel_prizes = normalize_to_dataframe(nobel_prizes_json_data)
nobel_prizes.to_csv('nobel_prizes.csv')
nobel_prizes

Unnamed: 0,year,category,date_awarded,prize_amount,prize_amount_adjusted,top_motivation,laureate_id,motivation,portion
0,1901,Chemistry,1901-11-12,150782,10833458,,160,in recognition of the extraordinary services h...,1
1,1901,Literature,1901-11-14,150782,10833458,,569,in special recognition of his poetic compositi...,1
2,1901,Peace,1901-12-10,150782,10833458,,462,for his humanitarian efforts to help wounded s...,1/2
3,1901,Peace,1901-12-10,150782,10833458,,463,for his lifelong work for international peace ...,1/2
4,1901,Physics,1901-11-12,150782,10833458,,1,in recognition of the extraordinary services h...,1
...,...,...,...,...,...,...,...,...,...
1056,2024,Peace,2024-10-11,11000000,11000000,,1043,for its efforts to achieve a world free of nuc...,1
1057,2024,Physics,2024-10-08,11000000,11000000,,1037,for foundational discoveries and inventions th...,1/2
1058,2024,Physics,2024-10-08,11000000,11000000,,1038,for foundational discoveries and inventions th...,1/2
1059,2024,Physiology or Medicine,2024-10-07,11000000,11000000,,1035,for the discovery of microRNA and its role in ...,1/2


**Output:** Normalized prize dataset with laureate relationships preserved.

### Data Quality Check

In [13]:
nobel_prizes.loc[nobel_prizes['laureate_id'].isna()].head()

Unnamed: 0,year,category,date_awarded,prize_amount,prize_amount_adjusted,top_motivation,laureate_id,motivation,portion
80,1914,Literature,,146900,8930767,No Nobel Prize was awarded this year. The priz...,,,
81,1914,Peace,,146900,8930767,No Nobel Prize was awarded this year. The priz...,,,
86,1915,Peace,1915-10-01,149223,7862394,No Nobel Prize was awarded this year. The priz...,,,
89,1915,Physiology or Medicine,,149223,7862394,No Nobel Prize was awarded this year. The priz...,,,
90,1916,Chemistry,1916-10-01,131793,6127082,No Nobel Prize was awarded this year. The priz...,,,


**Purpose:** Identifying prizes without individual laureates (organizational awards).

---

## Technical Achievements

### **API Integration**
- RESTful API consumption with error handling
- Metadata-driven data sizing
- Efficient single-call data extraction

### **Data Normalization**  
- JSONPath-style navigation system
- Nested structure flattening
- Type-safe value extraction

### **Data Modeling**
- Complex relationship handling (many-to-many)
- Polymorphic data processing

### **Production-Ready Code**
- Error handling
- Consistent null value strategy  
- Pandas-optimized data types

---

## Next Steps & Extensions

**Potential Enhancements:**
- Add data type conversion and validation
- Implement incremental data updates
- Create automated data quality checks
- Add visualization layer for insights
- Implement caching for API responses

---

## üè∑Ô∏è Tags
`#DataEngineering` `#APIIntegration` `#DataNormalization` `#DataModeling` `#Python` `#Pandas` `#ETL` `#NobelPrize`