# Week 1: Data Familiarization and Setup

This notebook covers the Week 1 tasks from our project plan:
- Explore SimpleMaps World Cities dataset
- Configure and test Google Air Quality API
- Document data sources
- Define research question within data lifecycle model

In [1]:
import pandas as pd
import numpy as np
import requests
import json
import matplotlib.pyplot as plt
import seaborn as sns

## Task 1: SimpleMaps World Cities Dataset Exploration

First we need to download and explore the SimpleMaps dataset to understand its structure and coverage.

In [2]:
# Load the SimpleMaps World Cities dataset
# URL: https://simplemaps.com/data/world-cities
cities_df = pd.read_csv('data/worldcities.csv')

print(f"Successfully loaded {len(cities_df)} cities from SimpleMaps dataset")
print(f"Dataset shape: {cities_df.shape}")
print(f"Columns: {cities_df.columns.tolist()}")

Successfully loaded 48059 cities from SimpleMaps dataset
Dataset shape: (48059, 11)
Columns: ['city', 'city_ascii', 'lat', 'lng', 'country', 'iso2', 'iso3', 'admin_name', 'capital', 'population', 'id']


In [3]:
# Explore the dataset structure
print("First 5 rows:")
print(cities_df.head())

print("\nDataset info:")
cities_df.info()

print("\nBasic statistics:")
cities_df.describe()

First 5 rows:
        city city_ascii      lat       lng    country iso2 iso3   admin_name  \
0      Tokyo      Tokyo  35.6870  139.7495      Japan   JP  JPN        Tōkyō   
1    Jakarta    Jakarta  -6.1750  106.8275  Indonesia   ID  IDN      Jakarta   
2      Delhi      Delhi  28.6100   77.2300      India   IN  IND        Delhi   
3  Guangzhou  Guangzhou  23.1300  113.2600      China   CN  CHN    Guangdong   
4     Mumbai     Mumbai  19.0761   72.8775      India   IN  IND  Mahārāshtra   

   capital  population          id  
0  primary  37785000.0  1392685764  
1  primary  33756000.0  1360771077  
2    admin  32226000.0  1356872604  
3    admin  26940000.0  1156237133  
4    admin  24973000.0  1356226629  

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48059 entries, 0 to 48058
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   city        48059 non-null  object 
 1   city_ascii  48057 non-null  obj

Unnamed: 0,lat,lng,population,id
count,48059.0,48059.0,47808.0,48059.0
mean,25.391265,16.247459,107856.6,1447192000.0
std,22.982203,70.460025,685511.1,261187700.0
min,-54.9333,-179.6,0.0,1004003000.0
25%,12.19275,-44.1564,12191.0,1250540000.0
50%,30.9347,14.4803,20913.5,1380337000.0
75%,42.6189,77.08095,46808.5,1704000000.0
max,81.7166,179.3667,37785000.0,1934976000.0


In [4]:
# Check missing values and data quality
print("Missing values per column:")
print(cities_df.isnull().sum())

print("\nUnique countries:")
print(f"Total countries: {cities_df['country'].nunique()}")
print(cities_df['country'].value_counts().head(10))

print("\nPopulation statistics:")
print(f"Cities with population data: {cities_df['population'].notna().sum()}")
print(f"Cities without population data: {cities_df['population'].isna().sum()}")

Missing values per column:
city              0
city_ascii        2
lat               0
lng               0
country           0
iso2             33
iso3              0
admin_name      201
capital       32921
population      251
id                0
dtype: int64

Unique countries:
Total countries: 242
country
India             7108
United States     5344
Brazil            2961
Germany           1759
China             1732
Philippines       1584
United Kingdom    1365
Italy             1357
Japan             1344
France            1160
Name: count, dtype: int64

Population statistics:
Cities with population data: 47808
Cities without population data: 251


In [None]:
cities_with_pop = cities_df.dropna(subset=['population'])
top_500_cities = cities_with_pop.nlargest(500, 'population')

print(f"Population range: {top_500_cities['population'].min():,.0f} {top_500_cities['population'].max():,.0f}")
print(top_500_cities[['city', 'country', 'population']].head(10))

# save for later
top_500_cities.to_csv('data/top_500_cities.csv', index=False)

Top 500 cities by population:
Population range: 1,543,000 to 37,785,000

Top 10 most populous cities:
          city       country  population
0        Tokyo         Japan  37785000.0
1      Jakarta     Indonesia  33756000.0
2        Delhi         India  32226000.0
3    Guangzhou         China  26940000.0
4       Mumbai         India  24973000.0
5       Manila   Philippines  24922000.0
6     Shanghai         China  24073000.0
7    São Paulo        Brazil  23086000.0
8        Seoul  Korea, South  23016000.0
9  Mexico City        Mexico  21804000.0

Saved top 500 cities to data/top_500_cities.csv


## Task 2: Google Maps Air Quality API Configuration

Test the Google Air Quality API with sample queries to understand response format and functionality.

In [12]:
# API configuration - imported from config.py
from config import API_KEY, BASE_URL

def test_api_connection():
    """Test basic API connectivity"""
    print(f"API Base URL: {BASE_URL}")
    print(f"API Key configured: {API_KEY[:10]}...")
    return True

if test_api_connection():
    print("API configuration complete")

API Base URL: https://airquality.googleapis.com/v1/
API Key configured: AIzaSyAeap...
API configuration complete


In [None]:
# Sample query function for current air quality
def get_current_air_quality(lat, lon, api_key):
    """Get current air quality for given coordinates"""
    url = f"{BASE_URL}currentConditions:lookup"
    params = {
        "key": api_key
    }
    data = {
        "location": {
            "latitude": lat,
            "longitude": lon
        }
    }
    
    try:
        response = requests.post(url, params=params, json=data)
        return response.json()
    except Exception as e:
        return {"error": str(e)}

# Test with sample coordinates (Pittsburgh)
print("Testing API with PITTSBURGH coordinates...")
sample_result = get_current_air_quality(40.4387, -79.9972, API_KEY)

if 'error' in sample_result:
    print(f"API Error: {sample_result['error']}")
else:
    print("API Response received successfully!")
    print(f"Response keys: {list(sample_result.keys())}")
    
    # Display key air quality information if available
    if 'indexes' in sample_result:
        for index in sample_result['indexes']:
            print(f"Index: {index.get('displayName', 'Unknown')}")
            print(f"AQI: {index.get('aqi', 'N/A')}")
            print(f"Category: {index.get('category', 'N/A')}")

Testing API with pITTSBURGH coordinates...
API Response received successfully!
Response keys: ['dateTime', 'regionCode', 'indexes']
Index: Universal AQI
AQI: 58
Category: Moderate air quality


In [16]:
# Sample query function for historical data
def get_historical_air_quality(lat, lon, start_time, end_time, api_key):
    """Get historical air quality data"""
    url = f"{BASE_URL}history:lookup"
    params = {
        "key": api_key
    }
    data = {
        "location": {
            "latitude": lat,
            "longitude": lon
        },
        "period": {
            "startTime": start_time,
            "endTime": end_time
        }
    }
    
    try:
        response = requests.post(url, params=params, json=data)
        return response.json()
    except Exception as e:
        return {"error": str(e)}

print("function defined")

function defined


## Task 3: Data Sources Documentation

Document key characteristics, formats, and limitations of both datasets.

### SimpleMaps World Cities Database Documentation

Source: https://simplemaps.com/data/world-cities
Format: CSV
Size: Approximately 41,000 cities
Last Updated: May 11, 2025

Key Fields (expected):
- city: City name
- lat: Latitude
- lng: Longitude 
- country: Country name
- iso2: ISO 2-letter country code
- iso3: ISO 3-letter country code
- population: Population count

Limitations:
- Static population snapshot (no historical data)
- Free version has limited fields
- Population data may not be current for all cities
- No city area data for density calculations

### Google Maps Air Quality API Documentation

Source: https://developers.google.com/maps/documentation/air-quality/overview
Format: JSON responses from REST API
Access: Free tier with $300 worth of credits

Available Data:
- Universal Air Quality Index (AQI)
- Pollutant concentrations (PM2.5, PM10, O3, NO2, etc.)
- Weather data (temperature, humidity)
- Current conditions and historical data

Limitations:
- Rate limits apply ($300 of free credits)
- Historical data availability varies by location
- Costs may increase with heavy usage
- Single coordinate per city may not capture full urban area

## Task 4: Research Question and Data Lifecycle Model

Define how our research fits into the data lifecycle model from class.

## Week 1 Summary and Next Steps

Completed:
- API endpoint identification and function templates
- Data source documentation
- Research question definition within data lifecycle model

Pending:
- SimpleMaps dataset download and actual exploration
- Google API key setup and live testing
- Sample data collection to validate approach

Week 2 Preparation:
- Download SimpleMaps data
- Obtain Google API key
- Set up folder structure for data pipeline
- Test API functions with real data