# Extracting `DAILY` data from API and Loading it to DB 

As the purpose of our pipeline is to make Weather Data available for comparison to flights and airports data, in the first step we need to load the weather data in a raw form (JSON) into our database. So in later steps we can transform it to meaningful and useful tables.

### General Presteps:

The Goal of this Notebook is to get raw JSON data for Daily and Hourly Weather for 3 airport weather stations and load it as it is to our database.
- Find Station IDs for **defined** airports 
- Define the start and end of the period
- get the API Key from the `.env`

### Imports

we will need the credentials we saved in the `.env` file. We also will need SQLAlchemy and its functions

In [41]:
# we will need the credentials we saved in the .env file
from dotenv import dotenv_values

# We also will need SQLAlchemy and its functions
from sqlalchemy import create_engine, types
from sqlalchemy.dialects.postgresql import JSON as postgres_json

import pandas as pd

# requests library will make the API calls. 
# the json package will parse the JSON string and convert it to Python data structures
import requests
import json

# with 'datetime' we want to catch the timestamp of the API call. For the actuality reference. 
# and 'time' for slowing down a .bit
from datetime import datetime
import time

### Defining Airports andd finding the Station IDs

To find the Station IDs for the airpors without stressing our API Call limits, we will use the   search option of the **https://meteostat.net/**  

Search for the names of the airports above and find the Station IDs

In [42]:
#MIA,MCO,TPA
airport_staids = {
    'MIA': 72202,  
    'MCO': 72205,  # Arkansas / Little Rock
    # 'XNA': 'KXNA0'  # Fayetteville / Castleville
    'TPA': 72211 #Tulsa
    
}

### Defining the period

Our flight Data is from 2024-01-01 until 2024-03-31. For the lectures we will use the same period for the meteostat JSON API.

In [43]:
period_start = "2017-08-30"
period_end = "2017-09-30"

### loading API Key

In [44]:
# getting API and DB credentials - Alternative 1: dotenv_values()

config = dotenv_values()
api_key = config['X-RapidAPI_Key'] # align the key label with your .env file


# Part 1: Daily Station Data

### Objectives -  Daily Station Data:

### Test: For-loop generating the querystrings

In [45]:
# testing for-loop: querystring for each airport

for airport in airport_staids:
   
    querystring = {
        "station":airport_staids[airport]
        ,"start":period_start
        ,"end":period_end
        ,"model":"true"
    }
    print(airport, "\n", querystring)

MIA 
 {'station': 72202, 'start': '2017-08-30', 'end': '2017-09-30', 'model': 'true'}
MCO 
 {'station': 72205, 'start': '2017-08-30', 'end': '2017-09-30', 'model': 'true'}
TPA 
 {'station': 72211, 'start': '2017-08-30', 'end': '2017-09-30', 'model': 'true'}


**[what is the 'model' parameter?](https://dev.meteostat.net/api/stations/daily.html#parameters)**   
-> Substitute missing records with statistically optimized model data

### API CALL daily (per station)

In [46]:
#  let's catch each response in a dictionary. create an empty dictionary with the following keys:

weather_dict = {'extracted_at':[], 
                'airport_code':[], 
                'station_id':[], 
                'extracted_data':[]
               }

# API CALL daily (station) - for the syntax: see the rapidapi interface

url = "https://meteostat.p.rapidapi.com/stations/daily"

headers = {                   #headers dictionary is use to send specific information as part of the HTTPS request
        "X-RapidAPI-Key": api_key,
        "X-RapidAPI-Host": "meteostat.p.rapidapi.com"
}

# for-loop for the querystrings
for airport in airport_staids:
   
    querystring = {
        "station":airport_staids[airport]
        ,"start":period_start
        ,"end":period_end
        ,"model":"true"
    }
    
    # making one call with the current querystring
    response = requests.get(url, headers=headers, params=querystring)  #This argument is where you pass additional parameters to the API in the form of a dictionary 
                
    # appending data to the dictionary:
    weather_dict['extracted_at'].append(datetime.now())                # timestamp, 
    weather_dict['airport_code'].append(airport)                       # airport code    
    weather_dict['station_id'].append(airport_staids[airport])         # weater Station ID
    weather_dict['extracted_data'].append(json.loads(response.text))   # JSON string

#### Checkout the filled dictionary

In [48]:
weather_dict

{'extracted_at': [datetime.datetime(2025, 2, 13, 22, 48, 48, 867412),
  datetime.datetime(2025, 2, 13, 22, 48, 49, 523166),
  datetime.datetime(2025, 2, 13, 22, 48, 49, 821475)],
 'airport_code': ['MIA', 'MCO', 'TPA'],
 'station_id': [72202, 72205, 72211],
 'extracted_data': [{'meta': {'generated': '2025-02-12 16:36:14'},
   'data': [{'date': '2017-08-30 00:00:00',
     'tavg': 30.6,
     'tmin': 27.2,
     'tmax': 34.4,
     'prcp': 0.0,
     'snow': None,
     'wdir': None,
     'wspd': 9.7,
     'wpgt': None,
     'pres': 1015.8,
     'tsun': 562},
    {'date': '2017-08-31 00:00:00',
     'tavg': 31.1,
     'tmin': 28.9,
     'tmax': 33.9,
     'prcp': 0.0,
     'snow': None,
     'wdir': 112.0,
     'wspd': 14.0,
     'wpgt': None,
     'pres': 1016.7,
     'tsun': 487},
    {'date': '2017-09-01 00:00:00',
     'tavg': 30.9,
     'tmin': 26.1,
     'tmax': 33.9,
     'prcp': 11.4,
     'snow': None,
     'wdir': 117.0,
     'wspd': 12.2,
     'wpgt': None,
     'pres': 1016.5,
    

### Make it a dataframe

this is our raw data, which we now can load into the database

In [49]:
weather_daily_df = pd.DataFrame(weather_dict)
weather_daily_df
data = weather_daily_df

### SIDEBAR: For the curious and sceptics...

    In case you can't resist to know what the data looks like when flattened. 
    Here is the preview with pandas. BUT we are not transforming before loading in our pipeline just yet. 
    We Extract and Load the raw JSON.

In [51]:
#json

import pandas as pd
import json


weather_daily_df = pd.DataFrame(data)
weather_daily_df['data'] = weather_daily_df['extracted_data'].apply(lambda x: x['data']) 
exploded_df = weather_daily_df.explode('data')
normalized_data = pd.json_normalize(exploded_df['data'])
final_df = pd.concat([exploded_df.drop(columns=['data', 'extracted_data']).reset_index(drop=True), normalized_data], axis=1)
print(final_df)
final_df.info()


                 extracted_at airport_code  station_id                 date  \
0  2025-02-13 22:48:48.867412          MIA       72202  2017-08-30 00:00:00   
1  2025-02-13 22:48:48.867412          MIA       72202  2017-08-31 00:00:00   
2  2025-02-13 22:48:48.867412          MIA       72202  2017-09-01 00:00:00   
3  2025-02-13 22:48:48.867412          MIA       72202  2017-09-02 00:00:00   
4  2025-02-13 22:48:48.867412          MIA       72202  2017-09-03 00:00:00   
..                        ...          ...         ...                  ...   
91 2025-02-13 22:48:49.821475          TPA       72211  2017-09-26 00:00:00   
92 2025-02-13 22:48:49.821475          TPA       72211  2017-09-27 00:00:00   
93 2025-02-13 22:48:49.821475          TPA       72211  2017-09-28 00:00:00   
94 2025-02-13 22:48:49.821475          TPA       72211  2017-09-29 00:00:00   
95 2025-02-13 22:48:49.821475          TPA       72211  2017-09-30 00:00:00   

    tavg  tmin  tmax  prcp  snow   wdir  wspd  wpgt

> #### Note: we only used up 3 API calls per attempt

### Loading the data into the DB

Now all we need to create a table in your Schema in our database is part of the `weather_daily_df` dataframe.  
We can use pandas' ability to work with SQLAlchemy and "save" the data to the DB using the `.to_sql()` method

In [52]:
# getting API and DB credentials - Alternative 1: dotenv_values()

config = dotenv_values()
 
pg_user = config['POSTGRES_USER'] # align the key labels with your .env file
pg_host = config['POSTGRES_HOST']
pg_port = config['POSTGRES_PORT']
pg_db = config['POSTGRES_DB']
pg_schema = config['POSTGRES_SCHEMA']
pg_pass = config['POSTGRES_PASS']

In [53]:
pg_schema


's_martinvolman'

In [54]:
# updating the url
url = f'postgresql://{pg_user}:{pg_pass}@{pg_host}:{pg_port}/{pg_db}'

# creating the engine
engine = create_engine(url, echo=False)

In [55]:
engine.url # checking the url (pass is hidden)

postgresql://martinvolman:***@data-analytics-course-2.c8g8r1deus2v.eu-central-1.rds.amazonaws.com:5432/hh_analytics_24_4

In [56]:
# defining data types for the DB

dtype_dict = {
    'extracted_at': types.DateTime,
    'airport_code': types.String,
    'station_id': types.Integer,
    'time': types.String,
    'temp': types.Float,
    'dwpt': types.Float,
    'rhum': types.Float,
    'prcp': types.Float,
    'snow': types.String,
    'wdir': types.Float,
    'wspd': types.Float,
    'wpgt': types.String,
    'pres': types.Float,
    'tsun': types.String,
    'coco': types.Float,
}

In [None]:
# #check if the dataframe is empty
# for airport, df in dfs.items():
#     print(f"Checking DataFrame for {airport}:")
#     print(df.shape)  # Check row and column count
#     print(df.head())  # Check the first few rows

#     if df.shape[0] == 0:
#         print(f"⚠️ The DataFrame for {airport} is empty, skipping insert.")
#         continue  # Skip if the DataFrame is empty
    
#     # If the DataFrame is not empty, proceed to insert into the DB
#     table_name = f"weather_daily_raw_{str(airport).lower()}"
#     df.to_sql(name=table_name, 
#               con=engine, 
#               schema=pg_schema, 
#               if_exists='replace',  
#               dtype=dtype_dict, 
#               index=False)

#     print(f"✅ Data for {airport} written to {table_name}")


Checking DataFrame for 0:
(1, 2)
                                                data       meta.generated
0  [{'date': '2017-08-30 00:00:00', 'tavg': 30.6,...  2025-02-12 16:36:14


ProgrammingError: (psycopg2.ProgrammingError) can't adapt type 'dict'
[SQL: INSERT INTO s_martinvolman.weather_daily_raw_0 (data, "meta.generated") VALUES (%(data)s, %(meta_generated)s)]
[parameters: {'data': [{'date': '2017-08-30 00:00:00', 'tavg': 30.6, 'tmin': 27.2, 'tmax': 34.4, 'prcp': 0.0, 'snow': None, 'wdir': None, 'wspd': 9.7, 'wpgt': None, 'pres': ... (5233 characters truncated) ... 0:00:00', 'tavg': 27.8, 'tmin': 25.0, 'tmax': 32.2, 'prcp': 0.0, 'snow': None, 'wdir': None, 'wspd': 7.9, 'wpgt': None, 'pres': 1013.7, 'tsun': None}], 'meta_generated': '2025-02-12 16:36:14'}]
(Background on this error at: https://sqlalche.me/e/20/f405)

In [None]:
# #1 option with json
# # Assuming 'data' is a column with a list of dictionaries
# for airport, df in dfs.items():
#     # Flatten the 'data' column if it's a list of dictionaries
#     if isinstance(df['data'].iloc[0], list):
#         # If 'data' is a list of dictionaries, normalize it
#         df_data_normalized = pd.json_normalize(df['data'].explode())
#         df = pd.concat([df.drop(columns=['data']), df_data_normalized], axis=1)
    
#     # Generate table name
#     table_name = f"weather_daily_raw_{str(airport).lower()}"
    
#     # Write to SQL
#     df.to_sql(name=table_name, 
#               con=engine, 
#               schema=pg_schema, 
#               if_exists='replace',  
#               dtype=dtype_dict, 
#               index=False)

#     print(f"✅ Data for {airport} written to {table_name}")


✅ Data for 0 written to weather_daily_raw_0
✅ Data for 1 written to weather_daily_raw_1
✅ Data for 2 written to weather_daily_raw_2


In [None]:
final_df.to_sql(name = 'weather_daily_cleaned', 
                       con = engine, 
                       schema = pg_schema, 
                       if_exists='replace', 
                       dtype=dtype_dict,
                       index=False
                      )

96

If you see a '3' as the result of the last cell. Something should be right. :) 

Check in DBeaver if you see a new table in your Schema. Don't forget to refresh your Schema.

## Now continue with the hourly data.