# Task 1

## Data Model

As Station-Date is a 1-many relationship;
    Date-Climate is a 1-1 relationship and
    Date-Hotspot is a 1-0/many relationship.
The data models are designed as the following:

```
Station Collection:
    Stations = [
        {
            id: station,
            dates: [id of date]
        }
    ]
Dates Collection:
    Dates = [
        {
            id: date,
            climate: {
                // climate attributes
            },
            hotspots: [id of hotspot]
        }
    ]
Hotspots Collection:
    Hotspots = [
        {
            id: id,
            other_attributes: ...
        }
    ]
```

## Justification
### Benefits
- Decoupling Station from Date allows Dates to be referenced from Station
- Putting the data of climate in Dates allows direct access to climate from a given data
- Decoupling Dates from Hotspots because transitive dependency thing (datetime -> date)

### Drawback
- Only Year 2022 contains data of hotspots, so there are a lot of empty list for date.hotspots


# 2. Queueing MongoDB with PyMongo
## 2.1 Read data and create database

### Importing Libraries

In [60]:
import pandas as pd             # pandas version 2.0.1
from pymongo import MongoClient
from datetime import datetime
from bson import ObjectId
from pprint import pprint

### Connecting to Database

In [47]:
client = MongoClient('mongodb://localhost:27017/')
db = client.fit3182_assignment_db
db.stations.drop()
db.dates.drop()
db.hotspots.drop()

### Reading Datasets

In [48]:
climate_historic = pd.read_csv("../dataset/climate_historic.csv")
hotspot_historic = pd.read_csv("../dataset/hotspot_historic.csv")

### Functions for converting date in string to datetime

In [38]:
def raw_date_to_datetime(date: str) -> datetime:
    dd, mm, yyyy = date.split("/")
    return datetime(int(yyyy), int(mm), int(dd))

def raw_datetime_to_datatime(datetime_str: str) -> datetime:
    date, time = datetime_str.split("T")
    yy, mm, dd = date.split("-")
    h, m, s = time.split(":")
    return datetime(int(yy), int(mm), int(dd), hour=int(h), minute=int(m), second=int(s))

### Light data wrangling

In [49]:
dates = pd.DataFrame(climate_historic.date).merge(hotspot_historic.date).date           # merging both datasets' date
dates = pd.DataFrame(dates.unique())                                                    # Get all unique dates from both data set
dates = list(dates[0].apply(raw_date_to_datetime))                                      # date:str -> date:datetime; Series -> List

climate_historic.date = climate_historic.date.apply(raw_date_to_datetime)               # date:str -> date:datetime
hotspot_historic.date = hotspot_historic.date.apply(raw_date_to_datetime)               # date:str -> date:datetime
hotspot_historic.datetime = hotspot_historic.datetime.apply(raw_datetime_to_datatime)   # datetime:str -> datetime:datetime

### Creating Data Models

In [50]:
# list representing the Station collection
stations_col = [
    {
        "_id": int(station),
        "dates": [
            {"date": date}
            for date in climate_historic[climate_historic.station == station]["date"].array
        ]
    }
    for station in list(climate_historic.station.unique())
]

In [51]:
# list representing the Hotspot collection
hotspots_col = [
    {
        "_id": ObjectId(),              # this is done as there are duplicated value for "date" and "datetime" across rows of data
        "date": hotspot.date,           #
        "datetime": hotspot.datetime,   #
        "lat": hotspot.latitude,
        "lng": hotspot.longitude,
        "confidence": hotspot.confidence,
        "surface_temperature": hotspot.surface_temperature_celcius
    }
    for _, hotspot in hotspot_historic.iterrows()
]

In [52]:
# list representing the Dates collection
dates_col = [
    {
        "_id": date,
        "climate": {    # date-climate is 1-1
            "air_temperature": int(climate_historic[climate_historic.date == date]["air_temperature_celcius"].array[0]),
            "relative_humidity": float(climate_historic[climate_historic.date == date]["relative_humidity"].array[0]),
            "windspeed_knots": float(climate_historic[climate_historic.date == date]["windspeed_knots"].array[0]),
            "max_wind_speed": float(climate_historic[climate_historic.date == date]["max_wind_speed"].array[0]),
            "precipitation": climate_historic[climate_historic.date == date]["precipitation"].array[0],                     # str
            "ghi": int(climate_historic[climate_historic.date == date]["GHI_w/m2"].array[0])
        },
        "hotspots": [   # date-hotspots is 1-0/many
            {"hotspot_id": hotspot}
            for hotspot in pd.DataFrame(hotspots_col)[pd.DataFrame(hotspots_col).date == date]["_id"]
        ]
    }
    for date in dates
]

### Inserting Data to MongoDb

In [53]:
from pymongo.errors import BulkWriteError

try:
    db.stations.insert_many(stations_col)
except BulkWriteError:
    print("Duplicated Keys in stations (Data already inserted)")
else:
    print("Station inserted")

try:
    db.dates.insert_many(dates_col)
except BulkWriteError:
    print("Duplicated Keys in dates (Data already inserted)")
else:
    print("Date inserted")

try:
    db.hotspots.insert_many(hotspots_col)
except BulkWriteError:
    print("Duplicated Keys in hotspots (Data already inserted)")
else:
    print("Hotspot inserted")

Station inserted
Date inserted
Hotspot inserted


## 2.2 Querying the database
### 2.2a Finding the climate data on 12th December 2022

In [62]:
pprint(
    db.dates.find_one(
        {"_id": datetime(2022, 12, 12)},
        {"_id": 1, "climate": 1}
    )
)

{'_id': datetime.datetime(2022, 12, 12, 0, 0),
 'climate': {'air_temperature': 19,
             'ghi': 156,
             'max_wind_speed': 12.0,
             'precipitation': ' 0.00I',
             'relative_humidity': 55.3,
             'windspeed_knots': 6.2}}


### 2.2b Finding hotspot data when surface temperature between 65 and 100

In [68]:
for hotspot in db.hotspots.find(
    {"surface_temperature": {"$gt": 65, "$lt": 100}},
    {"surface_temperature":1, "lat":1, "lng":1, "confidence":1, "_id":0}
):
    pprint(hotspot)

{'confidence': 78, 'lat': -37.966, 'lng': 145.051, 'surface_temperature': 68}
{'confidence': 86, 'lat': -35.543, 'lng': 143.316, 'surface_temperature': 67}
{'confidence': 93, 'lat': -37.875, 'lng': 142.51, 'surface_temperature': 73}
{'confidence': 95, 'lat': -37.613, 'lng': 149.305, 'surface_temperature': 75}
{'confidence': 90, 'lat': -37.624, 'lng': 149.314, 'surface_temperature': 66}
{'confidence': 93, 'lat': -38.057, 'lng': 144.211, 'surface_temperature': 73}
{'confidence': 92, 'lat': -37.95, 'lng': 142.366, 'surface_temperature': 70}
{'confidence': 100, 'lat': -36.282, 'lng': 146.157, 'surface_temperature': 71}
{'confidence': 100, 'lat': -37.634, 'lng': 149.237, 'surface_temperature': 71}
{'confidence': 98, 'lat': -37.605, 'lng': 149.302, 'surface_temperature': 83}
{'confidence': 99, 'lat': -37.6, 'lng': 149.325, 'surface_temperature': 86}
{'confidence': 95, 'lat': -37.618, 'lng': 149.281, 'surface_temperature': 76}
{'confidence': 100, 'lat': -37.606, 'lng': 149.312, 'surface_tempe

### 2.2c Finding climate and hotspot data on 15th and 16th of December 2022

In [87]:
# Find date, surface temperature (°C), air temperature (°C), relative humidity and max wind speed on 15th and 16th of December 2022.
for date in db.dates.find(
    {"$or": [{"_id": datetime(2022, 12, 15)}, {"_id": datetime(2022, 12, 16)}]}
):
    for hotspot_ref in date.get("hotspots"):
        for hotspot in db.hotspots.find({"_id": hotspot_ref.get("hotspot_id")}):
            print("-------------------------")
            print("Date: " + str(date.get("_id").date()))
            print("Surface Temperature: " + str(hotspot.get("surface_temperature")))
            print("Air Temperature: " + str(date.get("climate").get("air_temperature")))
            print("Relative Humidity: " + str(date.get("climate").get("relative_humidity")))
            print("Max Wind Speed: " + str(date.get("climate").get("max_wind_speed")))
            print()

-------------------------
Date: 2022-12-15
Surface Temperature: 42
Air Temperature: 18
Relative Humidity: 52.0
Max Wind Speed: 14.0

-------------------------
Date: 2022-12-15
Surface Temperature: 36
Air Temperature: 18
Relative Humidity: 52.0
Max Wind Speed: 14.0

-------------------------
Date: 2022-12-15
Surface Temperature: 38
Air Temperature: 18
Relative Humidity: 52.0
Max Wind Speed: 14.0

-------------------------
Date: 2022-12-15
Surface Temperature: 40
Air Temperature: 18
Relative Humidity: 52.0
Max Wind Speed: 14.0

-------------------------
Date: 2022-12-16
Surface Temperature: 43
Air Temperature: 18
Relative Humidity: 53.7
Max Wind Speed: 13.0

-------------------------
Date: 2022-12-16
Surface Temperature: 33
Air Temperature: 18
Relative Humidity: 53.7
Max Wind Speed: 13.0

-------------------------
Date: 2022-12-16
Surface Temperature: 54
Air Temperature: 18
Relative Humidity: 53.7
Max Wind Speed: 13.0

-------------------------
Date: 2022-12-16
Surface Temperature: 73
Ai

### 2.2d Finding climate and hotspot data on 15th and 16th of December 2022

In [None]:
# Find datetime, air temperature (°C), surface temperature (°C) and confidence when the confidence is between 80 and 100.
