<div>
    <img style="float:right; width:210px" src="images/snext-logo.png"/>
    <div style="float:left;"><h1>Introduction to Python for Data Science</h1></div>
</div>

---
# Notebook 4: Transfer Task
This notebook introduces the instructions for the transfer task you have to complete for the module "Digital Infrastructure & Software Development"



## Your Objective
I this task you will explore the data sphere of air quality data using the OpenAQ API. 
OpenAQ provides access to a global database of air quality measurements, allowing us to analyze and compare pollution levels across different cities and regions. 
Understanding air quality is crucial for environmental research, public health, and policy-making (for more info, see this [link](https://openaq.org/why-air-quality/)).

**Your overall objective is to produce a well documented analysis that states and answers a research question about that data.**

Specifically, to complete this task you need to:
- Formulate a research question comparing air quality across selected cities or regions.
- Retrieve relevant air quality data using the OpenAQ API.
- Analyze, process, and visualize this data.
- Document your findings and insights in a Jupyter notebook.

Follow the instructions in this notebook, fill the blanks to craft your response step-by-step. Add new cells where necessary. Based on your notebook write the research report as response in the INSIDER task.

## About the data set and air quality measurement

The OpenAQ API provides access to a wide range of air quality metrics that are crucial for assessing environmental and public health.

- **Particulate Matter (PM2.5 and PM10)**
  - PM2.5 and PM10 are tiny particles in the air.
  - They can penetrate deep into the lungs and enter the bloodstream.
  - Pose significant health risks, including respiratory and cardiovascular issues.

- **Nitrogen Dioxide (NO2)**
  - Primarily produced from fossil fuel combustion.
  - Can aggravate respiratory diseases like asthma.
  - High levels are often found in urban and industrial areas.

- **Sulfur Dioxide (SO2)**
  - Generated from burning fossil fuels containing sulfur.
  - Can cause acid rain and respiratory problems.
  - Affects the environment by harming plants and aquatic life.

- **Carbon Monoxide (CO)**
  - Produced by incomplete combustion of fossil fuels.
  - A colorless, odorless, and poisonous gas.
  - High concentrations can be lethal, and lower concentrations can cause health issues.

- **Ozone (O3)**
  - Ground-level ozone is a component of smog.
  - Beneficial in the upper atmosphere but harmful at ground level.
  - Can cause breathing problems, trigger asthma, and reduce lung function.

## Preparation of your work

Before diving into the task, familiarize yourself with the OpenAQ API. Here are some key points:
- API Key: Create an account to generate your [API key](https://docs.openaq.org/using-the-api/api-key)
- API Documentation: Review the OpenAQ [API documentation](https://docs.openaq.org/docs) to understand the available endpoints, parameters, and response formats.
- API Requests: You will use Python’s requests library to interact with the API.
- Data Considerations: Pay attention to data formats, available parameters (like PM2.5, NO2), and how to filter data by date, location, etc.

It is advisable to install Python, Jupyter Lab locally on your computer, so you can save your work. As fallback you can use one of the many cloud platforms (e.g. Google Colab, Binder, ...) to do your work online. See the seperate notebook ("SetupEnvironment") with instructions for further information and instructions.

## Evaluation and grading of this task

- The task is considered as completed if your code produces a table of correct data.
- The task is done well if your code generates a single table with all scores, can display the demanded bar chart and compute the means successfully.
- The task is done perfect if you round all values, append the mean values to the table and this way include the mean scores in your bar chart.

### Hints

#### General

- Develop your code in multiple cells so you can check the output of every step
- For the final solution you will need a loop to fetch data for all the cities, but don't start this way. First get the task working for a single city, then carefully introduce more complexity to your code.
- If your loop does runs endlessly, use the stop button in the top menu bar to interrupt python. Alternatively use the top menu: Kernel -> Interrupt Kernel. If that doesn't help: `Kernel -> Restart` kernel.
- If you left the notebook sitting for some time, it might be inactive - the python kernel terminated. The bottom bar then does not indicate an idle kernel anymore. In that case select `Kernel -> Restart kernel` from the top menubar.
- You have seen all required operations in the notebooks you worked through for this task; feel free to use this as copy/paste reference. Where unsure use google to find more examples of a certain operation.
- Getting the quality scores from teleport API requires going through these steps: 
  * look up the city, get the link to the city data
  * use the city data link to get the link to the urban area of the city
  * use the urban area data link to get the link to the scores
  * retrieve the scores
- You can save your work by simply copy/pasting your code to an empty document on your computer or downloading this notebook (File -> Download). After re-opening the Jupyter environment, you can re-upload the notebook by drag/dropping the downloaded notebook to the file explorer on the left hand side.

#### Working with API results

- Before writing the code, you might want to walk through the sequence of queries in a browser. 
- Getting the required bits of information from the retrieved raw (JSON) data is a little cumbersome. The cell below contains an example query that demonstrates how to dig layer by layer into the first API response. If this feels confusing, revisit notebook #2 and recheck the sections about lists and dictionaries, especially the person example.

#### Preparing your dataframe

- Remember: operations like drop, rename, etc. create and return a copy of the dataframe. Use inplace editing like `df.rename(..., inplace=True)` or assign the changed dataset to the original name like in `df = df.rename(...)` to make your changes persistent
- To approach the problem stepwise, create and prepare separate dataframes for each city in your loop and join the dataframes at the end outside the loop. 
- In might be useful to drop the color column and rename column "score_out_of_10" to the city name in dataframe preparation.
- Don't forget to set the index. 
- There are at least two options to join multiple dataframes:
   * Use the `.loc[]` function to piece your result dataframe together, like `df.loc[:,"columnX"] = ...` to add a column
   * Check the documentation of `pandas.concat()` function online and use it to generate your result

#### Navigating OpenAQ API

If you find it difficult to access the information you seek from the API, try these things:
- describe your region of interest by determining the geo coordinates, e.g. using Google Maps or Openstreetmap (see [docs](https://support.google.com/maps/answer/18539?hl=en&co=GENIE.Platform%3DDesktop))
- query the ``/locations`` endpoint with the coordinates and radius to find locations known to the api that correspond to the region you're interested in (see docs [here](https://docs.openaq.org/api/operations/locations_get_v3_locations_get))

This should yield a list of locations, containing sensors relevant to your question. Look up how to work with ``/sensors`` endpoint to access data series. 

---
# Step 1: Formulate Your Research Question

## Task

Develop a research question that delves into the complexities of air quality data. Your question should not only be insightful but also adhere to certain criteria to ensure an adequate level of complexity and analytical depth.

Requirements for Your Research Question:

- Multiple Cities/Regions: Select at least two cities or regions for your analysis. These could be in the same country or across different countries, offering a diverse perspective on air quality.
- Multiple Parameters: Choose at least two air quality parameters (like PM2.5, NO2, SO2, CO) for comparison. This will enrich your analysis by allowing you to explore correlations or contrasts between different types of pollutants.
- Time Frame: Include a specific time frame in your question. For instance, you might compare data from the past year, different seasons, or specific months.
- Analytical Angle: Frame your question to include an analytical angle, such as trends over time, peak pollution periods, or comparisons between weekdays and weekends.

Examples of Suitable Research Questions:

- "How do the levels of PM2.5 and NO2 in Tokyo and Delhi compare during the winter months of the last two years?"
- "What are the trends in SO2 and CO levels in Chicago and Toronto, and how do these vary between summer and winter seasons?"

Once you've formulated your research question, clearly outline the chosen cities/regions and the specific air quality parameters you will focus on. This specificity will be crucial for the data retrieval process in the subsequent chapters.

## Research Question

...fill in your question here...

---
# Step 2: Data Retrieval from OpenAQ

Now that you have your research question, the next step is to retrieve the relevant air quality data from the OpenAQ API. This chapter will guide you through the process of setting up API requests to fetch the data you need.

Here is some code to get you started...

In [None]:
import requests
import pandas as pd
from pandas import json_normalize
from datetime import datetime

date_from = "2023-12-29" 
date_to = "2024-01-02"

data = []

url = f"https://api.openaq.org/v3/parameters/2/latest?limit=1000"
api_key = "..." # create account, get your own key in profile/settings menu
response = requests.get(url, headers={"X-API-Key": api_key})

# Flatten the JSON data
df = json_normalize(response.json()["results"])

In [None]:
# Parse the date in UTC, drop local time, sort the dataframe
df["datetime.utc"] = pd.to_datetime(df["datetime.utc"])
df.drop(labels="datetime.local", axis="columns", inplace=True)
df.sort_values("datetime.utc", inplace=True)
df.set_index("datetime.utc", inplace=True)

df.head()

# Step 3: Data Processing and Analysis

With the air quality data collected from the OpenAQ API, the next step is to process and analyze this data to answer your research question.

Steps for Data Processing and Analysis:
- Data Cleaning: Begin by inspecting and (if necessary) cleaning the data. This includes handling missing values, filtering irrelevant data, and converting data types if necessary.
- Data Organization: Organize your data in a way that facilitates analysis. For example, you might create separate DataFrames for each city or pollutant.
- Descriptive Statistics: Conduct statistical analyses that are relevant to your research question. This may include calculating averages, medians, or identifying trends.

In [None]:
# ...

# Step 4: Data Visualization

Visualizing your data is a powerful way to communicate your findings and understand trends and patterns.

Visualization Techniques:
- Choosing the Right Visualization: Depending on your data and research question, choose appropriate visualizations. Line graphs are great for showing trends over time, while bar charts can compare different cities or pollutants effectively.
- Creating Visualizations: Use Python library Matplotlib (or optionally seaborn, if you like) to create your visualizations.

The [Python Graph Gallery](https://python-graph-gallery.com) is a great resource to draw inspiration from.

In [None]:
# ...

# Step 5: Testing, Documentation and Notebook Structure

Your Jupyter notebook should be without errors, well-documented and structured. 
To test your code, restart the kernel (python session running in the background of this notebook) and clear all outputs. Rerun all code cells and check the results.
Remove all instructions, include explanations of your methodology, observations, and interpretations at each step.

Reformat so this notebook fits this structure:
- Introduction: Brief overview of your task and research question.
- Data Retrieval: Code and explanation for retrieving data.
- Data Processing: Steps taken to clean and organize the data.
- Data Analysis: Analysis performed on the data, with observations.
- Visualization: Visualizations of the data with interpretations.
- Conclusion: Summarize, interpret your findings and provide insights.


# Step 6: Summary and Conclusion
In this final chapter, reflect on your findings from the data analysis and visualization. Provide a concise summary that answers your research question and discuss any interesting insights or implications of your analysis.