## Aeronautical Occurrences Data Analytics Project
This notebook will be our best friend until the end of our project. It will store all the code developed during this period and also explain each stage of development, the paths and choices made, as well as any tips or questions that arise during the process. 📝🤝


> **⚠️Important⚠️**: This is a study project, which means that even with my best effort, it can -- and most certainly will -- contain some errors, as in all learning process. So please keep that in mind if using this notebook for any purpose.

### 1. Introduction
Since all countries must have an eye on what happens at their airfields, it wouldn't be any different within Brazil's airspace. This leads us to the **Brazilian Open Data Portal**, that can be accessed and explored by everyone through:
```
https://dados.gov.br/home
```
After a few minutes searching, one should be able to find a repository that gathers all public data shared by the brazilian government through this portal. To save some time, here's the link we must look for:
```
https://dados.gov.br/dados/conjuntos-dados/ocorrncias-aeronuticas
```

> Note: There's a syntax error on the URL that's been used by the brazilian government. *"ocorrncias-aeronuticas"* should be spelled *"ocorrencias-aeronauticas"*, so keep an eye for future changes

### 2. Getting into action 🏃
After this briefily description, let's dive into our hands-on learning project. There's two main ways to gather our data from the source, we can:
1. Download it manualy from their site
2. Fetch the data from their API, which is built using Swagger

Since the 1st one is a little to easy, we'll try to fetch the raw data that we need sending a request to the API. For doing thar we need to take a closer look into it's documentation, that can be found at:
```
https://dados.gov.br/swagger-ui/index.html#
```
Also, to be able to make requests to this API, we need a key token that can be generated as you log in to your account at the Data Portal homepage. Once we've done that, we should see something like a button to generate our key. Now that we have our key-token, let's get the data!

### 2. Extraction
In order to fetch, parse and manipulate our data, we must import some essential libraries such as `requests` to make HTTP Requests, and `pandas` to further data manipulation. We can also import `json` since we're to further manipulate JSON data, so here it goes.
> Note: The ```import config``` statement is required in order to use confidential information that is stored in a config file, without sharing them as a push this repository to Github.

In [1]:
import config
import pandas as pd
import requests
import json

The code below makes a GET request to the government public data API using `requests`. For this GET request we're passing 2 arguments, the url itself and some headers such as shown bellow:
```
headers = {
    'accept': 'application/json',
    'chave-api-dados-abertos': 'YOUR_OPEN_DATA_KEY_API'
}
```
For confidentiality matters I saved these headers in a config file as already mentioned above, but feel free to add your own key to fetch data, you'll just need to change the param `headers=config.headers` to `headers=headers`, if you decide to add the above snippet to your code. You can also pass these directly as argument for `headers` as:
```
response = requests.get(url, headers={
                                 'accept': 'application/json',
                                 'chave-api-dados-abertos': 'YOUR_OPEN_DATA_KEY_API'
                                 }).json()
```

In [2]:
url = 'https://dados.gov.br/dados/api/publico/conjuntos-dados/ocorrncias-aeronuticas'

# get the response object from the request, parse to a json format by using .json() method
# and make it more "readable" by using the dump() method from the json library and then assigning
# to var -> indented_response.
response = requests.get(url, headers=config.headers).json()
print(json.dumps(response, indent=4))

{
    "id": "6fe3fb07-51fb-496a-b3a1-8ef7ba5a3905",
    "titulo": "Ocorr\u00eancias Aeron\u00e1uticas",
    "organizacao": "agencia-nacional-de-aviacao-civil-anac",
    "inventario": null,
    "descricao": "Cont\u00e9m dados das ocorr\u00eancias aeron\u00e1uticas enviadas pela For\u00e7a A\u00e9rea Brasileira por meio do CENIPA para a ANAC acrescido de informa\u00e7\u00f5es enriquecidas pela ANAC.",
    "licenca": "odc-odbl",
    "responsavel": "Assessoria de Seguran\u00e7a Operacional - ASSOP",
    "emailResponsavel": "assop@anac.gov.br",
    "periodicidade": "DIARIA",
    "temas": [
        {
            "name": "defesa-e-seguranca",
            "title": "Defesa e Seguran\u00e7a"
        },
        {
            "name": "transportes-e-transito",
            "title": "Transportes e Tr\u00e2nsito"
        }
    ],
    "tags": [
        {
            "id": "590faac4-b214-441e-a9ae-490be21d0166",
            "name": "ANAC",
            "display_name": null
        },
        {
          

Since we can now take a better look at the response we got from the API call, we should seek the relevant part for our project. Usually, when making requests like these the relevant part of data comes in a nested JSON style text, therefore we'll need to search which key contains the information that we'll going to use, and that's why we call `json.dumps()` and pass an `indent` value, otherwise it would be harder to analyse and find the desired `key:pair` value. Going through the JSON we can indentify that the relevant part for us is contained by `recursos` key. So now, let's get some data.

    
    

In [7]:
print(json.dumps(response['recursos'], indent=4))

[
    {
        "id": "6de61b5d-f8ce-437b-a549-515f9ef2d8c7",
        "idConjuntoDados": "6fe3fb07-51fb-496a-b3a1-8ef7ba5a3905",
        "titulo": "Metadados do conjunto de dados: Ocorr\u00eancias Aeron\u00e1uticas - HTML",
        "link": "https://www.anac.gov.br/acesso-a-informacao/dados-abertos/areas-de-atuacao/seguranca-operacional/ocorrencias-aeronauticas/metadados-do-conjunto-de-dados-ocorrencias-aeronauticas",
        "descricao": "Metadados do conjunto de dados: Ocorr\u00eancias Aeron\u00e1uticas em HTML",
        "tipo": "DICIONARIO_DE_DADOS",
        "formato": "HTML"
    },
    {
        "id": "398a7b43-b3bb-470b-900e-53b28c0370dd",
        "idConjuntoDados": "6fe3fb07-51fb-496a-b3a1-8ef7ba5a3905",
        "titulo": "Arquivo: Seguran\u00e7a Operacional - Ocorr\u00eancias Aeron\u00e1uticas - Formato CSV",
        "link": "https://sistemas.anac.gov.br/dadosabertos/Seguranca%20Operacional/Ocorrencia/V_OCORRENCIA_AMPLA.csv",
        "descricao": "Arquivo: Seguran\u00e7a Operacio

By taking a better look, we can see that the key `recursos` gets us some information about the data and how to access it instead of give us the data itself, so how to get the real data

In [19]:
for dictionary in response['recursos']:
    with requests.get(dictionary['link'], headers=config.headers) as req:
        with open(f'../data/occurrences.{dictionary['formato'].lower()}', 'wb') as file:
            file.write(req.content)

In [8]:
json_data = response['recursos']
json_data

[{'id': '6de61b5d-f8ce-437b-a549-515f9ef2d8c7',
  'idConjuntoDados': '6fe3fb07-51fb-496a-b3a1-8ef7ba5a3905',
  'titulo': 'Metadados do conjunto de dados: Ocorrências Aeronáuticas - HTML',
  'link': 'https://www.anac.gov.br/acesso-a-informacao/dados-abertos/areas-de-atuacao/seguranca-operacional/ocorrencias-aeronauticas/metadados-do-conjunto-de-dados-ocorrencias-aeronauticas',
  'descricao': 'Metadados do conjunto de dados: Ocorrências Aeronáuticas em HTML',
  'tipo': 'DICIONARIO_DE_DADOS',
  'formato': 'HTML'},
 {'id': '398a7b43-b3bb-470b-900e-53b28c0370dd',
  'idConjuntoDados': '6fe3fb07-51fb-496a-b3a1-8ef7ba5a3905',
  'titulo': 'Arquivo: Segurança Operacional - Ocorrências Aeronáuticas - Formato CSV',
  'link': 'https://sistemas.anac.gov.br/dadosabertos/Seguranca%20Operacional/Ocorrencia/V_OCORRENCIA_AMPLA.csv',
  'descricao': 'Arquivo: Segurança Operacional - Ocorrências Aeronáuticas em formato CSV',
  'tipo': 'DADOS',
  'formato': 'CSV'},
 {'id': '2f3b4b61-dbcc-4dee-bf22-db78eb8244

In [9]:
raw_data = pd.json_normalize(json_data)
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   id               3 non-null      object
 1   idConjuntoDados  3 non-null      object
 2   titulo           3 non-null      object
 3   link             3 non-null      object
 4   descricao        3 non-null      object
 5   tipo             3 non-null      object
 6   formato          3 non-null      object
dtypes: object(7)
memory usage: 300.0+ bytes
