# Recommendations

You have to use the data available in the dataset folder. 

Try to answer the questions using python.

Be clear and get straight to the point in your answers.

Good job!

<hr>

### Question 1

Try to clean/aggregate the data from grid_weather_data.sql in order to avoid NULL (or NaN) values.

The first step on the process of cleaning the dataset was to understand how much data was missing and if there was any clear pattern on why it's missing.

For that, the dataset was loaded into a pandas dataframe and using its built-in methods to analyze the amount of missing rows in each column and its percentage when compared to the total.

- Total of missing values per column:

```py
df.isnull().sum()
```

| **Column**            | **Missing Values** |
|-----------------------|--------------------|
| cod_city              | 0                  |
| date                  | 0                  |
| hour                  | 0                  |
| precipitation         | 76698              |
| dry_bulb_temperature  | 5966               |
| wet_bulb_temperature  | 16473              |
| high_temperature      | 79166              |
| low_temperature       | 78710              |
| relative_humidity     | 9648               |
| relative_humidity_avg | 81742              |
| pressure              | 19480              |
| sea_pressure          | 69868              |
| wind_direction        | 11324              |
| wind_speed_avg        | 82790              |
| cloud_cover           | 81082              |
| evaporation           | 10299              |


- Percentage of missing values per column:
```py
for col in df:
    if df[col].isnull().mean()>0:
        print(col, round(df[col].isnull().mean(),4))
```
| **Column**            | **Missing Values (%)** |
|-----------------------|------------------------|
| precipitation         | 0.6647                 |
| dry_bulb_temperature  | 0.0517                 |
| wet_bulb_temperature  | 0.1428                 |
| high_temperature      | 0.6861                 |
| low_temperature       | 0.6821                 |
| relative_humidity     | 0.0836                 |
| relative_humidity_avg | 0.7084                 |
| pressure              | 0.1688                 |
| sea_pressure          | 0.6055                 |
| wind_direction        | 0.0981                 |
| wind_speed_avg        | 0.7175                 |
| cloud_cover           | 0.7027                 |
| evaporation           | 0.0893                 |

Based on the data, it can be seem that there's a relevant amount of data missing that does not seem to follow any specific pattern. Given that, the missing values will be replaced by using the median of the numerical column following the code below.

```py
for col in df_numerical:
    col_median = df_numerical[col].median()
    df_numerical[col].fillna(col_median, inplace=True)
```

The clean dataset is saved on a csv file called "clean_dataset.csv"

<hr>

### Question 2

What was the precipitation mean of each city throughout 2002?

Using the following query below in the postgres database it's possible to get the mean precipitation of each city
in the mentioned year.

```sql
SELECT cod_city, avg(precipitation) 
FROM weather.grid_weather_data gwd 
WHERE date BETWEEN '2002-01-01' AND '2002-12-31'
GROUP BY cod_city;
```

Using the cod_city from the database with the name of the city from the json file the following table is created:

| cod_city | city_name      | mean_precipitation |
| -------- | ------------- | ------------------ |
| 59999    | ALTO PARNAIBA | 3.932328767123288  |
| 60020    | BARREIRAS     | 2.603013698630137  |
| 60046    | CANARANA      | 4.619607843137253  |

<hr>

### Question 3

Which features have some correlation?

The correlation coeffients (p) were calculated using a method from the pandas dataframe called "corr". This method saves the results on a numpy array. Since the visualization of the array is unpratical it was plotted as a heatmap using matplotlib and seaborn.

```py
correlation = df_numerical.corr()

plt.figure(figsize=(16,12))
plt.title('Correlation Heatmap')
ax = sns.heatmap(correlation, square=True, annot=True, fmt='.2f', linecolor='white')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
ax.set_yticklabels(ax.get_yticklabels(), rotation=30)           
plt.show()
```
The figure is shown below:

![Heatmap](correlation_heatmap.png)

Given that the correlation range is expressed from -1(negative perfect correlation) to +1(perfect correlation) with 0 being no correlation. It's possible to determine that these features are correlated:

- relative_humidity and dry_bulb_temperature (moderate/strong negative)
- Dry_bulb_temperature and wet_bulb_temperature (weak positive)
- Low_temperature and wet_bulb_temperature (weak positive)
- Wind_speed_avg and wet_bulb_temperature (weak positive)
- Evaporarion and wet_bulb_temperature (weak positive)
- High_temperature and relative_umidity_avg (moderate positive)
- High_temperature and wind_speed_avg (moderate negative)
- Cloud_cover and High_temperature (moderate positive)
- Wind_speed_avg and cloud_cover (weak/moderate positive)


<hr>

### Question 4

Create time-series plots using python to show the correlations found in the previous question.

Given the answer for the question 3 with the correlation heatmap. Two different plots will be made, one considering a strong negative correlation (relative_humidity and dry_bulb_temperature) and a moderate positive one (cloud_cover and high temperature).

The first plot shows that these two features usually have oposite behaviors, when one rises the other falls, working both ways:

![Negative Correlation](negative_correlation.png)

The second plot shows the moderate positive correlation, given that the two features have similar behaviours, when one rises the other one rises too, however, being only of moderate strengh the pattern does not happen every time.

![Positive Moderate Correlation](moderate_correlation.png)



<hr>

### Question 5

Make an exploratory analysis under the data and present your insights.

<hr>

### Question 6

- Create a Rest API using python framework (e.g., django, fastapi, flask, tornado) in order to provide the weather data inside of grid_weather_data.sql and grid_weather.json

- Create and use any kind of database to make a CRUD to use it later. 

- Try to provide a swagger to describe your API's structure.

- Try to host it in some cloud platform (e.g., heroku, pythonanywhere), and don't forget to provide the link to access it. Otherwise, prepare modules and run server/database in order to (1) run on some env: pip install requirements.txt; (2) them run server.py: python server.py.

- Share below a link to your Rest API code stored in a repository from GitHub.

https://github.com/gv-public/seedz-tech-interview

More info on the README.md file on the repository

<hr>

### Question 7

Make a python script in order to make many requests in parallel to your Rest API that you've created in the previous question.

In [None]:
# REMEMBER TO PUT A TIME.SLEEP(10) OR HIGHER

import requests
import concurrent.futures

# Função para fazer uma requisição à API
def fazer_requisicao(url):
    response = requests.get(url)
    # Você pode adicionar aqui o processamento da resposta da requisição, se necessário
    return response.json()

# Lista de URLs da API que deseja acessar
urls = [
    "https://api.exemplo.com/endpoint1",
    "https://api.exemplo.com/endpoint2",
    "https://api.exemplo.com/endpoint3",
    # Adicione mais URLs da API, se necessário
]

# Criando um executor para executar as requisições em paralelo
with concurrent.futures.ThreadPoolExecutor() as executor:
    # Fazendo as requisições em paralelo e obtendo os resultados
    resultados = executor.map(fazer_requisicao, urls)

    # Iterando sobre os resultados
    for resultado in resultados:
        # Processando os resultados das requisições, se necessário
        print(resultado)

<hr>