# AD09 - GSOD 1929-Today Weather Dataset



# Introdução

O conjunto de dados de teste tem origem aqui:

- https://www.ncei.noaa.gov/access/metadata/landing-page/bin/iso?id=gov.noaa.ncdc:C00516

Trata-se do conjunto de dados Global Summary of the Day (GSOD), com um sumário diário (frequentemente atualizado) de cerca de 9 mil estações metereológicas, entre os anos de 1929 até os dias de hoje. No conjunto de dados original, existe um arquivo por estação metereológica e por ano. As informações sobre cada coluna (e sua posição em bytes) pode ser consultado no link abaixo:

- https://www.ncei.noaa.gov/data/global-summary-of-the-day/doc/readme.txt

Há uma grande quantidade de arquivos, pelo menos 505 mil.

## Como baixar todos os arquivos

Atenção, o código abaixo pode demorar pois o somatório do volume é de cerca de 3.2 GBytes. Para reduzir a quantidade de anos e obter dados mais rapidamente, basta alterar o código (onde aparece `1929` e `2022`). Perceba que os anos mais antigos tem uma quantidade de dados bem menor que anos recentes.

In [None]:
%%bash
BASE=https://www.ncei.noaa.gov/data/global-summary-of-the-day/archive/
WD="weather"
mkdir -p $WD
pushd $WD
for ano in $(seq 1929 1935); do
    FILE=${ano}.tar.gz
    wget -q $BASE/$FILE -O $FILE
    tar vxfz $FILE
done
popd

/content/weather /content
03005099999.csv
03075099999.csv
03159099999.csv
03262099999.csv
03311099999.csv
03379099999.csv
03091099999.csv
03601099999.csv
03777099999.csv
03795099999.csv
03497099999.csv
03396099999.csv
03811099999.csv
03856099999.csv
03894099999.csv
03953099999.csv
03973099999.csv
03980099999.csv
99006199999.csv
03864099999.csv
03804099999.csv
03005099999.csv
03026099999.csv
03091099999.csv
03159099999.csv
03075099999.csv
03262099999.csv
03311099999.csv
03379099999.csv
03497099999.csv
03559099999.csv
03396099999.csv
03601099999.csv
03777099999.csv
03795099999.csv
03804099999.csv
03811099999.csv
03856099999.csv
03864099999.csv
03894099999.csv
03953099999.csv
03973099999.csv
99006199999.csv
03980099999.csv
03005099999.csv
03026099999.csv
03075099999.csv
03159099999.csv
03262099999.csv
03311099999.csv
03396099999.csv
03497099999.csv
03091099999.csv
03777099999.csv
03795099999.csv
03379099999.csv
03601099999.csv
03856099999.csv
03811099999.csv
03864099999.csv
03894099999.cs

## Início

In [None]:
import numpy as np
import pandas as pd

## Ler e concatenar os dados oriundos de múltiplos arquivos

Modifique os anos para ler somente um subconjunto de anos.

In [None]:
import os
import glob
all_files = glob.glob(os.path.join("weather/*.csv"))
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index=True)
df

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,ELEVATION,NAME,TEMP,TEMP_ATTRIBUTES,DEWP,DEWP_ATTRIBUTES,...,MXSPD,GUST,MAX,MAX_ATTRIBUTES,MIN,MIN_ATTRIBUTES,PRCP,PRCP_ATTRIBUTES,SNDP,FRSHTT
0,36870099999,1935-05-01,43.233333,76.933333,851.0,"ALMATY, KZ",60.0,4,9999.9,0,...,13.0,999.9,75.0,,50.0,*,99.99,,999.9,10000
1,36870099999,1935-05-03,43.233333,76.933333,851.0,"ALMATY, KZ",54.5,4,9999.9,0,...,13.0,999.9,61.0,,48.0,,0.00,I,999.9,0
2,36870099999,1935-05-04,43.233333,76.933333,851.0,"ALMATY, KZ",53.7,4,9999.9,0,...,8.9,999.9,63.0,*,43.0,*,0.00,I,999.9,0
3,36870099999,1935-05-05,43.233333,76.933333,851.0,"ALMATY, KZ",66.7,4,9999.9,0,...,5.1,999.9,73.0,,46.0,,0.00,I,999.9,0
4,36870099999,1935-05-06,43.233333,76.933333,851.0,"ALMATY, KZ",67.0,4,9999.9,0,...,8.9,999.9,81.0,,52.0,,0.00,I,999.9,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28226,35416099999,1935-12-14,49.066667,54.683333,128.0,"UIL, KZ",7.0,4,9999.9,0,...,1.9,999.9,18.0,*,-4.0,*,0.00,I,999.9,0
28227,35416099999,1935-12-17,49.066667,54.683333,128.0,"UIL, KZ",19.0,4,9999.9,0,...,23.9,999.9,25.0,*,14.0,*,0.00,I,999.9,0
28228,35416099999,1935-12-18,49.066667,54.683333,128.0,"UIL, KZ",16.7,4,9999.9,0,...,23.9,999.9,23.0,*,14.0,*,0.00,I,999.9,0
28229,35416099999,1935-12-23,49.066667,54.683333,128.0,"UIL, KZ",17.7,4,9999.9,0,...,13.0,999.9,27.0,*,7.0,*,0.00,I,999.9,0


## Função para converter de Farenheit para Celcius

In [None]:
def fahrenheit_to_celsius (t_fahrenheit, r = 2):
    t_celsius = (5/9) * (t_fahrenheit - 32)
    t_celsius = round(t_celsius, r)
    return(t_celsius)

fahrenheit_to_celsius(100)

37.78

# Exercícios

Embora os dados tenham medidas em Farenheit, espera-se que as respostas sejam todas em Celcius. Utilize a função `fahrenheit_to_celsius` para converter uma coluna de temperatura em Farenheit para outra coluna em Celcius.

In [None]:
df["TEMP"] = fahrenheit_to_celsius(df["TEMP"])
df["MAX"] = fahrenheit_to_celsius(df["MAX"])
df["MIN"] = fahrenheit_to_celsius(df["MIN"])
df[["DATE", "TEMP", "MIN", "MAX"]]


Unnamed: 0,DATE,TEMP,MIN,MAX
0,1935-05-01,15.56,10.00,23.89
1,1935-05-03,12.50,8.89,16.11
2,1935-05-04,12.06,6.11,17.22
3,1935-05-05,19.28,7.78,22.78
4,1935-05-06,19.44,11.11,27.22
...,...,...,...,...
28226,1935-12-14,-13.89,-20.00,-7.78
28227,1935-12-17,-7.22,-10.00,-3.89
28228,1935-12-18,-8.50,-10.00,-5.00
28229,1935-12-23,-7.94,-13.89,-2.78


In [None]:
df

Unnamed: 0,STATION,DATE,LATITUDE,LONGITUDE,ELEVATION,NAME,TEMP,TEMP_ATTRIBUTES,DEWP,DEWP_ATTRIBUTES,...,MXSPD,GUST,MAX,MAX_ATTRIBUTES,MIN,MIN_ATTRIBUTES,PRCP,PRCP_ATTRIBUTES,SNDP,FRSHTT
0,36870099999,1935-05-01,43.233333,76.933333,851.0,"ALMATY, KZ",15.56,4,9999.9,0,...,13.0,999.9,23.89,,10.00,*,99.99,,999.9,10000
1,36870099999,1935-05-03,43.233333,76.933333,851.0,"ALMATY, KZ",12.50,4,9999.9,0,...,13.0,999.9,16.11,,8.89,,0.00,I,999.9,0
2,36870099999,1935-05-04,43.233333,76.933333,851.0,"ALMATY, KZ",12.06,4,9999.9,0,...,8.9,999.9,17.22,*,6.11,*,0.00,I,999.9,0
3,36870099999,1935-05-05,43.233333,76.933333,851.0,"ALMATY, KZ",19.28,4,9999.9,0,...,5.1,999.9,22.78,,7.78,,0.00,I,999.9,0
4,36870099999,1935-05-06,43.233333,76.933333,851.0,"ALMATY, KZ",19.44,4,9999.9,0,...,8.9,999.9,27.22,,11.11,,0.00,I,999.9,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28226,35416099999,1935-12-14,49.066667,54.683333,128.0,"UIL, KZ",-13.89,4,9999.9,0,...,1.9,999.9,-7.78,*,-20.00,*,0.00,I,999.9,0
28227,35416099999,1935-12-17,49.066667,54.683333,128.0,"UIL, KZ",-7.22,4,9999.9,0,...,23.9,999.9,-3.89,*,-10.00,*,0.00,I,999.9,0
28228,35416099999,1935-12-18,49.066667,54.683333,128.0,"UIL, KZ",-8.50,4,9999.9,0,...,23.9,999.9,-5.00,*,-10.00,*,0.00,I,999.9,0
28229,35416099999,1935-12-23,49.066667,54.683333,128.0,"UIL, KZ",-7.94,4,9999.9,0,...,13.0,999.9,-2.78,*,-13.89,*,0.00,I,999.9,0


## Resposta estas questões

Qual a temperatura máxima?

In [None]:
df["TEMP"].max()

37.5

Qual a temperatura máxima por ano?

In [None]:
df[['Ano','Mes','Dia']] = df.DATE.str.split("-",expand=True,)
df.groupby("Ano")["TEMP"].max()

Ano
1930    15.28
1932    26.67
1933    11.00
1934    24.28
1935    37.50
Name: TEMP, dtype: float64

Qual a temperatura média por mês, de cada ano?

In [None]:
df[['Ano','Mes','Dia']] = df.DATE.str.split("-",expand=True,)
df.groupby(["Ano", "Mes"])["TEMP"].mean()

Ano   Mes
1930  06     15.280000
1932  01     12.582097
      02      9.730690
      03     11.865410
      04     10.981220
      05     26.132000
      06     26.520000
      07     25.846000
      08     25.334000
      09     26.266111
      11     19.522000
      12     16.832857
1933  01      0.851319
      02      0.329881
      03      7.199841
      04      8.938750
1934  01    -13.606429
      02    -12.030909
      03    -10.367692
      04     -2.043846
      05      9.955937
      06     16.227000
      07     16.096061
      08     16.762581
      09      6.922581
      10      3.836970
      11     -7.012593
      12    -13.413704
1935  01      2.051698
      02      3.721678
      03      4.682202
      04      8.294996
      05     11.556325
      06     17.144808
      07     19.050285
      08     18.413135
      09     13.741417
      10      8.163127
      11      0.365285
      12     -1.978181
Name: TEMP, dtype: float64

Qual o dia do ano mais quente?

In [None]:
df[['Ano','Mes','Dia']] = df.DATE.str.split("-",expand=True,)
df.groupby(["Dia"])["TEMP"].max().sort_values(ascending=False).index[0]

'05'

Em um dado ano, qual a temperatura máxima por estação metereológica?

In [None]:
df[['Ano','Mes','Dia']] = df.DATE.str.split("-",expand=True,)
df.groupby(["Ano", "STATION"])["TEMP"].max()

Ano   STATION    
1930  3559099999     15.28
1932  3811099999     12.06
      99006199999    26.67
1933  3379099999     11.00
      3864099999     10.56
                     ...  
1935  72286023119    36.22
      72511514711    32.11
      72677024033    32.94
      72681024131    32.83
      78793099999    27.22
Name: TEMP, Length: 153, dtype: float64

Quantas estações existem em cada ano?

In [None]:
df[['Ano','Mes','Dia']] = df.DATE.str.split("-",expand=True,)
df.groupby(["Ano", "STATION"]).size()

Ano   STATION    
1930  3559099999       1
1932  3811099999     121
      99006199999    175
1933  3379099999      89
      3864099999      97
                    ... 
1935  72286023119    365
      72511514711    242
      72677024033    215
      72681024131    365
      78793099999      5
Length: 153, dtype: int64

Qual a maior amplitude de temperatura para um dado dia?

In [None]:
df[['Ano','Mes','Dia']] = df.DATE.str.split("-",expand=True,)
dft = df[["Ano", "Mes", "Dia", "TEMP"]] \
    .groupby(["Ano", "Mes", "Dia"], as_index=False) \
    .agg({"TEMP": [min, max]})
#dft.reset_index(inplace=True)
dft.columns = ["_".join(x) for x in dft.columns.ravel()]
dft["amplitude"] = dft["TEMP_max"] - dft["TEMP_min"]
dft.sort_values("amplitude", ascending=False)

  


Unnamed: 0,Ano_,Mes_,Dia_,TEMP_min,TEMP_max,amplitude
873,1935,11,23,-44.00,18.50,62.50
862,1935,11,12,-41.67,20.28,61.95
901,1935,12,21,-31.28,26.94,58.22
906,1935,12,26,-41.78,14.61,56.39
904,1935,12,24,-37.78,17.28,55.06
...,...,...,...,...,...,...
404,1934,06,25,15.28,15.28,0.00
405,1934,06,26,15.83,15.83,0.00
408,1934,06,29,17.50,17.50,0.00
409,1934,06,30,16.94,16.94,0.00


Existem casos onde a temperatura MIN é maior que a MAX?

In [None]:
df[['Ano','Mes','Dia']] = df.DATE.str.split("-",expand=True,)
dft = df[["Ano", "Mes", "Dia", "TEMP"]] \
    .groupby(["Ano", "Mes", "Dia"], as_index=False) \
    .agg({"TEMP": [min, max]})
#dft.reset_index(inplace=True)
dft.columns = ["_".join(x) for x in dft.columns.ravel()]
dft[dft["TEMP_max"] < dft["TEMP_min"]]

  


Unnamed: 0,Ano_,Mes_,Dia_,TEMP_min,TEMP_max
