# Web Scraping from the SENHAMI automatic weather stations database

In this notebook, we will get a weather variables of SENHAMI's automatic weather stations from filtered dates and using a bit of web scraping.
So let´s get started 

## Importing libraries

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

## Defining our available dates

First of all, the available date are into the [website of SENHAMI's stations](https://senamhi.gob.pe/?&p=estaciones), each station have own available dates. After to review the available date of your station of interest (the available dates appear in the drop-down "Ir" button). You can input the first and last year/month in numeric data 

In [2]:
# First year available in the station
first_year = input("Digite el año inicial disponible en la estación : ")

# First month available in the station
first_month =input("Digite el mes inicial disponible en la estación, en números : ")

# Last year available in the station
last_year = input("Digite el año final disponible en la estación : ")

# Last month available in the station
last_month =  input("Digite el mes final disponible en la estación, en números : ")


Digite el año inicial disponible en la estación : 2016
Digite el mes inicial disponible en la estación, en números : 1
Digite el año final disponible en la estación : 2020
Digite el mes final disponible en la estación, en números : 11


In [3]:
# Generating right date strings for our web scraping
def filtro(first_year,last_year,first_month,last_month):
    filtro = []
    for year in range(int(first_year), int(last_year) + 1):
        if year == int(first_year):
            for month in range(int(first_month), 13):
                if month < 10:
                    filtro.append(str(first_year) + '0' + str(month)) 
                else:
                    filtro.append(str(first_year) + str(month))
        elif year == int(last_year):
            for month in range(1, int(last_month) + 1):
                if month < 10:
                    filtro.append(str(last_year) + '0' + str(month)) 
                else:
                    filtro.append(str(last_year) + str(month))  
        else:               
            for month in range(1,13):
                if month < 10:
                    filtro.append(str(year) + '0' + str(month))
                else:            
                    filtro.append(str(year) + str(month))  
    return filtro                

###  Parameters to URL

As well as the available dates, other parameters like code, type of stations are really relevant. The method to obtain is the same to the available dates, you only  access to the website.

In [5]:
# Station code 
CODIGO = str(input("Inserte código de la estación automática : "))
#Specifying the filtered date
filtro = filtro(first_year,last_year,first_month,last_month)
# Type of station, in this case automatic
estado = 'AUTOMATICA'

Inserte código de la estación automática : 4726631C


## Web Scraping 

At this point, we will use a **for** to get a list of all tables through the URL from the SENHAMI's website using filtered dates and the respective parameters. 

Next,  we convert HTML to text format and each row list of each table have splited by comma for each element of row belong a column. For last, we append it  to list called "rows" with wich we've created a dataframe and dropped the duplicate headers of each table.

In [None]:
# List of stored HTML tables
table = []

# Table format in a dictionary 
dicc={'width':"100%", 'border':"1", 'class':"body01", 'bordercolor':"#999999", 
      'cellpadding':"0", 'cellspacing':"1", 'align':"center", 'id':"dataTable"}

# We will make a for to get a list of all available tables
for fecha in filtro:
    url = 'http://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones={}&CBOFiltro={}&t_e=M&estado={}&cod_old=&cate_esta=EMA'.format(
          CODIGO,
          fecha,
          estado)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    table.append(soup.find('table',dicc)) 

In [8]:
table = []

dicc={'width':"100%", 'border':"1", 'class':"body01", 'bordercolor':"#999999", 
      'cellpadding':"0", 'cellspacing':"1", 'align':"center", 'id':"dataTable"}

url = 'http://www.senamhi.gob.pe/mapas/mapa-estaciones-2/_dato_esta_tipo02.php?estaciones=4726631C&CBOFiltro=201601&t_e=M&estado=AUTOMATICA&cod_old=&cate_esta=EMA'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
table.append(soup.find('table',dicc)) 

In [20]:
# Obtain each row of each table and append it in a list 
rows = []

# We convert HTML to text format and each row list of each table 
# have splited by comma for each element of row belong a column.
for t in table:
    for i in t.find_all('tr',{'aling': "center"}):
        rows.append(i.text.replace(" ","").replace("\n", " ").strip().split()) 
        
# Creando dataframe      
senhami = pd.DataFrame(rows[1:], columns = rows[0])

# Dropping headers
mask = senhami.iloc[:, 0] == senhami.columns.to_list()[0]
d1 = senhami[~mask]

In [25]:
d1

Unnamed: 0,AÑO/MES/DÍA,HORA,TEMPERATURA(°C),PRECIPITACIÓN(mm/hora),HUMEDAD(%),DIRECCIONDELVIENTO(°),VELOCIDADDELVIENTO(m/s)
0,2016/01/01,00:00,13.9,0.0,S/D,40,1.0
1,2016/01/01,01:00,13.8,0.0,S/D,5,0.9
2,2016/01/01,02:00,14.0,0.0,S/D,325,0.8
3,2016/01/01,03:00,13.7,0.2,S/D,324,1.3
4,2016/01/01,04:00,13.2,0.0,S/D,347,1.9
...,...,...,...,...,...,...,...
725,2016/01/31,19:00,15.7,0.0,S/D,4,1.7
726,2016/01/31,20:00,14.9,0.0,S/D,347,1.8
727,2016/01/31,21:00,14.7,0.0,S/D,22,0.7
728,2016/01/31,22:00,14.3,1.0,S/D,357,0.6


In [None]:
#Saving to .csv file
d1.to_csv('celendin.csv', index = False)