IMD Weather Data Pipeline : Web Scraping With Python [ Write Up ]

The Indian Meteorological Department (IMD) website provides vital weather information for India. Extracting this data manually is impractical. This article will demonstrate how to build a Python web scraper to automatically collect weather data from the IMD website. We'll cover the necessary libraries and techniques to get you started and tap into a wealth of meteorological information.

Without further ado, let's get started!

1. Setting up the Tools (Import Libraries):
First, we're importing the necessary libraries. requests to fetch web pages, BeautifulSoup to parse HTML, pandas to handle and organize our data into tables, and os to interact with the file system.

In [None]:
    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    import os

2. Defining our Target (Main URL):
We're setting the address of our target webpage to the variable main_url. This is where we'll start our web scraping adventure. It's the same menu_test.php page on the India Meteorological Department website.

In [None]:
main_url = 'https://city.imd.gov.in/citywx/menu_test.php'

3. Naming our Data Storage (Excel File):
Next, we're creating a variable called excel_file, that is set to scraped_tables.xlsx. This is the excel file where we'll be storing our collected weather data.

In [None]:
excel_file = "scraped_tables.xlsx"

4. Preparing the Container (Empty all_tables List):
We create an empty list called all_tables. This list will hold the pandas dataframes that we will be collecting. We'll be adding all of them together at the end into one big dataframe.

In [None]:
all_tables = []

5. Checking for Existing Data:
Now, we're doing something clever. We're checking if an Excel file called scraped_tables.xlsx already exists. We do this using the os.path.exists() function. If the file exists, it means we might already have some scraped data we want to keep.

In [None]:
if os.path.exists(excel_file):

6. Loading Existing Data (if available):
If the Excel file exists, we try to load its content into a pandas DataFrame, called existing_df. We wrap this in a try-except block to handle potential errors, like if the file is corrupted or not a valid Excel file. We then load this dataframe into the all_tables list so that the new data can be appended to the already existing data. Finally, we print a message that informs the user we are resuming scraping, and if any error occurs then it lets the user know we are starting from scratch.We use a try block to catch potential errors during the web fetching and parsing process.

In [None]:
    try:
            existing_df = pd.read_excel(excel_file)
            all_tables = [existing_df]
            print("Resuming scraping from existing data in scraped_tables.xlsx")
    except Exception as e:
            print(f"Error reading existing data: {e}. Starting from scratch.")

7. Fetching Main Webpage Content:
Inside the try block, we use requests.get to fetch the content of the main webpage, and save it into html_content variable.

In [None]:
response = requests.get(main_url)
html_content = response.text

8. Parsing HTML with BeautifulSoup:
We then create the BeautifulSoup object to make sense of the HTML structure of the webpage.

In [None]:
soup = BeautifulSoup(html_content, 'html.parser')

9. Finding Relevant Links (using a lambda function):
Next, we need to get the links of the specific pages that contain the tables we need. We do this by extracting all the < a > elements with href attributes starting with the string "city_weather_test_try_warnings.php?id=".
Now we use a lambda function to check that condition,and save this to the links list

In [None]:
links = [a['href'] for a in soup.find_all('a', href=lambda href: href and href.startswith('city_weather_test_try_warnings.php?id='))]

10. Looping Through Each Link:
We start iterating over every link in the links list.

In [None]:
for link in links:

11. Constructing Full URL:
For each link, we create the complete URL by prepending the base url using string concatenation, which is stored in full_url.

In [None]:
full_url = 'https://city.imd.gov.in/citywx/' + link

12. Checking if Data Already Exists (in Excel):
We then check if the table has already been collected before, and if the existing df is not empty. This prevents the program from loading the same tables if we already scraped them. If the full_url is already in the column names of the existing dataframe, we skip the link and move on. This assumes that we are storing the full url of the table in a column. The idea behind this code is to resume from where it stops, this saves us a bit of time.

In [None]:
if existing_df is not None and full_url in existing_df.columns:
   print(f"Skipping URL: {full_url} - Data already exists.")
   continue

13. Attempting to Read HTML Tables (Try Block):
We use another try block to load the html table into the variable called tables, this can fail so we want to catch those failures.

In [None]:
try:
    tables = pd.read_html(full_url)

14. Skipping Pages with Insufficient Tables:
We check if the tables contains at least two tables, and if not, we print a message saying that the table is being skipped.

In [None]:
           if len(tables) < 2:
                print(f"Skipping URL: {full_url} - Insufficient tables.")
                continue

15. Extracting Relevant Tables:
We load all tables that are not the first one into the all_tables list. We only skip the first table because we are only interested in the tables with the data, not the first one with the location.

In [None]:
for table in tables[1:]:
    all_tables.append(table)

16. Handling Errors Loading Tables:
If any ValueError is raised, we catch it here and print a message to the console letting us know which table could not be loaded, and we skip to the next link using the continue keyword.

In [None]:
        except ValueError as e:
            print(f"Error reading tables from {full_url}: {e}")
            continue

17. Handling Errors Fetching/Parsing Main Page:
We catch any kind of exception outside the inner loop, so we print an error if we can't fetch or parse the html of the main page.

In [None]:
    except Exception as e:
        print(f"An error occurred: {e}")

18. Saving Data to Excel (Finally Block):
Finally, we have a finally block, which ensures that this part always executes, regardless of whether any exceptions occurred in the try block. If all_tables is not empty, we combine them all together into a new dataframe called final_df, and save it to the excel file. We also print a message letting the user know the excel file has been updated. Or, if no tables were scraped, or an error occurred, we print a message to the console that no tables were found during scraping.

In [None]:
    finally:

        if all_tables:
            final_df = pd.concat(all_tables, ignore_index=True)
            final_df.to_excel(excel_file, index=False)
            print("Data saved to scraped_tables.xlsx")
        else:
            print("No tables found or an error occurred during scraping.")