## Web-scraping for Indian Railways Station list

In this notebook, I'm trying to extract list of Indian Railway Stations from [irfca.org website](https://irfca.org/apps/station_codes?page=1) in a pandas dataframe. 

This is just a little experiment

In [1]:
# Import necessary libraries

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

In [2]:
# prepare our delicios soup

link = 'https://irfca.org/apps/station_codes?page=1'

req = requests.get(link)
soup = BeautifulSoup(req.content)

#### Let's have a look at the soup

In [3]:
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>                                  [IRFCA] Indian Railways Station Codes Index                                                   			    </title>
<meta content="" name="description"/>
<meta content="" name="author"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
		<script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
		<![endif]-->
<link href="/apps/stylesheets/bootstrap-app.css?1522865253" media="screen" rel="stylesheet" type="text/css"/>
<link href="/apps/stylesheets/irfca-app.css?1522865256" media="screen" rel="stylesheet" type="text/css"/>
<script src="/apps/javascripts/jquery-1.5.2.min.js?1522865248" type="text/javascript"></script>
<script src="/apps/javascripts/bootstrap-dropdown.js?1522865243" type="text/javascript"></script>
<script src="/apps/javascripts/bootstrap-modal.js?1522865243" type="text/javascript"></script>
<!-- Le fav and touch icons -->
<link hr

In [4]:
# <tbody> is where we have our rows which we need to extract

soup.find('tbody')

<tbody>
<tr>
<td>AA</td>
<td>Ataria</td>
<td><a href="#" rel="twipsy" title="Uttar Pradesh">UP</a></td>
<td><a href="#" rel="twipsy" title="North Eastern Railway">NER</a></td>
</tr>
<tr>
<td>AADR</td>
<td>Amb Andaura</td>
<td><a href="#" rel="twipsy" title=""></a></td>
<td><a href="#" rel="twipsy" title=""></a></td>
</tr>
<tr>
<td>AAG</td>
<td>Angar</td>
<td><a href="#" rel="twipsy" title="Maharashtra">MH</a></td>
<td><a href="#" rel="twipsy" title="Central Railway">CR</a></td>
</tr>
<tr>
<td>AAH</td>
<td>Itehar</td>
<td><a href="#" rel="twipsy" title="Madhya Pradesh">MP</a></td>
<td><a href="#" rel="twipsy" title="North Central Railway">NCR</a></td>
</tr>
<tr>
<td>AAK</td>
<td>Ankaikila</td>
<td><a href="#" rel="twipsy" title=""></a></td>
<td><a href="#" rel="twipsy" title=""></a></td>
</tr>
<tr>
<td>AAL</td>
<td>Amlai</td>
<td><a href="#" rel="twipsy" title="Madhya Pradesh">MP</a></td>
<td><a href="#" rel="twipsy" title="South East Central Railway">SECR</a></td>
</tr>
<tr>
<td>AAM</t

In [5]:
# And we can access the rows with tag <tr>

soup.find('tbody').find_all('tr')

[<tr>
 <td>AA</td>
 <td>Ataria</td>
 <td><a href="#" rel="twipsy" title="Uttar Pradesh">UP</a></td>
 <td><a href="#" rel="twipsy" title="North Eastern Railway">NER</a></td>
 </tr>, <tr>
 <td>AADR</td>
 <td>Amb Andaura</td>
 <td><a href="#" rel="twipsy" title=""></a></td>
 <td><a href="#" rel="twipsy" title=""></a></td>
 </tr>, <tr>
 <td>AAG</td>
 <td>Angar</td>
 <td><a href="#" rel="twipsy" title="Maharashtra">MH</a></td>
 <td><a href="#" rel="twipsy" title="Central Railway">CR</a></td>
 </tr>, <tr>
 <td>AAH</td>
 <td>Itehar</td>
 <td><a href="#" rel="twipsy" title="Madhya Pradesh">MP</a></td>
 <td><a href="#" rel="twipsy" title="North Central Railway">NCR</a></td>
 </tr>, <tr>
 <td>AAK</td>
 <td>Ankaikila</td>
 <td><a href="#" rel="twipsy" title=""></a></td>
 <td><a href="#" rel="twipsy" title=""></a></td>
 </tr>, <tr>
 <td>AAL</td>
 <td>Amlai</td>
 <td><a href="#" rel="twipsy" title="Madhya Pradesh">MP</a></td>
 <td><a href="#" rel="twipsy" title="South East Central Railway">SECR</a>

#### All right. This seems feasible. Let's write the code afresh to extract data into our dataframe.

In [6]:
link = 'https://irfca.org/apps/station_codes?page='            # Here we excluded last digit of the link which is page number
rows = []             # create empty list to be used for df creation


# On the weblink, we noticed that there are 41 pages available. We will read them all.
for i in range(41):
    
    req = requests.get(link + str(i+1))                     # Read each page
    soup = BeautifulSoup(req.content)
    
    # Below code takes care of populating our df elements
    body = soup.find('tbody').find_all('tr')
    for i in range(len(body)):
        
        element = body[i].find_all('td')
        
        try:
            station_code = element[0].contents[0]              # extract station code
        except:
            station_code = np.nan
    
        try:
            station_name = element[1].contents[0]              # extract station name
        except:
            station_name = np.nan
        
        try:
            state = element[2].find('a').contents[0]              # extract state
        except:
            state = np.nan
    
        try:
            zone = element[3].find('a').contents[0]              # extract zone
        except:
            zone = np.nan
        
        # Append the fields in the list
        rows.append({'station_code' : station_code,
                     'station_name' : station_name,
                     'state' : state,
                     'zone' : zone
                    })


#### Create Pandas Dataframe from list

In [7]:
stations_df = pd.DataFrame(rows)
stations_df

Unnamed: 0,station_code,station_name,state,zone
0,AA,Ataria,UP,NER
1,AADR,Amb Andaura,,
2,AAG,Angar,MH,CR
3,AAH,Itehar,MP,NCR
4,AAK,Ankaikila,,
...,...,...,...,...
8082,ZPL,Zangalapalle,AP,SCR
8083,ZRD,Jiradei,,
8084,ZRDE,Jiradei,BR,NER
8085,ZW,Zawar,RJ,NWR


In [8]:
stations_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8087 entries, 0 to 8086
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   station_code  8087 non-null   object
 1   station_name  8086 non-null   object
 2   state         7621 non-null   object
 3   zone          7573 non-null   object
dtypes: object(4)
memory usage: 252.8+ KB


In [10]:
stations_df.to_csv('files/IR_stations.csv', index = False)