 
# Classification of Delhi Metro stations.
<img src='https://upload.wikimedia.org/wikipedia/commons/6/65/Delhi_Metro_logo.svg' style="width: 150px;"/>

## Introduction

Delhi Metro is a rapid transit system serving Delhi and its satellite cities in the National Capital Region of India. As of now, there are a total of 229 metro stations including the Airport Express stations. The first section of the Delhi Metro opened on 25 December 2002 with the Red Line,[2] and has since been expanded to around 347.66 km(216.03 miles) of route length as of 4 October 2019. The network has nine operational lines and is built and operated by the Delhi Metro Rail Corporation Limited (DMRC). The Delhi Metro Rail Corporation makes 2,700 trips per day carrying 1.5 million passengers, who on an average travel a distance of 17 kilometres each.

For this project, we will try to look at the places surrounding these metro stations and classify them accordng to the similarity of nearby venues. Every one use metro transit to migrate from one place to another for reasons which can be personal of professional. If there are more professional places like companies, offices surrounding a station then it will mostly be used by working professionals. Then there are some stations with many unversities or colleges nearby and is used by Students mostly. Stations which have places like amusement parks, malls, monuments are used by people for recreation. 

We can classify stations by primary usage analyzing the data that contains the number of nearby venues according to their category. This can help plan further extension of the network and find places for new development.

## Data

In this section we will describe our base data which we will analyze to reach the goal we want. 

Let us import some of the libraries that we will be required to move further in the project.

In [6]:
import numpy as np # library to handle data in a vectorized manner

import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

import json # library to handle JSON files

from geopy.geocoders import Nominatim 
# convert an address into latitude and longitude values

import requests # library to handle requests

# Matplotlib and associated plotting modules
import matplotlib.colors as colors
import matplotlib.pyplot as plt

# import k-means from clustering stage
from sklearn.cluster import KMeans

#!conda install -c conda-forge folium=0.5.0 --yes # uncomment this line if you haven't completed the Foursquare API lab
import folium # map rendering library

print('Libraries imported.')

Libraries imported.


Because we are working on the data that is based on places. It will be great to visualize using a world map. Since, our project revolves around the National Capital of India, Delhi. We wil visualize a simple map of delhi using folium library.

We will require the Latitude and Longitude of Delhi to focus it on the map. To get these coordinates we use geopy.geocoders that can perform geocoding of the given Address.

In [8]:
address = 'Delhi, India'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Delhi are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Delhi are 28.6517178, 77.2219388.


In [9]:
map_delhi = folium.Map(location=[latitude, longitude], zoom_start=11)  

map_delhi

Now we will need a data that will contain the list of all the metro stations under DMRC in Delhi. For this we will use this url

https://en.wikipedia.org/wiki/List_of_Delhi_Metro_stations

We have to scrape the relevant table data from this url like Station name, Line. 

#### Assumption :-
- There are some stations with more than one line that pass through it. So we have assumed that only the line that is written first on the above url will be the data of our choice to nullify the ambiguity that we may face while plotting on the graph.

In [11]:
from bs4 import BeautifulSoup

wiki_url = 'https://en.wikipedia.org/wiki/List_of_Delhi_Metro_stations'
wiki_page = requests.get(wiki_url).text
wiki_doc = BeautifulSoup(wiki_page, 'lxml')

rows = wiki_doc.find('table', {'class': 'wikitable sortable'}).findAll('tr')


df = pd.DataFrame()
 


lst = []
form = '{ "name": "%s",\
          "details": {"line":"[%s]",\
                      "latitude":0.0,\
                      "longitude":0.0 }}'
Station=[]
Line = []
for row in rows[1:]:
    items = row.find_all('td')
    try:
        if len(items)==8:
            Station.append(items[0].find('a').contents[0])
            Line.append(items[2].find('a').find('span').find('b').contents[0])
            lst.append(form % (items[0].find('a').contents[0],
               items[2].find('a').find('span').find('b').contents[0]))
    
    except Exception as e:
        continue

string = '['+','.join(lst)+']'

data = json.loads(string)

f = open('metro.json', 'w+')
f.write(json.dumps(data, indent=4))
f.close()

In [12]:
print(len(Station))
print(len(Line))

228
228


Adding the stations and Corresponding Line to the empty dataframe that we have made above with the name 'df'.

In [13]:
df['Station']=Station
df['Line']=Line
df

Unnamed: 0,Station,Line
0,Adarsh Nagar,Yellow Line
1,AIIMS,Yellow Line
2,Akshardham,Blue Line
3,Anand Vihar ISBT,Blue Line branch
4,Arjan Garh,Yellow Line
5,Arthala,Red Line
6,Ashok Park Main,Green Line
7,Ashram,Pink Line
8,Azadpur,Yellow Line
9,Badarpur Border,Violet Line


Now that we have all the station and lines in our dataframe but we will need to use their coordinates that are unique to them and can be used to plot each station on the map.

Again we will use the geocoder to get the corresponsing latitude and longitude value of all station. 

#### Assumption
- While using the address there are few shortcomings like there can be more than one address with same name. for example 'Gandhinagar' is in Gujarat as well as in Delhi. so we have to use few try except blocks that will search for the place with more accuracy to less accuracy. 
- If we dont find the coordinates even after this we have used None as lat. and long. values. Store them in two list and then update in the dataframe

In [14]:
Latitude = []
Longitude = []
for stat in Station:
    try:
        try:
            try:
                address = "{} metro station, Delhi, India".format(stat)
                geolocator = Nominatim(user_agent="ny_explorer")
                location = geolocator.geocode(address)
                lat = location.latitude
                long = location.longitude
            except Exception as e:
                address = "{}, Delhi, India".format(stat)
                geolocator = Nominatim(user_agent="ny_explorer")
                location = geolocator.geocode(address)
                lat = location.latitude
                long = location.longitude
        except Exception as e:
            address = "{}, India".format(stat)
            geolocator = Nominatim(user_agent="ny_explorer")
            location = geolocator.geocode(address)
            lat = location.latitude
            long = location.longitude
    except Exception as d:
        lat=None
        long=None
    Latitude.append(lat)
    Longitude.append(long)
print(Latitude)
print(Longitude)





[28.7144008, 28.5668602, 28.61784195, 28.6467533, 28.4807352, 28.676999, 28.6716045, 28.5724231, 28.7076568, 28.4904999, None, 28.6907847, 28.6297676, 28.3858361, 28.56790025, None, 28.6974603, 28.6158794, 28.6605039, 28.6501605, 28.5067242, 28.5381411, 28.6768508, 28.6157555, 28.6018751, 28.5487982, 28.593833099999998, 28.639203600000002, 28.5918905, 28.67588615, 28.5894384, 28.5771915, 28.61931, 28.5656109, 28.574272, 28.5810577, 28.5864469, 28.5922362, 28.597009, 28.60225285, 28.5518376, 28.6646964, 28.6200437, 28.3702344, 28.6580737, 28.6939642, 28.493751, 28.7024752, 28.59778075, 28.5443766, 28.5418777, 28.5585815, 28.6981317, 28.4820212, 28.7301214, 28.5431246, 28.5442564, 28.5887494, 28.878965, 28.4593429, 28.4723277, 28.57440755, 28.6517178, 28.55489735, 28.6206019, 28.628899099999998, 28.6305091, 28.6826822, 28.7259717, 28.65001015, 28.55849025, 28.633021, 28.6289502, 28.582457050000002, 28.6088598, 28.5384251, 28.5458279, 28.58337705, 28.6443188, 28.6757033, None, 28.58823925

In [15]:
df['Latitude'] = Latitude
df['Longitude'] = Longitude
df.head(20)

Unnamed: 0,Station,Line,Latitude,Longitude
0,Adarsh Nagar,Yellow Line,28.714401,77.167288
1,AIIMS,Yellow Line,28.56686,77.207806
2,Akshardham,Blue Line,28.617842,77.279488
3,Anand Vihar ISBT,Blue Line branch,28.646753,77.318004
4,Arjan Garh,Yellow Line,28.480735,77.125762
5,Arthala,Red Line,28.676999,77.391892
6,Ashok Park Main,Green Line,28.671605,77.155291
7,Ashram,Pink Line,28.572423,77.258598
8,Azadpur,Yellow Line,28.707657,77.175547
9,Badarpur Border,Violet Line,28.4905,77.304038


In [16]:
# df.to_csv('DELHI_METRO_DATA.csv',index=False)
df=pd.read_csv('DELHI_METRO_DATA.csv')
df

Unnamed: 0,Station,Line,Latitude,Longitude
0,Adarsh Nagar,Yellow Line,28.714401,77.167288
1,AIIMS,Yellow Line,28.56686,77.207806
2,Akshardham,Blue Line,28.617842,77.279488
3,Anand Vihar ISBT,Blue Line branch,28.646753,77.318004
4,Arjan Garh,Yellow Line,28.480735,77.125762
5,Arthala,Red Line,28.676999,77.391892
6,Ashok Park Main,Green Line,28.671604,77.155291
7,Ashram,Pink Line,28.572423,77.258598
8,Azadpur,Yellow Line,28.707657,77.175547
9,Badarpur Border,Violet Line,28.4905,77.304038


Converting all the line with a integer value so that we can use it easily.
- Blue line and Green line diverge or branch so simplicity we have taken them as the same line.

In [17]:
linetonum = {"Yellow Line": 1, "Red Line": 2,"Blue Line": 3,'Blue Line branch':3, "Pink Line": 4,"Magenta Line": 5, "Green Line": 6,'Green Line branch':6, "Violet Line": 7, "Orange Line": 8,"Grey Line": 9}

Our data contains few NaN values for which our code was unable to find the coordinates value so we have remove their rows. But you can also replace them with coordinated found from web.

In [18]:
data = df.dropna(axis=0)
data

Unnamed: 0,Station,Line,Latitude,Longitude
0,Adarsh Nagar,Yellow Line,28.714401,77.167288
1,AIIMS,Yellow Line,28.56686,77.207806
2,Akshardham,Blue Line,28.617842,77.279488
3,Anand Vihar ISBT,Blue Line branch,28.646753,77.318004
4,Arjan Garh,Yellow Line,28.480735,77.125762
5,Arthala,Red Line,28.676999,77.391892
6,Ashok Park Main,Green Line,28.671604,77.155291
7,Ashram,Pink Line,28.572423,77.258598
8,Azadpur,Yellow Line,28.707657,77.175547
9,Badarpur Border,Violet Line,28.4905,77.304038


In [19]:
data.replace({"Line": linetonum},inplace=True)
data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  regex=regex,


Unnamed: 0,Station,Line,Latitude,Longitude
0,Adarsh Nagar,1,28.714401,77.167288
1,AIIMS,1,28.56686,77.207806
2,Akshardham,3,28.617842,77.279488
3,Anand Vihar ISBT,3,28.646753,77.318004
4,Arjan Garh,1,28.480735,77.125762
5,Arthala,2,28.676999,77.391892
6,Ashok Park Main,6,28.671604,77.155291
7,Ashram,4,28.572423,77.258598
8,Azadpur,1,28.707657,77.175547
9,Badarpur Border,7,28.4905,77.304038


For a clear visualization we have used the line with their actual color in hexadecimal code. For example, Station on 'Red Line' will be marked with red color. 

In [20]:
colors_dict = {1:'#FFFF00', 2:'#FF0000',3:'#0000FF', 4:'#FFC0CB',5:'#FF00FF', 6:'#008000',7:'#EE82EE', 8:'#FFA500',9:'#808080'} 

Using the code to get coordinate of Delhi.

In [21]:
address = 'Delhi, India'

geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of Delhi are {}, {}.'.format(latitude, longitude))

The geograpical coordinate of Delhi are 28.6517178, 77.2219388.


Plotting the map with all stations marked with corresponding line color using folium library.

In [22]:
map_delhi_metro = folium.Map(location=[latitude, longitude], zoom_start=10)

for line, station, lat,long in zip(data['Line'], data['Station'],data['Latitude'], data['Longitude']):
    folium.Circle(
        [lat,long],
        popup=station,
        radius=20,
        color=colors_dict[line]
    ).add_to(map_delhi_metro)
map_delhi_metro.save(outfile='Delhi_metro_stations.html')
map_delhi_metro

Zooming out the map will give us a outlier whose error in getting coordinates is very high. We can see that station named 'Lal Qila' is plotted near Telangana state so we will correct it by using the real coordinate dof located using map and replacing the value.

In [23]:
data.at[97,'Latitude'] = 28.656682
data.at[97,'Longitude'] = 77.236612
data


Unnamed: 0,Station,Line,Latitude,Longitude
0,Adarsh Nagar,1,28.714401,77.167288
1,AIIMS,1,28.56686,77.207806
2,Akshardham,3,28.617842,77.279488
3,Anand Vihar ISBT,3,28.646753,77.318004
4,Arjan Garh,1,28.480735,77.125762
5,Arthala,2,28.676999,77.391892
6,Ashok Park Main,6,28.671604,77.155291
7,Ashram,4,28.572423,77.258598
8,Azadpur,1,28.707657,77.175547
9,Badarpur Border,7,28.4905,77.304038


In [24]:
# Sort the rows of dataframe by column 'Line'
data_sort = data.sort_values(by ='Line')
data_sort

Unnamed: 0,Station,Line,Latitude,Longitude
0,Adarsh Nagar,1,28.714401,77.167288
46,Ghitorni,1,28.493751,77.149187
81,Jor Bagh,1,28.588239,77.216528
134,New Delhi,1,28.643641,77.221737
68,Jahangirpuri,1,28.725972,77.162658
61,INA,1,28.574408,77.210241
181,Samaypur Badli,1,28.744616,77.138265
180,Saket,1,28.524411,77.213725
60,IFFCO Chowk,1,28.472328,77.072422
177,Rohini Sector 18,1,28.738348,77.139832


In [25]:
data_sort.dtypes

Station       object
Line           int64
Latitude     float64
Longitude    float64
dtype: object

In [26]:
map_delhi_metro = folium.Map(location=[latitude, longitude], zoom_start=10)
#add markers
for line, station, lat,long in zip(data_sort['Line'], data_sort['Station'],data_sort['Latitude'], data_sort['Longitude']):
    folium.Circle(
        [lat,long],
        popup=station,
        radius=30,
        fill=True,
        color=colors_dict[line]
    ).add_to(map_delhi_metro)   
map_delhi_metro.save(outfile='metro_stations.html')
map_delhi_metro

Now we have a good visualiztion of each station and also we easily can trace the path of each line. 