# Indian Railways

Let's explore and visualize a really exciting and most complicated rail network data in the world: Indian Railway. We will explore this dataset from Kaggle. There are three datasets which are related with each other.

- trains.json
- stations.json
- schedules.json

First, let's take care of some imports.

In [None]:
import kagglehub
import pandas as pd
import geopandas as gpd
import json
from shapely.geometry import LineString

Let's import our dataset from Kaggle using `kagglehub`.

In [None]:
# Download latest version
path = kagglehub.dataset_download("sripaadsrinivasan/indian-railways-dataset")

print("Path to dataset files:", path)

Using `read_file` provided by `geopandas` library, we can conveniently read the JSON files and initialize the `GeoDataframe`. How convenient!

In [None]:
stations_gdf = gpd.read_file(f"{path}/stations.json")

Let's inspect the `stations_gdf` using `head()`.

In [None]:
stations_gdf.head()

Unfortunately, attempting to read the `trains.json` in a similar fashion doesn't work. The `trains.json` contains `LineString`, and certain records within the `trains.json` contains only a single point, which causes violations in trying to parse into `GeoDataframe`.

Thanks to the notebook: [Indian Railways EDA](https://www.kaggle.com/code/nilankardeb/indian-railways-eda) on Kaggle, we have a function `convert_to_gds` which sanitizes the input data and provides us a way to obtain the `GeoDataframe`.

In [None]:
def convert_to_gdf(json_data, geometry_type):
    if geometry_type == 'Point':
        gdf = gpd.GeoDataFrame.from_features(features=json_data['features'])
    elif geometry_type == 'LineString':
        # fetch the column names based on `properties` keys
        properties_columns = list(json_data['features'][0]['properties'].keys())
        # fetch the values (rows) based on the `properties` values
        properties_vals = [list(i['properties'].values()) for i in json_data['features']]


        geometry_col = [
            LineString(i['geometry']['coordinates']) 
            if len(i['geometry']['coordinates']) >= 2
            # else Point(i['geometry']['coordinates'][0])
            else LineString([i['geometry']['coordinates'][0]] * 2)
            for i in json_data['features']
        ]
            
        df = pd.DataFrame(data=properties_vals, columns=properties_columns)
        df['geometry'] = geometry_col
        gdf = gpd.GeoDataFrame(df)
        
    # setting the CRS
    gdf = gdf.set_crs('EPSG:4326')
    
    return gdf

Particularly notice these lines of code if you are interested. If there is a single coordinate, it is simply repeated to create at least two coordinates required to form a `LineString`.
```
if len(i['geometry']['coordinates']) >= 2
# else Point(i['geometry']['coordinates'][0])
else LineString([i['geometry']['coordinates'][0]] * 2)
```
Now, let's use this function to initialize a `trains_gdf`. 

In [None]:
with open(f"{path}/trains.json", mode='r') as json_file:
    trains_gdf = convert_to_gdf(json.load(json_file), geometry_type="LineString")

Let's inspect the `trains_gdf` using `head()`.

In [None]:
trains_gdf.head()

`schedules.json` is not a GeoDataframe, so we will simply create a regular Pandas Dataframe.

In [None]:
with open(f"{path}/schedules.json", mode='r') as json_file:
    schedules = json.load(json_file)
    columns = list(schedules[0].keys())
    data_vals = [list(i.values()) for i in schedules]
    schedules_df = pd.DataFrame(data=data_vals, columns=columns)
schedules_df.head()