# Data Preparation to Support Analytical Use-Case 

Background:
- this data preparation is done to support an analysis from the Google Data Analyst Certification Case Study. 
- you may learn more about the case study here: https://1drv.ms/b/s!Aj-vT8FJDVrhl32VsKmkrCyMk4-X?e=jR3Hml


About the Data: 
- monthly divvybike data is uploaded here: https://divvy-tripdata.s3.amazonaws.com/index.html
- the dataset is licensed under https://divvybikes.com/system-data
- data between Jan 2020 to July 2023 have then been pre-processed and stored into a local database. 
- to support the scope of analysis in this case study, only (i) data between Aug 22 to Jul 23 (past 12 months), (ii) relevant columns are being extracted, and stored in a local database and table, "google_casestudy_data"

Objective: 
- To process and prepare the raw dataset such that it is ready to be used for analysis. 


# 1 - Connect to Database & Extract the Dataset

In [9]:
from datetime import datetime

now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

Start Time = 03:31:58


In [10]:
from sqlalchemy import create_engine
import pandas as pd


#on the following line -- to input username and password as accordingly

## engine = create_engine('postgresql://<username>:<password>@<localhost:5432>/<database>')

data = pd.read_sql('SELECT * FROM case_study_dataset.divvy_bikes.google_casestudy_data', engine)

In [11]:
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

Start Time = 03:33:12


# 2 - Data Types Validation & Correction

In [12]:
print(data.dtypes)
data.head(5)

ride_id                object
rideable_type          object
started_at             object
ended_at               object
start_station_name     object
end_station_name       object
start_lat             float64
start_lng             float64
end_lat               float64
end_lng               float64
member_casual          object
dtype: object


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual
0,A489B0C40E9CFB6E,electric_bike,2023-03-27 14:23:43,2023-03-27 14:47:08,,,41.88,-87.63,41.96,-87.65,member
1,DCD6E7E02628A529,electric_bike,2023-03-27 05:40:20,2023-03-27 05:42:35,,,41.96,-87.65,41.96,-87.65,member
2,6DAD99A20709B682,electric_bike,2023-03-27 05:47:37,2023-03-27 06:09:39,,,41.96,-87.65,41.88,-87.63,member
3,9DC1982F25F794BE,electric_bike,2023-03-26 23:53:57,2023-03-26 23:58:12,,,41.79,-87.61,41.78,-87.6,member
4,A48FA69E713D17E0,electric_bike,2023-03-27 09:55:22,2023-03-27 10:06:05,,,41.89,-87.62,41.89,-87.65,member


### Analysis of the results

based on the results, we will need to convert the following field(s) so that they are reflected with the corrected datatype:
1. [started_at] - to change to "date/time"
2. [ended_at] to change it to "date/time"

In [13]:
data['started_at'] = pd.to_datetime(data['started_at'], format = '%Y-%m-%d %H:%M:%S')
data['ended_at'] = pd.to_datetime(data['ended_at'], format = '%Y-%m-%d %H:%M:%S')

In [14]:
#to check again after conversion 
print(data.dtypes)
data.head(5)

ride_id                       object
rideable_type                 object
started_at            datetime64[ns]
ended_at              datetime64[ns]
start_station_name            object
end_station_name              object
start_lat                    float64
start_lng                    float64
end_lat                      float64
end_lng                      float64
member_casual                 object
dtype: object


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual
0,A489B0C40E9CFB6E,electric_bike,2023-03-27 14:23:43,2023-03-27 14:47:08,,,41.88,-87.63,41.96,-87.65,member
1,DCD6E7E02628A529,electric_bike,2023-03-27 05:40:20,2023-03-27 05:42:35,,,41.96,-87.65,41.96,-87.65,member
2,6DAD99A20709B682,electric_bike,2023-03-27 05:47:37,2023-03-27 06:09:39,,,41.96,-87.65,41.88,-87.63,member
3,9DC1982F25F794BE,electric_bike,2023-03-26 23:53:57,2023-03-26 23:58:12,,,41.79,-87.61,41.78,-87.6,member
4,A48FA69E713D17E0,electric_bike,2023-03-27 09:55:22,2023-03-27 10:06:05,,,41.89,-87.62,41.89,-87.65,member


# 3 - Summary Statistics 

In [15]:
data.describe(include=['object'])

Unnamed: 0,ride_id,rideable_type,start_station_name,end_station_name,member_casual
count,4963496,4963496,4203091,4155028,4963496
unique,4963496,3,1659,1673,2
top,A489B0C40E9CFB6E,electric_bike,Streeter Dr & Grand Ave,Streeter Dr & Grand Ave,member
freq,1,2709258,54985,55745,3143182


In [16]:
#to check on the no of missing value(s) for each of the field. 
data.isnull().sum()

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    760405
end_station_name      808468
start_lat                  0
start_lng                  0
end_lat                 5274
end_lng                 5274
member_casual              0
dtype: int64

### Details of Analysis of the Results 

No Further Actionables: 

1. [ride_id]: each of the transaction has a unique id, without any missing value. therefore, all transactions are accounted for. 

2. [rideable_type]: there are 3 classes of rideable type, and without any missing value. therefore, all the rides are being categorised into 3 major types. 

3. [started_at] & [ended_at] : all the rides contains a timestamp. Therefore, all of the rides are accounted for. 

4. member_casual: there are 2 classes of member type, and without any missing value. therefore, all the rides are being categorised into 2 major types. 

With Follow-Ups: 

5. [start_station_name] & [end_station_name] : there are 760,405 & 808,468 records with missing value. This accounts for 15.3% & 16.3% of the data. Therefore, these records need to be further analyze to determine if they will be useful to be kept in the dataset for analysis. 

6. [start_lat] & [start_lng]: there is no missing record, however; there are missing value from the start_station_name. Therefore, further analysis is required since it is not logical for the 15.3% of missing records to contain a latitude / longtitude. 

7. [end_lat] & [end_lng] & : there are 5274 missing records,while there are much more missing value from the end_station_name. Therefore, further analysis is required since it is not logical for the 16.3% of missing records to contain a latitude / longtitude. 



# 4 -New Derived Fields to Support Analysis

Derived fields can be created from existing fields, to provide further/ deeper insights to the analysis.

(1) Time Series Analysis

with "started_at" and "ended_at", we are able to generate further insights to generate more analysis, as below: 

1. duration:  [ended_at] - [started_at]
2. day of the week: derive from [started_at]
3. time (in hour) of the ride: derive from [started_at]

(2) Distance and Speed 

with the latitude and longtitude of both starting stations and ending stations,

4. derive the distance cycled over the ride using a formula called, "Haversine Distance". We can make use of a python library "mpu" to calculate the distance. 

5. with the "distance" & "duration" derived, we are able to compute the "cycling speed" of the rider. 


reference: https://towardsdatascience.com/calculating-distance-between-two-geolocations-in-python-26ad3afe287b

In [17]:
#create the field, 'duration', and convert it to display in minutes

data['duration'] = data['ended_at'] - data['started_at']
data['duration_min'] = data['duration'].dt.total_seconds().div(60).astype(float)
data.head(5)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,duration,duration_min
0,A489B0C40E9CFB6E,electric_bike,2023-03-27 14:23:43,2023-03-27 14:47:08,,,41.88,-87.63,41.96,-87.65,member,0 days 00:23:25,23.416667
1,DCD6E7E02628A529,electric_bike,2023-03-27 05:40:20,2023-03-27 05:42:35,,,41.96,-87.65,41.96,-87.65,member,0 days 00:02:15,2.25
2,6DAD99A20709B682,electric_bike,2023-03-27 05:47:37,2023-03-27 06:09:39,,,41.96,-87.65,41.88,-87.63,member,0 days 00:22:02,22.033333
3,9DC1982F25F794BE,electric_bike,2023-03-26 23:53:57,2023-03-26 23:58:12,,,41.79,-87.61,41.78,-87.6,member,0 days 00:04:15,4.25
4,A48FA69E713D17E0,electric_bike,2023-03-27 09:55:22,2023-03-27 10:06:05,,,41.89,-87.62,41.89,-87.65,member,0 days 00:10:43,10.716667


In [18]:
#create the field, 'day_of_week', from the starting time 

data['day_of_week'] = data['started_at'].dt.dayofweek.map({
    0: 'Mon',
    1: 'Tue',
    2: 'Wed',
    3: 'Thu',
    4: 'Fri',
    5: 'Sat',
    6: 'Sun'
})
data.head(5)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,duration,duration_min,day_of_week
0,A489B0C40E9CFB6E,electric_bike,2023-03-27 14:23:43,2023-03-27 14:47:08,,,41.88,-87.63,41.96,-87.65,member,0 days 00:23:25,23.416667,Mon
1,DCD6E7E02628A529,electric_bike,2023-03-27 05:40:20,2023-03-27 05:42:35,,,41.96,-87.65,41.96,-87.65,member,0 days 00:02:15,2.25,Mon
2,6DAD99A20709B682,electric_bike,2023-03-27 05:47:37,2023-03-27 06:09:39,,,41.96,-87.65,41.88,-87.63,member,0 days 00:22:02,22.033333,Mon
3,9DC1982F25F794BE,electric_bike,2023-03-26 23:53:57,2023-03-26 23:58:12,,,41.79,-87.61,41.78,-87.6,member,0 days 00:04:15,4.25,Sun
4,A48FA69E713D17E0,electric_bike,2023-03-27 09:55:22,2023-03-27 10:06:05,,,41.89,-87.62,41.89,-87.65,member,0 days 00:10:43,10.716667,Mon


In [19]:
#create the field, 'hr_of_day', from the starting time 

data['hr_day'] = data['started_at'].dt.hour
data.head(5)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,duration,duration_min,day_of_week,hr_day
0,A489B0C40E9CFB6E,electric_bike,2023-03-27 14:23:43,2023-03-27 14:47:08,,,41.88,-87.63,41.96,-87.65,member,0 days 00:23:25,23.416667,Mon,14
1,DCD6E7E02628A529,electric_bike,2023-03-27 05:40:20,2023-03-27 05:42:35,,,41.96,-87.65,41.96,-87.65,member,0 days 00:02:15,2.25,Mon,5
2,6DAD99A20709B682,electric_bike,2023-03-27 05:47:37,2023-03-27 06:09:39,,,41.96,-87.65,41.88,-87.63,member,0 days 00:22:02,22.033333,Mon,5
3,9DC1982F25F794BE,electric_bike,2023-03-26 23:53:57,2023-03-26 23:58:12,,,41.79,-87.61,41.78,-87.6,member,0 days 00:04:15,4.25,Sun,23
4,A48FA69E713D17E0,electric_bike,2023-03-27 09:55:22,2023-03-27 10:06:05,,,41.89,-87.62,41.89,-87.65,member,0 days 00:10:43,10.716667,Mon,9


## [Highlight #1]: Calculate the distance between on lat and log 

- Note, as formula (haversine from mpu) is unable to handle when the lat & lng are missing.
- Therefore, we will apply the formula to those without missing values only. 
- This is done so via splitting out those with and without missing values into 2 data frames, and to combine them again after the distance has been calculated. 

In [20]:
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

Start Time = 03:33:35


In [21]:
# add in a location key 
data['start_loc_key'] = data['start_lat'].astype(str) + data['start_lng'].astype(str) 
data['end_loc_key'] = data['end_lat'].astype(str) + data['end_lng'].astype(str) 

In [22]:
# calculate the distance between on lat and log 

#check if any null, and include it in a new dataframe

variableToPredict = ['start_lat', 'start_lng', 'end_lat', 'end_lng']
data_miss_latlog  = data[data[variableToPredict].isna().any(axis=1)]

#those without any missing value: 

data_w_lat_log = data[~data.isin(data_miss_latlog)].dropna(how = 'all')


In [23]:
print(data.shape)
print(data_miss_latlog.shape)
print(data_w_lat_log.shape)

(4963496, 17)
(5274, 17)
(4958222, 17)


In [24]:
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

Start Time = 03:34:29


In [25]:
import mpu

data_w_lat_log["distance_km"] = data_w_lat_log.apply(lambda x:
                                  mpu.haversine_distance((x["start_lat"], x["start_lng"]),
                                                         (x["end_lat"], x["end_lng"])), axis=1)

data_w_lat_log.head(5)

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,duration,duration_min,day_of_week,hr_day,start_loc_key,end_loc_key,distance_km
0,A489B0C40E9CFB6E,electric_bike,2023-03-27 14:23:43,2023-03-27 14:47:08,,,41.88,-87.63,41.96,-87.65,member,0 days 00:23:25,23.416667,Mon,14.0,41.88-87.63,41.96-87.65,9.048194
1,DCD6E7E02628A529,electric_bike,2023-03-27 05:40:20,2023-03-27 05:42:35,,,41.96,-87.65,41.96,-87.65,member,0 days 00:02:15,2.25,Mon,5.0,41.96-87.65,41.96-87.65,0.0
2,6DAD99A20709B682,electric_bike,2023-03-27 05:47:37,2023-03-27 06:09:39,,,41.96,-87.65,41.88,-87.63,member,0 days 00:22:02,22.033333,Mon,5.0,41.96-87.65,41.88-87.63,9.048194
3,9DC1982F25F794BE,electric_bike,2023-03-26 23:53:57,2023-03-26 23:58:12,,,41.79,-87.61,41.78,-87.6,member,0 days 00:04:15,4.25,Sun,23.0,41.79-87.61,41.78-87.6,1.38704
4,A48FA69E713D17E0,electric_bike,2023-03-27 09:55:22,2023-03-27 10:06:05,,,41.89,-87.62,41.89,-87.65,member,0 days 00:10:43,10.716667,Mon,9.0,41.89-87.62,41.89-87.65,2.483299


In [26]:
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

Start Time = 03:38:31


In [27]:
import numpy as np 

data_miss_latlog['distance_km'] = np.nan

frames = [data_w_lat_log, data_miss_latlog]
  
final_data = pd.concat(frames)

final_data.head(10)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_miss_latlog['distance_km'] = np.nan


Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,end_station_name,start_lat,start_lng,end_lat,end_lng,member_casual,duration,duration_min,day_of_week,hr_day,start_loc_key,end_loc_key,distance_km
0,A489B0C40E9CFB6E,electric_bike,2023-03-27 14:23:43,2023-03-27 14:47:08,,,41.88,-87.63,41.96,-87.65,member,0 days 00:23:25,23.416667,Mon,14.0,41.88-87.63,41.96-87.65,9.048194
1,DCD6E7E02628A529,electric_bike,2023-03-27 05:40:20,2023-03-27 05:42:35,,,41.96,-87.65,41.96,-87.65,member,0 days 00:02:15,2.25,Mon,5.0,41.96-87.65,41.96-87.65,0.0
2,6DAD99A20709B682,electric_bike,2023-03-27 05:47:37,2023-03-27 06:09:39,,,41.96,-87.65,41.88,-87.63,member,0 days 00:22:02,22.033333,Mon,5.0,41.96-87.65,41.88-87.63,9.048194
3,9DC1982F25F794BE,electric_bike,2023-03-26 23:53:57,2023-03-26 23:58:12,,,41.79,-87.61,41.78,-87.6,member,0 days 00:04:15,4.25,Sun,23.0,41.79-87.61,41.78-87.6,1.38704
4,A48FA69E713D17E0,electric_bike,2023-03-27 09:55:22,2023-03-27 10:06:05,,,41.89,-87.62,41.89,-87.65,member,0 days 00:10:43,10.716667,Mon,9.0,41.89-87.62,41.89-87.65,2.483299
5,773C6EF79FCB8966,electric_bike,2023-03-29 10:19:10,2023-03-29 10:27:07,,,41.9,-87.62,41.91,-87.63,member,0 days 00:07:57,7.95,Wed,10.0,41.9-87.62,41.91-87.63,1.386112
6,73BC3AFCB0AC18B8,electric_bike,2023-03-28 22:51:42,2023-03-28 22:56:21,,,41.88,-87.63,41.89,-87.63,member,0 days 00:04:39,4.65,Tue,22.0,41.88-87.63,41.89-87.63,1.111949
7,F70C183896230041,electric_bike,2023-03-28 22:48:42,2023-03-28 22:49:01,,,41.88,-87.63,41.88,-87.63,member,0 days 00:00:19,0.316667,Tue,22.0,41.88-87.63,41.88-87.63,0.0
8,8F609CFCE33E31AA,electric_bike,2023-03-28 22:57:01,2023-03-28 23:14:36,,,41.89,-87.63,41.96,-87.65,member,0 days 00:17:35,17.583333,Tue,22.0,41.89-87.63,41.96-87.65,7.957569
9,ABBA74C991D7B0F5,electric_bike,2023-03-21 15:07:58,2023-03-21 15:10:39,,,41.88,-87.63,41.88,-87.63,member,0 days 00:02:41,2.683333,Tue,15.0,41.88-87.63,41.88-87.63,0.0


In [28]:
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

Start Time = 03:38:35


# 5 - Export to .CSV

In [29]:
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

#final_data.to_csv('final_data.csv')

now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

Start Time = 03:38:35
Start Time = 03:38:35


In [30]:
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

Start Time = 03:38:35


# 6 - [Highlights #2] Identify Location based on Lat and Log

With the "latitude" and "longitude" available in the dataset, we can make use of these data to find out the address of these location, via geopy. 


Reference: 
1. https://towardsdatascience.com/geocode-with-python-161ec1e62b89
2. https://geopy.readthedocs.io/en/stable/#nominatim

In [31]:
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

Start Time = 03:38:35


In [32]:
# create dataset for location analysis

start_loc = data_w_lat_log[['start_loc_key', 'start_lat','start_lng']].drop_duplicates().reset_index()
end_loc = data_w_lat_log[['end_loc_key', 'end_lat','end_lng']].drop_duplicates().reset_index()

print(start_loc.shape)
print(end_loc.shape)

(1879297, 4)
(15206, 4)


In [33]:
#combine the dataset - to create a master copy. 

all_loc = pd.DataFrame( np.concatenate( (start_loc.values, end_loc.values), axis=0 ) )
all_loc.columns = [ 'index','loc_key', 'lat', 'lng' ]
all_loc = all_loc.drop_duplicates().reset_index()

all_loc = all_loc[['loc_key', 'lat', 'lng']]

print(all_loc.shape)

del start_loc
del end_loc

all_loc.head(5)

(1894349, 3)


Unnamed: 0,loc_key,lat,lng
0,41.88-87.63,41.88,-87.63
1,41.96-87.65,41.96,-87.65
2,41.79-87.61,41.79,-87.61
3,41.89-87.62,41.89,-87.62
4,41.9-87.62,41.9,-87.62


In [35]:
a = all_loc[['loc_key', 'lat', 'lng']]

In [36]:
# convert all the 3 decimal place (to the nearest 111 m) 

all_loc['lat_3dp'] = a['lat'].astype(float).round(decimals=3)
all_loc['lng_3dp'] = a['lng'].astype(float).round(decimals=3)
all_loc['loc_key_3dp'] = all_loc['lat_3dp'].astype(str) + all_loc['lng_3dp'].astype(str)
all_loc.head(5)

Unnamed: 0,loc_key,lat,lng,lat_3dp,lng_3dp,loc_key_3dp
0,41.88-87.63,41.88,-87.63,41.88,-87.63,41.88-87.63
1,41.96-87.65,41.96,-87.65,41.96,-87.65,41.96-87.65
2,41.79-87.61,41.79,-87.61,41.79,-87.61,41.79-87.61
3,41.89-87.62,41.89,-87.62,41.89,-87.62,41.89-87.62
4,41.9-87.62,41.9,-87.62,41.9,-87.62,41.9-87.62


In [37]:
# extract the rounded off lat and log only to extract proxy location 

area_loc = all_loc[['loc_key_3dp','lat_3dp', 'lng_3dp']].drop_duplicates().reset_index()
area_loc = area_loc[['loc_key_3dp','lat_3dp', 'lng_3dp']]
print(area_loc.shape)
area_loc.head(5)

(4509, 3)


Unnamed: 0,loc_key_3dp,lat_3dp,lng_3dp
0,41.88-87.63,41.88,-87.63
1,41.96-87.65,41.96,-87.65
2,41.79-87.61,41.79,-87.61
3,41.89-87.62,41.89,-87.62
4,41.9-87.62,41.9,-87.62


In [38]:
now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

Start Time = 03:39:31


In [39]:
from geopy.geocoders import Nominatim
# initialize Nominatim API
geolocator = Nominatim(user_agent="han-CaseStudy", timeout= 1)

import time

In [None]:
# create empty dataframe for consolidation 
loc_master =pd.DataFrame()

# for every set of lat and lng - retrieve the address through geolocator and combine it to create a master location dataset 

now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("Start Time =", current_time)

for i in range (len(area_loc)):
#for i in range (10):

    try:
    
        Latitude = str(area_loc['lat_3dp'][i])
        Longitude = str(area_loc['lng_3dp'][i])

        location = geolocator.reverse(Latitude+","+Longitude)
        address = location.raw['address']

        a = pd.DataFrame.from_dict(address, orient='index').T
        a['loc_key_3dp'] = area_loc['loc_key_3dp'][i]

        loc_master = pd.concat([a, loc_master])
        
        ## this will result in 4000+ row printed 
        
        #now = datetime.now()
        #current_time = now.strftime("%H:%M:%S")
        #print(str(i) + "Done Time =", current_time)
        
        #time.sleep(1)
            
    except:
        
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        print(str(i) + "Done with error =", current_time)
        pass

now = datetime.now()
current_time = now.strftime("%H:%M:%S")
print("All DONE Time =", current_time)

print(loc_master)

In [43]:
loc_master.columns

Index(['house_number', 'road', 'neighbourhood', 'quarter', 'city',
       'municipality', 'county', 'state', 'ISO3166-2-lvl4', 'postcode',
       'country', 'country_code', 'loc_key_3dp', 'town', 'railway', 'hamlet',
       'leisure', 'village', 'amenity', 'residential', 'man_made', 'shop',
       'building', 'craft', 'office', 'retail', 'police_beat', 'industrial',
       'tourism', 'highway', 'historic', 'aeroway', 'place', 'district',
       'club'],
      dtype='object')

In [49]:
# joining the results back to the area_loc, for full mapping

all_loc_master =  all_loc.merge(loc_master, on='loc_key_3dp', how='left') 

In [50]:
all_loc_master.columns

Index(['loc_key', 'lat', 'lng', 'lat_3dp', 'lng_3dp', 'loc_key_3dp',
       'house_number', 'road', 'neighbourhood', 'quarter', 'city',
       'municipality', 'county', 'state', 'ISO3166-2-lvl4', 'postcode',
       'country', 'country_code', 'town', 'railway', 'hamlet', 'leisure',
       'village', 'amenity', 'residential', 'man_made', 'shop', 'building',
       'craft', 'office', 'retail', 'police_beat', 'industrial', 'tourism',
       'highway', 'historic', 'aeroway', 'place', 'district', 'club'],
      dtype='object')

In [51]:
#exporting the final results.
all_loc_master.to_csv('all_loc_master.csv')