### Data processing

In this Jupyter notebook, we interpret and visualize data collected in previous notebook.

### Import of packages

In this section, we import all packages that are necessary for *data interpretation* as well as for their *visualization*. First, we import pandas, geopandas, json, numpy and matpotlib.pyplot in order to process the data. Then, we import Nominatim from geopy.geocoders in order to determine the location of the user. To perform geodetic computations, we import Geod from pyproj. Finally, we import folium and its plugin to visualize data.

In [549]:
import pandas as pd
import geopandas as gpd
import json
import numpy as np
import matplotlib.pyplot as plt

from geopy.geocoders import Nominatim

from pyproj import Geod

#python -m pip install folium
import folium
from folium.plugins import MarkerCluster

### Data loading & descriptive statistics

After importing packages, it is necessary to download json data from API. Data are then merged and prepared in csv file.

In [559]:
swim_data = pd.read_csv('raw_data.csv')

Downloading JSON data from API: 100%|██████████| 2/2 [00:17<00:00,  8.78s/it]
Searching through the description for attributes:: 100%|██████████| 31/31 [00:00<00:00, 27431.10it/s]

Data prepared in csv file.





We provide desriptive statistics, including only numeric columns. Since column *ID* and column *id_a* are identical, we remove one of them.

In [560]:
swim_data = swim_data.drop('id_a', 1)
desc = swim_data.describe(include = [np.number])

desc

Unnamed: 0.1,Unnamed: 0,ID,Average rating,Number of ratings,Longitude,Latitude
count,31.0,31.0,28.0,31.0,31.0,31.0
mean,15.0,202936.870968,4.227485,3.741935,49.237986,15.159155
std,9.092121,27.072424,0.791475,3.010733,1.753516,2.674773
min,0.0,202835.0,2.0,0.0,41.76032,2.983775
25%,7.5,202924.5,3.964286,1.5,49.110857,14.339853
50%,15.0,202944.0,4.3875,3.0,49.924286,14.942772
75%,22.5,202955.5,4.892857,5.0,50.093087,16.709543
max,30.0,202963.0,5.0,12.0,50.279876,19.445608


To prepare data for another part, we will select only columns with Name, Latitude, Longitude and Average rating of each location. We display first rows of our data to make sure that they are in desired format.

In [562]:
swim_loc = swim_data[['Name','Latitude','Longitude', 'Average rating']]
swim_loc.head()

Unnamed: 0,Name,Latitude,Longitude,Average rating
0,Costa Brava,2.983775,41.76032,
1,Nádrž Dolní Žleb,17.307842,49.751182,2.0
2,Kralupy nad Vltavou,14.307153,50.256161,5.0
3,Koupaliště Flošna,15.841977,50.205154,4.8
4,Městské lázně,15.828759,50.214217,4.5


To find a set of coordinates corresponding to the address, we use Nominatim imported from geopy.geocoders. Nominatim uses OpenStreetMap data to find any locations on Earth by name and address (geocoding) and to detect its coordinates. It can also do the reverse. 

Any user of this application can insert his location (city is enough) into the brackets and the application will determine precise address as well as the coordinates of that place.

In [564]:
geolocator = Nominatim(user_agent='myapplication')

# Insert city of your location:
location = geolocator.geocode("Litomyšl")

print("Your location is ", location.address, ".")
print("Longitude = {}, Latitude = {}".format(location.longitude, location.latitude))

Your location is  Litomyšl, okres Svitavy, Pardubický kraj, Severovýchod, 570 01, Česká republika .
Longitude = 16.3101243, Latitude = 49.8725491


Ater we know location of user, the application can measure the distance between that location and all the places in the dataset. We decided to use World Geodetic System (WGS) ellipsoid to measure the distances as it is a highly accurate approach.

In [566]:
#We use World Geodetic System ellipsoid to measure the distance.
g = Geod(ellps='WGS84')

#We declare a function measuring distance between pairs of latitude-longitude points.
def Distance(name, lat1, lon1, lat2, lon2, rating):
    az12,az21,dist = g.inv(lon1, lat1, lon2, lat2)
    return dist

#As lat1 and lon1, we will use latitude and longitude of user's location.
name = swim_loc['Name']
lat1 = location.latitude
lon1 = location.longitude
lat2 = swim_loc['Latitude']
lon2 = swim_loc['Longitude']
rating = swim_loc['Average rating']

#We define new dataframe including values of columns from swim_loc.
df = pd.DataFrame({'Name':name,'Average rating':rating,'Latitude_1':lat1,'Longitude_1':lon1,'Latitude_2':lat2,'Longitude_2':lon2})

#We add a column with distance in metres using distance function.
df['Distance'] = Distance(df['Name'].tolist(),df['Latitude_1'].tolist(),df['Longitude_1'].tolist(),df['Latitude_2'].tolist(),df['Longitude_2'].tolist(),df['Average rating'].tolist())

#Then we convert column with distance to kilometers and display first rows of new dataframe.
df['Distance'] = (1/1000)*df['Distance']

df.head()

Unnamed: 0,Name,Average rating,Latitude_1,Longitude_1,Latitude_2,Longitude_2,Distance
0,Costa Brava,,49.872549,16.310124,2.983775,41.76032,5725.521722
1,Nádrž Dolní Žleb,2.0,49.872549,16.310124,17.307842,49.751182,4688.608866
2,Kralupy nad Vltavou,5.0,49.872549,16.310124,14.307153,50.256161,5000.285866
3,Koupaliště Flošna,4.8,49.872549,16.310124,15.841977,50.205154,4852.033516
4,Městské lázně,4.5,49.872549,16.310124,15.828759,50.214217,4853.787915


We select 5 geographically nearest places to user's location and sort those places by the average rating.

In [567]:
#We define new dataframe with 5 geographicaly nearest places to user's location. 
df_map = df.nsmallest(5, columns=['Distance']).sort_values(['Average rating'], ascending=[True])

#We remove columns Latitude 1 and Longitude 1 as those coordinates are identical for all rows and we don't need them.
df_map.drop(['Latitude_1','Longitude_1'], axis=1, inplace=True)

#Then, we reset indexing of rows for easier manipulation with data in future steps.
df_map.reset_index(drop=True, inplace=True)

#We round values in column Average rating and Distance to one decimal point and display the result.
df_map['Average rating'] = df_map['Average rating'].round()
df_map['Distance'] = df_map['Distance'].round()

df_map

Unnamed: 0,Name,Average rating,Latitude_2,Longitude_2,Distance
0,Termální koupaliště,4.0,17.644178,47.683612,4542.0
1,Termální koupaliště Bešeňová,4.0,19.445608,49.100166,4452.0
2,Lipotské koupaliště,5.0,17.457754,47.863354,4570.0
3,Romantická pláž,5.0,14.768348,44.922321,4676.0
4,Akvapark Bohuňovice,5.0,17.283724,49.661243,4686.0


### Data visualization

In the final part, we will analyze and visualize collected and summarized data.

First, we will map all the points from the original dataset, using the Folium module. Since most of the places are located in the Czech Republic, we put latitude and longitude of Czech Republic as the initial coordinates of the map.

In [574]:
#We need to remove punctuation from 'Name' column since the map does not display those letters correctly.
swim_loc['Name'] = swim_loc['Name'].str.replace('á','a').str.replace('í','i').str.replace('é','e').str.replace('ě','e').str.replace('š','s').str.replace('č','c').str.replace('ř','r').str.replace('ž','z').str.replace('ý','y').str.replace('ů','u').str.replace('ú','u').str.replace('ň','n')

#We create map object using Folium. Based on our preferences, we set tiles to cartodbpositron and zoom_start to 8.
map0 = folium.Map(
    location=[49.8037633, 15.4749126],
    tiles='cartodbpositron',
    zoom_start=8)

#Then we create a list of latitude and longitude coordinate pairs.
map_locations0 = swim_loc[['Latitude', 'Longitude']]
locationlist0 = map_locations0.values.tolist()

#We set parameters of markers so that all points are well visible and recognizable. We add these markers to our map.
for point in range(0, len(locationlist0)):
    folium.Marker(locationlist0[point], popup=swim_loc['Name'][point], icon=folium.Icon(color='darkblue', icon='tint')).add_to(map0)    
    
#We create marker clusters that group points that overlap.
marker_cluster = folium.plugins.MarkerCluster().add_to(map0)    
    
map0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


...

In [519]:
#We need to remove punctuation from 'Name' column since the map does not display those letters correctly.
df_map['Name'] = df_map['Name'].str.replace('á','a').str.replace('í','i').str.replace('é','e').str.replace('ě','e').str.replace('š','s').str.replace('č','c').str.replace('ř','r').str.replace('ž','z').str.replace('ý','y').str.replace('ů','u').str.replace('ú','u').str.replace('ň','n')

#We create map object using Folium. Based on our preferences, we set tiles to cartodbpositron and zoom_start to 10.
map1 = folium.Map(
    location=[location.latitude, location.longitude],
    tiles='cartodbpositron',
    zoom_start=10)

#Then we create a list of latitude and longitude coordinate pairs.
map_locations = df_map[['Latitude_2', 'Longitude_2']]
locationlist = map_locations.values.tolist()


for point in range(0, len(locationlist)):
    folium.Marker(locationlist[point], popup=df_map['Name'][point], icon=folium.Icon(color='darkblue', icon='tint')).add_to(map1)


folium.Marker(location =[location.latitude, location.longitude],
                    icon = folium.Icon(color = 'gray', icon = 'home'),).add_to(map1)


map1

Another data visualization is useful for people looking for a place to swim and not willing to pay an entrance fee. Places without entrance fee are marked as green circles, while places with entrance fee of any value are marked as red circles.

In [543]:
#preparing data with needed columns
swim_entr = swim_data[['ID','Name','Latitude','Longitude', 'Average rating', 'Entrance']]

#removing punctuation from 'Name' column
#swim_entr['Name'] = swim_entr['Name'].str.replace('á','a').str.replace('í','i').str.replace('é','e').str.replace('ě','e').str.replace('š','s').str.replace('č','c').str.replace('ř','r').str.replace('ž','z').str.replace('ý','y').str.replace('ů','u').str.replace('ú','u').str.replace('ň','n')

#creating map object
map_entr = folium.Map(
    location=[49.8038, 15.4749],
    tiles='cartodbpositron',
    zoom_start=8
    )

#creating a dictionary for the colors of markers
colordict = {'Entrance fee': 'red', 'No entrance fee': 'lightgreen'}

#setting CircleMarker parameters
for lat, lon, name, entrance in zip(swim_entr['Latitude'], swim_entr['Longitude'], swim_entr['Name'], swim_entr['Entrance']):
    folium.CircleMarker(
        location = [lat, lon],
        popup = ('Name: ' + str(name).capitalize() + '<br>' 'Average rating: ' + str(rating) + '<br>' 'Entrance: ' + str(entrance)),
        color = 'b',
        fill_color = colordict[entrance],
        fill = True,
        fill_opacity = 0.8
        ).add_to(map_entr)

#visualization of map
map_entr

Since there are people who enjoy swimming and catching the sun without their clothes, we prepared a map representing locations to swim that are suitable for nudists. Places suitable for nudists are marked as pink circles and places not suitable for nudists are marked as dark red circles.

In [548]:
#preparing data with needed columns
swim_nudist = swim_data[['ID','Name','Latitude','Longitude', 'Average rating', 'Nudist beach']]

#removing punctuation from 'Name' column
#swim_nudist['Name'] = swim_nudist['Name'].str.replace('á','a').str.replace('í','i').str.replace('é','e').str.replace('ě','e').str.replace('š','s').str.replace('č','c').str.replace('ř','r').str.replace('ž','z').str.replace('ý','y').str.replace('ů','u').str.replace('ú','u').str.replace('ň','n')

#creating map object
map_nudist = folium.Map(
    location=[49.8038, 15.4749],
    tiles='cartodbpositron',
    zoom_start=8
    )

#creating a dictionary for the colors of markers
colordict = {'Suitable for nudists': 'pink', 'Not suitable for nudists': 'darkred'}

#setting CircleMarker parameters
for lat, lon, name, nudist in zip(swim_nudist['Latitude'], swim_nudist['Longitude'], swim_nudist['Name'], swim_nudist['Nudist beach']):
    folium.CircleMarker(
        location = [lat, lon],
        popup = ('Name: ' + str(name).capitalize() + '<br>' 'Average rating: ' + str(rating) + '<br>' 'Nudist beach: ' + str(nudist)),
        color = 'b',
        fill_color = colordict[nudist],
        fill = True,
        fill_opacity = 0.8
        ).add_to(map_nudist)

#visualization of map
map_nudist

__Next steps:__
- improve the Class
- Clean dataframe & check validity
- Analysis of data:
    - Descriptive stats
    - Visualization