#### DATA DOWNLOADER

Data are downloaded from a combination of API (https://www.mapotic.com/api/v1/maps/2941/public-pois/?ordering=created&page=1&page_size=10) and each place's webpage and are saved to a csv file. \
It is sufficient only to call method mergeData() from class Swim() to get the csv file. The final file includes ID of the place, name, date of creation, longitude and latitude, attributes (description, refreshment, good for diving or nudist, entrance fee and accesibility) and rating. \
The comments are inclueded in the Class.

The only disadvantage of this process is opening each webpage and going through a string to get the Description, which takes really long time. The usual time is about 1 hour. Data are valid as of 20/09/2020.

In [1]:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
import unicodedata
import numpy as np
from tqdm import tqdm
import time

In [252]:
#how to access another than the first page -> add another parameter for page
response = requests.get('https://www.mapotic.com/api/v1/maps/2941/public-pois/?page=1&page_size=10&ordering=created&page=4')
print(response)

<Response [200]>


In [11]:
class Swim():
    '''
    This Class will be used to download all data and merge them into one DataFrame.
    Method MergeData() has to be used.
    '''
    def __init__(self):
        """ Data are saved to the Class. """
        
        self.pages_json = self.downloadPages()
        self.get_data = self.getData()
        pass
    def downloadPages(self):
        
        """
        Downloading data from API and save it to a json file.
           The response of the link to get the data from API changed its behaviour few days ago, so in case it will not work properly - raw_data.json from 14/09/2020 is available to use.
           The parameter page_size was probably changed by the owner (parameter pagination does not work anymore), thus in case, there is a backup.
           
        """
        
        resp=[]
        b= tqdm(list(range(1,334)), desc="Downloading JSON data from API")  
        for i in b:
            response = requests.get('https://www.mapotic.com/api/v1/maps/2941/public-pois/?page=1&page_size=10&ordering=created&page='+ str(i) )
            resp.append(response.json())
#         with open('raw_data.json', 'w') as raw:   # in case of bad response from API
#             json.dump(resp, raw) 
        return resp
            
    def getData(self):
        """
        Method getData() will parse json format from API.
        """
        
        preview = self.pages_json
#         with open('raw_data_stare.json') as raw_data: # in case of bad response from API
#             preview = json.load(raw_data)
        latitude = []    
        longitude = []
        ids = [] 
        names = [] 
        ratings = []
        count_ratings = []
        created = []
        for page in preview: 
            for result in page['results']:
                names.append(result['name'])
                ids.append(result['id'])
                latitude.append(result['point']['coordinates'][0])
                longitude.append(result['point']['coordinates'][1])
                ratings.append(result['rating']['average'])
                count_ratings.append(result['rating']['count'])
                created.append(result['created'])

        df = pd.DataFrame({
            'ID' : ids, 
            'Name' : names, 
            'Average rating' : ratings,
            'Number of ratings' : count_ratings,
            'Created' : created,
            'Longitude' : longitude,
            'Latitude' : latitude})
        return df  
    
    def getDescPlaces(self, idn, name):
        """
        Method to get the soup (only the part we are interested in) from each specific website. ID and name (without diacritics) have to be entered.
        """
        
        name = unicodedata.normalize('NFKD', name).lower()
        new_name = ''
        for c in name:
            if not unicodedata.combining(c):
                new_name += c
                
        odkaz = 'https://www.swimplaces.com/'  + str(idn) + '-'+ new_name.replace(' ', '-') 
        r = requests.get(odkaz)
        r.encoding = 'UTF-8'
        soup = BeautifulSoup(r.text,'lxml').find('meta', {'name' : 'description'})['content'].split('|')
        return soup
    
    def getAtt(self):
        """
        Method that will download the information about attributes for each place with the help of getDescPlaces().
        """
        value = None
        def find_between(string, start, end):  #additional function to get information from a string
            return (string.split(start))[1].split(end)[0]
        
        nazev =[]
        att = []
        for i, n in self.get_data.iterrows(): # downloading a string from webpages where Description is saved
            nazev.append(self.getDescPlaces(n['ID'], n['Name'])[0])
            try:
                att.append(self.getDescPlaces(n['ID'], n['Name'])[1])  
            except:
                att.append(value)
            time.sleep(0.1)
        ids_att = self.get_data['ID'].values 
        
        desc = []
        refresh = []
        diving = []
        entrances = []
        access = []
        nudists = []
        
        # looking for specific attributes in each string for each place
        for ii in tqdm(att, desc = 'Searching through the description for attributes:',position=0, leave=True): 
            if ii is not None:
                if 'Description:' in ii:
                    try:
                        desc.append(find_between(ii, 'Description:', ':').rsplit(' ', 1)[0].rsplit(',', 1)[0].strip()) # assumption: if the creator used colon in description, we will have only a part of the description...
                    except:
                        desc.append(value)
                else:
                    desc.append(value)
                    
                if 'Refreshment' in ii:
                    try:
                        refresh.append(find_between(ii, 'Refreshment:', ':').rsplit(' ', 1)[0].rsplit(',', 1)[0].strip())
                    except:
                        refresh.append(value)
                else:
                    refresh.append(value) 
                    
                if 'Diving' in ii:
                    try:
                        diving.append(find_between(ii, 'Diving:', ':').rsplit(' ', 1)[0].rsplit(',', 1)[0].strip())
                    except:
                        diving.append(value)
                else:
                    diving.append(value)
                    
                if 'Accessibility/parking' in ii:
                    try:
                        access.append(find_between(ii, 'Accessibility/parking:', ':').rsplit(' ', 1)[0].rsplit(',', 1)[0].strip()) 
                    except:
                        access.append(value)                                                         
                else:
                    access.append(value)  
                    
                if 'Entrance' in ii:
                    try:
                        entrances.append(find_between(ii, 'Entrance:', ':').rsplit(' ', 1)[0].rsplit(',', 1)[0].strip()) 
                    except:
                        entrances.append(value)
                else:
                    entrances.append(value)
                    
                if 'Nudist beach' in ii:
                    try:
                        nudists.append(find_between(ii, 'Nudist beach:', ':').rsplit(' ', 1)[0].rsplit(',', 1)[0].strip())
                    except:
                        nudists.append(value) 
                else: 
                    nudists.append(value)
            else:
                desc.append(value)
                refresh.append(value)
                diving.append(value)
                access.append(value)
                entrances.append(value)
                nudists.append(value)
                    
        # Attributes will be prepared in dataframe:
        attributes = pd.DataFrame({
            'id_a' : ids_att,
            'nazev' :nazev,
            'Description' : desc,
            'Refreshment' : refresh,
            'Diving' : diving,
            'Entrance' : entrances,
            'Accessibility and parking' :access,
            'Nudist beach' : nudists
        })
        return attributes
    def mergeData(self):
        ''' 
        Finally, both data sources are merged and everything is saved to csv file.
        '''
        output = self.get_data.merge(self.getAtt(), left_on = 'ID', right_on = 'id_a')
        output.to_csv('raw_data.csv', sep = ',')
        return print('Data prepared in csv file.')

In [12]:
swim = Swim()
swim.mergeData()

Downloading JSON data from API: 100%|██████████| 333/333 [06:37<00:00,  1.19s/it]
Searching through the description for attributes:: 100%|██████████| 3321/3321 [00:00<00:00, 111133.76it/s]

Data prepared in csv file.



