> First time use: follow instructions in the README.md file in this directory.


**[PT]** Português

---

**[EN]** English


# Naturalidade


Análise dos valores da informação sobre a naturalidade dos estudantes

---

# Place of Birth

Information related to place of birth of students


## Setup

In [1]:
from timelinknb import get_db
from ucalumni.config import default_db

db_spec = default_db
db = get_db(db_spec)

## Lista de lugares diferentes e número de ocorrências

---

## List of different places with number of occurrences

In [2]:
from timelinknb.pandas import attribute_values

naturalidades = attribute_values('naturalidade',dates_between=('1500-00-00','1990-00-00'))
naturalidades.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11499 entries, Lisboa to Óvoa, Viseu
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   count     11499 non-null  int64 
 1   date_in   11499 non-null  object
 2   date_max  11499 non-null  object
dtypes: int64(1), object(2)
memory usage: 359.3+ KB


### Lugares principais

---

### Main locations

In [3]:
naturalidades.sort_values('count', ascending=False).head(20)



Unnamed: 0_level_0,count,date_in,date_max
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Lisboa,8784,1537-02-12,1916-07-19
Coimbra,5526,1537-00-00,1915-10-12
Porto,3391,1537-05-30,1917-10-22
Braga,1608,1540-01-21,1914-07-24
Évora,1072,1537-11-22,1910-10-10
Viseu,986,1537-00-00,1912-07-03
Guimarães,980,1537-12-18,1912-07-18
Lamego,972,1537-00-00,1909-10-05
Aveiro,790,1538-04-21,1913-10-13
Vila Real,765,1537-03-07,1909-11-09


### Lugares com menos de 5 estudantes

---

### Locations with less than five students

In [4]:
naturalidades[naturalidades['count']<3].info()

<class 'pandas.core.frame.DataFrame'>
Index: 8948 entries, Abadia to Óvoa, Viseu
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   count     8948 non-null   int64 
 1   date_in   8948 non-null   object
 2   date_max  8948 non-null   object
dtypes: int64(1), object(2)
memory usage: 279.6+ KB


In [5]:
naturalidades[naturalidades['count']<3].sort_values('count', ascending=False).head(20)

Unnamed: 0_level_0,count,date_in,date_max
value,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Abadia,2,1836-10-14,1886-10-14
"Sande, Viseu",2,1840-10-23,1855-10-13
Santa Comba da Ermida,2,1766-10-01,1824-10-08
"Santa Clara, Coimbra",2,1589-10-14,1910-10-12
"Santa Catarina, Alcobaça",2,1613-10-09,1760-10-01
Santa Barbara,2,1795-10-06,1797-10-09
Santa Ana de Cambas,2,1875-10-15,1902-10-04
Santa Ana,2,1740-01-30,1861-10-25
Sanguinheira,2,1642-10-22,1642-10-22
Sanfins do Minho,2,1652-10-20,1652-10-22


## Identificação de topónimos

---

## Geocoding

https://craftingdh.netlify.app/tutorials/folium/

In [6]:
!pip install geopy

[0m

This is here for testing purposes will migrate to timelinknb when done.

In [199]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from geopy.distance import geodesic
import pandas as pd

class GeocodeOSM:

    result_cols = ['place_name',
                    'geocoder',
                    'id',
                    'address',
                    'city',
                    'country',
                    'importance',
                    'class',
                    'type',
                    'latitude',
                    'longitude',
                    'distance']

    def __init__(self,user_agent, timeout=3, min_delay_seconds=1, error_wait_seconds=10):
        """ Create a geocoder for Open Street Mapp (Nominatim)
        
        Args:
            user_agent: string identifying the application.
            timeout: time in seconds to wait for answer
            min_delay_seconds: delay between calls to service
            error_wait_seconds: seconds to wait when service return error

        Creates a utility class around geopy Nominating geocoder.

        Usage:
            osm_coder = GeocodeOSM('myapp')
            result = osm_coder.geocode_osm('Coimbra')

        See:
            https://geopy.readthedocs.io/en/stable/#nominatim
            https://geopy.readthedocs.io/en/stable/index.html?highlight=geopy.extra.rate_limiter.RateLimiter#geopy.extra.rate_limiter.RateLimiter
        """
        self.geolocator = Nominatim(user_agent=user_agent, timeout=timeout)
        # abide to https://operations.osmfoundation.org/policies/nominatim/
        self.geocode = RateLimiter(geolocator.geocode, min_delay_seconds = min_delay_seconds,  max_retries=5,
                                    error_wait_seconds=error_wait_seconds, return_value_on_exception = None) 

    def filter_locations(self,locations: list, classes=None):
        """ locations whose raw['class'] is in classes or None
        
        locs = filter_locations(locations,classes=['place', 'boundary'])  """
        if locations is None:
            return None
        r = [loc for loc in locations if loc.raw['class'] in classes]
        if len(r) == 0:
            return None
        else:
            return r

    def geocode_osm(self,place: str, featuretype: str, countries=None, search_all=False, distance_from=None):
        """geocode a place with open street map geocoder
        
        Args:
            place: string with place or address to geocode
            featuretype: type of feature (country, state, city, settlement)
            countries: list of countries to restrict the query in order.
                       can contain lists, e.g.:['pt','br',['ao','cv','mz'],['es','fr]]
            search_all: continue searching in more countries after first results
            distance_from: coordinates of origin. If more than one place located with same importance
                            return the closest one to this point. If None, ignore distance.



        See:
            https://geopy.readthedocs.io/en/stable/#nominatim
            https://geopy.readthedocs.io/en/stable/index.html?highlight=geopy.extra.rate_limiter.RateLimiter#geopy.extra.rate_limiter.RateLimiter
        """

        if countries is not None:
            if type(countries) is str:
                country_codes = [countries]
            elif type(countries) is list:
                if len(countries) > 0:
                    country_codes = countries
                else:
                    country_codes = None
            else:
                raise(ValueError("countries must be None, str or list"))
        else:
            country_codes = None
        
        if type(distance_from) is not None:
            if type(distance_from) is not tuple:
                raise(ValueError("distance_from must be a tuple (lat,long)"))
            elif len(distance_from) != 2:
                raise(ValueError("distance_from must be a tuple (lat,long)"))

        cache = []
        classes = ['place', 'boundary']
        locations = geocode(nplace, 
                        featuretype=featuretype,
                        exactly_one=False, 
                        namedetails=True, addressdetails=True,
                        country_codes=country_codes[0])

        locations = self.filter_locations(locations,classes)

        if (locations is None or search_all) and len(country_codes) > 1: 
                  
            for countries in country_codes[1:]:
                more_locations = geocode(nplace, 
                                featuretype=featuretype,
                                exactly_one=False,
                                namedetails=True, addressdetails=True,
                                country_codes=countries)
                more_locations = self.filter_locations(more_locations,classes)
                if more_locations is not None:
                    if locations is None:
                        locations = more_locations
                    else:
                        locations = locations + more_locations
                    if not search_all:
                        break 

        if locations is None:
            return None
        else:
            found = False
            for location in [loc for loc in locations]:
                    
                osm_id = location.raw.get('osm_id','*noif*')
                address_details = location.raw['address']
                osm_class = location.raw['class']
                osm_type = location.raw['type']
                lat = location.latitude
                long = location.longitude
                if distance_from is not None:
                    d_lat, d_long = distance_from
                    distance_in_km = geodesic((d_lat,d_long), (lat,long)).km
                else:
                    distance_in_km = None
                    
                country = address_details.get('country','')
                
                city = address_details.get('city',None)  # urban    
                town = address_details.get('town',None)  # rural
                municipality = address_details.get('municipality',None)  # rural

                city_name = city if city is not None \
                            else town if town is not None \
                            else municipality if municipality is not None \
                            else None
                # if city_name is None:
                #    continue
                address_details.pop('postcode',None)
                address_details.pop('country_code',None)

                address_details.pop('ISO3166-2-lvl4',None)
                address_details.pop('ISO3166-2-lvl3',None)

                osm_address = ", ".join(list(address_details.values()))  
                osm_importance = location.raw['importance']
                found = True
                cache = cache + [(place,'osm',osm_id, osm_address, city_name, country, osm_importance,osm_class,osm_type,lat,long,distance_in_km)]

            if found:        
                cache_df = pd.DataFrame(cache,columns=self.result_cols)
                # sort by descending importance and increasing distance. First row should be the best
                cache_df = cache_df.sort_values(['importance','distance'], ascending=[False,True]).iloc[:1]
                return cache_df  
            else:
                return None       

Prioridades de pesquisa

---

Search priorities


In [88]:
plp = ['ao','mz','cv','st','mo','in']
other_countries = ['es','it','fr','ie','de']

#### Normalização

---

#### Normalizing

In [236]:
nplaces = {
          'Alcangosta':'Alcongosta',
          'Baia Brasil':'Bahia, Brasil',
          'Baía Brasil':'Bahia, Brasil',
          'Baia, Brasil':'Bahia, Brasil',
          'Baía, Brasil':'Bahia, Brasil',
          'Baía':'Bahia, Brasil',
          'Ribeira de Litém':'Santiago de Litém',
          'Rio de Janeiro':'Rio de Janeiro, Brasil',
          'Roma':'Roma, Itália',
          'Santa Eulália da Palmeira':'Palmeira, Santo Tirso',
          'Santa Eulália de Barrosas':'Santa Eulália, Vizela',
          "Abravezes":"Abraveses",
          "Arrifana de Sousa":"Penafiel",
          "Arrifana de Sousa, Penafiel":"Penafiel",
          "Arrifana do Sousa":"Penafiel",
          "Benviver":"Bem-Viver",
          "Cabeceira de Basto":"Cabeceiras de Basto",
          "Casais do Campo":"Casais do Campo, 3045-039, Coimbra",
          "Farinha Podre, hoje São Pedro de Alva":"São Pedro de Alva",
          "Funchal, Ilha da Madeira":"Funchal, Madeira",
          "Ilha de S. Tomé":"S.Tomé e Príncipe",
          "Lagoa, Ilha de São Miguel, Açores":"Lagoa, Açores",
          "Landroal":"Alandroal",
          "Lavarrabos, hoje São João do Campo":"São João do Campo",
          "Mondim":"Mondim de Basto",
          "Panascoso":"Penhascoso",
          "Penavrde":"Pena Verde",
          "Punhete, hoje Constância": "Constância",
          "Ribeira da Pena":"Ribeira de Pena",
          "São João de El-Rei, Minas Gerais, Brasil": "São João del-Rei, Minas Gerais, Brasil",
          "São Paio de Farinha Podre, hoje São Pedro de Alva":"São Pedro de Alva",
          "São Tiago de Besteiros":"Santiago de Besteiros",
          "São Tiago de Cacém":"Santiago de Cacém",
          "Sebal Grande":"Sebal, Condeixa-a-Nova",
          "Souzel":"Sousel",
          "Tuias, Marco de Canaveses": "Tuias, Marco de Canaveses",
          "Vendas dos Moinhos": "Venda dos Moinhos",
          "Vila Cova de Sub-Avô, hoje Vila Cova de Alva":"Vila Cova de Alva",
          "Vila Nova de Portimão":"Portimão",
          

          # Inglaterra
          # Ilha de S. Tomé
          }

#### Testes

---

#### Testing

Colocar o topónimo a localizar

---

Set the place to look for

In [71]:
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from geopy.distance import geodesic

geolocator = Nominatim(user_agent="fauc1530-1919", timeout=2)
# abide to https://operations.osmfoundation.org/policies/nominatim/
geocode = RateLimiter(geolocator.geocode, min_delay_seconds = 1,   return_value_on_exception = None) 


In [235]:
place = "Pena Verde"

if place in nplaces.keys():
    nplace = nplaces[place]
    print(f"Normalized: {nplace}")
else:
    nplace = place

uc_location = (40.207422, -8.4260033)
countries = ['pt', 'br',['ao','mz','cv','st','mo','in'],['es','it','fr','ie','de'],None]

locator = GeocodeOSM('fauc1537-1919', timeout=15)
result = locator.geocode_osm(nplace, featuretype=None, countries=countries, distance_from=uc_location, search_all=False)
result

Unnamed: 0,place_name,geocoder,id,address,city,country,importance,class,type,latitude,longitude,distance
0,Pena Verde,osm,6406860,"Pena Verde, Aguiar da Beira, Guarda, Portugal",Aguiar da Beira,Portugal,0.55,boundary,administrative,40.726664,-7.505577,97.045236


### Improve

* Try to remove "orago" from name when not found. Orago in São|Santa+word+da|de|do+Place


### Misses hard to understand

* "Lagoa, Ilha de São Miguel, Açores"



### Mistakes
* Secarias,osm,280149141,"Secarias, Silveira, Silveira, Torres Vedras, Lisboa, Portugal",Silveira,Portugal,0.36,place,hamlet,39.1106425,-9.3582676,145.70092450095862 
    * should be https://pt.wikipedia.org/wiki/Secarias
    * better search first on wikidata by type= freguesia de Portugal

```sparql
   SELECT DISTINCT ?rank ?item ?itemLabel ?iof ?iofLabel ?inside ?insideLabel ?coordinates ?countryLabel WHERE 

   { SERVICE wikibase:mwapi 
              {bd:serviceParam wikibase:endpoint "pt.wikipedia.org";
                      wikibase:api "Generator";
                      mwapi:generator "search";
                      mwapi:gsrsearch "Secarias";
                      mwapi:gsrlimit "max".
               
      ?item wikibase:apiOutputItem mwapi:item .
    }
      
      ?item p:P31 ?statement0.
      ?statement0 (ps:P31/(wdt:P279*)) wd:Q1131296.
      ?item p:P31/ps:P31 ?iof.
    
      ?item p:P131/ps:P131 ?inside.   
    
      ?inside p:P31 ?statement1.
      ?statement1 (ps:P31/(wdt:P279*)) wd:Q13217644.
    
      ?item p:P625/ps:P625 ?coordinates.
      ?item p:P17/ps:P17 ?country.
    
   {SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". }}
     }
  LIMIT 300
  ```


In [11]:
places = naturalidades[naturalidades['count']>=2].sort_index().index
len(places)
places[:10]

Index(['', 'A de Barros', 'Abadia', 'Abambres', 'Abaças', 'Abiul', 'Abiúl',
       'Aboim', 'Aboim da Nobrega', 'Abragão'],
      dtype='object', name='value')

Nova versão com função

In [237]:
from os.path import exists
import pandas as pd

uc_location = (40.207422, -8.4260033)
countries = ['pt', 'br',['ao','mz','cv','st','mo','in'],['es','it','fr','ie','de'],None]

locator = GeocodeOSM('fauc1537-1919', timeout=120)

cache_file = '../inferences/places/osm-places.csv'
places_cache_cols = ['naturalidade','geocoder','id','address','city','country','importance','class','type','latitude','longitude','distance']
if exists(cache_file):
    places_cache = pd.read_csv(cache_file)
else:
   places_cache = pd.DataFrame(columns=locator.result_cols)


not_found_file = '../inferences/places/osm_not_found.csv'

not_found_df: pd.DataFrame = None

if exists(not_found_file):
    not_found_df = pd.read_csv(not_found_file)
    not_found = list(not_found_df['not_found'])
else:
    not_found = []
    not_found_df = pd.DataFrame(columns=['not_found'])

cached = set(places_cache['place_name'])

counter = 0
n_cached = 0
n_not_found = 0

for place in places:
    
    if len(place.strip()) == 0:
        continue

    if place in ['Alemanha','Lisboa','Aveiro','Ilha da Madeira']:
        pass  # to debug

    counter = counter + 1
    if place in cached:
        n_cached = n_cached + 1
        continue
    if place in not_found:
        n_not_found = n_not_found + 1
        continue

    if place in nplaces.keys():
        nplace = nplaces[place]
        print(f"Using normalized form '{nplace}' for '{place}'")
    else:
        nplace = place
    

    cache_df = locator.geocode_osm(nplace, featuretype=None, countries=countries, distance_from=uc_location, search_all=False)
    if cache_df is not None:
        cache_df['place_name'] = place  # restore the original name, before normalization
        places_cache = pd.concat([places_cache,cache_df],axis=0)
        print(f"Found: {place} ({cache_df['address'].iloc[0]})")
    else:
            nf = pd.DataFrame([(place)],columns=['not_found'])
            not_found_df = pd.concat([not_found_df,nf], axis=0) 
            not_found = not_found + [place]
            print("Not found:",place)






Using normalized form 'Abraveses' for 'Abravezes'
Found: Abravezes (Abraveses, Viseu, Viseu, Portugal)
Not found: Acentar
Not found: Achada, Ponta Delgada
Not found: Adaganha
Found: Agada (Angadi, Karwar taluk, Uttara Kannada, Karnataka, India)
Found: Aguada (Aguada, Carmópolis, Região Geográfica Imediata de Aracaju, Região Geográfica Intermediária de Aracaju, Sergipe, Região Nordeste, Brasil)
Not found: Aiamonte
Not found: Ala, Miranda do Douro
Not found: Alcantra, Maranhão
Not found: Alcaíde, Guarda
Not found: Alcorrochel
Not found: Aldeia Galega, Ribatejo
Found: Aldeia da Cruz, Ourém (Aldeia, Porto Velho, Freixianda, Ribeira do Fárrio e Formigais, Ourém, Santarém, Portugal)
Not found: Aldeia de Barros
Not found: Aldeia de Cima, Armamar
Not found: Aldeia de Joane
Not found: Aldeias Novas
Not found: Alfora
Not found: Algarve, Faro
Not found: Algarve, Lagos
Not found: Algarve, Loulé
Not found: Algarve, Silves
Not found: Algodres, Linhares
Not found: Algozo
Not found: Alhos Vedras
Not f

In [205]:
places_cache.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3157 entries, 0 to 0
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   place_name  3157 non-null   object 
 1   geocoder    3157 non-null   object 
 2   id          3157 non-null   int64  
 3   address     3157 non-null   object 
 4   city        3099 non-null   object 
 5   country     3157 non-null   object 
 6   importance  3157 non-null   float64
 7   class       3157 non-null   object 
 8   type        3157 non-null   object 
 9   latitude    3157 non-null   float64
 10  longitude   3157 non-null   float64
 11  distance    3157 non-null   float64
dtypes: float64(4), int64(1), object(7)
memory usage: 320.6+ KB


In [206]:
places_cache.to_csv(cache_file,index=False)
not_found_file = '../inferences/places/osm_not_found.csv'
not_found_df.to_csv(not_found_file,index=False)


In [208]:
# Merge located places with naturalidade stats
naturalidades['naturalidade'] = naturalidades.index.values
naturalidade_geocoded = naturalidades.merge(places_cache,how='left',right_on='place_name',left_on='naturalidade')
naturalidade_geocoded.head(10)

Unnamed: 0,count,date_in,date_max,naturalidade,place_name,geocoder,id,address,city,country,importance,class,type,latitude,longitude,distance
0,8784,1537-02-12,1916-07-19,Lisboa,Lisboa,osm,5400890.0,"Lisboa, Lisboa, Portugal",Lisboa,Portugal,0.82497,boundary,administrative,38.744052,-9.151828,174.057858
1,5526,1537-00-00,1915-10-12,Coimbra,Coimbra,osm,5379538.0,"Coimbra, Coimbra, Portugal",Coimbra,Portugal,0.685346,boundary,administrative,40.211193,-8.429463,0.511958
2,3391,1537-05-30,1917-10-22,Porto,Porto,osm,3372453.0,"Cedofeita, Santo Ildefonso, Sé, Miragaia, São ...",Porto,Portugal,0.735184,boundary,administrative,41.149451,-8.610788,105.770156
3,1608,1540-01-21,1914-07-24,Braga,Braga,osm,4115866.0,"Braga, Braga, Portugal",Braga,Portugal,0.671045,boundary,administrative,41.551058,-8.428005,149.213032
4,1072,1537-11-22,1910-10-10,Évora,Évora,osm,5402589.0,"Évora, Évora, Portugal",Évora,Portugal,0.633324,boundary,administrative,38.570774,-7.909281,187.077773
5,986,1537-00-00,1912-07-03,Viseu,Viseu,osm,5330332.0,"Viseu, Viseu, Portugal",Viseu,Portugal,0.621902,boundary,administrative,40.657471,-7.913866,66.226034
6,980,1537-12-18,1912-07-18,Guimarães,Guimarães,osm,3924938.0,"Guimarães, Braga, Portugal",Guimarães,Portugal,0.66133,boundary,administrative,41.441768,-8.295571,137.515578
7,972,1537-00-00,1909-10-05,Lamego,Lamego,osm,5328700.0,"Lamego, Viseu, Portugal",Lamego,Portugal,0.578091,boundary,administrative,41.07174,-7.814995,109.011191
8,790,1538-04-21,1913-10-13,Aveiro,Aveiro,osm,5325138.0,"Aveiro, Aveiro, Portugal",Aveiro,Portugal,0.625496,boundary,administrative,40.640496,-8.653784,51.829301
9,765,1537-03-07,1909-11-09,Vila Real,Vila Real,osm,4202166.0,"Vila Real, Vila Real, Portugal",Vila Real,Portugal,0.678191,boundary,administrative,41.300624,-7.764228,133.643484


#### Número de estudantes com naturalidade georeferenciada

---

#### Number of students with geocoded place of birth

In [209]:
naturalidade_geocoded[naturalidade_geocoded['address'].notnull()][['count']].sum()

count    83500
dtype: int64

Sem georeferenciação

---

Without geocoding

In [210]:
naturalidade_geocoded[naturalidade_geocoded['address'].isnull()][['count']].sum()

count    10645
dtype: int64

Lugares mais importantes para georeferenciar

---

Most relevant places to be geocoded

In [211]:
naturalidade_geocoded[naturalidade_geocoded['address'].isnull()].sort_values('count',ascending=False).head(20)

Unnamed: 0,count,date_in,date_max,naturalidade,place_name,geocoder,id,address,city,country,importance,class,type,latitude,longitude,distance
62,202,1540-10-20,1770-10-01,Arrifana de Sousa,,,,,,,,,,,,
167,77,1538-12-05,1910-10-14,Vila Nova de Portimão,,,,,,,,,,,,
387,32,1540-00-00,1771-05-24,,,,,,,,,,,,,
416,29,1583-10-08,1840-10-03,Benviver,,,,,,,,,,,,
445,27,1625-10-14,1873-10-15,"Lavarrabos, hoje São João do Campo",,,,,,,,,,,,
448,27,1560-10-01,1826-10-30,Mondim,,,,,,,,,,,,
452,27,1639-11-08,1889-10-15,Ribeira da Pena,,,,,,,,,,,,
454,27,1693-10-01,1888-10-08,Sebal Grande,,,,,,,,,,,,
463,26,1656-10-16,1763-10-01,Panascoso,,,,,,,,,,,,
467,26,1660-01-31,1881-10-12,São Tiago de Besteiros,,,,,,,,,,,,
