## Import all necessary packages

In [None]:
import docx2txt
import spacy
from collections import Counter
import pandas as pd 

# from urllib import request
# from geotext import GeoText

from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut

import plotly.express as px

## Start by importing AirBnB document

In [5]:
airbnbtext = docx2txt.process("airbnb_text.docx")
print(airbnbtext)

Using Airbnb, a short-term rental service that enables homeowners or tenants to rent out properties for side income, is a huge hit with budget-conscious travelers. Regulatory boards around the world, however, can be a challenge. 

Among the problems that city governments and state regulators foresee with Airbnb are the potential to upend landlord-tenant relations (for example, a landlord could try to evict a tenant to charge higher short-term rents to vacationers). Regulators also fear a potential influx of travelers who will transform quiet residential neighborhoods into revolving hotel districts. There are also concerns about a current lack of oversight and accountability over Airbnb-related tax collection and adherence to zoning by-laws.

Therefore, individuals who are considering using Airbnb (either to find a room or to rent out an apartment) should conduct due diligence to check that the city in question fosters a supportive environment for Airbnb. Moreover, the listing should co

## Use SpaCy to extract locations

In [6]:
nlp = spacy.load('en')

In [12]:
locations = list()
for ent in nlp(airbnbtext).ents:
    if ent.label_ in ['GPE', 'LOC']:
        locations.append(ent)
        print(ent)

Paris
Barcelona
Santa Monica
California
Airbnb
Amsterdam
Berlin
London
San Francisco
New York
Paris
Paris
Paris
Airbnb
Spain
New York
Santa Monica
Paris
France
Barcelona
Airbnb
Barcelona
Airbnb
Berlin
Airbnb
Berlin
Amsterdam
London
Amsterdam
Airbnb
Amsterdam
Greater London
London
Airbnb
London
stays.15﻿
New York
the United States
Airbnb
Airbnb
New York City
San Francisco
San Francisco
San Francisco
New York
San Francisco
Santa Monica
Airbnb
U.S.
California
Airbnb
Santa Monica
the City of Santa Monica
Airbnb
Santa Monica
Airbnb


## Collapse locations into dictionary 

#### (note some locations if worded differently will need additional mapping)

In [42]:
c = Counter()
for loc in locations:
    c.update({loc.text: 1}) 

In [43]:
c

Counter({'Paris': 5,
         'Barcelona': 3,
         'Santa Monica': 5,
         'California': 2,
         'Airbnb': 13,
         'Amsterdam': 4,
         'Berlin': 3,
         'London': 4,
         'San Francisco': 5,
         'New York': 4,
         'Spain': 1,
         'France': 1,
         'Greater London': 1,
         'stays.15\ufeff': 1,
         'the United States': 1,
         'New York City': 1,
         'U.S.': 1,
         'the City of Santa Monica': 1})

'Greater London' needs to be mapped to 'London' for example...

### Map terms using mapping.txt file (to add new terms, only need update the file)

In [73]:
# newlist = list()
# for loc in locations:
#     texttoappend = loc.text
#     for new in newlist:
#         if new in loc.text:
#             texttoappend = new
#     newlist.append(texttoappend)

In [123]:
def map_terms(input):
    input = input.lower()
    mapped = input.strip()
    with open('mapping.txt') as f:
        lines = f.read().splitlines()
        for line in lines:
            if input.strip() in line.split(','):
                mapped = line.split(',')[0]
            else:
                continue

    return(mapped.title())

In [105]:
newlist = list()
for loc in locations:
    mapped_term = map_terms(loc.text)
    newlist.append(mapped_term)

In [107]:
c = Counter()
for loc in newlist:
    c.update({loc: 1}) 

In [108]:
c

Counter({'Paris': 5,
         'Barcelona': 3,
         'Santa Monica': 6,
         'California': 2,
         'Airbnb': 13,
         'Amsterdam': 4,
         'Berlin': 3,
         'London': 5,
         'San Francisco': 5,
         'New York': 4,
         'Spain': 1,
         'France': 1,
         'Stays.15\ufeff': 1,
         'United States Of America': 2,
         'New York City': 1})

The mapping has created successful counts of city mentions

## Make a df from the counts dictionary

In [116]:
airbnb = pd.DataFrame(c.items())
airbnb.columns = ['city', 'mentions']
airbnb['sourcedoc'] = 'airbnb'  # add doc source 

In [117]:
airbnb

Unnamed: 0,city,mentions,sourcedoc
0,Paris,5,airbnb
1,Barcelona,3,airbnb
2,Santa Monica,6,airbnb
3,California,2,airbnb
4,Airbnb,13,airbnb
5,Amsterdam,4,airbnb
6,Berlin,3,airbnb
7,London,5,airbnb
8,San Francisco,5,airbnb
9,New York,4,airbnb


## Repeat these steps for the Olympics document

In [118]:
olympicstext = docx2txt.process("olympics_text.docx")
print(olympicstext)

List of Olympic Games host cities

From Wikipedia, the free encyclopedia

Map of host cities and countries of the modern summer (orange) and winter (blue) Olympics. * Tokyo hosted the 2020 Summer Olympics in 2021. In the SVG file, tap or hover over a city to show its name (only on the desktop).

This is a list of host cities of the Olympic Games, both summer and winter, since the modern Olympics began in 1896. Since then, summer and winter games have usually celebrated a four-year period known as an Olympiad; summer and winter games normally held in staggered even years. There have been 28 Summer Olympic Games held in 23 cities, and 23 Winter Olympic Games held in 20 cities. In addition, three summer and two winter editions of the games were scheduled to take place but later cancelled due to war: Berlin (summer) in 1916; Tokyo–Helsinki (summer) and Sapporo–Garmisch-Partenkirchen (winter) in 1940; and London (summer) and Cortina (winter) in 1944. The 1906 Intercalated Olympics were offi

In [119]:
nlp = spacy.load('en')
locations2 = list()
for ent in nlp(olympicstext).ents:
    if ent.label_ in ['GPE', 'LOC']:
        locations2.append(ent)
        print(ent)

Wikipedia
Tokyo
Tokyo
Helsinki
Athens
Tokyo
Singapore
Beijing
Paris
Milan
Cortina
Los Angeles
Beijing
Paris
London
Los Angeles
Tokyo
Pyeongchang
Beijing
London
Paris
Los Angeles
France
Japan
United Kingdom
Austria
Australia
Canada
Europe
Asia
Singapore
Southeast Asia's
South America's
Central Asia
Central America


In [124]:
newlist2 = list()
for loc in locations2:
    mapped_term = map_terms(loc.text)
    newlist2.append(mapped_term)

In [125]:
c = Counter()
for loc in newlist2:
    c.update({loc: 1}) 

In [126]:
c

Counter({'Wikipedia': 1,
         'Tokyo': 4,
         'Helsinki': 1,
         'Athens': 1,
         'Singapore': 2,
         'Beijing': 3,
         'Paris': 3,
         'Milan': 1,
         'Cortina': 1,
         'Los Angeles': 3,
         'London': 2,
         'Pyeongchang': 1,
         'France': 1,
         'Japan': 1,
         'United Kingdom': 1,
         'Austria': 1,
         'Australia': 1,
         'Canada': 1,
         'Europe': 1,
         'Asia': 1,
         'Southeast Asia': 1,
         'South America': 1,
         'Central Asia': 1,
         'Central America': 1})

In [127]:
olympics = pd.DataFrame(c.items())
olympics.columns = ['city', 'mentions']
olympics['sourcedoc'] = 'olympics'  # add doc source 
olympics

Unnamed: 0,city,mentions,sourcedoc
0,Wikipedia,1,olympics
1,Tokyo,4,olympics
2,Helsinki,1,olympics
3,Athens,1,olympics
4,Singapore,2,olympics
5,Beijing,3,olympics
6,Paris,3,olympics
7,Milan,1,olympics
8,Cortina,1,olympics
9,Los Angeles,3,olympics


### ...add additional docs as desired before merging

## Merge dataframes 

In [181]:
merged_df = pd.concat([airbnb, olympics])
merged_df.head()

Unnamed: 0,city,mentions,sourcedoc
0,Paris,5,airbnb
1,Barcelona,3,airbnb
2,Santa Monica,6,airbnb
3,California,2,airbnb
4,Airbnb,13,airbnb


In [132]:
len(merged_df)

39

## Extract coordinates for locations mentioned using Nominatim

This OSM tool (https://wiki.openstreetmap.org/wiki/Nominatim) allows users to geocode location names (text). Users need to add their email address as the 'user_agent' parameter, as it tries to limit hits by a single user. You can use multiple email addresses from the same IP. 

In [136]:
geolocator = Nominatim(user_agent="myemail@email.com", timeout=6)

In [140]:
lat_lon = dict()
for loc in merged_df['city']: 
    try:
        location = geolocator.geocode(loc)
        if location:
            print(location.latitude, location.longitude)
            lat_lon[loc] = (location)
    except GeocoderTimedOut as e:
        print("Error: geocode failed on input %s with message %s".format(loc, e))
lat_lon

48.8566969 2.3514616
41.3828939 2.1774322
34.0194704 -118.4912273
36.7014631 -118.755997
1.2758649 103.835589
52.3727598 4.8936041
52.5170365 13.3888599
51.5073219 -0.1276474
37.7790262 -122.419906
40.7127281 -74.0060152
39.3260685 -4.8379791
46.603354 1.8883335
39.7837304 -100.445882
40.7127281 -74.0060152
-1.0197136 -71.9383333
35.6828387 139.7594549
60.1674881 24.9427473
37.9839412 23.7283052
1.357107 103.8194992
39.906217 116.3912757
48.8566969 2.3514616
45.4668 9.1905
43.7070273 11.6858483
34.0536909 -118.242766
51.5073219 -0.1276474
37.5622911 128.4295278
46.603354 1.8883335
36.5748441 139.2394179
54.7023545 -3.2765753
47.2 13.2
-24.7761086 134.755
61.0666922 -107.991707
51.0 10.0
51.2086975 89.2343748
-36.8624515 174.7207047
-21.0002179 -61.0006565
39.4009215 72.8676621
-30.29284845 153.12561585745573


{'Paris': Location(Paris, Île-de-France, France métropolitaine, France, (48.8566969, 2.3514616, 0.0)),
 'Barcelona': Location(Barcelona, Barcelonès, Barcelona, Catalunya, 08001, España, (41.3828939, 2.1774322, 0.0)),
 'Santa Monica': Location(Santa Monica, California, United States, (34.0194704, -118.4912273, 0.0)),
 'California': Location(California, United States, (36.7014631, -118.755997, 0.0)),
 'Airbnb': Location(Airbnb, Blair Road, Bukit Merah, Singapore, Central, 169377, Singapore, (1.2758649, 103.835589, 0.0)),
 'Amsterdam': Location(Amsterdam, Noord-Holland, Nederland, (52.3727598, 4.8936041, 0.0)),
 'Berlin': Location(Berlin, 10117, Deutschland, (52.5170365, 13.3888599, 0.0)),
 'London': Location(London, Greater London, England, United Kingdom, (51.5073219, -0.1276474, 0.0)),
 'San Francisco': Location(San Francisco, San Francisco City and County, San Francisco, California, United States, (37.7790262, -122.419906, 0.0)),
 'New York': Location(New York, United States, (40.7127

In [176]:
coordinates = dict()
for loc in lat_lon:
    coordinates[loc] = lat_lon[loc][1]

In [187]:
# add coordinates to merged df
merged_df['coordinates'] = merged_df['city'].map(coordinates)

# drop lines for which no coordinates were found
merged_df = merged_df[merged_df['coordinates'].notna()]

# parse out lat and lon
merged_df['LAT'] = merged_df['coordinates'].apply(lambda x: x[0])
merged_df['LON'] = merged_df['coordinates'].apply(lambda x: x[1])

In [188]:
merged_df.head()

Unnamed: 0,city,mentions,sourcedoc,coordinates,LAT,LON
0,Paris,5,airbnb,"(48.8566969, 2.3514616)",48.856697,2.351462
1,Barcelona,3,airbnb,"(41.3828939, 2.1774322)",41.382894,2.177432
2,Santa Monica,6,airbnb,"(34.0194704, -118.4912273)",34.01947,-118.491227
3,California,2,airbnb,"(36.7014631, -118.755997)",36.701463,-118.755997
4,Airbnb,13,airbnb,"(1.2758649, 103.835589)",1.275865,103.835589


## Plot locations using plotly.express

In [189]:
mapbox_access_token = "***********************************************"
px.set_mapbox_access_token(mapbox_access_token)

In [200]:
fig = px.scatter_mapbox(merged_df, lat="LAT", lon="LON", color="sourcedoc",
                  color_continuous_scale=px.colors.cyclical.IceFire, size="mentions", zoom=.5, text="city")

fig.update_layout(title='Locations extracted from text')

fig.show()