# Analyzing Text

It may be surprising that not all travel articles include a map.  Understanding where restaurants, hotels, landmarks, and other points of interest are in relation to each other is important for itinerary building.Parsing text to look up places is a good application of forward geocoding.

Read this blog post for more background:

https://developer.here.com/blog/turn-text-into-here-maps-with-python-nltk

Example of a travel article without a map:
- [25 Best Things to Do in Cleveland, OH](https://vacationidea.com/destinations/best-things-to-do-in-cleveland.html)
- [US News Travel Section](https://travel.usnews.com/Cleveland_OH/Things_To_Do/)

In [20]:
import bs4
import nltk
import urllib

from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.tag import pos_tag

In [8]:
url = 'https://vacationidea.com/destinations/best-things-to-do-in-cleveland.html'
response = urllib.request.urlopen(url)
html = response.read()
soup = bs4.BeautifulSoup(html, 'html.parser')
soup

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html lang="en" xml:lang="en" xmlns="https://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="en-us" http-equiv="content-language"/>
<title>25 Best Things to Do in Cleveland, Ohio</title>
<meta content="Explore Cleveland's vibrant parks, museums and attractions on your weekend getaway." name="description">
<style>
a,abbr,acronym,address,applet,b,big,blockquote,body,caption,center,cite,code,dd,del,dfn,div,dl,dt,em,fieldset,font,form,h1,h2,h3,h4,h5,h6,html,i,iframe,img,ins,kbd,label,legend,li,object,ol,p,pre,q,s,samp,small,span,strike,strong,sub,sup,table,tbody,td,tfoot,th,thead,tr,tt,u,ul,var{margin:0;padding:0;font-size:100%;vertical-align:baseline;border:0;outline:0;background:0 0}ol,ul{list-style:none}blockquote,q{quotes:none}address,caption,cite,code,dfn,em,strong,th,var{font-style:normal;font-we

In [9]:
for section in soup(['script', 'style']):
    section.decompose()
    
text = soup.get_text()
text

"\n\n\n\n\n25 Best Things to Do in Cleveland, Ohio\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n \n \n\n\n \n \n\n\n \nToggle navigation\n\nHome\nBeach Vacations\nRomantic Getaways\nFamily\nWeekend Getaways\nLast Minute\n\n\nSearch this site\n\n \nSearch \n\n\n  \n \n \n\n\n\n\n\n\n\n25 Best Things to Do in Cleveland, Ohio\n\n\n© Courtesy of rudi1976/Fotolia.com\n\n\n\n\n\n\n\n\n \nCleveland, Ohio is a vibrant, culturally diverse city with world-class museums, free attractions, unique wedding venues and beautiful parks. Explore the trendy University Circle area, visit the International Women's Air & Space Museum, the Museum of Contemporary Art, and the Rock and Roll Hall of Fame Museum. \r\n\r\nBest things to do in Cleveland, Ohio with kids include the Cleveland Metroparks Zoo and the Greater Cleveland Aquarium. \r\n\r\n\tMore ideas: Best California Beaches, Oregon, Florida Weekend Getaways\n\n\n\n\n\n\n\n\n\n  1.University Circle\n\n\n © Courtesy

In [14]:
# Ignore punctuation, duplicates
tokenizer = RegexpTokenizer(r'\w+')
tokens = set(tokenizer.tokenize(text))
tokens

{'open',
 'insight',
 'oldest',
 'crystal',
 'Lakefront',
 'other',
 'stimulate',
 'panoramic',
 'beers',
 'development',
 'In',
 'School',
 'explored',
 'Peace',
 'Boulevard',
 'Circle',
 'pm',
 'Berkeley',
 'colonies',
 'CIA',
 'shishito',
 'heroes',
 'Zoo',
 'garden',
 'Möst',
 'Franz',
 'handcrafted',
 'verified',
 'll',
 '88th',
 'paleontology',
 'day',
 'new',
 'help',
 'recreation',
 'Vermont',
 'bread',
 'Book',
 'quick',
 'Stories',
 'famous',
 'numerous',
 'alfresco',
 'smoothies',
 '7340',
 'UT',
 'boast',
 'things',
 'Meanwhile',
 'Trips',
 'gorillas',
 'all',
 'get',
 'Charlottesville',
 'made',
 'times',
 'has',
 'Marsh',
 'shimmers',
 'oven',
 '74',
 'tequila',
 'city',
 'Tybee',
 'Programs',
 '216',
 'women',
 'PA',
 '70',
 'Boasting',
 'still',
 'out',
 'Hills',
 '750',
 'seek',
 'Maine',
 'lighting',
 'exotic',
 'hour',
 'spot',
 'routes',
 'Coral',
 'Tampa',
 'study',
 'residences',
 'provides',
 'your',
 '661',
 'Monterey',
 'TWA',
 'entertain',
 'visitors',
 'hikin

In [21]:
# Remove stop words, get proper nouns
stop_words_set = set(stopwords.words())
tokens = [w for w in tokens if not w in stop_words_set]
proper = pos_tag(tokens)
tokens = [w for w,pos in proper if pos in ['NNP', 'NNPS']]

tokens

['Lakefront',
 'School',
 'Peace',
 'Boulevard',
 'Circle',
 'Berkeley',
 'CIA',
 'Zoo',
 'Möst',
 'Franz',
 'Vermont',
 'Book',
 'Stories',
 'UT',
 'Trips',
 'Charlottesville',
 'Marsh',
 'Tybee',
 'Programs',
 'PA',
 'Boasting',
 'Maine',
 'Coral',
 'Tampa',
 'Monterey',
 'TWA',
 'James',
 'Committed',
 'Italy',
 'Space',
 'Erieside',
 'Sunset',
 'Vacations',
 'Sanctuary',
 'Euclid',
 'Pair',
 'Caribbean',
 'Enjoy',
 'Orchestra',
 'Ninety',
 'Wander',
 'Festival',
 'Road',
 'Vienna',
 'World',
 'Visit',
 'Sandusky',
 'LLC',
 'Cool',
 'All',
 'E',
 'Canalway',
 'States',
 'Raton',
 'Aiming',
 'Established',
 'Park',
 'Mills',
 'CA',
 'Photo',
 'Greater',
 'Kenneth',
 'Major',
 'Sand',
 'IWASM',
 'Vacation',
 'Building',
 'Severance',
 'Carlsbad',
 'William',
 'Sadura',
 'Voted',
 'Detroit',
 'Pearl',
 'Aquarium',
 'A',
 'USA',
 'Avenue',
 'Chicago',
 'Places',
 'GA',
 'IL',
 'Lauderdale',
 'West',
 'Jennifer',
 'Erie',
 'Wildlife',
 'Cleveland',
 'Sanibel',
 'Scottsdale',
 'Roll',
 'S

# Geocoder Autocomplete

Request is for a list of address suggestions for search text.  Can be used interactively as one types to test for a match, or useful for a list of tokens.

In [29]:
import os
import requests

APP_ID_HERE = os.environ['APP_ID_HERE']
APP_CODE_HERE = os.environ['APP_CODE_HERE']

uri = 'https://autocomplete.geocoder.api.here.com/6.2/suggest.json'
params = {
    'app_id': APP_ID_HERE,
    'app_code': APP_CODE_HERE,
    'query': 'Charlottesville',
}

response = requests.get(uri, params=params)
response.json()

{'suggestions': [{'label': 'United States, VA, Charlottesville (City), Charlottesville',
   'language': 'en',
   'countryCode': 'USA',
   'locationId': 'NT_MkkLiWmx89vadPV8GHcS7D',
   'address': {'country': 'United States',
    'state': 'VA',
    'county': 'Charlottesville (City)',
    'city': 'Charlottesville',
    'postalCode': '22902'},
   'matchLevel': 'city'},
  {'label': 'United States, VA, Charlottesville (City)',
   'language': 'en',
   'countryCode': 'USA',
   'locationId': 'NT_ItTbwffEBG8czeHHer6OxD',
   'address': {'country': 'United States',
    'state': 'VA',
    'county': 'Charlottesville (City)'},
   'matchLevel': 'county'},
  {'label': 'United States, TN, Knoxville, Charlottesville Blvd',
   'language': 'en',
   'countryCode': 'USA',
   'locationId': 'NT_TVDv5q3g8gMfkE2G-CNmyB',
   'address': {'country': 'United States',
    'state': 'TN',
    'county': 'Knox',
    'city': 'Knoxville',
    'street': 'Charlottesville Blvd',
    'postalCode': '37922'},
   'matchLevel': 's

# Try It

Parsing street addresses is tricky but give it a try to look for combinations of tokens when combined with autocomplete can help you identify location matches.