# Collection of Data from ricksteves.com

## Import Necessary Libraries

In [1]:
import pandas as pd
import time
import urllib.request
from collections import Counter
from selenium.webdriver import Chrome
import pickle
import pymongo
import numpy as np
from scrape import collect_urls, collect_all_data, collect_city_data, get_wiki_description, replace_df_text, collect_city_photo

### Instantiate MongoDB Atlas Database and Collection

In [2]:
with open('.secrets/password.txt', 'r') as f:
    conn_string = f.read().strip()

mc = pymongo.MongoClient(conn_string)

In [3]:
city_db = mc['city_database']

In [4]:
city_collection = city_db['city_collection']
country_collection = city_db['country_collection']
wiki_collection = city_db['wiki_collection']

## Web Scraping

### Rick Steves
[Rick Steves](ricksteves.com) has short summaries and photos of 213 European cities or regions. I collected country summaries, city/region summaries, city_urls as well as the leading photo of each location.

In [5]:
browser = Chrome()
url = "https://www.ricksteves.com/"
browser.get(url)

In [6]:
browser.find_element_by_xpath('//*[@id="nav"]/ul/li[2]/a').click()

In [7]:
country_urls = collect_urls(browser, '/europe/')

In [8]:
country_urls

['https://www.ricksteves.com/europe/austria',
 'https://www.ricksteves.com/europe/belgium',
 'https://www.ricksteves.com/europe/bosnia-herzegovina',
 'https://www.ricksteves.com/europe/bulgaria',
 'https://www.ricksteves.com/europe/croatia',
 'https://www.ricksteves.com/europe/czech-republic',
 'https://www.ricksteves.com/europe/denmark',
 'https://www.ricksteves.com/europe/england',
 'https://www.ricksteves.com/europe/estonia',
 'https://www.ricksteves.com/europe/finland',
 'https://www.ricksteves.com/europe/france',
 'https://www.ricksteves.com/europe/germany',
 'https://www.ricksteves.com/europe/greece',
 'https://www.ricksteves.com/europe/hungary',
 'https://www.ricksteves.com/europe/iceland',
 'https://www.ricksteves.com/europe/ireland',
 'https://www.ricksteves.com/europe/italy',
 'https://www.ricksteves.com/europe/montenegro',
 'https://www.ricksteves.com/europe/netherlands',
 'https://www.ricksteves.com/europe/norway',
 'https://www.ricksteves.com/europe/poland',
 'https://www.

The last link is not for actual countries so we will remove them.

In [9]:
country_urls = country_urls[0:-1]

In [10]:
country_urls

['https://www.ricksteves.com/europe/austria',
 'https://www.ricksteves.com/europe/belgium',
 'https://www.ricksteves.com/europe/bosnia-herzegovina',
 'https://www.ricksteves.com/europe/bulgaria',
 'https://www.ricksteves.com/europe/croatia',
 'https://www.ricksteves.com/europe/czech-republic',
 'https://www.ricksteves.com/europe/denmark',
 'https://www.ricksteves.com/europe/england',
 'https://www.ricksteves.com/europe/estonia',
 'https://www.ricksteves.com/europe/finland',
 'https://www.ricksteves.com/europe/france',
 'https://www.ricksteves.com/europe/germany',
 'https://www.ricksteves.com/europe/greece',
 'https://www.ricksteves.com/europe/hungary',
 'https://www.ricksteves.com/europe/iceland',
 'https://www.ricksteves.com/europe/ireland',
 'https://www.ricksteves.com/europe/italy',
 'https://www.ricksteves.com/europe/montenegro',
 'https://www.ricksteves.com/europe/netherlands',
 'https://www.ricksteves.com/europe/norway',
 'https://www.ricksteves.com/europe/poland',
 'https://www.

Now that we have the url for all the countries, lets get the urls for each city. Let's start collecting some data. First go to the country's page, collect the urls for all the cities/regions. Then go to each of those pages collect the summary.

In [11]:
for country in country_urls:
    collect_all_data(browser, country, country_collection, city_collection)

Inserted Austria into country collection
Inserted Danube Valley, Austria into city collection
Inserted Hallstatt, Austria into city collection
Inserted Salzburg, Austria into city collection
Inserted Tirol, Austria into city collection
Inserted Vienna, Austria into city collection
Completed scrapping Austria
Inserted Belgium into country collection
Inserted Antwerp, Belgium into city collection
Inserted Bruges, Belgium into city collection
Inserted Brussels, Belgium into city collection
Inserted Ghent, Belgium into city collection
Completed scrapping Belgium
Inserted Bosnia-Herzegovina into country collection
Inserted Mostar, Bosnia-Herzegovina into city collection
Inserted Sarajevo, Bosnia-Herzegovina into city collection
Completed scrapping Bosnia-Herzegovina
Inserted Bulgaria into country collection
Completed scrapping Bulgaria
Inserted Croatia into country collection
Inserted Dalmatian Coast, Croatia into city collection
Inserted Dubrovnik, Croatia into city collection
Inserted Hva

Inserted Pisa, Italy into city collection
Inserted Pompeii & Herculaneum, Italy into city collection
Inserted Ravenna, Italy into city collection
Inserted Rome, Italy into city collection
Inserted Sicily, Italy into city collection
Inserted Siena, Italy into city collection
Inserted Sorrento, Italy into city collection
Inserted Tuscan Hill Towns, Italy into city collection
Inserted Tuscany, Italy into city collection
Inserted Venice, Italy into city collection
Completed scrapping Italy
Inserted Montenegro into country collection
Completed scrapping Montenegro
Inserted Netherlands into country collection
Inserted Amsterdam, Netherlands into city collection
Inserted Delft, Netherlands into city collection
Inserted Edam, Netherlands into city collection
Inserted Haarlem, Netherlands into city collection
Inserted The Hague, Netherlands into city collection
Completed scrapping Netherlands
Inserted Norway into country collection
Inserted Bergen, Norway into city collection
Inserted Norwegian

Take the MongoDB collection and turn in into a list of dictionaries for the countries.

In [13]:
country_dicts = [x for x in country_collection.find()]
country_dicts[0]

{'_id': ObjectId('5d24d9b74ccd32610d73dd1c'),
 'country': 'Austria',
 'country_summary': "Small, landlocked Austria offers alpine scenery, world-class museums, cobbled quaintness, and Wiener schnitzel. Unlike Germany, its industrious neighbor to the northwest, Austria is content to bask in its good living and elegant, opulent past as the former head of one of Europe's grandest empires. Austrians tend to be relaxed, gregarious people who love the outdoors as much as a good cup of coffee in a café."}

There should be 31 countries in the list. There is so then we will convert to a data frame.

In [14]:
len(country_dicts)

31

Drop the `_id` column.

In [15]:
country_df = pd.DataFrame(country_dicts)
country_df.drop('_id', axis=1, inplace=True)
country_df.head()

Unnamed: 0,country,country_summary
0,Austria,"Small, landlocked Austria offers alpine scener..."
1,Belgium,Belgium falls through the cracks. Wedged betwe...
2,Bosnia-Herzegovina,Apart from the tragic way it separated from Yu...
3,Bulgaria,"Endearing, surprising Bulgaria is a rewarding ..."
4,Croatia,With thousands of miles of seafront and more t...


Saving the dataframe to a pickle file for use in other notebooks.

In [17]:
# pickle.dump(country_df, open('data/countries.pkl', 'wb'))

Similar to the countries, convert `city_collection` to a list of dictionaries. There should be 213 destinations and save to a pickle file for other notebooks.

In [18]:
city_dicts = [x for x in city_collection.find()]
city_dicts[0]

{'_id': ObjectId('5d24d9ed4ccd32610d73dd1d'),
 'city': 'Danube Valley',
 'country': 'Austria',
 'city_summary': "The Danube is at its romantic best just west of Vienna. Mix a cruise with a bike ride through the Danube's Wachau Valley, lined with ruined castles, beautiful abbeys (including the glorious Melk Abbey), small towns, and vineyard upon vineyard. Much of the valley has a warm fairy-tale glow, but a trip here isn't complete without the chilling contrast of a visit to the Mauthausen concentration camp memorial.",
 'city_url': 'https://www.ricksteves.com/europe/austria/danube-valley'}

In [19]:
len(city_dicts)

213

In [20]:
city_df = pd.DataFrame(city_dicts)
city_df.drop('_id', axis=1, inplace=True)
city_df.head()

Unnamed: 0,city,city_summary,city_url,country
0,Danube Valley,The Danube is at its romantic best just west o...,https://www.ricksteves.com/europe/austria/danu...,Austria
1,Hallstatt,Lovable Hallstatt is a tiny town bullied onto ...,https://www.ricksteves.com/europe/austria/hall...,Austria
2,Salzburg,"Thanks to its charmingly preserved old town, s...",https://www.ricksteves.com/europe/austria/salz...,Austria
3,Tirol,Mountainous Tirol — in Austria's western panha...,https://www.ricksteves.com/europe/austria/tirol,Austria
4,Vienna,"Vienna is the capital of Austria, the cradle o...",https://www.ricksteves.com/europe/austria/vienna,Austria


In [22]:
# pickle.dump(city_df, open('data/cities.pkl', 'wb'))

In [None]:
city_df = pickle.load(open('data/cities.pkl', 'rb'))

In [24]:
city_df.head()

Unnamed: 0,city,city_summary,city_url,country
0,Danube Valley,The Danube is at its romantic best just west o...,https://www.ricksteves.com/europe/austria/danu...,Austria
1,Hallstatt,Lovable Hallstatt is a tiny town bullied onto ...,https://www.ricksteves.com/europe/austria/hall...,Austria
2,Salzburg,"Thanks to its charmingly preserved old town, s...",https://www.ricksteves.com/europe/austria/salz...,Austria
3,Tirol,Mountainous Tirol — in Austria's western panha...,https://www.ricksteves.com/europe/austria/tirol,Austria
4,Vienna,"Vienna is the capital of Austria, the cradle o...",https://www.ricksteves.com/europe/austria/vienna,Austria


I decided later to collect also collect photos and save them in an image folder.

In [14]:
for country_url in country_urls:
    collect_city_photos(browser, country_url)

As ricksteves.com has only short 1-2 paragraph descriptions of each location, I wanted to get a larger set of data. I scraped Wikipedia for an additional summary.

In [25]:
city_list = list(zip(city_df['city'], city_df['country']))

In [26]:
city_list[0:5]

[('Danube Valley', 'Austria'),
 ('Hallstatt', 'Austria'),
 ('Salzburg', 'Austria'),
 ('Tirol', 'Austria'),
 ('Vienna', 'Austria')]

In [27]:
for place in city_list:
    city = place[0]
    get_wiki_description(browser, city, wiki_collection)
    time.sleep(np.random.randint(20, 120))

KeyboardInterrupt: 

In [None]:
wiki_collection.find_one()

In [None]:
wiki_df = pd.DataFrame([x for x in wiki_collection.find()])

In [None]:
wiki_df.head()

In [None]:
wiki_city_df.drop('_id', axis=1, inplace=True)
wiki_city_df.drop_duplicates(inplace=True)
wiki_city_df.head()

In [None]:
for row in wiki_city_df['text']:
    print(row[:300])
    print('\n')

Since the data set is small enough and it is possible that wikipedia sent to a different link than intended (due to ambiguity), I will do a visual check that the summary is the summary I want.

Cities I need to recollect:
* Danube Valley, Austria (references the river, I want to get the valley (city such as Melk))
* Split, Croatia
* Bath, England
* Durham, Englang
* Glastonbury & Wells, England (may need to get for each city?)
* Stonehenge & Avebury, England (may need to refer to each city/landmark)
* Stratford, England
* Warwick & Coventry, England
* Windsor, England
* D-Day Beaches, France
* Reims & Verdun, France (may need to get for each city)
* Rhine Valley, Germany
* Hydra, Greece
* Olympia, Greece
* Connemara & County Mayo, Ireland
* Country Clare & the Burren, Ireland
* Kenmare & the Ring of Kerry, Ireland
* Kilkenny & the Rock of Cashel, Ireland
* Kinsale & Cobh, Ireland
* Portrush & the Antrim Coast, Ireland
* Waterford & County Wexford, Ireland
* Italian Lakes, Italy
* Pompeii & Herculaneum, Italy
* Tuscan Hill Towns, Italy
* Edam, Netherlands
* Norwegian Fjords, Norway
* Nazaré, Portugal
* Óbidos, Portugal
* Oban, Mull & Iona, Scotland
* Córdoba, Spain
* Toledo, Spain
* White Hill Towns, Spain
* Lake Geneva & French Switzerland, Switzerland

In [None]:
city_wiki = [('Danube Valley', 'Melk'),
             ('Split', 'Split,_Croatia'), ('Bath', 'Bath,_Somerset'), ('Durham', 'Durham,_England'), 
             ('Glastonbury & Wells', 'Glastonbury'), ('Stonehenge & Avebury', 'Avebury'),
             ('Stratford', 'Stratford-upon-Avon'), ('Warwick & Coventry', 'Coventry'),
             ('Windsor', 'Windsor,_Berkshire'), ('D-Day Beaches', 'Normandy_landings'),
             ('Reims & Verdun', 'Reims'), ('Rhine Valley', 'Mörsbach'), ('Hydra', 'Hydra_(island)'),
             ('Olympia', 'Olympia,_Greece'), ('Connemara & County Mayo', 'Connemara'),
             ('Country Clare & the Burren', 'The_Burren'), ('Kenmare & the Ring of Kerry', 'Kenmare'),
             ('Kilkenny & the Rock of Cashel', 'Kilkenny'), ('Kinsale & Cobh', 'Cobh'),
             ('Portrush & the Antrim Coast', 'Portrush'), ('Waterford & County Wexford', 'Waterford'),
             ('Italian Lakes', 'Varenna'), ('Pompeii & Herculaneum', 'Pompeii'),
             ('Tuscan Hill Towns', 'Montepulciano'), ('Edam', 'Edam,_Netherlands'),
             ('Norwegian Fjords', 'Sognefjord'), ('Nazaré', 'Nazaré,_Portugal'),
             ('Óbidos', 'Óbidos,_Portugal'), ('Oban, Mull & Iona', 'Iona'), ('Córdoba', 'Córdoba,_Spain'),
             ('Toledo', 'Toledo,_Spain'), ('White Hill Towns', 'Ronda'),
             ('Lake Geneva & French Switzerland', 'Lake_Geneva'),
            ]

In [None]:
for city in city_wiki:
    replace_df_text(browser, city, wiki_city_df)
    time.sleep(np.random.randint(10,30))
    print(f'Replaced {city[0]}')

Save the wikipedia data as a pickle file.

In [33]:
#pickle.dump(wiki_df, open('data/wiki_data.pkl', 'wb'))
wiki_city_df = pickle.load(open('data/wiki_data.pkl', 'rb'))

Merge the two city data frames (`city_df` and `wiki_city_df`) together.

In [37]:
combined_descriptions_df = pd.merge(city_df, wiki_city_df, how='left', on='city')
combined_descriptions_df.head(10)

Unnamed: 0,city,city_summary,city_url,country,_id,text
0,Danube Valley,The Danube is at its romantic best just west o...,https://www.ricksteves.com/europe/austria/danu...,Austria,5d11974d6e7463927710e428,\nThe Danube (/ˈdæn.juːb/ DAN-yoob; known by v...
1,Danube Valley,The Danube is at its romantic best just west o...,https://www.ricksteves.com/europe/austria/danu...,Austria,5d1197626e7463927710e429,\nThe Danube (/ˈdæn.juːb/ DAN-yoob; known by v...
2,Hallstatt,Lovable Hallstatt is a tiny town bullied onto ...,https://www.ricksteves.com/europe/austria/hall...,Austria,5d1197936e7463927710e42a,Hallstatt (German: [ˈhalʃtat]; Central Bavaria...
3,Salzburg,"Thanks to its charmingly preserved old town, s...",https://www.ricksteves.com/europe/austria/salz...,Austria,5d1197ff6e7463927710e42b,Salzburg (German: [ˈzaltsbʊɐ̯k] (listen);[note...
4,Tirol,Mountainous Tirol — in Austria's western panha...,https://www.ricksteves.com/europe/austria/tirol,Austria,5d1198416e7463927710e42c,"\nTyrol (/tɪˈroʊl, taɪ-, ˈtaɪroʊl/;[1] histori..."
5,Vienna,"Vienna is the capital of Austria, the cradle o...",https://www.ricksteves.com/europe/austria/vienna,Austria,5d1195dc0d668bb1db943136,\nVienna (/viˈɛnə/ (listen);[11][12] German: W...
6,Vienna,"Vienna is the capital of Austria, the cradle o...",https://www.ricksteves.com/europe/austria/vienna,Austria,5d11970b6e7463927710e427,\nVienna (/viˈɛnə/ (listen);[11][12] German: W...
7,Vienna,"Vienna is the capital of Austria, the cradle o...",https://www.ricksteves.com/europe/austria/vienna,Austria,5d11988a6e7463927710e42d,\nVienna (/viˈɛnə/ (listen);[11][12] German: W...
8,Antwerp,"Antwerp (Antwerpen in Dutch, Anvers in French)...",https://www.ricksteves.com/europe/belgium/antwerp,Belgium,5d1198cd6e7463927710e42e,\nAntwerp (/ˈæntwɜːrp/ (listen); Dutch: Antwer...
9,Bruges,"With pointy gilded architecture, stay-a-while ...",https://www.ricksteves.com/europe/belgium/bruges,Belgium,5d1198ef6e7463927710e42f,"\nBruges (/bruːʒ/, French: [bʁyʒ]; Flemish: Br..."


In [38]:
combined_descriptions_df.drop('_id', axis=1, inplace=True)

In [39]:
combined_descriptions_df.head(10)

Unnamed: 0,city,city_summary,city_url,country,text
0,Danube Valley,The Danube is at its romantic best just west o...,https://www.ricksteves.com/europe/austria/danu...,Austria,\nThe Danube (/ˈdæn.juːb/ DAN-yoob; known by v...
1,Danube Valley,The Danube is at its romantic best just west o...,https://www.ricksteves.com/europe/austria/danu...,Austria,\nThe Danube (/ˈdæn.juːb/ DAN-yoob; known by v...
2,Hallstatt,Lovable Hallstatt is a tiny town bullied onto ...,https://www.ricksteves.com/europe/austria/hall...,Austria,Hallstatt (German: [ˈhalʃtat]; Central Bavaria...
3,Salzburg,"Thanks to its charmingly preserved old town, s...",https://www.ricksteves.com/europe/austria/salz...,Austria,Salzburg (German: [ˈzaltsbʊɐ̯k] (listen);[note...
4,Tirol,Mountainous Tirol — in Austria's western panha...,https://www.ricksteves.com/europe/austria/tirol,Austria,"\nTyrol (/tɪˈroʊl, taɪ-, ˈtaɪroʊl/;[1] histori..."
5,Vienna,"Vienna is the capital of Austria, the cradle o...",https://www.ricksteves.com/europe/austria/vienna,Austria,\nVienna (/viˈɛnə/ (listen);[11][12] German: W...
6,Vienna,"Vienna is the capital of Austria, the cradle o...",https://www.ricksteves.com/europe/austria/vienna,Austria,\nVienna (/viˈɛnə/ (listen);[11][12] German: W...
7,Vienna,"Vienna is the capital of Austria, the cradle o...",https://www.ricksteves.com/europe/austria/vienna,Austria,\nVienna (/viˈɛnə/ (listen);[11][12] German: W...
8,Antwerp,"Antwerp (Antwerpen in Dutch, Anvers in French)...",https://www.ricksteves.com/europe/belgium/antwerp,Belgium,\nAntwerp (/ˈæntwɜːrp/ (listen); Dutch: Antwer...
9,Bruges,"With pointy gilded architecture, stay-a-while ...",https://www.ricksteves.com/europe/belgium/bruges,Belgium,"\nBruges (/bruːʒ/, French: [bʁyʒ]; Flemish: Br..."


In [40]:
# pickle.dump(combined_descriptions_df, open('data/combined_cities.pkl', 'wb'))
combined_descriptions_df = pickle.load(open('data/combined_cities.pkl', 'rb'))

In [41]:
combined_descriptions_df.head()

Unnamed: 0,city,city_summary,city_url,country,text
0,Danube Valley,The Danube is at its romantic best just west o...,https://www.ricksteves.com/europe/austria/danu...,Austria,\nThe Danube (/ˈdæn.juːb/ DAN-yoob; known by v...
1,Danube Valley,The Danube is at its romantic best just west o...,https://www.ricksteves.com/europe/austria/danu...,Austria,\nThe Danube (/ˈdæn.juːb/ DAN-yoob; known by v...
2,Hallstatt,Lovable Hallstatt is a tiny town bullied onto ...,https://www.ricksteves.com/europe/austria/hall...,Austria,Hallstatt (German: [ˈhalʃtat]; Central Bavaria...
3,Salzburg,"Thanks to its charmingly preserved old town, s...",https://www.ricksteves.com/europe/austria/salz...,Austria,Salzburg (German: [ˈzaltsbʊɐ̯k] (listen);[note...
4,Tirol,Mountainous Tirol — in Austria's western panha...,https://www.ricksteves.com/europe/austria/tirol,Austria,"\nTyrol (/tɪˈroʊl, taɪ-, ˈtaɪroʊl/;[1] histori..."
