# Introduction
This notebook contains the work for [Homework 3](https://github.com/lucamaiano/ADM/tree/master/2022/Homework_3) 
of [Algorithmic Methods for Data Mining 2022](http://aris.me/index.php/data-mining-ds-2022).

## Group Members
* Laura Mignella
* Paolo Barba
* Jonas Barth

## Index
* [Data Collection](#1.-Data-Collection)
    * [Get the list of places](#1.1.-Get-the-list-of-places)
    * [Crawl Places](#1.2.-Crawl-places)
    * [Parse Pages](#1.3.-Parse-Pages)
* [Search Engine](#2.-Search-Engine)
    * [Conjunctive Index](#2.1.-Conjunctive-Index)
        * [Create Index](#2.1.1.-Create-Index)
        * [Create Search Engine and Run Search](#2.1.2.-Create-Search-Engine-and-Run-Search)
    * [TF-IDF Index](#2.2.-TF-IDF-Index)
        * [Create TF-IDF Index](#2.2.1.-Create-TF-IDF-Index)
        * [Create TF-IDF Search Engine and Run Search](#2.2.2.-Create-TF-IDF-Search-Engine-and-Run-Search)
* [Own Score](#3.-Own-Score)
    * [Proximity](#3.1.-Proximity)
    * [Popularity](#3.2.-Popularity)
    * [Proximity & Popularity](#3.3.-Proximity-&-Popularity)
* [Visualizing the most relevant places](#4.-Visualizing-the-most-relevant-places)

# Imports

In [1]:
import pandas as pd
from parse import parse_htmls
from util import read_place_desc, read_htmls_in, write_places_to_tsv, read_places
from service import PlaceService, SearchEngine
from index import preprocess, Index, TfIdfIndex

import requests
from tqdm import tqdm
import os
import time
import util
import plotly.express as px

# 1. Data Collection
##  1.1. Get the list of places





In [None]:
with open('most_popular_places.txt', 'w') as file:
    for i in tqdm(range(400)):
        
        url = 'https://www.atlasobscura.com/places?page='+str(i+1)+'&sort=likes_count'
        list_page = requests.get(url)
        list_soup = BeautifulSoup(list_page.text)
        list_places = [x['href'] for x in list_soup.find_all('a', {'class':"content-card content-card-place"})]
        for place in tqdm(list_places):
            file.write('https://www.atlasobscura.com' + str(place))
            file.write('\n')

## 1.2. Crawl places

In [None]:
f= open('most_popular_places.txt', 'r')
for j, url in enumerate(f):
    
    if j %18 == 0:
        dir_path = f'page{j//18+1}'
        if not os.path.isdir(dir_path):
            os.mkdir(dir_path)
            
    response = requests.get(url.strip())
    with open(f'{dir_path}/{j+1}.html','w') as file:
        print()
        file.write(response.text)
        
    time.sleep(1)

# 1.3. Parse Pages

In [None]:
all_htmls = read_htmls_in('./pages')

In [None]:
all_places = parse_htmls(htmls)

In [None]:
tsv_path = write_places_to_tsv('./', all_places)

In [2]:
place_service = PlaceService()
place_service.load('./places.tsv')

---

# 2. Search Engine 

## 2.1. Conjunctive Index

## 2.1.1. Create Index
The index can either be created from the parsed and saved files or from an already saved index.
### Create from saved `.tsv` file.

In [71]:
ids, descriptions = read_place_desc('./places.tsv')
index = Index.create_from(ids, descriptions)

### Load saved index

In [3]:
index = Index.load_from('./resources/index.pickle')

### 2.1.2. Create Search Engine and Run Search

In [4]:
search_engine = SearchEngine(index, place_service)

In [5]:
search_engine.query('american museum')

Unnamed: 0,name,desc,url
1804,Uncommon Objects,Like an elegant antiques mall gone horribly wr...,https://www.atlasobscura.com/places/uncommon-o...
2458,Tamástslikt Cultural Institute,"The Tamástslikt Cultural Institute, situated o...",https://www.atlasobscura.com/places/tamastslik...
349,Mitsitam Native Foods Cafe,"A visit to the National Mall in Washington, D....",https://www.atlasobscura.com/places/mitsitam-n...
3701,Museum of Chinese in America,The Museum of Chinese in America is nestled—al...,https://www.atlasobscura.com/places/museum-of-...
1087,Museum of Mourning Art,Mourning and personal response to death are un...,https://www.atlasobscura.com/places/museum-of-...
...,...,...,...
6473,Museum of the American Cocktail,They say that New Orleans is the home of the f...,https://www.atlasobscura.com/places/museum-ame...
1934,Unto These Hills Cherokee Theatre,"Since 1950, members of the local Cherokee trib...",https://www.atlasobscura.com/places/unto-these...
984,Theodore Roosevelt Birthplace Museum,Behind an otherwise innocuous (if immaculately...,https://www.atlasobscura.com/places/theodore-r...
620,Canyons of the Ancients,Ripe for quiet reflection and simply awe-inspi...,https://www.atlasobscura.com/places/canyons-of...


## 2.2. TF-IDF Index
### 2.2.1. Create TF-IDF Index
The index can either be created from the parsed and saved files or from an already saved index.

### Ceate from saved `.tsv` file

In [75]:
tf_idf_index = TfIdfIndex.create_from(ids, descriptions)

### Load saved index

In [7]:
tf_idf_index = TfIdfIndex.load_from('./resources/tf_idf_index.pickle')

### 2.2.2. Create TF-IDF Search Engine and Run Search

In [8]:
tf_idf_search_engine = SearchEngine(tf_idf_index, place_service)
tf_idf_search_engine.query_top_k("american museum", 10)[['name', 'desc', 'url', 'similarity']]

Unnamed: 0,name,desc,url,similarity
3926,Smithsonian Sushi Collection,The American History Museum has collected an a...,https://www.atlasobscura.com/places/smithsonia...,0.999944
6489,Mercer Museum and Fonthill Castle,"Henry Chapman Mercer, a renowned archaeologist...",https://www.atlasobscura.com/places/fonthill,0.998837
2458,Tamástslikt Cultural Institute,"The Tamástslikt Cultural Institute, situated o...",https://www.atlasobscura.com/places/tamastslik...,0.998837
4697,Zippo/Case Museum,Invented in and still proudly manufactured in ...,https://www.atlasobscura.com/places/zippo-case...,0.998837
238,Off the Rez Cafe,The U.S. government’s forced relocation of Nat...,https://www.atlasobscura.com/places/off-the-re...,0.998837
6238,Oak Ridge &quot;The Secret City&quot;,The city of Oak Ridge was established by the U...,https://www.atlasobscura.com/places/the-secret...,0.998837
5429,Old Time Wooden Nickel Company,"The adage goes, “don’t take any wooden nickels...",https://www.atlasobscura.com/places/old-time-w...,0.994973
5068,Self-Taught Genius Gallery,"In 2017, the American Folk Art Museum in Manha...",https://www.atlasobscura.com/places/self-taugh...,0.99231
5517,Niles Essanay Silent Film Museum,It was Spring in San Francisco. One quiet Apri...,https://www.atlasobscura.com/places/niles-essa...,0.988467
343,Gillette Castle State Park,"High above the Connecticut River, Gillette Cas...",https://www.atlasobscura.com/places/gillettes-...,0.988467


---

# 3. Own Score
For our own score, we decided to give the users three ways to rank the places:

1. [Proximity](#3.1.-Proximity)
1. [Popularity](#3.2.-Popularity)
1. [Proximity & Popularity](#3.3.-Proximity-&-Popularity)

## 3.1. Proximity
The proximity score is based on the user's current location. Places that are closer to the user's location are ranked higher than places that are further away. To user's location is obtained by fetching the currently used IP address and finding the latitudes and longitudes associated with it. Although not exact and prone to manipulation (a VPN could be used to "change" location), we avoid having to clean and parse more user input.

The similarity score for a place is calculated by subtracting the distance between the user location from the maximum possible distance, and normalising it over the maximum possible distance. Given a distance function $dist(p_1, p_2)$ that returns the distance between two positions on the surface of the earth, the proximity score is defined as:

$$proximity(place) = \frac{max\_distance - dist(place, user)}{max\_distance}$$

The $max\_distance$ is simply the earth's circumference divided by two, as this is the maximum possible distance between any two points on the surface of the earth.

The reason for subtracting the distance between the user and the place from the maximum distance is so that scores closer to 1 correspond to a higher similarity and scores closer to 0 to lower similarity. More formally:

$$\lim_{dist(place, user) \to 0} proximity(place) = 1$$

$$\lim_{dist(place, user) \to max\_distance} proximity(place) = 0$$

In [85]:
search_engine.query_custom('museum', top_k=10, proximity=True, popularity=False)[["name","desc","address","similarity"]]

Unnamed: 0,name,desc,address,similarity
2917,Cesare Lombroso&#39;s Museum of Criminal Anthr...,"Once only open to academics, Lombroso’s Museum...","University of Turin, Via Pietro Giuria 15, Tur...",0.993611
3001,Christ of the Abyss,While figures of the divine and the decorative...,"San Fruttuoso Coast, San Fruttuoso, 16034, Italy",0.993545
4863,Lovers of Valdaro,"Arms and legs entwined, the couple lay facing ...","Piazza Castello, Mantua, Italy",0.993472
1522,Nautilus Antiques and Old Oddities,With the tagline “Antique Scientific Instrumen...,"Bellezia 15/b, Modena, 41121, Italy",0.991769
157,St. Beatus Cave,The legend of the cave revolves around its nam...,"Lake Thun, Beatenberg, 3800, Switzerland",0.991382
4759,Barryland,"The Musée et Chiens du Saint-Bernard, aka “Bar...","34 Rue du Levant, Martigny, Switzerland",0.991099
1289,The Town Of Witches,"“Are you a good witch, or a bad witch?” In Tri...","Triora, 18010, Italy",0.990016
5534,Otzi the Iceman,Three basic conditions can lead to natural mum...,"Via Museo 43, South Tyrol Museum of Archaeolog...",0.989922
5753,Museo di Palazzo Poggi Anatomy &amp; Obstetric...,The Anatomical and Obstetrics Collection at th...,"Museo de Palazzo Poggi, Via Zamboni, 33, Bolog...",0.989905
6672,H.R. Giger Museum,"In the quaint medieval city of Gruyères, Switz...","Château St. Germain, Gruyères, 1663, Switzerland",0.989803


## 3.2. Popularity
The popularity score ranks places by popularity, with more popular places being above less popular places. The popularity of a place is calculated using the number of people that have visited it and the number of people that want to go. We chose these two variables, because the very meaning of popularity is that many people are interested in a place. If many people want to go to a place or have visited it, it therefore means that is popular.

For the number of people that went and number of people that want to go, we calculate their ratio over the total number of people that visited all places and the total number, sum them, and equally weigh the two ratios. 

$$popularity = \frac{1}{2} \times \left(\frac{num\_people\_went}{total\_people\_went} + \frac{num\_people\_want}{total\_people\_want}\right)$$



In [86]:
search_engine.query_custom("museum", top_k=10, proximity=False, popularity=True)[["name","desc","address","similarity"]]

Unnamed: 0,name,desc,address,similarity
2000,Mütter Museum,Located inside the headquarters of the College...,"19 South 22nd Street, Philadelphia, Pennsylvan...",0.005018
6820,Museum of Pop Culture,"In Seattle, where art seems to spring from the...","325 5th Avenue North, Seattle, Washington, 981...",0.004714
0,City Hall Station,The first New York City subway was built and o...,"31 Centre St, New York, New York, 10007, Unite...",0.004609
614,Natural History Museum of London,"Established in 1881, the Natural History Museu...","Cromwell Road, London, England, SW7 2DD, Unite...",0.0045
14,The Evolution Store,Evolution stands out among the clothing stores...,"687 Broadway, New York, New York, 10012, Unite...",0.004283
5999,The Witch House of Salem,The Salem witchcraft trials took place between...,"310 1/2 Essex Street, Salem, Massachusetts, 01...",0.003955
825,Casa Batlló,"One of Gaudí’s most iconic works, Casa Batlló ...","43 Passeig de Gràcia, Barcelona, 08007, Spain",0.003873
5011,Park Güell,"At Park Güell, stone, tile, plants, and Medite...","s/n Carrer d'Olot, Barcelona, 08024, Spain",0.003774
4411,Centre Pompidou,"Located in Paris’ 4th arrondissement, Centre G...","Centre Georges Pompidou, Paris, 75004, France",0.003709
1010,La Brea Tar Pits Dragonfly Fossils,The landmarked La Brea Tar Pits and Museum is ...,"La Brea Tar Pits and Museum, 5801 Wilshire Bou...",0.003677


## 3.3. Proximity & Popularity
For the combination of proximity and popularity, the two scores are simply multiplied together.

$$proximity \times popularity$$

In [87]:
search_engine.query_custom("museum" , top_k=10, proximity=True, popularity=True)[["name","desc","address","similarity"]]

Unnamed: 0,name,desc,address,similarity
614,Natural History Museum of London,"Established in 1881, the Natural History Museu...","Cromwell Road, London, England, SW7 2DD, Unite...",0.004284
825,Casa Batlló,"One of Gaudí’s most iconic works, Casa Batlló ...","43 Passeig de Gràcia, Barcelona, 08007, Spain",0.003733
5011,Park Güell,"At Park Güell, stone, tile, plants, and Medite...","s/n Carrer d'Olot, Barcelona, 08024, Spain",0.003637
4411,Centre Pompidou,"Located in Paris’ 4th arrondissement, Centre G...","Centre Georges Pompidou, Paris, 75004, France",0.003591
2000,Mütter Museum,Located inside the headquarters of the College...,"19 South 22nd Street, Philadelphia, Pennsylvan...",0.003363
0,City Hall Station,The first New York City subway was built and o...,"31 Centre St, New York, New York, 10007, Unite...",0.003118
14,The Evolution Store,Evolution stands out among the clothing stores...,"687 Broadway, New York, New York, 10012, Unite...",0.002898
5999,The Witch House of Salem,The Salem witchcraft trials took place between...,"310 1/2 Essex Street, Salem, Massachusetts, 01...",0.00274
7008,221b Baker Street,Beeton’s Christmas Annual was a hugely popular...,"237 Baker Street, Devon, London, England, NW1 ...",0.002696
6820,Museum of Pop Culture,"In Seattle, where art seems to spring from the...","325 5th Avenue North, Seattle, Washington, 981...",0.002676


---

# 4. Visualizing the most relevant places

In [9]:
d_cosine = tf_idf_search_engine.query_top_k("american museum", 10)
d_proximity = search_engine.query_custom('american museum', top_k=10, proximity=True, popularity=False)
d_popularity = search_engine.query_custom("american museum", top_k=10, proximity=False, popularity=True)
d_combination = search_engine.query_custom("american museum" , top_k=10, proximity=True, popularity=True)

In [16]:
fig = px.scatter_mapbox(
    d_cosine,  # Our DataFrame
    lat = "lat",
    lon = "lon",
    center = {"lat": 40.77, "lon": -73.96},  # where map will be centered (New York)
    width = 1000,  # Width of map
    height = 600,  # Height of map
    color="similarity", size="similarity",
    zoom=0.7,
    hover_data = ["name", "address", "num_people_visited", "num_people_want"],
    # what to display when hovering mouse over coordinate
)
fig.update_layout(
    title={
        'text': "Maps visualization of the most relevant place according to cosine similarity",
        'y':0.95,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.update_layout(mapbox_style="stamen-toner") 
fig.show()

In [17]:
fig = px.scatter_mapbox(
    d_popularity,  # Our DataFrame
    lat = "lat",
    lon = "lon",
    center = {"lat": 40.77, "lon": -73.96},  # where map will be centered
    width = 1000,  # Width of map
    height = 600,  # Height of map
    color="similarity", size="similarity",
    zoom=0.7,
    hover_data = ["name", "address", "num_people_visited", "num_people_want"],
    # what to display when hovering mouse over coordinate
)
fig.update_layout(
    title={
        'text': "Maps visualization of the most relevant place according to popularity",
        'y':0.95,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.update_layout(mapbox_style="stamen-toner") 
fig.show()

In [18]:
fig = px.scatter_mapbox(
    d_proximity,  # Our DataFrame
    lat = "lat",
    lon = "lon",
    center = {"lat": 40.77, "lon": -73.96},  # where map will be centered
    width = 1000,  # Width of map
    height = 600,  # Height of map
    color="similarity", size="similarity",
    zoom=0.5,
    hover_data = ["name", "address", "num_people_visited", "num_people_want"],
    # what to display when hovering mouse over coordinate
)
fig.update_layout(
    title={
        'text': "Maps visualization of the most relevant place according to proximity",
        'y':0.95,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.update_layout(mapbox_style="stamen-toner") 
fig.show()

In [19]:
fig = px.scatter_mapbox(
    d_combination,  # Our DataFrame
    lat = "lat",
    lon = "lon",
    center = {"lat": 40.77, "lon": -73.96},  # where map will be centered
    width = 1000,  # Width of map
    height = 600,  # Height of map
    color="similarity", size="similarity",
    zoom=0.5,
    hover_data = ["name", "address", "num_people_visited", "num_people_want"],
    # what to display when hovering mouse over coordinate
)
fig.update_layout(
    title={
        'text': "Maps visualization of the most relevant place according to proximity and popularity ",
        'y':0.95,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'})

fig.update_layout(mapbox_style="stamen-toner") 
fig.show()