# Homework 3 in ADM - Find the perfect place to stay in Texas!
## By Guliano Soced Tocilj, Hassan Ismail and Hannes Engelhardt

### Our approach

We started solving the homework using functions to test our approaches to the different tasks. When we solved all of them to our satisfaction we converted our code to classes. You can find the classes we created with the comments we provided in the file *classes/hoohle.py* and their explanations in *classes.ipynb* file. This will increase the readabilty of this notebook.

For each of the search engines we had to create during our homework we created a class. We called them *HoohleSimple*, *HoohleTFIDF* and *HoohleNOSTRO*. We will refer to them in the following while explaning our steps. To provide these classes with the needed information we created the classes *AirbnbData* and *Property*. In the first one we handled the preparation of the data and the latter will provide the information about the document and term ids, whenever one of our hoohles need this information. 

Below you can find the libraries we used during the completion of our homework.

In [None]:
import pandas as pd
import nltk
import csv

from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from translate import Translator
from langdetect import detect

stemmer = SnowballStemmer('english')

from pathlib import Path
import math
import numpy as np
import re
from scipy import spatial
import heapq as hq
import geopy
import geopy.distance
import folium

from IPython.display import HTML, display
import importlib 
from importlib import *
import classes

#### Importing our classes

In [2]:
# Execute only once on a machine
#nltk.download()

In [110]:
#importlib.reload(classes.hoohle)
from classes.hoohle import *

In [111]:
#importlib.reload(classes.hoohle)
from classes.hoohle import AirbnbData

In [112]:
#importlib.reload(classes.hoohle)
from classes.hoohle import HoohleSimple

In [113]:
#importlib.reload(classes.hoohle)
from classes.hoohle import HoohleTFIDF

In [125]:
#importlib.reload(classes.hoohle)
from classes.hoohle import HoohleNOSTRO

In [114]:
#importlib.reload(classes.hoohle)
from classes.hoohle import Property

### Step 1 and 2
Our first step was to download and clean the dataframe. We removed the dupliclates using the function *removeDuplicates* in the class *AirbnbData*. We realized that there are some entries which aren't in English. In order not to lose many of the entries and still be able to use them we translated the title and description. We then used the function *translate* you can find in the same class (Airbnb).

Our next step was to create the tsv-files. For that we saved each row of the dataframe into a seperate file, named as asked in the explanantion of the tasks for the homework.

#### Creating Data Object
By initiating an instance from the AirbnbData class, this will autmatically read the data from the passed csv file and store them in a dataframe called 'data'.

In [115]:
data = AirbnbData('Airbnb_Texas_Rentals.csv', 'Airbnb Texas',True)

#### Cleaning Data
By calling the clean function, which will remove entries with null values, convert words into lower case, remove stop words, and stem words

In [19]:
data.clean()

#### Translation
This function checks the language of the title and description text and if it's not English it will translate them into the desired language (English in our case)

In [None]:
data.translate('English')

#### Creating TSV Files
The below function create a TSV file for each entry in our data under the 'data' folder.

In [21]:
data.createTSVs()

## Search Engine 1 (Step 3.1)
By creating an object from the *HoohleSimple* class, and passing the data variable.

the initiation process includes creating a vocabulary file (vocabulary.csv) and and index dictionary which will remain in memory.
And below as you will see, we provided an option to save the index to file on disk so we can read it again and assign to our search engine object.

To create the vocabulary file we needed for the creation of the inverted index we used the function *buildVocabulary* in the class *AirbnbData*. This function called another function, *getUniqueTerms* from the class *Property* which itself called the function *cleanData* that converts all the words to lowercase, removes punctation and stopwords, stemms the words and removes numbers and strings starting with numbers. 

The next step (3.2.1) was to execute the query. This will be done by the class *HoohleSimple*. The user will be asked to input a query which will be passed to the function *printResults* in this class. This function will call the function get *getDocsByQuery*. This function will first clean the search query by applying the function (*cleanData*) we described earlier (creation of the vocabulary file). After the query has been cleaned it will be passed on to the function *getDocsByTerm*. This will access the function *getID* which outputs the id of the term, assigned in our vocabulary file. The id of the vocabulary will then be used to get all the document ids assigned to the id of the vocabulary. In this step we access our index which will be created in the initiation process of *HoohleSimple*. This index is a dictionary where the key is the id of the term we got in our vocabulary file and the values are the ids of the documents containing this term. The ids of the documents we get from our index will then be appended to a list containing all the document ids we found for the terms of the search query. With this list of document ids, *hoohleSimple* will than print the title, link and description of the apartements that match the search query. See and test below.


In [122]:
hoohle = HoohleSimple(data, True)
#hoohle = HoohleSimple(data)

#### Save Dataframe
As we mentioned before, we provided the option to save our dataframe after cleansing to disk, for further import.

In [23]:
saveData(data.data, 'dataframe/airbnb')

In [None]:
#### Save Index To Disk

In [24]:
hoohle.saveIndex('dataframe/indexSimple.index')

In [116]:
#data.readData('dataframe/airbnb.feather')
#hoohle.readIndex('dataframe/indexSimple.index')

### Executing the search
By passing user input text to *printResults* funtion.

In [61]:
qq = input("Enter a search query")
hoohle.printResults(qq)

Enter a search query private room with queen bed and garden


### Search Engine 2 (Step 3.2) Conjunctive query & Ranking score
For the second search engine it was necessary to create a inverted index that contains the document id and the matching tfIdf for the term in the document. This index is created at the initialization of the class *HoohleTFIDF*. He we called the function *get_TFIDF*. This function will access the functions *get_TF* and "get_IDF" and return the product of the values returned by these two functions. The latter returns the IDF (inverse document frequency) which is the logarithm of the total number of documents over the number of documents containt the specified term. *get_TF* returns the TF (term frequency) of a term in a document in relation to/over the length of the document. We decided to calculate the term frequency like this because a long description might contain the term more often but not provide more information and will therefore not increase the importance of the document. All three functions can be found in the class *HoohleTFIDF*. The values returned by *get_TFIDF* were then stored into the dictionary *indexTFIDF*. For storing we used the format that is described in the task explanation of the homework. 

To get the top k (k = 10 in our case) documents matching the search query we passed the query to *getDocsByQuery*. This function will first clean the data as already described in the explanation of *hoohleSimpe*. Then it will get the documents matching the query with the functions we used in *hoohleSimple* as well. If the list of documents returned is at least k, we will get the cosine similarity of the all documents in the list in relation to the search query by using *getCosineSimi* and add the returned value to a list containing the cosine similarity for the found documents. In this step we pass the values of *getTFIDF_query* and *getTFIDF_vector* to *getCosineSimi*. *getTFIDF_query* will calculate the tfIdf (as described earlier) of all terms in the query and return a list with all the values. *getTFIDF_vector* will return a vector, contained in a list. It accesses the precalculated *indexTFIDF* to get the tfIdf for a term in a document and then append it to the vector list. Before returning the list of the documents with the highest similarity to the search query, *getDocsByQuery* will sort the list of the similar documents by their consine similarity. For that we used the python library *heapq*. 

If the length of the list of documents found is smaller than k, *getDocsByQuery* will call the function *completeToK*. The latter will use the list of documents returned by *getDocsByQuery* and extend it with documents based on theier cosine similarity to the query from the user.


Initiating The Search Engine by creating an object form the class *HoohleTFIDF* passing the airbnb data

In [124]:
#data.data.index = range(len(data.data.index))
hoohleTF = HoohleTFIDF(data, True)
#hoohleTF.index = hoohle.index

#### Executing The Search
By passing user input text to *printResults* funtion.

In [None]:
qq = input("Enter a search query")
hoohleTF.printResults(qq)

Saving the index to Disk

In [64]:
hoohleTF.saveIndex('dataframe/indexTFIDF.index')

Reading the saved index from Disk

In [None]:
#hoohleTF.readIndex('dataframe/indexTFIDF.index')

### Search Engine 3 (Step 4) Define a new score!

Our idea for the third search engine was to use the *price*, the *number of rooms* and the *distance to the city center* (in the following called attributes) as additional information to optimize our search result. When executing the search query, the user will be asked to provide his preferences for the aforementioned attributes. As the location of the city center is not present in the provided dataset we scraped it for every city using the library *geopy* in the function *buildGeoInfo* in the class *HoohleNOSTRO*. In this class you can also find the function *insertGeoInfo* which we used to save the scraped information to our dataframe.

Our first approach to define a new score was to assign a value ranging from -1 to 1 to get the similarity of the users preferences to the actual values in the dataframe
. In this case, -1 would have been the minimum value of *price*, *number of rooms* and *distance to city center*. We would have compared the actual value the entry to the average value of the aforementioned attributes and than assigned a value in the range from -1 to 1. We didn't use this approach as we figured out a better one.

The approach we used was to calculate a value in the range from 0 to 1. This score will be calculated using the functions *scoringPrice*, *scoringRoom* and *scoringLoc*. These functions are to be found in the class *HoohleNostro*. These functions will use the preferences we ask the user to provide. The minimum of every attribute will be subtracted from the value the user provided us. The result of the subtraction will then be divided by the maximum value of the attribute minus the minimum value of the attribute. 

During the initialization *HoohleNOSTRO* will initialize *HoohleSimple* as a parent to inherit the functions we defined in the latter class. This way we can use the *getDocsByQuery* function of *HoohleSimple*. This will output the a list of documents matching the search query of the user as described in the section __Search Engine 1 (Step 3.1)__. The difference is how we sorted these results. For that we used the function *rank* in the class *HoohleNOSTRO*. It will create a list that contains the document id and the cosine similarity. The is provided by the function *getCosineSimi* we already described. In this case this function uses information from the dictionary *indexNostro* to rank the documents of our result by their similarity to the search query. The results are then printed, ordered by the rank we found. *indexNostro* is created by the function *indexNostro* in *HoohleNOSTRO*. It will append a vector of the attributes to every listing. 

In [75]:
hoohleNostro = HoohleNOSTRO(data)

Building the dictionary of Geo information

In [76]:
hoohleNostro.buildGeoInfo()

Inserting the Geo information into the dataframe, by calculating the distances to City Center *(refer to the class markdown file)* for more infromation about that

In [77]:
hoohleNostro.insetGeoInfo()

Building the index by creating vectors of scores for each entry

In [78]:
hoohleNostro.buildIndex()

Saving the index to a file.

In [80]:
hoohleNostro.saveIndex('dataframe/indexNostro.index')

saving the Geo dictionary to a file.

In [82]:
hoohleNostro.saveIndex('dataframe/geodict.dict')

In [314]:
#hoohleNostro.geodict = tmpDict
#hoohleNostro.indexNostro = tmpIndexN
#hoohleNostro.index = tmpIndex

Showing the user interface to user, and getting their preferences and passing them to *printResults* function to show the results.


In [106]:
def showUI():
        qq = input("What are you looking for?")
        qprice = input("Price (in $)?")
        qrooms = input("Number of rooms?")
        qcity = input("In which city?")
        qdist = input("Distance to city center (in km)?")
        hoohleNostro.printResults(qq, [float(qprice), float(qrooms), qcity, float(qdist)])
showUI()

What are you looking for? private house with queen bed and garden
price? 100
number of rooms? 5
In which city? San Antonio
distance to city center? 5


#### Bonus Step



In [127]:
distanceFrom=[float(input()),float(input())]

 32.87876767
 -97.00324223


In [148]:
dis=float(input())

 3


In [149]:
m = folium.Map(location=[distanceFrom[0],distanceFrom[1]])
data.data.index = range(len(data.data.index))
folium.Circle(
    location=[distanceFrom[0],distanceFrom[1]],
    radius=dis*1000,
    color='#3186cc',
    fill=True,
    fill_color='#3186cc'
).add_to(m)

for i in range(len(data.data)):
    coor=(data.data['latitude'][i], data.data['longitude'][i])
    if geopy.distance.distance(distanceFrom, coor).km<=dis:
        folium.Marker(coor, popup=str(data.data['title'][i])).add_to(m)

m