# Algorithmic Methods for Data Mining
## Homework 3 - Airbnb Search Engine
### Group #17 - Giulia Scialanga, Guilherme Nicchio, Marco Minici
##### 26/11/2018

The goal of this project consists in buiding a search engine over a data base of Airbnb houses. 

The code returns the houses of the data base which matches the descriptions entered by an user query. 

User query means the sentence an user enter in a search field, for example: "A beuatiful house with beach and garden"

For this project was used the following libraries:

In [3]:
from os.path import isdir
from os import mkdir
import pandas as pd
import csv
from os import listdir
from os.path import isfile
from os import remove
from tqdm import tqdm

from nltk import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
import re
import string
import pickle

Insert quick description of the libraries used.

The data is organized in a way that we have a csv file with the following columns.

In [6]:
data.head()

Unnamed: 0,average_rate_per_night,bedrooms_count,city,date_of_listing,description,latitude,longitude,title,url
1,$27,2,Humble,May 2016,Welcome to stay in private room with queen bed...,30.020138,-95.293996,2 Private rooms/bathroom 10min from IAH airport,https://www.airbnb.com/rooms/18520444?location...
2,$149,4,San Antonio,November 2010,"Stylish, fully remodeled home in upscale NW – ...",29.503068,-98.447688,Unique Location! Alamo Heights - Designer Insp...,https://www.airbnb.com/rooms/17481455?location...
3,$59,1,Houston,January 2017,'River house on island close to the city' \nA ...,29.829352,-95.081549,River house near the city,https://www.airbnb.com/rooms/16926307?location...
4,$60,1,Bryan,February 2016,Private bedroom in a cute little home situated...,30.637304,-96.337846,Private Room Close to Campus,https://www.airbnb.com/rooms/11839729?location...
5,$75,2,Fort Worth,February 2017,Welcome to our original 1920's home. We recent...,32.747097,-97.286434,The Porch,https://www.airbnb.com/rooms/17325114?location...


The first goal on the process is to match the user query with descriptions and tittles of the houses.

### Step 1 - Cleaning the data

#### Cleaning process:

It was noticed that the data base has some "noise" within it, for example empty cells (NA) in location coordinates, title and description which is essential for the concept of search engine built and the houses of the respective residences won't produce any match to an entered query.

It was also noticed duplicated cells, that sometimes are system error or a house entered in the system twice by the owner, in order to avoid a duplicated outcome for the query the duplicated cells will be dropped out of the data base.

Cleaning process:

-Delete all rows with (lat,long) equal to NA since if an house is not locatable it is useless for a customer.

-Delete rows with both description and title equal to NA since it wouldn't be possible to match the user query.

-Retain all others rows, maybe later penalizing records with NA values for other columns(e.g.:bedrooms_count).

#### 1. Load the data.csv

- read csv file with pandas csv read.

#### 2. Call cleandata()

- Gets just data which has not null in latitude column;

- Drop NA's from description and title;

- Drop all duplicated rows;

- Returns the data.


### Step 2 - Pre-processing the data and creating Tsv files for each house
At this point there is a data frame with duplicated removed and empty fields removed.
The next step is to go through each row of the data frame process it and write a tsv file with it.

Before diving into text and feature extraction, the first step will be further treating the data in order to obtain better features. We will achieve this by doing some of the basic _*pre-processing*_ steps on our data, for this it was used the library NLTK which has many features for natural language processing.

###### Lower case
The first pre-processing step is to transform our descriptions and titles into lower case. This avoids having multiple copies of the same words. For example, while calculating the word count, ‘House’ and ‘house’ would be taken as different words.

##### Tokenization
Tokenization refers to dividing the text into a sequence of words or sentences.

##### Removing Punctuation and Stop Words
The next step is to remove punctuation and stopwords, as it doesn’t add any extra information while treating text data. Therefore removing all instances of it will help us reduce the size of the data.

##### Stemming
Stemming refers to the removal of suffices, like “ing”, “ly”, “s”, etc. by a simple rule-based approach. For this purpose, it will be used SnowballStemmer from the NLTK library.

##### Removing non-enlish characters
When handling the data was noticed non-english texts, for example descriptions entered in chinese language can bring problematics when processing the text. In order to solve this issue was created a function to test if the description an title are indeed using english characters, therefore, the data just proceed for process and creating tsv file if the english test function returns true for the test.

##### Creating TSV files
After pre processing the text we join the description and title text and replace then in the original field, proceeding to the tsv file creation. The treated data has 18259 rolls which will result in the same number of tsv files for the analysis of the search engine.

    1. calls createAllReviews
       Calls CreateTSV to all lines of the dataframe com .apply(lambda)
        2.CreateTSV (inside a line of the dataframe the following will happen)
            if description is empty (.isna) writes "NaN" else calls _nltkProcess
            the same for title
                3. _nltkProcess
                    string lower
                    4. calls _setupNltk
                        Lazy initialization of objects needed to preprocess strings
                        tokenizer = RegexpTokenizer(r'\w+')
                        stopwords = set(stopwords.words('english'))
                        stemmer = SnowballStemmer('english') 
                    tokenizer
                    remove stopwords
                    stemm
                    join the result and return
                5. calls isEnglish() to test over each word int the line if they are in english and not chinese
                    try to encode into english, if fails return FALSE for test "is english"
                    if succeeds return TRUE for the test
                if returned TRUE goes on to create the file, if False returns from CreateTSV and don't create the file
                open the file to write using with open(self.dir_path+self.review_dir+"doc_"+str(x.name)+".tsv", 'w') as file:
                try to write it
                    except (if fails to write) save the index (x.name) to delete the file afterwards

### Step 3 -  BuildingEncoding()
At this point the tsv files are created and pre-processed. The next step is to create a dictionary for the vocabulary of words present in all the documents, in this dictionary each word corresponds to a single number. This process will help analyse which documents have the respective word.

Example of what is the target at this point: {"car":1, "house":2}

This step was approached browsing through every file's title and description and adding its words into an empty dictionary. But before adding it is checked if the word isn't yet in the dictionary, when adding the word to the dictionary its key value is a counter of how many words were collected so far. 

    1.BuildingEncoding
        create an empty dictionary to store words
        create a counter to store the index of how many single words we have (vocabulary size)
        for each file in "listdir(directory address)"
            goes to the title, strip its words and create a set of it to remove duplicates
            do the same for the description
            union both sets
            for each word in the united set check if it is not dictionary already
                if it isnt add it to the dictionary as a key
                set the counter as the value of the word in the dictionary
                increase the counter +1

### Step 4 - Reverse Index
Creating a dictionary of words as keys and documents as values.

After this step can say in for each number (word) which documents have it example 

 
{

term_id_1:[document_1, document_2, document_4],

term_id_2:[document_1, document_3, document_5, document_6],

...}


In this step it is created a new dictionary which the keys are the the vocabulary numbers of the previous dictionary. Then it is checkd for each document if the title and the description has the respective word, if there is the document name is appended as a value to the respective keyword.

    create a new dictionary with the keys being the numbers at the size of the vocabulary
        this is our target dictionary, we want to say what documents have this word
    for each file in "listdir(directory address)"
        goes to the title, strip its words and create a set of it to remove duplicates
        do the same for the description
        union both sets
        for each word in the set
            get which number this word has in the vocabulary dictionary
            goes to the new dictionary and append the name of the document to the key

In [9]:


#NOW WE HAVE OUR TARGET DICTIONARY AND WE CAN DO THE INTERSECTION WITH THE QUERY

#-------------------------------------------------------
#our first go is to get the query and proccess it

#enter a string "a with garden house"
#get the string of the query and tokenize it
#filter it eliminating the stopwords
#stemm the filtered words and save it into a empty list
#the result is ['garden', 'hous']

#------------------------------------------------------
#now that we have the query words we can go to the dictionary 
#for every word in the query list
    #we append the set of documents that corresponds 
    #to the word number from the new dictionary into a new list
#this new list has all sets of documents which contain the query words
#as we want a document which has all the words in it we do an intersection of this documents
#WE THEN HAVE ALL THE DOCUMENTS WHICH HAS ALL THE WORDS FROM THE QUERY



