# Take-Home Challenge

### Guidelines
* We expect that the test should take around 4 hours to do. However, we strongly advise you to carefully read this assignment, think about approaches and try to understand the data before diving into the questions. You are free to spend as much time on it as you want, in the timeframe given by our recruiter.
* In case of using this Google Colab, you'll need to upload the files on google drive folder given to you (*listing.html, properties.csv*) running the cell below. 
* If you want to use some python packages that are not yet installed on this notebook, use !pip install package.

# Data Extraction (CSS + REGEX)

Casafari tracks the entire real estate market by aggregating properties from thousands of different websites. The first step of this process is to collect all the relevant information using web crawlers. This task will give a brief overview of how this extraction is made. 

The task consists of 3 parts, which will evaluate your skills in CSS3 selectors and regular expressions knowledge, which are essential to data extraction processes. We believe that even if you do not have previous knowledge of CSS, HTML and REGEX, you should be able to complete this task in less than a hour. There are many tutorials and informations on how to use CSS3 selectors and regular expressions to extract data. Do not be afraid to google it! This task is also a evaluation of your learning capabilities.

The normal questions already have some examples and can be solved only by filling the CSS3 selectors or the regular expressions in the given space. You can check if you have the correct results by running the pre-made script after it. However, if you feel comfortable, you can use another python package and rewrite the script in a similar way to extract the data.

For the extra challenges, you'll need to construct the scripts from scratch.

__(1)__ For the following task, use the _listing.html_ file, which represents a listings for a property. Open the HTML file on your browser, investigate it with the Inspect tool, view the source code and explore it. 
After that, fill the CSS3 selectors in the following script to extract the following information about this property:

* Number of bathrooms
* Number of bedrooms
* Living Area
* Energy Rating
* Description
* Agent Name
* Extract the location of the property

## Install Libs

In [None]:
!pip install lxml
!pip install cssselect
!pip install spacy

In [None]:
!python -m spacy download en_core_web_sm

## Imports

In [None]:
from lxml import html, etree
import re
import pandas as pd
import numpy as np
import spacy
import en_core_web_sm
import nltk

from collections import Counter
from string import punctuation
from nltk.corpus import stopwords
from nltk import word_tokenize

## Setup

In [None]:
nltk.download('stopwords')
nltk.download('punkt')

## Variables

In [None]:
_dir = "."

## Code

In [None]:
# EXAMPLE SELECTOR TO EXTRACT THE PROPERTY TYPE
Selector_Example = "h1.lbl_titulo"

In [None]:
# EXAMPLE CODE, RUN TO CHECK THE EXAMPLE SELECTOR

try:
    f = open(r'{}/listing.html'.format(_dir), "r")

except FileNotFoundError:
    f = open(r'listing.html', "r")

page = f.read()
f.close()

#parsing HTML into a tree structure
tree = html.fromstring(page)

print('Example -> Property type: {}'.format(tree.cssselect(Selector_Example)[0].text))

Now that you understand the example, just fill the CSS selectors here and check it by running the below cells:

In [None]:
############## Q1 ANSWERS ##################

'''
Gui Palazzo Note:
With this selector we can get all data needed in a list of web tag elements. Afterwards, let's use a hash map to get the info we need.

'''


only_li_elements = "ul.bloco-dados li"

In [None]:
'''
Gui Palazzo Note:
  Section 1: using cssselect to get all li elements in order to extract 4 out of 7 needed informations
    to answer the question below.

  Section 2: using xpath to get other 3 needed infos.
'''


#Section 1
_dict = {}
all_li_elem = tree.cssselect(only_li_elements)

for elem in all_li_elem:
    text = elem.text_content().strip()
    text_splitted = text.split(": ")
    _dict[text_splitted[0]] = text_splitted[1]


#Section 2
desc = tree.xpath("/html/body/div/div[2]/div/div[4]/p/text()")[0]
location = tree.xpath("//*[@id='Cpl_lbl_morada']")[0].text
agent_name = tree.xpath("//*[@id='Cpl_moduloinformacaolateral_module_holder']/div/div/div[1]")[0].text[6:]

In [None]:
############### RUN TO CHECK YOUR ANSWERS ##################

'''
Gui Palazzo Note: 
Apparently there's something wrong with the Total Area in the website (listing.html file), because if 
  the Living Area is 80m2, how could the Total Area be less than that (0m2)?
'''


print('Bathrooms: {}'.format(_dict['Bathrooms']))
print('')
print('Bedrooms: {}'.format(_dict['Bedrooms']))
print('')
print('Total area: {}'.format(_dict['Total Area']))
print('')
print('Living area: {}'.format(_dict['Living Area']))
print('')
print('Description: {}'.format(str(desc)))
print('')
print('Agent name: {}'.format(agent_name))
print('')
print('Location: {}'.format(location))

__Extra Challenge__:

Write from scratch a script to extract and print:
* One link that leads to http://mydomain.com/link-to-image
* Extract all the features of the property

In [None]:
############### WRITE THE SCRIPT TO SOLVE THE EXTRA CHALLENGE HERE ##################

try:
    f = open(r'{}/listing.html'.format(_dir), "r")

except FileNotFoundError:
    f = open(r'listing.html', "r")

page = f.read()
f.close()

#parsing HTML into a tree structure
tree = html.fromstring(page)


def get_link():

    xpath = "/html/body/div/div[2]/div/div[2]/div/a/@href"
    link = str(tree.xpath(xpath)[0])
    return link


def get_features():
  
    selector = "ul.modulo-caracteristicas-conteudo li"
    features = tree.cssselect(selector)
    feat_list = [feat.text_content().strip() for feat in features]
    return feat_list

  
print("The link is the following: {}".format(get_link()))
print()
print("The features are the following: {}".format(get_features()))

__(2)__ In the second part you will still have to use the html file. However, this time, you should use regular expressions to extract the following data from the webpage:

* Urls that are links to listings (i.e.: http://mydomain.com/link-to-listing). Do not use the whole url itself in regular expression. It should select only 3 links.
* The agent telephone number
* The property price

In [None]:
# REGEXP EXAMPLE TO EXTRACT THE AGENT EMAIL
Regexp_Example = r"\">(.*?@.*?)<"

In [None]:
# RUN TO CHECK THE EXAMPLE RESULTS

try:
    f = open(r'{}/listing.html'.format(_dir), "r")

except FileNotFoundError:
    f = open(r'listing.html', "r")

page = f.read()
f.close()

print("Email extracted: {}".format(re.findall(Regexp_Example, page)[0]))

In [None]:
# WRITE YOUR REGULAR EXPRESSIONS HERE
Regexp_1 = r"http://.*link-to-listing"
Regexp_2 = r"\d+(?:-)\d+"
Regexp_3 = r"\d.*\s€"

In [None]:
############### RUN TO CHECK YOUR ANSWERS ##################
print('Links extrated:')
for w in re.findall(Regexp_1, page):
    print(w)
    
print('')
print("Agent Phone Number: {}".format(re.findall(Regexp_2, page)[0]))
print('')
print("Property price: {}".format(re.findall(Regexp_3, page)[0]))

__Extra Challenge__
* Extract latitude and longitude value from html ()_those values are in the html code, but are not shown on the page__)

In [None]:
############### WRITE THE SCRIPT TO SOLVE THE EXTRA CHALLENGE HERE ##################

'''
Gui Palazzo Note: 

  Regex pattern explanation
  
    -?: matches the minus signal (-) 0 or 1 times
    \d+: matches any digit from 1 to infinite -- similar to \d{1,}
    ?:\.: try to match a dot (.)
    (?:\.\d+): this whole submatch consists in a non-capturing group, because we can have natural numbers as latitude and/or longitude
                meaning that a dot followed by numbers don't necessarily exists.
'''


regex_pattern = "-?\d+(?:\.\d+),-?\d+(?:\.\d+)"
print(re.findall(regex_pattern, page)[0])

# Data Analysis (Python)


You obtained all the data that you need and you now need to run an analysis on the following problem. For this part, feel free to use as many cells as you need below this point. Please use properties.csv as your data source.



## Problem 
A private investor is planning an investment in one of the four locations. In order to decide where to invest he needs to know the price impact of such features as ‘pool’, ‘sea view’ and ‘garage’ on properties in each location.
He also asks for the mean price of the properties in each type group (‘apartments’, ‘houses’, ‘plots’) and wants to know about properties in the market that are undervalued and overvalued. In order to accomplish the problem that was described we want you to cover the following steps:

###Part 1: Data Cleaning
As you have seen previously, a lot of information is present in the title/features fields. From there, we want to extract the relevant information for further analysis, such as:
 - 1A: Property  **type** (as presented in **Details** above) of each property from **title** field
 - 1B: Property **location** (as presented in **Details** above) of each property from **title** field
 - 1C: From **features** field, if a property has:
  - a pool
  - a garage
  - sea view

####Deliverables part 1:
- Create a property dataset with the following schema and save it in a csv file:
  - id; location name; type; title; features; pool (0/1); sea view (0/1); garage (0/1);
  - pool, sea view, garage should be binary - 1 if the property has the feature and 0 if not
- For each of the 3 tasks (1A, 1B, 1C), describe in detail the what you did. What are the advantages and disadvantages of your approach?
-  Please provide your code in the cells below, in a reproducible and understandable way;

###Part 2: Identify outliers
Now that the data is structured correctly, let's look at which properties are a  good deal for our investor. For this you will need to identify undervalued, overvalued, and normal properties in the dataset. Please use any model you find appropiate in order to obtain this.
####Deliverables part 2:
- As before, deliver a csv file with the following format:
  - id; location name; type; area; price; **over-valued (0/1), under-valued (0/1), normal (0/1)**
 - the new columns should be binary, where for example **over-valued** column would get value 1 if the property is indeed over-valued, 0 otherwise;
- A short report (could be a pdf file or new cells within the notebook) containing:
  - visualizations (such as scatter plots) discriminating between the undervalued, overvalued and normal properties;
  - a explanation of what is the difference between under-valued/over-valued properties and pure data outliers;
  - any notes/conclusions you wish to add;
- Your code, in the cells below;

###Part 3: Theoretical questions
- Mention at least 2 hidden traps you found while solving the problems and what would help you to clean the data set;
- Describe in detail how you would evaluate the price impact of features such as sea view, pool and garage considering the dataset provided. Your answer should also include how would you deal with missing values, outliers and duplicated listings (same property listing published by different agencies);

####Extra challenge:
- Describe how would you model the data over time (using createdAt field). What changes over time would you look for and what would you expect the outcomes to be? (i.e. in terms of pricing per location/type)

#**Data Analysis (Python) - Resolution Starts Here**

In [None]:
# Importing the csv file and creating a dataframe

try:
    df_original = pd.read_csv('{}/properties.csv'.format(_dir))
except: 
    df_original = pd.read_csv('properties.csv')

df = df_original.copy()

df.head(6)

In [None]:
# Showing more rows of the dataframe in order to analyze some columns by looking at the data.

pd.set_option('display.max_rows', 1500)

### Part 1

**Creating the words dataframe to use in Part 1**

In [None]:
# Using nltk in order to create a DataFrame with words I will analyze in the next step

stoplist = set(stopwords.words('english') + list(punctuation))
texts = df.features.str.lower()
word_counts = dict(Counter(word_tokenize('\n'.join(texts))))

w_df = pd.DataFrame(word_counts, index=[0])
w_df = w_df.T.reset_index()
w_df.rename(columns={0: 'count_words', 'index': 'words'}, inplace=True)

w_df.head()

**Part 1**

*Exploring the dataset*

In [None]:
# Exploring the dataset shape in order to make sure I'm losing the correct amount of lines in case there's a dropna usage or any other methods

df.shape

In [None]:
# Looking at the data types and non-values

df.info()

In [None]:
# Exploring the columns names

df.columns

**Investigation for Part 1A (Property Type)**

In [None]:
df_aux = df[['id', 'title']]


#See why the following regex_patterns are as comprehensive as they need to be in the Section below named: 
  #"RegEx section validation for Property type"
regex_pattern1 = r"apartments?"
regex_pattern2 = r"houses?"
regex_pattern3 = r"plots?"


df_aux['type'] = df_aux.title.apply(lambda x: 'apartments' if re.search(regex_pattern1, x.lower()) is not None else
                                              'houses' if re.search(regex_pattern2, x.lower()) is not None else
                                              'plots' if re.search(regex_pattern3, x.lower()) is not None else
                                              'unknown')


#This is going to be joined to generate the final_df
df_id_type = df_aux[['id', 'type']]
df_id_type.head()

**RegEx section validation for Property type**

In [None]:
# Testing all the ways "apartments" could be written

# regex_pattern = r"ap(?:[a-z])"  #since there's no "ap" word, it seems that there's no "ap" word referring to "apartment"
  #havind said that, we can try to match apartment or apartments words

regex_pattern = r"apartments?"
list_to_search = w_df.words.str.cat(sep=', ').lower()

#I would maintain this as a np.array, but since all the words_list variables are list, I'm going to maintain the same pattern
apartments_words_list = list(np.unique(np.array(re.findall(regex_pattern, list_to_search))))
apartments_words_list

In [None]:
# Testing all the ways "houses" could be written

# patterns tested: houses?, home?, household?, homestead?, homebase?, home\sbase?
# there are others ways to refer to a house, like "residence" and so on. But since the problem statement says to separate only "houses"
  #I'll keep with that for now

regex_pattern = r"houses?"
list_to_search = w_df.words.str.cat(sep=', ').lower()

#I would maintain this as a np.array, but since all the words_list variables are list, I'm going to maintain the same pattern
houses_words_list = list(np.unique(np.array(re.findall(regex_pattern, list_to_search))))
houses_words_list

In [None]:
# Testing all the ways "plots" could be written
  
regex_pattern = r"plots?"
list_to_search = w_df.words.str.cat(sep=', ').lower()

#I would maintain this as a np.array, but since all the words_list variables are list, I'm going to maintain the same pattern
plots_words_list = list(np.unique(np.array(re.findall(regex_pattern, list_to_search))))
plots_words_list

**END OF RegEx section validation for Property type**

**Investigation for Part 1B (Property Location)**

In [None]:
wdf_aux_1b = w_df.copy()
wdf_aux_1b.head()

In [None]:
#there are many location, such as cities, beaches and so on
#there seems to have an overlapping of beaches in cities and sometimes it brings only the city instead of the beach.
#for this analysis, to avoid a lot of complexity (given the time), I chose some names as a sample.

#the names can be found in the section below where I filter them.

#GPE label is specific for geographical data (cities, coutries, ...)

nlp = en_core_web_sm.load()
_list = []

for i, _title in enumerate(df.title):
    _title = _title.strip()
    doc = nlp(_title)
    print([X.text for X in doc.ents if X.label_ == 'GPE'])
    if i == 5:
        break

In [None]:
df_aux = df[['id', 'title']]

regex_pattern1 = r"costa nagüeles III"
regex_pattern2 = r"las cañas beach"
regex_pattern3 = r"nagüeles"
regex_pattern4 = r"montepiedra"
regex_pattern5 = r"andalucia"
regex_pattern6 = r"alhambra del mar"




df_aux['location'] = df_aux.title.apply(lambda x: 'costa_nagueles_III' if re.search(regex_pattern1, x.lower()) is not None else
                                                  'las_canas_beach' if re.search(regex_pattern2, x.lower()) is not None else
                                                  'nagueles' if re.search(regex_pattern3, x.lower()) is not None else
                                                  'montepiedra' if re.search(regex_pattern4, x.lower()) is not None else
                                                  'andalucia' if re.search(regex_pattern5, x.lower()) is not None else
                                                  'alhambra_del_mar' if re.search(regex_pattern6, x.lower()) is not None else
                                                  'unknown')

df_aux.head()

In [None]:
#Evaluating the amount of data I'm losing by selecting only this sample

#I'm working with ~53% of the data
df_aux.groupby('location')['location'].value_counts()

In [None]:
#This is going to be joined to generate the final_df
df_id_location = df_aux[['id', 'location']]
df_id_location.head()

**Investigation for Part 1C (Property Features: pool, garage and sea view)**

In [None]:
# Filling nan values with Unknown in order to display only data that means something
df.features.fillna("unknown", inplace=True)


# Giving a quick look in the way data is written, how are some exceptions I must handle in order to extract the info I need
for i, feat in enumerate(df.features):
    print(i, "--", feat)
    if i == 5:
        break

In [None]:
df_aux = df[['id', 'features']]
df_aux.head()

**RegEx section validation for Features**

In [None]:
# Creating the columns I'll use to further analysis

wdf_aux_1c = w_df.copy()

wdf_aux_1c['if_pool'] = wdf_aux_1c.words.apply(lambda x: True if 'pool' in x.lower() else False)
wdf_aux_1c['if_garage'] = wdf_aux_1c.words.apply(lambda x: True if 'garage' in x.lower() else False)
wdf_aux_1c['if_seaView'] = wdf_aux_1c.words.apply(lambda x: True if 'sea' in x.lower() else False)

df_aux = wdf_aux_1c.copy()

In [None]:
# Printing words related to 'pool' to understand if there's anything I must avoid

_df_aux = df_aux[df_aux.if_pool]
print(_df_aux)

# Analyzing the df below, the following snippet contains must-have words related to 'pool'

pool_word_list = list(_df_aux[~_df_aux.words.isin(['whirlpool'])].words)

In [None]:
# Printing words related to 'garage' to understand if there's anything I must avoid

_df_aux = df_aux[df_aux.if_garage]
print(_df_aux)

# Analyzing the df below, the following snippet contains must-have words to 'garage'

garage_word_list = list(_df_aux.words)

In [None]:
# Printing words related to 'sea view' to understand if there's anything I must avoid

_df_aux = df_aux[df_aux.if_seaView]
print(_df_aux)

# Analyzing the df below, the following snippet contains must-have words to 'sea view'

seaView_word_list = list(_df_aux[_df_aux.words.isin(['sea/lake', 'sea', 'sea/beach', 'seaside'])].words)

In [None]:
# Since 'sea view' is a compound word, I could be losing data in the way words are been counted
  # so let's try to quantify this loss

wdf_aux_1c['if_seaView_viewTest'] = wdf_aux_1c.words.apply(lambda x: True if 'view' in x.lower() else False)
wdf_aux_1c['if_seaView_seaTest'] = wdf_aux_1c.words.apply(lambda x: True if 'sea' in x.lower() else False)

df_aux = wdf_aux_1c.copy()

_df_aux = df_aux[df_aux.if_seaView_viewTest]
print("View Words DataFrame Analysis")
print(_df_aux)


_df_aux = df_aux[df_aux.if_seaView_seaTest]
print("\n\nSea Words DataFrame Analysis")
print(_df_aux)


# According to the "View Words DataFrame Analysis", even if all the words preceding 'view(s)' were sea/lake or any 'sea' word 
# related, the maximum 'sea view(s)' I could get is 598 + 451 = 1049

# For words that make sense to be related to 'Sea view', I would choose: sea/lake, sea, sea/beach, seaside
# Having said that, I could have: 26 + 941 + 40 + 3 = 1010

# So, the loss would be: (1049-1010) / 1049 = 3.72%
# Even if every 'view(s)' word are related to 'sea' words, I would be losing 3.72% of the data and I'm ok with that for now.

In [None]:
#only for printing purposes and understanding, this is the whole dataframe with all the booleans tests

wdf_aux_1c.head()

**End of RegEx section validation for Features**

In [None]:
df_aux = df[['id', 'features']]

df_aux['pool'] = df_aux.features.apply(lambda x: 1 if re.search("(?!.*whirl)pool", x.strip().lower()) is not None else 0)
df_aux['garage'] = df_aux.features.apply(lambda x: 1 if re.search("garage", x.strip().lower()) is not None else 0)
df_aux['sea_view'] = df_aux.features.apply(lambda x: 1 if re.search("(?!.*ba)sea(?![p, t, r])", x.strip().lower()) is not None else 0)

df_aux.head()

In [None]:
#This is going to be joined to generate the final_df

df_id_features = df_aux[['id', 'pool', 'garage', 'sea_view']]
df_id_features.head()

In [None]:
# Deliverable Part 1

#Since the final_df is getting data directly from the original df, I wanna replicate all the data from it,
  #that's why I'm doing a left join on the other dfs.

final_df = df[['id', 'title', 'features']]
final_df = final_df.merge(df_id_type, how='left', on='id')
final_df = final_df.merge(df_id_features, how='left', on='id')
final_df = final_df.merge(df_id_location, how='left', on='id')

final_df = final_df[['id', 'location', 'type', 'title', 'features', 'pool', 'sea_view', 'garage']]
final_df.head()


'''
** Comments

 All this analysis requires a much more detailed approach on which location to consider: city, beach, hotels name? I did a simplified one
    by getting only a sample; If we decide to go with more granular data, like beach, we must understand the overlapping between beaches, and
    a good approach for that is cascading geographical data upstream (country > city > beach > hotels).
 
 Moreover, some compound words like 'sea view' are not 100% matched since 'sea' counts as one word separetely from 'view',
    and therefore I can't ensure it's totally correct.
 
 The way I created the dfs to be joined are not best one actually, because there were only few variables, but imagine a scenario
    where there are 100 variables -- I would had to write 100 lines one by one.
 
 All in all, given the scenario, I would stick with this approach to simplify the things, and because I think it gives a good
    view of my coding skills.
'''

In [None]:
# Creating the csv

final_df.to_csv('df_part1.csv', sep=',')