<a href="https://www.kaggle.com/code/mayanklad/comparison-house-description-vs-characteristics?scriptVersionId=143690854" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## **Introduction:** Determine whether a house's description accurately matches its listed characteristics
- This case study aims to demonstrate the understanding of various AI and Machine learning tools used for data cleaning, data wrangling, data scraping, data visualization and modeling in Python.
- Focuses on using web scraping techniques using Beautiful Soup library to extract real estate listings of Rome from an Italian real estate website, immobiliare.it.
- Performs a clustering analysis on the house's given description as the first clustering group (the description cluster or TF-IDF cluster) and five other attributes of the house listing ('price', 'rooms', 'surface', 'bathrooms', 'floor') as the second clustering group (the feature cluster) to compare the similarity between them.
- Thus, this analysis would help to decide how effectively the house's given description reflect the actual attributes of a house listing.

## Importing Libraries

In [None]:
# Data
import numpy as np
import pandas as pd
import lxml
from bs4 import BeautifulSoup


# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
from ydata_profiling import ProfileReport


# Data Preprocessing
import re
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import nltk
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler


# Model
from sklearn.cluster import KMeans


# Misc
import string
import math
from os import path
import warnings
import requests
from collections import defaultdict
from time import sleep
from tqdm.auto import tqdm

In [None]:
nltk.download('all')

In [None]:
warnings.filterwarnings('ignore')

## Scraping the Data


### About the Data Source
- The dataset has been obtained by scraping data from an Italian real estate website (‘https://www.immobiliare.it/en’) by using the Beautiful Soup library. 
- It contains real estate listings from the city of Rome.
- After scraping data, we obtained 1665 listings which means our dataset contains 1665 rows and 7 columns. 




In [None]:
mainpage = 'https://www.immobiliare.it/en'

Firstly, a function of created to scrape the data from the site www.immobiliare.it and storing it as a csv file.

In [None]:
def get_listing(listing):
    '''
    Returns the features of a listing.

    Parameters:
        listing (bs4.element.Tag):The object corresponds to an XML or HTML tag of listing.

    Returns:
        list:The list of features of the listing.   
    '''
    
    title = listing.find('a', class_='in-card__title').get('title')
    link = listing.find('a', class_='in-card__title').get('href')
        
    rooms, surface, bathrooms, floor = 'NA','NA','NA','NA'

    if not link.startswith('https://www.immobiliare.it/en'):
        link = mainpage + link
        
    sub_content = requests.get(link)
    sub_soup = BeautifulSoup(sub_content.text, "html.parser")
    
    infos = sub_soup.find('ul', class_='nd-list nd-list--pipe in-feat in-feat--full in-feat__mainProperty in-landingDetail__mainFeatures')
    
    price = infos.find('li', class_='nd-list__item in-feat__item in-feat__item--main in-detail__mainFeaturesPrice')    
    price = price.getText()    

    sub_infos = infos.find_all('div', class_='in-feat__data')
    
    rooms = sub_infos[0].getText()
    surface = sub_infos[1].getText()
    bathrooms = sub_infos[2].getText()
    floor = sub_infos[3].getText()

    description = sub_soup.find('div', class_='in-readAll').div.getText()

    features = [title, price, rooms, surface, bathrooms, floor, description]
    
    sleep(1) # We apply thread sleep to allow the next request to return a response
    
    return features


Then, we iterate over each listing by calling another function that scrapes the seven features of the property. Each house listing is further saved to the “raw_data.csv” file.

In [None]:
def scraping_function():
    '''
    Function to scrap the data and save it as a csv file.

    Returns:
        pandas.DataFrame:DataFrame object consisting of the scraped listings.   
    '''
    counter=0
    listings = []
    columns = ['title', 'price', 'rooms', 'surface', 'bathrooms', 'floor','description']

    # There are 80 pages in search results
    for page_num in tqdm(range(1, 81)):

        # requesting for the html page
        web_page = requests.get(f'https://www.immobiliare.it/en/vendita-case/roma/?criterio=rilevanza&pag={page_num}')
        
        # soupifying
        soup = BeautifulSoup(web_page.text, 'lxml')

        # find all the tags li with class 'nd-list__item in-realEstateResults__item' which are individual listings
        listings_html = soup.find_all('li', class_ = 'nd-list__item in-realEstateResults__item')

        for listing in listings_html:
            try:
                features = get_listing(listing)
                listings.append(features)
                counter += 1
                
            except:
                # Error in scraping the listing and moving on to the next one
                pass

            if counter == 10000: # Save the data to csv if fetched 10000 listings and stop the scraping
                df = pd.DataFrame(listings, columns=columns)
                df.to_csv('raw_data.csv', index=False)
                
                return df
    
    df = pd.DataFrame(listings, columns=columns)
    df.to_csv('raw_data.csv', index=False)
    
    print(f'{counter} listings were scraped successfully!')
    
    return df


In [None]:
if not path.exists('raw_data.csv') and not path.exists('/kaggle/input/rome-real-estate-listings/raw_data.csv'):
    df = scraping_function()
else:
    print('Data already scraped! Use the stored CSV file.')

## Data Loading

Loading the scraped data (saved datafile) and inserting the columns names

In [None]:
columns = ['title', 'price', 'rooms', 'surface', 'bathrooms', 'floor','description']
if path.exists('raw_data.csv'):
    df = pd.read_csv('data/raw_data.csv', header=None, skiprows=1, names=columns)
else:
    df = pd.read_csv('/kaggle/input/rome-real-estate-listings/raw_data.csv', header=None, skiprows=1, names=columns)

df

Now, to observe any missing values. 

In [None]:
df.isna().sum()

There are missing values in the "floor" column. 

## Data Wrangling

### Initial Missing values handling

#### Plot to show the non-null values count

In [None]:
fig = msno.bar(df, color=(233/255, 114/255, 77/255))

#### Dropping the null values

In [None]:
df.dropna(inplace=True)

### Preprocessing the `price` column

In [None]:
df[df.price.str.contains('[a-zA-Z]') == True].head()

As we can see there are listings which have **price** values as **Price on application** which needs to be handled.

#### Removing 'Price on application' from 'price'

In [None]:
df = df[df.price.str.contains('[a-zA-Z]') == False]

#### Transforming and formatting the price

Since price is in the form of string with format **€ 100,000**, we have to transform it into 100000.0 a floating value.

In [None]:
def price_prep(price):
    '''
    Function to preprocess price feature.
    
    Parameters:
        price (str):The string containing the price.


    Returns:
        float:Price value in float   
    '''
    # removing punctuation and symbols
    price = price.replace(',','')
    price = price.replace('€','')
    
    price = price.strip()
    price = price.split('-')    # if price is in range eg. 80000 - 100000 then we split it into two.
    
    if len(price) == 1: # This represents no price range i.e. it is single value.
        return float(price[0])
    
    else: # If we have two values i.e. price range then compute their average
        min_price = float(price[0])
        max_price = float(price[1])
        return (min_price + max_price) / 2

In [None]:
df.price = df.price.map(price_prep)
df.head()

### Preprocessing the `rooms` column

For the values in the “rooms” column, the ‘+’ was removed where present and in the case where a range was shown such as ‘1-5’, the last value was taken. 

In [None]:
def rooms_prep(rooms):
    '''
    Function to preprocess rooms feature.
    
    Parameters:
        rooms (str):The string containing the rooms.


    Returns:
        int:Rooms value in int   
    '''
    rooms = rooms.strip()
    rooms = rooms.replace('+','') # If values is in the form of 1+ then remove the +
    
    # usually there's only one value. In case we have a range, for instance '1 - 5', we pick the last value.
    return int(rooms[-1])

In [None]:
df.rooms = df.rooms.map(rooms_prep)
df.head()

### Preprocessing the `surface` column 

This column refers to surface area and values contained m² This was removed using regular expression to substitute a blank space and maintain numerical values. 

In [None]:
# removing m²
df.surface = df.surface.map(lambda x: float(re.sub(r'\D', '', x)))
df.head(5)

### Preprocessing the `bathrooms` column

Some values contained  “+” in this column and this was replaced with white space to remove the “+” from the values. 


In [None]:
df.bathrooms = df.bathrooms.map(lambda x: int(x.strip().replace('+','')))
df

### Preprocessing the `floor` column

Values were in the form of 1, 1+, G, 1 - 5, B - G and they were removed using a function which takes the last digit if it is in a range. For the values with ‘A’ (penthouse) or ‘M’ (middle floor), it was replaced with ‘nan’, ‘G’ (ground floor) was replaced with 0, ‘R’ (raised floor) was replaced with 0.5 to denote a raised floor and ‘B’ (basement) or ‘SB’ (semi-basement) was replaced with -1 to denote a basement.


In [None]:
def floor_prep(floor):
    '''
    Function to preprocess floor feature.
    
    Parameters:
        floor (str):The string containing the floor.


    Returns:
        float:floor value in int   
    '''
    
    floor = str(floor)
    floor = floor.strip()
    floor = floor.split('-')[-1] # if range use the last value (higher value)
    floor = floor.replace('+','')
    floor = floor.strip()
    
    if 'G' in floor: # ground floor
        return 0
    elif 'A' in floor or 'M' in floor: # penthouse or middle floor so dont know the exact floor hence will ignore this value
        return np.nan
    elif 'R' in floor: # raised floor
        return 0.5
    elif 'B' in floor or 'SB' in floor: # basement or semi-basement
        return -1
    else:
        return float(floor)

In [None]:
df.floor = df.floor.map(floor_prep)
df

### Preprocessing the `description` column 

Steps for preprocessing **description**:
* lowercasing the text
* removal of punctuations
* removal of stopwords
* lemmatization of text

In [None]:
def lemmatize_words(text):
    '''
    Function to lemmatize text based on part of the speech.
    
    Parameters:
        text (str):The string containing the text.


    Returns:
        str:Lemmitized text
    '''
    
    lemmatizer = WordNetLemmatizer()
    wordnet_map = {"N" : wordnet.NOUN, "V" : wordnet.VERB, "J" : wordnet.ADJ, "R" : wordnet.ADV}
    
    # word tokenization
    word_tokens = nltk.word_tokenize(text)
    
    # part of speech tagger to tag the word tokens.
    pos_tagged_text = nltk.pos_tag(word_tokens)#text.split())
    
    # lemmatization based on part of the speech
    return " ".join([lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tagged_text])

In [None]:
def description_prep(df):
    '''
    Function to preprocess description feature.
    
    Parameters:
        df (pandas.DataFrame):The dataset.


    Returns:
        pandas.DataFrame:The dataset containing new feature 'description_lem' for preprocessed and lemmitized description
    '''
    
    # Lowercasing all the words
    df['description_lem'] = df.description.str.lower()
    
    # Removal of punctuations
    PUNCT_TO_REMOVE = string.punctuation + '“”–’°•€'
    df['description_lem'] = df['description_lem'].apply(lambda text: text.translate(str.maketrans('', '', PUNCT_TO_REMOVE)))
    
    # Removal of stopwords
    STOPWORDS = set(stopwords.words('english'))
    df['description_lem'] = df['description_lem'].apply(lambda text: " ".join([word for word in str(text).split() if word not in STOPWORDS]))
    
    # Lemmatization
    df['description_lem'] = df['description_lem'].apply(lambda text: lemmatize_words(text))
    
    return df

In [None]:
# Code to fix NLTK not finding wordnet in Kaggle notebooks
import subprocess

# Download and unzip wordnet
try:
    nltk.data.find('wordnet.zip')
except:
    nltk.download('wordnet', download_dir='/kaggle/working/')
    command = "unzip /kaggle/working/corpora/wordnet.zip -d /kaggle/working/corpora"
    subprocess.run(command.split())
    nltk.data.path.append('/kaggle/working/')

In [None]:
df = description_prep(df)
df.head()

### Final Missing Values Handling

In [None]:
df.dropna(inplace=True)

## Data Visualization & Plotting

### Surface vs Price

In [None]:
sns.scatterplot(data=df, x='surface', y='price')
plt.title('Surface vs Price')
plt.show()

Its clear from the scatter plot that as surface increases the price of listing also increases.

### Average Price based on No. of bathrooms:

In [None]:
mean_price_by_baths = df.groupby('bathrooms').price.mean()

sns.barplot(x=mean_price_by_baths.index, y=mean_price_by_baths)
plt.title('Average Price based on No. of bathrooms')
plt.show()

In the above bar graph, it is clear that the average price increases as the number of bathrooms increases in the listing where the average price of listings with 3 bathrooms is more than double that of two bathroom listings.

### Average Price based on No. of rooms

In [None]:
mean_price_by_rooms = df.groupby('rooms').price.mean()

sns.barplot(x=mean_price_by_rooms.index, y=mean_price_by_rooms)
plt.title('Average Price based on No. of rooms')
plt.show()

As we can see from the graph, average price increases as the number of rooms increases in the listing with average price approx. twice for listings with 5 rooms than that of with 4 rooms.

### Average Price based on Floor

In [None]:
mean_price_by_floor = df.groupby('floor').price.mean()

sns.barplot(x=mean_price_by_floor.index, y=mean_price_by_floor)
plt.title('Average Price based on Floor')
plt.show()

From the above graph we can on average:
* Listings on floors 3 and 4 have the highest price
* Listings on higher floors and basements have the lowest price

## Pandas Profiling

**Note:** Uncomment this section if required !!!

#### Generating the Profile Report

In [None]:
# profile = ProfileReport(df, title="Profiling Report Post Data Wrangling")

#### Saving the report as HTML

In [None]:
# profile.to_file("profiling-report.html")

## Outlier Detection

We performed outlier detection by plotting Boxplots on basis of IQR. Below are the boxplots for the continuous features:


### Boxplot for Price

In [None]:
sns.boxplot(data=df, x='price')

### Boxplot for Surface

In [None]:
sns.boxplot(data=df, x='surface')

### Boxplot for Floor

In [None]:
sns.boxplot(data=df, x='floor')

### Why we did not remove outliers?
- In some circumstances, domain expertise can be utilized in place of statistical methods to detect and manage outliers. Outliers in the “price” variable in real estate datasets represents legitimate properties with high price and should not be eliminated. 
- A few outliers in the dataset are less likely to have an impact on clustering techniques like K-means or K-means++ since they are intrinsically resistant to outliers.
- Since we have a small-sized dataset consisting of only around 1600 rows, removing outliers from a small dataset may drastically limit the quantity of data available for analysis, perhaps producing skewed or incorrect conclusions. Hence, we decided not to remove any outliers.





## Feature Engineering

To reach our goal where we need to performed a clustering analysis on the house's description (called the **description cluster or TF-IDF cluster**) and five other attributes of the house listing ('price', 'rooms', 'surface', 'bathrooms', 'floor') called the **feature cluster**, we will divide the dataset in two data frames:
1. **Feature data frame:** the combination of five attributes- price, rooms, surface, bathrooms and floor
2. **TFIDF data frame:** house’s description_lem column (the one obtained after cleaning the description feature)


### Data Frame with price, rooms, surface, bathrooms and floor

In [None]:
df_features = df[['price', 'rooms', 'surface', 'bathrooms', 'floor']]
df_features.head()

### TFIDF Data Frame from description

In [None]:
tfidf = TfidfVectorizer()

X_tfidf = tfidf.fit_transform(df['description_lem'])

In [None]:
df_tfidf = pd.DataFrame(
    data=X_tfidf.toarray(),
    columns=tfidf.get_feature_names_out()
)

In [None]:
df_tfidf.shape

#### Visualizing the word occurrences

In [None]:
def count_plot_words_occurrences(df, figsize=(15,8), xticks_start=0, xticks_end=None):
    """
    For each word in tfidf dataframe it counts the occurences of the words in all the announcements
    
    Parameters:
        df (pandas.DataFrame): TFIDF dataset
        figsize: Dimension of the plot
        xticks_start: Axis x attribute for the plot (default value: 0)
        xticks_end:   Axis x attribute for the plot (default value: number of columns)
        steps:        Axis x attribute for the plot (default value: 1000)
    """
         
    words_counting = []

    # put NaN values if there's 0
    df = df.where(df != 0)
 
    for i in df:
        cnt_word = df.loc[:, i].count()
        words_counting.append(int(cnt_word))
        
    # if there is no input about xticks_end the assign default value i.e. total number of columns
    if xticks_end is None:
        xticks_end = len(df.iloc[0])

    plt.figure(figsize=figsize)
 
    x = range(xticks_start, xticks_end)

    plt.plot(x, words_counting, 'ro')
    
    plt.xlabel('Word_ID', size = 15)
    plt.ylabel('Listings containing the word_ID', size = 10)
    plt.title('Distribution of the words over the listings', size=12)
    
    plt.grid(linestyle='--', color='lightgray', zorder = 0)    

    plt.show()
    
    return

In [None]:
df_temp = pd.DataFrame(
    data=X_tfidf.toarray())

#### Looking at the occurence of first 1000 words in all the dataframe

In [None]:
count_plot_words_occurrences(df_temp.loc[:,0:1000], xticks_start=0, xticks_end=1001)

#### Looking at the occurrence of the last 1000 words of the document 

In [None]:
count_plot_words_occurrences(df_temp.loc[:,11000:], xticks_start=11000, xticks_end=len(df_temp.iloc[0]))

As we can see from the above two plots, we have a lot of words that compare only once in all the documents.

## Normalization

Since there were no categorical columns (except title and description) that we need for clustering analysis, we only used **StandardScaler()** function of scikit-learn is used to normalize the five numerical feature columns ('price', 'rooms', 'surface', 'bathrooms', 'floor') into a standard range.


In [None]:
scaler = StandardScaler()
data_transformed = scaler.fit_transform(df_features)

In [None]:
df_features_norm = pd.DataFrame(data=data_transformed, columns=['price', 'rooms', 'surface', 'bathrooms', 'floor'])
df_features_norm.head()

## Model Building (using K-Means++ Clustering)

### Function for elbow method

In [None]:
def elbow_method (data, max_clusters=10, TFIDF = False, figsize=(12,8)):
    
    n_listings = len(data)
    n_features = len(data.iloc[0])
    
    plot_labels = [
        ['Number of clusters (k)', 'Inertia (sum of squared distance)', 'Elbow-Method for features_dataframe'],
        ['Number of clusters (k)', 'Inertia (sum of squared distance)', f'Elbow Method for {n_listings} listings based on {n_features} features']
    ]
    
    labels_idx = 0
    
    if TFIDF == True:
        labels_idx = 1
       
                                                               
    ssd = dict()
        
    for k in range(2, max_clusters+1):
        model = KMeans(n_clusters=k, init='k-means++')
        model = model.fit(data)
        ssd[k] = model.inertia_
        
        
    # plotting the elbow method
    
    fig = plt.figure(figsize=figsize)
    x = list(ssd.keys())
    y = list(ssd.values())
    
    plt.plot(x, y, 'bx-')
    plt.xlabel(plot_labels[labels_idx][0], size=15)
    plt.ylabel(plot_labels[labels_idx][1], size=13)
    plt.title(plot_labels[labels_idx][2], size=15)
    plt.grid(linestyle='--', linewidth=2, color='lightgray', zorder = 0)    
    
    if TFIDF is False:
        plt.xticks(np.arange(2, max_clusters+1))
    
    plt.show()
    
    return 

### K-Means with features data frame

#### Applying elbow method

In [None]:
elbow_method(df_features_norm)

Based on the above graph we select **k=5** as the most appropriate number of clusters.


#### Model creation

Then, we applied the KMeans clustering algorithm with 5 clusters to the features dataframe and assigned each data point to its corresponding cluster. The resulting cluster labels are added as a new column “features_cluster” to the original dataframe.

In [None]:
model = KMeans(n_clusters=5, random_state=1234, init='k-means++')
model.fit(data_transformed)

clusters = model.predict(data_transformed)

In [None]:
df['features_cluster'] = clusters
df.head()

In [None]:
data = df.groupby('features_cluster').price.count()

sns.barplot(x=data.index, y=data)
plt.title('Count of data in each feature clusters')
plt.ylabel('Count')
plt.show()

The above bar plot shows the count of data points in each feature cluster that helps to visualize the distribution of data points across the different feature clusters.


### K-Means with TFIDF

#### Applying elbow method

In [None]:
elbow_method(df_tfidf, TFIDF=True)

#### Model creation

Since our goal is to compare the similarities of clusters we will fix the **number of clusters (k) to 5** as we got from the **features data frame**.

Similarly, we could depict the frequency of data points in each cluster of the TF-IDF dataframe (containing the description of the house) and assign each data point a “TFIDF_cluster” label after fitting the TF-IDF dataframe in K-means++ model.


In [None]:
model = KMeans(n_clusters=5, random_state=1234, init='k-means++')
model.fit(df_tfidf)

clusters = model.predict(df_tfidf)

df['TFIDF_cluster'] = clusters
df.head()

In [None]:
data = df.groupby('TFIDF_cluster').price.count()

sns.barplot(x=data.index, y=data)
plt.title('Count of data in each TFIDF clusters')
plt.ylabel('Count')
plt.show()

## Jaccard similarity between two matrices clusters

We will only consider the two columns:
* **features_cluster**
* **TFIDF_cluster**

And create a dataframe named **df_j** for the Jaccard-similarity.

In [None]:
df_j = pd.DataFrame()
df_j = df[[ 'description', 'features_cluster', 'TFIDF_cluster']]

df_j.head()

### Grouping the data w.r.t. clusters

We now create two dictionaries to represent the two cluster groups.

Each dict will include the number associated with the cluster as its key and all of the documents in that cluster as its value.


Example:

|keys(cluster number)   |  values(list of listing indices)  |
|-----------------|---------------------------------|
|    0            | [1, 7, 35, 74]                      |      
|    1            | [6, 11, 100]                       |
|    2            | [3, 128, 153]                     |

In [None]:
features_clusters = defaultdict(list)
TFIDF_clusters = defaultdict(list)

for i in range(len(df_j)):
    k1 = df_j.iloc[i]['features_cluster']
    k2 = df_j.iloc[i]['TFIDF_cluster']
    
    features_clusters[k1].append(i)
    TFIDF_clusters[k2].append(i)

In [None]:
def jaccard_similarity(set1, set2):
    '''
    Function to calculte the Jaccard-similarity score on two sets.
    
    Parameters:
        set1 (set):Data in form of set
        set2 (set):Data in form of set

    Returns:
        float: Jaccard-similarity score
    '''
    
    # set the two lists, in order to mac
    set1 = set(set1)
    set2 = set(set2)
    
    intersection = set1.intersection(set2)
    union = set1.union(set2)
    jaccard_similarity = len(intersection) / float(len(union))
    
    return jaccard_similarity

In [None]:
def calculate_jaccard_similarities(features_clusters, TFIDF_clusters):
    '''
    Function to calculate the Jaccard-similarities on pairs of feature clusters and TFIDF clusters.
    
    Parameters:
        features_clusters (dict): Grouped data based on feature clusters
        TFIDF_clusters (dict): Grouped data based on TFIDF clusters

    Returns:
        list: List of tuples consisting of cluster pairs and their Jaccard-similarity score
    '''
    
    jaccard_scores_list = []
    
    for cl1 in features_clusters.keys():
        for cl2 in TFIDF_clusters.keys():
            
            j_score = jaccard_similarity(features_clusters[cl1], TFIDF_clusters[cl2])
            
            jaccard_scores_list.append(tuple([(cl1, cl2), j_score]))
    
    jaccard_scores_list.sort(key = lambda x: x[1], reverse=True) # Sorting in descending order based on jaccard-similarity score
    
    return jaccard_scores_list

### Calculating the Jaccard-similarities

In [None]:
jaccard_similarities = calculate_jaccard_similarities(features_clusters, TFIDF_clusters)

In [None]:
df_j_score = pd.DataFrame(data=jaccard_similarities, columns=['Cluster pair', 'Jaccard-similarity score'])

plt.figure(figsize=(10, 8))
sns.barplot(data=df_j_score, y='Cluster pair', x='Jaccard-similarity score', orient='h')
plt.title('Jaccard-similarity scores for cluster pairs (feature cluster, TFIDF cluster)')
plt.show()

The above figure shows the Jaccard Similarity score for each cluster pair (feature cluster, TFIDF cluster) in decreasing order.

**Since all the Jaccard-similarity scores are approx. 0.2 or less, we came to conclusion that the description provided by the owner in the listings don't completely match with specifications of the listings.**

## Wordcloud of best pair based on Jaccard-similarity

Finally, we made a word cloud from the house description of the best pair based on the highest Jaccard similarity score as shown below.


In [None]:
def doWordcloud(text):
    '''
    Function to create word cloud.
    
    Parameters:
        text (str): Text document or corpus
    '''
    # Create stopword list:
    stopwords = set(STOPWORDS)

    if path.exists('house_mask.png') or path.exists('/kaggle/input/house_mask.png'):
        house_mask = np.array(Image.open("house_mask.png"))
    
        wc = WordCloud(background_color="white", max_words=500, mask=house_mask,
                       stopwords=stopwords, contour_width=4, contour_color='firebrick')
    else:
        wc = WordCloud(background_color="white", max_words=500, stopwords=stopwords)
    
    # Generate a wordcloud
    wc.generate(text)

    fig = plt.figure(figsize=[20,10])
    plt.imshow(wc, interpolation='bilinear')
    plt.axis("off")
    plt.axis("off")
    fig.show()
    
    return

In [None]:
def setWordcloud (df_j, list_cluster1, list_cluster2):
    '''
    Function to create word cloud of pair of clusters given.
    
    Parameters:
        df_j (pd.DataFrame): Jaccard dataframe
        list_cluster1 (list): List of data values in the cluster
        list_cluster2 (list): List of data values in the cluster
    '''
    
    listing = (list_cluster1 + list_cluster2)

    description = str()
    for i in listing:
        description += str(df_j.iloc[i]['description'])
    
    doWordcloud(description)
    
    return

### Plotting the wordcloud

In [None]:
setWordcloud(df_j, features_clusters[1], TFIDF_clusters[2])

## Conclusion
Through this assignment, we learned, implemented, and accomplished all the basic machine learning modeling steps, starting from scraping data from a real estate website and converting it into a CSV file with the scraped house listing details. Then, we moved on to the data wrangling process. We learned how to clean numerical and categorical valued columns using regular expressions and NLP techniques like lemmatization, punctuation and stopwords removal. 

We applied data visualization techniques to infer the relationships and trends, for instance, among the average price of the house and other features. Pandas' profile was also created for our dataset and saved into an HTML file. After normalization on the dataset and using the Elbow method, we determined that 5 clusters are quite optimal and then applied the K-Means++ clustering algorithm to derive the clusters for two clustering groups - the feature cluster and TFIDF cluster. Finally, we calculated the Jaccard Similarity score for each cluster pair, and most of the scores were approximately 0.2 or lesser. Hence, the description provided by the owner in the listings does not wholly match other real estate listings specifications such as price, bathrooms, floors etc.


