### Autoreload Extension Activation

The code above is used to activate the autoreload extension in Jupyter notebook. This extension allows for automatic reloading of modules before executing user code, thus any changes made in the modules will be reflected without the need to restart the kernel. The '2' argument allows all currently imported modules to be reloaded every time before executing the Python code typed.

In [None]:
%load_ext autoreload
%autoreload 2

### Setting Up Modin Environment

This code block is used for setting up the Modin environment. It first imports necessary libraries, determines the number of CPUs available for computation, and sets up the appropriate environment variables. Then, it configures the Modin engine to use Dask and sets the number of partitions equal to the number of CPUs. A Dask client is also created for distributed computing. Lastly, it sets the maximum number of rows to display to 30 when using Modin's pandas implementation.

In [None]:
import os
import multiprocessing
import modin
import modin.pandas as md
num_cpus = multiprocessing.cpu_count() -1

os.environ["MODIN_CPUS"] = str(num_cpus)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
modin.config.Engine.put("Dask")
modin.config.NPartitions.put(num_cpus)

from distributed import Client
client = Client()
print("Modin using %s cpus" % modin.config.NPartitions.get())


md.set_option('display.max_rows', 30)

### Importing Necessary Libraries and Modules

This section of code is responsible for importing all the necessary libraries and modules that will be used throughout the rest of the code. This includes libraries for data manipulation (like `rdfpandas`), machine learning (`xgboost`), data visualization (`matplotlib`, `seaborn`), date handling (`datetime`), and others. The `helper` module is also imported, which presumably contains custom functions to assist with data processing.

In [None]:
from helper.helper import *
from datetime import date
import matplotlib.pyplot as plt   
from xgboost import plot_importance
import pickle 
import seaborn as sns
from numpy import sort
from sklearn.preprocessing import MinMaxScaler
import rdfpandas

### Input files 


### File Paths and Constants

This section of code defines the file paths for the destination data file and the mapped listings file. It also sets the date for the listings and a default random state.
`destination_data_file`: is the filename of AirBnB tourism destination data (e.g. London), download it from https://insideairbnb.com/

`listings_mapped_entities`: is the filename of the DBpedia entities extracted from each AirBnB accommodation description form file "destination_data_file"

`dataset_name`: is the name that would be appended to the training, dev and test datasets created to train ans test KGE-BERT model

`last_date`: is the date of the data dump (e.g. y=2022, m=09, d=10)

`DEFAULT_RANDOM_STATE`: is the random seed used for results' reproducibility sake

In [None]:
destination_data_file = './data/london/date=20220910/listings.csv.gz'
dataset_name = "airbnb_london_20220910"
listings_mapped_entities = "data/london/date=20220910/london_listings_mapped.csv"
last_date = date(2022, 9, 10)
DEFAULT_RANDOM_STATE = 1

### Load Destination Data

The code reads a CSV file named `destination_data_file` into a pandas DataFrame called `destination_data`.

In [None]:
destination_data = pd.read_csv(destination_data_file)

### Column Types Definition

The code defines a dictionary `column_types` that categorizes different types of columns in a dataset into various categories such as `category_columns`, `boolean_columns`, `date_columns`, `present_absent_columns`, `to_drop_cols`, `amenities_col`, `amenities_tao_col`, `entities_col`, and `listing_type_cols`. It then creates a new category `listing_type_and_category_cols` that combines `listing_type_cols` and `category_columns`.

In [None]:
column_types = {
    "category_columns" : ['neighbourhood_cleansed', 'host_response_time'],
    "boolean_columns" : ['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'has_availability', 'instant_bookable'],
    "date_columns" : ['host_since', 'first_review', 'last_review'],
    "present_absent_columns" : ['host_about', 'host_neighbourhood'],
    "to_drop_cols" : ['neighbourhood', 'host_location', 'host_about','neighborhood_overview','host_location','bathrooms_text'],
    "amenities_col" : ['amenities_v'],
    "amenities_tao_col" : ['amenities_tao_v'],
    "entities_col" : ['entities_v'],
    "listing_type_cols" : ['property_type', 'room_type']}
column_types["listing_type_and_category_cols"] = column_types["listing_type_cols"] + column_types["category_columns"]

### Loading and Mapping TAO Class Ontologies

The code provides functions to load TAO (Tourist Attraction Ontology) class mappings from files and create an amenity dictionary.

1. `load_tao_mappings`: This function reads mapping files for different TAO classes (location amenities, location facilities, accommodations, tourist locations) and returns dataframes for each class mapping.

2. `AmenityMapper_Factory`: This function utilizes the `load_tao_mappings` function to load all TAO class mappings and creates an `AmenityMapper` for all mappings and for location amenities mappings.

3. `create_amenity_dic`: This function creates a dictionary for a given list of amenities, where each amenity is linked to its corresponding class through the `AmenityMapper`.

In [None]:
def load_tao_mappings(
        la_mapping_file = "class_mapping/la_ontology_mapping.csv",
        lf_mapping_file = "class_mapping/lf_ontology_mapping.csv",
        acco_mapping_file = "class_mapping/accommodation_ontology_mapping.csv",
        tl_mapping_file = "class_mapping/tourist_location_ontology_mapping.csv"):
    """Loads TAO class mappings from files.
    Each mapping files has two columns:
    label: contains the lower case text label associate to the TAO class
    class: contains the camel case OWL class in the onyology 
    
    Parameters
    ----------
    la_mapping_file : str
        The mapping file complete path for location amenities' classes
    lf_mapping_file : str
        The mapping file complete path for location facilities' classes
    acco_mapping_file : str
        The mapping file complete path for accommodations' classes
    tl_mapping_file :
        The mapping file complete path for tourist locations' classes
    
    Returns
    -------
    A tuple of 4 pandas dataframe containing mappings for location 
    amenities, location facilities, accommodations, tourist locations
    """
    
    #### List of amenities labels vs classes extracted from ontology
    df_ab2t_o = pd.read_csv(la_mapping_file)
    df_ab2t_o.rename(columns={'class': 'amenity_class'}, inplace=True)
    df_ab2t_o["amenity_id"] = df_ab2t_o["label"].apply(lambda e: clean_name(str(e)).lower().strip())
    
    #### List of l labels vs classes extracted from ontology
    df_lf_o = pd.read_csv(lf_mapping_file)
    df_lf_o.rename(columns={'class': 'amenity_class'}, inplace=True)
    df_lf_o["amenity_id"] = df_lf_o["label"].apply(lambda e: clean_name(str(e)).lower().strip())
    
    #### List of l labels vs classes extracted from ontology
    df_ac_o = pd.read_csv(acco_mapping_file)
    df_ac_o.rename(columns={'class': 'amenity_class'}, inplace=True)
    df_ac_o["amenity_id"] = df_ac_o["label"].apply(lambda e: clean_name(str(e)).lower().strip())
    
    #### List of l labels vs classes extracted from ontology
    df_tl_o = pd.read_csv(tl_mapping_file)
    df_tl_o.rename(columns={'class': 'amenity_class'}, inplace=True)
    df_tl_o["amenity_id"] = df_tl_o["label"].apply(lambda e: clean_name(str(e)).lower().strip())
    
    return df_ab2t_o, df_lf_o, df_ac_o, df_tl_o

def AmenityMapper_Factory(
        la_mapping_file = "class_mapping/la_ontology_mapping.csv",
        lf_mapping_file = "class_mapping/lf_ontology_mapping.csv",
        acco_mapping_file = "class_mapping/accommodation_ontology_mapping.csv",
        tl_mapping_file = "class_mapping/tourist_location_ontology_mapping.csv"):
    
    df_ab2t_o, df_lf_o, df_ac_o, df_tl_o = load_tao_mappings(la_mapping_file, lf_mapping_file, acco_mapping_file, tl_mapping_file)
    
    all_mappings = pd.concat([df_ab2t_o, df_lf_o, df_ac_o, df_tl_o])
    all_matcher = AmenityMapper(all_mappings)
    am_matcher = AmenityMapper(df_ab2t_o)
    return all_matcher, am_matcher

def create_amenity_dic(am_matcher, cleaned_amenities):
    am_dict = {}
    for am in cleaned_amenities:
        am_dict[am] = am_matcher.amenity_linker(am)
    return am_dict

### Creating Amenities Root Map

This function `create_amentities_root_map` generates a mapping of amenities and their root classes. It first identifies subclasses of the root class `LocationAmenity` and then finds valid subclasses by traversing through their descendants. It then creates a generalization of classes based on their valid parents. Finally, it creates a dataframe `amentities_root_map_df` and dictionary `amentities_root_map` that represents the mapping of amenities to their root classes.

In [None]:
def create_amentities_root_map(tao):
    tao_classes = {}
    root_class = tao.LocationAmenity
    first_level_classes = set(root_class.subclasses())
    print("first level classes: ", first_level_classes) 
    valid_classes = set()
    for subc in first_level_classes:  
        #print("subclasses of ", subc)
        for c in subc.descendants():
            valid_classes.add(c)
            #print(">>>  ",c)
    #print(valid_classes)

    class_generalization = {}
    for subc in first_level_classes:  
        #print(subc)
        for c in subc.descendants():
            #print("----",c)
            if c.iri.startswith(tao_base_iri):
                parents = set(c.is_a)
                valid_parents = parents.intersection(valid_classes) - first_level_classes
                #print(c.name, parents, valid_parents)            
                if len(valid_parents) == 1:
                    class_generalization[c.name] = valid_parents.pop()
                    #print("Class %s ...selected parent: %s" % (c, class_generalization[c.name]))
                else:
                    class_generalization[c.name] = c
                    #print("No valid parent found using ", c)
    
    amentities_root_map = root_sublclasses_tree(tao.LocationAmenity)
    amentities_root_map_df = pd.DataFrame.from_dict(amentities_root_map, orient='index', columns=["macro_amenity"]).reset_index(names='amenity').explode("macro_amenity")
    
    return amentities_root_map_df, amentities_root_map

### Data Preparation Function

This function `prepare_data` performs a series of data cleaning and preprocessing tasks on given dataframes. It handles null values, converts data types, creates new features, and prepares the data for further analysis. The function also drops unused columns, removes rows and columns with null values, and extracts specific data from the amenities and entities columns. It then generates hot encodings for entities and places. The final output is a dictionary containing all the cleaned and processed dataframes along with some additional useful information.

In [None]:
def prepare_data(df, ct, last_date, df_ent, amentities_root_map_df, pl_min_freq = 20):
    boolean_columns = ct["boolean_columns"]
    date_columns = ct["date_columns"]
    present_absent_columns = ct['present_absent_columns']
    present_absent_columns_renamed = ["pa_"+col for col in present_absent_columns]
    
    # avoid side effects on dataframe
    df = df.copy()
    df_ent = df_ent.copy()
    all_matcher, am_matcher = AmenityMapper_Factory()
    
    
    # fill null values in columns
    df.description = df.description.fillna("").apply(lambda text: pre_process(text))
    df.bedrooms = df.bedrooms.fillna(0)
    df.host_response_time = df.host_response_time.fillna("undefined")
    df.host_neighbourhood = df.host_neighbourhood.fillna("")
    df.host_is_superhost = df.host_is_superhost.fillna(0)
    df.name = df.name.fillna("")
    df.beds = df.beds.fillna(0)
    
    # change price into float
    df.price = clean_price(df.price)
    
    # change string t/f columns into bolean
    df[boolean_columns] = df[boolean_columns].replace("t", 0).replace("f", 1)
    
    # cast date columns to datetime and calculate days subce
    for col in date_columns:
        df[col] = pd.to_datetime(df[col], infer_datetime_format=True, )
    
    df[date_columns] = df[date_columns].apply(lambda my_col: my_col.apply(lambda my_date: calc_days_since(my_date, last_date, my_col.name)))    
    
    # change percentages to float and fill null values with mean
    df.host_response_rate = df.host_response_rate.str.replace("%","").astype(np.float64)
    df.host_acceptance_rate = df.host_acceptance_rate.str.replace("%","").astype(np.float64)
    df.host_response_rate.fillna(df.host_response_rate.mean(), inplace=True)
    df.host_acceptance_rate.fillna(df.host_acceptance_rate.mean(), inplace=True)
    
    # create columns with boolean value to account for presence of other columns
    df[present_absent_columns_renamed] = df[present_absent_columns].apply(lambda my_col: my_col.apply(lambda text: int(type(text) is str)))    
    
    # drop unused columns
    df = df.drop(columns=ct["to_drop_cols"])
    
    ## remove completely null columns
    df = df.dropna(axis=1, how = 'all')
    
    ## remove rows with null values
    df = df.dropna()
    
    df['amenities_v'] = df['amenities'].apply(lambda am: json.loads(am))
    df['amenities_all_v'] = df['amenities'].apply(lambda am: json.loads(am)).apply(lambda am_list: list(set(am_list)))
    frequent_amenities = property_frequencies(df['amenities_all_v'])
    df['amenities_freq_v'] = df['amenities_all_v'].apply(lambda am_list: list(set(am_list).intersection(frequent_amenities)))
    
    distinct_amenities = list(df['amenities_all_v'].explode().unique())
    cleaned_amenities = set([clean_name(am) for am in distinct_amenities])
    
    am_dict = create_amenity_dic(am_matcher, cleaned_amenities)
    
    df['amenities_tao_v'] = df['amenities_all_v'].apply(lambda x: remap_amenities(x, am_dict))
    df['amenities_tao_classes_v'] = df['amenities_all_v'].apply(lambda x: remap_amenities(x, am_dict, output='class'))
    
    amenity_classes_for_listings = pd.DataFrame(df[['id','amenities_tao_classes_v']].explode('amenities_tao_classes_v'))
    macro_amenity_classes_for_listings = amenity_classes_for_listings.merge(amentities_root_map_df, left_on='amenities_tao_classes_v', right_on='amenity', how="left")[["id","macro_amenity"]]
    macro_amenity_classes_for_listings["tot"] = 1
    macro_amenity_features = macro_amenity_classes_for_listings.reset_index().groupby(['id', 'macro_amenity'])['tot'].aggregate('sum').unstack()
    macro_amenity_features.fillna(0, inplace=True)
    macro_amenity_features.reset_index()
    
    
    # DBpedia entities from descriptions
    df_ent.drop_duplicates(subset=['id'], inplace=True)
    df_ent['entities_v'] = df_ent['entities'].apply(lambda e: converter(str(e)))
    df_ent.id.astype(int)
    frequent_entities = property_frequencies(df_ent['entities_v']) 
    all_entities = property_frequencies(df_ent['entities_v'], min_freq = 0) 
    df = df.merge(df_ent[['id', 'entities_v']], left_on='id', right_on='id', how='left')
    
    ## Note df['entities_v'] has all entities with repetitions
    df['entities_all_v'] = df['entities_v'].apply(lambda am_list: deb(am_list))
    df['entities_freq_v'] = df['entities_all_v'].apply(lambda am_list: list(set(am_list).intersection(frequent_entities)))
    
    ## DBpedia entities of type Place from descriptions
    ent_places = df_ent['entities'].apply(lambda e: converter(str(e), filter="place"))
    frequent_places = property_frequencies(ent_places, min_freq = pl_min_freq)
    all_places = property_frequencies(ent_places, min_freq = 0) 
    
    df_all_places = df['places_all_v'] = df['entities_all_v'].apply(lambda am_list: list(set(am_list).intersection(all_places)))
    df_freq_places = df['places_freq_v'] = df['entities_all_v'].apply(lambda am_list: list(set(am_list).intersection(frequent_places)))
    
    print(len(all_places), len(frequent_places), len(frequent_entities)) 
    
    #Hot encodings
    mlb = MultiLabelBinarizer()
    df_all_entities_filtered_exploded = pd.DataFrame(mlb.fit_transform(df['entities_all_v']), columns=mlb.classes_, index=df.index)
    
    mlb = MultiLabelBinarizer()
    df_freq_entities_filtered_exploded = pd.DataFrame(mlb.fit_transform(df['entities_freq_v']), columns=mlb.classes_, index=df.index)
    
    mlb = MultiLabelBinarizer()
    df_all_places_exploded = pd.DataFrame(mlb.fit_transform(df_all_places), columns=mlb.classes_, index=df.index)
    df_freq_places_exploded = pd.DataFrame(mlb.fit_transform(df_freq_places), columns=mlb.classes_, index=df.index)
    
    
    #df_listing_type_exploded = pd.get_dummies(X_temp, columns=listing_type_cols)
    
    output = {
        "df_data": df,
        "df_entities": df_ent,
        "df_all_places": df_all_places,
        "df_freq_places": df_freq_places,
        "frequent_entities": frequent_entities,
        "all_entities": all_entities,
        "amenities_lookup": am_dict,
        "frequent_places": frequent_places,
        "all_places": all_places,
        "df_freq_places": df_freq_places,
        "df_all_entities_filtered_exploded": df_all_entities_filtered_exploded,
        "df_freq_entities_filtered_exploded": df_freq_entities_filtered_exploded,
        "df_all_places_exploded": df_all_places_exploded,
        "df_freq_places_exploded": df_freq_places_exploded,
        "df_macro_amenity_classes_for_listings": macro_amenity_classes_for_listings
        
    }
    
    
    return output
    

### Loading Ontologies

This section of code is responsible for loading two different ontologies: `tao.rdf` and `skos.rdf`. It sets up the world environment, specifies the file paths and base IRIs for the ontologies, and then loads them into the world environment. The namespaces for each ontology are also defined.

In [None]:
world = World()
tao_file = "ontologies/tao.rdf" 
tao_base_iri = "http://purl.org/tao/ns#"
tao_ontology = world.get_ontology(tao_file).load()
skos_file = './ontologies/skos.rdf' 
skos_ontology = world.get_ontology(skos_file).load()
skos = skos_ontology.get_namespace("http://www.w3.org/2004/02/skos/core#")
tao = tao_ontology.get_namespace(tao_base_iri)

### Creation of Amenities Root Map

The provided code block is responsible for creating an amenities root map. This map is generated by invoking the `create_amentities_root_map` function with `tao` as the input argument. The results are stored in two variables: `amentities_root_map_df` and `amentities_root_map`.

In [None]:
amentities_root_map_df, amentities_root_map = create_amentities_root_map(tao)

### Loading and Renaming DataFrame Columns

This line of code is responsible for loading a CSV file into a pandas DataFrame named `df_ent`. It also renames the column originally named "source_id" to "id".

In [None]:
df_ent = pd.read_csv(listings_mapped_entities, sep = ",").rename(columns={"source_id": "id"})

### Data Cleaning and Filtering

The code block above is responsible for cleaning a DataFrame (`df_ent`). It first counts the total number of lines. Then it removes the lines where the `id` is either not available or is equal to 'id'. After this, it tries to convert the `id` to an integer type, and removes any rows where this conversion is not possible. Finally, it calculates and prints the number of lines that were removed during this cleaning process.

In [None]:
## filter wrong lines
num_lines = df_ent.shape[0]
df_ent = df_ent[(df_ent.id != 'id') & (df_ent.id.notna())] 
df_ent["id1"] = pd.to_numeric(df_ent.id, errors='coerce', downcast="integer")
df_ent = df_ent[df_ent.id1.notna()]
df_ent["id"] = df_ent.id.astype(int)
df_ent.drop(columns=["id1"], inplace=True)
removed_lines = num_lines - df_ent.shape[0]
print("Removed corrupted lines", removed_lines)

### Data Preparation and Storage

This piece of code handles the preparation and storage of data. It provides the option to load previously computed data from a file or to prepare data from scratch. If the data is prepared from scratch, it also offers the option to save the newly prepared data to a file for future use.

In [None]:
load_from_file = False ### Change to True to load a previous computation
store_to_file = True
check_pickle = True
filename = "checkpoints/prepared_data.pickle"

if load_from_file:
    print("Loading data from file", filename)
    with open(filename, 'rb') as handle:
        prepared_data = pickle.load(handle)
else:
    print("Preparing data from scratch")
    prepared_data = prepare_data(destination_data.copy(), column_types, last_date, df_ent, amentities_root_map_df)
    if store_to_file:
        print("Saving prepared data to file", filename)
        with open(filename, 'wb') as handle:
            pickle.dump(prepared_data, handle, protocol=pickle.HIGHEST_PROTOCOL)

### Checking Pickle File Integrity

This code block checks the integrity of a Pickle file. It compares the loaded Pickle data with the original data using the DeepDiff library. If there are any differences, they are printed to the console.

In [None]:
if check_pickle:
    ## to check for pickle content with respect to the original one
    from deepdiff import DeepDiff
    with open(filename, 'rb') as handle:
        loaded_data = pickle.load(handle)
    diff = DeepDiff(prepared_data, loaded_data)
    print(diff)

## Create a TAO entity matcher from descriptions
The matcher is created using original TAO mappings and new mappings extracted from amenity metadata

### Mapping Labels to Classes

This section of the code is responsible for loading class mappings from TAO (Tourism Analytics Ontology), preparing new class mappings from amenity metadata, and then combining these mappings. It also appends the base IRI (Internationalized Resource Identifier) from TAO to the 'amenity_class' column of the combined dataframe. Finally, it uses this combined dataframe to create an instance of the AmenityMapper class.

In [None]:
df_ab2t_o, df_lf_o, df_ac_o, df_tl_o = load_tao_mappings() ## label to class mappings from TAO 
am_dict = prepared_data["amenities_lookup"] ## new label to class mappings from amenity metadata
lookup_labels_from_data = pd.DataFrame.from_dict(am_dict, orient="index", columns=["tao_label", "amenity_class"]).reset_index().rename(columns={'index': 'label'})
all_lookup = [df_ab2t_o, df_lf_o, df_ac_o, df_tl_o]
all_lookup_df = pd.concat(all_lookup, axis = 0).dropna(axis=0)
all_lookup_df["amenity_class"] = tao.base_iri + all_lookup_df["amenity_class"]
print(tao.base_iri)
am_matcher = AmenityMapper(all_lookup_df)

### TAO Entity Extractor Testing

This section of code is dedicated to testing the TAO (Text Analysis Operations) entity extractor. It does so by preprocessing a selected text from the "description" field of the "destination_data" dataset, then applying the amenity_linker_multi function to extract entities. The raw and processed text, as well as the extraction results, are then printed for verification.

In [None]:
### test TAO entity extractor 
i = 4
raw_text = destination_data["description"][i]
text = pre_process(raw_text)
res = am_matcher.amenity_linker_multi(text)
print(raw_text)
print(text)

print(res)
#matches = self.__matcher(doc)

## Extract all TAO entities and merge them with DBpedia entities

### Entity Merging Function

This function, `merge_entities`, is designed to merge two lists of entities, `tao_entities` and `dbp_entities`. It first checks if each input is a list, returning the other if one is not. Then it iterates over `tao_entities`, storing the first and last word of each entity. If `dbp_entities` is a list, it does the same for each of its entities, creating sets of words for each. If there's an intersection between the two sets, the current `tao_entity` is discarded. Otherwise, it's kept. The function finally returns a list combining the kept `tao_entities` and all `dbp_entities`.

In [None]:
def merge_entities(tao_entities: list, dbp_entities:list):
    if type(dbp_entities) is not list:
        return tao_entities
    if type(tao_entities) is not list:
        return dbp_entities
    
    keep_tao = []
    for _, tao_ent in enumerate(tao_entities):
        tao_first_word = tao_ent["surface_word_pos"][0]
        tao_last_word = tao_ent["surface_word_pos"][1]
        keep = True
        if type(dbp_entities) is list:
            for _, dbp_ent in enumerate(dbp_entities):
                dbp_first_word = dbp_ent["surface_word_pos"][0]
                dbp_last_word = dbp_ent["surface_word_pos"][1]  ### last position is excluded
                tao_words = set(range(tao_first_word, tao_last_word))
                dbp_words = set(range(dbp_first_word, dbp_last_word))             

                if len(tao_words.intersection(dbp_words)) > 0:
                    keep = False
                    break

        if keep:
            #print("keeping: ", tao_ent["surface_form"])
            keep_tao.append(tao_ent)
        else:
            #print("discarding:", tao_ent,"collision with ", dbp_ent)
            pass
    
    return keep_tao + dbp_entities
            
            
            

### Extracting and Merging TAO and DBpedia Entities

The function `extract_tao_entities` takes a dataframe as input, removes any null texts, and applies the TAO entity matcher to each description to extract TAO entities. 

The function `merge_tao_and_dbp_entities` merges the dataframe containing TAO entities with the dataframe containing DBpedia entities. It then removes rows with no DBpedia or TAO entities. Finally, it merges the DBpedia entities with the TAO entities, giving precedence to DBpedia entities if the surface forms overlap.

In [None]:
def extract_tao_entities(data_df: pd.DataFrame) -> pd.DataFrame:
    ## we must use the same text used with DBpedia spotlight in order to align surface form word positions during merge
    df_tao_entities_from_text = data_df.dropna(axis = 0) ## remove null texts
    #df_tao_entities_from_text = df_tao_entities_from_text[0:100] ## for testing, comment when processing all data
    ## we use the tao entoty matcher to process each description and extract TAO entities from it
    data_df["tao_entities"] = data_df.description.apply(lambda raw_text: am_matcher.amenity_linker_multi(pre_process(raw_text)))
    return data_df

def merge_tao_and_dbp_entities(df_tao: pd.DataFrame, df_dbp: pd.DataFrame) -> pd.DataFrame:
    ## merge the dataframe with DBpedia entities with the dataframe with TAO entities
    df_tao_dbpedia_entities =  df_tao[["id","tao_entities"]].merge(df_dbp[["id","entities_v"]], left_on="id", right_on="id", how="outer").rename(columns={"entities_v": "dbp_entities"})
    ## remove rows with no DBpedia entities and no TAO endf_tao_entities_from_texttity found
    df_tao_dbpedia_entities = df_tao_dbpedia_entities[(df_tao_dbpedia_entities.astype(str)['dbp_entities'] != '[]') & (~df_tao_dbpedia_entities.tao_entities.isna())]
    ## merge DBpedia entities with TAO entities giving precedence to DBpedia entities if the surface forms overlap
    df_tao_dbpedia_entities["merged_entities"] = df_tao_dbpedia_entities.apply(lambda row: merge_entities(row["tao_entities"], row["dbp_entities"]), axis=1)
    
    return df_tao_dbpedia_entities

### Entity Extraction and Storage

This script handles the extraction and storage of TAO and DBpedia entities. 

- The first set of variables determine whether the program should load entities from a file or prepare them from scratch.
- If `load_from_file` is `True`, the script loads TAO and DBpedia entities from a specified file. If `load_from_file_just_tao` is `True`, it also loads TAO entities extracted from text from a separate file.
- If `load_from_file` is `False`, the script prepares TAO and DBpedia entities from scratch. It extracts TAO entities from all listing descriptions and merges them with entities extracted using DBpedia spotlight. It uses Modin dataframes to parallelize this process.
- After preparing the entities, if `store_to_file` is `True`, the script saves the TAO and DBpedia entities to a file. If `store_to_file_just_tao` is `True`, it also saves just the TAO entities to a separate file.

In [None]:
load_from_file = False ### Change to True to load a previous computation
load_from_file_just_tao = False
store_to_file = True
store_to_file_just_tao = True
check_pickle = True
filename = "checkpoints/dbpedia_tao_entities_for_kbert.pickle"
tao_entities_parquet = "checkpoints/tao_entities_from_text_for_bert_kg.parquet"
if load_from_file:
    print("Loading TAO and DBpedia entities from file", filename)
    with open(filename, 'rb') as handle:
        df_tao_dbpedia_entities = pickle.load(handle)

    if load_from_file_just_tao:
        print("Loading Just TAO entities extracted from text from file", tao_entities_parquet)
        df_tao_entities_from_text = pd.read_parquet(tao_entities_parquet)
    
else:
    print("Preparing TAO and DBpedia entities from scratch")
    ## find TAO entities in all listings description and merge them with entites extracted using DBpedia spotlight
    
    ### We use Modin dataframes to parallelize
    df_descriptions = md.DataFrame(destination_data[["id", "description"]].copy().dropna(axis=0))
    md_tao_entities_from_text = extract_tao_entities(df_descriptions)
    df_tao_entities_from_text = md_tao_entities_from_text._to_pandas()
    # Create column with vector of TAO classes found in descriptions
    df_tao_entities_from_text.drop_duplicates(subset=['id'], inplace=True)
    df_tao_entities_from_text['tao_entities_v'] = df_tao_entities_from_text['tao_entities'].apply(lambda e: converter(e))
    df_tao_entities_from_text.id.astype(int)
    df_tao_entities_from_text.reset_index(inplace=True, drop=True)
    
    md_dbp_ent = md.DataFrame(prepared_data["df_entities"].copy()[["id","entities"]]) ## get dbpedia entities from prepared data
    md_dbp_ent["entities_v"] = md_dbp_ent['entities'].apply(lambda e: converter(str(e), full=True)) ## convert text to array   
    df_dbp_ent = md_dbp_ent._to_pandas()
    df_tao_dbpedia_entities = merge_tao_and_dbp_entities(df_tao_entities_from_text, df_dbp_ent)
    if store_to_file:
        print("Saving TAO and DBpedia entities to file", filename)
        with open(filename, 'wb') as handle:
            pickle.dump(df_tao_dbpedia_entities, handle, protocol=pickle.HIGHEST_PROTOCOL)
    if store_to_file_just_tao:
        print("Saving just TAO entities to file", tao_entities_parquet)
        df_tao_entities_from_text.to_parquet(tao_entities_parquet)
        

## Prepare balanced datasets and train classification models

### Extracting DataFrame from Prepared Data

The following line of code is used to extract a DataFrame named "df_data" from the prepared_data dictionary and assign it to the variable "lc".

In [None]:
lc = prepared_data["df_data"]

### Histogram Plot of Description Length

The given code snippet is creating a histogram plot of the description lengths. It first pre-processes the text in the 'description' column of the 'lc' dataframe, splits it into words, counts the number of words, and then plots a histogram. The title of the histogram is 'Description length in words'.

In [None]:
distr = lc.description.apply(lambda s: len(pre_process(s).split(' ')))
distr.hist()
plt.title('Description length in words')

### Column Type Definitions

The code above is defining different types of columns in a dataset. These include:
- `category_columns`: These are columns with categorical data.
- `boolean_columns`: These are columns with boolean values.
- `date_columns`: These are columns with date values.
- `present_absent_columns`: These columns indicate the presence or absence of a certain attribute.
- `amenities_col`: This column contains information about the amenities of a listing.
- `amenities_tao_col`: This column contains transformed amenities data.
- `entities_col`: This column contains entity data.
- `listing_type_cols`: These columns contain information about the type of the listing.
- `listing_type_and_category_cols`: This is a combination of the `listing_type_cols` and `category_columns`.

In [None]:
category_columns = column_types["category_columns"] #['neighbourhood_cleansed', 'host_response_time']
boolean_columns = column_types["boolean_columns"] #['host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'has_availability', 'instant_bookable']
date_columns = column_types["date_columns"] #['host_since', 'first_review', 'last_review']
present_absent_columns = column_types["present_absent_columns"] #['host_about', 'host_neighbourhood']
amenities_col = column_types["amenities_col"] #['amenities_v']
amenities_tao_col = column_types["amenities_tao_col"] #['amenities_tao_v']
entities_col = column_types["entities_col"] #['entities_v']
listing_type_cols = column_types["listing_type_cols"] #['property_type', 'room_type']
listing_type_and_category_cols = listing_type_cols + category_columns

### Data Preprocessing for Price Prediction

This code performs data preprocessing for price prediction. It first calculates the mean and standard deviation of the prices. Then, it filters out data points that are more than two standard deviations away from the mean. Finally, it creates two lists of column names: one for numerical columns that are not related to review scores or price (the target variable), and one for the excluded columns that start with 'review_scores' or 'price'.

In [None]:
price_mean = lc.price.mean()
price_std = lc.price.std()
price_prepared_data_df = lc[lc.price < price_mean + 2 * price_std]
price_numeric_cols = [col for col in list(price_prepared_data_df.columns) if not ( col.startswith('review_scores') or col.startswith('price')) and price_prepared_data_df[col].dtype != np.dtype('O')]  ## exclude columns about reviews scores and price that we want to predict
price_numeric_excluded_cols = [col for col in list(price_prepared_data_df.columns) if ( col.startswith('review_scores') or col.startswith('price')) and price_prepared_data_df[col].dtype != np.dtype('O')]  ## excluded columns

### Overview of Code for Excluding Numeric Columns in Price Data

The variable `price_numeric_excluded_cols` is used to store the column names of the price data that are not numeric. This is useful for data preprocessing where we need to separate numeric and non-numeric data for further analysis.

In [None]:
price_numeric_excluded_cols

### Data Preparation for 'availability_365' Analysis

This code block is used for preparing the data for analysis related to 'availability_365'. 

- The first line filters out null values in the 'availability_365' column of the dataset.
- The next two lines create lists of column names. The first list includes all numeric columns that are not related to 'availability', while the second list includes all numeric columns that are related to 'availability'. These lists are created for further data processing and analysis.

In [None]:
avail365_prepared_data_df = lc[lc.availability_365.notnull()]
avail365_numeric_cols = [col for col in list(avail365_prepared_data_df.columns) if not ( col.startswith('availability')) and avail365_prepared_data_df[col].dtype != np.dtype('O')]  ## exclude columns about availability count that we want to predict
avail365_numeric_excluded_cols = [col for col in list(avail365_prepared_data_df.columns) if ( col.startswith('availability')) and avail365_prepared_data_df[col].dtype != np.dtype('O')]  ## excluded columns 

### Overview of `avail365_numeric_excluded_cols` Variable

This variable `avail365_numeric_excluded_cols` is likely used to store the column names of a dataset that are excluded from computations, specifically those columns with numeric data type and associated with the availability of a resource over a 365-day period.

In [None]:
avail365_numeric_excluded_cols

### Data Preparation and Column Selection

The code calculates the mean and standard deviation of the number of reviews. It then prepares the data by excluding records with a number of reviews less than the mean plus twice the standard deviation. Finally, it generates two lists of column names: one excluding review-related columns with non-object data types, and the other specifically including only those columns.

In [None]:
nrw_mean = lc.price.mean() ### Here we have an error that has no influence in results because price mean and std are greater than number of reviews mean and std (so we are taking more data)
nrw_std = lc.price.std()
nrw_prepared_data_df = lc[lc.number_of_reviews_ltm < nrw_mean + 2 * nrw_std]
num_reviews_numeric_cols = [col for col in list(nrw_prepared_data_df.columns) if not ( col.startswith('review_score') or "review" in col) and nrw_prepared_data_df[col].dtype != np.dtype('O')]  ## exclude columns about reviews
num_reviews_numeric_excluded_cols = [col for col in list(nrw_prepared_data_df.columns) if ( col.startswith('review_score') or "review" in col) and nrw_prepared_data_df[col].dtype != np.dtype('O')]  ## exclude columns about reviews

### Data Cleaning: Exclusion of Non-Numeric Columns

The variable `num_reviews_numeric_excluded_cols` likely refers to a list or array of columns in a dataset that have been excluded due to their non-numeric nature. This is typically done in data cleaning to prepare the dataset for statistical analysis or machine learning models, which often require numeric input.

In [None]:
num_reviews_numeric_excluded_cols

### Preparing Data for Review Score Prediction

This section of the code is preparing the data to be used for predicting review scores. It creates two lists of column names from the DataFrame `rev_score_prepared_data_df`. The first list, `rev_score_numeric_cols`, includes all columns that do not start with 'review_scores' and are not of type object. The second list, `rev_score_numeric_excluded_cols`, includes all columns that do start with 'review_scores' and are not of type object. These two lists will be used to select the relevant features for the prediction model.

In [None]:
rev_score_prepared_data_df = lc
rev_score_numeric_cols = [col for col in list(rev_score_prepared_data_df.columns) if not col.startswith('review_scores') and rev_score_prepared_data_df[col].dtype != np.dtype('O')]  ## exclude columns about reviews scores that we want to predict
rev_score_numeric_excluded_cols = [col for col in list(rev_score_prepared_data_df.columns) if col.startswith('review_scores') and rev_score_prepared_data_df[col].dtype != np.dtype('O')]  ## excluded columns

### Code Documentation: rev_score_numeric_excluded_cols

This section of code seems to be defining a variable or function named `rev_score_numeric_excluded_cols`. However, without the actual code, it's hard to provide a meaningful comment. In general, this could be used to store or manipulate data where numeric values are excluded from a review score.

In [None]:
rev_score_numeric_excluded_cols

### Intersection of Numeric Columns

The following code is used to find the common numeric columns in different data sets. It uses the `set.intersection()` method to identify the shared numeric columns between `price_numeric_cols`, `avail365_numeric_cols`, `num_reviews_numeric_cols`, and `rev_score_numeric_cols`.

In [None]:
common_numeric_cols = set.intersection(set(price_numeric_cols),set(avail365_numeric_cols), set(num_reviews_numeric_cols), set(rev_score_numeric_cols))

### Set Difference Operation on Numeric Columns

This line of code is performing a set difference operation between `price_numeric_cols` and `common_numeric_cols`. It returns a set that includes the numeric columns in `price_numeric_cols` that are not present in `common_numeric_cols`.

In [None]:
set(price_numeric_cols) - common_numeric_cols

### Subtraction of Sets in Python

The given line of code performs the operation of subtracting one set from another in Python. It subtracts the `common_numeric_cols` set from the `avail365_numeric_cols` set. The result will be a new set that contains elements present in `avail365_numeric_cols` but not in `common_numeric_cols`.

In [None]:
set(avail365_numeric_cols) - common_numeric_cols

### Difference Between Numeric Columns

The given line of code is used to find the difference between two sets: `num_reviews_numeric_cols` and `common_numeric_cols`. It returns a set that includes items present in `num_reviews_numeric_cols` but not in `common_numeric_cols`.

In [None]:
set(num_reviews_numeric_cols) - common_numeric_cols

### Set Difference Operation on Numeric Columns

The given line of code performs a set difference operation between two sets: `rev_score_numeric_cols` and `common_numeric_cols`. It returns a set that contains all the elements present in `rev_score_numeric_cols` but not in `common_numeric_cols`.

In [None]:
set(rev_score_numeric_cols) - common_numeric_cols

### Scoring Metrics Definition

This code block is responsible for defining the scoring metrics to be used for model evaluation. It includes F1 score, macro precision, and macro recall.

In [None]:
scoring = {'f1': 'f1',
           'prec_macro': 'precision_macro',
           'rec_macro': 'recall_macro'}

### Configuration Settings

The code block above specifies the configuration settings for a program. It includes flags for loading targets from a file, storing targets to a file, enabling plots, and the filename for storing the targets data.

In [None]:
load_targets_from_file = False ### Change to True to load a previous computation
store_targets_to_file = True
enable_plots = False
targets_filename = "checkpoints/targets_data.pickle"

### XGBClassifier Model Initialization

The code initializes an XGBClassifier model with 'mlogloss' as the evaluation metric and importance type set to "weight". The XGBClassifier is a part of the XGBoost library which is used for gradient boosting.

In [None]:
model = XGBClassifier(eval_metric='mlogloss', importance_type="weight")

### Loading Targets Data
The code checks if `load_targets_from_file` is `True`. If so, it loads the targets data from a specified file (`targets_filename`) using the `pickle` module.

In [None]:
if load_targets_from_file:
    print("Loading targets data from file", targets_filename)
    with open(targets_filename, 'rb') as handle:
        targets = pickle.load(handle)

### Feature Selection and Model Complexity Reduction

The code block above is primarily concerned with feature selection and reduction of model complexity. 

1. It first checks whether target data needs to be computed from scratch. If so, it prepares the data for four different targets: price, availability over 365 days, number of reviews, and review score. 

2. It then generates a balanced dataset for each target and stores the balanced data, target vector, and excluded data in the respective target dictionary. 

3. It subsequently selects only the numeric columns from the balanced data and removes certain irrelevant columns. The accuracy of the model is then determined by reducing the features. 

4. If plots are enabled, it generates a heatmap showing the correlation between features and a plot showing the accuracy of the model against the number of features. 

5. It manually defines accuracy cut-offs for each target and determines the threshold for which the accuracy falls below the cut-off. 

6. Finally, it updates the target dictionary with the columns to keep for each target.

In [None]:
### Identify the most relevant features to reduce the model's complexity.

if not load_targets_from_file:
    print("Computing target data from scratch")
    targets = {
        "price": {
            "target_col": "price",
            "prepared_data": price_prepared_data_df,
            "threshold": price_prepared_data_df.price.median(),
            "safe": price_prepared_data_df.price.median()/100*3,
            "numeric_cols": price_numeric_cols,
        },
        "avail365": {
            "target_col": "availability_365",
            "prepared_data": avail365_prepared_data_df,
            "threshold": 0.5,  ## zero availability vs non zero availability for 365 days
            "safe": 0.1,
            "numeric_cols": avail365_numeric_cols

        },
        "num_reviews": {
            "target_col": "number_of_reviews_ltm",
            "prepared_data": rev_score_prepared_data_df,
            "threshold": 0.5,  ### we try to predict if the listing is reviewed (which means it vas visited) or not
            "safe": 0.1,
            "numeric_cols": num_reviews_numeric_cols
        },
        "rev_score": {
            "target_col": "review_scores_value",
            "prepared_data": nrw_prepared_data_df,
            "threshold": 4.5,  ### we try to predict if the review rate (from 0 to 5) is greater of 4.5
            "safe": 0.1,
            "numeric_cols": rev_score_numeric_cols
        },    
    }

    for t in targets.keys():
        balanced_data, y, excluded_df = produce_balanced_dataset_safe(
            targets[t]["prepared_data"], targets[t]["threshold"], targets[t]["safe"], col = targets[t]["target_col"])
        targets[t]["balanced_data"] = balanced_data.dropna(axis=1, how = 'all').dropna()
        targets[t]["y"] = targets[t]["balanced_data"]["__y__"].copy()
        targets[t]["balanced_data"] = targets[t]["balanced_data"].drop(columns=["__y__"])

        targets[t]["excluded_data"] = excluded_df
        
    for t in targets.keys():
    #for t in ["rev_score"]:
        print("Target:",t, flush=True)
        numeric_cols = targets[t]["numeric_cols"]
        data_ok = targets[t]["balanced_data"].copy()[numeric_cols]  ## let's take just numeric columns
        y = targets[t]["y"] ## target vector
        X = data_ok.drop(['id', 'scrape_id', 'host_id'], axis = 1)
        print("X:", X.shape, flush=True)
        print("y:", y.shape, flush=True)
        res = accuracy_by_feature_reduction(X, y, model)
        targets[t]["results"] = res

        if enable_plots:
            df_corr = X.corr()
            df_corr_high = df_corr[df_corr>0.1]
            plt.figure(figsize=(8, 6))
            sns.heatmap(data=df_corr_high, cmap="Oranges", annot=False,  linewidths=0.1, linecolor="gray",xticklabels=True, yticklabels=True)
            plt.title('%s features correlation' % t)
            plt.show()

            plt.plot(res["partial_num_features"], res["partial_accuracies"])
            plt.title('%s accuracy vs features' % t)
            plt.show()      
        
        
    ### Accuracy cut-offs are defined manually inspecting the accuracy by num features curves
    targets["price"]["acc_cutoff"] = 0.75
    targets["avail365"]["acc_cutoff"] = 0.8
    targets["num_reviews"]["acc_cutoff"] = 0.7
    targets["rev_score"]["acc_cutoff"] = 0.65

    for t in targets.keys():
        p_acc = pd.Series(targets[t]["results"]["partial_accuracies"])
        p_thr = pd.Series(targets[t]["results"]["thresholds"])
        thresh = p_thr[p_acc < targets[t]["acc_cutoff"]].max()
        full_model = targets[t]["results"]["full_model"]
        keep_cols = full_model.feature_names_in_[full_model.feature_importances_ <  thresh]

        ## update target dictionary with columns to keep
        targets[t]["base_cols"] = keep_cols

### Saving Target Data to File

This code block checks if the `store_targets_to_file` flag is enabled. If true, it saves the `targets` data to a file named `targets_filename` using Python's `pickle` module for serialization.

In [None]:
if store_targets_to_file:
    print("Saving target data to file", targets_filename)
    with open(targets_filename, 'wb') as handle:
        pickle.dump(targets, handle, protocol=pickle.HIGHEST_PROTOCOL)

## Generate lookup files for vector embeddings using hot encoding

### Creating Compact Lookup Function

This function `create_lookup_compact` is used to create a lookup dictionary, a numpy array representation, and a final DataFrame for the given data. It performs several tasks:

1. It checks if a target dataset is provided, if not, it uses the prepared data.
2. It hot encodes the listing types in the data.
3. It transforms the amenities into a binary form using MultiLabelBinarizer.
4. It creates a final DataFrame by joining the original data with the hot encoded and binary transformed data.
5. It drops the 'name' column from the final DataFrame.
6. It converts the hot encoded columns to a numpy array.
7. It packs the numpy array into a new numpy array with a float64 data type.
8. It creates a lookup dictionary where each key is the 'id' from the final DataFrame and the value is the corresponding row from the packed numpy array.

In [None]:
def create_lookup_compact(X_full, prepared_data):
    if X_full is None: ## if a we don't have target data we produce the lookup for all prepared data
        X_full = prepared_data["df_data"]
    df_listing_type_exploded = pd.get_dummies(X_full[listing_type_cols], columns=listing_type_cols) ## sf with listing type hot encoding
    mlb = MultiLabelBinarizer()
    df_amenities_tao_exploded = pd.DataFrame(mlb.fit_transform(X_full['amenities_tao_v']), columns=mlb.classes_, index=X_full.index)
    final_df = X_full[["id","name"]].copy()
    final_df = final_df.\
        join(df_listing_type_exploded, rsuffix="_lt").\
        join(prepared_data["df_freq_entities_filtered_exploded"], rsuffix="_db").\
        join(df_amenities_tao_exploded, rsuffix="_am")
    final_df.drop(columns={"name"}, inplace=True)

    ## Convert hot encoded colums to numpy array  
    l = final_df.iloc[:,1:]
    V = l.to_numpy(dtype=bool)
    Vpack = np.packbits(V, axis = 1).astype('float64')/255
    lookup = {}
    for i,id in enumerate(list(final_df["id"])):
        lookup[str(id)] = Vpack[i,:]
    
    return lookup, Vpack, final_df


### Creating Lookup Amenities Database

This function `create_lookup_amenities_dbp__text_dbp` is used to create a lookup dictionary and a numpy array from a given prepared data. The function performs several operations:

1. It extracts the data frame from the prepared data.
2. It performs one-hot encoding on the listing type columns.
3. It uses MultiLabelBinarizer to transform the 'amenities_dbp_classes_v' column.
4. It creates a final dataframe by joining several dataframes based on different suffixes and drops the 'name' column.
5. It converts the hot encoded columns to a numpy array.
6. It creates a lookup dictionary where each key is an 'id' and its value is the corresponding row from the numpy array. 

The function returns the lookup dictionary, the numpy array, and the final dataframe.

In [None]:
def create_lookup_amenities_dbp__text_dbp(prepared_data):
    X_full = prepared_data["df_data"]
    df_listing_type_exploded = pd.get_dummies(X_full[listing_type_cols], columns=listing_type_cols) ## sf with listing type hot encoding
    mlb = MultiLabelBinarizer()
    df_amenities_dbp_exploded = pd.DataFrame(mlb.fit_transform(X_full['amenities_dbp_classes_v']), columns=mlb.classes_, index=X_full.index)
    print("Hot encoding size for amenities mapped to DBpedia:",df_amenities_dbp_exploded.shape[1])
    final_df = X_full[["id","name"]].copy()
    final_df = final_df.\
        join(df_listing_type_exploded, rsuffix="_lt").\
        join(prepared_data["df_freq_entities_filtered_exploded"], rsuffix="_db").\
        join(df_amenities_dbp_exploded, rsuffix="_am")
    final_df.drop(columns={"name"}, inplace=True)

    ## Convert hot encoded colums to numpy array 
    l = final_df.iloc[:,1:]
    V = l.to_numpy(dtype=bool)
    lookup = {}
    for i,id in enumerate(list(final_df["id"])):
        lookup[str(id)] = V[i,:]
    
    return lookup, V, final_df

### Creating Lookup for Amenities and Text Database

This function `create_lookup_amenities_tao__text_dbp` is used to create a lookup for amenities and text data. It first checks if the target data is available, if not, it uses the prepared data. Then, it performs one-hot encoding on the listing type columns. The amenities are binarized and added to the dataframe. The final dataframe is created by joining the original dataframe with the exploded dataframes. The name column is dropped and the remaining columns are converted to a numpy array. A lookup dictionary is created where each key is an id and the value is the corresponding row from the numpy array. The function returns the lookup dictionary, the numpy array, and the final dataframe.

In [None]:
def create_lookup_amenities_tao__text_dbp(X_full, prepared_data):
    if X_full is None: ## if a we don't have target data we produce the lookup for all prepared data
        X_full = prepared_data["df_data"]
    df_listing_type_exploded = pd.get_dummies(X_full[listing_type_cols], columns=listing_type_cols) ## sf with listing type hot encoding
    mlb = MultiLabelBinarizer()
    df_amenities_tao_exploded = pd.DataFrame(mlb.fit_transform(X_full['amenities_tao_v']), columns=mlb.classes_, index=X_full.index)
    final_df = X_full[["id","name"]].copy()
    final_df = final_df.\
        join(df_listing_type_exploded, rsuffix="_lt").\
        join(prepared_data["df_freq_entities_filtered_exploded"], rsuffix="_db").\
        join(df_amenities_tao_exploded, rsuffix="_am")
    final_df.drop(columns={"name"}, inplace=True)

    ## Convert hot encoded colums to numpy array 
    l = final_df.iloc[:,1:]
    V = l.to_numpy(dtype=bool)
    lookup = {}
    for i,id in enumerate(list(final_df["id"])):
        lookup[str(id)] = V[i,:]
    
    return lookup, V, final_df


### Creating Lookup Amenities and Text TAO

This function `create_lookup_amenities_tao__text_tao` generates a lookup dictionary and two dataframes. It first hot encodes the listing types and amenities from the prepared data. It then creates a dataframe for TAO entities and merges it with the original dataframe. The final dataframe is then converted to a numpy array, and a lookup dictionary is created mapping each id to its corresponding numpy array. The function returns the lookup dictionary, the numpy array, and the final dataframe.

In [None]:
def create_lookup_amenities_tao__text_tao(prepared_data, df_entites_tao):
    X_full = prepared_data["df_data"]
    df_listing_type_exploded = pd.get_dummies(X_full[listing_type_cols], columns=listing_type_cols) ## sf with listing type hot encoding
    mlb = MultiLabelBinarizer()
    df_amenities_tao_exploded = pd.DataFrame(mlb.fit_transform(X_full['amenities_tao_v']), columns=mlb.classes_, index=X_full.index)
    
    df_entites_tao_exploded = pd.DataFrame(mlb.fit_transform(df_entites_tao['tao_entities_v']), columns=mlb.classes_, index=df_entites_tao.index)
    final_df_entites_tao = df_entites_tao.join(df_entites_tao_exploded)
    final_df_entites_tao.head()
    final_df_entites_tao.drop(columns={"tao_entities_v"}, inplace=True)
    
    final_df = X_full[["id","name"]].copy()
    final_df = final_df.\
        join(df_listing_type_exploded, rsuffix="_lt").\
        join(df_amenities_tao_exploded, rsuffix="_am")
    final_df.drop(columns={"name"}, inplace=True)

    final_df = final_df.merge(final_df_entites_tao, on = 'id', how = 'left').fillna(0)

    ## Convert hot encoded colums to numpy array 
    l = final_df.iloc[:,1:]
    V = l.to_numpy(dtype=bool)
    lookup = {}
    for i,id in enumerate(list(final_df["id"])):
        lookup[str(id)] = V[i,:]
    
    return lookup, V, final_df


### Creating a Lookup Text Database

This function, `create_lookup_text_dbp`, is designed to create a lookup text database from prepared data. It first extracts the necessary data from the input, joins relevant data frames, and removes unnecessary columns. Then, it converts hot encoded columns to a numpy array. Finally, it uses a loop to create a lookup dictionary where each key is an ID and the value is the corresponding row from the numpy array. The function returns the lookup dictionary, the numpy array, and the final data frame.

In [None]:
def create_lookup_text_dbp(prepared_data):
    X_full = prepared_data["df_data"]
    final_df = X_full[["id","name"]].copy()
    final_df = final_df.join(prepared_data["df_freq_entities_filtered_exploded"], rsuffix="_db")
    final_df.drop(columns={"name"}, inplace=True)

    ## Convert hot encoded colums to numpy array 
    l = final_df.iloc[:,1:]
    V = l.to_numpy(dtype=bool)
    lookup = {}
    for i,id in enumerate(list(final_df["id"])):
        lookup[str(id)] = V[i,:]
    
    return lookup, V, final_df


### Creating Lookup Text for TAO Entities

This function, `create_lookup_text_tao`, is used to create a lookup table for TAO (The Art of) entities. It takes as input a preprocessed dataset and a DataFrame of TAO entities. It uses the `MultiLabelBinarizer` from `sklearn` to one-hot encode the TAO entities. The function then merges this with the original dataset, and converts the one-hot encoded columns to a numpy array. The final output is a dictionary where each key-value pair represents an ID and its corresponding one-hot encoded TAO entity vector, along with the numpy array of the one-hot encoded TAO entities and the final dataframe.

In [None]:
def create_lookup_text_tao(prepared_data, df_entites_tao):
    X_full = prepared_data["df_data"]
    mlb = MultiLabelBinarizer()
    df_entites_tao_exploded = pd.DataFrame(mlb.fit_transform(df_entites_tao['tao_entities_v']), columns=mlb.classes_, index=df_entites_tao.index)
    final_df_entites_tao = df_entites_tao.join(df_entites_tao_exploded)
    final_df_entites_tao.drop(columns={"tao_entities_v"}, inplace=True)
    
    final_df = X_full[["id","name"]].copy()
    final_df = final_df.merge(final_df_entites_tao, on = 'id', how = 'left').fillna(0)
    final_df.drop(columns={"name"}, inplace=True)

    ## Convert hot encoded colums to numpy array 
    l = final_df.iloc[:,1:]
    V = l.to_numpy(dtype=bool)
    lookup = {}
    for i,id in enumerate(list(final_df["id"])):
        lookup[str(id)] = V[i,:]
    
    return lookup, V, final_df


### Amenity Lookup Table Creation

This function `create_lookup_amenities_no_kg` generates a lookup table for amenities of property listings. It first identifies frequently used amenities, then creates binary vectors for each listing to indicate the presence of these amenities. The function also hot encodes the listing type and combines this information into a final dataframe. The dataframe is then converted into a numpy array and a lookup dictionary is created for each listing ID. The function returns this lookup dictionary, the numpy array, and the final dataframe.

In [None]:
def create_lookup_amenities_no_kg(prepared_data, min_freq = 5):
    X_full = prepared_data["df_data"].copy()
    frequent_amenities = property_frequencies(X_full['amenities_all_v'], min_freq = min_freq) ## find the amenities used more than min_freq in all listings
    X_full['amenities_freq_v'] = X_full['amenities_all_v'].apply(lambda am_list: list(set(am_list).intersection(frequent_amenities))) ## for each listing only store the amenities in the frequent_amenity list
    df_listing_type_exploded = pd.get_dummies(X_full[listing_type_cols], columns=listing_type_cols) ## sf with listing type hot encoding
    mlb = MultiLabelBinarizer()
    df_amenities_no_kg_exploded = pd.DataFrame(mlb.fit_transform(X_full['amenities_freq_v']), columns=mlb.classes_, index=X_full.index)
    final_df = X_full[["id","name"]].copy()
    final_df = final_df.\
        join(df_listing_type_exploded, rsuffix="_lt").\
        join(df_amenities_no_kg_exploded, rsuffix="_am")
    final_df.drop(columns={"name"}, inplace=True)

    ## Convert hot encoded colums to numpy array 
    l = final_df.iloc[:,1:]
    V = l.to_numpy(dtype=bool)
    lookup = {}
    for i,id in enumerate(list(final_df["id"])):
        lookup[str(id)] = V[i,:]
    
    return lookup, V, final_df

### Creating a Lookup Table for Amenities

The function `create_lookup_amenities_tao()` takes a prepared dataset and creates a lookup table for amenities. It first performs one-hot encoding on the listing type columns, then uses a MultiLabelBinarizer to transform the 'amenities_tao_v' column into multiple binary columns. It then joins these new columns to the original dataset, removes the 'name' column, and converts the hot-encoded columns to a numpy array. Finally, it creates a lookup dictionary where each key is a listing id and the value is the corresponding row from the numpy array. The function returns this lookup dictionary, the numpy array, the final dataframe, and the number of columns in the amenities dataframe.

In [None]:
def create_lookup_amenities_tao(prepared_data):
    X_full = prepared_data["df_data"]
    df_listing_type_exploded = pd.get_dummies(X_full[listing_type_cols], columns=listing_type_cols) ## sf with listing type hot encoding
    mlb = MultiLabelBinarizer()
    df_amenities_tao_exploded = pd.DataFrame(mlb.fit_transform(X_full['amenities_tao_v']), columns=mlb.classes_, index=X_full.index)
    he_am_size = df_amenities_tao_exploded.shape[1]
    final_df = X_full[["id","name"]].copy()
    final_df = final_df.\
        join(df_listing_type_exploded, rsuffix="_lt").\
        join(df_amenities_tao_exploded, rsuffix="_am")
    final_df.drop(columns={"name"}, inplace=True)

    ## Convert hot encoded colums to numpy array 
    l = final_df.iloc[:,1:]
    V = l.to_numpy(dtype=bool)
    lookup = {}
    for i,id in enumerate(list(final_df["id"])):
        lookup[str(id)] = V[i,:]
    
    return lookup, V, final_df, he_am_size

### Extracting Prepared Data

The code above is responsible for extracting the prepared data from the dictionary `prepared_data` and storing it in the variable `pr`. This is a simple data retrieval operation.

In [None]:
pr = prepared_data["df_data"]

## Hot encoding vectors

### Hot Encoding and Saving TAO Amenities

The code is responsible for creating a hot encoding for amenities using the TAO mapping. It then saves this encoding to a pickle file for future use. The size of the hot encoding is also printed out for reference.

In [None]:
## Hot encoding for just TAO amenities 

vector_features="he_listing_types-amenities-tao"
output_dir="bert_input_data/vectors_lookup"
dataset_name = "airbnb_london_20220910"

lookup_am_tao, V_am_tao, final_df, he_amenities_size = create_lookup_amenities_tao(prepared_data)
print("Hot encoding size for amenities mapped to TAO:",he_amenities_size)

with open(f"{output_dir}/all_{vector_features}_{dataset_name}_id2embedding.pickle", 'wb') as f:
    pickle.dump(lookup_am_tao, f)

### Hot Encoding Vector Optimization

This code is designed to optimize the size of a hot encoding vector using a range of cut-off values. It iteratively explores different cut-off points for amenity frequencies in a property dataset, aiming to achieve a hot encoding vector size that is as close as possible to a target size. The process begins with a broad step size to quickly narrow down the range, and then refines the cut-off value using a step size of 1. The final cut-off value and hot encoding size are then printed out. The initial cut-off value is set to 4, and the target hot encoding size is determined by the size of a previously defined hot encoding vector for amenities.

In [None]:
#### We want to use amenities names to create a hot encoding vector with (almost) the same size as the hot encoding vector produced when mapping amenities to TAO

def explore_cut_off(first_cut_off, last_cut_off, step_size, target_size):
    active_cut_off = first_cut_off
    for cut_off in range(first_cut_off, last_cut_off, step_size):
        old_cut_off = active_cut_off
        active_cut_off = cut_off
        freq_he_size = len(property_frequencies(prepared_data["df_data"]['amenities_all_v'], min_freq = active_cut_off))
        if freq_he_size < target_size:
            print("With cut off %s the hot encoding vector size is too small (%s)" % (cut_off, freq_he_size))
            break
    return old_cut_off, cut_off

target_he_size = he_amenities_size
#target_he_size = 600
#target_he_size = 701

num_dist_amenities = len(property_frequencies(prepared_data["df_data"]['amenities_all_v'], min_freq = 0)) ## number of distinct amenities
print("total number of distinct amenities:", num_dist_amenities)
initial_am_num = 4

first_cut_off, last_cut_off = explore_cut_off(initial_am_num, num_dist_amenities, 10, target_he_size)
final_cut_off, _ = explore_cut_off(first_cut_off, last_cut_off, 1, target_he_size)

final_he_size = len(property_frequencies(prepared_data["df_data"]['amenities_all_v'], min_freq = final_cut_off))
print("Cut off: %s; HE size: %s" % (final_cut_off, final_he_size))

### Hot Encoding of Amenities

The code above is used for hot encoding of frequently occurring amenities. It does this without mapping to TAO or DBPedia. The encoded features are saved in a pickle file for further use.

In [None]:
## Hot encoding for frequent amenities just using their names, WITHOUT mapping to TAO or DBPedia 

vector_features="he_listing_types-amenities-no-kg-"+str(final_he_size)
output_dir="bert_input_data/vectors_lookup"
dataset_name = "airbnb_london_20220910"
## features in vector listing_types+amenities_tao+dbpedia
lookup_am_no_kg, V_am_no_kg, final_df = create_lookup_amenities_no_kg(prepared_data, min_freq = final_cut_off)
with open(f"{output_dir}/all_{vector_features}_{dataset_name}_id2embedding.pickle", 'wb') as f:
    pickle.dump(lookup_am_no_kg, f)



### Creating and Saving Lookup Table for Amenities and TAO

The code block initially specifies the vector features, output directory, and dataset name. It then creates a lookup table for amenities and TAO entities from the dataset. Finally, the lookup table is saved as a pickle file in the specified directory.

In [None]:
## Hot encoding for TAO amenities and TAO entities from listing description  

vector_features="he_listing_types-amenities-tao_text-tao"
output_dir="bert_input_data/vectors_lookup"
dataset_name = "airbnb_london_20220910"

lookup_tao, V_tao, final_df = create_lookup_amenities_tao__text_tao(prepared_data, df_tao_entities_from_text[["id","tao_entities_v"]])
with open(f"{output_dir}/all_{vector_features}_{dataset_name}_id2embedding.pickle", 'wb') as f:
    pickle.dump(lookup_tao, f)

### Creating and Saving TAO Entity Vector Features

The code initiates the creation of Textual Aspect of Object (TAO) entity vector features from a specific dataset. It then saves these vector features into a pickle file for future use.

In [None]:
## Hot encoding for just TAO entities from listing description  

vector_features="he_text-tao"
output_dir="bert_input_data/vectors_lookup"
dataset_name = "airbnb_london_20220910"

lookup_txt_tao, V_txt_tao, final_df = create_lookup_text_tao(prepared_data, df_tao_entities_from_text[["id","tao_entities_v"]])
with open(f"{output_dir}/all_{vector_features}_{dataset_name}_id2embedding.pickle", 'wb') as f:
    pickle.dump(lookup_txt_tao, f)

### Creating Vector Features and Saving to File

The code first sets the necessary variables for vector features, output directory, and dataset name. Then, it creates lookup text and vectors for DBpedia entities from the prepared data. The lookup is then serialized and saved to a pickle file in the specified directory.

In [None]:
## Hot encoding for just DBpedia entities from text

vector_features="he_dbpedia"
output_dir="bert_input_data/vectors_lookup"
dataset_name = "airbnb_london_20220910"

lookup_dbp, V_dbp, final_df = create_lookup_text_dbp(prepared_data)
with open(f"{output_dir}/all_{vector_features}_{dataset_name}_id2embedding.pickle", 'wb') as f:
    pickle.dump(lookup_dbp, f)

### Creating and Saving Vector Lookup for Amenities and Text Description

The code block is responsible for creating a lookup table for hot encoded amenities and text description using TAO and DBpedia. The lookup table is then saved as a pickle file for future use.

In [None]:
## Hot encoding for TAO AND DBpedia feature

## Hot encoding for amenities with TAO and description text with DBpedia
vector_features="he_listing_types-amenities-tao__dbpedia"
output_dir="bert_input_data/vectors_lookup"
dataset_name = "airbnb_london_20220910"

lookup_all_he, V_all_he, final_df = create_lookup_amenities_tao__text_dbp(None, prepared_data)
with open(f"{output_dir}/all_{vector_features}_{dataset_name}_id2embedding.pickle", 'wb') as f:
    pickle.dump(lookup_all_he, f)


### Creating Hot Encoding Lookup Table and Saving to File

This code block creates a lookup table for hot encoding based on the target data. It iterates over the keys of the `targets` dictionary, generates a lookup table for each target using the `create_lookup_amenities_tao__text_dbp` function, and stores the result in the `lookup_he_by_target` dictionary. It then saves each lookup table to a pickle file in the `bert_input_data/vectors_lookup` directory for future use. The filename of the pickle file is composed of the target name, vector features, and the dataset name.

In [None]:
##### Hot encoding divided by target data: used for backward compatibility
lookup_he_by_target = {}
for t in targets.keys():
    vector_features="he_listing_types-amenities-tao__dbpedia"
    output_dir="bert_input_data/vectors_lookup"
    dataset_name = "airbnb_london_20220910"

    lookup, V, final_df = create_lookup_amenities_tao__text_dbp(targets[t]["balanced_data"], prepared_data)
    lookup_he_by_target[t] = lookup
    with open(f"{output_dir}/{t}_{vector_features}_{dataset_name}_id2embedding.pickle", 'wb') as f:
        pickle.dump(lookup, f)

### Loading Pickle File

The code opens and reads a pickle file which contains embeddings for Airbnb London listings. These embeddings are then stored in the variable 'content'.

In [None]:
with open("bert_input_data/vectors_lookup/price_he_listing_types-amenities-tao__dbpedia_airbnb_london_20220910_id2embedding.pickle", 'rb') as pickle_file:
    content = pickle.load(pickle_file)


### Hot Encoding for TAO and DBpedia Features

This code performs hot encoding for TAO and DBpedia features. 

##### Compact Hot Encoding for All Data
The first section of the code performs hot encoding for all data. It creates a lookup table for the compact hot encoding, then saves this lookup table to a pickle file for future use.

##### Compact Hot Encoding Divided by Target Data
The second section of the code performs hot encoding for each target in the target data. It creates a lookup table for each target, then saves these lookup tables to individual pickle files for future use. This is done for backward compatibility.

In [None]:
## Compact hot encoding for TAO AND DBpedia feature

##### Compact hot encoding for all data
vector_features="listing_types-amenities-tao__dbpedia"
output_dir="bert_input_data/vectors_lookup"
dataset_name = "airbnb_london_20220910"
## features in vector listing_types+amenities_tao+dbpedia
lookup_all_he_compact, V_all_compact, _ = create_lookup_compact(None, prepared_data)
with open(f"{output_dir}/all_{vector_features}_{dataset_name}_id2embedding.pickle", 'wb') as f:
    pickle.dump(lookup_all_he_compact, f)

##### Compact hot encoding divided by target data: used for backward compatibility
lookup_compact_he_by_target = {}
for t in targets.keys():
    vector_features="listing_types-amenities-tao__dbpedia"
    output_dir="bert_input_data/vectors_lookup"
    dataset_name = "airbnb_london_20220910"
    ## features in vector listing_types+amenities_tao+dbpedia
    lookup, V, _ = create_lookup_compact(targets[t]["balanced_data"], prepared_data)
    lookup_compact_he_by_target[t] = lookup
    with open(f"{output_dir}/{t}_{vector_features}_{dataset_name}_id2embedding.pickle", 'wb') as f:
        pickle.dump(lookup, f)


### Extracting and Comparing Key Sets from Different Data Sources

The code above creates sets of keys from different lookup files (`lookup_tao`, `lookup_he_by_target`, `lookup_compact_he_by_target`). It then compares these keys to the keys found in the `targets` dictionary to ensure all required IDs are present across all data sources.

In [None]:
#### Check if we have all ids found in lookup file for single targets compared to to those in lookup files created from all prepared data

all_keys=set(lookup_tao.keys())
all_keys_he_from_targets = set()
all_keys_he_compact_from_targets = set()
for t in targets.keys():
    all_keys_he_from_targets = all_keys.union(set(lookup_he_by_target[t]))
    all_keys_he_compact_from_targets = all_keys.union(set(lookup_compact_he_by_target[t]))    

### Compatibility Verification

The code is performing a check for retro compatibility. It asserts that all ids from the prepared data cover all ids taken from single targets and potentially more.

In [None]:
## retro compatibility check
## if we take all ids from prepared data we cover all id taken form single targets and more
assert len(all_keys - all_keys_he_from_targets) == 0 

### Checking Retro Compatibility

The code snippet is asserting that there is full retro compatibility between two sets of keys, `all_keys` and `all_keys_he_compact_from_targets`. It ensures that there are no extra keys in `all_keys` that are not present in `all_keys_he_compact_from_targets`, validating the consistency of data between the two sets.

In [None]:
## retro compatibility check
## if we take all ids from prepared data we cover all id taken form single targets and more
assert len(all_keys - all_keys_he_compact_from_targets) == 0 

### Data Consistency Verification

This section of the code performs a sanity check to ensure that the combined length of the 'lookup_am_tao' and 'lookup_dbp' dataframes for a specific index my_key equals the length of the 'lookup_all_he' dataframe for the same index. This is done to confirm data consistency across different dataframes.

In [None]:
## sanity check
## the dbpedia he + tao he shoud have the same length as the complete dbpedia + tao he
my_key = list(lookup_am_tao.keys())[0]
assert lookup_am_tao[my_key].shape[0] + lookup_dbp[my_key].shape[0] == lookup_all_he[my_key].shape[0]

## Prepare train/dev/test for KGE-BERT

### Cleaning and Sorting Entities Function

This function drops missing values from the input DataFrame, extracts the 'surface_form' and 'pos' values from the 'entity_data' column, removes duplicate entries based on 'source_id', 'uri', and 'surface_form', and then sorts the DataFrame by 'source_id' and 'pos'. The function finally drops the 'pos' and 'surface_form' columns before returning the cleaned and sorted DataFrame.

In [None]:
def clean_and_sort_entities(df: pd.DataFrame) -> pd.DataFrame:
    df.dropna(axis=0, inplace=True)
    df["surface_form"] = df.entity_data.apply(lambda e: e["surface_form"])
    df.drop_duplicates(["source_id", "uri", "surface_form"], inplace=True)
    df["pos"] = df.entity_data.apply(lambda e: e['surface_char_pos'][0])
    df = df.sort_values(by=["source_id", "pos"]).drop(columns=["pos","surface_form"])
    return df

### Preparing DataFrame for BERT Augmentation

This function prepares a DataFrame for BERT augmentation. It creates a new DataFrame with specific columns from the input DataFrames, including text, title, authors, over, under, and metadata columns. The "over" and "under" columns are derived from the Y_df DataFrame. The function then returns the newly created DataFrame.

In [None]:
def prepare_bert_augmented_df(X_df, Y_df, metadata_cols, text_col):
    df = pd.DataFrame()
    df["text"] = X_df[text_col] #.apply(lambda txt: pre_process(txt))
    df["title"] = X_df["name"]
    df["authors"] = X_df["id"].astype(str)
    df["over"] = Y_df*1
    df["under"] = 1-Y_df*1
    df[metadata_cols]=X_df[metadata_cols]
    return df

### Saving BERT Augmented Data

The function `save_bert_augmented_data_as_pickle` prepares and saves the training, development, and testing datasets into pickle files. The data is prepared using the `prepare_bert_augmented_df` function, and then stored in the specified output directory. The function returns the prepared dataframes for further use.

In [None]:
def save_bert_augmented_data_as_pickle(target_name, dataset_name, 
                                       X_train_df, Y_train_df, 
                                       X_dev_df, Y_dev_df,
                                       X_test_df, Y_test_df,
                                       text_col, metadata_cols, metadata_type_label, output_dir):
    
    train_df = prepare_bert_augmented_df(X_train_df, Y_train_df,metadata_cols, text_col)
    dev_df = prepare_bert_augmented_df(X_dev_df, Y_dev_df, metadata_cols, text_col)
    test_df = prepare_bert_augmented_df(X_test_df, Y_test_df, metadata_cols, text_col)
    labels = ["over", "under"]
    
    train_output_data = (train_df,metadata_cols,[],labels)
    with open(f"{output_dir}/train_{metadata_type_label}{target_name}_{dataset_name}.pickle", 'wb') as f:
        pickle.dump(train_output_data, f)

    dev_output_data = (dev_df,metadata_cols,[],labels)
    with open(f"{output_dir}/dev_{metadata_type_label}{target_name}_{dataset_name}.pickle", 'wb') as f:
        pickle.dump(dev_output_data, f)
    
    test_output_data = (test_df,metadata_cols,[],labels)
    with open(f"{output_dir}/test_{metadata_type_label}{target_name}_{dataset_name}.pickle", 'wb') as f:
        pickle.dump(test_output_data, f)
    
    return train_df, dev_df, test_df

### Data Preparation and Cleanup

1. The variable `t` is set to "price".
2. A dictionary `metadata_type_labels` is defined to map metadata types to labels.
3. A copy of the "balanced_data" from the "price" target is made and stored in `X_full`.
4. The "description" column of `X_full` is copied to a new column "text_a".
5. Rows with empty descriptions are removed from `X_full`.
6. Metadata type is set to "base_cols".
7. A list of metadata columns is extracted from the "price" target.
8. Unwanted columns 'id', 'scrape_id', 'host_id' are removed from the metadata columns list.
9. The label for the metadata type is retrieved from the `metadata_type_labels` dictionary.
10. The target variable `y` is extracted from the "price" target.
11. A new dataframe `kbert_format_df` is created by selecting specific columns from `X_full` and joining with `y`, resetting the index, and renaming the "__y__" column to "label".

In [None]:
t = "price"
metadata_type_labels = { "base_cols": "", "numeric_cols":"all_meta_"}
X_full = targets[t]["balanced_data"].copy()
X_full["text_a"] = X_full["description"] #.apply(lambda t: t if len(t) <= max_length else t[0:max_length])
X_full = X_full[X_full.description !=""]
metadata_type = "base_cols"
metadata_cols = list(targets[t][metadata_type])
metadata_cols = [col for col in metadata_cols if col not in ['id', 'scrape_id', 'host_id']] ## remove unwanted colums if present
metadata_type_label = metadata_type_labels[metadata_type]
y = targets[t]["y"]
kbert_format_df = X_full[["id","text_a","name"] + metadata_cols].join(y)\
    .reset_index(drop=True)\
    .rename(columns={"__y__":"label"})



### Description Enhancement Functions

The first four functions (`extend_description`, `extend_description_v2`, `extend_description_with_lodging_type`, `extend_description_with_lodging_type_v2`) are used to extend the description of a property with its amenities and lodging type.

### Data Preprocessing

The code then copies the balanced data from the target to `X_full`. It finds the most frequent amenities in all listings and stores them in `frequent_amenities`. For each listing, it only stores the amenities that appear in `frequent_amenities`. 

### Description Extension Application

Finally, it applies the `extend_description` function to each row in `X_full`, extending each description with its frequent amenities.

In [None]:
def extend_description(description, amenities):
    new_description = "The following amenities are included: "
    for amenity in amenities:
        new_description = new_description + amenity + ", "
    return description + new_description

def extend_description_v2(description, amenities):
    new_description = ""
    for amenity in amenities:
        new_description = new_description + amenity + " "
    return description + new_description

def extend_description_with_lodging_type(description, amenities, property_type, room_type):
    new_description = extend_description(description, amenities) + " "
    listing_type_description = "We offer: "+ property_type + " " + room_type + " "
    return pre_process(listing_type_description + new_description)

def extend_description_with_lodging_type_v2(description, amenities, property_type, room_type):
    new_description = extend_description_v2(description, amenities) + " "
    listing_type_description = " "+ property_type + " " + room_type + " "
    return pre_process(new_description + listing_type_description)

X_full = targets[t]["balanced_data"].copy()
frequent_amenities = property_frequencies(X_full['amenities_all_v'], min_freq = final_cut_off) ## find the amenities used more than min_freq in all listings
X_full['amenities_freq_v'] = X_full['amenities_all_v'].apply(lambda am_list: list(set(am_list).intersection(frequent_amenities))) ## for each listing only store the amenities in the frequent_amenity list

X_full['ext_description'] = X_full.apply(lambda row: extend_description(row.description, row.amenities_freq_v), axis=1)
    

### Function: numeric_metadata_to_text

This function concatenates the values of metadata columns with a description column into a single string. It iterates over each metadata column, converting the numeric metadata into a string and appending it to the description text.

### Function: numeric_metadata_to_text_v2

This function is a variant of the previous function. It also concatenates the values of metadata columns with a description column into a single string, but it omits the column names in the final text. It simply appends the numeric metadata values to the description text.

In [None]:
def numeric_metadata_to_text(row, metadata_cols, description_col):
    text = row[description_col] + " other features are "
    for col in metadata_cols:
        text = text + col + " is " + str(row[col]) + " "
    return text

def numeric_metadata_to_text_v2(row, metadata_cols, description_col):
    text = row[description_col] + " "
    for col in metadata_cols:
        text = text + " " + str(row[col]) + " "
    return text

### Data Preprocessing and Saving for BERT Augmentation

This script handles the data preprocessing for the Airbnb dataset, specifically for BERT augmentation with Knowledge Graphs (KGE-BERT). It performs the following tasks:

1. Defines the necessary parameters and directories.
2. Iterates over the keys of the `targets` dictionary, which contains the different target variables for the model.
3. For each target, it performs several preprocessing steps such as metadata processing, removing unwanted columns, extending the description with a list of amenities, and normalizing metadata if required.
4. It then splits the processed data into train, dev, and test sets.
5. The script also extends the textual description of the listings with the list of amenities and the type of lodging, and converts numeric metadata to text.
6. The processed data is then saved in different formats depending on the type of extension (plain, extended with amenities, extended with amenities and lodging type, etc).
7. Finally, it prints out the number of columns and samples in the train, dev, and test sets for each type of metadata.

In [None]:
dev_size = 0.1
test_size = 0.1
max_length = 512
dev_test_size = dev_size + test_size
test_size_fraction = test_size / dev_test_size
#dataset_name = "airbnb_london_20220910" ### Defined at the start of the notebook
metadata_type_labels = { "base_cols": "", "numeric_cols":"all_meta_"}
bert_augmented_output_dir = "bert_input_data" ## output dir for bert augmented with KG
kbert_output_dir = "kbert_input_data"    

#t = "price"
for t in targets.keys(): #["price"]: #
    print("Target: ", t)
    print("----------Bert augmented with kg-----------")
    
    for metadata_directive in ["base_cols", "base_cols:norm", "numeric_cols", "numeric_cols:normalize"]:
    #for metadata_directive in ["base_cols:norm", "numeric_cols:norm"]:
    #for metadata_directive in ["base_cols:norm"]:
        
        metadata_type = metadata_directive.split(":")[0]
        try:
            metadata_processing = metadata_directive.split(":")[1]+"_"
            print("Metadata processing method: ", metadata_processing)
            
        except:
            metadata_processing = ""
            print("No processing of metadata")
            
        #metadata_cols = list(targets[t]["base_cols"])
        metadata_cols = list(targets[t][metadata_type])
        metadata_cols = [col for col in metadata_cols if col not in ['id', 'scrape_id', 'host_id']] ## remove unwanted colums if present
        metadata_type_label = metadata_type_labels[metadata_type] + metadata_processing ## extend metadata_type_label with the processing label
        
        print("Metadata type label:", metadata_type_label)

        X_full = targets[t]["balanced_data"].copy()
        X_full["text_a"] = X_full["description"] #.apply(lambda t: t if len(t) <= max_length else t[0:max_length])
        X_full = X_full[X_full.description !=""]
        
        ### change column amenities_freq_v contains only the most frequent amenities so that the hot encoding vector size has size final_he_size
        frequent_amenities = property_frequencies(X_full['amenities_all_v'], min_freq = final_cut_off) ## find the amenities used more than min_freq in all listings
        X_full['amenities_freq_v'] = X_full['amenities_all_v'].apply(lambda am_list: list(set(am_list).intersection(frequent_amenities))) ## for each listing only store the amenities in the frequent_amenity list
        
        
        
        X_full['ext_description'] = X_full.apply(lambda row: extend_description(row.description, row.amenities_freq_v), axis=1) ## extend the textual description with the list of amenities
        X_full['ext_description_with_lodging_type'] = X_full.apply(
            lambda row: extend_description_with_lodging_type(row.description, row.amenities_freq_v, row.property_type, row.room_type), axis=1) ## extend the textual description with the list of amenities
        X_full['ext_description_with_lodging_type_v2'] = X_full.apply(
            lambda row: extend_description_with_lodging_type_v2(row.description, row.amenities_freq_v, row.property_type, row.room_type), axis=1) ## extend the textual description with the list of amenities

        X_full['ext_tao_description_with_lodging_type'] = X_full.apply(
            lambda row: extend_description_with_lodging_type(row.description, row.amenities_tao_v, row.property_type, row.room_type), axis=1) ## extend the textual description with the list of amenities
        X_full['ext_tao_description_with_lodging_type_v2'] = X_full.apply(
            lambda row: extend_description_with_lodging_type_v2(row.description, row.amenities_tao_v, row.property_type, row.room_type), axis=1) ## extend the textual description with the list of amenities

        X_full['ext_description_and_meta_with_lodging_type'] = X_full.apply(lambda row: numeric_metadata_to_text(row, metadata_cols, 'ext_description_with_lodging_type'), axis=1)
        X_full['ext_description_and_meta_with_lodging_type_v2'] = X_full.apply(lambda row: numeric_metadata_to_text_v2(row, metadata_cols, 'ext_description_with_lodging_type_v2'), axis=1)
        X_full['ext_tao_description_and_meta_with_lodging_type'] = X_full.apply(lambda row: numeric_metadata_to_text(row, metadata_cols, 'ext_tao_description_with_lodging_type'), axis=1)
        X_full['ext_tao_description_and_meta_with_lodging_type_v2'] = X_full.apply(lambda row: numeric_metadata_to_text_v2(row, metadata_cols, 'ext_tao_description_with_lodging_type_v2'), axis=1)
        
        # if metadata_processing == "norm_": ### we have to normalize metadata
        #     print("Normalizing metadata")
        #     X_full[metadata_cols] = normalize(X_full[metadata_cols], norm='l2')            
        #     break
        

        y = targets[t]["y"]
        kbert_format_df = X_full[[
                "id","text_a","name", 
                "ext_description", 
                'ext_description_with_lodging_type', 
                'ext_description_with_lodging_type_v2', 
                'ext_tao_description_with_lodging_type',
                'ext_tao_description_with_lodging_type_v2',
                'ext_description_and_meta_with_lodging_type',
                'ext_description_and_meta_with_lodging_type_v2',
                'ext_tao_description_and_meta_with_lodging_type',
                'ext_tao_description_and_meta_with_lodging_type_v2'] + metadata_cols].join(y)\
            .reset_index(drop=True)\
            .rename(columns={"__y__":"label"})

        kbert_format_df[["label"]] = kbert_format_df[["label"]] * 1
        X_train, X_dev_test, y_train, y_dev_test = train_test_split(kbert_format_df, kbert_format_df["label"], test_size=dev_test_size, random_state=DEFAULT_RANDOM_STATE, shuffle=True)
        X_dev, X_test, y_dev, y_test = train_test_split(X_dev_test, y_dev_test, test_size=test_size_fraction, random_state=DEFAULT_RANDOM_STATE, shuffle=True)

        
    
        if metadata_processing == "norm_": ### we have to normalize metadata
            print("Normalizing metadata")
            
            # We train the scaler using train data to avoid leaking of informations to dev and test splits
            # We use MinMaxScaler to have all data value in the range [0,1] to be comparable with hot encoding
            scaler = MinMaxScaler()
            scaler.fit(X_train[metadata_cols])
            #print("Scaler data max: ", scaler.data_max_)
            
            #X_train[metadata_cols] = pd.DataFrame(normalize(X_train[metadata_cols], norm='l2'))
            X_train[metadata_cols] = scaler.transform(X_train[metadata_cols])
            X_dev[metadata_cols] = scaler.transform(X_dev[metadata_cols])
            X_test[metadata_cols] = scaler.transform(X_test[metadata_cols])
            
        
        ### Save dataset with plain description extracted from AirBnB data
        bert_kg_train_df, bert_kg_dev_df, bert_kg_test_df = save_bert_augmented_data_as_pickle(t, dataset_name, 
                                                                                               X_train, y_train, X_dev, y_dev, X_test, y_test, 
                                                                                        "text_a", metadata_cols, metadata_type_label, bert_augmented_output_dir)

        ### Save dataset where the description extracted from AirBnB data is extended with a list of included amenities
        bert_kg_train_ext_df, bert_kg_dev_ext_df, bert_kg_test_ext_df = save_bert_augmented_data_as_pickle(t+"_extended", dataset_name, 
                                                                                               X_train, y_train, X_dev, y_dev, X_test, y_test, 
                                                                                        "ext_description", metadata_cols, metadata_type_label, bert_augmented_output_dir)

        ### Save dataset where the description extracted from AirBnB data is extended with a list of included amenities and the listing type
        bert_kg_train_ext_lt_df, bert_kg_dev_ext_lt_df, bert_kg_test_ext_lt_df = save_bert_augmented_data_as_pickle(t+"_extended_lt", dataset_name, 
                                                                                               X_train, y_train, X_dev, y_dev, X_test, y_test, 
                                                                                        "ext_description_with_lodging_type", metadata_cols, metadata_type_label, bert_augmented_output_dir)
        ### 2nd version. Save dataset where the description extracted from AirBnB data is extended with a list of included amenities and the listing type
        bert_kg_train_ext_lt_df, bert_kg_dev_ext_lt_df, bert_kg_test_ext_lt_df = save_bert_augmented_data_as_pickle(t+"_extended_lt_v2", dataset_name, 
                                                                                               X_train, y_train, X_dev, y_dev, X_test, y_test, 
                                                                                        "ext_description_with_lodging_type_v2", metadata_cols, metadata_type_label, bert_augmented_output_dir)

        
        ### Save dataset where the description extracted from AirBnB data is extended with a list of included amenities (mapped to TAO) and the listing type
        bert_kg_train_tao_ext_lt_df, bert_kg_dev_tao_ext_lt_df, bert_kg_test_tao_ext_lt_df = save_bert_augmented_data_as_pickle(t+"_extended_tao_lt", dataset_name, 
                                                                                               X_train, y_train, X_dev, y_dev, X_test, y_test, 
                                                                                        "ext_tao_description_with_lodging_type", metadata_cols, metadata_type_label, bert_augmented_output_dir)
        
        ### 2nd version. Save dataset where the description extracted from AirBnB data is extended with a list of included amenities (mapped to TAO) and the listing type
        bert_kg_train_tao_ext_lt_df, bert_kg_dev_tao_ext_lt_df, bert_kg_test_tao_ext_lt_df = save_bert_augmented_data_as_pickle(t+"_extended_tao_lt_v2", dataset_name, 
                                                                                               X_train, y_train, X_dev, y_dev, X_test, y_test, 
                                                                                        "ext_tao_description_with_lodging_type_v2", metadata_cols, metadata_type_label, bert_augmented_output_dir)
               
        ### Save dataset where the description extracted from AirBnB data is extended with a list of included amenities and the listing type
        bert_kg_train_ext_meta_lt_df, bert_kg_dev_ext_meta_lt_df, bert_kg_test_ext_meta_lt_df = \
                    save_bert_augmented_data_as_pickle(t+"_extended_meta_lt", dataset_name, 
                                                           X_train, y_train, X_dev, y_dev, X_test, y_test, 
                                                            "ext_description_and_meta_with_lodging_type", metadata_cols, metadata_type_label, bert_augmented_output_dir)

        ### 2nd version. Save dataset where the description extracted from AirBnB data is extended with a list of included amenities and the listing type
        bert_kg_train_ext_meta_lt_df, bert_kg_dev_ext_meta_lt_df, bert_kg_test_ext_meta_lt_df = \
                    save_bert_augmented_data_as_pickle(t+"_extended_meta_lt_v2", dataset_name, 
                                                           X_train, y_train, X_dev, y_dev, X_test, y_test, 
                                                            "ext_description_and_meta_with_lodging_type_v2", metadata_cols, metadata_type_label, bert_augmented_output_dir)


        ### Save dataset where the description extracted from AirBnB data is extended with a list of included amenities (mapped to TAO) and the listing type
        bert_kg_train_tao_ext_meta_lt_df, bert_kg_dev_tao_ext_meta_lt_df, bert_kg_test_tao_ext_meta_lt_df = \
                    save_bert_augmented_data_as_pickle(t+"_extended_tao_meta_lt", dataset_name, 
                                                           X_train, y_train, X_dev, y_dev, X_test, y_test, 
                                                            "ext_tao_description_and_meta_with_lodging_type", metadata_cols, metadata_type_label, bert_augmented_output_dir)
        
        ### 2nd version. Save dataset where the description extracted from AirBnB data is extended with a list of included amenities (mapped to TAO) and the listing type
        bert_kg_train_tao_ext_meta_lt_df, bert_kg_dev_tao_ext_meta_lt_df, bert_kg_test_tao_ext_meta_lt_df = \
                    save_bert_augmented_data_as_pickle(t+"_extended_tao_meta_lt_v2", dataset_name, 
                                                           X_train, y_train, X_dev, y_dev, X_test, y_test, 
                                                            "ext_tao_description_and_meta_with_lodging_type_v2", metadata_cols, metadata_type_label, bert_augmented_output_dir)

        
        print("----------Metadata: %s -----------" % metadata_type)
        print("Num columns:", bert_kg_train_df.shape[1])
        print("Train samples:", bert_kg_train_df.shape[0])
        print("Dev samples:", bert_kg_dev_df.shape[0])    
        print("Test samples:", bert_kg_test_df.shape[0])      
        print("\n\n")   

    
    

### Plotting Bar Charts for Targets

The code iterates through a dictionary of targets, retrieving the target values and associated parameters for each target. It then generates a bar chart visualizing the distribution of target values that are above and below a specified threshold. The code also prints the shape of the target values and their counts.

In [None]:
for target_name in targets.keys():
    #target_name = "price"
    target_values = targets[target_name]["y"]
    THRESHOLD = targets[target_name]["threshold"]
    SAFE = targets[target_name]["safe"]
    plt.bar([target_name+" under threshold (%.2f)" % (THRESHOLD - SAFE) , target_name+" over threshold (%.2f)" % (THRESHOLD + SAFE)],target_values.value_counts())
    plt.title('%s balanced data' % target_name)
    plt.show()
    print(target_name, target_values.shape)
    print(target_values.value_counts())