# Part 3 - Text analysis and ethics

# 3.a Computing PMI

In this assessment you are tasked to discover strong associations between concepts in Airbnb reviews. The starter code we provide in this notebook is for orientation only. The below imports are enough to implement a valid answer.

### Imports, data loading and helper functions

We first connect our google drive, import pandas, numpy and some useful nltk and collections modules, then load the dataframe and define a function for printing the current time, useful to log our progress in some of the tasks.

In [1]:
# 21015647

In [2]:
import pandas as pd
from nltk.tag import pos_tag
import re
from collections import defaultdict,Counter
from nltk.stem import WordNetLemmatizer
from datetime import datetime
from tqdm import tqdm
import numpy as np
import os
tqdm.pandas()


# These are the additional libraries I have included myself:
import itertools
from collections import Counter
import math

  from pandas import Panel


In [3]:
# nltk imports, note that these outputs may be different if you are using colab or local jupyter notebooks
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize,sent_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\c21015647\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\c21015647\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\c21015647\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [4]:
# load stopwords
sw = set(stopwords.words('english'))

# As there are reviews in spanish and french, we include the stopwords from those languages in the whole set (all_sw):

sw_french = set(stopwords.words('french'))
sw_spanish = set(stopwords.words('spanish'))
all_sw = list(set(itertools.chain.from_iterable([sw, sw_french, sw_spanish])))

In [5]:
p = 'data'
df = pd.read_csv(os.path.join(p,'reviews.csv'))
# deal with empty reviews
df.comments = df.comments.fillna('')

In [6]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...


In [7]:
df.shape

(452143, 6)

### 3.a1 - Process reviews

What to implement: A `function process_reviews(df)` that will take as input the original dataframe and will return it with three additional columns: `tokenized`, `tagged` and `lower_tagged`.

In [8]:
def process_reviews(df):
  # your code here

    '''
    Takes as argument a dataframe containing a column of text reviews/user comments (strings) and returns the same dataframe
    but with three extra natural language analysis columns, as described below.
    
    Args:
        A DataFrame containing a column called "comments". This column should be of type string.
        
    Returns:
        Original DataFrame but with three additional columns:
        
            tokenized: Comments column converted to list of strings, the words from original sentence (tokens).
            tagged: tokenized column after applying Part-of-speech analysis (list of tagged tokens or tuples).
            lower_tagged: tagged column with words converted to lowercase in order to reduce the vocabulary. 
            
    '''
 
    df['tokenized'] = df['comments'].progress_apply(word_tokenize)
        
    df['tagged'] = df['tokenized'].progress_apply(pos_tag)
    
    
    # We convert to lowercase AFTER having tagged the words:
    
    lowerize_tagged = lambda sentence: [(tup[0].lower(), tup[1]) for tup in sentence]
    
    df['lower_tagged'] = df['tagged'].progress_apply(lowerize_tagged)
    
      
    return df

In [9]:
df = process_reviews(df)

df.head(10)

100%|████████████████████████████████████████████████████████████████████████| 452143/452143 [03:15<00:00, 2309.43it/s]
100%|█████████████████████████████████████████████████████████████████████████| 452143/452143 [18:54<00:00, 398.69it/s]
100%|████████████████████████████████████████████████████████████████████████| 452143/452143 [00:47<00:00, 9586.70it/s]


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,tokenized,tagged,lower_tagged
0,2818,1191,2009-03-30,10952,Lam,Daniel is really cool. The place was nice and ...,"[Daniel, is, really, cool, ., The, place, was,...","[(Daniel, NNP), (is, VBZ), (really, RB), (cool...","[(daniel, NNP), (is, VBZ), (really, RB), (cool..."
1,2818,1771,2009-04-24,12798,Alice,Daniel is the most amazing host! His place is ...,"[Daniel, is, the, most, amazing, host, !, His,...","[(Daniel, NNP), (is, VBZ), (the, DT), (most, R...","[(daniel, NNP), (is, VBZ), (the, DT), (most, R..."
2,2818,1989,2009-05-03,11869,Natalja,We had such a great time in Amsterdam. Daniel ...,"[We, had, such, a, great, time, in, Amsterdam,...","[(We, PRP), (had, VBD), (such, JJ), (a, DT), (...","[(we, PRP), (had, VBD), (such, JJ), (a, DT), (..."
3,2818,2797,2009-05-18,14064,Enrique,Very professional operation. Room is very clea...,"[Very, professional, operation, ., Room, is, v...","[(Very, RB), (professional, JJ), (operation, N...","[(very, RB), (professional, JJ), (operation, N..."
4,2818,3151,2009-05-25,17977,Sherwin,Daniel is highly recommended. He provided all...,"[Daniel, is, highly, recommended, ., He, provi...","[(Daniel, NNP), (is, VBZ), (highly, RB), (reco...","[(daniel, NNP), (is, VBZ), (highly, RB), (reco..."
5,2818,4748,2009-06-29,20192,Jie,Daniel was a great host! He made everything so...,"[Daniel, was, a, great, host, !, He, made, eve...","[(Daniel, NNP), (was, VBD), (a, DT), (great, J...","[(daniel, NNP), (was, VBD), (a, DT), (great, J..."
6,2818,5202,2009-07-07,23055,Vanessa,Daniele is an amazing host! He provided everyt...,"[Daniele, is, an, amazing, host, !, He, provid...","[(Daniele, NNP), (is, VBZ), (an, DT), (amazing...","[(daniele, NNP), (is, VBZ), (an, DT), (amazing..."
7,2818,9131,2009-09-06,26343,Katja,You can´t have a nicer start in Amsterdam. Dan...,"[You, can´t, have, a, nicer, start, in, Amster...","[(You, PRP), (can´t, VBP), (have, VB), (a, DT)...","[(you, PRP), (can´t, VBP), (have, VB), (a, DT)..."
8,2818,12103,2009-10-01,40999,Marie-Eve,Daniel was a fantastic host. His place is calm...,"[Daniel, was, a, fantastic, host, ., His, plac...","[(Daniel, NNP), (was, VBD), (a, DT), (fantasti...","[(daniel, NNP), (was, VBD), (a, DT), (fantasti..."
9,2818,16196,2009-11-04,38623,Graham,Daniel was great. He couldn.t do enough for us...,"[Daniel, was, great, ., He, couldn.t, do, enou...","[(Daniel, NNP), (was, VBD), (great, JJ), (., ....","[(daniel, NNP), (was, VBD), (great, JJ), (., ...."


### 3.a2 - Create a vocabulary

What to implement: A function `get_vocab(df)` which takes as input the DataFrame generated in step 1.c, and returns two lists, one for the 1,000 most frequent center words (nouns) and one for the 1,000 most frequent context words (either verbs or adjectives). 

In [12]:
def get_vocab(df):
  # your code here

    '''
    Takes a dataframe containing a column been analyzed with part-of-speech, and returns the top 1000 most common words 
    for central and context vocabulary.
    
    Arguments:
        df: Dataframe as described above.
    
    Returns:
        cent_vocab: Top 1000 most common central words (nouns).
        cont_vocab: Top 1000 most common context words (verbs or adjectives).
    
    '''

    def noun_filter(tagged_list):
        '''
        Nested function to filter the tags from a pos-tagged sentence. Keeps only Nouns.
        
        Arguments:
            tagged_list: list of tagged words (list of tuples, in the form (word, tag) ).
            
        Returns:
            Filtered input list, having left only tagged words (tuples) whose tags start with 'N'
        
        '''        
        return [tup[0] for tup in tagged_list if tup[1][0] == 'N' ]
    
    
    def verb_adj_filter(tagged_list):
        '''
        Nested function to filter the tags from a pos-tagged sentence. Keep only Verbs or Adjectives.
        
        Arguments:
            tagged_list: list of tagged words (list of tuples, in the form (word, tag) )
            
        Returns:
            Filtered input list, having left only tagged words (tuples) whose tags start with 'V' or 'J'
        
        '''    
        return [tup[0] for tup in tagged_list if (tup[1][0] == 'V' or tup[1][0] == 'J') ]
    

    
            # Filter the tagged reviews series into Nouns and Verbs & Adjectives:
        
        
    # We apply the above functions to filter the column lower_tagged into nouns_only and ver_adj_only.
    
    # To reduce the number of potential accidental captures, we apply a lambda function to remove stopwords.
    # The function will also filter out words of length 1, to discard possible accidental punctuation captures.
    remove_sw = lambda sentence : [word for word in sentence if ((not word in all_sw) and (len(word)>1))]
    
    df['nouns_only'] = df['lower_tagged'].progress_apply(noun_filter).progress_apply(remove_sw)
    
    # Repeat the same step for verbs and adjectives.
   
    df['verb_adj_only'] = df['lower_tagged'].progress_apply(verb_adj_filter).progress_apply(remove_sw)
    
    
    
    # Merge the above series into a single list, for either nouns or verb/adj :
    
    all_nouns = list(itertools.chain.from_iterable(df['nouns_only']))
    all_verb_adj = list(itertools.chain.from_iterable(df['verb_adj_only']))
    
     
    # Use Counter to obtain a dictionary of unique values and their frequency. 
    # Then CHAIN with most_common to keep the top 1000 words with higher frequency.
    
    cent_vocab = Counter(all_nouns).most_common(1000)
    cont_vocab = Counter(all_verb_adj).most_common(1000)
    
    
    # We have obtained a list of tuples in the form (word, frequency).
    # We are only interested in the words, not the frequency. Filter to remove frequencies:
    
    cent_vocab = [word[0] for word in cent_vocab]
    cont_vocab = [word[0] for word in cont_vocab]
    
    
    return cent_vocab, cont_vocab


In [13]:
cent_vocab, cont_vocab = get_vocab(df)



100%|████████████████████████████████████████████████████████████████████████| 452143/452143 [01:01<00:00, 7335.36it/s]
100%|████████████████████████████████████████████████████████████████████████| 452143/452143 [02:02<00:00, 3678.21it/s]
100%|████████████████████████████████████████████████████████████████████████| 452143/452143 [01:03<00:00, 7118.93it/s]
100%|████████████████████████████████████████████████████████████████████████| 452143/452143 [01:41<00:00, 4446.58it/s]


### 3.a3 Count co-occurrences between center and context words

With these two 1,000-word vocabularies, create a co-occurrence matrix where, for each center word, you keep track of how many of the context words co-occur with it. Consider this short review with only one sentence as an example, where we want to get co-occurrences for verbs and adjectives for the center word restaurant:

a. ‘A big restaurant served delicious food in big dishes’
{‘restaurant’: {‘big’: 2, ‘served’:1, ‘delicious’:1}}


What to implement: A function `get_coocs(df, center_vocab, context_vocab)` which takes as input the DataFrame generated in step 1, and the lists generated in step 2 and returns a dictionary of dictionaries, of the form in the example above. It is up to you how you define context (full review? per sentence? a sliding window of fixed size?), and how to deal with exceptional cases (center words occurring more than once, center and context words being part of your vocabulary because they are frequent both as a noun and as a verb, etc). Use comments in your code to justify your approach. 

In [52]:
def get_coocs(df, cent_vocab, cont_vocab):
  # your code here

    '''
    
    Args:
        df: DataFrame with a column containing text entries, relative to the vocabulary.
        cent_vocab: List of most common nouns.
        cont_vocab: List of most common context words.
        
    Returns:
        A DataFrame showing relative frequencies between nouns and their surrounding context words.
    
    
    '''

    # Create a default dictionary to start collecting keys (center words) as they turn up during the search.
    # The default dictionary value is a list, that will collect all the context words for each key (center word).
    
    freqs = defaultdict(list)
      
    
    '''
    To be honest the below decision has been made as a compromise between allowed computation time and sample evaluation of
    user comments.
    
    Due to the lenght of the comments I considered a valid answer to evaluate the context as a full review. In this way,
    two labmda functions have been defined to split the tokenized comments between central and context words contained on
    each entry of the Data Frame. Duplicated entries of central words have been clustered using the set function.
    
    As the comments are tend to be concise and focused in a one or two topics, will associate the context words to the
    central words appearing on each entry iterating by a loop through the whole Data Frame.
    
    '''
    
    # Create two additional columns, containing the centre and context words for each comment/entry in the dataframe:
    
    cent_vocab_filter = lambda comment : set([word for word in comment if word in cent_vocab])
    cont_vocab_filter = lambda comment : [word for word in comment if word in cont_vocab]
     
    df['comments_cent_vocab'] = df['tokenized'].progress_apply(cent_vocab_filter)    
    df['comments_cont_vocab'] = df['tokenized'].progress_apply(cont_vocab_filter)

    
    # Iterate through the dataframe and associate centre words with context words, for each comment/entry:
    for index in tqdm(df.index):
        
        for cent_word in df.loc[index, 'comments_cent_vocab']:
            freqs[cent_word] = freqs[cent_word] + df.loc[index, 'comments_cont_vocab']
            
        
        
    # At this point we've got a dictionary whose keys are the centre words and corresponding values the raw occurrences of
    # context words around them. Now we apply the last step which is clustering repeating occurrences in nested dictionaries.
        
    coocs = defaultdict(dict)
    
    for center_word in tqdm(cent_vocab):
        coocs[center_word] = dict(Counter(freqs[center_word]))
            
                                        
    return coocs  

In [None]:
# I've been unable to apply the function to the whole dataframe due to excessively long computation times. <<<<<<<<<<

# Tried and think about a more efficient/simple function but that was the most simple version I could find.
# As a result, I will apply the get_coocs function to the first 10000 entries of the df only..

coocs = get_coocs(df[:10000], cent_vocab, cont_vocab)


### 3.a4 Convert co-occurrence dictionary to 1000x1000 dataframe
What to implement: A function called `cooc_dict2df(cooc_dict)`, which takes as input the dictionary of dictionaries generated in step 3 and returns a DataFrame where each row corresponds to one center word, and each column corresponds to one context word, and cells are their corresponding co-occurrence value. Some (x,y) pairs will never co-occur, you should have a 0 value for those cases. 

In [58]:
def cooc_dict2df(coocs):
  # your code here

    '''
    Transforms a dictionary of dictionaries into a DataFrame showing co-occurrence values.
    
    Args:
        coocs: Dictionary of dictionaries.
    
    Returns:
        coocdf.T : Transposed DataFrame of co-occurrence values.
    
    '''

    # Panda's DataFrame function takes the dictionary as input, uses the primary keys of the dictionary to build the columns.
    # The nested keys from the inner-dictionaries will be taken as the index series for the dataframe.
    
    coocdf = pd.DataFrame(coocs).fillna(0)
    
    # The above DataFrame has been build with center words as columns and context words as rows.
    # As the exercise asks to return the center words as rows, we transpose the DataFrame using the T function on the output.
    
    return coocdf.T

In [59]:
# Please note only the first 10000 entries from the df were taken on previous step due to excessively long computation times.

coocdf = cooc_dict2df(coocs)
print(coocdf.shape)

coocdf

(1000, 998)


Unnamed: 0,cool,nice,clean,quiet,use,didnt,finding,come,back,amazing,...,wenn,schöne,wirklich,noch,diese,jeden,sie,wurden,med,não
place,80.0,935.0,883.0,372.0,109.0,9.0,16.0,177.0,347.0,306.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
apartment,35.0,598.0,601.0,266.0,83.0,3.0,6.0,115.0,222.0,187.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
amsterdam,1.0,58.0,36.0,15.0,5.0,3.0,0.0,15.0,12.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
location,61.0,804.0,873.0,389.0,110.0,12.0,8.0,170.0,365.0,305.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
host,35.0,656.0,691.0,238.0,78.0,7.0,9.0,128.0,225.0,237.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
reinhart,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
cama,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
lights,0.0,9.0,4.0,3.0,5.0,0.0,0.0,0.0,1.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
estancia,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.a5 Raw co-occurrences to PMI scores

What to implement: A function `cooc2pmi(df)` that takes as input the DataFrame generated in step 4, and returns a new DataFrame with the same rows and columns, but with PMI scores instead of raw co-occurrence counts. 

In [60]:
def cooc2pmi(df):
  # your code here

    '''
    Converts a DataFrame of raw occurrences center word vs context word into a PMI score matrix.
    PMI formula follows a frequentist probability calculation.
    
    Args:
    
        DataFrame containing context words as columns, center words as rows. Values show the co-occurrence values.
    
    Returns:
    
        Transformed DataFrame into PMI score matrix.
    
    '''

    # N is the sum of total occurrences in the DataFrame. This will be required to calculate probabilities later on.
    N = df.to_numpy().sum()

    # We calculate PMI scores column by column:
    for column in tqdm(df.columns):
    
        # We follow the PMI formula calculating probabilities of:
        # pw - center words in any context.
        # pc - context words for any center word.
        # pwc - joint appearances of each center and context word.
        
        # Being already in a fixed column, calculate PMI scores row by row following df.index (contains all rows):
        for row in df.index:
            pwc = ( df.loc[row, column] ) / N # x/y located value
            pw = ( df.loc[row, :].sum() ) / N # sum of all values for this row
            pc = ( df[column].sum() ) / N # sum of all values for this column
        
            # print('pwc, pw, pc: ', pwc, pw, pc) # Left for debugging
        
            # Exception handling for math domain. Handle values of zero and also convert negative results to zero.
            if (pwc == 0) or (pw*pc == 0):
                df.loc[row, column] = 0
        
            else:
                try:
                    pmi = math.log( (pwc / (pw*pc) ), 2)
            
                except:
                    df.loc[row, column] = 0
        
                if pmi > 0:
                    df.loc[row, column] = pmi
                else:
                    df.loc[row, column] = 0
    

    # As the above strategy was overwritting the input DataFrame (df) using the .loc method, we copy the buffer df into 
    # the return required output pmidf:
    
    pmidf = df
    
    return pmidf

In [62]:
# Please note only the first 10000 entries from the df were taken on previous steps due to excessively long computation times.

pmidf = cooc2pmi(coocdf)
print(pmidf.shape)

pmidf

100%|████████████████████████████████████████████████████████████████████████████████| 998/998 [06:46<00:00,  2.45it/s]

(1000, 998)





Unnamed: 0,cool,nice,clean,quiet,use,didnt,finding,come,back,amazing,...,wenn,schöne,wirklich,noch,diese,jeden,sie,wurden,med,não
place,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
apartment,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
amsterdam,0.0,0.000000,0.000000,0.000000,0.000000,3.04118,0.0,0.163855,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
location,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
host,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
reinhart,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
cama,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,5.926649,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,7.562796
lights,0.0,3.902002,3.429032,3.493542,4.278567,0.00000,0.0,0.000000,2.763578,4.051647,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000
estancia,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.000000,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000


### 3.a6 Retrieve top-k context words, given a center word

What to implement: A function `topk(df, center_word, N=10)` that takes as input: (1) the DataFrame generated in step 5, (2) a `center_word` (a string like `‘towels’`), and (3) an optional named argument called `N` with default value of 10; and returns a list of `N` strings, in order of their PMI score with the `center_word`. You do not need to handle cases for which the word `center_word` is not found in `df`. 

In [64]:
def topk(df, center_word, N=10):
  # your code here

    '''
    Given a DataFrame in the way of a matrix of PMI scores of center words vs context words, returns the top-k context words
    for a given center word.
    
    Args:
    
        df: DataFrame in the way of a matrix of PMI scores of center words vs context words.
            Columns are context words, Rows are center words.
            
        center_word: The word (row) from which we retrieve the top-k context words (columns).
        
        N: The number of context word we wish to retrieve, sorted by PMI score.
        
    Returns:
    
        top_word[:N]: A list of strings, the top-N context words sorted by PMI score, for a given center word.
    
    
    '''
    top_words = list(df.loc[center_word, :].sort_values(ascending = False).index)

    return top_words[:N]

In [65]:
topk(pmidf, 'coffee')

['buy',
 'complimentary',
 'written',
 'numerous',
 'wide',
 'waking',
 'regular',
 'older',
 'delicious',
 'save']

# 3.b Ethical, social and legal implications



Local authorities in touristic hotspots like Amsterdam, NYC or Barcelona regulate the price of recreational apartments for rent to, among others, ensure that fair rent prices are kept for year-long residents. Consider your price recommender for hosts in Question 2c. Imagine that Airbnb recommends a new host to put the price of your flat at a price which is above the official regulations established by the local government. Upon inspection, you realize that the inflated price you have been recommended comes from many apartments in the area only being offered during an annual event which brings many tourists, and which causes prices to rise. 

In this context, critically reflect on the compliance of this recommender system with **one of the five actions** outlined in the **UK’s Data Ethics Framework**. You should prioritize the action that, in your opinion, is the weakest. Then, justify your choice by critically analyzing the three **key principles** outlined in the Framework, namely _transparency_, _accountability_ and _fairness_. Finally, you should propose and critically justify a solution that would improve the recommender system in at least one of these principles. You are strongly encouraged to follow a scholarly approach, e.g., with peer-reviewed references as support. 

Your report should be between 500 and 750 words long.  

### Your answer here. No Python, only Markdown.

Write your answer after the line.

---

The main conflict with the framework would be with action 3 - comply with the law.

Suggesting a price not in accordance with local regulations could end up in the user either disregarding the law or causing confussion, as he/she may not be aware of the regulations, considering the returned prices within the regular band.

On both cases a system recommending prices above the maximum could result in users incited to post properties above the maximum legal price, causing the conflict.


Fairness:

These inflated prices effect could have more or less effect in the end-used depending on his/her Bias: For example, one user looking for properties who may be unaware of the event may tolerate aumented prices in big cities such as Barcelona, as he/she may not be familiar with the regular prices but could end up paying a higher price because of having heard that big touristic cities are way more expensie. However should this event/circumstance occurr in a small town, the user may investigate the reason bit further. 

From the ACM code of ethics, in terms of malfunctions and social consequences, an increase in price for requesting users could leave low-income-users stranded as they could not afford to get one property, and therefore those willing attend the event from outside but having a reduced budget would get excluded. Going beyond, charity associations or other well-intended movements/associations without profit spirit - who may want to be involved in the event - wouldn't be able to afford the properties either, therefore they couldn't participate in the even. This will result not just in an handicap for these associations, but also in the event itself, as we are preventing these sort of groups from participating and therefore having an impact on the attendance profile. It is critical that the algorithm is designed with good will: In "Algorithms, governance and regulation: beyond 'the necessary hasthtags. Leighton Andrews, Cardiff Business School, Cardiff University, Wales" is it discussed the harm of "algorithmic lawbreaking", where algorithms are intentionally designed to deceive lawmakers and regulators. The arise of these algorithms has lead to stronger regulations and inspections therefore we must ensure our project is well documented and showing the actual intention of the recommender system.


Accountability:


One protocol to ensure unfairness is eliminated, could be to introduce a final step before returning the final recommended price: Apart from checking the result against the local regulations, we could perform a t-test to compare the entries analyzed within the given timeframe with the rest of the data, or at least a bigger sample. Based on these considerations we could have two options: To max-out the maximum returned value when exceeding the regulations or to trigger a warning depending of the significance of the t-test results. This would be in accordance with to political initiatives already taken in cities such as Barcelona, to promote ethical, citizen-centric, data-driven policymaking (Calzada I. & Almirall, E. (2019) Barcelona's grassroots-led urban experimentation: Deciphering the 'data commons' policy scheme), where the Data commons initiative puts the citizen in the centre of the use of data-driven technologies, before bussiness related strategies.


Transparency:

As the warnings could be disregarded we would need to make the choice of returning the actual calculated value or the maxed-out. Following the article "Evaluating Predictive Algorithms" by David Demortain and Bilel Benbouzid, algorithms such as PredPol to predict crime, returned predictions are impossible to validate, for example, the algorith predicts an high crime-rate in a certain area, then because that predicted zone is then patrolled by the police, the predicted crimes do not happen, but not because of a fail in the prediction, but because of the preventive actions taken beforehand. (Algorithmic Regulation, The London School of Economics and Political Science. Discussion Paper No: 85, September 2017). Because of this, the algorithms, intentions and continous reviews of the code and implementation need to be clearly documented and communicated to the public, so in this way unexpected behaviours can be communicated and corrected.



The solution to improve the system has already been discussed during the review of the above principles. The system could evaluate the returned results against a bigger sample and then either max-out the output comparing against a table of government regulations, of triggering a warning saying that the system has detected that the recommended price may exceed the maximum limit as per the local regulations and encouraging the user to consult the local authorities.
...