### In this notebook I use some natural language processing (NLP) method with scikit learn, to spot potential multihosts in Bologna Airbnb host community. 
### I'm going to do that exploiting textual descriptions of the announces ("description", "space" and "summary" columns above all) into an Inside Airbnb dataset (http://insideairbnb.com/get-the-data.html by Murray Cox). The reason is that a lot of announces seem to have very similar description although they belong to different hosts. That means that apparently there's a copy-paste abuse among hosts that MAYBE belong to the same home structure. Why one should ever do that? Why a big Airbnb commercial player is pushed to create different "host" entities to handle his property? Look at that:

In [3]:
import pandas as pd

# I've picked up the latest DATASET (the detailed one) among Bologna data on Inside Airbnb:
listings = pd.read_csv("bologna/detailed/02_2020_text.csv")

set(listings[listings['calculated_host_listings_count'] > 10]\
    .sort_values('calculated_host_listings_count', ascending=False).host_name)

{'28',
 'Agostino',
 'Alessandro',
 'Andrea',
 'Arianna',
 'Bnbbologna',
 'Canguro Properties',
 'Flat In Italy',
 'Franco',
 'GetTheKey',
 'Halldis Apartments & Villas',
 'Luca',
 'Marco',
 'Michela Borghetto',
 'Rambaldo',
 'Realkasa',
 'Silvia',
 'Stefano',
 'The Place',
 'Vito',
 'Welcome To Bologna & Ferrara!'}

### This is the list of players with more than 10 announces on the platform. As you can see, around the 50% of them has a definitely "human" and "italian" name, in spite of their commercial vocation. Agostino, Franco, Silvia... They inspire confidence, intimacy, autenticity. Isn't strange? Would you let your uncle Mario represent your italian guesthouse in full? Apparently, you should! Multihosts and commercial partners are a problem for Airbnb storytelling because they're faceless, and they're not consistent with the idea of a resident host, keen to show and explain his/her exotic city to guests. That's perhaps why the host identity tents to be humanized and parcellized on the platform.

### Now, let's have a look into this dataset:

In [4]:
listings.head(3)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,42196,https://www.airbnb.com/rooms/42196,20200219034035,2020-02-19,50 sm Studio in the historic centre,,Really cozy and typical bolognese 50 sm studio...,Really cozy and typical bolognese 50 sm studio...,none,,...,t,f,flexible_new,f,f,5,2,3,0,1.52
1,46352,https://www.airbnb.com/rooms/46352,20200219034035,2020-02-19,A room in Pasolini's house,"Simple, cozy and silent room in a lived house ...",Please take two minutes of your time to read e...,"Simple, cozy and silent room in a lived house ...",none,In the very nearby you have Via Saragozza whic...,...,f,f,flexible_new,f,f,2,0,2,0,2.23
2,59697,https://www.airbnb.com/rooms/59697,20200219034035,2020-02-19,BOLOGNA CENTRE RELAX & COMFORT,"Situato nel centro storicocuore di Bologna, vi...",The apartment is located at the heart of Bolog...,The apartment is located at the heart of Bolog...,none,,...,f,f,moderate_new,f,f,1,1,0,0,2.72


### As we said above, we would like to compare textual features of every single announce with each other. Let's pick up a column and a threshold. 

In [5]:
# I set the COLUMN I want to use for similarity, 
# and the SIMILARITY THRESHOLD under which 2 announces are not considered similar

text = "space"
similarity_ts = 0.65

In [6]:
# I take only announces of entire houses or apartments. I exclude ones with empty value.

summaries = [ (summary,listings['id'][ix])
             for ix, summary in enumerate(list(listings[text])) 
             if str(summary) != 'nan' 
             and listings['room_type'][ix] == 'Entire home/apt']

corpus = [i[0] for i in summaries]
id_name = [i[1] for i in summaries]

### Here we are. With scikit learn tfidf we vectorize each text, while with the linear kernel we measures the similarity. So, from a matrix we can retrieve a list of couples, paired by similarity.
This code cell can be quite slow. 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
similarity_matrix = linear_kernel(tfidf)
dataset = pd.DataFrame({c:similarity_matrix[:, ix] for ix,c in enumerate(id_name)}, index=id_name)

s = [(ix,val) for ix,val in dataset.stack().sort_values(ascending=False).iteritems() 
     if ix[0] != ix[1] and val > similarity_ts]

### Let's unstack our results!

In [8]:
stack = []

def get_host(ix): 
    return byId('host_id', ix)

def byId(what, i):
    return listings.loc[listings['id'] == i, what].iloc[0]


for ids,sim in s:
    host_0 = get_host(ids[0])
    host_1 = get_host(ids[1])
    stack.append(( (host_0, ids[0]), (host_1, ids[1]) , sim ))
    
stack = [i for i in stack if i[0][0] != i[1][0]]

In [9]:
relevant_columns = ['host_id', 'host_name','id', 'name','space', 'description', 'summary']

similar_announces = []
for i in stack:
    
    similar_announces.append(listings.loc[listings['id'] == i[0][1]])
    similar_announces.append(listings.loc[listings['id'] == i[1][1]])

In [10]:
similar_announces = pd.concat(similar_announces)[relevant_columns].drop_duplicates()
similar_announces.head(100)

Unnamed: 0,host_id,host_name,id,name,space,description,summary
2889,235518701,Giorgio,31441827,"Calzolerie Luxury Studio, in the heart of the ...",The apartment is on the second floor of a buil...,Elegant and modern studio recently renovated w...,Elegant and modern studio recently renovated w...
2852,232103797,Paolo,31069336,Calzolerie luxury apartment in the historic ce...,The apartment is on the second floor of a buil...,Modern and elegant apartment recently renovate...,Modern and elegant apartment recently renovate...
4227,68659490,San Petronio Apartments,40839793,San Felice Apt,Intero appartamento,"San Felice Apt. è un appartamento moderno, sit...","San Felice Apt. è un appartamento moderno, sit..."
4422,50429991,Flat In Italy,42061263,Angy Vi. Apartments - Flat in Bo,intero appartamento,Grazioso appartamento situato a pochi passi da...,Grazioso appartamento situato a pochi passi da...
327,22113925,Sandro,4264582,MYHOME,Lovely apartment in Bologna in the city center...,Lovely apartment in Bologna in the city center...,Lovely apartment in Bologna in the city center...
...,...,...,...,...,...,...,...
1015,76308611,Alessandra,14345155,Casa Anna in historic center,L'abitazione si trova in una delle zone più se...,My accommodation is in the historical district...,My accommodation is in the historical district...
4073,45303779,Filippo,39934007,Casa Caterina in Centro Storico a Bologna,L'abitazione si trova in una delle zone più se...,Situato in una delle zone del centro storico p...,Situato in una delle zone del centro storico p...
3572,275377946,Carla,36637216,"Casa San Felice, spaziosa e luminosa",L'appartamento è posto al 2° piano di uno stab...,Spazioso e luminoso appartamento con 3 camere ...,Spazioso e luminoso appartamento con 3 camere ...
1891,147618750,Ubaldo,22508568,Apartment-Apartment-Ensuite with Shower-City V...,Appartamento di 50 mq composto da soggiorno co...,The exclusive and refined apartments are locat...,The exclusive and refined apartments are locat...


In [11]:
# if you want a .csv output:

# print(len(similar_announces))
# similar_announces.to_csv("similar_announce_%s.xlsx" % text)

### We can observe several phenomena at a quick glance. It turns out that often the same host has more than one id. It can happen also that a single announce appears on the platform multiple times, with different hosts, but it's rarer. Generally, there's a number of totally different hosts having different announces but identical descriptions.

### This is a qualitative approach. We can't make any quantitative statement, nor state that a host is unequivocally part of a bigger group of hosts. But still, I think it's a useful method for describing our datasets.