### In this notebook I use some natural language processing (NLP) method with scikit learn, to spot potential multihosts in Bologna Airbnb host community. 
### I'm going to do that exploiting textual descriptions of the announces ("description", "space" and "summary" columns above all) into an Inside Airbnb dataset (http://insideairbnb.com/get-the-data.html by Murray Cox). The reason is that a lot of announces seem to have very similar description although they belong to different hosts. That means that apparently there's a copy-paste abuse among hosts that MAYBE belong to the same home structure. Why one should ever do that? Why a big Airbnb commercial player is pushed to create different "host" entities to handle his property? Look at that:

In [1]:
import pandas as pd

# I've picked up the latest dataset - the zipped listing, since it is richer in textual info:
listings = pd.read_csv("listings_detailed/02_2020.csv")

set(listings[listings['calculated_host_listings_count'] > 10]\
    .sort_values('calculated_host_listings_count', ascending=False).host_name)

{'28',
 'Agostino',
 'Alessandro',
 'Andrea',
 'Arianna',
 'Bnbbologna',
 'Canguro Properties',
 'Flat In Italy',
 'Franco',
 'GetTheKey',
 'Halldis Apartments & Villas',
 'Luca',
 'Marco',
 'Michela Borghetto',
 'Rambaldo',
 'Realkasa',
 'Silvia',
 'Stefano',
 'The Place',
 'Vito',
 'Welcome To Bologna & Ferrara!'}

### This is the list of players with more than 10 announces on the platform. As you can see, around the 50% of them has a definitely "human" and "italian" name, in spite of their commercial vocation. Agostino, Franco, Silvia... They inspire confidence, intimacy, autenticity. Isn't strange? Would you let your uncle Mario represent your italian guesthouse in full? Apparently, you should! Multihosts and commercial partners are a problem for Airbnb storytelling because they're faceless, and they're not consistent with the idea of a resident host, keen to show and explain his/her exotic city to guests. That's perhaps why the host identity tends to be humanized and parcellized on the platform.

### Now, let's have a look into this dataset:

In [2]:
listings.head(3)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,42196,https://www.airbnb.com/rooms/42196,20200219034035,2020-02-19,50 sm Studio in the historic centre,,Really cozy and typical bolognese 50 sm studio...,Really cozy and typical bolognese 50 sm studio...,none,,...,t,f,flexible_new,f,f,5,2,3,0,1.52
1,46352,https://www.airbnb.com/rooms/46352,20200219034035,2020-02-19,A room in Pasolini's house,"Simple, cozy and silent room in a lived house ...",Please take two minutes of your time to read e...,"Simple, cozy and silent room in a lived house ...",none,In the very nearby you have Via Saragozza whic...,...,f,f,flexible_new,f,f,2,0,2,0,2.23
2,59697,https://www.airbnb.com/rooms/59697,20200219034035,2020-02-19,BOLOGNA CENTRE RELAX & COMFORT,"Situato nel centro storicocuore di Bologna, vi...",The apartment is located at the heart of Bolog...,The apartment is located at the heart of Bolog...,none,,...,f,f,moderate_new,f,f,1,1,0,0,2.72


### As we said above, we would like to compare textual features of every single announce with each other. Let's pick up a column and a threshold. 

In [3]:
# I set the COLUMN I want to use for similarity
# and the SIMILARITY THRESHOLD under which 2 announces are not considered similar
# (caution: column naming is sometimes inconsistent across dataframes)

text = "description"
similarity_ts = 0.65

In [4]:
# I take only announces of entire houses or apartments. I exclude ones with empty value.

summaries = [ (summary,listings['id'][ix])
             for ix, summary in enumerate(list(listings[text])) 
             if str(summary) != 'nan' 
             and listings['room_type'][ix] == 'Entire home/apt']

corpus = [i[0] for i in summaries]
id_name = [i[1] for i in summaries]

### Here we are. With scikit learn tfidf we vectorize each text, while with the linear kernel we measures the similarity. So, from a matrix we can retrieve a list of couples, paired by similarity.
This code cell can be quite slow. 

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(corpus)
similarity_matrix = linear_kernel(tfidf)
dataset = pd.DataFrame({c:similarity_matrix[:, ix] for ix,c in enumerate(id_name)}, index=id_name)

s = [(ix,val) for ix,val in dataset.stack().sort_values(ascending=False).iteritems() 
     if ix[0] != ix[1] and val > similarity_ts]

### Let's unstack our results!

In [6]:
stack = []

def get_host(ix): 
    return byId('host_id', ix)

def byId(what, i):
    return listings.loc[listings['id'] == i, what].iloc[0]


for ids,sim in s:
    host_0 = get_host(ids[0])
    host_1 = get_host(ids[1])
    stack.append(( (host_0, ids[0]), (host_1, ids[1]) , sim ))
    
stack = [i for i in stack if i[0][0] != i[1][0]]

In [7]:
relevant_columns = ['host_id', 'host_name','id', 'name','space', 'description', 'summary']

similar_announces = []
for i in stack:
    
    similar_announces.append(listings.loc[listings['id'] == i[0][1]])
    similar_announces.append(listings.loc[listings['id'] == i[1][1]])

In [8]:
similar_announces = pd.concat(similar_announces)[relevant_columns].drop_duplicates()
similar_announces.head(100)

Unnamed: 0,host_id,host_name,id,name,space,description,summary
2019,174718810,Sandro,23426146,MY HOME Bologna,Lovely apartment in Bologna in the city center...,Lovely apartment in Bologna in the city center...,Lovely apartment in Bologna in the city center...
327,22113925,Sandro,4264582,MYHOME,Lovely apartment in Bologna in the city center...,Lovely apartment in Bologna in the city center...,Lovely apartment in Bologna in the city center...
3467,267840406,Alexandros,35601415,Historical home in a beautiful and safe street,I am looking forward to hosting you. Our flat ...,Our home is in a safe and beautifull neighbour...,Our home is in a safe and beautifull neighbour...
3380,260401929,Alexandros,34501306,"Safe, beautifull and historical home in Bologna",I am looking forward to hosting you. Our flat ...,Our home is in a safe and beautifull neighbour...,Our home is in a safe and beautifull neighbour...
1150,101120055,Marco,15662345,Minimal Art Space appartment,,"Il mio alloggio è adatto a coppie, avventurier...","Il mio alloggio è adatto a coppie, avventurier..."
...,...,...,...,...,...,...,...
3808,292478881,Anita,38394637,"S. Petronio Vecchio 24, spacious and central.","Located on the ground floor, the apartment ins...","Apartment with 3 bedrooms, right in the center...","Apartment with 3 bedrooms, right in the center..."
3772,290301702,Roberto,38205616,"Colonna Home Bologna, Under the Two Towers",L'appartamento è posto al 2° piano di uno stab...,Grazioso appartamento frutto di una recente ri...,Grazioso appartamento frutto di una recente ri...
2145,183694652,Cinzia,24345029,Appartamento Pescarìe:in the heart of the old ...,"The apartment has a double bed, in an area par...","Nice studio in the heart of the Quadrilatero, ...","Nice studio in the heart of the Quadrilatero, ..."
2775,227516498,Daniele,30292358,Usodimare apartment: spazioso e con parcheggio,,Spazioso appartamento famigliare dotato di par...,Spazioso appartamento famigliare dotato di par...


In [10]:
# if you want a .csv output:

similar_announces.to_csv("other_data/similar_announce_%s.xlsx" % text)

### We can observe several phenomena at a quick glance. It turns out that often the same host has more than one id. It can happen also that a single announce appears on the platform multiple times, with different hosts, but it's rarer. Generally, there's a number of totally different hosts having different announces but identical descriptions.

### This is a qualitative approach. We can't make any quantitative statement, nor state that a host is unequivocally part of a bigger group of hosts. But still, I think it's a useful method for describing our datasets.