# A brief analysis of Airbnb's semi-structured data with PyMongo.

This practice is a brief analysis of a high volume of semi-structured data. I loaded a large set of publicly available real data into MongoDB and performed a quick analysis process using Python's map and reduce techniques. 

The data is available from the data source link:
https://www.mongodb.com/docs/atlas/sample-data/sample-airbnb/#std-label-sample-airbnb

In [1]:
#Let's import the required packages
import pandas as pd
import altair as alt
from pymongo import MongoClient
from bson.code import Code

In [2]:
#Let's connect in my local MongoDB. The Airbnb data were previously imported to MongoDB.
client = MongoClient("mongodb://127.0.0.1:27017")
print("Connection Successful")
client.close()

Connection Successful


In [3]:
#Let's set the database and the collection to work with.
dsadb = client.dsadb
airbnbcollection = dsadb.airbnb

In [4]:
#One can retrieve the semi-structured data into a structured data like pandas dataframe.
datapoints = list(airbnbcollection.find({}))
df = pd.json_normalize(datapoints)
df.head()

Unnamed: 0,_id,listing_url,name,summary,space,description,neighborhood_overview,notes,transit,access,...,review_scores.review_scores_accuracy,review_scores.review_scores_cleanliness,review_scores.review_scores_checkin,review_scores.review_scores_communication,review_scores.review_scores_location,review_scores.review_scores_value,review_scores.review_scores_rating,weekly_price,monthly_price,reviews_per_month
0,10006546,https://www.airbnb.com/rooms/10006546,Ribeira Charming Duplex,Fantastic duplex apartment with three bedrooms...,Privileged views of the Douro River and Ribeir...,Fantastic duplex apartment with three bedrooms...,"In the neighborhood of the river, you can find...",Lose yourself in the narrow streets and stairc...,Transport: • Metro station and S. Bento railwa...,We are always available to help guests. The ho...,...,9.0,9.0,10.0,10.0,10.0,9.0,89.0,,,
1,10009999,https://www.airbnb.com/rooms/10009999,Horto flat with small garden,One bedroom + sofa-bed in quiet and bucolic ne...,Lovely one bedroom + sofa-bed in the living ro...,One bedroom + sofa-bed in quiet and bucolic ne...,This charming ground floor flat is located in ...,"There´s a table in the living room now, that d...","Easy access to transport (bus, taxi, car) and ...",,...,,,,,,,,1492.0,4849.0,
2,1001265,https://www.airbnb.com/rooms/1001265,Ocean View Waikiki Marina w/prkg,A short distance from Honolulu's billion dolla...,Great studio located on Ala Moana across the s...,A short distance from Honolulu's billion dolla...,You can breath ocean as well as aloha.,,Honolulu does have a very good air conditioned...,"Pool, hot tub and tennis",...,9.0,8.0,9.0,9.0,10.0,9.0,84.0,650.0,2150.0,
3,10021707,https://www.airbnb.com/rooms/10021707,Private Room in Bushwick,Here exists a very cozy room for rent in a sha...,,Here exists a very cozy room for rent in a sha...,,,,,...,10.0,10.0,10.0,10.0,8.0,8.0,100.0,,,
4,10030955,https://www.airbnb.com/rooms/10030955,Apt Linda Vista Lagoa - Rio,Quarto com vista para a Lagoa Rodrigo de Freit...,,Quarto com vista para a Lagoa Rodrigo de Freit...,,,,,...,,,,,,,,,,


In [5]:
#One can count how many documents are in the dataset.
airbnbcollection.count_documents({})

5555

In [6]:
# One can create a dataframe and check how many rows and columns.
df.shape

(5555, 77)

In [7]:
#One can filter specific documents and create a dataframe with them.
pd.json_normalize(airbnbcollection.find({"property_type" : "House"},{"property_type" : True, "cancellation_policy" : True, "_id": False}))

Unnamed: 0,property_type,cancellation_policy
0,House,moderate
1,House,flexible
2,House,strict_14_with_grace_period
3,House,moderate
4,House,strict_14_with_grace_period
...,...,...
601,House,flexible
602,House,strict_14_with_grace_period
603,House,moderate
604,House,flexible


In [8]:
#One can order these documents
pd.json_normalize(airbnbcollection.find({"property_type" : "House"},
                                        {"property_type" : True, "cancellation_policy" : True, "_id": False}).\
                                        sort("cancellation_policy", -1))

Unnamed: 0,property_type,cancellation_policy
0,House,super_strict_60
1,House,super_strict_60
2,House,super_strict_60
3,House,super_strict_60
4,House,super_strict_60
...,...,...
601,House,flexible
602,House,flexible
603,House,flexible
604,House,flexible


In [9]:
#One can perform aggregations calculating the data count, sum, or average.
aggsample=pd.json_normalize(airbnbcollection.aggregate(
    [{
    "$group" : 
        {"_id":"$property_type",
         "count": {"$sum" : 1},
         "average":{"$avg":"$number_of_reviews"}}}], 
        allowDiskUse = True)).rename(columns={"_id": "Property_type", "count": "Count", "average": "Average"})
aggsample

Unnamed: 0,Property_type,Count,Average
0,Boutique hotel,53,17.603774
1,Hostel,34,15.411765
2,Campsite,1,47.0
3,Resort,11,8.909091
4,Villa,32,13.09375
5,Earth house,1,1.0
6,Pension (South Korea),1,32.0
7,Houseboat,1,2.0
8,Aparthotel,23,14.217391
9,Treehouse,1,103.0


In [10]:
#Let's do some charts with the aggregated data.
alt.Chart(aggsample).mark_circle().encode(
    x='Count:Q',
    y='Property_type:N',)

One can perform optmized calculations with map and reduce. Let's do it!

In [11]:
map = Code("function(){emit(Math.floor(this.number_of_reviews), 1)}")

In [12]:
reduce = Code("function(id, counts){return Array.sum(counts)}")

In [13]:
result = pd.json_normalize(airbnbcollection.map_reduce(map, reduce, "myresults").find()).\
            rename(columns={"_id": "Number_of_reviews", "value": "Count"}).\
            sort_values('Count', ascending=False).head(10)
result

Unnamed: 0,Number_of_reviews,Count
208,0.0,1388.0
57,1.0,511.0
255,2.0,329.0
30,3.0,247.0
15,4.0,187.0
210,6.0,139.0
87,5.0,128.0
222,7.0,107.0
220,8.0,88.0
145,12.0,71.0


In [14]:
#Let's make some graphics with the calculated data.
alt.Chart(result).mark_bar().encode(
    x='Number_of_reviews:N',
    y='Count:Q',)

Let's count the number of Airbnb properties by how many bedrooms.

In [15]:
map = Code("function(){emit(Math.floor(this.bedrooms), 1)}")

In [16]:
reduce = Code("function(id, counts){return Array.sum(counts)}")

In [17]:
result = pd.json_normalize(airbnbcollection.map_reduce(map, reduce, "myresults").find()).\
            rename(columns={"_id": "Number_of_bedrooms", "value": "Count"}).dropna().\
            sort_values('Count', ascending=False)
result

Unnamed: 0,Number_of_bedrooms,Count
0,1.0,3308.0
3,2.0,1090.0
1,0.0,496.0
11,3.0,427.0
2,4.0,161.0
13,5.0,36.0
12,6.0,16.0
4,7.0,7.0
9,8.0,3.0
7,9.0,2.0


In [18]:
#Let's make a simple graphic.
alt.Chart(result).mark_bar().encode(
    x='Number_of_bedrooms:N',
    y='Count:Q',)