# Data Workout Airbnb

# Workout 6- Aggregating Data

The following exercises will focus on aggregating data

# About the Dataset

This activity will explore public Airbnb dataset containing 30,478 Airbnb listings in New York City.

* source - [Inside Airbnb](https://insideairbnb.com/get-the-data/)
* file path - `../data/airbnb.csv`

A few of of the columns included are:

* __Host Since__: Start date of the host
* __Name__: Description of the Property
* __Neighbourhood__: The property neighborhood
* __Property Type__: The type of property (Apartment, House, etc)
* __Room Type__: The type of room (Entire Home, Private room, etc)
* __Price__: The daily price of the property

# Keep In Mind

For each exercise, new patterns or information may be revealed.

Ask yourself or discuss with your group the following questions:
* Is this trivia or this a tool?
* What makes this information trivia?
* What makes this information a tool?
* How does the nature of tool or trivia change based on who is viewing the information?

# 0 - Importing the tools 

* Import `pandas` as `pd`

In [1]:
## Begin Solution
import pandas as pd

## End Solution

# 1 - Load the data 

Load the airbnb data into a dataframe named `df`

In [2]:
## Begin Solution
file = "../data/airbnb.csv"

df = pd.read_csv(file)

## End Solution

# 2 - Preview the Data

Output the first 5 rows of data

In [3]:
## Begin Solution
df.info()

## End Solution

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30478 entries, 0 to 30477
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Unnamed: 0                  30478 non-null  int64  
 1   Host Id                     30478 non-null  int64  
 2   Host Since                  30475 non-null  object 
 3   Name                        30478 non-null  object 
 4   Neighbourhood               30478 non-null  object 
 5   Property Type               30475 non-null  object 
 6   Review Scores Rating (bin)  22155 non-null  float64
 7   Room Type                   30478 non-null  object 
 8   Zipcode                     30344 non-null  float64
 9   Beds                        30393 non-null  float64
 10  Number of Records           30478 non-null  int64  
 11  Number Of Reviews           30478 non-null  int64  
 12  Price                       30478 non-null  int64  
 13  Review Scores Rating        221

# 3 - Clean the Data 

* Remove the `Unnamed: 0` column.
* Fix column names for consistency and remove leading/trailing spaces.


In [4]:
## Begin Solution
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
df = df.drop("unnamed:_0", axis = 1)

df.info()
## End Solution

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30478 entries, 0 to 30477
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   host_id                     30478 non-null  int64  
 1   host_since                  30475 non-null  object 
 2   name                        30478 non-null  object 
 3   neighbourhood               30478 non-null  object 
 4   property_type               30475 non-null  object 
 5   review_scores_rating_(bin)  22155 non-null  float64
 6   room_type                   30478 non-null  object 
 7   zipcode                     30344 non-null  float64
 8   beds                        30393 non-null  float64
 9   number_of_records           30478 non-null  int64  
 10  number_of_reviews           30478 non-null  int64  
 11  price                       30478 non-null  int64  
 12  review_scores_rating        22155 non-null  float64
dtypes: float64(4), int64(4), object

# 4 - Transform the Data
Convert the `host since` column to `datetime` object.

In [5]:
## Begin Solution
df["host_since"] = pd.to_datetime(df["host_since"])

df.info()
## End Solution

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30478 entries, 0 to 30477
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   host_id                     30478 non-null  int64         
 1   host_since                  30475 non-null  datetime64[ns]
 2   name                        30478 non-null  object        
 3   neighbourhood               30478 non-null  object        
 4   property_type               30475 non-null  object        
 5   review_scores_rating_(bin)  22155 non-null  float64       
 6   room_type                   30478 non-null  object        
 7   zipcode                     30344 non-null  float64       
 8   beds                        30393 non-null  float64       
 9   number_of_records           30478 non-null  int64         
 10  number_of_reviews           30478 non-null  int64         
 11  price                       30478 non-null  int64     

# 4 - Average Price by Neighborhood

Group by `neighbourhood` and find the average `price`.

In [6]:
## Begin Solution
df.groupby("neighbourhood")["price"].mean()
## End Solution

neighbourhood
Bronx             94.660870
Brooklyn         129.500471
Manhattan        198.474584
Queens           103.222125
Staten Island    163.462585
Name: price, dtype: float64

# 5 - Review Scores by Property Type

Group by `property type` and get average `review scores rating`.

In [7]:
## Begin Solution
df.groupby("property_type")["review_scores_rating"].mean().sort_values(ascending=False)
## End Solution

property_type
Lighthouse         100.000000
Castle             100.000000
Chalet              99.000000
Boat                94.833333
Hut                 94.000000
Townhouse           93.807229
Loft                93.507614
Condominium         92.708333
Apartment           92.050417
Villa               91.833333
Bungalow            91.333333
House               90.883258
Bed & Breakfast     90.544118
Other               89.607143
Treehouse           88.333333
Cabin               86.500000
Dorm                84.500000
Camper/RV           81.166667
Tent                      NaN
Name: review_scores_rating, dtype: float64

# 6 - Maximum Number of Reviews by Room Type 

Group by `room type` and find the maximum `number of reviews`.

In [8]:
## Begin Solution
df.groupby("room_type")["number_of_reviews"].max().sort_values(ascending=False)
## End Solution

room_type
Entire home/apt    257
Private room       256
Shared room        159
Name: number_of_reviews, dtype: int64

# 7 - Average Review Score Rating by Neighborhood

Group by `neighbourhood` and calculate the average `review scores rating`.

In [9]:
## Begin Solution
df.groupby("neighbourhood")["review_scores_rating"].mean().sort_values(ascending=False)
## End Solution

neighbourhood
Brooklyn         92.363497
Manhattan        91.801785
Bronx            91.654378
Queens           91.549057
Staten Island    90.843750
Name: review_scores_rating, dtype: float64

# 8 - Total Beds in each Neighborhood

Find the total number of `beds` in each `neighbourhood`

In [10]:
## Begin Solution
df.groupby("neighbourhood")["beds"].sum().sort_values(ascending=False)
## End Solution

neighbourhood
Manhattan        24041.0
Brooklyn         18135.0
Queens            3461.0
Bronx              547.0
Staten Island      320.0
Name: beds, dtype: float64

# 9 - Distinct Property Types by Neighborhood 

Group by `neighbourhood` and count the number of distinct `property type`.

In [11]:
## Begin Solution
df.groupby(["neighbourhood", "property_type"])["name"].count()
## End Solution

neighbourhood  property_type  
Bronx          Apartment            218
               Bed & Breakfast        6
               Condominium            3
               House                110
               Loft                   6
               Townhouse              2
Brooklyn       Apartment           9740
               Bed & Breakfast       76
               Boat                   1
               Bungalow               1
               Camper/RV              1
               Chalet                 1
               Condominium           30
               Dorm                  19
               House               1202
               Lighthouse             1
               Loft                 502
               Other                 16
               Tent                   3
               Townhouse             79
               Treehouse              2
               Villa                  1
Manhattan      Apartment          15433
               Bed & Breakfast       60
         

# 10 - One Column, Multiple Aggregations 

Group by `neighbourhood` and perform multiple aggregations on the `price` column: find the min, max, and mean.

In [12]:
## Begin Solution
df.groupby("neighbourhood")["price"].agg(["min","max","mean"])
## End Solution

Unnamed: 0_level_0,min,max,mean
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Bronx,10,4000,94.66087
Brooklyn,10,8000,129.500471
Manhattan,20,10000,198.474584
Queens,25,4000,103.222125
Staten Island,35,5000,163.462585


# 11 - Multiple Columns, Different Aggregations

Group by `neighbourhood` and perform different aggregations on different columns:
* Find the average `price`
* Find the maximum `number of reviews`

In [13]:
## Begin Solution
df.groupby("neighbourhood").agg(
    {
        "price":"mean",
        "number_of_reviews":"max"
    }
).sort_values(
    by="number_of_reviews",
    ascending=False
)
## End Solution

Unnamed: 0_level_0,price,number_of_reviews
neighbourhood,Unnamed: 1_level_1,Unnamed: 2_level_1
Brooklyn,129.500471,257
Manhattan,198.474584,242
Queens,103.222125,201
Bronx,94.66087,138
Staten Island,163.462585,104


# 12 - Host Growth Over Time

Let's examine the growth patterns of Airbnb hosts reflected in the dataset.

Steps:
1. __Derive Host Onboarding Year:__ Transform the `host since` column to extract and store only the year of host registration. (The `.dt` access followed by the `.year` attribute will grant access to the year component of the `datetime` series)
2. __Quantify Annual Host Additions:__ Group the dataset by the "host since" column and compute the count of unique hosts for each year using the `.size()` method.

In [14]:
# Begin Solution
df["host_since"] = df["host_since"].dt.year
df.groupby("host_since").size()
# End Solution

host_since
2008.0      38
2009.0     502
2010.0    1374
2011.0    3300
2012.0    5760
2013.0    6440
2014.0    7908
2015.0    5153
dtype: int64

## The Case of the Duplicate Hosts: A Growth Rate Mystery

What's wrong with our approach here?

* There may be multiple entries for the `host id` which means we may have included multiple entries for. 

* In the cell below, write a solution that provides a more accurate representation of the growth rate of the platform.

_(Remember the `.nunique()` method is perfect for getting number of unique values in a Series or within a group)_

In [15]:
# Begin Solution
df.groupby("host_since")["host_id"].nunique()
# End Solution

host_since
2008.0      25
2009.0     331
2010.0     996
2011.0    2584
2012.0    4585
2013.0    5227
2014.0    6385
2015.0    4285
Name: host_id, dtype: int64

## Freestyle

What information can you and your group gather from the dataset on your own?

Use the rest of the notebook to explore and discover new patterns.