## Analyzing AirBnB Data with `Pandas`

### Step 0: Importing packages

In [4]:
import pandas as pd 
import seaborn as sns 

### Step 1: Importing the data

We import our data using Pandas' `read_csv` function:

In [2]:
data = pd.read_csv("data/airbnb_nyc_2019.csv")

Let's take a peek at the first few rows. We can do so using `head()`. By default, this will give us the first 5 rows of the dataframe. We can always modify how many rows to return by specifying a number instead the head function. For example, `data.head(2)` will return the first 2 rows:

In [5]:
data.head(2)

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355


We can also get the number of columns and rows of the dataframe using `data.shape`. This returns (`number of rows`, `number of columns`).

In [6]:
data.shape

(48895, 16)

Using `shape`, we can see that there are 48,895 rows and 16 columns in our dataset.

We can also get an overview of missing values and datatypes of each column. We can see below that we have 2 columns with missing values: `last_review` and `reviews_per_month`. 

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

###  Step 2: Data Exploration

#### How many unique neighbourhood groups are there?

In [13]:
data['neighbourhood_group'].value_counts()

Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: neighbourhood_group, dtype: int64

We can see that neighbourhood groups represents New York boroughs. There are 5 groups in our dataset.

#### How many neighbourhoods are in each borough?

To find out how many neighbourhoods are in each borough, we can `groupby` borough (neighbourhood group) and count the number of unique neighbourhoods using `nunique()`.

In [16]:
data.groupby('neighbourhood_group')['neighbourhood'].nunique()

neighbourhood_group
Bronx            48
Brooklyn         47
Manhattan        32
Queens           51
Staten Island    43
Name: neighbourhood, dtype: int64

We can see that Queens has the most neighbourhoods while Staten Island has the fewest.

#### What's the average cost of an AirBnB in NYC?

We can apply the `mean()` function to the `'price'` column to figure this out:

In [27]:
mean_price = data['price'].mean()
mean_price

152.7206871868289

Using Python's useful f-string funcitonality, we can easily round out mean price to the nearest decimal by adding `.2f` at the end of `mean_price`: 

In [23]:
print(f"The mean price of an AirBnb in NYC is {mean_price:.2f}")

The mean price of an AirBnb in NYC is 152.72


The f-string approach does not change the actual value of `mean_price` to 2 decimals. If we wanted to permanently update `mean_price` to the 2 decimals, we could use the `round()` function:

In [26]:
round(data['price'].mean(), 2)

152.72