![NYC Skyline](nyc.jpg)

Welcome to New York City, one of the most-visited cities in the world. There are many Airbnb listings in New York City to meet the high demand for temporary lodging for travelers, which can be anywhere between a few nights to many months. In this project, we will take a closer look at the New York Airbnb market by combining data from multiple file types like `.csv`, `.tsv`, and `.xlsx`.

Recall that **CSV**, **TSV**, and **Excel** files are three common formats for storing data. 
Three files containing data on 2019 Airbnb listings are available to you:

**data/airbnb_price.csv**
This is a CSV file containing data on Airbnb listing prices and locations.
- **`listing_id`**: unique identifier of listing
- **`price`**: nightly listing price in USD
- **`nbhood_full`**: name of borough and neighborhood where listing is located

**data/airbnb_room_type.xlsx**
This is an Excel file containing data on Airbnb listing descriptions and room types.
- **`listing_id`**: unique identifier of listing
- **`description`**: listing description
- **`room_type`**: Airbnb has three types of rooms: shared rooms, private rooms, and entire homes/apartments

**data/airbnb_last_review.tsv**
This is a TSV file containing data on Airbnb host names and review dates.
- **`listing_id`**: unique identifier of listing
- **`host_name`**: name of listing host
- **`last_review`**: date when the listing was last reviewed

As a consultant working for a real estate start-up, you have collected Airbnb listing data from various sources to investigate the short-term rental market in New York. You'll analyze this data to provide insights on private rooms to the real estate company.

There are three files in the data folder: airbnb_price.csv, airbnb_room_type.xlsx, airbnb_last_review.tsv.

- What are the dates of the earliest and most recent reviews? Store these values as two separate variables with your preferred names.
- How many of the listings are private rooms? Save this into any variable.
- What is the average listing price? Round to the nearest two decimal places and save into a variable.
- Combine the new variables into one DataFrame called review_dates with four columns in the following order: first_reviewed, last_reviewed, nb_private_rooms, and avg_price. The DataFrame should only contain one row of values.

In [32]:
# Import necessary packages
import pandas as pd
import numpy as np

In [33]:
airbnb_price = pd.read_csv('data/airbnb_price.csv')
airbnb_room_type = pd.read_excel('data/airbnb_room_type.xlsx')
airbnb_last_review = pd.read_csv('data/airbnb_last_review.tsv', sep='\t')

airbnb_price.head()

Unnamed: 0,listing_id,price,nbhood_full
0,2595,225 dollars,"Manhattan, Midtown"
1,3831,89 dollars,"Brooklyn, Clinton Hill"
2,5099,200 dollars,"Manhattan, Murray Hill"
3,5178,79 dollars,"Manhattan, Hell's Kitchen"
4,5238,150 dollars,"Manhattan, Chinatown"


In [34]:
airbnb_price.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25209 entries, 0 to 25208
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   listing_id   25209 non-null  int64 
 1   price        25209 non-null  object
 2   nbhood_full  25209 non-null  object
dtypes: int64(1), object(2)
memory usage: 591.0+ KB


In [35]:
airbnb_price.columns.to_list()

['listing_id', 'price', 'nbhood_full']

In [36]:
airbnb_room_type.head()

Unnamed: 0,listing_id,description,room_type
0,2595,Skylit Midtown Castle,Entire home/apt
1,3831,Cozy Entire Floor of Brownstone,Entire home/apt
2,5099,Large Cozy 1 BR Apartment In Midtown East,Entire home/apt
3,5178,Large Furnished Room Near B'way,private room
4,5238,Cute & Cozy Lower East Side 1 bdrm,Entire home/apt


In [37]:
airbnb_room_type.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25209 entries, 0 to 25208
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   listing_id   25209 non-null  int64 
 1   description  25199 non-null  object
 2   room_type    25209 non-null  object
dtypes: int64(1), object(2)
memory usage: 591.0+ KB


In [38]:
airbnb_room_type.columns.to_list()

['listing_id', 'description', 'room_type']

In [39]:
airbnb_last_review.head()

Unnamed: 0,listing_id,host_name,last_review
0,2595,Jennifer,May 21 2019
1,3831,LisaRoxanne,July 05 2019
2,5099,Chris,June 22 2019
3,5178,Shunichi,June 24 2019
4,5238,Ben,June 09 2019


In [40]:
airbnb_last_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25209 entries, 0 to 25208
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   listing_id   25209 non-null  int64 
 1   host_name    25201 non-null  object
 2   last_review  25209 non-null  object
dtypes: int64(1), object(2)
memory usage: 591.0+ KB


In [41]:
airbnb_last_review.columns.to_list()

['listing_id', 'host_name', 'last_review']

In [42]:
airbnb = pd.merge(airbnb_room_type, airbnb_price, on='listing_id')
airbnb = pd.merge(airbnb, airbnb_last_review, on='listing_id')
airbnb.head()

Unnamed: 0,listing_id,description,room_type,price,nbhood_full,host_name,last_review
0,2595,Skylit Midtown Castle,Entire home/apt,225 dollars,"Manhattan, Midtown",Jennifer,May 21 2019
1,3831,Cozy Entire Floor of Brownstone,Entire home/apt,89 dollars,"Brooklyn, Clinton Hill",LisaRoxanne,July 05 2019
2,5099,Large Cozy 1 BR Apartment In Midtown East,Entire home/apt,200 dollars,"Manhattan, Murray Hill",Chris,June 22 2019
3,5178,Large Furnished Room Near B'way,private room,79 dollars,"Manhattan, Hell's Kitchen",Shunichi,June 24 2019
4,5238,Cute & Cozy Lower East Side 1 bdrm,Entire home/apt,150 dollars,"Manhattan, Chinatown",Ben,June 09 2019


### Pytanie 1

What are the dates of the earliest and most recent reviews? Store these values as two separate variables with your preferred names.

In [43]:
airbnb['last_review'] = pd.to_datetime(airbnb['last_review'])
airbnb.dtypes

  airbnb['last_review'] = pd.to_datetime(airbnb['last_review'])


listing_id              int64
description            object
room_type              object
price                  object
nbhood_full            object
host_name              object
last_review    datetime64[ns]
dtype: object

In [44]:
sorted_last_reviews = airbnb['last_review'].sort_values(ascending=True)
sorted_last_reviews

12007   2019-01-01
10430   2019-01-01
14845   2019-01-01
7203    2019-01-01
17123   2019-01-01
           ...    
23952   2019-07-08
9595    2019-07-08
9505    2019-07-08
17563   2019-07-08
58      2019-07-09
Name: last_review, Length: 25209, dtype: datetime64[ns]

In [45]:
earliest_review = sorted_last_reviews.min()
recent_review = sorted_last_reviews.max()

print(f"Earliest review was: {earliest_review} and most recent review was {recent_review} ")

Earliest review was: 2019-01-01 00:00:00 and most recent review was 2019-07-09 00:00:00 


### Pytanie 2

How many of the listings are private rooms? Save this into any variable.

In [46]:
airbnb['room_type'].value_counts()

room_type
Entire home/apt    8458
Private room       7241
entire home/apt    2665
private room       2248
ENTIRE HOME/APT    2143
PRIVATE ROOM       1867
Shared room         380
shared room         110
SHARED ROOM          97
Name: count, dtype: int64

In [47]:
airbnb['room_type'] = airbnb['room_type'].str.lower()
airbnb['room_type'].value_counts()

room_type
entire home/apt    13266
private room       11356
shared room          587
Name: count, dtype: int64

In [48]:
count_private_rooms = airbnb['room_type'].value_counts()['private room']
count_private_rooms

11356

### Pytanie 3

What is the average listing price? Round to the nearest two decimal places and save into a variable.

In [51]:
airbnb['price_clean'] = airbnb['price'].str.replace(' dollars', '').astype(float)
airbnb['price_clean']

0        225.0
1         89.0
2        200.0
3         79.0
4        150.0
         ...  
25204    129.0
25205     45.0
25206    235.0
25207    100.0
25208     30.0
Name: price_clean, Length: 25209, dtype: float64

In [55]:
avg_price = airbnb['price_clean'].mean().round(2)
avg_price

141.78

### Pytanie 4

Combine the new variables into one DataFrame called review_dates with four columns in the following order: first_reviewed, last_reviewed, nb_private_rooms, and avg_price. The DataFrame should only contain one row of values.

In [54]:
result = pd.DataFrame({
    'first_reviewed': [earliest_review],
    'last_reviewed': [recent_review],
    'num_private_rooms': [count_private_rooms],
    'avg_price': [avg_price]
})
result

Unnamed: 0,first_reviewed,last_reviewed,num_private_rooms,avg_price
0,2019-01-01,2019-07-09,11356,141.777936


# Solution

In [None]:
# Import necessary packages
import pandas as pd
import numpy as np

# Import CSV for prices
airbnb_price = pd.read_csv('data/airbnb_price.csv')

# Import Excel file for room types
airbnb_room_type = pd.read_excel('data/airbnb_room_type.xlsx')

# Import TSV for review dates
airbnb_last_review = pd.read_csv('data/airbnb_last_review.tsv', sep='\t')

# Join the three data frames together into one
listings = pd.merge(airbnb_price, airbnb_room_type, on='listing_id')
listings = pd.merge(listings, airbnb_last_review, on='listing_id')

# What are the dates of the earliest and most recent reviews?
# To use a function like max()/min() on last_review date column, it needs to be converted to datetime type
listings['last_review_date'] = pd.to_datetime(listings['last_review'], format='%B %d %Y')
first_reviewed = listings['last_review_date'].min()
last_reviewed = listings['last_review_date'].max()

# How many of the listings are private rooms?
# Since there are differences in capitalization, make capitalization consistent
listings['room_type'] = listings['room_type'].str.lower()
private_room_count = listings[listings['room_type'] == 'private room'].shape[0]

# What is the average listing price?
# To convert price to numeric, remove " dollars" from each value
listings['price_clean'] = listings['price'].str.replace(' dollars', '').astype(float)
avg_price = listings['price_clean'].mean()

review_dates = pd.DataFrame({
    'first_reviewed': [first_reviewed],
    'last_reviewed': [last_reviewed],
    'nb_private_rooms': [private_room_count],
    'avg_price': [round(avg_price, 2)]
})

print(review_dates)
