# Filtering Airbnb

### Introduction

In this lesson, we'll use our knowledge of loops and filtering to work with Airbnb data in New York City. Let's get started.

### Loading and Exploring our Data

Let's start by loading up our data.

In [1]:
import pandas as pd
listings_df = pd.read_csv('https://raw.githubusercontent.com/eng-6-22/mod-1-a-data-structures/master/3-coercing-filtering-data/AB_NYC_2019.csv')

listings = listings_df.to_dict('records')

Let's start by seeing the number of listings we have gathered.

In [2]:
len(listings)

# 48895

48895

And now, let's see which attributes are available to us on each listing.  Look at the keys available on a single listing.

In [5]:
listings[0].keys()


# dict_keys(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
# 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
# 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month',
# 'calculated_host_listings_count', 'availability_365'])

dict_keys(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365'])

Now we'll like to calculate some statistics with this data.  Before doing so, let's get a sense of how recent our data is.  

To start, let's select just the `last_review` value from each element.  
> Use list comprehension to create a list of `last_reviews`, one for each listing.

In [6]:
last_reviews = [listing['last_review'] for listing in listings]

In [7]:
last_reviews[:5]

# ['2018-10-19', '2019-05-21', nan, '2019-07-05', '2018-11-19']

['2018-10-19', '2019-05-21', nan, '2019-07-05', '2018-11-19']

We can see from the above that some of our listings have `nan` values.  
> As we may know `nan` stands for not a number and is generally used to represent missing values.  

Let's filter out the listings with last_review values of `nan`.  Removing only `nan` values can be tricky.  So here's a hint to get you started:

> nan is of type `float`.  So to check if the our value is `nan` we can use something like the following.

In [None]:
type(last_reviews[0]) == float

False

In [None]:
type(last_reviews[2]) == float

True

In [None]:
last_reviews[0], last_reviews[2]
# ('2018-10-19', nan)

Now, use what you learned from above to select *listings* that do not have a `last_review` value of nan.

In [8]:
listings_not_nan = [listing for listing in listings if type(listing['last_review']) != float ]

In [9]:
len(listings_not_nan)

# 38843

38843

Calculate the percentage of listings we have left.

In [11]:
len(listings_not_nan)/len(listings)

# 0.7944166070150323

0.7944166070150323

Ok, not amazing, but not bad.

### Back on Track

Now that we've removed our listings with a last_review of nan, let's make sure that we are working with relatively recent reviews.

Our first step is to change the first `last_review` value from to a Python string to a datetime object.

> Let's practice this on a single element first.

> See [this post](https://www.digitalocean.com/community/tutorials/python-string-to-datetime-strptime) for coercing data to a datetime to do so.

In [12]:
first_listing = listings[0]

In [14]:
last_review = first_listing['last_review']
last_review

'2018-10-19'

In [18]:
from datetime import datetime

last_review_datetime = datetime.strptime(last_review, '%Y-%m-%d')

In [19]:
last_review_datetime.year, last_review_datetime.month

(2018, 10)

In [20]:
listings_not_nan_copied = listings_not_nan.copy()

> Now iterate through the copy of `listings_not_nan_copied` and change each `last_review` to a datetime.

In [21]:
# write code here to change the dictionaries in listings_not_nan_copied
for listing in listings_not_nan_copied:
  listing['last_review'] = datetime.strptime(listing['last_review'], '%Y-%m-%d')

> We can check that each `last_review` is now a datetime object.

In [22]:
updated_last_reviews = [listing['last_review']
                        for listing in listings_not_nan_copied]

updated_last_reviews[:2]

# [datetime.datetime(2018, 10, 19, 0, 0), datetime.datetime(2019, 5, 21, 0, 0)]

[datetime.datetime(2018, 10, 19, 0, 0), datetime.datetime(2019, 5, 21, 0, 0)]

Ok, now find the listing with the oldest last_review, and then we'll find the listing with the most recent last review.

In [27]:
earliest_listing = min(listings_not_nan_copied, key=lambda listing: listing['last_review'])

In [28]:
earliest_listing['last_review']

# datetime.datetime(2011, 3, 28, 0, 0)

datetime.datetime(2011, 3, 28, 0, 0)

In [29]:
latest_listing = max(listings_not_nan_copied, key=lambda listing: listing['last_review'])


In [33]:
latest_listing['last_review'].year

# datetime.datetime(2019, 7, 8, 0, 0)

2019

So we can see that our data ranges from 2011 to July 2019.  Let's limit our data so that we only are working with data from July 2017 to July 2019.

In [40]:
recent_listings = [listing for listing in listings_not_nan_copied if listing['last_review']> datetime(2017, 6, 30)]

In [41]:
len(recent_listings)

# 33259

33259

So now we have about 11000 recent listings.

### Listings by Neighborhood

Let's getter a better sense of some of these recent listings.  Begin by creating a list of each `neighbourhood_group` of our recent listings.

In [42]:
# write code here
list(set([listing['neighbourhood_group'] for listing in recent_listings]))

# ['Queens', 'Brooklyn', 'Bronx', 'Manhattan', 'Staten Island']

['Staten Island', 'Bronx', 'Brooklyn', 'Queens', 'Manhattan']

Ok, so it looks like the five boroughs.  Now let's find the number `recent_listings` in Manhattan that were less than or equal to 50 dollars.

In [43]:
cheaper_manhattan_listings = [listing for listing in recent_listings if listing['neighbourhood_group'] == 'Manhattan' and listing['price'] <= 50]

In [44]:
len(cheaper_manhattan_listings)

# 640

640

### Summary

In this lesson, we saw how to use filtering to reduce our dataset down to higher quality data and to then query our data.  We also worked to remove `nan` values by looking to see if the datatype of our data was a float.  As an alternative, we could have also used the function from the math library.

In [39]:
import math
import numpy as np

math.isnan(np.nan)

True