### 0. Advance Preparation

Please download the datasets from the site below and put them in the `data` directory in the directory in which this notebooks is.

__New York City Airbnb Open Data__  
  URL:  
  ・https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/  
  DATA:  
  ・AB_NYC_2019.csv

In [None]:
%matplotlib inline

import branca.colormap as cm
import folium
from folium.plugins import HeatMap
import pandas as pd
import seaborn as sns

### 1. Selecting rows and columns by  integer locations

Read the Airbnb listings data file into DataFrame.

In [None]:
ab_listings_df = pd.read_csv('data/AB_NYC_2019.csv')

len(ab_listings_df)

Check the contents.

In [None]:
ab_listings_df.head()

Check the data types and find any missing values.

In [None]:
ab_listings_df.info()

#### 1-1. Selecting rows and columns by integer locations
Use `pandas.DataFrame.iloc` for selecting specific rows or columns by integer locations.  
Since `pandas.DataFrame.iloc` returns an indexer, use square brackets to specify integer locations.

Get the third row.

In [None]:
ab_listings_df.iloc[2] # 番号は0から始まる

Get the second, third, fourth rows.  
You need specify the locations like \[start:end+1\].

In [None]:
ab_listings_df.iloc[1:4] # 番号は0から始まる

Get only the sixth column.  
When you don't need to specify any rows, just put ":" without numbers.

In [None]:
ab_listings_df.iloc[:,5] # 番号は0から始まる

#### 1-2. Selecting rows and columns by labels
Use `pandas.DataFrame.loc` for selecting specific rows or columns by labels.  
Since `pandas.DataFrame.loc` returns an indexer, use square brackets to specify labels.

To have the DataFrame get row labels, call `groupby()` with `neighbourhood` as an argument.

In [None]:
ab_listings_group_by_neighbourhood_df = ab_listings_df.groupby('neighbourhood').mean().round(2) # とりあえず平均。

ab_listings_group_by_neighbourhood_df.head()

Get rows by specifying a row label.

In [None]:
ab_listings_group_by_neighbourhood_df.loc['Arden Heights']

Select columns by specifying a column label.  
When you don't need to specify any rows, just put ":" without numbers.

In [None]:
ab_listings_df.loc[:,'neighbourhood']

When you select columns by specifying labels, you can omit `loc`.

In [None]:
ab_listings_df['neighbourhood']

### 2. Selecting rows and columns by conditions

When you compare a column values to an integer, for instance, use the less than operator against `price` column, you'll get a `Series` object which has rows of the same number as the DataFrame does.

In [None]:
price_under_100_bools = ab_listings_df['price'] < 100

price_under_100_bools

Put the variable you've got into the brackets after the DataFrame, you'll get only rows whose corresponding rows in the variable have `True`.

In [None]:
price_under_100_df = ab_listings_df[price_under_100_bools]

price_under_100_df

Make sure that the max price of the newly created DataFrame's `price` column values is less than 100.

In [None]:
price_under_100_df['price'].max()

You'll get the same result with `pandas.DataFrame.loc`.

In [None]:
price_under_100_loc_df = ab_listings_df.loc[price_under_100_bools]

price_under_100_loc_df

with `pandas.DataFrame.query()`, you can use a SQL-like syntax.

In [None]:
ab_listings_df.query('host_name == "LisaRoxanne"')

It's easy to specify a range with `query()`.

In [None]:
ab_listings_df.query('365 < minimum_nights < 500')

Let's try specifying a range without `query()` and find out what a hassle that can be.

In [None]:
### Let’s try!! ###

### 3. Selecting randomly

You can easily select random sample with `pandas.DataFrame.sample()`.

Here, we pass 42 for the random_state argument so that we get the same results everytime we call it.

__Phrases from The Hitchhiker's Guide to the Galaxy__  
https://en.wikipedia.org/wiki/Phrases_from_The_Hitchhiker%27s_Guide_to_the_Galaxy

In [None]:
ab_listings_sample_df = ab_listings_df.sample(frac=0.2, random_state=42) # 全体の20%を抽出

len(ab_listings_sample_df) / len(ab_listings_df)

In some cases, `pandas.DataFrame.sample()` might have you get data sets improperly.

For example, a host whose `host_id` is 2787 has multiple listings, and it might get only part of those.  
In that case, it would be difficult to do an analyze like how many listings each host has.

In [None]:
print("ホスト（host_id=2787）の物件数")

print("・抽出前: ", len(ab_listings_df.query('host_id == 2787')), "件")

print("・抽出後: ", len(ab_listings_sample_df.query('host_id == 2787')), "件")

A solution for that might be to call it by the host.

The downside of this solution is that you wouldn't get data sets of the exact size you wanted.

In [None]:
host_id_sample = pd.Series(ab_listings_df['host_id'].unique()).sample(frac=0.2, random_state=42)

ab_listings_sample_revised_df = ab_listings_df.query('host_id in @host_id_sample') # @マークで変数にアクセス可

ab_listings_sample_revised_df.head()

Make sure that the data you've got has become approximately 20% of its original size.

In [None]:
len(ab_listings_sample_revised_df) / len(ab_listings_df)

Check how many listings in the data the host whose `host_id` is 2787 has.

It should be equal to the original number or zero.

Whatever the `random_state` argument is, it should be either of them.

In [None]:
print("ホスト（host_id=2787）の物件数")

print("・抽出前: ", len(ab_listings_df.query('host_id == 2787')), "件")

print("・抽出後: ", len(ab_listings_sample_revised_df.query('host_id == 2787')), "件")

### 4.  visualizing data on a map

Since the data sets have coodinates, we put them on a map.

First, use `folium.plugins.HeatMap` and see how spread they are, because the size is too large to put them all.

Do you find anything interesting?

In [None]:
# ニューヨークの座標
new_york_city_coordinates = [40.7128, -74.0060]

# 全データの座標を抽出し、リストに変換
ab_listings_coords = ab_listings_df[['latitude', 'longitude']].values.tolist()

# 地図を描画
map = folium.Map(location=new_york_city_coordinates, zoom_start=9.5)

HeatMap(ab_listings_coords, radius=5, blur=5).add_to(map)

map

Next, randomly select 500 rows from them, and put each of them as a circle which has a color according to its price.

To see their distributions, create a histogram.

In [None]:
sns.histplot(ab_listings_df['price'], kde=False)

They seems to have some extreme outliers, so eliminate the data whose `price` values are equal to or more than 1,500 dollars and try again. 

In [None]:
ab_listings_no_too_expensive_df = ab_listings_df[ab_listings_df['price'] < 1500]

sns.histplot(ab_listings_no_too_expensive_df['price'], kde=False)

This time, We are going to extract 500 rows randomly and create another histogram, calling `sample()` with `n=500` as an argument.

Make sure that it has a similar shape.

In [None]:
ab_listings_no_too_expensive_sample_df = ab_listings_no_too_expensive_df.sample(n=500)

sns.histplot(ab_listings_no_too_expensive_sample_df['price'], kde=False)

Put circles on a map.

Do you find anything interesting?

In [None]:
# ヒストグラムの最大・最小を見ながらカラーマップを設定
colormap = cm.LinearColormap(colors=['blue','red'], vmin=0,vmax=1000)

map = folium.Map(location=new_york_city_coordinates, zoom_start=9.5)

for index, row in ab_listings_no_too_expensive_sample_df.iterrows():
    location = (row['latitude'], row['longitude'])
    color = colormap(row['price'])
    popup_message_html = f"<p>\"{row['name']}\"</p><p>host: {row['host_name']}<p>price: ${row['price']:,}</p></p>"
    popup = folium.Popup(folium.IFrame(popup_message_html), min_width=400, max_width=400)

    folium.Circle(location=location,
                  radius=10,
                  color=color,
                  fill=True,
                  fill_opacity=1,
                  popup=popup).add_to(map)

map