## Finding insides from AirBnB in Berlin

In this project I used the [Cross Industry Process](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining) for Data Mining on the [Berlin AirBnB dataset](http://insideairbnb.com/get-the-data.html). I hope this project will give you some valuable insides for Berlin.

The CRISP-DM process can be broken down into several steps which help understanding the problem

- Business Understanding
- Data Understanding
- Prepare Data
- Data Modeling
- Evaluate the Results
- Deploy


In [1]:
#Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from support import *

plt.style.use('ggplot')  # Красивые графики
plt.rcParams['figure.figsize'] = (15, 5)  # Размер картинок

debug = False

## Business Understanding


## Data Understanding

To answer the questions curated below it is necessary to obtain some sort of Data understanding. This includes a general overview of what kind of information is available within the dataset. But also where there might be potential shortcomings, such as missing data. 

We are going to work with such datasets:

- listings.csv consists of details of all the listings in Berlin including their price, accomodates, ratings, number of reviews, summary, name, owner name, Description, host Id and many other columns decribing details of listings.

- calendar.csv consists of details of listings and its availability and its price.


In [2]:
# Reat the data
listings = pd.read_csv('data/listings.csv.gz', low_memory=False, compression="gzip")
# calendar = pd.read_csv('./calendar.csv')

In [None]:
# Checking the shape of the listings datasets
if debug:
    listings.shape
    listings.head()

In [None]:
if debug:
    pd.options.display.max_columns = listings.shape[1]
    listings.describe()

In [2]:
# Data owerview 
if debug:
    colcheck(listings)

In [None]:
if debug:
    listings.isnull().mean()

In [None]:
#Provide a set of column name that have no values and must be dropped
if debug:
    listings.columns[listings.isnull().mean() == 1]

In [None]:
#listings.columns.tolist()

In [None]:
if debug:
    listings.hist(figsize=(16,100),layout=(44,1));

In [None]:
if debug:
    plt.figure(figsize=(16,10))
    sns.heatmap(listings.corr(), annot=True, fmt='.1f')

In [47]:
if debug:
    listings.dtypes.value_counts()

object     62
float64    23
int64      21
dtype: int64

## Prepare Data

decide on the data that we are going to use for our analysis. 

In [32]:
# Define a subset of the original Listing dataset
ss = listings[['id','price', 'property_type', 'bedrooms', 'minimum_nights',
               'number_of_reviews', 'reviews_per_month','review_scores_value', 'availability_365', 'availability_90', 'availability_60', 'availability_30']]

#To simplify let's use only appartments
ss = ss[ss['property_type']=='Apartment']

# After filtering out only the apartments let's drop property_type
ss = ss.drop(columns=['property_type'])

In [8]:
if debug:
    showmissing(ss)

In [None]:
#Drop missing values
ss.dropna(axis=0)

In [33]:
# Clearing the price

# ss = ss.replace({'price': r'\$(\d{,3})\,?(\d{,3})\.*(\d{,3})'}, {'price': r'\1\2'}, regex=True)
# ss.price = ss['price'].astype(int)

ss['price'] = ss['price'].str.replace(',', '')
ss['price'] = ss['price'].str.replace('$', '')
ss['price'] = ss['price'].astype(float)
ss['price'].describe()


count    21849.000000
mean        61.782461
std         91.428355
min          0.000000
25%         33.000000
50%         49.000000
75%         75.000000
max       9000.000000
Name: price, dtype: float64

In [53]:
ss.price.value_counts(normalize = True, bins = 20)

(-9.001, 450.0]     0.997757
(450.0, 900.0]      0.001693
(900.0, 1350.0]     0.000320
(4950.0, 5400.0]    0.000046
(1350.0, 1800.0]    0.000046
(1800.0, 2250.0]    0.000046
(8550.0, 9000.0]    0.000046
(4050.0, 4500.0]    0.000046
(5400.0, 5850.0]    0.000000
(5850.0, 6300.0]    0.000000
(4500.0, 4950.0]    0.000000
(8100.0, 8550.0]    0.000000
(3600.0, 4050.0]    0.000000
(3150.0, 3600.0]    0.000000
(2700.0, 3150.0]    0.000000
(2250.0, 2700.0]    0.000000
(6300.0, 6750.0]    0.000000
(6750.0, 7200.0]    0.000000
(7200.0, 7650.0]    0.000000
(7650.0, 8100.0]    0.000000
Name: price, dtype: float64

In [51]:
ss.price.min()

0.0

In [46]:
ss[ss['price'] >= 300].shape
# all_rm = all_rm[all_rm['price'] <= 200]

(136, 9)

In [44]:
listings[listings['id'] == 14201780][['id','price']]

Unnamed: 0,id,price
8200,14201780,"$5,000.00"
