# Boston and Seattle AirBNB Data Science Case Study

***This analysis follow the Cross-Industry Standard Process for Data Mining (CRISP-DM). This means it will follow the following steps interchangbly:***
* Business Understanding.
* Data Understanding.
* Data Preparation. 
* Modelling. 
* Evaluation.
* Deployment
---
Let's start with the first step, which is Understanding the main problem and asking the questions that mainly tackle the problem.

## Business Understanding
This dataset contains information and attributes of AirBNB listings. The listings are accompained with various features that can give us different insight about the AirBNB short-term rental business. However, we are mainly interested in the price to be our reponse variable. So that we will use the other features in relation with the price to answer the following questions:

1. What is the price average for the listed neighbourhoods indvidually ? Which neighbourhood has the maximum/minuimum price ?
1. Does the neighbourhood affects the listing price ?
1. Does the price fluctuates based on the month/ season of the year ? 
1. Do hosts with various/single listings tend to stabilize prices or change them frequently ?
1. What impact do the host attributes (superhost, acceptance rate, ..., etc) have on the price ?
1. What is the relationship between the property attributes and the price ?   

---

## Data Understanding 
In this section we will try to answer the questions using the data attributes and features. 
* What is the price average for the listed neighbourhoods indvidually ? Which neighbourhood has the maximum/minuimum price ?

**This question can be answered by grouping the dataframe by the neighbourhood and finding the average of the price for each neighbourhood.**
* Does the neighbourhood affects the listing price ?

**We will visualize the price average distribution with the neigbourhood, in order to find out a probable relationship between the price and neighbourhoods.**
* Does the price fluctuates based on the month/ season of the year ? 

**The listing dates will be formatted as datetime and new features (month, year, week days) will be added in order to find out the relationship.**

* Do hosts with various/single listings tend to stabilize prices or change them frequently ?

**we will need to access the 'host_listings_count' column and the price column to answer this question.**

* What impact do the host attributes (superhost, acceptance rate, ..., etc) have on the price ?

**we will compare some numerical columns that evaluates the host with the price column to see if they have an effect on the price. These columns are: 'host_response_rate',
 'host_acceptance_rate',
 'host_is_superhost','review_scores_rating'.**

* What is the relationship between the property attributes and the price ?  

**we will compare some numerical columns that evaluates the property with the price column to see if they have an effect on the price. These columns are: 'property_type',
 'room_type',
 'accommodates',
 'bathrooms',
 'bedrooms',
 'beds',
 'bed_type',
 'amenities'.**
  
---
## Data Preparation 

Firstly, the main file of interest is the listings csv. The dataset will go through the following process: 

1. Check for columns of interest according to the above mentioned questions and drop the other columns. 
1. Check for the missing values and deal with it based on each case indivdiually. (imputing/remocing or creating new features out of them.)
1. Confirm or change the data type of each column as needed. 
1. Create new dataframes to answer each question individually using visualization and statistical analysis. 
1. Merge datasets and adapt categorical/string columns for data modelling (machine learning prediction of the price). 


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import sklearn as sk
%matplotlib inline

In [2]:
# Load Boston Datasets
BostonCalendar = pd.read_csv('BostonAirBNB/BostonCalendar.csv')
BostonListing = pd.read_csv('BostonAirBNB/BostonListings.csv')
BostonReview = pd.read_csv('BostonAirBNB/BostonReviews.csv')

In [3]:
print ('Boston Dataset Sizes: ', (BostonCalendar.shape[0],
        BostonListing.shape[0], BostonReview.shape[0]))

('Boston Dataset Sizes: ', (1308890, 3585, 68275))


In [5]:
print ('Boston Dataset number of uniques: ', (BostonCalendar.nunique().sum(),
        BostonListing.nunique().sum(), BostonReview.nunique().sum()))

('Boston Dataset number of uniques: ', (5198, 71125, 219024))


In [19]:
BostonListing.rename(  columns = {'id':'listing_id'}, inplace = True)

In [21]:
Merged_df = pd.merge(BostonCalender, BostonListing, on = 'listing_id')

In [22]:
Merged_df.shape[0]

1308890

In [26]:
len(list(BostonListing.columns))

95