# Paris Airbnb Price

How much should I rent my flat per night?

I use linear regression for predicting the night price of an appartment on Airbnb.
The dataset is available at http://insideairbnb.com/get-the-data.html under a Creative Commons CC0 1.0 Universal (CC0 1.0) "Public Domain Dedication" license.

## 1.Data Preprocessing
If the data file is not uncompressed yet, we have to uncompress it.

In [None]:
#import gzip
#import shutil
#with gzip.open('data/listing.csv.gz', 'rb') as f_in:
#    with open('data/listing.csv', 'wb') as f_out:
#        shutil.copyfileobj(f_in, f_out)

First, we import the required libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

Then, we import the dataset and have a look at it.

In [None]:
dataset = pd.read_csv('data/listings.csv')
dataset.iloc[:5,:]

There are a lot of variable.
Let's keep only the relevant ones, as well as the dependent variable: the price per night!

In [None]:
dataset = dataset[['host_is_superhost','neighbourhood','zipcode','latitude','longitude','property_type',
                   'room_type','accommodates','bathrooms','bedrooms','beds','bed_type','amenities','square_feet','price',
                   'weekly_price','monthly_price','cleaning_fee','number_of_reviews','review_scores_rating',
                   'review_scores_accuracy','review_scores_cleanliness','review_scores_checkin',
                   'review_scores_communication','review_scores_location','review_scores_value']]
dataset.iloc[:5,:]

These variables are interesting, but we could do better.
- Some of these are directly related (neighbourhood and zipcode directly depends of latitude and longitude / global review score depends on the other review scores).
- Amenities are a bit difficult to treat for a first version of the algorithm.
- Square feet field is rarely completed
- Prices are treated as strings because they contain the '$' character.
- Weekly and Monthly prices are not always available and we should for now keep these appart.
- We are going to predict the total price which is composed of the price added the cleaning fee.
- When there is no review, the other review variables are considered 'Nan'.
- Sometimes, the other review variables are 'NaN' even though there are reviews.

In [None]:
dataset = dataset.drop(columns=['neighbourhood','zipcode','amenities','square_feet','weekly_price','monthly_price','review_scores_rating'])
dataset['price'] = dataset['price'].replace('[\$,]', '', regex=True).astype(float) + dataset['cleaning_fee'].replace('[\$,]', '', regex=True).astype(float)
dataset = dataset.drop(columns=['cleaning_fee'])
dataset = dataset[dataset['price'] != np.nan]
#Replacing "NaN" values
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(dataset[['latitude','longitude','accommodates','bathrooms','bedrooms','beds','number_of_reviews',
                              'review_scores_accuracy','review_scores_cleanliness','review_scores_checkin',
                              'review_scores_communication','review_scores_location', 'review_scores_value']])
dataset[['latitude','longitude','accommodates','bathrooms','bedrooms','beds','number_of_reviews',
                              'review_scores_accuracy','review_scores_cleanliness','review_scores_checkin',
                              'review_scores_communication','review_scores_location', 'review_scores_value']] = imputer.transform(dataset[['latitude','longitude','accommodates','bathrooms','bedrooms','beds','number_of_reviews',
                                  'review_scores_accuracy','review_scores_cleanliness','review_scores_checkin',
                                  'review_scores_communication','review_scores_location', 'review_scores_value']])

Looks better! Now let's split the dataset between dependent, independent variables and training and test sets.
We should also encode the categorical data.

In [None]:
X = dataset.drop(columns=['price']).values
y = dataset['price'].values
# Encoding the categorical data
# TODO

# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)