# **AISaturdays Rental Challenge**

<img src="https://do3z7e6uuakno.cloudfront.net/uploads/event/logo/1112702/595053a7143adafce285b2e39ca04f1a.jpeg" width="300">

**Instructions:**

- You'll be using Python 3.
- You'll use Python's libraries: Pandas, MatPlotLib, Numpy.

**Completing the exercise, you´ll learn to:**
- Better use and understand Python NoteBooks.
- Be able to use python functions and additional libraries.
- Dataset:
 - Obtain the dataset and visualize the information contained.
 - Clean and normalise the dataset's information.
 - Represent and analyse the dataset's information.
- Correctly apply the Random Forest Algorithm.
- Improve the predictions using Hyperparameter Tunning, Feature engineering and Gradient Boosting

¡Let's get started!


#1.Importing libraries 

In [1]:
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import statsmodels.api as sm
import scipy.stats as stats

#2. Dataset

Open the .csv ('AB_NY_2019.csv') and display the first few lines.

In [2]:
#Two lines of code


1. Show the number of features and examples in the dataset.

In [3]:
#One-liner code


2. Obtain the (dtypes) of the features and the number of rows.

In [4]:
#One-liner code


### Variables



* **Id/name:**  Identifier and name of the property.

* **host_id/host_name:** Identifier and host name.

* **neighbourhood_group/neighbourhood:** Area and neighbourhood of the property. Each Area is a group of neighbourhoods.
* **latitude/longitude:** Latitude and longitude of the property.

* **room_type:** The type of property offered. It can be a single room, a shared apartment or a whole property.

* **minimum_nights:**  Minimum number of nights to stay.

* **number_of_reviews:**  Total number of reviews.

* **last_review:**  Date of the last review.

* **reviews_per_month:** Monthly number of reviews. It's rarely *int* and it can be less than one..

* **calculated_host_listings_count:** Total number of properties the host is offering.

* **availability_365:** Yearly availability of the flat in number of days: maximum is 365 days ( the whole year).

* **price:** Our target variable! The price of the proeprty in dollars.



Is this truly a regression or classification problem? Why?:

3. Before analyzing the dataset, we need to transform the dates (last_review) into something with which we can work. Pandas has a specific format for this, datetime. Change last_review to datetime.

In [5]:
# One-liner code


4. To analyse the data we also need to lnow how much information we're missing. Use .isnull() to find out which feature is missing more values. 

In [6]:
# One-liner code


5. Finally, we need to drop the features that are only identifiers and are not useful for predictions.

In [7]:
# One-liner code


6. All ready! We can analyse the distribution of data with .describe()

In [8]:
# One-liner code


### Cleaning and normalization of the dataset
![texto alternativo](https://i.imgur.com/8u4xTI7.png)

This dataset contains incomplete information that we need to impute to be able to use it for the rpediction of the property prices.
We also need to transform *last_review* if we eant to use it for the prediction, we cannot use it directly as a date.
For this processing we'll use Pandas functions, you already have a cheatsheet from the previous session, but take a look at this other [cheatsheet](https://assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf).

7. Find the number of columns that do not have reviews and hence have empty values for last_review and reviews_per_month.

In [8]:
# One-liner code


8. We need to complete this information if we don't want to disregard the rest of the features. Fill in the NaNs of reviews_per_month with 0 (We'll take last_review later on).

In [10]:
# One-liner code


9. Time to transform the variable *last_review*. It's a date, which makes it hard to work with. Let's first complete the examples that do not have a last date. Replace these NaNs with the first historical revision of the dataset.

In [11]:
# Two lines of code


10. Now that we don't have empty values we can change the last_review variable to something more useful. We look for smaller values to correspond to old or no reviews, while larger values correspond to more recent reviews.
We can use the toordinal() function to find the number of days that have elapsed since day 1 of year 1, but those are still too large numbers that don't follow the distribution we're looking for.

Gets last_reviews to represent the number of days that have elapsed since the first historical review was done. 

In [12]:
# One-liner code


11. To visualize the distribution of dates, generate a graph showing the variable last_reviews.

In [13]:
# One-liner code


It seems there are two well distinguished groups. What do you think is it causing this?

#### Study of the variable to predict and noise elimination

12. A la hora de predecir el precio, es mucho mas favorable si primero transformamos y analizamos la variable que buscamos para hacerla mas facil de predecir.
12. When it comes to predicting the price, it is much more favorable if we first transform and analyze the variable we are looking for to make it easier to predict.

First, let's see how the price of the offers is distributed. Generate a graph showing the price of the bids. Here's a [hint.](https://seaborn.pydata.org/generated/seaborn.distplot.html).

In [14]:
# One-liner code


We have a variable that follows a log-normal distribution. We can transform it into a normal distribution by applying log1p(), a function that responds to the following equation:

𝑦=𝑙𝑜𝑔(𝑥+1)

This makes the price easier to predict, as it has a normal distribution.


13. Let's visualize this transformation. Generates another price graph after applying the log1p() function.

In [15]:
# One-liner code


Now we have a much more appropriate distribution to make predictions. However, there are still many outliers that add noise to the sample.

14. Above and below what values is this noise present? Eliminates from the dataframe those values that do not fit into the normal distribution. 

In [16]:
# Two lines of code


15. Now, rebuild the price plot and price log1p (use the same code as before, or put it in a [subplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.subplots.html )).

In [17]:
# Four lines of code


16. Finally, we have a noise-free normalized output variable that will improve our predictions. Change the variable price to the log1p of price.

In [18]:
# One-liner code


#### Variable exploration

We are going to explore a little more the rest of the variables that can affect the price of an offer.

17. Let's start by creating a histogram of the different areas of the city and the number of offers in each of them (you may need to enlarge the graph)

In [19]:
# Three lines of code


18. Now create a map of the offered apartments with latitude and longitude (extra points if you color them by areas or neighborhoods). It's best to do it in a subplot so you can control the size of the map.

In [20]:
# Two lines of code


19. We are now going to generate another histogram, this time with the type of room offered (It is also a good idea to adjust the size of the graph).

In [21]:
# Three lines of code


#### Variable Transformation

We can apply the same process that we applied to the price variable to our input variables and thus achieve a more comfortable distribution for the search methods.

Apply the log1p() transformation to minimum_nights, generating the before and after plots and compare them.

In [22]:
# Three lines of code


21. Finally, save minimum_nights as log1p of minimum_nights

In [23]:
# One-liner code


21. Repeat the process, this time with reviews_per_month. Is transformation relevant?

In [24]:
# Three lines of code


#### Study of availability in number of days (0.365)

22. Let's start by representing the availability in a distplot(). Since we know the limits of this variable, it is best to limit the range of the graph and make it larger.


In [25]:
# Four lines of code


#### Add artificial variables

It has been seen in the scatterplot above that there seem to be two groups, one available most of the year and another only a few days.

It is also intuited that those sites that do not have reviews... How do they not give much confidence? ;)

Add three categories that measure if the apartment is available all year round, if its availability is very low (less than 12 days a year), and if it has no reviews.

23. Add three categories that measure if the apartment is available all year round, if its availability is very low (less than 12 days a year), and if it has no reviews.

In [26]:
# Three lines of code


24. We are going to generate a heatmap that shows the relationship between all the input variables and price. It uses corr() and seaborn's heatmap() function.

In [27]:
# Three lines of code


#### Pass categorical variables to one_hot

25. To make the categorical features easier to interpret by the model, we are going to transform them into a OneHotEncoding. Use pandas get_dummies() function (you should have 241 columns left)

In [28]:
# Two lines of code


# Models, models, models

With all the exploration, analysis and data cleaning done, we move on to the fun part: The models!

We start by importing all the classes that we are going to need to find a good predictive model:

In [29]:
from sklearn.model_selection import train_test_split,cross_val_score,  GridSearchCV, KFold, StratifiedKFold, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

26. Split the dataset into X_train, X_test, y_train and y_test using train_test_split(). Don't forget not to include price in the training set.

In [30]:
# Three lines of code


27. We are going to use cross_validation to train our model, using Kfold to find the score. Implement a Kfold that performs 5 splits and calculates the mean error and deviation of a RandomForestRegressor without changing its parameters (yet). [Hint](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html).

In [31]:
# Three lines of code


28. When using a RandomForestRegressor, what hyperparameters were we using? List all the parameters used by this model (it uses the get_params() function and the pprint library)

In [32]:
from pprint import pprint
## Two lines of code


We can adjust all of these parameters to improve the accuracy of our model. One way to find which combination works best is to use a GridSearchCV, which tests models with many different combinations and calculates their score to find the best brute force model. For this, you have to pass a list of values for each parameter, and GridSearchCV will try all of them. [More information](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html).

29. Delimit what values you want each parameter to have, and include each of these lists in a dictionary to be able to execute the GridSearchCV. Note the possible values for each of the parameters.

In [33]:
# Eight lines of code


30. Now we can implement a GridSearchCV. To make it faster, a version is used that does not test all possible combinations, but only a few random ones (hence its name, RandomizedSearchCV). Implement it, taking into account that it has as parameters the model to be adjusted and the dictionary that we have defined before, among others. This step may take a few minutes as you have to adjust many models to find the best one. Here is the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) for the RandomizedSearchCV.

In [34]:
# Two lines of code


31. Finally, find the mean squared error and 𝑅2 of the best model you have created.

In [35]:
# Six lines of code


Now, to improve that score! You can try:

- Remove features that are not relevant to the prediction
- Implement Gradient boosting using XBoost or Adaboost, among others
- Tune hyperparameters manually to arrive at better models
- Use a Tree Interpreter to see which decision trees are most important

Let's see who achieves the best score! 🚀