# Global overview
This project is only for educational purposes. I did it while I was following the `Udacity DataScience nanodegree`.  
Within the course, in order to develop our communication skills as DataScientist, we are asked to pick a dataset available on the Internet. There are suggestions but we can take any dataset as soon as we explore it and follow the CRISP-DM process:
![CRIP-DM process](../assets/CRISP-DM.png)

There are 2 main objectives to reach in this project:
* create a Github code repository to share code, scripts, notebooks and documentation about data wrangling/modeling techniques, with a technical audience in mind
* write a blog post accessible to non-technical people in order to share our questions, insights, findings, thoughts and conclusions.

I chose to use datasets from airbnb and, as a french people, I chose Paris as the city to analyze.


## Table of Contents

1. `Data Understanding`: [here](1_Data_Understanding.ipynb)
2. `Business Understanding`: [here](2_Business_Understanding.ipynb)
3. `Data Preparation`: [here](3_Data_Preparation.ipynb)
4. `Modeling` and `Evaluation`: [here](4_Modeling.ipynb)
5. Tuning models and Evaluation [here](4b_Modeling_Tuning_xgboost.ipynb)

The `Deployment` (or `Exposition`) is done through this [blog post](https://nidragedd.github.io/things-you-should-now-before-visiting-Paris/).

An additional notebook with further Data Exploration is also provided [here](APPENDIX_Bonus_Exploratory_Data.ipynb).


## Setup environment
### Install dependencies
Before going further, please ensure that you have installed all required dependencies. This can easily be done through **[conda](https://docs.conda.io/en/latest/)**
and the given `environment.yml` file  by running this command in a terminal (if you have conda, obviously):
```
conda env create -f environment.yml
```

If you do not have/want to use conda for any reason, you can still setup your environment by running some `pip install`
commands. Please refer to the `environment.yml` file to see what are the dependencies you will need to install.  
Basically, this project requires **Python 3.7** in addition to common datascience packages (such as 
**[numpy](https://www.numpy.org/)**, **[pandas](https://pandas.pydata.org/)**, 
**[sklearn](https://scikit-learn.org/stable/)**, **[matplotlib](https://matplotlib.org/)**, 
**[seaborn](https://seaborn.pydata.org/)** and so on).

For modeling, this project is using models available in sklearn + **[xgboost](https://xgboost.readthedocs.io/en/latest/)**.  

> NOTE: I made a NLP try on reviews, that is why there are:
* **[gensim](https://radimrehurek.com/gensim/)**: used for topic modeling
* **[wordcloud](https://pypi.org/project/wordcloud/)**: used to generate some tag clouds
* **[spacy](https://spacy.io/)**: NLP package for easy lemmatization
* **[pyLDAvis](https://pyldavis.readthedocs.io/en/latest/)**: great tool to visualize the topics with LDA (Latent Dirichlet Allocation)  

> Feel free to skip them as for the moment the NLP part is not complete so not available in this repository.  
My idea was to perform topic modeling on reviews per neighbourhood to see what were the relevant topics for each one but I have to handle language detection as some reviews are in french whereas others are in english (so need to load a different module in spacy for example). This would have led me far from the initial target so I decided to give up this part for the moment and focus on price prediction instead.

### Get the data
Run the cell below to download the data (be careful, depending on your network connection, this may take a while to run).  
The script will:
* create the `data` directory (be careful it will empty it if it already exists !)
* collect 5 datasets from insideairbnb website

In [1]:
import sys
sys.path.append('../') 

from src.utils import datacollector

datacollector.collect_data()

Download started for file http://data.insideairbnb.com/france/ile-de-france/paris/2019-07-09/data/listings.csv.gz
Download finished for file http://data.insideairbnb.com/france/ile-de-france/paris/2019-07-09/data/listings.csv.gz
Download started for file http://data.insideairbnb.com/france/ile-de-france/paris/2019-07-09/visualisations/listings.csv
Download finished for file http://data.insideairbnb.com/france/ile-de-france/paris/2019-07-09/visualisations/listings.csv
Download started for file http://data.insideairbnb.com/france/ile-de-france/paris/2019-07-09/data/calendar.csv.gz
Download finished for file http://data.insideairbnb.com/france/ile-de-france/paris/2019-07-09/data/calendar.csv.gz
Download started for file http://data.insideairbnb.com/france/ile-de-france/paris/2019-07-09/data/reviews.csv.gz
Download finished for file http://data.insideairbnb.com/france/ile-de-france/paris/2019-07-09/data/reviews.csv.gz
Download started for file http://data.insideairbnb.com/france/ile-de-fra

As you can see, the data are from **July 2019** (but if you want, you can download archives data from the website).  
According to the insideairbnb website, the data contains those informations:  

| File name          | File description                                                                |
|--------------------|---------------------------------------------------------------------------------|
| calendar.csv.gz    | Detailed Calendar Data for listings in Paris                                    |
| reviews.csv.gz     | Detailed Review Data for listings in Paris                                      |
| listings.csv.gz    | Detailed Listings data for Paris                                                |
| listings.csv       | Summary information and metrics for listings in Paris (good for visualisations) |
| neighbourhoods.csv | Neighbourhood list for geo filter. Sourced from city or open source GIS files   |

---

Now that we have both data and an environment up & running, let's start by exploring those files [here](1_Data_Understanding.ipynb).