# Capstone Project

## Background

Woah time really flies and you already reached the last sprint of the second module in the course! You should be proud of yourself. In the past three sprints you were gaining precious knowledge that helped you acquire data engineering skills. Now you should know what good Python code looks like, why OOP is used, how to structure Python project, how to work with SQL, how to develop and deploy web application. All these skills will enable you making outstanding projects that not only cover data analysis and modeling but also making your discoveries reachable to other people.

Now the time has come to put all your learnings into one place and complete the second capstone project of the course. During this project you will have to create a Python package, collect dataset using data scraping technique, train model and deploy it for others to reach.

Most importantly you will have to create whole E2E Machine Learning plan: establish the problem, collect dataset, train model, evaluate it and deploy it. By completing this project, you will strengthen your data engineering skills and prove to yourself and other that you are capable of planning and executing data science projects.

<div style="text-align: center;">
<img src="https://miro.medium.com/max/700/1*x7P7gqjo8k2_bj2rTQWAfg.jpeg"/ width="300px">
</div>

---

## Requirements
The whole capstone project requires you to execute full featured E2E Machine Learning Project so let's see what actually do you have to complete:

### Define problem you want to solve
This is the part where you have to select a problem. Here are the topics that you can choose from: text classification, price prediction, item category classification. Through the second module of the course you saw a few examples of datasets that could be used to solve these problems (eBay listings, Reddit posts, Twitter tweets). In this stage you have to:
- Define the problem and create a short presentation
- Explain what do you wan to solve, and what is the potential value of your solution
- Define the data source you will collect data from

### Collecting data
During this stage you will need to create a Python package that is able to scrape specific website. You saw many examples during the period of the second module, where functions that take few arguments (`keywords`, `number of samples`, etc.) and outputs pandas `DataFrame`s were created. Now you will need to transform this functionality into Python package that is installable through pip.
- Create Python package that is able to scrape specific webpage
- The package should be installable through `pip`
- The package should meed all expected Python package standards: clean code, tests, documentation.
- Collect and process dataset using your created package

### Training and saving the model
During this step you will need to use your collected data to train, test and save machine learning model. Do not spend much time on this step just make sure that:
- Correct machine learning algorithm is selected
- Model is successfully trained (remember first module of the course)
- Model is saved for later deployment

### Creating API for the trained model
This is the step you have done at least couple of times. You will need to create API using Flask. While creating the application you will need to do these things:
- Load trained model
- Create inference pipeline
- Create `POST` route to reach model and send it's outputs as response

### Tracking model's predictions
Now you will need to enable model's predictions tracking. During this step you will need to connect your flask application to PostgreSQL database hosted by Heroku and put model's inputs and outputs into one table:
- Create PostgreSQL database hosted by Heroku
- Create table for predictions tracking. There should be columns for inputs and outputs of model
- At every request of model insert required values to the database
- Create new route in Flask application that returns 10 most recent requests and responses in JSON format

### Deploying the application
After completing all the steps required above, you will need to deploy your application to Heroku. You will need to follow steps provided in the fourth lesson of this sprint.
- Make sure all secrets and passwords are set as ENV variables in Heroku
- Deploy application to Heroku
- Ensure that you application is accessible (provide link to it)

---

## Evaluation criteria
- All requirements are met
- The project is well thought out. Defined problem is clearly presented
- Model actually works, is able to make predictions that make sense
- Written code is clear and clean. All the PEP8 standards are met

---

## Problem

I am interested in investing in real estate. For that I need an estimate for how much could I rent out a given flat and for how much could I buy it. This project tries to gather data for predicting rent prices.

I scrape rent listings from Lithuanian real estate portal aruodas.lt

## Workflow from A to Z

### Gathering data

In [3]:
# !pip install git+https://github.com/mutusfa/scrape_aruodas

In [9]:
import pandas as pd
from pathlib import Path

from scrape_aruodas.main import scrape

PROJECT_DIR = Path.cwd()

In [10]:
# raw = scrape(num_items=2000)
# raw.to_csv(str(PROJECT_DIR / "data/raw/rent.csv"))

### Cleaning the data

In [15]:
import src.data.make_dataset
intermediate = src.data.make_dataset.make_intermediate()  # also gets saved at data/intermediate/rent.csv

There is a notebook in src/data/select_final.ipynb with minimal exploratory data analysis, where I cut off outliers.

In [14]:
final = pd.read_csv(str(PROJECT_DIR / "data/final/rent.csv"), index_col=0)
final.head()

Unnamed: 0,city,district,latitude,listing_url,longitude,street,floor_area_m2,monthly_rent,number_of_rooms,floor,number_of_floors,build_year,building_type,heating_type,equipment
1,vilniuje,snipiskese,54.720888,https://www.aruodas.lt/butu-nuoma-vilniuje-sni...,25.278539,juozo-balcikonio-g,19.0,326.0,1.0,3.0,5.0,2020.0,Mūrinis,Centrinis kolektorinis,Įrengtas
2,vilniuje,naujininkuose,54.662883,https://www.aruodas.lt/butu-nuoma-vilniuje-nau...,25.27784,telsiu-g,42.0,399.0,3.0,2.0,4.0,2015.0,Mūrinis,Geoterminis,Įrengtas
3,vilniuje,fabijoniskese,54.742411,https://www.aruodas.lt/butu-nuoma-vilniuje-fab...,25.22911,salomejos-neries-g,50.0,360.0,2.0,11.0,12.0,2008.0,Mūrinis,Kita,Įrengtas
5,vilniuje,senamiestyje,54.681746,https://www.aruodas.lt/butu-nuoma-vilniuje-sen...,25.279369,klaipedos-g,105.0,1500.0,4.0,3.0,3.0,2013.0,Mūrinis,Centrinis kolektorinis,Įrengtas
6,vilniuje,naujininkuose,54.662866,https://www.aruodas.lt/butu-nuoma-vilniuje-nau...,25.277922,telsiu-g,42.0,350.0,1.0,1.0,4.0,2015.0,Mūrinis,Geoterminis,Įrengtas


### Modelling

In [18]:
import src.models

final = final.drop('listing_url', axis="columns")

model, scaler, encoder = src.models.main(final)

Score: 0.6708243526781839
Mean absolute error: 130.11441959871323
Mean absolute percentage error: 0.32305717022307767


Running src.models as a script saves model and used scaler and encoder to ./models directory.

### Accessing model via api

#### Access to model's inference (post interface)

https://jjuoda-ds-24.herokuapp.com/predict/
For example, using requests library:

In [20]:
import numpy as np
import requests

features_for_inference = [{
    "city": "kaune",
    "district": "centre",
    "latitude": 54.889328,
    "longitude": 23.936227,
    "street": "tunelio-g",
    "floor_area_m2": 25.0,
    "number_of_rooms": 1.0,
    "floor": 1.0,
    "number_of_floors": 2.0,
    "build_year": 1939.0,
    "building_type": "Medinis",
    "heating_type": "Dujinis",
    "equipment": "Įrengtas",
}]
url = "https://jjuoda-ds-24.herokuapp.com/predict/"
response = requests.post(url, json=features_for_inference)
inferred = np.array(response.json())
inferred

DEBUG: Starting new HTTPS connection (1): jjuoda-ds-24.herokuapp.com:443
DEBUG: https://jjuoda-ds-24.herokuapp.com:443 "POST /predict/ HTTP/1.1" 200 20


array([198.87162398])

#### Access last inferences made

In [21]:
url = "https://jjuoda-ds-24.herokuapp.com/inferences/"
response = requests.get(url)
last_inferences = np.array(response.json())
last_inferences

DEBUG: Starting new HTTPS connection (1): jjuoda-ds-24.herokuapp.com:443
DEBUG: https://jjuoda-ds-24.herokuapp.com:443 "GET /inferences/ HTTP/1.1" 200 3300


array([{'number_of_floors': 2.0, 'number_of_rooms': 1.0, 'street': 'tunelio-g', 'longitude': 23.936227, 'district': 'centre', 'inferred_monthly_rent': 198.87162398292858, 'heating_type': 'Dujinis', 'build_year': 1939.0, 'equipment': 'Įrengtas', 'floor': 1.0, 'floor_area_m2': 25.0, 'latitude': 54.889328, 'city': 'kaune', 'id': 80, 'building_type': 'Medinis'},
       {'number_of_floors': 2.0, 'number_of_rooms': 1.0, 'street': 'tunelio-g', 'longitude': 23.936227, 'district': 'centre', 'inferred_monthly_rent': 198.87162398292858, 'heating_type': 'Dujinis', 'build_year': 1939.0, 'equipment': 'Įrengtas', 'floor': 1.0, 'floor_area_m2': 25.0, 'latitude': 54.889328, 'city': 'kaune', 'id': 79, 'building_type': 'Medinis'},
       {'number_of_floors': 2.0, 'number_of_rooms': 1.0, 'street': 'tunelio-g', 'longitude': 23.936227, 'district': 'centre', 'inferred_monthly_rent': 198.8716239829286, 'heating_type': 'Dujinis', 'build_year': 1939.0, 'equipment': 'Įrengtas', 'floor': 1.0, 'floor_area_m2': 25.