# Rentop Kaggle Competition

W207-3 Spring 2017

Team members: Stephanie Fan, Boris Kletser, Amitabha Karmakar 

**Goal:** Use rental listing features to predict interest in rental inquiries.

- [Kaggle Competition](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries)
- Public Tableau Link: _____
- [Notebooks and Code](https://github.com/letslego/Rentop/)

## Business Understanding



**Problem:** The problem we are trying to solve is two-fold: First, it is to provide feedback to owners and agents on how to optimize listings to generate interest. Secondly, it helps RentHop identify potential issues with listings and fraud. Both of these should help customers better identify relevant listings.

**Metrics:** The relevant metric is accurate prediction of high, medium, and low interest. 
We would like to increase accuracy to 80% correct identification of high, medium, or low interest.

**Delivery:** We will deliver a model that predicts the probability of high, medium, and low interest for a given listing.

*Note:* For the purposes of this assignment, we will not be doing analysis of images provided with the competition and will mainly be focusing on using existing features (e.g. text, and values) to try to predict interest level.

## Data Understanding

**Sources:**
- train.json:  49352 records over 15 columns
- test.json:   74659 records over 14 columns

Each row is a listing; each column is a feature. The extra column in train.json is the interest level, which we need to predict for test.json.

**Existing Features:**
    
|Feature Type|Columns|Type|Notes|
|---|---|---|---|
|IDs|building_id|Long string||
||listing_id|7 digit num||
||manager_id|Long string||
|Location|street_address|Text||
||display_address|Text||
||latitude|Float|New York City only|
||longitude|Float|New York City only|
|Features|bathrooms|Int|(mean 1.2, sd 0.5)|
||bedrooms|Int|(mean 1.5, sd 1.1)|
||descriptions|Text||
||price|Int||
||created|Date|Dates between 2016-04 and 2016-06. Spread throughout weeks, mostly between 1-5am (esp 2am)
||photos|List of URLs||
|Target Var|interest_level|High/Medium/Low|This is what we’re predicting|


In [None]:
# code for loading and shape of train/test

### EDA
- Distribution of each feature
- Missing values
- Distribution of target
- Relationships between features
- Other idiosyncracies?

## Data Preparation

### Feature Transformation and Engineering
*[De-duplicating features](https://www.kaggle.com/jxnlco/two-sigma-connect-rental-listing-inquiries/deduplicating-features)*: parses descriptions into consistent rental features (ex: 24-hr concierge) and replaces synonyms with consistent terminology

*Text analysis:* Split descriptions into features describing writing style
- length of description
- number of words
- number of capital letters used
- number of punctuation marks used
- vocabulary richness (use of unique words)

*Feature Aggregation & Transformation*: Combine existing features into other features
- price per bedroom
- price per bathroom
- price per room
- number of photos per listing
- number of claimed rental features
- difference between street and display addresses
- neighborhoods (based on latitude/longitude)
- Multinomial Naive Bayes scoring for description vs interest level
- Multinomial Naive Bayes scoring for features vs interest level

*Time:* Split features into different time measurements -- does putting up the post at a certain time impact interest?
- year (no impact as all rentals were from 2016)
- month
- day of the month
- day of the week
- hour
- minute
- second
- time (hr + minutes)

### Principal component analysis (PCA)


### Target Transformation
Transform target (interest level = high, medium, or low) into ordinal values
- high = 2
- medium = 1
- low = 0

## Modeling



**Final model:** Random Forest Classifier

*Assumptions:* Features are non-parametric. We picked this method as it is fairly robust and does not require data to be parametric or regularized. In addition, using this method could allow for real-world interpretation of answers in comparison to other models, leading to direct 

*Regularization:* via PCA
    
**Other models tried:** 
- linear regression


In [None]:
#code for final model

## Evaluation

How well does the model perform?
Accuracy
ROC curves
Cross-validation
other metrics? performance?
AB test results (if any)
