# Rentop Kaggle Competition

W207-3 Spring 2017

Team members: Stephanie Fan, Boris Kletser, Amitabha Karmakar 

**Goal:** Use rental listing features to predict interest in rental inquiries.

- [Kaggle Competition](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries)
- [Notebooks and Code](https://github.com/letslego/Rentop/)

For final code, refer to RentHop_Code.ipynb, which is annotated with the appropriate sections.

## Business Understanding



**Problem:** The problem we are trying to solve is two-fold: First, it is to provide feedback to owners and agents on how to optimize listings to generate interest. Secondly, it helps RentHop identify potential issues with listings and fraud. Both of these should help customers better identify relevant listings.

**Metrics:** The relevant metric is accurate prediction of high, medium, and low interest. 
We would like to increase accuracy to 80% correct identification of high, medium, or low interest.

**Delivery:** We will deliver a model that predicts the probability of high, medium, and low interest for a given listing.

*Note:* For the purposes of this assignment, we will not be doing analysis of images provided with the competition and will mainly be focusing on using existing features (e.g. text, and values) to try to predict interest level.

## Data Understanding

**Sources:**
- train.json:  49352 records over 15 columns
- test.json:   74659 records over 14 columns

Each row is a listing; each column is a feature. The extra column in train.json is the interest level, which we need to predict for test.json.

**Existing Features:**
    
|Feature Type|Columns|Type|Notes|
|---|---|---|---|
|IDs|building_id|Long string||
||listing_id|7 digit num||
||manager_id|Long string||
|Location|street_address|Text||
||display_address|Text||
||latitude|Float|New York City only|
||longitude|Float|New York City only|
|Features|bathrooms|Int|(mean 1.2, sd 0.5)|
||bedrooms|Int|(mean 1.5, sd 1.1)|
||descriptions|Text||
||price|Int||
||created|Date|Dates between 2016-04 and 2016-06. Spread throughout weeks, mostly between 1-5am (esp 2am)
||photos|List of URLs||
|Target Var|interest_level|High/Medium/Low|This is what we’re predicting|


### EDA

#### Distribution of features
||bathrooms|bedrooms|latitude|longitude|price|
|---|---|---|---|---|---|
|count|49352|49352|49352|49352|49352|
|mean|1.21218|1.541640|40.741545|-73.955716|3830.174|
|std|0.50142|1.115018|0.638535|1.177912|22,066.87|
|min|0|0|0|-118.271000|43.00|
|25%|1|1|40.7283|-73.9917|2,500.00|
|50%|1|1|40.7518|-73.9779|3,150.00|
|75%|1|2|40.7743|-73.9548|4,100.00|
|max|10|8|44.8835|0.0000|4,490,000.00|

There is a skew in the number of bathrooms, bedrooms, and price, with the majority of the values being on the low end with a few high outliers.

#### Missing values
Based on the data distribution, there are some missing values (e.g. 0 for the longitutde), but all values in the table are filled in.

#### Distribution of target
The distribution of the target is severly biased toward the low interest class. Approximately 71%, 23%, and 7% of the dataset fall into the low, medium, and high interest level classes, respectively.

#### Relationships between features

||bathrooms|bedrooms|latitude|longitude|price|
|---|---|---|---|---|---|
|bathrooms|1.000000|0.533446|-0.009657|0.010393|0.069661|
|bedrooms|0.533446|1.000000|-0.004745|0.006892|0.051788|
|latitude|-0.009657|-0.004745|1.000000|-0.966807|-0.000707|
|longitude|0.010393|0.006892|-0.966807|1.000000|-0.000087|
|price|0.069661|0.051788|-0.000707|-0.000087|1.000000|

Unsurprisingly, there is a slight correlation between the number of bedrooms and the number of bathrooms. This is expected because houses are built with roughly a proportional number of bathrooms to the number of bedrooms in a house. Longitude and latitude are also quite correlated, although this could be due to the outliers in the dataset (suspected incorrect values because there are some significantly different values, and this dataset only has data from New York). Otherwise, latitude and longitude fall within a very small are of New York City. If you view the neighborhood clusters in the code notebook, you can see that the geolocation falls largely into a diagonal line due to the shape of the area we are looking at.

## Data Preparation

### Feature Transformation and Engineering
*[De-duplicating features](https://www.kaggle.com/jxnlco/two-sigma-connect-rental-listing-inquiries/deduplicating-features)*: parses descriptions into consistent rental features (ex: 24-hr concierge) and replaces synonyms with consistent terminology

*Text analysis:* Split descriptions into features describing writing style
- length of description
- number of words
- number of capital letters used
- number of punctuation marks used
- vocabulary richness (use of unique words)

*Feature Aggregation & Transformation*: Combine existing features into other features
- price per bedroom
- price per bathroom
- price per room
- number of photos per listing
- number of claimed rental features
- difference between street and display addresses
- neighborhoods (based on latitude/longitude)
- Multinomial Naive Bayes scoring for description vs interest level
- Multinomial Naive Bayes scoring for features vs interest level

*Time:* Split features into different time measurements -- does putting up the post at a certain time impact interest?
- year (no impact as all rentals were from 2016)
- month
- day of the month
- day of the week
- hour
- minute
- second
- time (hr + minutes)

### Principal component analysis (PCA)
Features were regularized and un-correlated using PCA after all feature engineering had been performed.

### Target Transformation
Transform target (interest level = high, medium, or low) into ordinal values
- high = 2
- medium = 1
- low = 0

## Modeling



**Final model:** Random Forest Classifier

*Assumptions:* Features are non-parametric. We picked this method as it is fairly robust and does not require data to be parametric or regularized. In addition, using this method could allow for real-world interpretation of answers in comparison to other models, leading to the potential for guidelines for posters to enhance attractiveness. 

*Regularization:* none. Not needed due to model type chosen.
    

**Other models tried:** 

|Model|Score|Notes|
|---|---|---|
|Baseline - linear regression against latitude and longitutde|0.79053|Surprisingly, this model was very quite good as it was based purely on location.|
|Multinomial Naive Bayes with Tdidf vectorization of description|0.75440|This was the best model given the set of features and models we tried. It was surprising as we thought there would be better predictors.|
|Random Forest on all numeric features after feature engineering|1.22052|There were not a huge number of numeric features (~29), so this model may not work as robustly as we expected.|
|Random Forest on top numeric features|2.26920|We only entered the top three factors, which was inappropriate given this model type. Due to this, the diversity of the trees was probably not very large, and probably biased the predictions.|
|Linear regression with all numeric features|1.09215|The linear regression gave a better result than the random forest model, likely due to the smaller number of features.|

## Evaluation

The model was trained using accuracy as a metric, and then periodic submissions were made to the Kaggle competition, which scores use the log loss function.

Overall, the models did not perform very well. In training the model based on accuracy, we generally only were able to predict the development set at around a 70% accuracy level given the features we had selected. This suggested that there may be some other factors that were not present in the dataset. A couple of factors that readily spring to mind that were unaccounted for was visual attractiveness of the photos, and user demographics as apartment rentals are driven by the need of the user, which can vary greatly depending on the point they are at in the life (e.g. young professionals, young families, age and number of children, student, seniors, etc.)


## Potential Next Steps
If we were continuing this competition, we would consider looking at the following next:

|Item|Rationale|
|---|---|
|Image Processing|Humans are highly visual, and we chose not to pursue image processing as part of the competition. However, we recognize the impact that photos can have, and believe this would be a worthwhile area to explore for features. Some top-of-mind ideas are brightness and contrast of the images, resolution, and size.|
|User Clustering|Another model that could be persued is to evaluate clusters within each interest level to see what clusters of features are attractive. This could help posters gain a better understanding of potential renters and what they are looking for.|
|Deep Learning|We settled on using Random Forests because it was a relatively easy method to employ. However, we cannot always understand how people work on a topical level. Additionally, we had a somewhat large set of features, which deep learning requires in order to be worthwhile.|