# Rentop Kaggle Competition

W207-3 Spring 2017

Team members: Stephanie Fan, Boris Kletser, Amitabha Karmakar 

**Goal:** Use rental listing features to predict interest in rental inquiries.

- [Kaggle Competition](https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries)
- [Notebooks and Code](https://github.com/letslego/Rentop/)

## Business Understanding



**Problem:** The problem we are trying to solve is two-fold: First, it is to provide feedback to owners and agents on how to optimize listings to generate interest. Secondly, it helps RentHop identify potential issues with listings and fraud. Both of these should help customers better identify relevant listings.

**Metrics:** The relevant metric is accurate prediction of high, medium, and low interest. 
Kaggle will score based on a mutliclass loss. The lower the number, the better.

**Delivery:** We will deliver a model that predicts the probability of high, medium, and low interest for a given listing.

*Note:* For the purposes of this assignment, we will not be doing analysis of images provided with the competition and will mainly be focusing on using existing features (e.g. text, and values) to try to predict interest level.

## Data Understanding

**Sources:**
- train.json:  49352 records over 15 columns
- test.json:   74659 records over 14 columns

Each row is a listing; each column is a feature. The extra column in train.json is the interest level, which we need to predict for test.json.

**Existing Features:**
    
|Feature Type|Columns|Type|Notes|
|---|---|---|---|
|IDs|building_id|Long string||
||listing_id|7 digit num||
||manager_id|Long string||
|Location|street_address|Text||
||display_address|Text||
||latitude|Float|New York City only|
||longitude|Float|New York City only|
|Features|bathrooms|Int|(mean 1.2, sd 0.5)|
||bedrooms|Int|(mean 1.5, sd 1.1)|
||descriptions|Text||
||price|Int||
||created|Date|Dates between 2016-04 and 2016-06. Spread throughout weeks, mostly between 1-5am (esp 2am)
||photos|List of URLs||
|Target Var|interest_level|High/Medium/Low|This is what we’re predicting|


In [None]:
# code for loading and shape of train/test

### EDA
- Distribution of each feature
- Missing values
- Distribution of target
- Relationships between features
- Other idiosyncracies?

## Data Preparation

### Feature Transformation and Engineering
*[De-duplicating features](https://www.kaggle.com/jxnlco/two-sigma-connect-rental-listing-inquiries/deduplicating-features)*: parses descriptions into consistent rental features (ex: 24-hr concierge) and replaces synonyms with consistent terminology

*Text analysis:* Split descriptions into features describing writing style
- length of description
- number of words
- number of capital letters used
- number of punctuation marks used
- vocabulary richness (use of unique words)

*Feature Aggregation & Transformation*: Combine existing features into other features
- price per bedroom
- price per bathroom
- price per room
- number of photos per listing
- number of claimed rental features
- difference between street and display addresses
- neighborhoods (based on latitude/longitude)
- Multinomial Naive Bayes scoring for description vs interest level
- Multinomial Naive Bayes scoring for features vs interest level

*Time:* Split features into different time measurements -- does putting up the post at a certain time impact interest?
- year (no impact as all rentals were from 2016)
- month
- day of the month
- day of the week
- hour
- minute
- second
- time (hr + minutes)

### Code for feature engineering

In [None]:
def add_txt_features(orig):
    dat = orig
    dat.loc[:,'strlen'] = [len(x) for x in dat['description']]
    dat.loc[:,'numwords'] = [len(x.split()) for x in dat['description']]
    dat.loc[:,'numcaps'] = [sum(1 for c in x if c.upper()) for x in dat['description']]
    dat.loc[:,'numpunct'] = [sum(1 for c in x if c in punctuation) for x in dat['description']]
    dat.loc[:,'richness'] = [len(set(x)) / (len(x)+0.001) for x in dat['description']] #avoid 0s
    return dat

def add_price_features(orig):
    dat = orig
    dat.loc[:,'price_per_bed'] = dat['price'] / (dat['bedrooms']+0.00001)
    dat.loc[:,'price_per_bath'] = dat['price'] / (dat['bathrooms']+0.00001)
    dat.loc[:,'price_per_room'] = dat['price'] / (dat['bathrooms'] + dat['bedrooms'] +0.00001)
    return dat

def get_num_photos(orig):
    dat = orig
    dat.loc[:,'numphotos'] = [len(x) for x in dat['photos'].values]
    return dat

def add_time_features(orig):
    dat = orig
    dat.loc[:,"created2"] = dat['created'].astype("datetime64");
     
    dat.loc[:,'year']   = dat['created2'].dt.year
    dat.loc[:,'month']  = dat['created2'].dt.month
    dat.loc[:,'day']    = dat['created2'].dt.day
    dat.loc[:,'weekday']= dat['created2'].dt.dayofweek
    dat.loc[:,'hour']   = dat['created2'].dt.hour
    dat.loc[:,'minute'] = dat['created2'].dt.minute
    dat.loc[:,'second'] = dat['created2'].dt.second
    dat.loc[:,'hr_min'] = dat['created2'].dt.hour.multiply(100).add(dat['created2'].dt.minute)
    return dat

def get_num_features(orig):
    dat = orig
    dat.loc[:,'numfeatures'] = [len(x) for x in dat['features'].values]
    return dat

def get_address_dif(orig):
    dat = orig
    street_addr_len = [len(sa) for sa in dat['street_address']]
    display_addr_len = [len(da) for da in dat['display_address']]
    dat.loc[:,'addr_dif'] = np.subtract(street_addr_len,display_addr_len)
    return dat



### Neighborhoods -- AMIT I didn't standardize these, couldn't get to work at the moment

In [None]:
def remove_outliers(orig):
    dat = orig
    dat = dat[((dat.latitude - dat.latitude.mean()) / dat.latitude.std()).abs() < 3]
    dat = dat[((dat.longitude - dat.longitude.mean()) / dat.longitude.std()).abs() < 3]
    return dat

def make_neighborhoods(orig, num_clusters):
    #returns a km, wth which we can classify other points
    dat = orig[['latitude', 'longitude']].copy()
    km = KMeans(num_clusters, random_state=1).fit(dat)
    return km

def fit_neighborhoods(orig, km):
    dat = orig[['latitude', 'longitude']].copy()
    dat2 = orig
    dat2.loc[:,'neighborhood'] = km.predict(dat)
    return dat2


### Description and Feature Analysis

In [None]:
def pre_proc(s,
              word_length_range=(3,7),
              remove_stop_words=True,
              scale_capitals=1,
              set_to_lower=True,
              remove_numbers=False
             ):
   
    s2 = re.sub(ur"\p{P}+","",s) #strip punctuation
    s2 = re.sub(ur"[^\w ]+"," ",s2) #remove punctuation2
    s2 = re.sub(ur"\_","",s2) #remove underscores (ignored by w)
    
    #http://stackoverflow.com/questions/8745821/find-words-with-capital-letters-not-at-start-of-a-sentence-with-regex
    #doesn't matter if at start of sentence, often it's the key NP. If a stopword, those get stripped anyway
    names = " "+" ".join(re.findall(ur'\b[A-Z][A-Za-z0-9]*\b',s2))
    for i in range(0,scale_capitals):
        s2 = s2 + names
        
    if set_to_lower:
        s2 = s2.lower() #lower case

    s2 = re.sub(ur"\s+", " ",s2) #remove mult spaces (avoids cases with double spaces for look behind)
    
    if remove_numbers:
        s2 = re.sub(ur"\d", " ",s2) #remove all numbers

    truncation_re = ur"(?<=(\s\w{"+ur"{}".format(word_length_range[1])+ur"}))(\w*\s)"
    s2 = re.sub(truncation_re,"\1 ",s2) #truncate words > n char

    short_elim_re = ur"\b\w{1,"+ur"{}".format(word_length_range[0])+ur"}\b"
    s2 = re.sub(short_elim_re, "", s2) #removes all words/numbers < n in length
    
    #remomve stop words
    if remove_stop_words:
        s2_split = s2.split()
        s3_split = s2.split()
        for key in s2_split:
            if key.lower in stop_words:
                s3_split.remove(key)
        s2 =' '.join(s3_split)
    return s2

pre_proc_custom = lambda x: pre_proc(x, 
                                      word_length_range = (3,8), 
                                      remove_stop_words = False, 
                                      scale_capitals = 1, 
                                      set_to_lower = True,
                                      remove_numbers = False
                                     )

In [None]:
mytv = TfidfVectorizer(ngram_range=(1,1), 
                       analyzer='word', 
                       preprocessor=pre_proc_custom)
mytv.fit_transform(X_train['description'].values)

def get_desc_mnb(X, y):    
    mytv_dev = TfidfVectorizer(ngram_range=(1,1), 
                           analyzer='word', 
                           preprocessor=pre_proc_custom,
                           vocabulary=train_words)
    
    X_dev_words = mytv_dev.fit_transform(X['description'].values) 
    
    mnb = MultinomialNB(alpha = 0.009)
    mnb.fit(X_dev_words, y)
    return mnb

def get_description_scores(X, mnb):
    mytv_dev = TfidfVectorizer(ngram_range=(1,1), 
                           analyzer='word', 
                           preprocessor=pre_proc_custom,  #set above
                           vocabulary=train_words) #also set above
    
    X_dev_words = mytv_dev.fit_transform(X['description'].values) 
    
    pred_train = mnb.predict_proba(X_dev_words)
    
    dat = X
    dat.loc[:,'desc_1'] = pred_train[:,0]
    dat.loc[:,'desc_2'] = pred_train[:,1]
    dat.loc[:,'desc_3'] = pred_train[:,2]
    return dat

In [None]:
def clean(s):
    for i,x in enumerate(s):
        x = x.lower()
        x = x.strip()
        x = x.replace("-", "")
        x = x.replace(" ", "")
        x = x.replace("twenty four hour", "24")
        x = x.replace("24/7", "24")
        x = x.replace("24hr", "24")
        x = x.replace("24-hour", "24")
        x = x.replace("24hour", "24")
        x = x.replace("24 hour", "24")
        x = x.replace("common", "cm")
        x = x.replace("concierge", "doorman")
        x = x.replace("bicycle", "bike")
        x = x.replace("private", "pv")
        x = x.replace("deco", "dc")
        x = x.replace("decorative", "dc")
        x = x.replace("onsite", "os")
        x = x.replace("outdoor", "od")
        x = x.replace("ss appliances", "stainless")
        s[i] = x
    return s

def clean_features(orig):
    dat = orig
    dat.loc[:,'cleaned_features'] = [' '.join(clean(f)) for f in dat['features'].values]
    return dat

In [None]:
mytv = TfidfVectorizer(ngram_range=(1,1), 
                       analyzer='word', 
                      ) 

def get_feature_mnb(X, y):
    X = clean_features(X)
    mytv_dev = TfidfVectorizer(ngram_range=(1,1), 
                           analyzer='word', 
                           #preprocessor=pre_proc_custom,  #set above
                           vocabulary=train_words) #also set above
    
    X_dev_words = mytv_dev.fit_transform(X['cleaned_features'].values) 
    
    mnb = MultinomialNB(alpha = 0.009)
    mnb.fit(X_dev_words, y)
    return mnb

def get_feature_scores(X, mnb):
    X = clean_features(X)
    mytv_dev = TfidfVectorizer(ngram_range=(1,1), 
                           analyzer='word', 
                           preprocessor=pre_proc_custom,  #set above
                           vocabulary=train_words) #also set above
    
    X_dev_words = mytv_dev.fit_transform(X['cleaned_features'].values) 
    
    pred_train = mnb.predict_proba(X_dev_words)
    #print pred_train
    #print mnb.classes_
    
    dat = X
    dat.loc[:,'feat_1'] = pred_train[:,0]
    dat.loc[:,'feat_2'] = pred_train[:,1]
    dat.loc[:,'feat_3'] = pred_train[:,2]
    return dat

### Principal component analysis (PCA)
Run across training set to see which features are most important. Try running analysis on just those.

In [None]:
pca = PCA()
X_transformed = pca.fit_transform(X_train_limited)
pca.fit(X_train_limited)
pca.explained_variance_ratio_
plt.plot(pca.explained_variance_)
plt.show()

### Target Transformation
Transform target (interest level = high, medium, or low) into ordinal values
- high = 3
- medium = 2
- low = 1

## Modeling



**Final model:** Random Forest Classifier

*Assumptions:* Features are non-parametric. We picked this method as it is fairly robust and does not require data to be parametric or regularized. In addition, using this method could allow for real-world interpretation of answers in comparison to other models, leading to direct 

*Regularization:* via PCA [[**AMIT**... would you be able to fix this inside the combining doc?]]
    
**Other models used:** 
- linear regression - was too biased toward low
- KNN to create neighborhoods


### Splitting the Data

In [None]:
train_df = pd.read_json("../input/train.json")
test_df = pd.read_json("../input/test.json")

train_df = train_df.set_index('listing_id')
test_df = test_df.set_index('listing_id')

y = train_df['interest_level']
y2 = y.replace({'low':1,'medium':2,'high':3})
X = train_df.drop('interest_level', 1)
X_train, X_dev, y_train, y_dev = train_test_split(X, y2, test_size=0.2, random_state=1)

### Adding in the new features

In [None]:
X_train = add_txt_features(X_train)
X_train = add_price_features(X_train)
X_train = get_num_photos(X_train)
X_train = add_time_features(X_train)
X_train = get_num_features(X_train)
X_train = get_address_dif(X_train)

neighborhoods = make_neighborhoods(X_train, 50)
X_train = fit_neighborhoods(X_train, neighborhoods)

my_desc_mnb = get_desc_mnb(X_train, y_train)
X_train = get_description_scores(X_train,my_desc_mnb)

my_feature_mnb = get_feature_mnb(X_train, y_train)
X_train = get_feature_scores(X_train, my_feature_mnb) 


X_test = add_txt_features(X_test)
X_test = add_price_features(X_test)
X_test = get_num_photos(X_test)
X_test = add_time_features(X_test)
X_test = get_num_features(X_test)
X_test = get_address_dif(X_test)

neighborhoods = make_neighborhoods(X_train, 50)
X_test = fit_neighborhoods(X_test, neighborhoods)

my_desc_mnb = get_desc_mnb(X_train, y_train)
X_test = get_description_scores(X_test,my_desc_mnb)

my_feature_mnb = get_feature_mnb(X_train, y_train)
X_test = get_feature_scores(X_test, my_feature_mnb) final model


#Drop non-numeric features
feature_list = ['feat_1','feat_2','feat_3', 'desc_1','desc_2','desc_3',
                'year', 'month', 'day', 'weekday', 'hour', 'minute', 'second', 'hr_min',
                'numphotos', 'numfeatures','addr_dif','neighborhood',
                'price_per_bed','price_per_bath','price_per_room',
                'price','bedrooms','bathrooms',
                'strlen','numwords','numcaps','numpunct','richness']

X_test_limited = X_test[feature_list]
print X_test_limited.shape

X_train_limited = X_train[feature_list] #index already set

rfc = RandomForestClassifier(n_estimators=20, n_jobs=-1) #-1 means use all available cores
rfc.fit(X_train_limited, y_train)
print X_train_limited.shape
labels = rfc.classes_

predictions = rfc.predict_proba(X_test_limited)
print predictions.shape

#reorganize to high mid low
temp = X_test_limited
temp.loc[:,'high'] = predictions[:,2]
temp.loc[:,'medium'] = predictions[:,1]
temp.loc[:,'low'] = predictions[:,0]

final_table = temp[['high','medium', 'low']]

print final_table.head()

## Evaluation

How well does the model perform?
Accuracy
ROC curves
Cross-validation
other metrics? performance?
AB test results (if any)


### Evaluating on training and dev sets

In [None]:
def accuracy_report(name, X_train, y_train, X_dev, y_dev):

    #RFC
    rfc = RandomForestClassifier(n_estimators=20, n_jobs=-1) #-1 means use all available cores
    rfc.fit(X_train, y_train)
    print('RF Training Accuracy: %.2f%% \t Test Accuracy: %.2f%%' % (
        rfc.score(X_train, y_train)*100,                                                         
        rfc.score(X_dev, y_dev)*100))

    print ('RFC dev: \n{}'.format(classification_report(y_dev, rfc.predict(X_dev))))
    print itemfreq(rfc.predict(X_dev))

    importances = rfc.feature_importances_
    features = X_train.columns

    sort_indices = np.argsort(importances)[::-1]
    sorted_features = []
    for i in sort_indices:
        sorted_features.append(features[i])

    print('\nfeatures')
    print(sorted_features)
    print importances[sort_indices]
    


## Submissions 

pca_submission02.csv
pca PCA PPPPCCCCAAAAA!
**1.09215**
(PCA on all features)

sumbmission004.csv
using less features
**2.26920**
(RFC on top 3 columns of classifier)

sumbmission003.csv
fixed bugs
**1.22052**
(All features RFC)

sumbmission002.csv
rf on mult features
**1.23316**
(All features RFC)

alterntive baseline
**0.75440**
(Analysis on text -- MNB like hw 2