
### hints
- multi-labeling
- self normalizing networks
- early stopping
- Calibration buckets
- use tags's title's word embeddigns as input signal

## Tasks
- Run an analysis on the dataset and provide descriptive statistics for the data
- Perform data cleaning, selection and splitting if needed
- Build one or more models to tag a new feedback with a relevant tag or tags using only feedback text
- Show how you evaluate, compare and improve your models
- Identify if using other features (provided in the dataset) in combination with the feedback text is helpful
- Identify howyou would use your model and findings to improve Airbnb's feedback module by writing out specific recommendations


# Airbnb Feedback Labeling

## Problem statement
Airbnb has a platform for taking feedback from its users (https://www.airbnb.com/help/feedback). Identifying the proper team that should handle each feedback is cruicial in sustaining customers and organic user growth. Given a feedback statement and some other features, we would like to identify the associated team.

## Understanding the Data


In [1]:
# imports and constants
%matplotlib inline  

import csv
import matplotlib.pyplot as plt
import numpy as np
import re

### Load the dataset

In [2]:
with open('NLP_TakeHome_Feedbacks.csv', 'rb') as csvfile:
    csvReader = csv.reader(csvfile, delimiter=',', quotechar='"')
    dataList = list(csvReader)

reqUserAgentList = []
countryList = []
isAppOrWebList = []
deviceTypeList = []
dsList = []
tagsList = []
feedbackList = []

for i in range(1, len(dataList)):
    # skip record id field
    reqUserAgentList.append(dataList[i][1])
    countryList.append(dataList[i][2])
    isAppOrWebList.append(dataList[i][3])
    deviceTypeList.append(dataList[i][4])
    dsList.append(dataList[i][5])
    tagsList.append(dataList[i][6])
    feedbackList.append(dataList[i][7])

dataListLength = len(dataList) - 1 # skipped header row
print('There is a total of ' + str(dataListLength) + ' records in this dataset.')

There is a total of 8949 records in this dataset.


### Identify types of tags

In [3]:
tagTypes = {}
regex = re.compile(r'\{\{([^\}]*)\}\}')

def getTagTypes(tagsList):
    tagTypes = {}
    for s in tagsList:
        lst = re.split(' \| ', s)
        for k in lst:
            if not tagTypes.has_key(k):
                tagTypes[k] = 1
            else:
                tagTypes[k] = tagTypes[k] + 1
    return tagTypes

tagTypes = getTagTypes(tagsList)

validTagTypes = {}
outlierTagTypes = {}
tagCountCap = 10

for s in tagTypes:
    if tagTypes[s] > tagCountCap:
        validTagTypes[s] = tagTypes[s]
    else:
        outlierTagTypes[s] = tagTypes[s]

outlierTagsRatio = 1 - len(validTagTypes)/float(len(tagTypes))
print('All togethere, there are %d types of tags!' % len(tagTypes))
print('%0.2f%% of all the Tags have less than %d instances, and the remaining %0.2f%% of Tag \
types have more than %d instances' % 
      (100 * outlierTagsRatio, tagCountCap, 100 * (1 - outlierTagsRatio), tagCountCap))  

print
print('For the scope of this project, to make computations tractable, we focus on Tag types with \
instances more than %d. The rest of the Tags would be converted to type UNKNOWN. \
List of Tag Types with instances more than %d is as follows:' % 
      (tagCountCap, tagCountCap))
print

# Convert outlier tags to UNKNOWN
UNKNOWN_TAG_TYPE = 'UNKNOWN_TAG_TYPE'

cleanedTagsList = [s for s in tagsList]

for i in range(len(cleanedTagsList)):
    lst = re.split(' \| ', cleanedTagsList[i])
    cleanedTags = []
    for k in lst:
        if outlierTagTypes.has_key(k):
            cleanedTags.append(UNKNOWN_TAG_TYPE)
        else:
            cleanedTags.append(k)
    
    cleanedTagsList[i] = ' | '.join(cleanedTags)

validTagTypes = None
outlierTagTypes = None

# print(getTagTypes(tagsList))
print(getTagTypes(cleanedTagsList))

All togethere, there are 621 types of tags!
90.02% of all the Tags have less than 10 instances, and the remaining 9.98% of Tag types have more than 10 instances

For the scope of this project, to make computations tractable, we focus on Tag types with instances more than 10. The rest of the Tags would be converted to type UNKNOWN. List of Tag Types with instances more than 10 is as follows:

{'Block calendar': 71, 'Calendar availability': 238, 'Guest cancellation': 224, 'Terms and conditions': 146, 'Review': 771, 'Manage listing': 228, 'Finding a place': 29, 'Damage': 208, 'Managing my calendar / Calendar settings': 17, 'Co-hosting': 45, 'Paying for my trip': 21, 'Other': 18, 'Location': 160, 'Special offer': 99, 'Multi-calendar': 46, 'My account or profile': 27, 'Mobile app': 730, 'Booking': 1003, 'Activate listing': 191, 'Search results': 254, 'Sharing': 143, 'Nightly price': 159, 'Getting paid': 27, 'Guest onboarding': 126, 'Calendar sync': 44, 'Host cancellation': 457, 'Cancel requ

### Identify some characteristics of feedbacks
I noticed that sometimes:
- Feedbacks could be empty
- Feedbacks can have strange placeholders. Since I have not been told how to handle these placeholders, based on the count of these placeholders, I would only consider the following as valid tokens and cast other placeholders as of type *UNKONWN_TOKEN*:
  - NAME
  - PHONE
  - EMAIL 

In [4]:
# identify types of placeholders
placeholders = {}
regex = re.compile(r'\{\{([^\}]*)\}\}')
for s in feedbackList:
    lst = regex.findall(s)
    for k in lst:
        if not placeholders.has_key(k):
            placeholders[k] = 1
        else:
            placeholders[k] = placeholders[k] + 1

print('Complete list of placeholders and their respective counts:\n')
print(placeholders)

print
print('A record with no feedback:')
print(feedbackList[8939]) # no feedback
print('A strange feedback with various types of placeholders:')
print(feedbackList[5828])

Complete list of placeholders and their respective counts:

{'NAME+SKYPE': 1, 'URL+SKYPE+SKYPE': 2, 'PHONE+SSN': 1, 'URL+NAME+NAME+NAME': 1, 'URL': 192, 'NAME+EMAIL': 7, 'URL+NAME': 3, 'URL+NAME+NAME+NAME+NAME': 1, 'SKYPE': 3, 'PHONE': 314, 'SSN': 1, 'CREDENTIAL+EMAIL+NAME+NAME+PHONE': 1, 'EMAIL+URL': 1, 'URL+SKYPE': 1, 'EMAIL+PHONE': 3, 'EMAIL': 382, 'NAME': 18723}

A record with no feedback:

A strange feedback with various types of placeholders:
Colleagues,   I can�۪t find a method for calling you so I�۪ll try here. You were very helpful last {{NAME}} and maybe you can pull it off again.  I have a 43-foot boat in a slip in {{NAME}} {{NAME}}. We had a very nice response to putting a listing up, but we�۪ve now learned from the marina that the {{NAME}} {{NAME}} of {{NAME}} {{NAME}} has forbidden AirBnB boat rentals anywhere in the waters of {{NAME}} {{NAME}} {{NAME}}.   Hmm. {{NAME}}. Do you have any advice?  Thanks, {{NAME}} ____________________________________ {{NAME}} {{NAME}}, MD, 

### Tokenize feedbacks texts and create word embeddigns arrays

In [5]:
# load wordembeddings
wordsList = np.load('wordsList.npy')
print('Loaded the word list!')

wordVectors = np.load('wordVectors.npy')
print ('Loaded the word vectors!')

wordsList = wordsList.tolist() #Originally loaded as numpy array
wordsList = [word.decode('UTF-8') for word in wordsList] #Encode words as UTF-8

print('Word embeddings word count: %d' % len(wordsList))
print('Word embeddings dimmension: ' + str(wordVectors.shape))

Loaded the word list!
Loaded the word vectors!
Word embeddings word count: 400000
Word embeddings dimmension: (400000, 50)


In [None]:
feedbackLegnthList = [len(s) for s in feedbackList]
print('Histogram of feedback lengths:')

fig, ax = plt.subplots()
plt.hist(feedbackLegnthList, bins=500)
plt.title('Feedback Lengths List Histogram')
plt.xlabel('Feedback Length')
plt.ylabel('Frequency')


fig, ax = plt.subplots()
plt.hist(feedbackLegnthList, bins=500)
plt.title('Feedback Lengths List Histogram at Logarithmic Scale')
ax.set_yscale('log')
plt.xlabel('Feedback Length')
plt.ylabel('Frequency')

outlierFeedbackLengthCount = 0.0
outlierCap = 2000
for s in feedbackLegnthList:
    if s > outlierCap:
        outlierFeedbackLengthCount = outlierFeedbackLengthCount  + 1

print('Max feedback length is %d and only %d samples have length more than %d (roughly %0.2f%%).' % 
      (max(feedbackLegnthList), outlierFeedbackLengthCount, outlierCap, outlierFeedbackLengthCount/dataListLength * 100))

In [None]:
feedbackLegnthListNoOutlier = []
for s in feedbackLegnthList:
    if s < outlierCap:
        feedbackLegnthListNoOutlier.append(s)

fig, ax = plt.subplots()
plt.hist(feedbackLegnthListNoOutlier, bins=500)
plt.title('Feedback Lengths List Histogram with No Outliers')
plt.xlabel('Feedback Length')
plt.ylabel('Frequency')

meanFeedbackLengthWithNoOutliers = sum(feedbackLegnthListNoOutlier) / float(len(feedbackLegnthListNoOutlier))
print('After removing these outliers, the mean feedback length is %0.2f' %
      meanFeedbackLengthWithNoOutliers)

lstmLength = int(round(meanFeedbackLengthWithNoOutliers))
print('To make this project computationally tractible,'+
      ' we take a maximum sequence length of %d to be used as LSTM unrolling length.' %
      lstmLength)

In [None]:

longestFeedback = ''
longestFeedbackLength = 0

# identify longest feedback
for feedback in feedbackList:
    feedbackLength = len(feedback)
    feedbackLegnthList.append(feedbackLength)
    if(feedbackLength > longestFeedbackLength):
        longestFeedback = feedback
        longestFeedbackLength = feedbackLength
        


In [None]:
from collections import Counter
a = ['a', 'a', 'a', 'a', 'b', 'b', 'c', 'c', 'c', 'd', 'e', 'e', 'e', 'e', 'e']
letter_counts = Counter(a)
df = pandas.DataFrame.from_dict(letter_counts, orient='index')
df.plot(kind='bar')