# Assignemnt

* Implement a simple approach to cluster the listings data into similar items
clusters that can be used in this way: If the items A and B are part of the
same cluster it is reasonable to recommend B to a user that is viewing or has
viewed A. Feel free to use the category information that sellers have used to
classify their listings in any way.


* How do you evaluate the quality of your results? 


* How does it compare to a naive approach that takes random listings from the same category?


* What are possible shortcomings and extensions of your implementation?


* How are newly listed (unseen) listings assigned to your clusters?

# Ideas

* Cluster size: 50 ( seems reasonable for a pool of items to recommend )


* Broad groupings
    - L2 categories are large enough to be used as a first filter
    - A separate clustering is run within each L2 category
    

* Similarity features
    - TF-IDF on title+description
    - price
    - geodistance


* Clustering algorithm
    - Agglomerative
        - admits custom distance
        - allows to input desired n_clusters
        
        Best choice, according to http://scikit-learn.org/stable/modules/clustering.html


* Processing
    - lemmatization, lowercase


* New items
    - Update distance matrix
    - We can use KNN to assign new cluster
    - Refit clusters sporadically, only when there's a large enough number of new items


* Evaluation
    - Ideally, an A/B test were we track variation of metrics like clicks, conversion, revenue
    - for now
        - explore sample
        - some cluster quality metric?


* Main shortcomings and posible solutions
    - AgglomerativeClustering doesn't scale well for larger data, unless a connectivity matrix is provided. K-Means is more scalable ( even more if we use MiniBatch K-Means ) but it is limited to use the Euclidean distance and doesn't accept a custom distance metric. There is a variant (K-medoids) which we could use in this case.  
    - TFIDF captures a too strict notion of similarity, where basically there have to be many common tokens between two given texts for them to be considered similar. There is no notion of semantic relatedness between tokens. We could ammend this by fitting an LDA model and transforming TFIDF vectors to vectors of LDA topic-scores. Another option, given enough data, would be to fit some word2vec model and then use Word Mover's Distance to measure similarity between texts.

    

* Extensions and improvements
    - use search keyword popularities to somehow weight the TFIDF scores of terms.
    - use L3 as parent clusters when large enough to split. Let's say, more than 70 items ( mostly cars ).
    - Incremental clustering ( IHAC ) https://github.com/frnsys/galaxy
    - Improve similarity computation 
        https://blog.booking.com/k-nearest-neighbours-from-slow-to-fast-thanks-to-math.html
    - feature weights
    - word2vec model
    - Named Entity Recognition and POS tags
    - Retrieval: rank items in cluster by similarity to seed item
    - Parallelize frequency computations ( gensim )
    - use MiniBatch K-means when doing it in scale