### Below I build three musical recommendation products, the interactive recommender, text based recommender, and the musical mashup. You can play with all three at http://michaelaaroncantrell.pythonanywhere.com/

#### The data for this project was obtained courteous of J. McAuley, C. Targett, J. Shi, A. van den Hengel. The data can be found at http://jmcauley.ucsd.edu/data/amazon/links.html

# <a class="anchor" id="Table-of-Contents"> Table of Contents </a>
* [Import Data](#Import-Data)
* [Clean Data](#Clean-Data)
* [Collaborative Recommender](#Collaborative-Recommender)
* [NLP Recommender](#NLP-Recommender)
* [Interactive Recommender](#Interactive-Recommender)
* [Text Based Recommender](#Text-Based-Recommender)
* [Musical Mashup](#Musical-Mashup)

#  <a class="anchor" id="Import-Data"> Import data </a>
#### [Table of Contents](#Table-of-Contents) 

### I import data from a MongoDB database. There are two tables of interest.

In [10]:
from pymongo import MongoClient
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.sparse import csr_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import normalize
from sklearn.decomposition import NMF
from sklearn.decomposition import TruncatedSVD
import nltk
from nltk.tokenize import PunktSentenceTokenizer
from textblob import TextBlob
from collections import defaultdict
import pickle

In [2]:
%%time
client = MongoClient('mongodb://localhost:27017/')
db=client['amazon_music']

cd_meta = db.cd_meta
cd = db.cd

meta_cursor = cd_meta.find({ });
cd_cursor = cd.find( { "unixReviewTime": { "$gt": 1104550954} } );

df_meta =  pd.DataFrame(list(meta_cursor))
df_cd =  pd.DataFrame(list(cd_cursor))

CPU times: user 52.4 s, sys: 3.53 s, total: 55.9 s
Wall time: 1min 26s


In [3]:
print(df_cd.shape[0])
df_cd.head()

2672618


Unnamed: 0,_id,asin,helpful,overall,reviewText,reviewTime,reviewerID,reviewerName,summary,unixReviewTime
0,58af6aa9cd39546919fc73e5,1393774,"[0, 0]",5.0,fantastic. old time religion is good for me. t...,"08 31, 2013",A9DMTMLFR9CO5,Albert Luguterah,i love it,1377907200
1,58af6aa9cd39546919fc73e6,1393774,"[0, 0]",5.0,I HAD THE ALBUM FOR YEARS AGO ....AND I AM VER...,"07 2, 2013",AHG1GTQZUYNJN,CAROLYNE CHAMBERLAIN,PURE JOY!,1372723200
2,58af6aa9cd39546919fc73e7,1393774,"[0, 0]",5.0,Pure praise to throne room. He had a unique st...,"04 2, 2014",A2TFO7NREP2B2D,cindy terpening_smith,pure,1396396800
3,58af6aa9cd39546919fc73e8,1393774,"[0, 0]",5.0,I have always loved Keith Green's music and ha...,"02 15, 2014",A2YAPAG1IPNK7K,diane tousley,Love this CD!,1392422400
4,58af6aa9cd39546919fc73e9,1393774,"[13, 15]",5.0,Keith Green had a passionate love for Jesus. ...,"11 1, 2005",AEKGGV851HY3K,D. MILLS,Passionate Faith Is Contagious,1130803200


In [4]:
df_meta.head()

Unnamed: 0,_id,asin,brand,categories,description,imUrl,price,related,salesRank,title
0,58af5fb3cd39546919f72c04,1501348,,"[[CDs & Vinyl, Christian, Pop & Contemporary],...","Lenny LeBlanc, Alex Acuna, Justo Almario, Tom ...",http://ecx.images-amazon.com/images/I/412JH6CM...,,"{'also_bought': ['6303646611', 'B000002C45', '...",{'Movies & TV': 359265},Lift Him Up With Ron Kenoly [VHS]
1,58af5fb3cd39546919f72c05,1393774,,"[[CDs & Vinyl, Christian]]",Audio CD,http://ecx.images-amazon.com/images/I/51MC7A5N...,16.64,"{'also_bought': ['B0016CP2GS', 'B0000275QQ', '...",{'Music': 41017},Songs for the Shepherd
2,58af5fb3cd39546919f72c06,5123909,,"[[CDs & Vinyl, Children's Music], [Movies & TV...",18 Music Videos for Kids: Do Your Ears Hang Lo...,http://ecx.images-amazon.com/images/I/41K31EWE...,29.98,"{'also_bought': ['B00000JLTM', 'B00006L97L', '...",{'Movies & TV': 451209},Silly Songs: 18 Wholesome Fun Songs for Kids [...
3,58af5fb3cd39546919f72c07,5072298,,"[[CDs & Vinyl, Children's Music], [CDs & Vinyl...",,http://ecx.images-amazon.com/images/I/510RRJWQ...,6.26,"{'also_viewed': ['B00000DPLL', 'B000008UPG', '...",{'Music': 350804},Hymns: 16 Classic Hymns for Children
4,58af5fb3cd39546919f72c08,5224896,,"[[CDs & Vinyl, Christian, Praise & Worship]]",,http://ecx.images-amazon.com/images/I/51SS0SRM...,8.99,"{'also_bought': ['B001EMSQOK', 'B001EMQ6H4', '...",{'Music': 347825},"Voice of the Wind: Personal Worship, Vol. 1"


# <a class="anchor" id="Clean-Data"> Clean Data </a>
#### [Table of Contents](#Table-of-Contents) 

### There are a few genres that aren't really 'music'. Also, for whatever reason, VHS is included in the heading "CD's & Vinyl" - I'll exclude those, too. Finally, I only include those reviews for which the album was reviewed at least 20 times, and the user has reviewed at least 5 albums, so that the data isn't too sparse.

In [5]:
def make_cond(idx):
    genres = set(['Instructional', 'Special Interest', 'Comedy & Spoken Word', 'Radio Shows'])
    categories = df_meta.iloc[idx]['categories']
    item_categories = set([cat for sublist in categories for cat in sublist])
    condition = (('[VHS]' not in df_meta.iloc[idx]['title']) & (item_categories.intersection(genres) == set()))
    return condition


def genre_and_cd_subset(df_cd, df_meta):
    df_meta = df_meta[df_meta['title'].notnull()]
    l = [idx for idx in list(range(df_meta.shape[0])) if make_cond(idx)]
    albums_asin = list(df_meta.iloc[l]['asin'])
    df_cd = df_cd[df_cd['asin'].isin(albums_asin)]
    return df_cd
    

def subset_on_reviews(df, n=20, m=5):
    num_reviews_per_reviewer = df.groupby('reviewerID').count()['_id'].reset_index()
    reviewer_reviewed_enough = num_reviews_per_reviewer[num_reviews_per_reviewer['_id']>=m]
    reviewer_reviewed_enough_indices = list(reviewer_reviewed_enough['reviewerID'].values)
    df = df[df['reviewerID'].isin(reviewer_reviewed_enough_indices)]
    
    num_reviews_per_album = df.groupby('asin').count()['_id'].reset_index()
    album_reviewed_enough = num_reviews_per_album[num_reviews_per_album['_id']>=n]
    album_reviewed_enough_indices = list(album_reviewed_enough['asin'].values)
    df = df[df['asin'].isin(album_reviewed_enough_indices)]
    return df
    
    
def clean_data(df_cd, df_meta):
    df_meta=df_meta[['_id','asin', 'categories', 'related', 'title']]
    df = genre_and_cd_subset(df_cd, df_meta)
    df = subset_on_reviews(df, 20, 5)
    albums = set(df['asin'].unique())
    album_list = list(albums)
    print('df shape', df.shape, 'number of unique albums', len(album_list))
    return df, df_meta, albums, album_list  

In [6]:
df_reviews, df_meta, albums, album_list = clean_data(df_cd, df_meta)

df shape (177163, 10) number of unique albums 4162


# <a class="anchor" id="Collaborative-Recommender"> Collaborative Recommender </a>
#### [Table of Contents](#Table-of-Contents) 

### The collaborative recommender is my take on Amazon's classic recommendation system. It works by using dimension reduction (SVD in this case) to extract latent relationships between users and ratings.

In [38]:
def make_dics(df):
    albums = df['asin'].unique()
    reviewers = df['reviewerID'].unique()
    
    dic_asin = {}
    dic_asin_reverse = {}
    for j, asin in enumerate(albums):
        dic_asin[j]=asin
        dic_asin_reverse[asin] = j
    
    dic_reviewer = {}
    dic_reviewer_reverse = {}
    for j, reviewerID in enumerate(reviewers):
        dic_reviewer[j]=reviewerID
        dic_reviewer_reverse[reviewerID] = j
        
    return dic_asin, dic_asin_reverse, dic_reviewer, dic_reviewer_reverse


def make_collab_recommender(df):
    dic_asin, dic_asin_reverse, dic_reviewer, dic_reviewer_reverse = make_dics(df)
    df['asin_integer'] = df['asin'].apply(lambda x: dic_asin_reverse[x])
    df['reviewerID_integer'] = df['reviewerID'].apply(lambda x: dic_reviewer_reverse[x])
    sparse_mat = csr_matrix((np.array(df['overall']), (np.array(df['asin_integer']),\
                                                       np.array(df['reviewerID_integer']))))
    svd = TruncatedSVD(n_components=100, random_state=42)
    arr = svd.fit_transform(sparse_mat)
    similarity = cosine_similarity(arr)
    return similarity, dic_asin, dic_asin_reverse


def make_asin_title_dic():
    dic = {}
    for idx in range(df_meta.shape[0]):
        row = df_meta.iloc[idx]
        dic[row['asin']]=row['title']
    return dic


def collab_recommender(asin):
    idx = dic_asin_reverse[asin]
    l = [[simil[idx][i],dic_asin[i]] for i in range(simil.shape[0])]
    l.sort(reverse=True)
    return l

In [None]:
asin_title_dic = make_asin_title_dic()
simil, dic_asin, dic_asin_reverse = make_collab_recommender(df_reviews)

In [45]:
print('The top recommendations for', asin_title_dic['B000002P72'], 'by B.B. King:')
[asin_title_dic[rec[1]] for rec in collab_recommender('B000002P72')[:10]]

The top recommendations for Live at the Regal by B.B. King:


['Live at the Regal',
 'My Favorite Things',
 'Live at the Fillmore',
 "Workingman's Dead",
 'Round About Midnight',
 'Giant Steps',
 'The Fillmore Concerts',
 'Blue Train',
 "This Year's Model",
 'My Aim Is True']

# <a class="anchor" id="NLP-Recommender"> NLP Recommender </a>
#### [Table of Contents](#Table-of-Contents) 

### The NLP recommender is built by taking reviews sentence by sentence. We reduce dimension and find latent topics using NMF and Count Vectorizer, and simultaneously use sentiment analysis to judge the positivity of the review. Multiplying these gives a score for the album. Taking the mean over all sentences about a given album results in a score for the album. Finally, these are normalized by album. The recommender works by finding the closest albums in cosine similarity in the dimension reduced space.

### Below I print the top words in each topic that NMF found to get a feel for what the topic is about. I also print the top albums and bottom 5 albums in each topic. Six of the thirty topics are removed since they do not seem to be musical, but rather about price, shipment, etc.

In [65]:
def make_corpus_and_table(df):
    sentence_tokenizer = PunktSentenceTokenizer()
    df_sentences = df.groupby('asin')['reviewText'].sum().apply(sentence_tokenizer.tokenize)
    corpus = df_sentences.sum()
    return df_sentences, corpus

    
def make_corpus_sentences_vect_model(df):
    df_sentences, corpus = make_corpus_and_table(df)
    vect = CountVectorizer(ngram_range=(1,2), token_pattern="\\b[a-z][a-z]+\\b", \
                           stop_words='english', max_df = 0.07) 
    counts = vect.fit_transform(corpus)
    model = NMF(n_components=30, init='random', random_state=0)
    model.fit(counts) 
    return corpus, df_sentences, vect, model


def return_top_words(model, feature_names, n_top_words):
    result = []
     
    for topic_idx, topic in enumerate(model.components_):
        print()
        print("Topic #%d:" % topic_idx)
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
        print() 


def print_extreme_albums(df):
    for col in df.columns:
        print('Top 5 for Topic {0}:'.format(col))
        print()
        for album in df[col].sort_values().tail().index:
            print(asin_title_dic[album])
        print()
        print('Bottom 5 for {0}'.format(col))
        print()
        for album in df[col].sort_values().head().index:
            print(asin_title_dic[album])
        print()


def make_topics_sentiments(review_list, model, count_vectorizer):
    '''Given a list of sentence reviews for one album, an NMF model, the count_vectorizer that made the model \
    and n_components, returns an array representing the score of the album in the topics.'''
    l  = []
    review_vec = count_vectorizer.transform(review_list)
    matrix = model.transform(review_vec)
    for row in range(matrix.shape[0]):
        topic_probs = matrix[row]
        sentiment = TextBlob(review_list[row]).sentiment.polarity
        l.append(sentiment*topic_probs)
    
    return [np.array(coord).sum()/matrix.shape[0] for coord in zip(*l)]


def rate_topics(df, model, count_vectorizer):
    ''' Given a list of album asins to score, an NMF model, the counter_vectorizer that made the model \
    , returns a dataframe with index the albums' asins, columns the topics \
    and rows the albums scores.'''
    asin_list = df['asin'].unique()
    dic = {}
    for asin in asin_list:
        review_list = list(df[df['asin']==asin]['reviewText'])[0]
        ratings = make_topics_sentiments(review_list, model, count_vectorizer)
        dic[asin] = ratings
    return pd.DataFrame(dic).transpose()


def normalize_and_round(df, amt_scale):
    df_norm = pd.DataFrame(normalize(df, axis=1), columns = df.columns, index = df.index)
    return df_norm.multiply(amt_scale).round(2)


def make_nlp_simil(df, model, count_vectorizer):
    df = df.reset_index()
    df_rate_topics = rate_topics(df, model, count_vectorizer)
    df_ratings = normalize_and_round(df_rate_topics, 100)
    nlp_simil = cosine_similarity(df_ratings)
    return nlp_simil, df_ratings


def nlp_recommender(asin):
    idx = dic_asin_reverse[asin]
    l = [[nlp_simil_topics[idx][i],dic_asin[i]] for i in range(nlp_simil_topics.shape[0])]
    l.sort(reverse=True)
    return l

In [None]:
corpus, df_sentences, vect, model = make_corpus_sentences_vect_model(df_reviews)

In [54]:
return_top_words(model, vect.get_feature_names(), 8)


Topic #0:
jazz, miles, blue, fusion, funk, davis, jazz rock, musicians


Topic #1:
cds, chingy, buy chingy, chingy cds, cds buy, best cds, greatest hits, favorite cds


Topic #2:
stones, rolling, rolling stones, rock roll, horrible, stones horrible, horrible stones, stone


Topic #3:
zeppelin, led, led zeppelin, rocks, zeppelin rocks, rocks led, page, iv


Topic #4:
hip, hop, hip hop, beats, nas, hop album, funk, real hip


Topic #5:
dylan, bob, bob dylan, folk, blonde, career, electric, blood


Topic #6:
riffs, fast, slow, solos, starts, riff, guitar riffs, melodic


Topic #7:
punk, green, green day, american, punk rock, sucks, pop punk, wave


Topic #8:
rap, beats, game, gangsta, hate, dont, west, rap album


Topic #9:
van, halen, van halen, roth, hagar, david, eddie, lee


Topic #10:
guitars, drums, piano, acoustic, lead, electric, acoustic guitar, keyboards


Topic #11:
floyd, pink, pink floyd, dark, waters, moon, gilmour, wall


Topic #12:
hot, singles, billboard, billboard hot, 

In [None]:
nlp_simil, df_ratings = make_nlp_simil(df_sentences, model, vect)
df_ratings_topics = df_ratings.drop([1,18,20,27,28,29], axis=1)
nlp_simil_topics = cosine_similarity(df_ratings_topics) 

In [66]:
print_extreme_albums(df_ratings)

Top 5 for Topic 0:

A Love Supreme
Bitches Brew
Birth of the Cool
Kind of Blue
The Shape of Jazz to Come

Bottom 5 for 0

In a Metal Mood: No More Mr Nice Guy
Worlds Apart (Collectors Edition CD + DVD)
Greatest Hits 1970-1978
The Gold Experience
Famous Last Words

Top 5 for Topic 1:

Bee Gees - Record: Their Greatest Hits
Smells Like Children
This Left Feels Right
Hoodstar
Jackpot

Bottom 5 for 1

Kidz Bop 8
100,000,000 Bon Jovi Fans Can't Be Wrong
Stories &amp; Alibis
Greatest Hitz
Down for Life

Top 5 for Topic 2:

Big Hits (High Tide and Green Grass)
12 X 5
Jump Back: The Best of the Rolling Stones
Aftermath
Steel Wheels

Bottom 5 for 2

On Top of Our Game
Corey Clark
I Monarch
If Only You Were Lonely

Top 5 for Topic 3:

Led Zeppelin III
Led Zeppelin 1
Physical Graffiti
Houses Of The Holy
Coda

Bottom 5 for 3

Firm
Straight Up
Cyclorama
Very Best &amp; Beyond

Top 5 for Topic 4:

Ironman
Hip Hop Is Dead
MM..Food
The Listening
Enta Da Stage

Bottom 5 for 4

Savage Life
Certified
Pow

In [71]:
print('Top NLP recommendations for', asin_title_dic['B000002P72'], 'by B.B. King:')
[asin_title_dic[rec[1]] for rec in nlp_recommender('B000002P72')][:10]

Top NLP recommendations for Live at the Regal by B.B. King:


['Live at the Regal',
 'Kicking Television: Live in Chicago',
 'Aladdin Sane',
 "Honkin' On Bobo",
 'Otis Blue',
 'Be Not Nobody',
 'Robert Johnson: The Complete Recordings',
 'The Stone Roses',
 'Waiting for Columbus',
 'Blind Faith - London Hyde Park 1969']

# <a class="anchor" id="Interactive-Recommender"> Interactive Recommender </a>
#### [Table of Contents](#Table-of-Contents) 

### The final recommendation engine works by first looking at all of the NLP recommendations. Then, it takes the percentage, P passed in to the recommender function, and filters out NLP recommendations which are not in the top $(1-P)$ percent of the collaborative recommendation engine. Thus, as we see below, when $P=1$, we get the NLP recommender back, we get a mixture at $P=0.5$ and as $P \to 0$, the recommendations approach the collaborative filter. On the website, the user can toggle P.

In [57]:
def recommender(asin,perc):
    '''given album asin and float perc in (0,1), return top 5 nlp recommendations subject to constraint of being in 
    top perc percent of all album recs'''
    
    nlp_recs = nlp_recommender(asin)
    collab_recs = collab_recommender(asin)
    num_albums = len(collab_recs)
    cut_off = int(num_albums*perc)
    top_perc_albums = [entry[1] for entry in collab_recs[:cut_off+5]]    
    
    recs = []
    while len(recs)<6:
        next_rec = nlp_recs.pop(0)[1]
        if next_rec in top_perc_albums:
            recs.append(next_rec)
            
    return recs

In [74]:
[asin_title_dic[rec] for rec in recommender('B000002P72', 1)]

['Live at the Regal',
 'Kicking Television: Live in Chicago',
 'Aladdin Sane',
 "Honkin' On Bobo",
 'Otis Blue',
 'Be Not Nobody']

In [75]:
[asin_title_dic[rec] for rec in recommender('B000002P72', 0.5)]

['Live at the Regal',
 'Kicking Television: Live in Chicago',
 'Aladdin Sane',
 'Otis Blue',
 'Robert Johnson: The Complete Recordings',
 'The Stone Roses']

In [95]:
[asin_title_dic[rec] for rec in recommender('B000002P72',0)]

['Live at the Regal',
 'The Fillmore Concerts',
 "Workingman's Dead",
 'Live at the Fillmore',
 'Round About Midnight',
 'Blue Train']

# <a class="anchor" id="Text-Based-Recommender"> Text Based Recommender </a>
#### [Table of Contents](#Table-of-Contents) 

### The text based recommender takes a user's textual request, and finds the albums that were described most similarly. Notice that only the NLP recommender structure is being used here (not the collaborative).

In [96]:
def text_to_album(text):
    sentence_tokenizer = PunktSentenceTokenizer()
    sentences = sentence_tokenizer.tokenize(text)
    desired_album = make_topics_sentiments(sentences, model, vect)
    df_desired = pd.DataFrame(desired_album).transpose()
    df_desired = normalize_and_round(df_desired, 100)
    comparisons = cosine_similarity(df_ratings,df_desired)
    return [dic_asin[idx] for idx in np.array([x[0] for x in comparisons]).argsort()[-5:]][::-1]

In [103]:
text = "childish innocent music. something with fun lyrics"
[asin_title_dic[rec] for rec in text_to_album(text)]

['Strange Little Girls',
 'Trash',
 'Portrait of an American Family',
 'Yield',
 'Lost Dogs']

# <a class="anchor" id="Musical-Mashup"> Musical Mashup </a>
#### [Table of Contents](#Table-of-Contents) 

### The musical mash up works by getting two album inputs from the user, adding the corresponding vectors in the 24-dimensional NLP space, and finding the nearest vectors/albums to the result in cosine similarity. This is the same as averaging the characteristics of the given albums. Below, we see that the mash up of Johnny Cash and N.W.A. is Kid Rock.

In [104]:
def mashup(asin1,asin2):
    new_album = pd.DataFrame(df_ratings.loc[asin1]+df_ratings.loc[asin2]).transpose() 
    new_album = new_album.drop([1,18,20,27,28,29], axis=1)
    comparisons = cosine_similarity(df_ratings_topics,new_album)
    return [dic_asin[idx] for idx in np.array([x[0] for x in comparisons]).argsort()[-5:]][::-1]

In [108]:
print('The mashup of', asin_title_dic['B000028U0Y'], 'and', asin_title_dic['B000003B6J'], 'is')
[asin_title_dic[rec] for rec in mashup('B000028U0Y', 'B000003B6J')]

The mashup of At Folsom Prison and Straight Outta Compton is


['Devil Without A Cause',
 'In My Lifetime 1',
 'The B. Coming',
 'Reanimation',
 'Buck The World']