# Recommender System Application Development

## Purpose

This is a project I found online about using a recommender system to predict what city a new user of an app
might want to travel to based on few a features they enter about what they like. The project will create
5 versions of recommender systems that use different parts of the data to determine the approrpriate recommendation


> https://towardsdatascience.com/recommender-system-application-development-part-1-of-4-cosine-similarity-f6dbcd768e83

In [1]:

import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns


In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/emrepun/RecommenderEngine/master/data_sets/city_data.csv')
df.head()

Unnamed: 0,city,popularity,description,image,rating,rating_count,positive_review,negative_review
0,Munich,"731,250 reviews and opinions",Munich was almost completely destroyed in two ...,https://media-cdn.tripadvisor.com/media/photo-...,8,250000,12500,3000
1,Prague,"1,690,403 reviews and opinions","We hear the question, ""What’s the next Prague?...",https://media-cdn.tripadvisor.com/media/photo-...,9,400000,50000,8000
2,Stockholm,"539,546 reviews and opinions",The capital city of Sweden combines modern att...,https://media-cdn.tripadvisor.com/media/photo-...,7,120000,12000,1000
3,San Francisco,"1,194,045 reviews and opinions","Who cares about a little fog (okay, a lot of f...",https://media-cdn.tripadvisor.com/media/photo-...,6,200000,20000,10000
4,St. Petersburg,"657,532 reviews and opinions","The second largest city in Russia, St. Petersb...",https://media-cdn.tripadvisor.com/media/photo-...,1,600000,70000,2500


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   city             25 non-null     object
 1   popularity       25 non-null     object
 2   description      25 non-null     object
 3   image            25 non-null     object
 4   rating           25 non-null     int64 
 5   rating_count     25 non-null     int64 
 6   positive_review  25 non-null     int64 
 7   negative_review  25 non-null     int64 
dtypes: int64(4), object(4)
memory usage: 1.7+ KB


## Version 1 - Recommender

>This version will only make recommendations by comparing keywords and city descriptions using cosine similarity

In [4]:
from nltk.corpus import stopwords


def clear(city_description):
    city_description = city_description.lower()
    city_description = city_description.split()
    city_keywords = [word for word in city_description if word not in stopwords.words('english')]
    merged_city_description = " ".join(city_keywords)
    return merged_city_description


In [5]:
for index, row in df.iterrows():
    clear_desc = clear(row['description'])
    df.at[index, 'description'] = clear_desc
    
updated_dataset = df.to_csv('city_data_cleared.csv')

In [6]:
df.head()

Unnamed: 0,city,popularity,description,image,rating,rating_count,positive_review,negative_review
0,Munich,"731,250 reviews and opinions",munich almost completely destroyed two world w...,https://media-cdn.tripadvisor.com/media/photo-...,8,250000,12500,3000
1,Prague,"1,690,403 reviews and opinions","hear question, ""what’s next prague?"" lot. we’r...",https://media-cdn.tripadvisor.com/media/photo-...,9,400000,50000,8000
2,Stockholm,"539,546 reviews and opinions",capital city sweden combines modern attraction...,https://media-cdn.tripadvisor.com/media/photo-...,7,120000,12000,1000
3,San Francisco,"1,194,045 reviews and opinions","cares little fog (okay, lot fog) there’s much ...",https://media-cdn.tripadvisor.com/media/photo-...,6,200000,20000,10000
4,St. Petersburg,"657,532 reviews and opinions","second largest city russia, st. petersburg cou...",https://media-cdn.tripadvisor.com/media/photo-...,1,600000,70000,2500


We will now create the cosine similarity class to use in our prediction

In [10]:
import re, math
from collections import Counter


def cosine_similarity_of(text1, text2):
        #get words first
        first = re.compile(r"[\w']+").findall(text1)
        second = re.compile(r"[\w']+").findall(text2)

        #get dictionary with each word and count.
        vector1 = Counter(first)
        vector2 = Counter(second)

        #convert vectors to set to find common words as intersection
        common = set(vector1.keys()).intersection(set(vector2.keys()))

        dot_product = 0.0

        for i in common:
            #get amount of each common word for both vectors and multiply them then add them together.
            dot_product += vector1[i] * vector2[i]

        squared_sum_vector1 = 0.0
        squared_sum_vector2 = 0.0

        #get squared sum values of word counts from each vector.
        for i in vector1.keys():
            squared_sum_vector1 += vector1[i]**2

        for i in vector2.keys():
            squared_sum_vector2 += vector2[i]**2

        #calculate magnitude with squared sums.
        magnitude = math.sqrt(squared_sum_vector1) * math.sqrt(squared_sum_vector2)

        if not magnitude:
           return 0.0
        else:
           return float(dot_product) / magnitude

Now we will create the recommender engine we will use in version 1

In [16]:
import operator
import json



def get_recommendations(keywords):

        df = pd.read_csv('city_data_cleared.csv')

        score_dict = {}

        for index, row in df.iterrows():
            score_dict[index] = cosine_similarity_of(row['description'], keywords)

        #sort cities by score and index.
        sorted_scores = sorted(score_dict.items(), key=operator.itemgetter(1), reverse=True)

        counter = 0

        #create an empty results data frame.
        resultDF = pd.DataFrame(columns=('city', 'popularity', 'description', 'score'))

        #get highest scored 5 cities.
        for i in sorted_scores:
            #print index and score of the city.
            #print(i[0], i[1])
            resultDF = resultDF.append({'city': df.iloc[i[0]]['city'], 'popularity': df.iloc[i[0]]['popularity'], 'description': df.iloc[i[0]]['description'], 'score': i[1]}, ignore_index=True)
            counter += 1

            if counter>4:
                break

        #convert DF to json.
        json_result = json.dumps(resultDF.to_dict('records'))
        return json_result
    

We will be testing our recommender engine under three different categories:
* Culture, Art and History
* Beach and Sun
* Nightlife and Party

In [34]:
culture_keywords = "history historical art architecture city culture"
beach_n_sun_keywords = "beach beaches park nature holiday sea seaside sand sunshine sun sunny"
nightlife_keywords = "nightclub nightclubs nightlife bar bars pub pubs party beer"


def get_top_5_city_names_out_of_json(json_string):
    list = json.loads(json_string)
    result = []
    max = len(list)
    i = 0
    while i < max:
        result.append((list[i]['city'], list[i]['score']))
        i += 1
    return result

In [35]:
top_5_cultural_cities = get_recommendations(culture_keywords)
city_names_for_cultural = get_top_5_city_names_out_of_json(top_5_cultural_cities)
print(city_names_for_cultural)
print("#################")

top_5_summer_cities = get_recommendations(beach_n_sun_keywords)
city_names_for_summer = get_top_5_city_names_out_of_json(top_5_summer_cities)
print(city_names_for_summer)
print("#################")

top_5_party_cities = get_recommendations(nightlife_keywords)
city_names_for_party = get_top_5_city_names_out_of_json(top_5_party_cities)
print(city_names_for_party)
print("#################")

[('Athens', 0.21629522817435007), ('St. Petersburg', 0.16666666666666666), ('Stockholm', 0.14962640041614492), ('Milan', 0.140028008402801), ('Rome', 0.12171612389003691)]
#################
[('Miami', 0.16351748504193217), ('Brighton', 0.10660035817780521), ('Hvar', 0.10192943828752511), ('Marmaris', 0.08528028654224416), ('Ibiza', 0.04598004898717029)]
#################
[('Ibiza', 0.15249857033260467), ('Miami', 0.14462030521243746), ('Hvar', 0.1126872339638022), ('Prague', 0.10482848367219184), ('Madrid', 0.06415002990995841)]
#################


### Conclusion for Version-1

In this version, we have developed a recommender application that makes city recommendation for traveling for three different categories, by computing the cosine similarity of city descriptions with the given category’s keywords.

## Version 2 - Recommender

In the next version below, we will implement a different method to calculate a score both including cosine similarity and Rating information of the cities.  Including the rating feature will improve the recommender allowing
it to make better predictions

In [36]:
from math import e

def get_rating_weight(rating, q=10):
    if rating > 10 or rating < 0:
        return None
    else:
        m = (2*q) / 10 #10 because rating varies between 0 and 10
        b = -q
        return (m*rating) + b

In [37]:
def calculate_final_score(cs, r):
    amount = (cs / 100) * r
    return cs + amount

Code Line 20 and 21:
    * Calculated the weight of the rate column
    * Takes cosine similarity score and rating contribution r parameters.
    * Calculates +-r percent of cosine similarity score in amount variable
    * Adds amount to the cosine similarity score and returns it.

In [38]:


def get_recommendations_include_rating(keywords):
    df = pd.read_csv('city_data_cleared.csv')

    score_dict = {}

    for index, row in df.iterrows():
        cs_score = cosine_similarity_of(row['description'], keywords)

        rating = row['rating']
        rating_contribution = get_rating_weight(rating,10)

        final_score = calculate_final_score(cs_score, rating_contribution)

        score_dict[index] = final_score

    #sort cities by score and index.
    sorted_scores = sorted(score_dict.items(), key=operator.itemgetter(1), reverse=True)

    counter = 0

    #create an empty results data frame.
    resultDF = pd.DataFrame(columns=('city', 'popularity', 'description', 'score'))

    #get highest scored 5 cities.
    for i in sorted_scores:
        #print index and score of the city.
        #print(i[0], i[1])
        resultDF = resultDF.append({'city': df.iloc[i[0]]['city'], 'popularity': df.iloc[i[0]]['popularity'], 'description': df.iloc[i[0]]['description'], 'score': i[1]}, ignore_index=True)
        counter += 1

        if counter>4:
            break

    #convert DF to json.
    json_result = json.dumps(resultDF.to_dict('records'))
    return json_result

In [39]:
# Version 2 requests are below:
top_5_cultural_with_rating = get_recommendations_include_rating(culture_keywords)
city_names_for_cultural_rating = get_top_5_city_names_out_of_json(top_5_cultural_with_rating)
print(city_names_for_cultural_rating)
print("#################")

top_5_summer_with_rating = get_recommendations_include_rating(beach_n_sun_keywords)
city_names_for_summer_rating = get_top_5_city_names_out_of_json(top_5_summer_with_rating)
print(city_names_for_summer_rating)
print("#################")


top_5_party_with_rating = get_recommendations_include_rating(nightlife_keywords)
city_names_for_party_rating = get_top_5_city_names_out_of_json(top_5_party_with_rating)
print(city_names_for_party_rating)
print("#################")

[('Athens', 0.22927294186481106), ('Stockholm', 0.1556114564327907), ('St. Petersburg', 0.15333333333333332), ('Milan', 0.15123024907502508), ('Rome', 0.13145341380123987)]
#################
[('Miami', 0.16024713534109353), ('Brighton', 0.11512838683202962), ('Hvar', 0.11008379335052712), ('Marmaris', 0.09380831519646858), ('Ibiza', 0.04598004898717029)]
#################
[('Ibiza', 0.15249857033260467), ('Miami', 0.1417278991081887), ('Hvar', 0.12170221268090638), ('Prague', 0.1132147623659672), ('Madrid', 0.06671603110635675)]
#################


In [41]:
top_5_cultural_cities = get_recommendations(culture_keywords)
city_names_for_cultural = get_top_5_city_names_out_of_json(top_5_cultural_cities)
print(city_names_for_cultural)
print("#################")


top_5_cultural_with_rating = get_recommendations_include_rating(culture_keywords)
city_names_for_cultural_rating = get_top_5_city_names_out_of_json(top_5_cultural_with_rating)
print(city_names_for_cultural_rating)
print("#################")

[('Athens', 0.21629522817435007), ('St. Petersburg', 0.16666666666666666), ('Stockholm', 0.14962640041614492), ('Milan', 0.140028008402801), ('Rome', 0.12171612389003691)]
#################
[('Athens', 0.22927294186481106), ('Stockholm', 0.1556114564327907), ('St. Petersburg', 0.15333333333333332), ('Milan', 0.15123024907502508), ('Rome', 0.13145341380123987)]
#################


### Conclusion for Version-2

It's important to include ratings in the recommender system so that our user can get good recommendations that fit what they are looking for

## Version 3

We will add a rating threshold to improve the recommender so that cities with very low count of ratings aren't affecting
the recommender system

In [42]:
def get_rating_weight_with_quantity(rating, c, T, q=10):
    if rating > 10 or rating < 0:
        return None
    else:
        m = (2*q) / 10 #10 because rating varies between 0 and 10
        b = -q
        val = (m*rating) + b
        M = e**((-T*0.68)/c)
        return val * M
    
# T is the threshold of quantity of ratings for a place
# c is the actual quantity of ratings for a place

In [43]:
def get_recommendations_include_rating_count_threshold(keywords):
        df = pd.read_csv('city_data_cleared.csv')

        score_dict = {}

        for index, row in df.iterrows():
            cs_score = cosine_similarity_of(row['description'], keywords)

            rating = row['rating']
            rating_count = row['rating_count']
            threshold = 1000000
            rating_contribution = get_rating_weight_with_quantity(rating,rating_count,threshold,10)

            final_score = calculate_final_score(cs_score, rating_contribution)

            score_dict[index] = final_score

        #sort cities by score and index.
        sorted_scores = sorted(score_dict.items(), key=operator.itemgetter(1), reverse=True)

        counter = 0

        #create an empty results data frame.
        resultDF = pd.DataFrame(columns=('city', 'popularity', 'description', 'score'))

        #get highest scored 5 cities.
        for i in sorted_scores:
            #print index and score of the city.
            #print(i[0], i[1])
            resultDF = resultDF.append({'city': df.iloc[i[0]]['city'], 'popularity': df.iloc[i[0]]['popularity'], 'description': df.iloc[i[0]]['description'], 'score': i[1]}, ignore_index=True)
            counter += 1

            if counter>4:
                break

        #convert DF to json.
        json_result = json.dumps(resultDF.to_dict('records'))
        return json_result

In [44]:
top_5_cultural_with_rating_count_threshold = get_recommendations_include_rating_count_threshold(culture_keywords)
city_names_for_cultural_rating_count_threshold = get_top_5_city_names_out_of_json(top_5_cultural_with_rating_count_threshold)
print(city_names_for_cultural_rating_count_threshold)
print("#################")

top_5_summer_with_rating_count_threshold = get_recommendations_include_rating_count_threshold(beach_n_sun_keywords)
city_names_for_summer_rating_count_threshold = get_top_5_city_names_out_of_json(top_5_summer_with_rating_count_threshold)
print(city_names_for_summer_rating_count_threshold)
print("#################")

top_5_party_with_rating_count_threshold = get_recommendations_include_rating_count_threshold(nightlife_keywords)
city_names_for_party_rating_count_threshold = get_top_5_city_names_out_of_json(top_5_party_with_rating_count_threshold)
print(city_names_for_party_rating_count_threshold)
print("#################")

[('Athens', 0.22085412061707188), ('St. Petersburg', 0.16237388971283098), ('Stockholm', 0.14964710498328637), ('Milan', 0.14638431288137468), ('Rome', 0.12625154479542344)]
#################
[('Miami', 0.16312689746989084), ('Brighton', 0.10991687069100607), ('Hvar', 0.10199961823851648), ('Marmaris', 0.08786696801960493), ('Ibiza', 0.04598004898717029)]
#################
[('Ibiza', 0.15249857033260467), ('Miami', 0.14427485656597425), ('Hvar', 0.1127648208188835), ('Prague', 0.10636051861765909), ('Madrid', 0.06540430684448524)]
#################


## Version 3

In this version we will improve our system further as the fourth version by using positive_review and negative_review features of our dataset

In [47]:
def get_rating_with_count_and_reviews(r, rc, pf, bf):
        if r > 10 or r < 0:
            return None
        else:
            positive_diff = (10 - r) / 2
            positive_rating = r + positive_diff
            negative_diff = r / 2
            negative_rating = r - negative_diff
            updated_rating = ((r * rc) + (pf * positive_rating) + (bf * negative_rating)) / (rc + pf + bf)
            
        return get_rating_weight_with_quantity(updated_rating,rc,1000000,10)

In [48]:
def get_recommendations_include_rating_count_threshold_positive_negative_reviews(keywords):
        df = pd.read_csv('city_data_cleared.csv')

        score_dict = {}

        for index, row in df.iterrows():
            cs_score = cosine_similarity_of(row['description'], keywords)

            rating = row['rating']
            rating_count = row['rating_count']
            positive_review_count = row['positive_review']
            negative_review_count = row['negative_review']
            rating_contribution = get_rating_with_count_and_reviews(rating,rating_count,positive_review_count,negative_review_count)

            final_score = calculate_final_score(cs_score, rating_contribution)

            score_dict[index] = final_score

        #sort cities by score and index.
        sorted_scores = sorted(score_dict.items(), key=operator.itemgetter(1), reverse=True)

        counter = 0

        #create an empty results data frame.
        resultDF = pd.DataFrame(columns=('city', 'popularity', 'description', 'score'))

        #get highest scored 5 cities.
        for i in sorted_scores:
            #print index and score of the city.
            #print(i[0], i[1])
            resultDF = resultDF.append({'city': df.iloc[i[0]]['city'], 'popularity': df.iloc[i[0]]['popularity'], 'description': df.iloc[i[0]]['description'], 'score': i[1]}, ignore_index=True)
            counter += 1

            if counter>4:
                break

        #convert DF to json.
        json_result = json.dumps(resultDF.to_dict('records'))
        return json_result

In [49]:
top_5_cultural_with_rating_count_threshold_reviews = get_recommendations_include_rating_count_threshold_positive_negative_reviews(culture_keywords)
city_names_for_cultural_rating_count_threshold_reviews = get_top_5_city_names_out_of_json(top_5_cultural_with_rating_count_threshold_reviews)
print(city_names_for_cultural_rating_count_threshold_reviews)
print("#############################")

top_5_summer_with_rating_count_threshold_reviews = get_recommendations_include_rating_count_threshold_positive_negative_reviews(beach_n_sun_keywords)
city_names_for_summer_rating_count_threshold_reviews = get_top_5_city_names_out_of_json(top_5_summer_with_rating_count_threshold_reviews)
print(city_names_for_summer_rating_count_threshold_reviews)
print("#############################")

top_5_party_with_rating_count_threshold_reviews = get_recommendations_include_rating_count_threshold_positive_negative_reviews(nightlife_keywords)
city_names_for_party_rating_count_threshold_reviews = get_top_5_city_names_out_of_json(top_5_party_with_rating_count_threshold_reviews)
print(city_names_for_party_rating_count_threshold_reviews)
print("#############################")

NameError: name 'get_rating_weight_with_count_and_reviews' is not defined