## Cosine Similarity

Taking a look at this article (https://www.newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python), it seems that consine similarity is a good choice for comparing our route descriptions.

Since the length of our descriptions are varying, cosine similarity seems to be the least sensitive to text length.
Additionally, Jaccard Index relies on specific vocabulary being similar, and Euclidean Distance is more sensitive to repetition of words.

Since we have our descriptions as a set of strings, 

In [37]:
import pandas as pd
import numpy as np
import csv
import ast

In [67]:
# Converting to CSV
df = pd.read_csv('Routes_With_Cleaned_Descriptions.csv')
df['Descriptions'] = df['Descriptions'].apply(ast.literal_eval)
df.head(3)

Unnamed: 0,Route,Location,URL,Avg Stars,Your Stars,Route Type,Rating,Pitches,Length,Area Latitude,Area Longitude,Descriptions
0,White Rastafarian,White Rastafarian Boulder > Outback Bouldering...,https://www.mountainproject.com/route/10572259...,3.9,-1,Boulder,V2 R,1,20.0,34.02073,-116.16212,"[problem, locat, larg, boulder, southeast, end..."
1,Slashface,Slash Boulder > Western Belt > Geology Tour Ro...,https://www.mountainproject.com/route/10572275...,3.9,-1,Boulder,V3 R,1,25.0,33.95344,-116.08706,"[anoth, joshua, tree, finest, boulder, problem..."
2,Pigpen,Pigpen Boulder > Manx Boulders Circuit > Manx/...,https://www.mountainproject.com/route/10572299...,3.9,-1,Boulder,V4,1,10.0,34.0153,-116.15811,"[behind, cyclop, rock, awesom, boulder, proble..."


In [75]:
# Lets create a Bag of Words vector for each row, or a matrix for the whole dataset.

# Initialize an empty set for the vocabulary
vocabulary = set()

# Build the vocabulary
for sentence in df['Descriptions']:
    vocabulary.update(sentence)

# Convert to a sorted list
vocabulary = sorted(list(vocabulary))

In [83]:
def create_bow_vector(sentence, vocab):
    vector = [0] * len(vocab)  # Initialize a vector of zeros
    for word in sentence:
        if word in vocab:
            idx = vocab.index(word)  # Find the index of the word in the vocabulary
            vector[idx] += 1  # Increment the count at that index
    return vector

bow_matrix = [create_bow_vector(sentence, vocabulary) for sentence in df['Descriptions']]

In [84]:
# With the BOW Matrix, 
# we have a pairwise cosine similarity matrix that can be used to descriptions or routes most similar

from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
pairwise_similarity = 1 - pairwise_distances(bow_matrix, metric = 'cosine')
pairwise_similarity

array([[1.        , 0.37511724, 0.12667365, ..., 0.33108771, 0.        ,
        0.03904344],
       [0.37511724, 1.        , 0.20780973, ..., 0.22065615, 0.        ,
        0.08006408],
       [0.12667365, 0.20780973, 1.        , ..., 0.15475893, 0.08111071,
        0.08111071],
       ...,
       [0.33108771, 0.22065615, 0.15475893, ..., 1.        , 0.05299989,
        0.05299989],
       [0.        , 0.        , 0.08111071, ..., 0.05299989, 1.        ,
        0.125     ],
       [0.03904344, 0.08006408, 0.08111071, ..., 0.05299989, 0.125     ,
        1.        ]])

## Finding Similar Routes
Now that we have a similarity matrix, we can append this to the names/indicies of our routes, and provide recommendations based on the similarity scores

In [149]:
# Lets create a function that takes in a Route Name, and spits out the n most simliar routes.

def recommender(route_name, n_recs):
    row_index = df.loc[df['Route'] == route_name].index[0]
    n_rec_indices = np.argsort(pairwise_similarity[row_index])[::-1][1:n_recs+1] 
    print(n_rec_indices)
    print(pairwise_similarity[row_index][n_rec_indices])
    names = []
    for i in n_rec_indices:
        names.append(df.loc[i, 'Route'])

    return names

In [152]:
recommender("Chuckawalla 'Yabo' Start", 5)

[601 294  94 284  12]
[0.57535596 0.55749852 0.55708601 0.51994695 0.51994695]


['Unnamed V1',
 'The Button High',
 'Nitwit',
 'The Love Machine',
 'Nicole Overhang']

Cool, we have a basic recommendation system working now.

However, after some exploration, notice that 'Nitwit' is often recommended. Taking a look at the description, Nitwit has a short description with commonly used words to describe routes, which makes it lexically similar to many other descriptions in our dataset. 

To address this, maybe we need to add other metrics, or attempt to use a different type of similarity measurement. Jaccard index wouln't work well since we run into the same problem with many overlapping words.

In [156]:
df[df['Route'] == 'Nitwit']

Unnamed: 0,Route,Location,URL,Avg Stars,Your Stars,Route Type,Rating,Pitches,Length,Area Latitude,Area Longitude,Descriptions
94,Nitwit,The Blockhead > JBMF Boulders > Roadside Rocks...,https://www.mountainproject.com/route/10603963...,2.0,-1,Boulder,V0,1,13.0,34.01475,-116.16589,"[stand, start, move, left]"
