## Cosine Similarity

Taking a look at this article (https://www.newscatcherapi.com/blog/ultimate-guide-to-text-similarity-with-python), it seems that consine similarity is a good choice for comparing our route descriptions.

Since the length of our descriptions are varying, cosine similarity seems to be the least sensitive to text length.
Additionally, Jaccard Index relies on specific vocabulary being similar, and Euclidean Distance is more sensitive to repetition of words.

Since we have our descriptions as a set of strings, 

In [2]:
import pandas as pd
import numpy as np
import csv
import ast

In [24]:
# Converting back from CSV to a usable set
df = pd.read_csv('Routes_With_Cleaned_Descriptions.csv')
df['Descriptions'] = df['Descriptions'].apply(ast.literal_eval)
df.head(3)

Unnamed: 0,Route,Location,URL,Avg Stars,Your Stars,Route Type,Rating,Pitches,Length,Area Latitude,Area Longitude,Descriptions
0,White Rastafarian,White Rastafarian Boulder > Outback Bouldering...,https://www.mountainproject.com/route/10572259...,3.9,-1,Boulder,V2 R,1,20.0,34.02073,-116.16212,"[problem, locat, larg, boulder, southeast, end..."
1,Slashface,Slash Boulder > Western Belt > Geology Tour Ro...,https://www.mountainproject.com/route/10572275...,3.9,-1,Boulder,V3 R,1,25.0,33.95344,-116.08706,"[anoth, joshua, tree, finest, boulder, problem..."
2,Pigpen,Pigpen Boulder > Manx Boulders Circuit > Manx/...,https://www.mountainproject.com/route/10572299...,3.9,-1,Boulder,V4,1,10.0,34.0153,-116.15811,"[behind, cyclop, rock, awesom, boulder, proble..."


In [25]:
# Lets create a Bag of Words vector for each row, or a matrix for the whole dataset.

# Initialize an empty set for the vocabulary
vocabulary = set()

# Build the vocabulary
for sentence in df['Descriptions']:
    vocabulary.update(sentence)

# Convert to a sorted list
vocabulary = sorted(list(vocabulary))

In [5]:
def create_bow_vector(sentence, vocab):
    vector = [0] * len(vocab)  # Initialize a vector of zeros
    for word in sentence:
        if word in vocab:
            idx = vocab.index(word)  # Find the index of the word in the vocabulary
            vector[idx] += 1  # Increment the count at that index
    return vector

bow_matrix = [create_bow_vector(sentence, vocabulary) for sentence in df['Descriptions']]

In [6]:
# With the BOW Matrix, 
# we have a pairwise cosine similarity matrix that can be used to descriptions or routes most similar

from sklearn.metrics import pairwise_distances
from scipy.spatial.distance import cosine
pairwise_similarity = 1 - pairwise_distances(bow_matrix, metric = 'cosine')
pairwise_similarity

array([[1.        , 0.37511724, 0.12667365, ..., 0.33108771, 0.        ,
        0.03904344],
       [0.37511724, 1.        , 0.20780973, ..., 0.22065615, 0.        ,
        0.08006408],
       [0.12667365, 0.20780973, 1.        , ..., 0.15475893, 0.08111071,
        0.08111071],
       ...,
       [0.33108771, 0.22065615, 0.15475893, ..., 1.        , 0.05299989,
        0.05299989],
       [0.        , 0.        , 0.08111071, ..., 0.05299989, 1.        ,
        0.125     ],
       [0.03904344, 0.08006408, 0.08111071, ..., 0.05299989, 0.125     ,
        1.        ]])

## Finding Similar Routes
Now that we have a similarity matrix, we can append this to the names/indicies of our routes, and provide recommendations based on the similarity scores

In [80]:
# Lets create a function that takes in a Route Name, and spits out the n most simliar routes.

def recommender(route_name, matrix, n_recs):
    row_index = df.loc[df['Route'] == route_name].index[0]
    n_rec_indices = np.argsort(matrix[row_index])[::-1][1:n_recs+1] 
    names = []
    for i in n_rec_indices:
        names.append(df.loc[i, 'Route'])

    return names

In [81]:
recommender("Chuckawalla 'Yabo' Start",pairwise_similarity, 5)

['Unnamed V1',
 'The Button High',
 'Nitwit',
 'The Love Machine',
 'Nicole Overhang']

Cool, we have a basic recommendation system working now.

However, after some exploration, notice that 'Nitwit' is often recommended. Taking a look at the description, Nitwit has a short description with commonly used words to describe routes, which makes it lexically similar to many other descriptions in our dataset. 

To address this, maybe we need to add other metrics, or attempt to use a different type of similarity measurement. Jaccard index wouln't work well since we run into the same problem with many overlapping words.

In [82]:
df[df['Route'] == 'Nitwit']

Unnamed: 0,Route,Location,URL,Avg Stars,Your Stars,Route Type,Rating,Pitches,Length,Area Latitude,Area Longitude,Descriptions
94,Nitwit,The Blockhead > JBMF Boulders > Roadside Rocks...,https://www.mountainproject.com/route/10603963...,2.0,-1,Boulder,V0,1,13.0,34.01475,-116.16589,"[stand, start, move, left]"


After performing some research, we want a system that places less weight on short texts, so consider the use of TF-IDF

Let's go ahead and perform TF-IDF (Term frequency-inverse document frequency) vectorization using the Sci-kit Learn library and again use cosine similarity.

In [83]:
# This time, we'll need the descriptions as strings instead of a list of words to help vectorization.

df_TFIDF = pd.read_csv('Routes_With_Cleaned_Descriptions.csv')
df_TFIDF['Descriptions'] = df_TFIDF['Descriptions'].apply(ast.literal_eval)
df_TFIDF['Descriptions'] = df_TFIDF['Descriptions'].apply(lambda x: ' '.join(x))

In [84]:
# Now we can start vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(df_TFIDF['Descriptions'])

similarity_matrix_tfidf = cosine_similarity(tfidf_matrix)

In [85]:
recommender("Chuckawalla 'Yabo' Start",similarity_matrix_tfidf, 5)

['Medium Chuckie',
 'Duncecap Finish',
 'Dream Sequence Direct',
 'Unknown Slopers',
 'The Egg Timer']

In [86]:
recommender("Chuckawalla 'Yabo' Start",pairwise_similarity, 5)

['Unnamed V1',
 'The Button High',
 'Nitwit',
 'The Love Machine',
 'Nicole Overhang']

In [87]:
recommender("Nicole Overhang",similarity_matrix_tfidf, 5)

['Nitwit', 'Kranium Left', 'Firefly', 'Hey Ladies V6/7', 'Intersection Mantle']

In [88]:
recommender("Nicole Overhang",pairwise_similarity, 5)

['Nitwit', 'Kranium Left', 'Firefly', 'Pistol Whipped', 'Powell Face']

Interesting to see that Nicole Overhang has the same results, but Chuckawalla Yabo Start has very different routes.

Lets create a dataframe to compare the recommendations given by each method.

In [98]:
# Apply the recommender to each route and compare Bag of Words and TF-IDF cosine similarity

rec_table = pd.DataFrame()

bow_recs = []
TFIDF_recs = []

for index, row in df.iterrows():
    bow_recs.append(recommender(df['Route'][index], pairwise_similarity, 5))
    TFIDF_recs.append(recommender(df['Route'][index], similarity_matrix_tfidf, 5))

rec_table['Route'] = df['Route']
rec_table['BoW Recs'] = bow_recs
rec_table['TF-IDF Recs'] = TFIDF_recs

In [109]:
# How many overlapping recommendations do we have for each route?
from collections import Counter

overlap = []
for index, row in rec_table.iterrows():
    overlap.append(len(set(rec_table['BoW Recs'][index]) & set(rec_table['TF-IDF Recs'][index])))

    
rec_table['Overlap'] = overlap
rec_table

Unnamed: 0,Route,BoW Recs,TF-IDF Recs,Overlap
0,White Rastafarian,"[Dream Sequence, The Man Who Smiled, High Noon...","[The Man Who Smiled, Slashface, High Noon, Dre...",4
1,Slashface,"[Ten Fidy, White Rastafarian, Mr. Creosote, Al...","[White Rastafarian, High Noon, Alexandria, Mr....",4
2,Pigpen,"[Street Zen, Sex Magician Sit Start, Dark Matt...","[Street Zen, Sex Magician Sit Start, Dark Matt...",3
3,JBMFP,"[The Hard Way, Ghetto Booty, Jerry's Kids, Tid...","[Friction 101, The Ayatollah of Rock and Rolla...",2
4,Gunsmoke,"[Orbiter, Tush, Night Crawler, Desert Teflon, ...","[The Chube, Driblet, Thingamajig, The Laying o...",0
...,...,...,...,...
995,Turnbuckle Left Arete,"[Browning Arete, Slap Prow, Afterthought, Doct...","[Slap Prow, Doctor Brown, Dream Sequence, Pixi...",3
996,Ramp Line,"[Unnamed, The Womb, The Bardini Crack (aka: Bl...","[Unnamed, Unnamed, The Bardini Crack (aka: Bla...",2
997,The Ayatollah of Rock and Rolla,"[Pixie Slab Right, Forged in Fire, Footprints,...","[Pixie Slab Right, Winged Assassins, Old Chub,...",5
998,Dysfunction,"[Sign Problem, Lemon, The Mullet, Facet Arete,...","[Bob, The Mullet, The Halfling, The Snows of K...",1


In [112]:
rec_table['Overlap'].median()

2.0

In [113]:
rec_table['Overlap'].mean()

2.007

In [114]:
rec_table.head(20)

Unnamed: 0,Route,BoW Recs,TF-IDF Recs,Overlap
0,White Rastafarian,"[Dream Sequence, The Man Who Smiled, High Noon...","[The Man Who Smiled, Slashface, High Noon, Dre...",4
1,Slashface,"[Ten Fidy, White Rastafarian, Mr. Creosote, Al...","[White Rastafarian, High Noon, Alexandria, Mr....",4
2,Pigpen,"[Street Zen, Sex Magician Sit Start, Dark Matt...","[Street Zen, Sex Magician Sit Start, Dark Matt...",3
3,JBMFP,"[The Hard Way, Ghetto Booty, Jerry's Kids, Tid...","[Friction 101, The Ayatollah of Rock and Rolla...",2
4,Gunsmoke,"[Orbiter, Tush, Night Crawler, Desert Teflon, ...","[The Chube, Driblet, Thingamajig, The Laying o...",0
5,Stem Gem,"[Hensel Face, Interceptor 2, Big Lizard in my ...","[Sorta High, The Palmist, Newton's Law, Northe...",0
6,Fry Problem,"[The Snows of Kilimanjaro, Ziggy, Moral Wastel...","[The Snows of Kilimanjaro, The Halfling, Spud ...",1
7,False Up 20,"[Corner Problem, Silly Putty, Largotot, Capuch...","[Corner Problem, Jerryatric, Crankcase, Shadow...",1
8,Saturday Night Live,"[Old Triangle Classic, Fedora, Retro, Turnbuck...","[High Heeled Sneakers, Unknown, Right Arete, F...",1
9,The Chube,"[Wandering n00b, Interceptor, Little Biglip, P...","[Gunsmoke, Wandering n00b, High Noon, JD half-...",1


### Conclusion

Now we have a basic recommendation system based off of descriptions of each climb

We used Cosine Similarity as our base method, but prepared our data in two different ways: Bag of Words, and TF-IDF vectorizing.

After manually comparing some of the route recommendation, we can find that some may be good fits, but it is difficult to measure how helpful the recommendations are without community opinions.