### Content-Based Trail Recommender

A recommendation system was created that uses the features of trails, such as distance, elevation, difficulty, and location (just to name a few), to find the most similar trails. When the user enters a trail, a list of the most similar trails is generated and ordered by their similarity score.

#### Where is the data coming from?

The data was scraped from AllTrails.com using BeautifulSoup and Selenium. One of the challenges of gathering the necessary data was getting the webpages to render all of the necessary information. When originally gathering all of the trails in Arizona, the entire list of trails would not generate unless a scrolling action would occur. As such, Selenium was used to create an artificial scrolling action to populate the webpage with all of the trails. A method was created within the class DataGrabber in order to perform the scroll action and grab the trail names and their links, which were then exported to a postgres sql database. Execute the cell below to see it in action. A new browser window will open. 

You don't have to wait for it to scroll through the entire page. Close the browser window when you get bored.

In [20]:
import sys
sys.path.append('..')
from hiking_data_v1 import DataGrabber

grabber = DataGrabber()
try:
    grabber.grab_name_and_links(state = 'arizona')
except:
    print('Window closed, carry on')

Window closed, carry on


Another method was also created to navigate to the individual pages all of the trails gathered in the previous step. This method collects all of the trails details and reviews. This data was then exported to a postgres sql database. Instead of actually running said method, a sql query was used in this case to load in the dataframes.

In [9]:
#Don't actually run this method
#grabber.grab_details()

#These sql queries will get us the data that was gathered via the above method
import pandas as pd
import pandas.io.sql as sqlio
import psycopg2

conn = psycopg2.connect("dbname='az_trail_recommender' user='josephdoperalski' host='localhost'")
cur = conn.cursor()
query_details = '''SELECT * FROM trail_details'''
details = sqlio.read_sql_query(query_details, conn)
details.drop('index', axis = 1, inplace = True)
details.head(10)





Unnamed: 0,trail_id,trail_name,dist,elev,type,difficulty,num_completed,latitude,longitude,tags,overview,full_desc
0,0,Camelback Mountain via Echo Canyon Trail,2.4 miles,"1,423 feet",Out & Back,HARD,5629,33.52273,-111.97467,"birding,hiking,nature trips,rock climbing,trai...",Camelback Mountain via Echo Canyon Trail is a ...,Camelback Mountain is one of the most popular ...
1,1,Devils Bridge Trail,4.2 miles,564 feet,Out & Back,MODERATE,5221,34.88774,-111.82262,"dogs on leash,birding,hiking,nature trips,off ...",Devils Bridge Trail is a 4.2 mile heavily traf...,Devils Bridge is the largest natural sandstone...
2,2,Cathedral Rock Trail,1.2 miles,744 feet,Out & Back,HARD,3676,34.82518,-111.7886,"dogs on leash,hiking,rock climbing,trail runni...",Cathedral Rock Trail is a 1.2 mile heavily tra...,Cathedral Rock is a short hike that offers bea...
3,3,Flatiron Via Siphon Draw Trail,6.2 miles,"2,933 feet",Out & Back,HARD,3120,33.45949,-111.48,"dogs on leash,birding,camping,hiking,nature tr...",Flatiron Via Siphon Draw Trail is a 6.2 mile h...,Bring plenty of water with you as there is non...
4,4,Piestewa Peak Summit Trail #300,2.2 miles,"1,151 feet",Out & Back,HARD,4336,33.53943,-112.02341,"birding,hiking,nature trips,trail running,walk...",Piestewa Peak Summit Trail #300 is a 2.2 mile ...,"Hikers, walkers, sightseers and runners alike ..."
5,5,West Fork Trail,7.2 miles,820 feet,Out & Back,EASY,2787,34.99098,-111.74266,"dogs on leash,kid friendly,birding,hiking,natu...",West Fork Trail is a 7.2 mile heavily traffick...,Semi-shaded by trees and towering canyon walls...
6,6,"Havasu Falls, Mooney Falls, and Beaver Falls",24.5 miles,"3,307 feet",Out & Back,HARD,1648,36.1596,-112.70941,"backpacking,camping,hiking,cave,river,views,wa...","Havasu Falls, Mooney Falls, and Beaver Falls i...",Hikers must get a permit from the Havasupai In...
7,7,Tom's Thumb Trail,4.0 miles,"1,236 feet",Out & Back,MODERATE,2534,33.69459,-111.80149,"dogs on leash,birding,hiking,mountain biking,n...",Tom's Thumb Trail is a 4 mile heavily traffick...,Tom's Thumb Trail is a short switchback trail ...
8,8,Hieroglyphic Trail,2.8 miles,567 feet,Out & Back,MODERATE,2974,33.389725,-111.424711,"dogs on leash,kid friendly,hiking,nature trips...",Hieroglyphic Trail is a 2.8 mile moderately tr...,At the end of this trail are petroglyphs and s...
9,9,Peralta Trail to Fremont Saddle,5.1 miles,"1,571 feet",Out & Back,MODERATE,2303,33.39755,-111.34791,"dogs on leash,backpacking,birding,camping,hiki...",Peralta Trail to Fremont Saddle is a 5.1 mile ...,


In [10]:
query_reviews = '''SELECT * FROM trail_reviews'''
reviews = sqlio.read_sql_query(query_reviews, conn)
reviews.drop('index', axis = 1, inplace = True)
reviews.head(10)

Unnamed: 0,review_id,trail_id,trail_name,user,rating,body
0,0_0,0,Camelback Mountain via Echo Canyon Trail,/members/arron-hampton,5,Super hard. It’s all scrambling after about 1/...
1,0_1,0,Camelback Mountain via Echo Canyon Trail,/members/brian-white-18,4,First time hiking this trail and it did not di...
2,0_2,0,Camelback Mountain via Echo Canyon Trail,/members/arron-hampton,5,A must do for the 360 degree views of the Vall...
3,0_3,0,Camelback Mountain via Echo Canyon Trail,/members/maddie-wilson,5,Very challenging and fun!
4,0_4,0,Camelback Mountain via Echo Canyon Trail,/members/giselle-johnson,5,Went when it was slightly rainy and enjoyed th...
5,0_5,0,Camelback Mountain via Echo Canyon Trail,/members/mark-mckeever-1,5,"Very nice hike. At 62, it was a challenge. Saw..."
6,0_6,0,Camelback Mountain via Echo Canyon Trail,/members/shaun-otoole,5,Amazing Trail. More treacherous than the photo...
7,0_7,0,Camelback Mountain via Echo Canyon Trail,/members/katie-plew,5,Great workout definitely worth the view!
8,0_8,0,Camelback Mountain via Echo Canyon Trail,/members/ronald-catague,5,physically demanding and definitely not for fi...
9,0_9,0,Camelback Mountain via Echo Canyon Trail,/members/wissam-baghdadi,5,"I started at 10:30 am, I was surprised to see ..."


From there, the data had to be cleaned and manipulated so that it could be represented numerically. In this recommendation system, only the trail details and not the reviews were used. The dist column was changed so the any data displayed in kilometers was converted to miles and any text was removed. This was similar to the elev column as meters were converted to feet and text was removed. type and difficulty were dummied. The tags were also represented in a way so that if they were present in the trail description, they received a 1 and a 0 if not. The overview and full_desc columns were combined and tf-idf was performed so that words that were more unique to the trail received a higher weighting. Finally, the data was normalized to set all features on the same scale.

In [17]:
from hiking_data_v1 import DetailsShaper
dtails_shaper = DetailsShaper(details)
dtails_shaper.adjust_columns()
dtails_shaper.fix_column_data()
original_df = dtails_shaper.proper_df
original_df.head(10)

Unnamed: 0,trail_id,trail_name,dist,elev,difficulty,num_completed,latitude,longitude,text,type_Out & Back,...,cave,horseback riding,waterfall,birding,rocky,hot springs,lake,over grown,city walk,scramble
0,0,Camelback Mountain via Echo Canyon Trail,2.4,1423.0,3,5629.0,33.52273,-111.97467,Camelback Mountain via Echo Canyon Trail is a ...,1,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
1,1,Devils Bridge Trail,4.2,564.0,2,5221.0,34.88774,-111.82262,Devils Bridge Trail is a 4.2 mile heavily traf...,1,...,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
2,2,Cathedral Rock Trail,1.2,744.0,3,3676.0,34.82518,-111.7886,Cathedral Rock Trail is a 1.2 mile heavily tra...,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
3,3,Flatiron Via Siphon Draw Trail,6.2,2933.0,3,3120.0,33.45949,-111.48,Flatiron Via Siphon Draw Trail is a 6.2 mile h...,1,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0
4,4,Piestewa Peak Summit Trail #300,2.2,1151.0,3,4336.0,33.53943,-112.02341,Piestewa Peak Summit Trail #300 is a 2.2 mile ...,1,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
5,5,West Fork Trail,7.2,820.0,1,2787.0,34.99098,-111.74266,West Fork Trail is a 7.2 mile heavily traffick...,1,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
6,6,"Havasu Falls, Mooney Falls, and Beaver Falls",24.5,3307.0,3,1648.0,36.1596,-112.70941,"Havasu Falls, Mooney Falls, and Beaver Falls i...",1,...,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
7,7,Tom's Thumb Trail,4.0,1236.0,2,2534.0,33.69459,-111.80149,Tom's Thumb Trail is a 4 mile heavily traffick...,1,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
8,8,Hieroglyphic Trail,2.8,567.0,2,2974.0,33.389725,-111.424711,Hieroglyphic Trail is a 2.8 mile moderately tr...,1,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
9,9,Peralta Trail to Fremont Saddle,5.1,1571.0,2,2303.0,33.39755,-111.34791,Peralta Trail to Fremont Saddle is a 5.1 mile ...,1,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0


In [14]:
dtails_shaper.tfidf()
dtails_shaper.transform()
df = dtails_shaper.transformed_df
df.head(10)

Unnamed: 0,trail_id,trail_name,dist,elev,num_completed,latitude,longitude,difficulty,type_Out & Back,type_Point to Point,...,views trail,walking,walking nature,walking trail,west,wild,wild flowers,wildlife,year,year round
0,0,Camelback Mountain via Echo Canyon Trail,-23.280468,2.516877,13.282782,-0.276528,-0.420509,3,1,0,...,0.0,0.0,0.0,0.0,0.0,0.056089,0.056089,0.0,0.05165,0.051868
1,1,Devils Bridge Trail,-14.255669,-8.86879,12.287327,0.854752,-0.233302,2,1,0,...,0.0,0.0,0.0,0.0,0.0,0.049119,0.049119,0.0,0.045231,0.045422
2,2,Cathedral Rock Trail,-29.297,-6.482969,8.517773,0.802904,-0.191416,3,1,0,...,0.0,0.060036,0.0,0.100312,0.0,0.0,0.0,0.093573,0.047315,0.047514
3,3,Flatiron Via Siphon Draw Trail,-4.228114,22.531263,7.161222,-0.328939,0.188538,3,1,0,...,0.0,0.0,0.0,0.0,0.0,0.125959,0.125959,0.0,0.0,0.0
4,4,Piestewa Peak Summit Trail #300,-24.283223,-1.088363,10.128068,-0.262687,-0.480519,3,1,0,...,0.0,0.0,0.0,0.0,0.0,0.09403,0.09403,0.0,0.0,0.0
5,5,West Fork Trail,0.785663,-5.475623,6.348755,0.940314,-0.134854,1,1,0,...,0.0,0.0,0.0,0.0,0.203949,0.0,0.0,0.0,0.049253,0.049461
6,6,"Havasu Falls, Mooney Falls, and Beaver Falls",87.524008,27.488469,3.569776,1.908831,-1.325136,3,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.05782,0.058064
7,7,Tom's Thumb Trail,-15.258424,0.038274,5.731475,-0.134095,-0.207287,2,1,0,...,0.0,0.0,0.0,0.0,0.0,0.075396,0.075396,0.0,0.069429,0.069722
8,8,Hieroglyphic Trail,-21.274957,-8.829026,6.805005,-0.386758,0.256611,2,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.109786,0.110249
9,9,Peralta Trail to Fremont Saddle,-9.743269,4.478552,5.167872,-0.380273,0.35117,2,1,0,...,0.0,0.0,0.0,0.0,0.0,0.139376,0.139376,0.0,0.128345,0.128887


#### How are recommendations made?

The cosine similarity was then calculated between all of the trails and the resulting matrix is used to find the trails with the highest similarity.

<img src="https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png">

In [16]:
from trail_recommender_v1 import ContentBased

content_based = ContentBased(df)
cos_mat = content_based.create_cosine_mat()
cos_mat

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1408,1409,1410,1411,1412,1413,1414,1415,1416,1417
0,1.000000,0.838909,0.923659,0.401936,0.980990,0.247575,-0.756357,0.971519,0.865435,0.927323,...,0.680423,0.468680,-0.532783,0.729247,-0.770802,0.762274,0.723051,-0.683180,0.709959,0.710376
1,0.838909,1.000000,0.887469,-0.062084,0.872358,0.650405,-0.730731,0.844646,0.921011,0.654152,...,0.278456,0.744976,-0.759672,0.786257,-0.762386,0.780157,0.780844,-0.788911,0.788943,0.788674
2,0.923659,0.887469,1.000000,0.074474,0.973468,0.263370,-0.929229,0.961555,0.978940,0.787856,...,0.577600,0.741712,-0.794619,0.925855,-0.940898,0.936941,0.922079,-0.902610,0.918121,0.917059
3,0.401936,-0.062084,0.074474,1.000000,0.260036,-0.287438,0.128733,0.301368,-0.069317,0.634722,...,0.625075,-0.569486,0.527219,-0.228475,0.141848,-0.150897,-0.235253,0.323304,-0.261807,-0.265950
4,0.980990,0.872358,0.973468,0.260036,1.000000,0.251984,-0.853705,0.986832,0.932986,0.883665,...,0.660906,0.609602,-0.663891,0.839875,-0.865738,0.865465,0.832454,-0.799325,0.824184,0.823109
5,0.247575,0.650405,0.263370,-0.287438,0.251984,1.000000,-0.055166,0.231576,0.378902,0.089124,...,-0.326868,0.426168,-0.337302,0.195067,-0.098865,0.166629,0.191708,-0.229149,0.217066,0.220961
6,-0.756357,-0.730731,-0.929229,0.128733,-0.853705,-0.055166,1.000000,-0.840292,-0.919949,-0.603066,...,-0.531492,-0.811824,0.886588,-0.977807,0.997077,-0.976311,-0.980550,0.969811,-0.972283,-0.972357
7,0.971519,0.844646,0.961555,0.301368,0.986832,0.231576,-0.840292,1.000000,0.917514,0.903669,...,0.693849,0.586121,-0.630106,0.825686,-0.846614,0.856945,0.819028,-0.775577,0.810164,0.807402
8,0.865435,0.921011,0.978940,-0.069317,0.932986,0.378902,-0.919949,0.917514,1.000000,0.700224,...,0.461003,0.826665,-0.863037,0.942252,-0.935689,0.942337,0.942782,-0.931958,0.941519,0.939072
9,0.927323,0.654152,0.787856,0.634722,0.883665,0.089124,-0.603066,0.903669,0.700224,1.000000,...,0.775561,0.240643,-0.285820,0.557231,-0.603020,0.613093,0.551359,-0.479349,0.532934,0.527566


In [8]:
recommendations = content_based.recommend(trail = "Camelback Mountain via Echo Canyon Trail", cosine_mat=cos_mat)
recommendations

4      4
7      7
14    14
26    25
13    13
9      9
21    20
2      2
17    16
24    23
Name: trail_id, dtype: int64

We can then join the trail_ids with the trail names for an ouput.

In [18]:
original_df[original_df['trail_id'].isin(recommendations)]

Unnamed: 0,trail_id,trail_name,dist,elev,difficulty,num_completed,latitude,longitude,text,type_Out & Back,...,cave,horseback riding,waterfall,birding,rocky,hot springs,lake,over grown,city walk,scramble
2,2,Cathedral Rock Trail,1.2,744.0,3,3676.0,34.82518,-111.7886,Cathedral Rock Trail is a 1.2 mile heavily tra...,1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,4,Piestewa Peak Summit Trail #300,2.2,1151.0,3,4336.0,33.53943,-112.02341,Piestewa Peak Summit Trail #300 is a 2.2 mile ...,1,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
7,7,Tom's Thumb Trail,4.0,1236.0,2,2534.0,33.69459,-111.80149,Tom's Thumb Trail is a 4 mile heavily traffick...,1,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
9,9,Peralta Trail to Fremont Saddle,5.1,1571.0,2,2303.0,33.39755,-111.34791,Peralta Trail to Fremont Saddle is a 5.1 mile ...,1,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
13,13,Pinnacle Peak Trail,3.5,1020.0,2,2162.0,33.72793,-111.86029,Pinnacle Peak Trail is a 3.5 mile heavily traf...,1,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
14,14,Camelback Mountain Via Cholla Trail,2.4,1158.0,3,3144.0,33.513597,-111.948474,Camelback Mountain Via Cholla Trail is a 2.4 ...,1,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
17,16,Hidden Valley Trail Via Mormon Trail,3.4,918.0,2,2169.0,33.366413,-112.031064,Hidden Valley Trail Via Mormon Trail is a 3.4 ...,0,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0
21,20,South Kaibab Trail to Cedar Ridge,3.1,1177.0,3,1754.0,36.05346,-112.08361,South Kaibab Trail to Cedar Ridge is a 3.1 mil...,1,...,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
24,23,Sunrise Peak via Sunrise Trail,3.6,1112.0,2,1297.0,33.59616,-111.76813,Sunrise Peak via Sunrise Trail is a 3.6 mile m...,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26,25,Mormon Loop to National Trail Loop,4.7,1167.0,2,1695.0,33.366431,-112.030978,Mormon Loop to National Trail Loop is a 4.7 mi...,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
