### Collaborative Filtering Trail Recommender

For this recommendation system, the reviews of all trails were gathered and used to predict the ratings of trails that a user has not rated yet. By doing so, we are able to recommend a trail to a user based on our predicted rating.

#### Where is the data coming from?

The data was scraped from AllTrails.com using BeautifulSoup and Selenium. A list of 1418 trails in Arizona was gathered. The reviews of each trail were then collected and exported to a SQL database. This was a bit challenging, as the only way to load more reviews was to click a button that would populate more reviews. This action would load 30 more reviews each time, so it had to be done repeatedly. As such, Selenium was used to located the button on click it (total reviews/30) times. For this demonstration, the class created to extract the data from the webpages will not be used. The data will be imported from the SQL database

In [14]:
import pandas as pd
import pandas.io.sql as sqlio
import psycopg2

conn = psycopg2.connect("dbname='az_trail_recommender' user='josephdoperalski' host='localhost'")
cur = conn.cursor()
query_reviews = '''SELECT * FROM trail_reviews'''
reviews = sqlio.read_sql_query(query_details, conn)
reviews.drop('index', axis = 1, inplace = True)
reviews.head(10)

Unnamed: 0,review_id,trail_id,trail_name,user,rating,body
0,0_0,0,Camelback Mountain via Echo Canyon Trail,/members/arron-hampton,5,Super hard. It’s all scrambling after about 1/...
1,0_1,0,Camelback Mountain via Echo Canyon Trail,/members/brian-white-18,4,First time hiking this trail and it did not di...
2,0_2,0,Camelback Mountain via Echo Canyon Trail,/members/arron-hampton,5,A must do for the 360 degree views of the Vall...
3,0_3,0,Camelback Mountain via Echo Canyon Trail,/members/maddie-wilson,5,Very challenging and fun!
4,0_4,0,Camelback Mountain via Echo Canyon Trail,/members/giselle-johnson,5,Went when it was slightly rainy and enjoyed th...
5,0_5,0,Camelback Mountain via Echo Canyon Trail,/members/mark-mckeever-1,5,"Very nice hike. At 62, it was a challenge. Saw..."
6,0_6,0,Camelback Mountain via Echo Canyon Trail,/members/shaun-otoole,5,Amazing Trail. More treacherous than the photo...
7,0_7,0,Camelback Mountain via Echo Canyon Trail,/members/katie-plew,5,Great workout definitely worth the view!
8,0_8,0,Camelback Mountain via Echo Canyon Trail,/members/ronald-catague,5,physically demanding and definitely not for fi...
9,0_9,0,Camelback Mountain via Echo Canyon Trail,/members/wissam-baghdadi,5,"I started at 10:30 am, I was surprised to see ..."


From here, some manipulation had to be done. The review_id, trail_name, and body all had to be dropped. This is done as the model used to create our recommendation engine only requires the trail_id, user, and rating. The user column was also altered so as to remove the '/members/' in front of the usernames. Running the cell below outputs the dataframe that will be fed into the model.

In [15]:
import sys
sys.path.append('..')
from hiking_data_v1 import ReviewsShaper

rev_shaper = ReviewsShaper(reviews)
rev_shaper.fix_column_data()
rev_shaper.user2user()
df = rev_shaper.user2user_df
df.head(20)

Unnamed: 0,user,trail_id,rating
0,007297,14,5.0
1,0scararn0ld,39,5.0
2,0scararn0ld,311,5.0
3,101southdrum-user,584,5.0
4,1337killer,248,5.0
5,139ss6,25,5.0
6,1433jenn,1479,4.0
7,14burrito,0,5.0
8,14burrito,22,4.0
9,14burrito,94,4.0


#### How are recommendations made?

This collaborative filtering recommandation system was created using a matrix factorization model, UVD. This type of model maps users and items (in this case, trails) to a latent feature space of some pre-decided dimensionality, r. This set of latent features is used to predict ratings by reconstructing the original matrix. 