Using machine learning, data mining, data visualization techniques
-
Platform - Spark Cluster (Databricks Cloud)
-
Database - Spark Cluster Tables
-
Coding Language - Python, Spark Sql
-
Data Visualization - Tableau, d3, Spark Cluster df chart
-
travel++ slides.pdf - Project Proposal Presentation Slides
-
trave++ poster.pdf - Travel++ final poster
-
report.pdf - Travel++ final report
Main Features
-
Gossip Queen
- World Wide Trends
- Real Time Hottest Topics
- Current Hottest Tourism Spots based on social media posts
-
Dr.Q
- Automatic answers traveling related questions using similar Reddit posts
- When Reddit cannot provide strong enough answers, crawl traveling links from Wiki pages, search for relative references and images to answer the user query
-
Map Attentive - https://github.com/shruthi-mohan/travel_plus_plus/tree/master
- Visualized Hotel, Restaurant map recommendation
- Personalized tourist spots, allowing users to choose the style they like on the map
- Weekly updated top tourism spots
-
gossip_queen.py
-
part 1 - longer term world popular topics
-
part 2 - real time world popular topics
-
part 3 - recommend current hottest tourism spots based on social media posts
-
GQ_real_time_trends_animation.html, GQ_real_time_trends_animation.gif
-
Using d3 to create daily real time trends animation, mouse over each moving circle, you will see the topics represented by the circle
-
DrQ_store_reddit_data.py - store reddit data into tables, easy to do query match
-
DrQ_match_reddit_posts.py
- Level 1 matching method - find matched posts using Levenshtein Distance
- Level 2 Method - calculate matching scores using NN entities
- Level 3 Method - Calculate scores using words (tokens)
- Level 3 Method - Approach 1: Calculate scores based on words locations
- Level 3 Method - Approach 2: Calculate scores based on words distance
- Level 3 Method - Approach 3: tokens frequency
- Level 3 Method - Approach 4: combine all the above 3 approaches and set weight to each of them
- Accurate Output sample:
User Query = "Advice for Europe trip?"
Top 5 returned Reddit posts: (https://www.reddit.com/r/travel/4b46hd, https://www.reddit.com/r/travel/4arqzu, https://www.reddit.com/r/travel/48uu2h, https://www.reddit.com/r/travel/4atds4, https://www.reddit.com/r/travel/2ltqv3)
-
DrQ_tables_and_relationships.pdf
- This tables and relationships will help you understand better how do I build Dr.Q Reddit Post Match and Search Engine.
-
DrQ_search_engine.py
- When the user query cannot get a high matching score in DrQ_match_reddit_posts.py, this code will find relative wiki pages, images to the user for reference
- Approach 1: Calculate scores based on words locations
- Approach 2: Calculate scores based on words distance
- Approach 3: tokens frequency
- Approach 4: combine all the above 3 approaches and set weight to each of them
- Accurate Output sample:
Query Tokens = [u'camera', u'travel', u'free']
Top 5 returned Reddit posts: (https://en.wikipedia.org/wiki/Digital_cameras, https://en.wikipedia.org/wiki/The_Traveler_(novel), https://en.wikipedia.org/wiki/The_Traveler_(1974_film), https://en.wikipedia.org/wiki/Elevator, https://en.wikipedia.org/wiki/Guided_bus)
-
daily_flickr_photos_csv.py - Create csv tables in Spark Cluster
-
daily_flickr_photos_parquet.py - Create parquet tables in Spark Cluster, added bar chart to show daily photo posting trend
-
dataframe_visualization.png - sample bar chart which shows daily photo posting trend, using Spark Clsuter is very convenient to create simple chart like this. Just write Saprk sql, then click the chart button.
-
merge_spark_tables.sql - Merger tables on Spark cluster
-
parquet_to_table.sql - Generate table through parquet file
-
DetailedReadMe.txt - Detailed technical notes