Keith Qu - Data Science with Python
Prediction and Analysis
Yelp Star Prediction
I wanted to see, along with review text, which factors affect Yelp star ratings. So I do some classification with Keras on 630,000+ reviews of businesses in the Toronto area. I try to incorporate external factors such as temperature and day of week to see if they can help predict star ratings.
Estimating Manhattan Office Rental Rates
I construct an 11 feature dataset of 1369 Manhattan office rentals, of which 539 have listed rental rates in $/sqft/year. With the 539 observation training set, I convert addresses and strings into quantitative data, and engineer features to try to create a predictive model of the remaining 830 prices.
Mercari Price Suggestion
With data from the Mercari Price Suggestion Challenge, I... uh challenge myself to suggest prices. Currently in the top 33% after one submission, a lot of room to improve. If only I could use the same batch sizes on my computer as the Kaggle kernel. Time to get that GTX 1080ti...
Toxic Comments Classification
A quick look at multilabel classification with different forms of toxic internet comments using logistic regression, NB-SVM, as well as convolutional and recurrent neural networks with the Keras API running on a TF backend. There were a bunch of problems with this competition: first the original test set was leaked, then the new test set had a different distribution than the original, leading to useless validation scores, and a lot of focus on making adjustments to the prediction results (to fit the new distribution), then finally a switch over to AUC that messed up all the scores.
Visualization and Other Stuff
Scoring Yahoo Finance Headlines
Yahoo Finance is one of the company's few remaining products that is actually very good, but support has been reduced greatly, and it no longer lets us specify a date range for headlines. By default, the page only displays 25 and more is injected onto the page when you scroll down, so I use Selenium and Google's net logging to get the json URLs and access them directly. The json contains more data than what actually shows up on the page (and the exact date), so this could be pretty useful. These headlines are then converted into sentiment scores.
Greater Toronto Area Gas Map
Scrapes a comprehensive set of gas price information for the Toronto area and visualizes it with Folium. Includes some light analysis: gas stations within 100m of a Starbucks tend to charge 0.6 cents more for regular gas!
Iterative K Means
Showcases how K Means clustering finds new centroids and labels with each iteration.