- Developed a classification machine learning model that will predict the price of Toronto Restaurants (F1-Score: 0.75 or 75%). This project was intended for people who want to explore the different cultures and restaurants the wonderful city of Toronto has to offer
- Utilized Python and BeautifulSoup to scrape 1000 restaurants and restaurant information from Yelp, you can find the scarping code here
- The data was cleaned and ready to use for the machine learning algorithms
- Exploratory Data Analysis (EDA) was performed to help understand the data better and visualize the data at hand
- The machine learning classification models that were used are:
- K-Nearest Neighbours
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression
-
Exploratory Data Analysis.ipynb: Exploratory Data Analysis
-
Model Building.ipynb: Model Building
-
data_clean.py: Python file for data cleaning
-
yelp_scraper.py: Python file for scraping 1000 restaurants on yelp
-
yelp_foods.csv: csv files for yelp restaurant data
- Python: Version 3.7
- Packages: Pandas, Numpy, Sklearn, Matplotlib, Seaborn, BeautifulSoup, Imblearn
- Yelp
- Scraper code referenced from here
- Rest_name: Name of the restaurant
- Rest_rating: Average Rating of the restaurant
- Rest_noreviews: Number of reviews on the restaurant
- Rest_price: Average Price of the restaurant
- Rest_cuisine: Type of cuisine of the restaurant
- Rest_add: Address of the restaurant
- Rest_tel: Restaurant telephone number
- Rest_nbh: The neighbourhood of the restaurant
This part of the project took a great deal of time as there was a lot of cleaning that needed to be done in order to achieve clean data to feed into the machine learning model.
- Parse out restaurant ratings
- Change restaurant prices to integers instead of strings ($$)
- Categorize the type of cuisine into 9 different categories
- American
- Asian
- Caribbean
- African
- Middle Eastern
- European
- Mexican
- South American
- Other
- Parsed out just streets from rest_add
- Created a dictionary of streets matching to respective neighbourhood
- Mapped the dictionary to any NaN values in the neighbourhood column
- This replaced all NaN values in neighbourhood to the respective neighbourhood the street was in
- In the price section, there are alot (184 records) of values listed as 0, which indicates no price value has been recorded for that restaurant
- In order to account for this missing data, the price will be determined by the average price of restaurants in the given neighbourhood
Developped many visuals that helped understand and explore the data, a full EDA can be found here below are some highlights:
For this model, we will use the following machine learning algorithms:
- K-Nearest Neighbours
- Decision Tree Classifier
- Random Forest Classifier
- Logistic Regression
The F1-score for each machine learning algorithm is shown below:
- K-Nearest Neighbours: 0.53 (53%)
- Decision Tree Classifier: 0.7 (70%)
- Random Forest Classifier: 0.75 (75%)
- Logistic Regression: 0.5 (50%)
- In conclusion, the random forest classification performed the best at an average F1 score of 0.75 or 75%.
- From the classification report we can see that the Random Forest Classifier partnered with the SMOTETomek imbalanced learning performed the best.
- The average precision score was 0.75 (75%) and average recall score 0.78 (78%)
- What does this mean for our analysis?
- The precision score tells us, of all the restaurant prices we predicted using the model, what fraction or percentage was actaully true?
- In other words, it is the True Positives/(True Positives + False Positives)
- In our model, the precision score is separated by the price ranges (1-4) and each precision score is represented respectively
- The recall score tells us, of all the restaurants that gave it's price listings, what fraction did we correctly predict the price using our model?
- In other words, it is the True Positives/(True Positives + False Negatives)
- In the near future, I hope to productionize this model on a Flask API.