# CIS 5450 Project: Difficulty Topics
**Group Members:**
* **Jessica Yang**
* **Julie Dai**
* **Yukun Zhou**


## Topic 1: Feature Engineering
[Hyperlink to Section 3.2: Computing Neighborhood-Level Cleanliness Features](https://colab.research.google.com/drive/1BH9G_jLxBbdTmtV4CRXfEiYlEiCODPVC#scrollTo=lHtsnJyiwJ9Q)

### Why we used this concept
Our research question investigates whether neighborhood cleanliness affects Yelp review sentiment. The raw datasets did not contain variables that directly connect environmental quality to review outcomes. To study this relationship, we needed engineered predictors that translate the 311 dataset into interpretable cleanliness metrics. Without feature engineering, our models would have no meaningful representation of neighborhood conditions.

### How we implemented it
We engineered several cleanliness related and review level features:
* complaints_per_sq_mi: complaint density normalized by neighborhood area
* pct_illegal_dumping: proportion of 311 complaints due to illegal dumping
* complaints_30d and complaints_90d: recent-activity windows to detect short-term spikes
* review_text_length: reviewer verbosity and potential sentiment richness
* price tier dummies: categorical encoding of restaurant cost level
* total_complaints: overall neighborhood complaint volume

All features were merged from the 311 dataset onto the review-level dataset, allowing both regression and classification models to incorporate neighborhood context.

### Results & Interpretation
Engineering these features made it possible to directly test our hypothesis. However, cleanliness variables showed very weak predictive signal. Both the linear regression and the boosted tree models placed far more weight on review level attributes (useful, funny, cool) and business level factors. This result suggests that neighborhood cleanliness does not meaningfully influence Yelp sentiment according to our data. The engineered features were essential in reaching that conclusion even though they did not improve accuracy.

## Topic 2: Entity Linking

[Hyperlink to Section 2.2: Assigning Restaurants to Neighborhoods](https://colab.research.google.com/drive/1BH9G_jLxBbdTmtV4CRXfEiYlEiCODPVC#scrollTo=Q34d5SFUVOei)

[Hyperlink to Section 2.4: Merging Neighborhood Information into Review-Level Data](https://colab.research.google.com/drive/1BH9G_jLxBbdTmtV4CRXfEiYlEiCODPVC#scrollTo=x_8tX-oujlFv)

[Hyperlink to Section Section 4.2 Merging Reviews, Business Attributes, and Cleanliness Features (Final Merge)](https://colab.research.google.com/drive/1BH9G_jLxBbdTmtV4CRXfEiYlEiCODPVC#scrollTo=fkHNFs0UB5Io)

We include three hyperlinks because entity linking in our project happens in three distinct stages which are mapping businesses to neighborhoods, merging neighborhood attributes into the review table, and performing the final full join that produces the modeling dataset.

### Why we used this concept
Our project integrates multiple completely different datasets. These datasets do not share a common numerical key. Without entity linking, we would have no principled way to attach neighborhood cleanliness metrics to individual Yelp reviews. Because our research question directly depends on the relationship between environmental quality and review sentiment, entity linking was essential and unavoidable.

### How we implemented it
We built a multi-step entity linking pipeline that connected Yelp data to Philadelphia’s 311 neighborhood metrics. First, we linked Yelp reviews to Yelp businesses using the shared business_id key. Next, we connected Yelp businesses to neighborhood polygons, which required additional cleaning because Yelp provides neighborhood names as free-text strings while the 311 GIS shapefile uses standardized MAPNAME fields. We normalized both fields, aligned naming conventions, and removed ambiguous or mismatched entries to ensure reliable mapping. After establishing a clean neighborhood linkage, we merged polygon level complaint aggregates, including features such as complaints_per_sq_mi, pct_illegal_dumping, total_complaints, and short-term indicators like complaints_30d and complaints_90d, into the neighborhood GeoDataFrame. In the final step, we enriched each Yelp review with its corresponding neighborhood’s cleanliness metrics, producing a unified review level dataset that contains business attributes, environmental conditions, and full review metadata. This fully linked dataset serves as the backbone for all analyses and modeling throughout the project.


### Results & Interpretation

Entity linking enabled the core analysis of the project. The fully merged dataset revealed that, despite careful spatial and categorical linkage, cleanliness variables had minimal predictive influence on review ratings. This finding would have been impossible without entity linking, which made it one of the deepest technical components of the entire pipeline.

## Topic 3: Hyperparameter Tuning (XGBoost)
[Hyperlink to Section 8.4: XGBoost Model + RandomizedSearchCV](https://colab.research.google.com/drive/1BH9G_jLxBbdTmtV4CRXfEiYlEiCODPVC#scrollTo=pBW7aZo6wRh3)
### Why we used this concept
XGBoost was the most flexible model in our project, but its performance depended heavily on how the key hyperparameters were set. When we first tested the model with default settings, the predictions were noticeably less accurate, and the residuals showed a wider spread. Because the combined Yelp and cleanliness dataset is large and contains many non-linear relationships, we needed a systematic way to search the parameter space instead of guessing values by hand. Tuning was also important to make sure that the comparison with the baseline model and the Random Forest was fair.


### How we implemented it
We built an end-to-end pipeline that included preprocessing and an XGBoost regressor. We then defined a set of candidate values for depth, learning rate, the number of estimators, and sampling parameters. Using `RandomizedSearchCV`, we sampled from this space and evaluated each configuration with cross-validation. This allowed us to explore a wide range of possibilities without training an excessive number of models. After the search completed, we selected the best parameter combination and refit the model on the full training data.
### Results & Interpretation
The tuned XGBoost model performed much better than the untuned version. It achieved the lowest RMSE among all models and produced the tightest residual bands in the diagnostic plots. The improvement suggests that tuning helped the model capture patterns in both the Yelp features and the cleanliness indicators that the default configuration missed. Overall, the process showed that careful hyperparameter search was essential for getting strong performance from XGBoost in this setting.