# **Independent Python Project – Yelp Rating Regression Predictor**

**Project Type:** Self-Initiated / Applied Data Science Project

**Language:** Python

**Timeline:** Summer 2025

**Tools:** Python, Pandas, NumPy, scikit-learn, Matplotlib, JSON, Linear Regression


### 🧠 Project Overview

I independently developed a **Yelp Rating Regression Predictor** using real-world Yelp data across six structured JSON files. The aim was to investigate which factors most affect a restaurant’s Yelp rating and to predict the potential Yelp rating of a fictional restaurant: *Danielle’s Delicious Delicacies*. The project demonstrates end-to-end data science skills including data cleaning, feature selection, exploratory data analysis, and multiple regression modeling.

### 💡 Key Features

- Merged and cleaned 6 real Yelp datasets using Pandas.
- Explored correlations between features and Yelp ratings.
- Built and compared multiple linear regression models using different feature subsets.
- Predicted Yelp ratings based on a hypothetical restaurant profile.


### 🧩 How It Works (Structure)

1. Load Yelp JSON datasets into Pandas DataFrames.
2. Merge all DataFrames on the common column `business_id`.
3. Drop non-numeric and non-informative features.
4. Replace missing values with 0 for modeling purposes.
5. Perform correlation analysis and visualization using `.corr()` and `matplotlib`.
6. Define feature subsets and build multiple regression models using `scikit-learn`.
7. Evaluate model accuracy using R² and interpret feature coefficients.
8. Predict the Yelp rating for a new restaurant scenario.


### 📌 What I Learned

- How to merge, clean, and model large datasets from multiple sources.
- Understanding feature correlation and how it affects predictive power.
- Evaluating and improving models using different subsets of features.
- Interpreting coefficients and applying insights to a real-world scenario.


### 🛠️ Future Improvements

- Explore more sophisticated models (Random Forest, XGBoost).
- Integrate NLP-based sentiment analysis directly on full review texts.
- Deploy a Streamlit app for interactive Yelp rating predictions.
- Perform feature engineering on temporal variables (e.g., trend of ratings over time).


## Example Code Snippet (Model Function)
```python
def model_these_features(feature_list):
    ratings = df.loc[:,'stars']
    features = df.loc[:,feature_list]
    X_train, X_test, y_train, y_test = train_test_split(features, ratings, test_size = 0.2, random_state = 1)
    if len(X_train.shape) < 2:
        X_train = np.array(X_train).reshape(-1,1)
        X_test = np.array(X_test).reshape(-1,1)
    model = LinearRegression()
    model.fit(X_train,y_train)
    print('Train Score:', model.score(X_train,y_train))
    print('Test Score:', model.score(X_test,y_test))
    print(sorted(list(zip(feature_list, model.coef_)), key=lambda x: abs(x[1]), reverse=True))
    y_predicted = model.predict(X_test)
    plt.scatter(y_test, y_predicted)
    plt.xlabel('Yelp Rating')
    plt.ylabel('Predicted Yelp Rating')
    plt.ylim(1, 5)
    plt.show()
```


[GitHub Repository](https://github.com/ouryba-49/Projects/tree/main/YelpRatingPredictor)
