<a href="https://colab.research.google.com/github/itzdineshx/cognifyz_internship/blob/main/level_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **COGNIFYZ DATA SCIENCE INTERNSHIP**

## **LEVEL 3**


---
## About Level 3

Level 3 of the Cognifyz Data Science Internship focuses on three key areas:

1. **Predictive Modeling**
2. **Customer Preference Analysis**
3. **Data Visualization**

### Task 1: Predictive Modeling

The goal was to develop a regression model to predict a restaurant's aggregate rating based on available features. The steps included:

- Splitting the dataset into training and testing sets.
- Evaluating model performance using appropriate metrics.
- Experimenting with various algorithms such as Linear Regression, Decision Trees, and Random Forest to compare their performance.

### Task 2: Customer Preference Analysis

The objective was to analyze the relationship between restaurant ratings and cuisine types. Key tasks included:

- Identifying the most popular cuisines based on the number of customer votes.
- Investigating whether certain cuisines tend to receive higher ratings.

### Task 3: Data Visualization

The final task involved creating visualizations to represent the data. Specific goals included:

- Displaying rating distributions through charts (e.g., histograms, bar plots).
- Comparing average ratings across different cuisines or cities.
- Visualizing the relationship between features and the target variable (aggregate rating).
---

In [None]:
#importig all the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix
from sklearn.svm import SVR

In [None]:
#accessing the file
df = pd.read_csv("/content/Dataset .csv")
df.head()

In [None]:
#checking for null values
df.isnull().sum()

In [None]:
df.describe()

In [None]:
df.columns

In [None]:
#filling the missing values
rest_data = df['Cuisines'].fillna('Unknown', inplace=True)

In [None]:
#re-checking for null values
df.isnull().sum()

# **Task 1: Predictive Modeling**

### predict the aggregate rating

Experiment with different algorithms (e.g.,
linear regression, decision trees, random
forest) and compare their performance.

In [None]:
# Select features and target variable
features = ['Average Cost for two', 'Votes']
target = 'Aggregate rating'

In [None]:
# Split the data into training and testing sets
X = df[features]
y = df[target]
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
# Function to train and evaluate a regression model
def train_and_evaluate_model(model):
    model.fit(X_train, Y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(Y_test, y_pred)
    r2 = r2_score(Y_test, y_pred)
    return mse, r2



> Linear Regression

In [None]:
# Training and evaluating Linear Regression model
linear_regression = LinearRegression()
mse_lr, r2_lr = train_and_evaluate_model(linear_regression)

In [None]:
#scores of linear model
print("Linear Regression:")
print(f"Mean Squared Error (MSE): {mse_lr:.6f}")
print(f"R-squared (R2): {r2_lr:.6f}")




> Decision trees






In [None]:
# Training and evaluating decision tree model
dec_model = DecisionTreeRegressor(random_state=42)
mse_dec, r2_dec = train_and_evaluate_model(dec_model)

In [None]:
#scores of decision tree model
print("\nDecision Tree:")
print(f"Mean Squared Error (MSE): {mse_dec:.6f}")
print(f"R-squared (R2): {r2_dec:.6f}")

> Random forest

In [None]:
# Training and evaluating random forest model
random_forest = RandomForestRegressor(random_state=42)
mse_rf, r2_rf = train_and_evaluate_model(random_forest)

In [None]:
#scores of random forest model
print("\nRandom Forest:")
print(f"Mean Squared Error (MSE): {mse_rf:.6f}")
print(f"R-squared (R2): {r2_rf:.6f}")


> Support Vector Regression

In [None]:
# Training and evaluating Support Vector Regression model
svr = SVR()
mse_svr, r2_svr = train_and_evaluate_model(svr)

In [None]:
#scores of support vector machine model
print("\nSupport Vector Regression:")
print(f"Mean Squared Error (MSE): {mse_svr:.6f}")
print(f"R-squared (R2): {r2_svr:.6f}")


# Task 2: Customer Preference Analysis

In [None]:
# Identify the most popular cuisines based on the number of customer votes.
cuisines_votes = df.groupby('Cuisines')['Votes'].sum().sort_values(ascending=False)
high_cuisines_votes=cuisines_votes.head(20)
print("Most popular cuisines based on votes:\n", high_cuisines_votes)

In [None]:
# Plot the relationship between Top 20 cuisine type and the number of votes
plt.figure(figsize=(12, 6))
colors = sns.color_palette("tab20", n_colors=len(high_cuisines_votes))
high_cuisines_votes.plot(kind='bar',color=colors)
plt.title('Number of Votes for Top 20 Cuisine Types')
plt.xlabel('Cuisine Type')
plt.ylabel('Total Number of Votes')
plt.tight_layout()
plt.show()

In [None]:
# Investigating whether certain cuisines tend to receive higher ratings.
cuisines_ratings = df.groupby('Cuisines')['Aggregate rating'].sum().sort_values(ascending=False)
high_cuisines_ratings = cuisines_ratings.head(20)
print("Most popular cuisines based on Aggregate rating:\n", high_cuisines_ratings)

In [None]:
# Plot the relationship between cuisine type and average rating
plt.figure(figsize=(12, 6))
colors = sns.color_palette("husl", n_colors=len(high_cuisines_ratings))
high_cuisines_ratings.plot(kind='bar',color=colors)
plt.xlabel('Cuisine Type')
plt.ylabel('Average Rating')
plt.title('Relationship between Top 20 Cuisine Type and Average Rating')
plt.show()

# Task 3: Data Visualization

In [None]:
# Scatter plot of Average Cost for two vs. Aggregate rating
plt.figure(figsize=(8, 6))
plt.scatter(df['Average Cost for two'], df['Aggregate rating'], cmap='viridis', alpha=0.6, edgecolor='k')
plt.title('Average Cost for two vs. Aggregate Rating')
plt.xlabel('Average Cost for two')
plt.ylabel('Aggregate Rating')
plt.colorbar(label='Aggregate Rating')
plt.show()


In [None]:
import matplotlib.pyplot as plt

# Scatter plot of Votes vs. Aggregate rating
plt.figure(figsize=(8, 6))
plt.scatter(df['Votes'], df['Aggregate rating'], c=df['Price range'], cmap='Reds', alpha=0.6, edgecolor='k')
plt.title('Votes vs. Aggregate Rating')
plt.xlabel('Votes')
plt.ylabel('Aggregate Rating')
plt.colorbar(label='Price Range')
plt.show()


In [None]:
# Correlation matrix heatmap
correlation_matrix = df[['Average Cost for two', 'Votes',
                         'Aggregate rating']].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix,annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Comparing average ratings across different cities
city_avg_ratings = df.groupby('City')['Aggregate rating'].mean().sort_values(ascending=False)
city_avg_ratings.head(10)

In [None]:
# Plot the average ratings across cities
plt.figure(figsize=(12, 6))
colors = sns.color_palette("Paired", n_colors=len(city_avg_ratings))
city_avg_ratings.plot(kind='bar', color=colors)
plt.title('Average Aggregate Rating across Cities')
plt.xlabel('City')
plt.ylabel('Average Aggregate Rating')
plt.xticks(rotation=90, ha='right')
plt.tight_layout()
plt.show()


---

#**Results**

## **Task 1: Predictive Modeling**

Four regression models were developed:

1.   **Linear Regression**,
2.    **Decision Tree**,
3. **Random Forest**,
4.**Support Vector Machine (SVM)**.

The models were evaluated using Root Mean Square Error (RMSE) and R-squared metrics.

The results are summarized below:

| **Model**              | **RMSE**    | **R-squared** |
|------------------------|-------------|---------------|
| Linear Regression       | 2.055716    | 0.096829      |
| Decision Tree           | 0.222755   | 0.902133    |
| Random Forest           | 0.159152  | 0.930077     |
| Support Vector Machine  | 2.253043    | 0.1105025     |

From the table, it is evident that the **Random Forest** model performed the best, with the lowest RMSE and highest R-squared, making it the preferred choice for this task.

## **Task 2: Customer Preference Analysis**



*  The analysis revealed that **North Indian**, **Mughlai**, and **Chinese** cuisines were the most popular, based on customer votes.
*  Additionally, several cuisines, such as **American**, **BBQ**, **Sandwich**, **Burger**, **Grill**, **Caribbean**, **Seafood**, and **Coffee and Tea**, consistently received an average rating of 4.9.

## **Task 3: Data Visualization**

* The majority of restaurant ratings fell between **3 and 4**.

* On a city-level analysis, **Inner City** had the highest average ratings, followed by **Quezon City**, **Makati City**, **Pasig City**, **Mandaluyong City**, and **Beechworth**.

* A correlation analysis indicated that aggregate ratings were positively associated with several features, including **Votes**, **Price Range**, **Has Table Booking**, and **Has Online Delivery**.
* Among these, **Price Range** had the strongest positive correlation.



---

#**Conclusion**

This project highlighted the critical role of predictive modeling, customer preference analysis, and data visualization in uncovering valuable insights and facilitating strategic decision-making.

The analysis of customer preferences offered deeper understanding of target audience preferences, while data visualization techniques proved effective in communicating complex insights clearly and concisely.

---

### **Producing report of this project**

In [None]:
!pip install ydata_profiling

In [None]:
import pandas as pd
from ydata_profiling import ProfileReport

# Load the CSV dataset
df = pd.read_csv("/content/Dataset .csv")

# Create the profile report with a more descriptive title
profile = ProfileReport(df, title="Data Analysis Report: Dataset")

# Generate the HTML report
profile.to_file("report.html")
