# <center> <h3> <span>Cognifyz Intership Program</span> </h3></center>
# <center> <h3> <span>Restaurant Data Analysis</span> </h3></center>

# Author: Joshua Salami Peter

## Data Science Internship Program
### Date: November 2024

# Level 1: Task 1
## Data Exploration and Preprocessing

In [None]:
# import libraries 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

%matplotlib inline

In [None]:
# import data

df = pd.read_csv("Dataset .csv")

In [None]:
# the first 5 rows of the DataFrame

df.head()

In [None]:
# basic information about the DataFrame

print(df.info())

1. Explore the dataset and identify the number of rows and columns

In [None]:
# Check the number of rows and columns

print(df.shape)

The dataset contains 9,542 rows and 21 columns, representing detailed information about various restaurants.

2. Check for missing values in each column and handle them accordingly.

In [None]:
# Check for missing values

print(df.isnull().sum())

The "Cuisines" column is the only column with missing values, with a total of 9 missing entries out of 9,542 rows.

In [None]:
# Handle missing values

df.dropna(inplace = True)

The missing values in the "Cuisines" column were handled by dropping the rows to maintain data quality and ensure that analyses involving cuisine types are based on complete data.

In [None]:
# Checking for confirmation

print(df.isnull().sum())

The dataset is now complete, with all columns containing valid data.

3. Perform data type conversion if necessary.

In [None]:
# Check data types

print(df.dtypes)

In [None]:
#Data types conversion

categorical_columns = ['Restaurant Name', 'City', 'Address', 'Locality', 'Locality Verbose', 
                       'Cuisines', 'Currency', 'Rating color', 'Rating text']
df[categorical_columns] = df[categorical_columns].astype('category')

bool_columns = ['Has Table booking', 'Has Online delivery', 'Is delivering now', 'Switch to order menu']
for col in bool_columns:
    df[col] = df[col].map({'Yes': True, 'No': False})  
    df[col] = df[col].astype(bool)

In [None]:
#Verify data types conversion
print(df.dtypes)

After performing data type conversion, certain columns were successfully converted to category type to optimize memory usage, and columns representing binary information were converted to boolean type for better data representation and analysis.

In [None]:
# Check for duplicate rows

duplicate_rows = df.duplicated().sum()
print(f" Number of duplicate rows: {duplicate_rows}")

The analysis revealed that there are zero duplicate rows in the dataset, indicating that all entries are unique.

4. Analyze the distribution of the target variable ("Aggregate rating") and identify any class imbalances.

In [None]:
# target variable "Aggregate rating"

target= "Aggregate rating"

# Descriptive statistics
print(df[target].describe())

The summary statistics shows that the "Aggregate Rating" column has 9,542 entries. The mean rating is approximately 2.67, with a standard deviation of 1.52, suggesting variability in ratings. The minimum rating is 0.0, while the maximum rating reaches 4.9. The 25th percentile is 2.5, the median (50th percentile) is 3.2, and the 75th percentile is 3.7, showing that most ratings fall between 2.5 and 3.7.

In [None]:
# proportion of each rating
rating_proportions = df['Aggregate rating'].value_counts(normalize=True) * 100
print(rating_proportions.head())

In [None]:
# Box plot for 'Aggregate Rating'

plt.figure(figsize = (10,4))

sns.boxplot(x = df["Aggregate rating"])

plt.xlabel("Box Plot")
plt.title("Aggregate Rating")

In [None]:
# Histogran plot for 'Aggregate Rating'

plt.figure(figsize=(6, 4))

sns.histplot(df['Aggregate rating'], bins=30, color = "b",kde=True)

plt.title('Distribution of Aggregate Rating')
plt.xlabel('Aggregate Rating')
plt.ylabel('Frequency')
plt.grid(alpha = 0.75)

Insights from the boxplot and histogram shows that ratings distribution is skewed, with most restaurants receiving ratings between 2.5 and 3.7, and a notable number of 0.0 ratings, indicating poor performance or missing data.

The dataset shows highly imbalanced classes, particularly with a high frequency of 0.0 ratings.

# Level 1: Task 2
## Descriptive Analysis

1. Calculate basic statistical measures (mean,
median, standard deviation, etc.) for numerica 
columns.

In [None]:
# Basic statistical measures

df.describe()

2. Explore the distribution of categorical 
variables like "Country Cod
,
" "C,,

 and

"Cs.ines.
"

In [None]:
# Country Codes by the count of restaurants (descending order)
country_order = df['Country Code'].value_counts().index

# Plot distribution for 'Country Code'

plt.figure(figsize=(15, 6))

sns.countplot(data=df, x='Country Code', palette='viridis', order=country_order)

plt.title('Distribution by Country Code')
plt.xticks(rotation=45)
plt.show()

The distribution of restaurants by country code shows that Country Code 1 has the highest number of restaurants, followed by Country Code 216, while Country Code 37 has the fewest.

In [None]:
# Plot distribution for top 10 cities

plt.figure(figsize=(10, 5))

sns.countplot(y='City', data=df, order=df['City'].value_counts().index[:10], palette='viridis')

plt.title('Distribution of Top Ten Restuarants by City')
plt.xlabel('Number of Restaurants')
plt.ylabel('City')


Based on the distribution of restaurants by city, New Delhi has the highest number of restaurants, followed by Gurgaon.

In [None]:
#  Plot distribution for Top 10 cuisines

plt.figure(figsize=(10, 5))

sns.countplot(y="Cuisines", data=df, order=df['Cuisines'].value_counts().index[:10], palette='viridis')

plt.title('Distribution of Top Ten Restuarants by Cuisine')
plt.xlabel('Number of Restaurants')
plt.ylabel('Cuisines')

The distribution of cuisines indicates that North Indian and North Indian, Chinese have the highest number of restaurants.

# Level 1 - Task 3
## Geospatial Analysis


1. Visualize the locations of restaurants on a map using latitude and longitude information.

In [None]:
fig = px.scatter_geo(
    df,
    lat='Latitude',
    lon='Longitude',
    hover_name='Restaurant Name',
    hover_data=['Aggregate rating', 'City'],
    color='Aggregate rating',
    color_continuous_scale=px.colors.cyclical.IceFire,
    title='Restaurant Locations with Natural Earth Projection',
    projection='natural earth'  
)

# layout properties
fig.update_layout(
    geo=dict(
        showland=True,
        landcolor='lightgray',
        showocean=True,
        oceancolor='lightblue',
        showcountries=True,
        countrycolor='black',
    ),
    legend_title_text='Aggregate Rating',
    autosize=False,
    width=1100,
    height=600
)

The map highlights areas where restaurants are most densely located.

2. Analyze the distribution of restaurants across different cities or countries and check correlation with ratings.

In [None]:
plt.figure(figsize=(10, 5))

sns.countplot(y='City', data=df, order=df['City'].value_counts().index[:10], palette='viridis')

plt.title('Distribution of Top Ten Restuarants by City')
plt.xlabel('Number of Restaurants')
plt.ylabel('City')

The top 10 cities with the highest number of restaurants, as displayed by the distribution graph, include cities with the most significant restaurant presence in the dataset. These cities, ranked by their restaurant count, highlight where the concentration of dining establishments is greatest, showing a focus on urban and popular areas. 

In [None]:
# Checking correlation between the restaurant's location and its rating

plt.figure(figsize=(10, 5))

# Calculate the correlation between latitude, longitude, and ratings
correlation_matrix = df[['Latitude', 'Longitude', 'Aggregate rating']].corr()

sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".4f")

plt.title("Correlation Between Restaurant's location and Rating")


The correlation matrix shows that the relationship between longitude and aggregate rating is negatively weak, with a correlation coefficient of -0.1147. This suggests that as longitude changes, there is a slight tendency for ratings to decrease, but the relationship is not strong. On the other hand, the relationship between latitude and aggregate rating is negligible.

# Conclusion:
Based on the data exploration and preprocessing tasks conducted, it is evident that the dataset contains valuable information about restaurant ratings, locations, and cuisines. The "Cuisines" column had missing values, which were appropriately handled by dropping the affected rows to maintain data quality. The summary statistics revealed that the majority of restaurants had ratings between 2.5 and 3.7, with a few outliers, particularly the 0.0 ratings, which may indicate poor performance or missing data. A high concentration of restaurants is located in Country Code 1, with New Delhi and Gurgaon being the leading cities in terms of restaurant count. The most common cuisines are North Indian and North Indian Chinese. The correlation analysis showed a weak negative relationship between longitude and ratings, but latitude had no meaningful correlation with ratings.

## Recommendations:

Address Rating Imbalance: Investigate and handle the 0.0 ratings more effectively, possibly by treating them as missing values or imputing them, to improve data quality and analysis accuracy.

Location-Based Insights: Explore location-specific factors influencing restaurant ratings, particularly in high-density areas, to identify trends and opportunities for growth.

Focus on Popular Cuisines: North Indian and North Indian Chinese cuisines dominate the dataset. New restaurants can consider these cuisines based on their popularity and customer demand.

Review Rating System: Consider improving the rating system to address the skewed distribution, especially around low ratings, to ensure more accurate customer feedback.

These recommendations aim to improve operational strategies and guide future decisions in restaurant management and marketing.

In [None]:
# Save the modified dataset to a new CSV file
df.to_csv('C:/Users/JSP/Desktop/cleaned_dataset.csv', index=False)
