# COGS 108 - Data Checkpoint

# Names

- Goldie Chu
- Tram Bui
- Justin Huang
- Tiffany Cheng
- Jason Lee

<a id='research_question'></a>
# Research Question

Is there a relationship between Michelin star restaurants — specifically, their minimum price, maximum price, type of cuisine, and number of restaurants in a country — and their respective country’s gross domestic product (GDP) per capita?

# Dataset(s)

<!-- *Fill in your dataset information here*

(Copy this information for each dataset)
- Dataset Name:
- Link to the dataset:
- Number of observations:

1-2 sentences describing each dataset. 

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets. -->

## Michelin Guide Restaurants 2021 
- Link to the dataset: https://www.kaggle.com/datasets/ngshiheng/michelin-guide-restaurants-2021
- Number of observations: 6353
- Description of the dataset: This dataset provides the name, address, price, cuisine type, and phone number of Michelin Star Restaurants worldwide in 2021. This will give us more information on the country in which the restaurant is located in order for us to make observations relating to the country and GDP.


## GDP per Capita, Current Prices (U.S. Dollars per Capita)
- Link to the dataset: https://www.imf.org/external/datamapper/NGDPDPC@WEO/OEMDC/ADVEC/WEOWORLD 
- Number of observations: 231 
- Description of the dataset: This dataset provides the gross domestic product (GDP) per capita of all the countries in the world from 1980 to 2027. Any years after 2021 are expected values. We plan to combine this dataset with the ‘Michelin Guide Restaurants 2021’ dataset by merging the 2021 GDP column of countries to the above table. 


# Setup

## Importing Necessary Libraries

In [None]:
# Used for performing numerical computations
import numpy as np

# Used for reading, modifying, and analyzing datasets
import pandas as pd

# Both of these packages are used for visualizing data
import matplotlib.pyplot as plt
import seaborn as sns

## Importing Datasets

### 2021 Michelin Restaurants Dataset

In [None]:
# Loading the Michelin Restaurants dataset into the 'michelin' DataFrame 
# Dataset link: https://www.kaggle.com/datasets/ngshiheng/michelin-guide-restaurants-2021
michelin = pd.read_csv('michelin_my_maps.csv')\
    
michelin.head()

### Country GDP Dataset

In [None]:
# Loading the Country GDP dataset into the 'gdp' DataFrame
# Dataset link: https://www.imf.org/external/datamapper/NGDPDPC@WEO/OEMDC/ADVEC/WEOWORLD
gdp = pd.read_csv('all_country_gdp.csv')

gdp.head()

# Data Cleaning

We have two data sets which we are cleaning separately, then merging them by matching each Michelin restaurant's country location to the GDP of that country.

## Cleaning the 2021 Michelin Restaurants Dataset

The goal of cleaning this dataset is to: extract the relevant columns, change the addresses of each restaurant to just be its country, group the restaurants by their countries, remove any restaurants with NaN column(s)

In [None]:
# Extracting the relevant columns from the dataset
michelin_sub = michelin[['Name', 'Address', 'MinPrice', 'MaxPrice', 'Currency', 'Cuisine']]

# Removing all restaurants with a NaN column value
michelin_sub.dropna(inplace=True)

# Updating each restaurants's address to just its country
michelin_sub['Address'] = michelin_sub['Address'].apply(lambda x: x.split(' ')[-1])

# Renaming the 'Address' column to 'Country'
michelin_sub = michelin_sub.rename(columns = {'Address': 'Country'})

# Grouping the restaurants by the same country
michelin_sub = michelin_sub.sort_values(by='Country').reset_index()

michelin_sub.head()

## Cleaning the Country GDP Dataset

The goal of cleaning this dataset is to extract only the 2021 GDP information and remove countries with a NaN value.

In [None]:
# Renaming the columns for extraction
gdp = gdp.rename(columns = {'GDP per capita, current prices\n (U.S. dollars per capita)': 'Country', '2021': 'GDP'})

# Extracting the desired columns 
gdp_sub = gdp[['Country', 'GDP']]

# Removing all countries with a NaN value
gdp_sub.dropna(inplace=True)

# Reset indices after dropping NaN values
gdp_sub.reset_index(inplace=True, drop=True)

gdp_sub.head()

## Merging the two datasets

The goal of merging the two datasets is to create a new DataFrame consisting of the Michelin restaurant data and an additional column for the respective GDP of the country that each restaurant is located in.

In [None]:
# Keeping only the restaurants whose countries have GDP data in the GDP dataset
dataList = list(gdp_sub['Country'])
michelin_sub = michelin_sub[michelin_sub['Country'].isin(dataList)].reset_index(drop=True)
michelinList = list(michelin_sub['Country'])
gdp_sub = gdp_sub[gdp_sub['Country'].isin(michelinList)].reset_index(drop=True)

# Merging the two datasets
merged = michelin_sub.merge(gdp_sub, left_on = 'Country', right_on = 'Country')

merged.head()