# COGS 108 - EDA Checkpoint

# Names

- Goldie Chu
- Tram Bui
- Justin Huang
- Tiffany Cheng
- Jason Lee

<a id='research_question'></a>
# Research Question

*Fill in your research question here*

# Dataset(s)

## 1. Michelin Guide Restaurants 2021 
- Link to the dataset: https://www.kaggle.com/datasets/ngshiheng/michelin-guide-restaurants-2021
- Number of observations: 6353
- Description of the dataset: This dataset provides the name, address, price, cuisine type, and phone number of Michelin Star Restaurants worldwide in 2021. This will give us more information on the country in which the restaurant is located in order for us to make observations relating to the country and GDP.


## 2. GDP per Capita, Current Prices (U.S. Dollars per Capita)
- Link to the dataset: https://www.imf.org/external/datamapper/NGDPDPC@WEO/OEMDC/ADVEC/WEOWORLD 
- Number of observations: 231 
- Description of the dataset: This dataset provides the gross domestic product (GDP) per capita of all the countries in the world from 1980 to 2027. Any years after 2021 are expected values. We plan to combine this dataset with the ‘Michelin Guide Restaurants 2021’ dataset by merging the 2021 GDP column of countries to the above table. 

## 3. Country Unemployment Rate (Percent)
- Link to the dataset: https://www.imf.org/external/datamapper/LUR@WEO/OEMDC/ADVEC/WEOWORLD
- Number of observations: 119
- Description of the dataset: This dataset provides the unemployment rate (percentage) of all the countries in the world from 1980 to 2027. Any years after 2021 are expected values. We plan to combine this dataset with the ‘Michelin Guide Restaurants 2021’ dataset by merging the 2021 unemployment rate column of countries to the above table.

## 4. Inflation Rate, Average Consumer Prices (Annual Percent Change)
- Link to the dataset: https://www.imf.org/external/datamapper/PCPIPCH@WEO/OEMDC/ADVEC/WEOWORLD
- Number of observations: 227
- Description of the dataset: This dataset provides the inflation rate of all the countries in the world from 1980 to 2027 with the annual percent change in average consumer prices. Any years after 2021 are expected values. We plan to combine this dataset with the ‘Michelin Guide Restaurants 2021’ dataset by merging the 2021 inflation rate column of countries to the above table.

# Setup

## 1. Importing Necessary Libraries

In [1]:
# Used for performing numerical computations
import numpy as np

# Used for reading, modifying, and analyzing datasets
import pandas as pd

# Disable pandas warnings during data cleaning
pd.options.mode.chained_assignment = None

# Both of these packages are used for visualizing data
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Defining Function(s) for Data Cleaning

### Standardizing Numerical Values in the Datasets
All of the numerical values in our datasets are string values and some of the datasets use commas to display values large values (e.g. one-thousand as 1,000). We wish to remove these commas and work with float values instead. 

This function will:
1. Remove these commas from numerical values in the dataset, if present
2. Convert the string values to float, or NaN otherwise

In [None]:
def standardize_numbers(string):
    string = string.strip()
    
    string = string.replace(',', '')
    
    string = string.strip()
    
    try:
        float(string)
    except:
        output = np.nan
    else:
        output = float(string)
    return output

## 3. Importing Datasets

### 2021 Michelin Restaurants Dataset

In [None]:
# Loading the Michelin Restaurants dataset into the 'michelin' DataFrame 
# Dataset link: https://www.kaggle.com/datasets/ngshiheng/michelin-guide-restaurants-2021
michelin = pd.read_csv('michelin_my_maps.csv')\
    
michelin.head()

### Country GDP Dataset

In [None]:
# Loading the Country GDP dataset into the 'gdp' DataFrame
# Dataset link: https://www.imf.org/external/datamapper/NGDPDPC@WEO/OEMDC/ADVEC/WEOWORLD
gdp = pd.read_csv('all_country_gdp.csv')

gdp.head()

### Country Unemployment Rate Dataset

In [None]:
# Loading the Country Unemployment dataset into the 'unemployment' DataFrame
# Dataset link: https://www.imf.org/external/datamapper/LUR@WEO/OEMDC/ADVEC/WEOWORLD
unemployment = pd.read_csv('country_unemployment.csv')

unemployment.head()

### Country Inflation Rate Dataset

In [None]:
# Loading the Country Inflation dataset into the 'inflation' DataFrame
# Dataset link: https://www.imf.org/external/datamapper/PCPIPCH@WEO/OEMDC/ADVEC/WEOWORLD
inflation = pd.read_csv('country_inflation.csv')

inflation.head()

# Data Cleaning

We have four data sets which we are cleaning and standardizing separately. We then merge them by matching each Michelin restaurant's country location to the GDP, unemployment rate, and inflation rate of that country.

## 1. Cleaning the 2021 Michelin Restaurants Dataset

The goal of cleaning this dataset is to: 
1. Extract the relevant columns (Name, Address, MinPrice, MaxPrice, Currency, and Cuisine)
2. Change the addresses of each restaurant to just be its country 
3. Group the restaurants by their countries
4. Standardize the 'MinPrice' and 'MaxPrice' columns
5. Remove any restaurants with NaN column(s)

In [2]:
# Extracting the relevant columns from the dataset
michelin_sub = michelin[['Name', 'Address', 'MinPrice', 'MaxPrice', 'Currency', 'Cuisine']]

# Standardizing MinPrice and MaxPrice
michelin_sub['MinPrice'] = michelin_sub['MinPrice'].astype(str).apply(standardize_numbers)
michelin_sub['MaxPrice'] = michelin_sub['MaxPrice'].astype(str).apply(standardize_numbers)

# Removing all restaurants with a NaN column value
michelin_sub.dropna(inplace=True)

# Updating each restaurants's address to just its country
michelin_sub['Address'] = michelin_sub['Address'].apply(lambda x: x.split(' ')[-1])

# Renaming the 'Address' column to 'Country'
michelin_sub = michelin_sub.rename(columns = {'Address': 'Country'})

# Grouping the restaurants by the same country
michelin_sub = michelin_sub.sort_values(by='Country').reset_index(drop=True)

michelin_sub.head()

## 2. Cleaning the Country GDP Dataset

The goal of cleaning this dataset is to:
1. Extract only the 2021 GDP information
2. Standardize the 'GDP' column
3. Remove countries with a NaN value

In [None]:
# Renaming the columns for extraction
gdp = gdp.rename(columns = {'GDP per capita, current prices\n (U.S. dollars per capita)': 'Country', '2021': 'GDP'})

# Extracting the desired columns 
gdp_sub = gdp[['Country', 'GDP']]

# Extracting all countries without 'no data' value
gdp_sub = gdp_sub[gdp_sub['GDP'].str.contains('no data') == False]

# Standardizing GDP
gdp_sub['GDP'] = gdp_sub['GDP'].astype(str).apply(standardize_numbers)

# Removing all countries with a NaN value
gdp_sub.dropna(inplace=True)

# Reset indices after dropping NaN values
gdp_sub.reset_index(inplace=True, drop=True)

gdp_sub.head()

## 3. Cleaning the Country Unemployment Dataset

The goal of cleaning this dataset is to: 
1. Extract the relevant columns (Country and 2021 Unemployment)
2. Standardize the 'Unemployment' column
5. Remove any restaurants with NaN column(s)

In [None]:
# Renaming the columns for extraction
unemployment = unemployment.rename(columns = {'Unemployment rate (Percent)': 'Country', '2021': 'Unemployment'})

# Extracting the desired columns 
unemployment_sub = unemployment[['Country', 'Unemployment']]

# Extracting all countries without 'no data' value
unemployment_sub = unemployment_sub[unemployment_sub['Unemployment'].str.contains('no data') == False]

# Standardizing Unemployment
unemployment_sub['Unemployment'] = unemployment_sub['Unemployment'].astype(str).apply(standardize_numbers)

# Removing all countries with a NaN value
unemployment_sub.dropna(inplace=True)

# Reset indices after dropping NaN values
unemployment_sub.reset_index(inplace=True, drop=True)

unemployment_sub.head()

## 4. Cleaning the Country Inflation Rate Dataset

The goal of cleaning this dataset is to: 
1. Extract the relevant columns (Country and 2021 Inflation)
2. Standardize the 'Inflation' column
5. Remove any restaurants with NaN column(s)

In [None]:
# Renaming the columns for extraction
inflation = inflation.rename(columns = {'Inflation rate, average consumer prices (Annual percent change)': 'Country', '2021': 'Inflation'})

# Extracting the desired columns 
inflation_sub = inflation[['Country', 'Inflation']]

# Extracting all countries without 'no data' value
inflation_sub = inflation_sub[inflation_sub['Inflation'].str.contains('no data') == False]

# Standardizing Inflation
inflation_sub['Inflation'] = inflation_sub['Inflation'].astype(str).apply(standardize_numbers)

# Removing all countries with a NaN value
inflation_sub.dropna(inplace=True)

# Reset indices after dropping NaN values
inflation_sub.reset_index(inplace=True, drop=True)

inflation_sub.head()

## 5. Merging the datasets

In [None]:
# Keeping only the restaurants whose countries have GDP data in the GDP dataset
gdpList = list(gdp_sub['Country'])
michelin_sub = michelin_sub[michelin_sub['Country'].isin(gdpList)].reset_index(drop=True)
michelinList = list(michelin_sub['Country'])
gdp_sub = gdp_sub[gdp_sub['Country'].isin(michelinList)].reset_index(drop=True)

# Merging the Michelin restaurants with GDP 
merged = michelin_sub.merge(gdp_sub, left_on = 'Country', right_on = 'Country')

# Merging the unemployment rates
unemployment_sub = unemployment_sub[unemployment_sub['Country'].isin(michelinList)].reset_index(drop=True)
merged = merged.merge(unemployment_sub, left_on = 'Country', right_on = 'Country')


# Merging the inflation rates
inflation_sub = inflation_sub[inflation_sub['Country'].isin(michelinList)].reset_index(drop=True)
merged = merged.merge(inflation_sub, left_on = 'Country', right_on = 'Country')

merged.head()

# Data Analysis & Results (EDA)

Carry out EDA on your dataset(s); Describe in this section

In [3]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION