<a href="https://www.kaggle.com/code/nhanbaoho/analysis-of-cost-of-living-index-2022?scriptVersionId=98212402" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
# Contents
<ol>
    <li>Import packages</li>
    <li>Read and Preprocess data</li>
    <li>Analysis of individual indicator</li>
    <li>Correlation between criteria</li>
    <li>The gap between continents</li>
</ol>

---
# 1. Import packages

In [None]:
# to map countries to continents
!pip install pycountry_convert

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import pycountry_convert as pycountry

---
# 2. Read and Preprocess data
## 2.1 Read data into a dataframe named "df"

In [None]:
df = pd.read_csv('/kaggle/input/cost-of-living-index-2022/Cost_of_Living_Index_2022.csv')
df

### The data contains 6 categories from 139 countries.
* Cost of living
* Rent
* Cost of leving plus rent
* Groceries
* Restaurant
* Local purchasing power

### Explore information about dataframe: index dtype, columns, non-null values, memory usage

In [None]:
df.info()

## 2.2 Remove null values

In [None]:
df.isna().sum()

#### There is no null value to be removed.

## 2.3 Remove the column Rank 
The column "Rank" is alphabetical order of countries' names and can be remove from data.

In [None]:
df.head()

In [None]:
df = df.drop("Rank", axis=1)

In [None]:
df.head()

## 2.4 Add continents of countries to dataframe
#### We first build mapping function
(Reference: https://pyquestions.com/get-continent-name-from-country-using-pycountry)

In [None]:
# mapping function: country -> contrinent
def country_to_continent(country_name):
    country_alpha2 = pycountry.country_name_to_country_alpha2(country_name)
    country_continent_code = pycountry.country_alpha2_to_continent_code(country_alpha2)
    country_continent_name = pycountry.convert_continent_code_to_continent_name(country_continent_code)
    return country_continent_name
country_to_continent("Zimbabwe")

In [None]:
# create country array
countries = np.array(df["Country"])
print(countries)

In [None]:
# There are three countries that cannot be converted by pycountry_convert
europe = ["Kosovo (Disputed Territory)", "Bosnia And Herzegovina"]
south_america = ["Trinidad And Tobago"]
# contries list
continent = []
for c in countries:
    if c in europe:
        continent.append("Europe")
    elif c in south_america:
        continent.append("South America")
    else:
        continent.append(country_to_continent(c))
# print in rows
print(continent)

In [None]:
# Insert new column titled "Continent"
df["Continent"] = np.array(continent)
df

---
# 3. Analysis of individual indicator
* In this section, we will explore the distribution of individual indicator using seaborn displot.
* We skip Cost of Living Plus Rent Index in this step.

### Explore  descriptive statistics about dataframe (in relation with index value 100 which is New York is based on).

In [None]:
df.describe()

* We can see that 75% of values are less than 66 for all indicators. This tells us that New York is more expensive than most of countries. 

## 3.1 Cost of Living Index

In [None]:
# Distribution of 'Cost of Living Index' of all continents
sns.displot(data=df,x='Cost of Living Index', kde=True, hue="Continent")
plt.title("Distribution of 'Cost of Living Index' of all continents")

* It can be seen that Cost of Ling Index are knewed left with centres are all around 30-50 with Africa being the cheapeast and Oceania being the most expensive. 
* Europe has more stable level than other continent.
* Only two countries having this index greater than New York, one in Europe and one in North America.

### Below are top 10 countries that has highest 'Cost of Living Index'.

In [None]:
df.nlargest(10, 'Cost of Living Index')[['Country', 'Continent']]

## 3.2 Rent Index

In [None]:
# Distribution of 'Rent Index' of all continents"
sns.displot(data=df,x='Rent Index', kde=True, hue="Continent")
plt.title("Distribution of 'Rent Index' of all continents")

* Rent Index in all continents is by far lower than New York and all knewed left, centred aroung 20-30. Most of them are only less than half of New York. 
* New York is the most expensive in renting appartment.
* The spread of this index in Asia is noticable. While most countries are fairly cheap, two countries are significantly expensive compared with the rest of Asia.
* Rent in South America is most affordable with index ranged between 5-20.

## 3.3 Groceries Index

In [None]:
# Distribution of 'Groceries Index' of all continents"
sns.displot(data=df,x='Groceries Index', kde=True, hue="Continent")
plt.title("Distribution of 'Groceries Index' of all continents")

* Groceries Index is low in Africa and South Ameria. 
* Asia shows a large spread from 20-80.

## 3.4 Restaurant Price Index

In [None]:
# "Distribution of 'Restaurant Price Index' of all continents"
sns.displot(data=df,x='Restaurant Price Index', kde=True, hue="Continent")
plt.title("Distribution of 'Restaurant Price Index' of all continents")

* Asia has more cheap restaurants than other continents. 
* Europe shows a large range of price, well spreading between 20-80.

## 3.5 Local Purchasing Power Index

In [None]:
# "Distribution of 'Local Purchasing Power Index' of all continents"
sns.displot(data=df,x='Local Purchasing Power Index', kde=True, hue="Continent")
plt.title("Distribution of 'Local Purchasing Power Index' of all continents")

* This index spread well for Europe and Asia with Europe is by far higher than the rest of the world. It show that the power of purchasing in Europe is stronger. 
* On the other side, Africa has lowest power of purchasing.

### Below is the top 10 countries of Highest Local Purchaing Power Index

In [None]:
df.nlargest(10, 'Local Purchasing Power Index')[['Country', 'Continent']]

### 3.6 General examination
* All indicators are left knewed, especially renting.
* Europe has many indicators closed to New York than other five continents, especially Local Purchasing Power Index.
* Asia has high level of difference between countries in all indecies.

---
# 4 Correlation between criteria
## We now examine the correlation between indices.

In [None]:
# matrix of correlation
df.corr()

## 4.1 Display correlation in heat map and pairplot

In [None]:
# figure size
plt.figure(figsize=(6, 6))
# correlation matrix
corr = df.corr()
# upper triangle is marked
marked_matrix = np.triu(corr)
# plot heatmap
sns.heatmap(data = corr, cmap='viridis', annot=True, mask = marked_matrix)
# figure title
plt.title("Correlation coefficients between indices")

In [None]:
# figure size
plt.figure(figsize=(10, 10))
# sns.pairplot(df,hue='Continent',palette='viridis')
sns.pairplot(df,hue='Continent',palette='viridis', corner = True)

## 4.2 'Cost of Living Index' v.s others
* It can be seen that "Cost of Living Index" is highly correlated with 'Groceries Index' and 'Restaurant Price Index' with coefficients being around 0.95. The graph show a sharp trend for this fact. This value with 'Rend Index' is slightly lower but still high, at 0.84. 

### 4.2.a Cost of Living Index v.s Restaurant Price Index

In [None]:
# "Cost of Living Index v.s Restaurant Price Index"
sns.lmplot(data=df, x = "Cost of Living Index", y = "Restaurant Price Index", hue='Continent')
plt.title("Cost of Living Index v.s Restaurant Price Index")

* The graph shows Europe has highest price of Restaurant while Asia has lowest compared with Cost of Living Index.

### 4.2.b Cost of Living Index v.s Groceries Index

In [None]:
# Cost of Living Index v.s Groceries Index
sns.lmplot(data=df, x = "Cost of Living Index", y = "Groceries Index", hue='Continent')
plt.title("Cost of Living Index v.s Groceries Index")

* In an opposite trend, Europe has lowest price of Groceries while Asia has higer compared with Cost of Living Index.
* North America and Europe seem to have same situation.

### 4.2.c Cost of Living Index v.s Rent Index

In [None]:
sns.lmplot(data=df, x = "Cost of Living Index", y = "Rent Index", hue='Continent')
plt.title("Cost of Living Index v.s Rent Index")

* Renting has wide spread of Rent Index compared with Cost of Living Index in both Asia and Europe.
* Renting in South America seems to be cheap compared with Cost of Living.

## 4.3 Restaurant v.s Groceries

In [None]:
# figure size
plt.figure(figsize=(15, 15))
# add regression line per group Seaborn
sns.lmplot(data=df, x="Groceries Index", y="Restaurant Price Index", hue="Continent")
# plt.xlabel("Groceries Index")
# plt.ylabel("Restaurant Price Index")
plt.title("Restaurant Price Index v.s Groceries Index")

* The graph show oppsotise trends between Europe and Asia. Restaurant price is highest among contients compared with groceries while Asia finds it cheaper.

## 4.4 Local Purchasing Power Index v.s other indices

In [None]:
# figure size
plt.figure(figsize=(6, 6))
# correlation matrix
corr = df.corr()
# upper triangle is marked
marked_matrix = np.triu(corr)
# plot heatmap
sns.heatmap(data = corr, cmap='viridis', annot=True, mask = marked_matrix)
# figure title
plt.title("Correlation coefficients between indices")

## 4.4.a The choice of criteria
* 'Local Purchasing Power Index' is fairly correlated with other index at similar levels, ranged between 0.63 to 0.7. 
* We have learned that "Cost of Living Index" is highly correlated to "Groceries Index" and "Restaurant Price Index".
* It is enough to consider 'Local Purchasing Power Index' in relation with 'Rent Index' and 'Cost of Living Index'.

## 4.4.b 'Local Purchasing Power Index' v.s 'Cost of Living Index'

In [None]:
sns.lmplot(data=df, 
                x = 'Cost of Living Index', 
                y = 'Local Purchasing Power Index', 
                hue = 'Continent')

* The graph show that the higher of 'Cost of Living Index', the higher 'Local Purchasing Index" in four continents 'Europe', 'Asia', 'North America', and "Oceania' with "Oceania' having stronger trend.
* However, this trend is opposite in both 'Affica' and 'South America'.

### 4.4.c 'Local Purchasing Power Index' v.s 'Rent Index'

In [None]:
sns.lmplot(data=df, 
                x = 'Rent Index', 
                y = 'Local Purchasing Power Index', 
                hue = 'Continent')

* 'Africa' again shows an opposite trend compared with other continents. Higher rent resulting in lower power of buying.
* Other continents has linearly increasing trend with Oceania again having higher gradient.

---
# 5. The gap between continents

## 5.1 Let's consider top 15 countries of highest indices

In [None]:
df.sort_values(['Cost of Living Index', 'Rent Index','Groceries Index', 'Restaurant Price Index'],
              ascending = [False, False, False, False]).head(15)

## 5.2 Let's consider top 15 countries of lowest indices

In [None]:
df.sort_values(['Cost of Living Index', 'Rent Index','Groceries Index', 'Restaurant Price Index'],
              ascending = [True, True, True, True]).head(15)

* There is remarkable gap between indices of top 15 highest and lowest.  Countries of highest indices are in Western countries while lowest are mainly in Asia and Africa. 
* Asia shows a mix level of high and low indices.

---
### Thanks for your interest and your feedback!