# Data Analysis on Billionaires

# Introduction

Who are the richest people around the world? By analyzing data on billionaires, we can gain insight into the world's wealthiest individuals, the industries they work in, their geographic distribution, and the dynamics of their wealth.

# 1. Data Preprocessing

## 1.1 Load Dataset

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff # for plot heatmap
import geopandas as gpd
import folium
from folium.plugins import MarkerCluster
%matplotlib inline

In [None]:
# Connect with google drive
from google.colab import drive
drivePath = '/content/drive'
drive.mount(drivePath)

Mounted at /content/drive


In [None]:
# Install wget
!pip install wget

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9656 sha256=94ac115bf9d2de9b6a72aba862d2c02c62fdcd326be70e4d13184ea8ba725d3e
  Stored in directory: /root/.cache/pip/wheels/8b/f1/7f/5c94f0a7a505ca1c81cd1d9208ae2064675d97582078e6c769
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [None]:
import wget

# Setup URL and path
URL = 'https://raw.githubusercontent.com/lviiholic/dataset-box/main/Billionaires_Statistics_Dataset.csv'
dataPath = drivePath + '/MyDrive/Colab Notebooks/data/'

# Download the file from github
fileName = wget.download(URL, out=dataPath)

# Print the file name including the local path
print(fileName)

/content/drive/MyDrive/Colab Notebooks/data//Billionaires_Statistics_Dataset (1).csv


In [None]:
# Read file into dataframe
df = pd.read_csv(fileName)
df.head()

Unnamed: 0,rank,finalWorth,category,personName,age,country,city,source,industries,countryOfCitizenship,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country
0,1,211000,Fashion & Retail,Bernard Arnault & family,74.0,France,Paris,LVMH,Fashion & Retail,France,...,1.1,"$2,715,518,274,227",65.6,102.5,82.5,24.2,60.7,67059887.0,46.227638,2.213749
1,2,180000,Automotive,Elon Musk,51.0,United States,Austin,"Tesla, SpaceX",Automotive,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891
2,3,114000,Technology,Jeff Bezos,59.0,United States,Medina,Amazon,Technology,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891
3,4,107000,Technology,Larry Ellison,78.0,United States,Lanai,Oracle,Technology,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891
4,5,106000,Finance & Investments,Warren Buffett,92.0,United States,Omaha,Berkshire Hathaway,Finance & Investments,United States,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891


## 1.2 Shape of Data

In [None]:
print(f"The dataset includes {df.shape[0]} rows and {df.shape[1]} columns.")

The dataset includes 2640 rows and 35 columns.


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2640 entries, 0 to 2639
Data columns (total 35 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   rank                                        2640 non-null   int64  
 1   finalWorth                                  2640 non-null   int64  
 2   category                                    2640 non-null   object 
 3   personName                                  2640 non-null   object 
 4   age                                         2575 non-null   float64
 5   country                                     2602 non-null   object 
 6   city                                        2568 non-null   object 
 7   source                                      2640 non-null   object 
 8   industries                                  2640 non-null   object 
 9   countryOfCitizenship                        2640 non-null   object 
 10  organization

Details for each column:

- rank: The ranking of the billionaire in terms of wealth.
- finalWorth: The final net worth of the billionaire in U.S. dollars.
- category: The category or industry in which the billionaire's business operates.
- personName: The full name of the billionaire.
- age: The age of the billionaire.
- country: The country in which the billionaire resides.
- city: The city in which the billionaire resides.
- source: The source of the billionaire's wealth.
- industries: The industries associated with the billionaire's business interests.
- countryOfCitizenship: The country of citizenship of the billionaire.
- organization: The name of the organization or company associated with the billionaire.
- selfMade: Indicates whether the billionaire is self-made (True/False).
- status: "D" represents self-made billionaires (Founders/Entrepreneurs) and "U" indicates inherited or unearned wealth.
- gender: The gender of the billionaire.
- birthDate: The birthdate of the billionaire.
- lastName: The last name of the billionaire.
- firstName: The first name of the billionaire.
- title: The title or honorific of the billionaire.
- date: The date of data collection.
- state: The state in which the billionaire resides.
- residenceStateRegion: The region or state of residence of the billionaire.
- birthYear: The birth year of the billionaire.
- birthMonth: The birth month of the billionaire.
- birthDay: The birth day of the billionaire.
- cpi_country: Consumer Price Index (CPI) for the billionaire's country.
- cpi_change_country: CPI change for the billionaire's country.
- gdp_country: Gross Domestic Product (GDP) for the billionaire's country.
- gross_tertiary_education_enrollment: Enrollment in tertiary education in the billionaire's country.
- gross_primary_education_enrollment_country: Enrollment in primary education in the billionaire's country.
- life_expectancy_country: Life expectancy in the billionaire's country.
- tax_revenue_country_country: Tax revenue in the billionaire's country.
- total_tax_rate_country: Total tax rate in the billionaire's country.
- population_country: Population of the billionaire's country.
- latitude_country: Latitude coordinate of the billionaire's country.
- longitude_country: Longitude coordinate of the billionaire's country.

## 1.3 Missing Values

In [None]:
df.isna().sum()

rank                                             0
finalWorth                                       0
category                                         0
personName                                       0
age                                             65
country                                         38
city                                            72
source                                           0
industries                                       0
countryOfCitizenship                             0
organization                                  2315
selfMade                                         0
status                                           0
gender                                           0
birthDate                                       76
lastName                                         0
firstName                                        3
title                                         2301
date                                             0
state                          

After examining the columns, several key ones used for analysis, such as age, country, and city, contain missing data, particularly when billionaires opt not to disclose their personal information. Consequently, some rows lack necessary information, and there is no reliable basis for estimating the missing values within these columns. The columns state and residenceStateRegion appear to contain data only when the countryOfCitizenship column is equal to 'United States'. While these columns could be useful for deeper analysis into the U.S. subset, they have little value for the overall exploratory data analysis (EDA) if this assumption is correct. Therefore, we have chosen to drop some columns to streamline the analysis.

### 1.3.1 Handle Missing Values

In [None]:
# Some columns which have alot of NAs and we are not going to use for analysis can be removed
columns_to_drop = ["birthDay", "birthMonth", "birthYear", "state", "firstName", "lastName", "residenceStateRegion", "organization", "title"]
df.drop(columns_to_drop, axis=1, inplace=True)

Imputation of data by taking mean, median will only cause to add falseness to our visualizations. Hence, we leave the rest of the NAs as such and proceed further.

In [None]:
df.isna().sum()

rank                                            0
finalWorth                                      0
category                                        0
personName                                      0
age                                            65
country                                        38
city                                           72
source                                          0
industries                                      0
countryOfCitizenship                            0
selfMade                                        0
status                                          0
gender                                          0
birthDate                                      76
date                                            0
cpi_country                                   184
cpi_change_country                            184
gdp_country                                   164
gross_tertiary_education_enrollment           182
gross_primary_education_enrollment_country    181


In [None]:
df.duplicated().sum()

0

There are no duplicate values in rows in the dataframe.

In [None]:
df.T.duplicated(keep=False)

rank                                          False
finalWorth                                    False
category                                       True
personName                                    False
age                                           False
country                                       False
city                                          False
source                                        False
industries                                     True
countryOfCitizenship                          False
selfMade                                      False
status                                        False
gender                                        False
birthDate                                     False
date                                          False
cpi_country                                   False
cpi_change_country                            False
gdp_country                                   False
gross_tertiary_education_enrollment           False
gross_primar

It can be seen that the industries column and category column overlap, so we will drop the category column.

In [None]:
# Drop duplicate columns for analytical purposes
df.drop("category", axis=1, inplace=True)

In [None]:
df.head()

Unnamed: 0,rank,finalWorth,personName,age,country,city,source,industries,countryOfCitizenship,selfMade,...,cpi_change_country,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,life_expectancy_country,tax_revenue_country_country,total_tax_rate_country,population_country,latitude_country,longitude_country
0,1,211000,Bernard Arnault & family,74.0,France,Paris,LVMH,Fashion & Retail,France,False,...,1.1,"$2,715,518,274,227",65.6,102.5,82.5,24.2,60.7,67059887.0,46.227638,2.213749
1,2,180000,Elon Musk,51.0,United States,Austin,"Tesla, SpaceX",Automotive,United States,True,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891
2,3,114000,Jeff Bezos,59.0,United States,Medina,Amazon,Technology,United States,True,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891
3,4,107000,Larry Ellison,78.0,United States,Lanai,Oracle,Technology,United States,True,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891
4,5,106000,Warren Buffett,92.0,United States,Omaha,Berkshire Hathaway,Finance & Investments,United States,True,...,7.5,"$21,427,700,000,000",88.2,101.8,78.5,9.6,36.6,328239523.0,37.09024,-95.712891


## 1.4 Data Types

In [None]:
df.dtypes

rank                                            int64
finalWorth                                      int64
personName                                     object
age                                           float64
country                                        object
city                                           object
source                                         object
industries                                     object
countryOfCitizenship                           object
selfMade                                         bool
status                                         object
gender                                         object
birthDate                                      object
date                                           object
cpi_country                                   float64
cpi_change_country                            float64
gdp_country                                    object
gross_tertiary_education_enrollment           float64
gross_primary_education_enro

In [None]:
# Convert 'birthDate' to datetime
df["birthDate"] = pd.to_datetime(df["birthDate"])

# Convert 'date' to datetime
df["date"] = pd.to_datetime(df["date"])

# Convert 'gdp_country' to numeric, removing dollar signs and commas
df["gdp_country"] = pd.to_numeric(df["gdp_country"].replace("[\$,]", "", regex=True), errors="coerce")

# Display the data types after conversion
print(df.dtypes)

rank                                                   int64
finalWorth                                             int64
personName                                            object
age                                                  float64
country                                               object
city                                                  object
source                                                object
industries                                            object
countryOfCitizenship                                  object
selfMade                                                bool
status                                                object
gender                                                object
birthDate                                     datetime64[ns]
date                                          datetime64[ns]
cpi_country                                          float64
cpi_change_country                                   float64
gdp_country             

## 1.5 Descriptive Statistics

Descriptive statistics are used to describe basic properties of data, thereby helping us understand more about the data. We will perform descriptive statistics of two data types: categorical and numerical.

In [None]:
def describe_dataframe(df):
    # Get summary statistics for numerical data
    numeric_stats = df.describe()

    # Get unique values percentages for categorical data
    unique_values_percentage = {}
    cat_cols = df.select_dtypes(include=['object', 'category', 'bool']).columns.tolist()
    for col in cat_cols:
        unique_values_percentage[col] = (df[col].value_counts(normalize=True) * 100).round(2).to_dict()

    # Get missing percentages
    missing_percentages = (df.isnull().mean() * 100).round(2)

    # Combine information into separate DataFrames for numerical and categorical data
    numeric_description = pd.DataFrame({
        'data_type': numeric_stats.dtypes,
        'nan_percentage': missing_percentages[numeric_stats.columns]
    })
    numeric_description = pd.concat([numeric_description, numeric_stats.transpose()], axis=1)

    categorical_description = pd.DataFrame({
        'data_type': df[cat_cols].dtypes,
        'unique_values_percentage': unique_values_percentage,
        'nan_percentage': missing_percentages[cat_cols],
        'number_of_unique_values': df[cat_cols].nunique()
    })

    return numeric_description, categorical_description

numeric_description, categorical_description = describe_dataframe(df)

### 1.5.1 Numeric Data Description

In [None]:
print("Numeric Data Description:")
numeric_description

Numeric Data Description:


Unnamed: 0,data_type,nan_percentage,count,mean,min,25%,50%,75%,max,std
rank,float64,0.0,2640.0,1289.159091,1.0,659.0,1312.0,1905.0,2540.0,739.693726
finalWorth,float64,0.0,2640.0,4623.787879,1000.0,1500.0,2300.0,4200.0,211000.0,9834.240939
age,float64,2.46,2575.0,65.140194,18.0,56.0,65.0,75.0,101.0,13.258098
birthDate,object,2.88,2564.0,1957-08-10 08:18:54.477379072,1921-09-11 00:00:00,1948-02-04 18:00:00,1957-07-15 12:00:00,1966-09-23 12:00:00,2004-05-06 00:00:00,
date,object,0.0,2640.0,2023-04-04 05:01:10.909091072,2023-04-04 05:01:00,2023-04-04 05:01:00,2023-04-04 05:01:00,2023-04-04 05:01:00,2023-04-04 09:01:00,
cpi_country,float64,6.97,2456.0,127.755204,99.55,117.24,117.24,125.08,288.57,26.452951
cpi_change_country,float64,6.97,2456.0,4.364169,-1.9,1.7,2.9,7.5,53.5,3.623763
gdp_country,float64,6.21,2476.0,11582873303921.5625,3154057987.0,1736425629520.0,19910000000000.0,21427700000000.0,21427700000000.0,9575588391938.195
gross_tertiary_education_enrollment,float64,6.89,2458.0,67.225671,4.0,50.6,65.6,88.2,136.6,21.343426
gross_primary_education_enrollment_country,float64,6.86,2459.0,102.85852,84.7,100.2,101.8,102.6,142.1,4.710977


### 1.5.2 Categorical Data Description

In [None]:
pd.set_option("max_colwidth", None)
print("\nCategorical Data Description:")
categorical_description


Categorical Data Description:


Unnamed: 0,data_type,unique_values_percentage,nan_percentage,number_of_unique_values
personName,object,"{'Wang Yanqing & family': 0.08, 'Li Li': 0.08, 'Bernard Arnault & family': 0.04, 'Lei Jufang': 0.04, 'Kim Jung-youn': 0.04, 'Mustafa Kucuk': 0.04, 'Christopher Kwok': 0.04, 'Edward Kwok': 0.04, 'Lai Jianfa': 0.04, 'Lau Cho Kun': 0.04, 'Alexis Lê-Quôc': 0.04, 'James Leininger': 0.04, 'Meng Qingshan & family': 0.04, 'Xuhui Li': 0.04, 'Jimmy John Liautaud': 0.04, 'Lin Lairong & family': 0.04, 'Lu Di': 0.04, 'Palmer Luckey': 0.04, 'Duncan MacMillan': 0.04, 'Yusaku Maezawa': 0.04, 'Kim Jung-min': 0.04, 'Kim Chang-soo': 0.04, 'Osman Kibar': 0.04, 'Keeree Kanjanapas': 0.04, 'Marek Dospiva': 0.04, 'Keith Dunleavy & family': 0.04, 'John Elkann': 0.04, 'Francois Feuillet & family': 0.04, 'Yasuhiro Fukushima': 0.04, 'Mario Gabelli': 0.04, 'Rolf Gerling': 0.04, 'John Goff': 0.04, 'Alexandre Grendene Bartelle': 0.04, 'Gerry Harvey': 0.04, 'He Zhaoxi': 0.04, 'William Heinecke': 0.04, 'Jay Hennick': 0.04, 'Ilkka Herlin': 0.04, 'Asok Kumar Hiranandani': 0.04, 'Hal Jackman': 0.04, 'Pavan Jain': 0.04, 'Ilson Mateus & family': 0.04, 'Alberto Palatchi': 0.04, 'Kommer Damen': 0.04, 'Mrudula Parekh': 0.04, 'Wu Lanlan & family': 0.04, 'Xu Bingzhong': 0.04, 'Yang Xuegang': 0.04, 'Vladimir Yevtushenkov': 0.04, 'David Zalik': 0.04, 'Zhang Xiaojuan': 0.04, 'Zheng Xiaodong': 0.04, 'Zhong Ruonong & family': 0.04, 'Noubar Afeyan': 0.04, 'Syed Mokhtar AlBukhary': 0.04, 'Herbert Allen, Jr. & family': 0.04, 'Vasily Anisimov': 0.04, 'Clifford Asness': 0.04, 'Louis Bacon': 0.04, 'Bai Baokun': 0.04, 'Alex Birkenstock': 0.04, 'Christian Birkenstock': 0.04, 'Ian Wood & family': 0.04, 'Stephen Winn': 0.04, 'Tom Werner': 0.04, 'Paul Saville': 0.04, 'Dragos Paval': 0.04, 'Jorge Perez': 0.04, 'Dmitry Pumpyansky': 0.04, 'Jinsheng Ren & family': 0.04, 'Brian Roberts': 0.04, 'Joe Rogers, Jr.': 0.04, 'Ruan Shuilong & family': 0.04, 'Ivan Savvidis & family': 0.04, 'Wang Minwen': 0.04, 'Keiichi Shibahara': 0.04, 'Rajju Shroff': 0.04, 'Sun Huaiqing & family': 0.04, 'Alain Taravella': 0.04, 'Jonathan Tisch': 0.04, 'Kenneth Tuchman': 0.04, 'Wang Chou-hsiong': 0.04, 'Carl DeSantis': 0.04, 'Chen Kaichen': 0.04, 'Cai Huabo': 0.04, 'Satyanarayan Nuwal': 0.04, 'Lei Jin': 0.04, 'Jin Lei & family': 0.04, 'Valentin Kipyatkov': 0.04, 'Koo Kwang-mo': 0.04, 'George Kurtz': 0.04, 'Joe Lacob': 0.04, 'Joe Lau': 0.04, 'Louise Lindh': 0.04, 'Luo Yangyong & family': 0.04, 'Joao Roberto Marinho': 0.04, 'Jose Roberto Marinho': 0.04, 'Roberto Irineu Marinho': 0.04, 'Gary Michelson': 0.04, 'Robert G. Miller': 0.04, ...}",0.0,2638
country,object,"{'United States': 28.98, 'China': 20.1, 'India': 6.03, 'Germany': 3.92, 'United Kingdom': 3.15, 'Russia': 3.04, 'Switzerland': 3.0, 'Hong Kong': 2.61, 'Italy': 2.11, 'Singapore': 1.77, 'Brazil': 1.69, 'Australia': 1.65, 'Taiwan': 1.65, 'Canada': 1.61, 'Japan': 1.46, 'France': 1.35, 'South Korea': 1.11, 'Thailand': 1.08, 'Sweden': 1.0, 'Israel': 1.0, 'Turkey': 0.96, 'Spain': 0.96, 'Indonesia': 0.96, 'United Arab Emirates': 0.65, 'Monaco': 0.65, 'Philippines': 0.54, 'Mexico': 0.5, 'Austria': 0.42, 'Malaysia': 0.42, 'Netherlands': 0.38, 'Norway': 0.35, 'Denmark': 0.27, 'Kazakhstan': 0.27, 'Finland': 0.27, 'Czech Republic': 0.27, 'Vietnam': 0.23, 'Ukraine': 0.23, 'Chile': 0.23, 'Poland': 0.19, 'Cyprus': 0.19, 'South Africa': 0.19, 'Egypt': 0.15, 'Ireland': 0.15, 'Argentina': 0.15, 'Hungary': 0.12, 'Romania': 0.12, 'Greece': 0.12, 'Nigeria': 0.12, 'Belgium': 0.12, 'Cayman Islands': 0.12, 'Qatar': 0.08, 'Lebanon': 0.08, 'Slovakia': 0.08, 'Bermuda': 0.08, 'Morocco': 0.08, 'Peru': 0.08, 'Bahamas': 0.08, 'New Zealand': 0.08, 'Turks and Caicos Islands': 0.04, 'Nepal': 0.04, 'Uruguay': 0.04, 'Tanzania': 0.04, 'Andorra': 0.04, 'Bahrain': 0.04, 'Colombia': 0.04, 'Liechtenstein': 0.04, 'Guernsey': 0.04, 'Oman': 0.04, 'Cambodia': 0.04, 'British Virgin Islands': 0.04, 'Luxembourg': 0.04, 'Latvia': 0.04, 'Algeria': 0.04, 'Portugal': 0.04, 'Georgia': 0.04, 'Eswatini (Swaziland)': 0.04, 'Uzbekistan': 0.04, 'Armenia': 0.04}",1.44,78
city,object,"{'New York': 3.86, 'Beijing': 2.65, 'Hong Kong': 2.65, 'Shanghai': 2.49, 'London': 2.38, 'Moscow': 2.34, 'Mumbai': 2.18, 'Shenzhen': 2.1, 'Singapore': 1.75, 'Delhi': 1.44, 'San Francisco': 1.44, 'Hangzhou': 1.32, 'Los Angeles': 1.32, 'Taipei': 1.17, 'Guangzhou': 1.09, 'Bangkok': 1.09, 'Seoul': 1.05, 'Tokyo': 0.93, 'Milan': 0.93, 'Istanbul': 0.86, 'Palm Beach': 0.82, 'Stockholm': 0.82, 'Paris': 0.82, 'Ningbo': 0.78, 'Dallas': 0.78, 'Jakarta': 0.7, 'Bangalore': 0.66, 'Sao Paulo': 0.66, 'Atlanta': 0.62, 'Palo Alto': 0.58, 'Melbourne': 0.58, 'Tel Aviv': 0.58, 'Chengdu': 0.58, 'Dubai': 0.55, 'Sydney': 0.55, 'Chicago': 0.55, 'Changsha': 0.55, 'Houston': 0.55, 'Madrid': 0.51, 'Manila': 0.51, 'Austin': 0.47, 'Munich': 0.47, 'Mexico City': 0.43, 'Montreal': 0.43, 'Atherton': 0.43, 'Beverly Hills': 0.43, 'Toronto': 0.43, 'Geneva': 0.43, 'Xiamen': 0.43, 'Las Vegas': 0.43, 'Boston': 0.39, 'Rio de Janeiro': 0.39, 'Miami': 0.39, 'Hamburg': 0.39, 'Monaco': 0.35, 'Ningde': 0.35, 'Greenwich': 0.35, 'Zurich': 0.35, 'Miami Beach': 0.31, 'Hyderabad': 0.31, 'Suzhou': 0.31, 'Foshan': 0.31, 'Ahmedabad': 0.31, 'Melsungen': 0.31, 'Fort Worth': 0.31, 'Chennai': 0.31, 'Quanzhou': 0.27, 'Pune': 0.27, 'Prague': 0.27, 'Hefei': 0.27, 'Nanjing': 0.27, 'Naples': 0.27, 'Berlin': 0.27, 'Lugano': 0.23, 'Almaty': 0.23, 'Seattle': 0.23, 'Delray Beach': 0.23, 'St. Louis': 0.23, 'Kuala Lumpur': 0.23, 'Lianyungang': 0.23, 'Zug': 0.23, 'Changzhou': 0.23, 'Oslo': 0.23, 'Vienna': 0.23, 'Santiago': 0.23, 'Wuxi': 0.23, 'Helsinki': 0.23, 'Monte Carlo': 0.23, 'Perth': 0.19, 'Saint Petersburg': 0.19, 'Vancouver': 0.19, 'Wuhan': 0.19, 'Gstaad': 0.19, 'Denver': 0.19, 'Phoenix': 0.19, 'Dongguan': 0.19, 'Los Altos': 0.19, 'Amsterdam': 0.19, 'Woodside': 0.19, 'Rome': 0.19, ...}",2.73,741
source,object,"{'Real estate': 5.72, 'Investments': 3.48, 'Diversified': 3.45, 'Pharmaceuticals': 3.22, 'Software': 2.39, 'Hedge funds': 1.55, 'Private equity': 1.52, 'Chemicals': 1.48, 'Retail': 1.44, 'Manufacturing': 1.33, 'Finance': 0.98, 'Banking': 0.98, 'Consumer goods': 0.91, 'Mining': 0.8, 'Telecom': 0.76, 'Supermarkets': 0.76, 'Auto parts': 0.72, 'Electronics': 0.72, 'Semiconductors': 0.68, 'Medical devices': 0.68, 'Online games': 0.57, 'E-commerce': 0.53, 'Financial services': 0.53, 'Fashion retail': 0.53, 'Shipping': 0.49, 'Steel': 0.49, 'Hotels, investments': 0.45, 'Cargill': 0.45, 'Agribusiness': 0.42, 'Medical equipment': 0.42, 'Luxury goods': 0.42, 'Construction': 0.42, 'Shoes': 0.42, 'Insurance': 0.42, 'Venture capital': 0.38, 'Healthcare': 0.38, 'Biotech': 0.38, 'Beer': 0.38, 'Batteries': 0.38, 'Media': 0.38, 'Cosmetics': 0.34, 'Education': 0.34, 'Hotels': 0.34, 'Machinery': 0.3, 'Oil': 0.3, 'Oil & gas': 0.3, 'Money management': 0.3, 'Medical technology': 0.3, 'Hospitals': 0.3, 'Building materials': 0.3, 'Video games': 0.3, 'Paints': 0.3, 'Beverages': 0.3, 'Petrochemicals': 0.3, 'Real estate, investments': 0.27, 'Business software': 0.27, 'Pipelines': 0.27, 'Drugstores': 0.27, 'Food': 0.27, 'Walmart': 0.27, 'Logistics': 0.27, 'Fintech': 0.27, 'Liquor': 0.27, 'Eyeglasses': 0.27, 'Sports': 0.27, 'Casinos': 0.27, 'Coal': 0.27, 'Facebook': 0.23, 'Apparel': 0.23, 'Soy sauce': 0.23, 'Cryptocurrency': 0.23, 'Retailing': 0.23, 'Cheese': 0.23, 'Packaging': 0.23, 'Software services': 0.23, 'Candy, pet food': 0.23, 'Sports apparel': 0.19, 'Candy': 0.19, 'Google': 0.19, 'Media, automotive': 0.19, 'Food, beverages': 0.19, 'Energy': 0.19, 'Furniture': 0.19, 'Oil, investments': 0.19, 'Electronics components': 0.19, 'Smartphones': 0.19, 'Electrical equipment': 0.19, 'Coffee': 0.19, 'Publishing': 0.19, 'Engineering': 0.19, 'Cable television': 0.19, 'Package delivery': 0.19, 'Restaurants': 0.19, 'Technology': 0.19, 'Cloud computing': 0.19, 'H&M': 0.19, 'Conglomerate': 0.19, 'IT provider': 0.15, 'Fertilizer': 0.15, 'Lego': 0.15, ...}",0.0,906
industries,object,"{'Finance & Investments': 14.09, 'Manufacturing': 12.27, 'Technology': 11.89, 'Fashion & Retail': 10.08, 'Food & Beverage': 8.03, 'Healthcare': 7.61, 'Real Estate': 7.31, 'Diversified': 7.08, 'Energy': 3.79, 'Media & Entertainment': 3.45, 'Metals & Mining': 2.8, 'Automotive': 2.77, 'Service': 2.01, 'Construction & Engineering': 1.7, 'Logistics': 1.52, 'Sports': 1.48, 'Telecom': 1.17, 'Gambling & Casinos': 0.95}",0.0,18
countryOfCitizenship,object,"{'United States': 27.84, 'China': 18.6, 'India': 6.4, 'Germany': 4.77, 'Russia': 3.94, 'Hong Kong': 2.58, 'Italy': 2.42, 'Canada': 2.39, 'Taiwan': 1.97, 'United Kingdom': 1.97, 'Brazil': 1.93, 'Australia': 1.78, 'France': 1.63, 'Switzerland': 1.55, 'Japan': 1.52, 'Sweden': 1.48, 'Singapore': 1.4, 'South Korea': 1.14, 'Israel': 1.14, 'Indonesia': 1.1, 'Thailand': 1.06, 'Spain': 1.02, 'Turkey': 0.98, 'Malaysia': 0.68, 'Mexico': 0.53, 'Philippines': 0.53, 'Netherlands': 0.45, 'Norway': 0.45, 'Czech Republic': 0.42, 'Austria': 0.42, 'Ireland': 0.34, 'Cyprus': 0.34, 'Denmark': 0.3, 'Finland': 0.27, 'Poland': 0.27, 'Chile': 0.27, 'Lebanon': 0.23, 'Egypt': 0.23, 'Romania': 0.23, 'Vietnam': 0.23, 'Kazakhstan': 0.23, 'Greece': 0.23, 'Argentina': 0.19, 'South Africa': 0.19, 'Ukraine': 0.19, 'Peru': 0.15, 'Belgium': 0.15, 'Colombia': 0.15, 'United Arab Emirates': 0.15, 'New Zealand': 0.11, 'Hungary': 0.11, 'Nigeria': 0.11, 'Monaco': 0.11, 'Georgia': 0.08, 'Qatar': 0.08, 'Bulgaria': 0.08, 'Oman': 0.08, 'Morocco': 0.08, 'Slovakia': 0.08, 'Macau': 0.04, 'Barbados': 0.04, 'Estonia': 0.04, 'St. Kitts and Nevis': 0.04, 'Tanzania': 0.04, 'Armenia': 0.04, 'Bangladesh': 0.04, 'Zimbabwe': 0.04, 'Nepal': 0.04, 'Portugal': 0.04, 'Liechtenstein': 0.04, 'Guernsey': 0.04, 'Iceland': 0.04, 'Belize': 0.04, 'Eswatini (Swaziland)': 0.04, 'Venezuela': 0.04, 'Algeria': 0.04, 'Panama': 0.04}",0.0,77
selfMade,bool,"{True: 68.64, False: 31.36}",0.0,2
status,object,"{'D': 46.33, 'U': 32.39, 'E': 10.15, 'N': 5.68, 'Split Family Fortune': 2.99, 'R': 2.46}",0.0,6
gender,object,"{'M': 87.23, 'F': 12.77}",0.0,2


Next we will create a dictionary containing the distribution of values of columns with categorical data types to facilitate data exploration.

In [None]:
cat_percentage = categorical_description[categorical_description["number_of_unique_values"] < 100]["unique_values_percentage"].copy().to_dict()

# 2. Exploratory Data Analysis

## 2.1 Data Overview
**How many billionaires are there in total, and what is the combined wealth they currently possess?**


In [None]:
billionaires_count = df.shape[0]
worth = df.finalWorth.sum()/1000000
data_collection_date = df.date[0]

print(f"Number of billionaires: {billionaires_count}")
print(f"Total wealth (in trillion dollars): {worth}")
print(f"Data collection date: {data_collection_date}")

Number of billionaires: 2640
Total wealth (in trillion dollars): 12.2068
Data collection date: 2023-04-04 05:01:00


As of April 4, 2023, there are a total of 2640 billionaires globally, amassing a combined wealth of USD 12.21 trillion.

## 2.2 WHO, WHERE, HOW

### **Who are the top 10 richest people in the world (as of 2023)?**

In [None]:
def format_final_worth(value):
    return f"${value} B"

df['Net worth in B'] = (df['finalWorth'] / 1000).apply(format_final_worth)

table = go.Figure(data=[go.Table(
    header=dict(values=['Rank', 'Name', 'Net worth', 'Age', 'Country', 'Source', 'Industries']),
    cells=dict(values=[df['rank'].head(10), df['personName'].head(10), df['Net worth in B'].head(10),
                       df['age'].head(10), df['countryOfCitizenship'].head(10),
                       df['source'].head(10), df['industries'].head(10)])
)])

df = df.drop(columns=['Net worth in B'])

table.update_layout(width=1200, height=515)

table.show()

Seven of the world's top 10 billionaires are from the United States.

### **Origins of Billionaires: Where do billionaires come from?**

In [None]:
country_counts = df['countryOfCitizenship'].value_counts().reset_index()
country_counts.columns = ['Country', 'Number of Billionaires']
country_counts['Rank'] = country_counts['Number of Billionaires'].rank(ascending=False, method='dense')
table = go.Figure(data=[go.Table(
    header=dict(values=['Rank', 'Country', 'Number of Billionaires']),
    cells=dict(values=[country_counts['Rank'], country_counts['Country'], country_counts['Number of Billionaires']])
)])

table.update_layout(width=1000, height=310)

table.show()

In [None]:
fig = px.choropleth(country_counts,
                    locations="Country",
                    locationmode="country names",
                    color="Number of Billionaires",
                    hover_name="Country",
                    color_continuous_scale="blues",
                    title="Distribution of Billionaires by Country")

fig.update_layout(width=1000, height=600)
fig.show()

The United States is the home country of the highest number of billionaires, with 735 individuals. Following closely are China with 491 billionaires, India with 169, and Germany with 126.

North America and Asia exhibit the highest concentration of billionaires. European countries and India also display darker shades compared to their neighboring nations, indicating a significant billionaire presence.

Africa and some Asian countries appear lighter on the map, signifying a scarcity of billionaires in these regions, potentially due to underdeveloped economies or disparities in wealth distribution.

**Which city is the most preferred residence for billionaires?**

In [None]:
city_count_df  = df.city.value_counts().reset_index()
city_count_df.columns = ['City', 'Number of Billionaires']

city_count_df['Rank'] = city_count_df.index + 1

table = go.Figure(data=[go.Table(
    header=dict(values=['Rank','City','Number of Billionaires']),
    cells=dict(values=[city_count_df['Rank'],city_count_df['City'], city_count_df['Number of Billionaires']])
)])

table.update_layout(width=700, height=310)

table.show()

New York City, known as the financial hub of the United States, boasts the world's two largest stock exchanges, namely the New York Stock Exchange (NYSE) and NASDAQ. This distinction, coupled with various other factors, makes it an appealing choice of residence for numerous billionaires. Taking second place is Hong Kong, widely recognized as one of the foremost international financial centers.

Expanding the focus to the top five cities most favored by billionaires, we discover that London is the last city to appear in the ranking in the fifth place. Beijing serves as the capital of China, and Shanghai stands out as a thriving economic center within the country.

### **Source of Wealth: Where they got their money, and How?**

In [None]:
# Select top sources
top_sources = df['source'].value_counts().head(15)

# Create trace for the bar plot
trace = go.Bar(
    x=top_sources.index,
    y=top_sources,
    marker=dict(color='skyblue')
)

# Create layout
layout = go.Layout(
    title='Top Sources of Wealth Among Billionaires',
    xaxis=dict(title='Source of Wealth', tickangle=45, tickfont=dict(size=10)),
    yaxis=dict(title='Number of Billionaires'),
    plot_bgcolor='white',
    showlegend=False
)

# Create figure
fig = go.Figure(data=[trace], layout=layout)

# Add text annotations
for i, v in enumerate(top_sources):
    fig.add_annotation(
        x=top_sources.index[i],
        y=v + 5,
        text=str(v),
        showarrow=False,
        font=dict(size=10, color='black')
    )

# Show the plot
fig.show()

The most common sources of wealth for billionaires are real estate development, investments, and pharmaceuticals. Some of these sources of wealth are diversified.

**Are there any industries that dominate in terms of billionaire representation?**

In [None]:
# Select top industries
top_industries = df['industries'].value_counts().head(15)

# Create trace for the bar plot
trace = go.Bar(
    x=top_industries.index,
    y=top_industries,
    marker=dict(color='skyblue')
)

# Create layout
layout = go.Layout(
    title='Top Industries Dominating Billionaire Representation',
    xaxis=dict(title='Industry', tickangle=45, tickfont=dict(size=10)),
    yaxis=dict(title='Number of Billionaires'),
    plot_bgcolor='white',
    showlegend=False
)

# Create figure
fig = go.Figure(data=[trace], layout=layout)

# Add text annotations
for i, v in enumerate(top_industries):
    fig.add_annotation(
        x=top_industries.index[i],
        y=v + 5,
        text=str(v),
        showarrow=False,
        font=dict(size=10, color='black')
    )

# Show the plot
fig.show()

The range of sectors encompasses a significant count of billionaires, showcasing the diverse routes to wealth accumulation across both traditional industries like Finance, Manufacturing and Real Estate, as well as modern domains such as Technology, Fashion, Healthcare and Energy.

The sectors that exhibit the greatest concentration of billionaire entrepreneurs are as follows:

- Finance and Investment, with 372 individuals
- Manufacturing, with 324 individuals
- Technology, with 314 individuals
- Fashion and Retail, with 266 individuals
- Food and Beverage, with 212 individuals

To better understand billionaires around the world, we will explore total assets according to different fields of the data. First, we will look at total assets by industry, which industry brings the most wealth to billionaires.

In [None]:
total_final_worth = df['finalWorth'].sum()
fig = px.pie(df, values='finalWorth', names='industries', hole=0.7,
             title=f'Percentage of finalWorth by Industry (Total: ${total_final_worth} M)')

fig.add_annotation(text='$' +str(total_final_worth) +' M', x=0.5, y=0.5, showarrow=False)
fig.update_layout(width=1000, height=535)
fig.show()

In terms of total wealth, the tech sector is currently the most lucrative at `$1.88 trillion` (15.4% of total billionaire wealth). Massive investments in artificial intelligence (AI) are reshaping the market, just as the internet did in the 20th century or cloud technology did in the 2010s. Tech giants are strategically betting on AI as the next revolutionary technology to transform every industry.

Following the Technology sector is the Fashion and Retail industry at `$1.70 trillion` (13.9%), and the Finance and Investment sector at `$1.61 trillion` (13.1%).

These are potential industries with many opportunities and challenges.

## 2.3 Age and Gender Analysis with selfMade

**What is the distribution of billionaires' ages?**

In [None]:
df['age'].describe()

count    2575.000000
mean       65.140194
std        13.258098
min        18.000000
25%        56.000000
50%        65.000000
75%        75.000000
max       101.000000
Name: age, dtype: float64

In [None]:
fig = ff.create_distplot([df['age'].dropna()], ['Age'], colors=['skyblue'])
fig.update_layout(
    title='Distribution of Ages',
    xaxis_title='Age',
    yaxis_title='Density',
    plot_bgcolor='white',
    showlegend=False
)

fig.show()

Billionaires encompass a wide age range, spanning from 18 to 101 years old, with an average age of 65 and a notable standard deviation of 13 years, highlighting considerable age diversity. It can be observed that approximately half of the billionaires are younger than 65, while the other half are older. Additionally, 25% of billionaires fall below the age of 56, whereas 75% of billionaires are below the age of 75.

**Analysis of Age Distribution among Billionaires by Gender**

In [None]:
import plotly.graph_objects as go
import plotly.express as px

# Gender distribution pie chart
fig = px.pie(df, names='gender', hole=.5)

# Update colors
fig.update_traces(marker=dict(colors=['skyblue', 'pink']))

# Update layout
fig.update_layout(showlegend=True, legend=dict(orientation="h", x=0.5, y=-0.1),
                  title=dict(text="Gender Distribution", x=0.5),
                  width=400, height=400)

# Show the figure
fig.show()

In [None]:
import plotly.express as px

# Group by 'gender' and 'age' and count the number of billionaires
grouped_data = df.groupby(['gender', 'age'])['personName'].count().reset_index()

# Create a list of histograms, one for each gender
hist_data = []
group_labels = []

for gender in grouped_data['gender'].unique():
    age_data = grouped_data[grouped_data['gender'] == gender]['age']
    hist_data.append(age_data)
    group_labels.append(gender)

# Create a distribution plot
fig = px.histogram(df, x="age", color="gender", nbins=20,
                   title="Billionaires Age Distribution by Gender",
                   color_discrete_map={"M": "skyblue", "F": "pink"})
fig.show()

It is obvious that male account for more than 80% of the number of billionaires, which is much larger than the number of female. There are only 2 billionaires in the 0-20 age group, with a balanced male-female ratio. There is only one billionaire over 100 years old, and he is a male. Overall, most male and female start to become billionaires after the age of 50, and the number of billionaires among male and female in the 55-59 age group ranks first among the number of billionaires of this gender.

**What is the distribution of gender among self-made billionaires?**

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

fig2 = px.pie(df[df.gender=='M'], names='selfMade', hole=.5)
fig2.update_traces(textinfo='percent+label', marker=dict(colors=['skyblue', 'pink', 'skyblue']))

fig3 = px.pie(df[df.gender=='F'], names='selfMade', hole=.5)
fig3.update_traces(textinfo='percent+label', marker=dict(colors=['slyblue', 'skyblue', 'pink']))

fig = make_subplots(rows=1, cols=2, specs=[[{'type': 'domain'}, {'type': 'domain'}]])

fig.add_trace(fig2.data[0], row=1, col=1)
fig.add_trace(fig3.data[0], row=1, col=2)

fig.update_layout(height=400, showlegend=False, title=dict(text="Distribution of SelfMade by Gender", x=0.5))
fig.add_annotation(dict(x=0.23, y=-0.1, ax=0, ay=0, text="% SelfMade (M)"))
fig.add_annotation(dict(x=0.78, y=-0.1, ax=0, ay=0, text="% SelfMade (F)"))

fig.show()

Interestingly, among billionaires, male account for nearly three-quarters of self-made wealth, while less than 30% of female’s assets are self-made.

In [None]:
import plotly.express as px

fig = px.violin(df, y="age", x="gender", color="selfMade", box=True, points="all", title="Distribution of Age, Gender, and SelfMade")
fig.update_yaxes(title="Age")  # Update y-axis label if needed

# Calculate median for each gender and selfMade
median_values = df.groupby(['gender', 'selfMade'])['age'].median().reset_index()

# Add shapes and annotations for median values
for _, row in median_values.iterrows():
    gender = row['gender']
    self_made = row['selfMade']
    median = row['age']

    fig.add_shape(type="line",
                  x0=gender, x1=gender, y0=median, y1=median,
                  line=dict(color="black", width=1, dash="dash"))

    fig.add_annotation(x=gender, y=median, text=f"Median: {median}",
                       showarrow=False, font=dict(size=10))

# Update marker colors to skyblue and pink
fig.update_traces(marker_line_color="white", marker_line_width=0.1)
fig.update_traces(marker_color="skyblue", selector=dict(name='False'))
fig.update_traces(marker_color="pink", selector=dict(name='True'))

fig.show()

As can be seen from the output violins, male billionaires who inherited wealth have a wide age span, concentrated under 59 years of age, with a quarter of them under 34 and a median of 68. Comparatively, the age distribution of self-made male billionaires is more compact, with most under 56 and 25% under 30. Among female billionaires, the age range of heirs is similarly broad, with most under 55, a quarter under 27, and a median of 65. On the other hand, the age distribution of self-made women is more concentrated, with the majority under 53, a quarter under 35, and a median of 60, showing a tendency to accumulate wealth earlier. To summarize, the age distribution of the self-made rich is denser than that of the heirs, especially among women, and this is even more evident in the density estimation graph.

## 2.4 Comparing the United States and China

From the data above, it can be seen that China and the United States are the two largest country groups in the database. There are some similarities and differences in the statistical patterns of the data on millionaires in the two countries, characteristics that may be able to reflect differences in the history of their economic development.

In [None]:
china_data = df[df['countryOfCitizenship']=='China']
us_data = df[df['countryOfCitizenship']=='United States']

In [None]:
combined_df = pd.concat([china_data, us_data])

fig = px.histogram(combined_df,
          x='age',
          color='countryOfCitizenship',
          title='Age Distribution Comparison between China and US',
          opacity=0.5
          )

fig.show()

As we can see from the chart, millionaires in both China and the US are concentrated in the 50-60 age group. This may be because the likelihood of reaching the peak of one's career is greatest at this age. Compared to the United States, China has a very small number of millionaires in the post-60 age group. The United States, on the other hand, has a significantly higher number of senior millionaires. This may be due to the fact that the US economy is much older and has more people who have become millionaires in the past.

To add perspective, we examine the distribution of self-made millionaires in both countries.


In [None]:
fig = px.scatter(china_data,
        x='age',
        y='finalWorth',
        color='selfMade',
        color_discrete_map={True: 'blue', False: 'red'},
        title='China: Age vs. Wealth (Selfmade vs. Non-Selfmade) ',
        hover_data=['personName'],
        opacity=0.5
        )
fig.show()

In [None]:
fig = px.scatter(us_data,
        x='age',
        y='finalWorth',
        color='selfMade',
        color_discrete_map={True: 'blue', False: 'red'},
        title='US: Age vs. Wealth (Selfmade vs. Non-Selfmade) ',
        hover_data=['personName'],
        opacity=0.5
        )
fig.show()

Obviously China has a much higher percentage of self-made millionaires than the US. Most of the US millionaires have probably started families and passed on their businesses to the next generation. This confirms the previous point.

Another interesting feature is that the richest millionaires have always started from nothing, rather than inheriting from previous generations.

## 2.5 Correlation Between Wealth and National Economic Indicators

In order to examine the relationship between the state of the country's economy and millionaires, we looked at the sum of each country's GDP and the wealth owned by the millionaire class, as well as their ratios.

In [None]:
for index, row in country_counts.iterrows():
  nowCountry = row['Country']
  ndf=df[df['country']==nowCountry].reset_index()
  if not ndf.empty:
    gdp = ndf.loc[0,'gdp_country']
    cpi = ndf.loc[0,'cpi_country']
    cpi_change = ndf.loc[0,'cpi_change_country']
    country_counts.loc[index,'GDP'] = gdp
    country_counts.loc[index,'CPI'] = cpi
    country_counts.loc[index,'CPI_change'] = cpi_change


In [None]:
for index, row in country_counts.iterrows():
  nowCountry = row['Country']
  fdf = df[df['country']==nowCountry]
  sumWealth = fdf['finalWorth'].sum()
  avgWealth = fdf['finalWorth'].mean()
  country_counts.loc[index,'Wealth_sum'] = sumWealth
  country_counts.loc[index,'Wealth_avg'] = avgWealth
  country_counts.loc[index,'Wealth_sum:GDP'] = sumWealth/row['GDP']*10000000

In [None]:
import plotly.express as px

country_counts['Wealth_sum:GDP_log'] = country_counts['Wealth_sum:GDP'].apply(lambda x: np.log(x))

fig = px.scatter(
    country_counts,
    x='GDP',
    y='Wealth_sum',
    size='Number of Billionaires',
    hover_data=['Country','CPI'],
    color = 'Wealth_sum:GDP_log'
    )
fig.update_layout(title='Millionaires Wealth vs. GDP by Country',
                  xaxis_title='GDP',
                  yaxis_title='Millionaires Wealth Total')
fig.update_layout(
    xaxis=dict(
        rangeslider=dict(visible=True),
        type='linear'
    )
)

fig.show()

Basically, the higher the country's GDP, the more total property millionaires own. This may seem obvious, but it supports the idea that individual wealth needs to be predicated on the development of the country.

Since the vast majority of the data in the database are from China, the United States. It is difficult for us to find more valid patterns.

# Conclusion

In conclusion, this exploratory data analysis on global billionaires revealed several insightful patterns.

The United States and China lead with the highest number of billionaires, reflecting their economic prowess. Major cities like New York, Hong Kong, and Beijing emerged as preferred residences, providing access to financial hubs. The technology sector currently dominates in terms of total billionaire wealth, driven by investments in AI and disruptive innovations. However, traditional industries like fashion/retail and finance remain major sources of fortunes.

While most billionaires are self-made males, female representation is growing, with self-made women tending to accumulate wealth at younger ages. Moreover, a country's economic development, measured by GDP, exhibits a positive correlation with the aggregate wealth of its billionaire class.

Overall, this analysis highlighted the diverse pathways to extreme wealth accumulation and the dynamics shaping the elite billionaire demographic worldwide.