**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - EDA Checkpoint

# Names

- Pallavi Saksena
- Alexander Tang
- Chia-Han Chen
- Ashaank Jha
- Joseph Teh

# Research Question

What economic factors (profession/industry, country/city residence, GDP, etc.) or personal traits (gender, age, education, self made/inherited wealth, etc.) are associated with billionaires in high-income countries versus low-income countries?

## Background and Prior Work

As of 2023, there are an estimated 2,640 billionaires in the world, with nearly 15% of the world’s billionaires in the finance industry, 13% in the manufacturing industry, and 12% in the technology industry.<a name="Forbes"></a>[<sup>1</sup>](#Forbes) According to recent studies, roughly 11% of the world’s billionaires have either run for election or have held a political position, with former examples including Donald Trump, Michael Bloomberg and Tom Steyer.<a name="CNBC"></a>[<sup>2</sup>](#CNBC) Oftentimes, these billionaires hold great amounts of political power through secret donations to political candidates, parties, and super PACs, and they also invest in crucial industries that allow them to have a lot of influence over their respective countries <a name="CNBC"></a>[<sup>2</sup>](#CNBC). By analyzing how these billionaires currently spend their fortune and their origin of wealth, it may potentially lead to insights into how certain countries are operated.

This project aims to analyze data relating to billionaires’ net worth, source of wealth, age, gender, location, consumer price index of residency, tax revenue of residency, and many other factors.<a name="Kaggle"></a>[<sup>3</sup>](#Kaggle) Our project aims to see whether or not there are specific trends and patterns among these billionaires and specific industries in high-income versus low-income countries. Our project also aims to determine whether there are specific traits among billionaires’ background and characteristics.

Previous investigations <a name="InvestmentMigration"></a>[<sup>4</sup>](#IvestmentMigration) indicate that there may be positive relationships between the number of billionaires, tax rates, quality of life, and GDP ratios in each country. However, there are mixed results in the research as to which factors are more important, so we can use the dataset we found to determine which ones may have the most impact (i.e. which are the most highly correlated) per country. As this research has been inconclusive, we aim to clarify the findings further. We will be categorizing industries into three different groups (primary, secondary, and tertiary) according to the three-sector model <a name="Fisher"></a>[<sup>5</sup>](#Fisher). Primary industries focus heavily on extracting raw materials such as the mining industry. Secondary industries specialize in the manufacturing of goods such as the automotive industry. Lastly, the tertiary sector consists of industries that transport and distribute/sell manufactured goods such as the retail industry. This model will allow us to distinguish between different types of industries during our research.

There are also clear differences in the billionaire status based on gender. An article by the New York Times <a name="NYTimes"></a>[<sup>6</sup>](#NYTimes)describes the statistics of these wealth differences on a global level, but there is less information on how this may vary by industry and country/location. We can look at the specific trends per industry and per country to see which ones tend to have more male versus female billionaires, and we may perform exploratory analyses to investigate why those may be the case.

Also, different publications use different ways to collect the net worth of billionaires. For example, as of February 10th 2024, according to Forbes, the net worth of Bernard Arnault is 219.2B (the first place)<a name="Forbes2"></a>[<sup>7</sup>](#Forbes2), but according to Bloomberg, his net worth is 191B (the third place) even though both websites are measured in real time and updated daily.<a name="Bloomberg"></a>[<sup>8</sup>](#Bloomberg)

Finally, classifying the countries of origin by Income levels - High and Low- per the World Bank’s definition of High Income economies, provides a further level of analysis that can enrich the understanding of wealth accumulation throughout the world. It can provide a deeper insight into the economic conditions that lead to Billionaire status as well as differences among billionaires from their respective countries. <a name="WorldBank"></a>[<sup>9</sup>](#WorldBank)

The significance of this study lies in the fact that billionaires often wield a disproportionate amount of power in shaping economic and social policy, as well as general social trends. Thus understanding the backgrounds of these people with immense power is vital to understanding broader societal trends, such as wealth inequality, social mobility, and tax policy. Additionally analyzing the different factors that may lead to billionaire status can help economists understand the systems of different countries and their effect on wealth accumulation and distribution.

1. <a name="Forbes"></a> [^](#Forbes)LaFranco, Rob & Peterson-Withorn, Chase. Forbes Billionaires 2023: The Richest People In The World. Forbes. https://www.forbes.com/billionaires/
2. <a name="CNBC"></a> [^](#CNBC)Frank, Robert. (26 Oct 2023). “Billionaire politicians have become ‘shockingly common’ around the world, new study finds.” CNBC. https://www.cnbc.com/2023/10/26/billionaire-politicians-shockingly-common-study.html
3. <a name="Kaggle"></a> [^](#Kaggle)Nidula Elgiriyewithana. Billionaires Statistics Dataset (2023). Retrieved February 11, 2024 from https://www.kaggle.com/datasets/nelgiriyewithana/billionaires-statistics-dataset.
4. <a name="InvestmentMigration"></a> [^](#InvestmentMigration)Popov, Vladimir. "Why Some Countries Have More Billionaires than Others? Explaining Variety in the Billionaire Intensity of GDP." Investment Migration Research Paper 2018 3 (2018).
5. <a name="Fisher"></a> [^](#Fisher)Fisher, Allan G. B. (1939). "Production, primary, secondary and tertiary". Economic Record. 15 (1): 24–38. doi:10.1111/j.1475-4932.1939.tb01015.x.
https://investmentmigration.org/wp-content/uploads/2020/10/IMC-paper-Popov-IMC-RP-2018-3final.pdf
6. <a name="NYTimes"></a> [^](#NYTimes)Frank, Robert. (30 Dec 2016). “Why Aren’t There More Female Billionaires?” The New York Times.
https://www.nytimes.com/2016/12/30/business/why-arent-there-more-female-billionaires.html
7. <a name="Forbes"></a> [^](#Forbes2)Real-Time Billionaires List. Forbes.
https://www.forbes.com/real-time-billionaires/#7743d3893d78
8. <a name="Bloomberg"></a> [^](#Bloomberg)Bloomberg Billionaires Index. Bloomberg.
https://www.bloomberg.com/billionaires/
9. <a name="WorldBank"></a> [^](#WorldBank)High-Income Countries 2024. World Bank. https://worldpopulationreview.com/country-rankings/high-income-countries

# Hypothesis


We hypothesize that the majority of billionaires among developing countries will come from the primary and secondary industries. In addition, we hypothesize that there is a higher proportion of billionaires among developed countries who are primarily in the tertiary sectors compared to developing countries. We also hypothesize that most billionaires in the world, across all industries and in both developed and developing countries, are men and older in age.

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Billionaires Statistics Dataset (2023)
  - Link to the dataset: https://www.kaggle.com/datasets/nelgiriyewithana/billionaires-statistics-dataset
  - Number of observations: 2638
  - Number of variables: 35
- Dataset #2 (if you have more than one!)
  - Dataset Name: High-Income Countries 2024
  - Link to the dataset: https://worldpopulationreview.com/country-rankings/high-income-countries
  - Number of observations: 80
  - Number of variables: 4

Billionaires Statistics Dataset (2023)
We will be using this dataset as our main dataset which includes the information about all the billionaires we will be including in our research. Some of the important variables that we will be looking at from this dataset include country, age, gender, and industries. The country variable will allow us to determine whether the billionaire resides in a high-income country or not by looking at the list of countries in the other dataset. Age and gender will be used to look at the personal characteristics of each billionaire which will help us find trends among the billionaires. Lastly, industries will be used to categorize each billionaire’s industry into one of the three industry sectors referenced earlier. We would likely want to add a column to denote high-income countries with values being True or False and another column to denote industry sector with values being Primary, Secondary, and Tertiary.

High-Income Countries 2024
Using this dataset, we will create a new column in the primary dataset to indicate whether a country is high-income or non high-income. Any countries listed in both datasets will be classified as high-income, and the remaining countries in the first dataset will be classified as non high-income. These classifications will then be used in comparing the results of our data analysis on various countries to see if we can determine any differences between the two classifications based on the three-sector model. Additionally standardizing the country names  will be necessary to avoid missing a country classified as high income.

Ultimately, we will merge both datasets so that the final dataset contains information about all the billionaires in our study as well as an indication whether their country is considered high-income or low-income.

## Billionaires Statistics Dataset (2023)

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Read in billionaires data
billionaires = pd.read_csv('Billionaires Statistics Dataset.csv')

In [3]:
# See which variables contain null values
billionaires.isnull().sum()

rank                                             0
finalWorth                                       0
category                                         0
personName                                       0
age                                             65
country                                         38
city                                            72
source                                           0
industries                                       0
countryOfCitizenship                             0
organization                                  2315
selfMade                                         0
status                                           0
gender                                           0
birthDate                                       76
lastName                                         0
firstName                                        3
title                                         2301
date                                             0
state                          

In [4]:
# Drop columns that are unecessary in our study
billionaires_dropped = billionaires[['personName','age','country','industries','selfMade','gender','gdp_country','gross_tertiary_education_enrollment','gross_primary_education_enrollment_country', 'total_tax_rate_country']]
billionaires_dropped.isnull().sum()

personName                                      0
age                                            65
country                                        38
industries                                      0
selfMade                                        0
gender                                          0
gdp_country                                   164
gross_tertiary_education_enrollment           182
gross_primary_education_enrollment_country    181
total_tax_rate_country                        182
dtype: int64

In [5]:
# Take a look at the most prominent industries in the dataset
billionaires_dropped['industries'].value_counts()

Finance & Investments         372
Manufacturing                 324
Technology                    314
Fashion & Retail              266
Food & Beverage               212
Healthcare                    201
Real Estate                   193
Diversified                   187
Energy                        100
Media & Entertainment          91
Metals & Mining                74
Automotive                     73
Service                        53
Construction & Engineering     45
Logistics                      40
Sports                         39
Telecom                        31
Gambling & Casinos             25
Name: industries, dtype: int64

In [7]:
# Drop rows which contain null values in our most important variables
billionaires_dropped = billionaires_dropped.dropna(subset = ['age','country'])
billionaires_dropped.isnull().sum()

# Are not currently dropping null values from last few columns as these are not our main variables of interest.
# We will keep these values in our dataset for the sake of our EDA

personName                                      0
age                                             0
country                                         0
industries                                      0
selfMade                                        0
gender                                          0
gdp_country                                   122
gross_tertiary_education_enrollment           140
gross_primary_education_enrollment_country    139
total_tax_rate_country                        140
dtype: int64

## High-Income Countries 2024

In [8]:
# Read in dataset which contains information about high-income countries
high_income = pd.read_csv('High_Income_2023.csv')

In [9]:
# Create a list of all high-income countries in the world
high_income_countries = high_income['country'].tolist()
high_income_countries

['Bermuda',
 'Liechtenstein',
 'Switzerland',
 'Luxembourg',
 'Norway',
 'Isle of Man',
 'Ireland',
 'United States',
 'Faroe Islands',
 'Denmark',
 'Iceland',
 'Singapore',
 'Qatar',
 'Cayman Islands',
 'Sweden',
 'Netherlands',
 'Australia',
 'Hong Kong',
 'Finland',
 'Austria',
 'Germany',
 'Belgium',
 'Israel',
 'Canada',
 'San Marino',
 'Macau',
 'New Zealand',
 'United Kingdom',
 'Andorra',
 'France',
 'United Arab Emirates',
 'Japan',
 'New Caledonia',
 'Italy',
 'South Korea',
 'Kuwait',
 'Greenland',
 'Sint Maarten',
 'Malta',
 'Brunei',
 'Aruba',
 'Spain',
 'Cyprus',
 'Slovenia',
 'Bahamas',
 'Estonia',
 'Bahrain',
 'Czech Republic',
 'Saudi Arabia',
 'Portugal',
 'Turks and Caicos Islands',
 'Puerto Rico',
 'Lithuania',
 'Slovakia',
 'Greece',
 'Latvia',
 'French Polynesia',
 'Saint Kitts and Nevis',
 'Curacao',
 'Nauru',
 'Oman',
 'Croatia',
 'Hungary',
 'Poland',
 'Barbados',
 'Antigua and Barbuda',
 'Uruguay',
 'Trinidad and Tobago',
 'Chile',
 'Panama',
 'Romania']

## Create Final Dataset

In [10]:
#Create new column in main dataframe for high_income countries
high_i = []
for c in billionaires_dropped['country']:
    if c in high_income_countries:
        high_i.append(True)
    else:
        high_i.append(False)
billionaires_dropped['high_income'] = high_i

In [11]:
billionaires_dropped

Unnamed: 0,personName,age,country,industries,selfMade,gender,gdp_country,gross_tertiary_education_enrollment,gross_primary_education_enrollment_country,total_tax_rate_country,high_income
0,Bernard Arnault & family,74.0,France,Fashion & Retail,False,M,"$2,715,518,274,227",65.6,102.5,60.7,True
1,Elon Musk,51.0,United States,Automotive,True,M,"$21,427,700,000,000",88.2,101.8,36.6,True
2,Jeff Bezos,59.0,United States,Technology,True,M,"$21,427,700,000,000",88.2,101.8,36.6,True
3,Larry Ellison,78.0,United States,Technology,True,M,"$21,427,700,000,000",88.2,101.8,36.6,True
4,Warren Buffett,92.0,United States,Finance & Investments,True,M,"$21,427,700,000,000",88.2,101.8,36.6,True
5,Bill Gates,67.0,United States,Technology,True,M,"$21,427,700,000,000",88.2,101.8,36.6,True
6,Michael Bloomberg,81.0,United States,Media & Entertainment,True,M,"$21,427,700,000,000",88.2,101.8,36.6,True
7,Carlos Slim Helu & family,83.0,Mexico,Telecom,True,M,"$1,258,286,717,125",40.2,105.8,55.1,False
8,Mukesh Ambani,65.0,India,Diversified,False,M,"$2,611,000,000,000",28.1,113.0,49.7,False
9,Steve Ballmer,67.0,United States,Technology,True,M,"$21,427,700,000,000",88.2,101.8,36.6,True


# Results

## Exploratory Data Analysis

Carry out whatever EDA you need to for your project.  Because every project will be different we can't really give you much of a template at this point. But please make sure you describe the what and why in text here as well as providing interpretation of results and context.

### Section 1 of EDA - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

### Section 2 of EDA if you need it  - please give it a better title than this

Some more words and stuff.  Remember notebooks work best if you interleave the code that generates a result with properly annotate figures and text that puts these results into context.

In [None]:
## YOUR CODE HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION

# Ethics & Privacy

It seems like the data was gathered through publically available sources so it would not seem like we need to get consent from these public figures to include them in our study. It could be possible that some billionaires may have been left out because of being less well known or from smaller countries. Because we did not collect the data directly from the billionaires directly, there could be bias present in the data. We can maybe try to find the sources that were used to compile this data and see for oursleves if there may be any previous bias present. We also do not get insight into the backgrounds of these billionaires as some may have started with more money and thus had an easier time building their wealth. Additionally, it may be easier to become a billionaire in certain countries rather than others. In order to detect these biases, we may want to perform an exploratory analysis on the data to determine which countries might have a large proportion of billionaires compared to population. It may also be possible that gender and age bias regarding billionaire status and how calculations of wealth are determined. We could try to find more data about the ages that people first became billionaires.

# Team Expectations 

* *We expect to meet at least once a week over Zoom to discuss our progress while communicating regularly throughout the week over Discord.*
* *If any conflict should arise, we can meet and talk it over and come to a resolution. If no resolution can be found, we will try to meet with professor to help us resolve our problem.*
* *Everybody will do their fair share of work whether we are meeting together or we have divided up the work individually.*

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 1/20  |  1 PM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data (Ant Man); EDA (Hulk) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Iron Man; Thor) | Discuss/edit Analysis; Complete project check-in |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Wasp)| Discuss/edit full project |
| 3/20  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |