# Forbes Billionaire Data Exploration
Notebook by Michael Kalmus (https://github.com/mkalmus). If you enjoy, please consider starring the repository on Github and upvoting on Kaggle!
<br>
<br>
According to Wikipedia (<a href=https://en.wikipedia.org/wiki/The_World%27s_Billionaires>Source</a>), the world has 2,755 billionaires. In this notebook, we explore a dataset containing information from the <a href=https://www.forbes.com/billionaires/> Forbes data</a> about these billionaires, whose net worths range from 177 billion USD to 1 billion USD. The dataset was curated by Alexander Bader and can be found <a href=https://www.kaggle.com/alexanderbader/forbes-billionaires-2021-30>here</a>. The dataset was also updated by Sourav Roy <a href=https://www.kaggle.com/roysouravcu/forbes-billionaires-of-2021>here</a>.
<br>
<br>
This notebook uses Plotly, Plotly Express, and Cufflinks to create interactive charts. For more information, read the Plotly documentation <a href=https://plotly.com/python/>here</a> and the Cufflinks documentation <a href=https://github.com/santosjorge/cufflinks>here</a>. The image below also comes from the Forbes article.
<br>
<img src="https://thumbor.forbes.com/thumbor/1500x0/smart/filters:format(jpeg)/https%3A%2F%2Fimages.forbes.com%2FBillionaires2021-ListHeader-2%2FBillionaires2021-Desktop-LanderHeader-v2.png" alt="billionaire thumbnail">

**Import Packages**

In [109]:
import pandas as pd
import numpy as np
import cufflinks as cf
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.express as px

**Initialize for Offline Plotting**

In [110]:
init_notebook_mode(connected=True)
cf.go_offline()

**Help for Functions used in Notebook**

In [125]:
# help(df.iplot)
# help(df.figure)
# help(px.bar)
# help(px.histogram)

**Read Data and Do Some Initial Cleaning**

In [111]:
df = pd.read_csv('forbes_billionaires_geo.csv')
#Make Self_made column values nicer for later plots
df['Self_made'].replace([True, False], ['Self-made', 'Not self-made'], inplace=True)
#Remove column capitalization to minimize typos
df.columns = [col_n.lower() for col_n in df.columns]

# What Do the Columns Mean?

By reading the data description <a href=https://www.kaggle.com/alexanderbader/forbes-billionaires-2021-30> here</a>, we can learn what each of the columns represents:
- Name: Name of billionaire
- NetWorth: Billionaire net worth
- Country: Country (presumably birth country)
- Source: Source of wealth (i.e. investments, retail, Amazon)
- Rank: Rank according to Forbes
- Age: Age as of 2021
- Residence: Country of residence
- Citizenship: Citizenship 
- Status: Marital status
- Children: Number of children
- Self_made: Indicator if billionaire is self-made or not
- Geometry: Point of residency using OpenCageGeocode

## What Does the Data Look Like?

**Show duplicate rows, description of dataframe, and column null counts, and percent null values**

In [112]:
n_duplicates = df.duplicated().sum()
df_described = df.describe().round(3)
null_cnts = df.isnull().sum()
null_pcts = (df.isnull().sum() / len(df)).round(3)
df_null = pd.DataFrame({'n_null': null_cnts, 
              'pct_null': null_pcts}).sort_values('n_null', ascending=False)

print(f"Dataframe Shape: {df.shape}")
print(f"Duplicate Rows: {n_duplicates}\n")
print(f"Numerical Column Description:")
display(df_described)
print(f"All Column Null Summary:")
display(df_null)

Dataframe Shape: (2755, 13)
Duplicate Rows: 0

Numerical Column Description:


Unnamed: 0,networth,rank,age,children
count,2755.0,2755.0,2630.0,1552.0
mean,4.749,1345.664,63.267,2.978
std,9.615,772.67,13.479,1.619
min,1.0,1.0,18.0,1.0
25%,1.5,680.0,54.0,2.0
50%,2.3,1362.0,63.0,3.0
75%,4.2,2035.0,73.0,4.0
max,177.0,2674.0,99.0,23.0


All Column Null Summary:


Unnamed: 0,n_null,pct_null
education,1346,0.489
children,1203,0.437
status,665,0.241
age,125,0.045
residence,40,0.015
self_made,18,0.007
citizenship,16,0.006
name,0,0.0
networth,0,0.0
country,0,0.0


In [113]:
#Replace null values with 'No Data'
df.fillna('NULL', inplace=True)

### Conclusions
Over 20% of Education, Children, and Status are null. The other columns have few null values and none of the numerical columns have null values.

## Who Are the Wealthiest People in the World? What Are Their Net Worths?

In [114]:
top_20_worth = df.sort_values('networth', ascending=False).iloc[:20] 
top_20_worth_fig = top_20_worth.figure(kind="bar", 
                   x="name", 
                   y="networth", 
                   title="Net Worth and Wealth Source of Top 20 Wealthiest Billionaires", 
                   xTitle="name", 
                   yTitle="Net Worth (Billions $USD)",
                   color="blue",
                   text="source")
top_20_worth_fig.update_yaxes(nticks=10)
display(top_20_worth_fig)

### Conclusions
The wealthiest individual is Jeff Bezos with his wealth from Amazon. It seems the most common source is technology companies, as we see companies like Amazon, Tesla, Microsoft, Facebook, and Google in the top 20. Another notable finding is that three members of the Walton family (Walmart) appear in the top 20 with net worths of around 60 billion USD.

## What is the Distribution of Wealth (Net Worth)?

**Net Worth Across All Individuals**

In [115]:
networth_hist_01 = df[['networth']].figure(kind="histogram", 
                        bins=(0, 200, 5), 
                        histnorm="percent",
                        title="Histogram of Net Worth (All Individuals)", 
                        xTitle="Net Worth (Billions $USD)", 
                        yTitle="Percent",
                        theme="ggplot",
                        color="blue",
                        bargap=0.1,
                        orientation="v",
                        text="networth")

networth_hist_01.update_yaxes(nticks=20)
networth_hist_01.update_xaxes(nticks=20)

display(networth_hist_01)

### Let's Look Closer at the Range between 0 and 10 Billion USD

In [116]:
networth_hist_02 = df[['networth']].figure(kind="histogram", 
                        bins=(0, 10, 1), 
                        title="Histogram of Net Worth (0-$10bn individuals)", 
                        xTitle="Net Worth (Billions $USD)", 
                        yTitle="Frequency",
                        theme="ggplot",
                        color="blue",
                        bargap=0.1,
                        orientation="v",
                        text="networth")
networth_hist_02.update_yaxes(nticks=20)
networth_hist_02.update_xaxes(nticks=10)

display(networth_hist_02)

### Conclusions
From the first chart, we can see over 75% of the billionaires have net worth between 1 and 5 billion USD. From the second chart, we can see over 50% of the billionaires have net worth between 1 and 2 billion USD. This means that individuals with net worths of over 5 billion USD make up less than 25% of the data. This puts into perspective the wealth of individuals like Jeff Bezos, who has a net worth of over 170 billion according to Forbes.

## How Does Wealth Vary by Self-made Status?

In [117]:
def plot_worth_unstacked(col):
    """
    Makes the following three plots relating net worth to a group (i.e. categorical column):
        1. Bar chart of total net worth by group
        2. Bar chart of number of billionaires by group
        3. Bar chart of average net worth per billionaire in each group
    Returns nothing but displays all three plots
    
    Parameters:
        col (str): the column to group by and analyze
    """
    sdf = df.copy()
    if col == 'children' or col == 'age':
        sdf[col].replace('NULL', -1, inplace=True)
    sdf = sdf.groupby(col).agg(
        {'networth': 'sum', 'name': 'count'}).sort_values('networth', ascending=False).round(3)
    #We only plot the top 20 values by total net worth
    if len(sdf.index) > 20:
        sdf = sdf.iloc[0:20]
    sdf['avg_networth'] = (sdf['networth'] / sdf['name']).round(3)
    n = len(sdf.index.unique())

    #Total net worth per category
    total_worth_fig = px.bar(sdf,
                 y="networth", 
                 text="networth", 
                 title=f"Total Net Worth by {col.capitalize()} (Top {n})")
    total_worth_fig.update_traces(textposition="outside", marker_color="lightsalmon")
    total_worth_fig.update_layout(yaxis={'title': 'Total Net Worth'}, xaxis={'nticks':30})

    #Number of billionaires per category
    count_fig = px.bar(sdf, y="name", text="name", 
                       title=f"Number of Billionaires by {col.capitalize()} (Top {n} by Total Net Worth)")
    count_fig.update_traces(textposition="outside", 
                            marker_color="crimson")
    count_fig.update_layout(yaxis={'title': 'Number of Billionaires'}, xaxis={'nticks':30})

    #Worth per billionaire per category
    avg_worth_fig = px.bar(sdf, 
                 y="avg_networth", 
                 text="avg_networth", 
                 title=f"Average Net Worth by {col.capitalize()} (Top {n} by Total Net Worth)")
    avg_worth_fig.update_traces(textposition="outside", marker_color="lightslategray")
    avg_worth_fig.update_layout(yaxis={'title': 'Average Net Worth per Person'}, xaxis={'nticks':30})
    
    display(total_worth_fig, count_fig, avg_worth_fig)
    return None

In [118]:
#Remove null safe_made values
df_selfmade = df[~df['self_made'].isnull()]

#Make histogram of net worth faceted by self made status
selfmade_hist = px.histogram(df_selfmade, 
                   x="networth", 
                   facet_row="self_made", 
                   range_x=(0, 50), 
                   range_y=(0, 1600),
                   facet_col_spacing=0.05,
                   nbins=40, 
                   title="Histograms of Net Worth by Self-made Status")

selfmade_hist.update_layout(bargap=0.1, yaxis={'range':[0,1800]}, xaxis={'nticks':20})
#Make labels for each subplot (right hand side) nicer
selfmade_hist.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

#Plot total worth by self made status
plot_worth_unstacked("self_made")

### Conclusions
From the first chart, we can see there are many more self-made billionaires than non-self made individuals. From the second chart, we can see wealth distribution is approximately the same between self-made and non self-made individuals. Interestingly, the non-self-made billionaires control more wealth per person.

## How Does Wealth Vary by Country?

In [119]:
plot_worth_unstacked("country")

### Conclusions
From the above plots, we can see the United States controls the most wealth and has the most billionaires. From the last plot, we can see that billionaires from France have the most money on average and billionaires from Hong Kong and France both have more money on average compared to billionaires from the US.

## How Does Net Worth Vary by Source?

In [120]:
plot_worth_unstacked("source")

We can see real estate is by far the largest contributor to billionaire net worth, but it's interesting that sources that are just large companies control a large portion of wealth and have lots of wealth per individual in those categories. For example, Jeff Bezos and his ex-wife alone are in the top 10.

## How Does Net Worth Vary by Children?

In [121]:
plot_worth_unstacked("children")

### Conclusions
We can see that the most wealth is owned by individuals for whom we don't have children data.  This is because this subset also contains the most billionaires as shown in the second chart. Amongst the subset that we do have data, most billionaires have 2 children. When we look at average wealth per billionaire, billionaires with seven children have surprisingly the highest average wealthy by more than double the second-highest.

In [122]:
plot_worth_unstacked("status")

### Conclusions
By far, most billionaires are married. However, the average wealth per person is by far the highest for billionaires that are currently in relationships.

## Extra: Interactive Plotting
Let's use iPyWidgets to the plots interactive!
<br>
<br>
Throughout this notebook, we plotted the same charts for various columns in the dataframe. Rather than writing code to do this one-by-one, we can instead use iPyWidgets to have user-driven input that shows only the variable we want to analyze at a given time.
<br>
<br>
To do this, we simply create a widget with the columns we viewed earlier and call interactive on the plotting function, using the widget as the function parameter rather than an individual column.

**Package Imports**

In [123]:
import ipywidgets as widgets
from ipywidgets import interact, interactive

**Make Widget and Display Interactive Output**

In [124]:
#Make widget
cols_to_analyze = ['self_made', 'country', 'source', 'children', 'status']
#Make style to show full description
full_description_style = {'description_width': 'initial'}
col_widget = widgets.Dropdown(description="Column to Analyze:", 
                              options=cols_to_analyze, 
                              style=full_description_style)

interactive_output = interactive(plot_worth_unstacked, col=col_widget)
display(interactive_output)

interactive(children=(Dropdown(description='Column to Analyze:', options=('self_made', 'country', 'source', 'c…

## Final Conclusions and Remarks
Overall, these charts show the danger of looking at total net worth by category. We can see in most of the visualizations that trends found in average net worth per category significantly differ from trends in total net worth.
<br>
<br>
Creating this notebook was a great lesson in Plotly and exploratory data analysis. If you enjoyed reading, **I would greatly appreciate an upvote on Kaggle or a star on Github**. I would also appreciate any improvements to the code, feedback, or questions. Until next time!