# Forbes Billionaire Data Exploration
According to Wikipedia (<a href=https://en.wikipedia.org/wiki/The_World%27s_Billionaires>Source</a>), the world has 2,755 billionaires. In this notebook, we explore a dataset containing information from the <a href=https://www.forbes.com/billionaires/> Forbes data</a> about these billionaires, whose net worths range from 177 billion USD to 1 billion USD. The dataset was curated by Alexander Bader and can be found <a href=https://www.kaggle.com/alexanderbader/forbes-billionaires-2021-30>here</a>.
<br>
<br>
This notebook uses Plotly, Plotly Express, and Cufflinks to create interactive charts. For more information, read the Plotly documentation here and the Cufflinks documentation here.
<br>
<img src="https://thumbor.forbes.com/thumbor/1500x0/smart/filters:format(jpeg)/https%3A%2F%2Fimages.forbes.com%2FBillionaires2021-ListHeader-2%2FBillionaires2021-Desktop-LanderHeader-v2.png" alt="billionaire thumbnail">

**Import Packages**

In [2]:
import pandas as pd
import numpy as np
import cufflinks as cf
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.express as px

**Initialize for Offline Plotting**

In [3]:
init_notebook_mode(connected=True)
cf.go_offline()

**Read Data and Do Some Initial Cleaning**

In [19]:
df = pd.read_csv('forbes_billionaires_geo.csv')
#Make Self_made column values nicer for later plots
df['Self_made'].replace([True, False], ['Self-made', 'Not self-made'], inplace=True)
#Remove column capitalization to minimize typos
df.columns = [col_n.lower() for col_n in df.columns]

# What Do the Columns Mean?

By reading the data description <a href=https://www.kaggle.com/alexanderbader/forbes-billionaires-2021-30> here</a>, we can learn what each of the columns represents:
- Name: Name of billionaire
- NetWorth: Billionaire net worth
- Country: Country (presumably birth country)
- Source: Source of wealth (i.e. investments, retail, Amazon)
- Rank: Rank according to Forbes
- Age: Age as of 2021
- Residence: Country of residence
- Citizenship: Citizenship 
- Status: Marital status
- Children: Number of children
- Self_made: Indicator if billionaire is self-made or not
- Geometry: Point of residency using OpenCageGeocode

## What Does the Data Look Like?

**Show duplicate rows, description of dataframe, and column null counts, and percent null values**

In [20]:
n_duplicates = df.duplicated().sum()
df_described = df.describe().round(3)
null_cnts = df.isnull().sum()
null_pcts = (df.isnull().sum() / len(df)).round(3)
df_null = pd.DataFrame({'n_null': null_cnts, 
              'pct_null': null_pcts}).sort_values('n_null', ascending=False)

print(f"Dataframe Shape: {df.shape}")
print(f"Duplicate Rows: {n_duplicates}\n")
print(f"Numerical Column Description:")
display(df_described)
print(f"All Column Null Summary:")
display(df_null)

Dataframe Shape: (2755, 13)
Duplicate Rows: 0

Numerical Column Description:


Unnamed: 0,networth,rank,age,children
count,2755.0,2755.0,2630.0,1552.0
mean,4.749,1345.664,63.267,2.978
std,9.615,772.67,13.479,1.619
min,1.0,1.0,18.0,1.0
25%,1.5,680.0,54.0,2.0
50%,2.3,1362.0,63.0,3.0
75%,4.2,2035.0,73.0,4.0
max,177.0,2674.0,99.0,23.0


All Column Null Summary:


Unnamed: 0,n_null,pct_null
education,1346,0.489
children,1203,0.437
status,665,0.241
age,125,0.045
residence,40,0.015
self_made,18,0.007
citizenship,16,0.006
name,0,0.0
networth,0,0.0
country,0,0.0


### Conclusions
Over 20% of Education, Children, and Status are null. The other columns have few null values and none of the numerical columns have null values.

## Who Are the Wealthiest People in the World? What Are Their Net Worths?

**Functions for Plotting** (reduce code repeating)

In [39]:
top_20_worth = df.sort_values('networth', ascending=False).iloc[:20] 
top_20_worth_fig = top_20_worth.figure(kind="bar", 
                   x="name", 
                   y="networth", 
                   title="Net Worth and Wealth Source of Top 20 Wealthiest Billionaires", 
                   xTitle="name", 
                   yTitle="Net Worth (Billions $USD)",
                   color="blue",
                   text="source")
top_20_worth_fig.update_yaxes(nticks=10)
display(top_20_worth_fig)

## What is the Distribution of Wealth (Net Worth)?

**Net Worth Across All Individuals**

In [40]:
networth_hist_01 = df[['networth']].figure(kind="histogram", 
                        bins=(0, 200, 5), 
                        histnorm="percent",
                        title="Histogram of Net Worth (All Individuals)", 
                        xTitle="Net Worth (Billions $USD)", 
                        yTitle="Percent",
                        theme="ggplot",
                        color="blue",
                        bargap=0.1,
                        orientation="v",
                        text="networth")

networth_hist_01.update_yaxes(nticks=20)
networth_hist_01.update_xaxes(nticks=20)

display(networth_hist_01)

### Let's Look Closer at the Range between 0 and 10 Billion USD

In [53]:
networth_hist_02 = df[['networth']].figure(kind="histogram", 
                        bins=(0, 10, 1), 
                        title="Histogram of Net Worth (0-$10bn individuals)", 
                        xTitle="Net Worth (Billions $USD)", 
                        yTitle="Frequency",
                        theme="ggplot",
                        color="blue",
                        bargap=0.1,
                        orientation="v",
                        text="networth")
networth_hist_02.update_yaxes(nticks=20)
networth_hist_02.update_xaxes(nticks=10)

display(networth_hist_02)

### Conclusions
From the first chart, we can see over 75% of the billionaires have net worth between 1 and 5 billion USD. From the second chart, we can see over 50% of the billionaires have net worth between 1 and 2 billion USD. 

## How Does Wealth Vary by Self-made Status?

In [159]:
def plot_worth_unstacked(col):
    sdf = df.groupby(col).sum()[["networth"]].sort_values("networth", ascending=False).round(3)
    fig = px.bar(sdf, 
                 y="networth", 
                 text="networth", 
                 title=f"Total Net Worth by {col.capitalize()}")
    fig.update_traces(textposition="outside")
    fig.update_layout(yaxis={'title': 'Total Net Worth'})
    fig.update_traces(textposition="outside")
    return fig

In [160]:
#Remove null safe_made values
df_selfmade = df[~df['self_made'].isnull()]

#Make histogram of net worth faceted by self made status
selfmade_hist = px.histogram(df_selfmade, 
                   x="networth", 
                   facet_row="self_made", 
                   range_x=(0, 50), 
                   range_y=(0, 1600),
                   facet_col_spacing=0.05,
                   nbins=40, 
                   title="Histograms of Net Worth by Self-made Status")

selfmade_hist.update_layout(bargap=0.1)
selfmade_hist.update_yaxes(range=[0, 1800])
selfmade_hist.update_xaxes(nticks=20)
selfmade_hist.for_each_annotation(lambda a: a.update(text=a.text.split("=")[-1]))

#Plot total worth by self made status
worth_by_selfmade = plot_worth_unstacked("self_made")
worth_by_selfmade.update_yaxes(nticks=10)

display(selfmade_hist, worth_by_selfmade)

### Conclusions
From the first chart, we can see wealth distribution is approximately the same between self-made and non self-made individuals. In both charts, we can see there are far more self-made billionaires than not, a fact that surprised me.

## How Does Wealth Vary by Country?

In [164]:
worth_by_country = plot_worth_unstacked("country")
#Hack to get the styling we want
worth_by_country.update_xaxes(range=[-0.5,4.5])
display(worth_by_country)

In [None]:
hist_data = df[df['source'].isin(list(top_10_sources.index))]
fig = px.bar(hist_data, x="source", y="networth", color="name", barmode="stack")
fig.update_layout(showlegend=False)
fig.update_xaxes(categoryorder='total descending')
display(fig)

## How Does Net Worth Vary by Source?

**Total Net Worth by Source**

In [71]:
total_worth_by_source = plot_networth_group("source")
display(total_worth_by_source)

We can see real estate is by far the largest contributor to billionaire net worth, but how many individuals are part of these categories?

**Bar chart of Number of Billionaires per Source**

In [73]:
top_10_sources = df.groupby('source').agg(
    {'networth': 'sum', 'name': 'count'}).sort_values('networth', ascending=False)[:10].round(3)

top_10_sources['avg_worth'] = (top_10_sources['networth'] / top_10_sources['name']).round(3)

avg_worth_bar = px.bar(top_10_sources, x=top_10_sources.index, y="avg_worth", text="name")
avg_worth_bar.update_layout(title="Number of Billionaires per Source for Top 10 Sources", 
                            yaxis={'title': 'Number of Billionaires'})
avg_worth_bar.update_traces(textposition='outside')

display(avg_worth_bar)

**Bar chart of Total Net Worth Stacked by Individual**

In [65]:
total_worth_by_country = plot_networth_group("country")
display(total_worth_by_country)

In [11]:
# help(df.iplot)
# help(df.figure)
# help(px.bar)

Help on method _iplot in module cufflinks.plotlytools:

_iplot(kind='scatter', data=None, layout=None, filename='', sharing=None, title='', xTitle='', yTitle='', zTitle='', theme=None, colors=None, colorscale=None, fill=False, width=None, dash='solid', mode='', interpolation='linear', symbol='circle', size=12, barmode='', sortbars=False, bargap=None, bargroupgap=None, bins=None, histnorm='', histfunc='count', orientation='v', boxpoints=False, annotations=None, keys=False, bestfit=False, bestfit_colors=None, mean=False, mean_colors=None, categories='', x='', y='', z='', text='', gridcolor=None, zerolinecolor=None, margin=None, labels=None, values=None, secondary_y='', secondary_y_title='', subplots=False, shape=None, error_x=None, error_y=None, error_type='data', locations=None, lon=None, lat=None, asFrame=False, asDates=False, asFigure=False, asImage=False, dimensions=None, asPlot=False, asUrl=False, online=None, **kwargs) method of pandas.core.frame.DataFrame instance
           Retur