The goal of this example is to demonstrate a realistic data science workflow that involves a complex data cleaning and transformation, along with exploratory data analysis with Lux.

In [None]:
import lux
import pandas as pd

We first load in the [Happy Planet Index (HPI)](http://happyplanetindex.org/) dataset, which contains country-level data on sustainability and well-being

In [None]:
HPI = pd.read_csv("https://github.com/lux-org/lux-datasets/blob/master/data/hpi_full.csv?raw=True")
# We add an additional feature column, describing whether the country is one of the G10 nations
HPI["G10"]  = HPI["Country"].isin(["Belgium","Canada","France","Germany","Italy","Japan","Netherlands","United Kingdom","Switzerland","Sweden","United States of America"])

We take a quick look at the HPI dataset: 

In [None]:
HPI

We drop the Inequality Adjusted measures since they are obviously correlated with each other, also dropping HPI Rank and only keeping Happy Planet Index.

In [None]:
HPI = HPI[HPI.columns.drop(list(HPI.filter(regex='IneqAdj'))+["HPIRank"])]
HPI

Now after dropping these columns, the correlations are a bit more realistic.

The `Country` column needs to be assigned to a code that is easier to work with later on. So we load in [countries dataset](https://github.com/mledoze/countries) that contains the ISO-3 country code and information such as currency, language, and geography.

In [None]:
countries = pd.read_csv("https://github.com/lux-org/lux-datasets/blob/master/data/countries.csv?raw=True")
countries["Country"]=countries["name"].apply(lambda x:x.split(",")[0])
countries.loc[countries["Country"]=='United States',"Country"] = 'United States of America'

In [None]:
countries

In [None]:
# The countries dataset has some additional features column that we can add in
countries["landlocked"] = countries["landlocked"].fillna("False").replace(1,"True")
countries["NumOfficialLanguages"]=countries.languages.str.count(",")+1
countries["NumBorderingCountries"]=countries.borders.str.count(",")+1
countries["NumBorderingCountries"]=countries["NumBorderingCountries"].fillna(0)
countries = countries[['Country','cca3', 'landlocked', "NumOfficialLanguages", "NumBorderingCountries",'area']]

In [None]:
# Combining the HPI information to get ISO-3 code
df = HPI.merge(countries)
df = df.rename(index=str, columns={"SubRegion":"Region","subregion":"SubRegion"})
df["Region"] = df.Region.replace("Middle East and North Africa","Middle East")
df.area = df.area.astype(int)

In [None]:
# Ensure well-formatted country names based on: https://github.com/deactivated/python-iso3166/blob/master/iso3166/__init__.py
df.loc[df.Country=="Russia","Country"]="Russian Federation"
df.loc[df["Country"]=="Czech Republic","Country"]="Czechia"
df.loc[df.Country=="DR Congo","Country"]="Congo, Democratic Republic of the"#not working?
df.loc[df.Country=="Bolivia","Country"]="Bolivia, Plurinational State of"
df.loc[df["Country"]=="Cote d'Ivoire","Country"]="Côte d'Ivoire"

After all the data cleaning, we print out the combined dataframe to look at the visualizations and patterns in the dataset. 

In [None]:
df

By inspecting the `Correlation` tab, we learn that there is a negative correlation between `AvrgLifeExpectancy` and `Inequality`. In other words, countries with higher levels of inequality also have a lower average life expectancy. We can also look at other tabs, which show the Distribution of quantitative attributes and the Occurrence of categorical attributes.

Now, let's investigate whether any country-level characteristics explain the observed negative correlation between inequality and life expectancy.
We can do this by specifying our analysis intent to Lux via `df.intent`:

In [None]:
df.intent = ["Inequality","AvrgLifeExpectancy"]

In [None]:
df

By looking at the colored scatterplots in the `Enhance` tab, we find that most G10 industrialized countries are on the upper left quadrant on the scatterplot (low inequality, high life expectancy). In the breakdown by Region, we observe that countries in Sub-Saharan Africa (yellow points) tend to be on the bottom right, with lower life expectancy and higher inequality.

We are now interested in how these country-level metrics related to a country's COVID intervention strategy and response. We download the [COVID pandemic policy dataset](https://ourworldindata.org/grapher/covid-stringency-index) dataset.

In [None]:
covid = pd.read_csv("https://github.com/lux-org/lux-datasets/blob/master/data/covid-stringency-index.csv?raw=True")
covid = covid.rename(columns={"stringency_index":"stringency"})
covid['Day'] = pd.to_datetime(covid['Day'], format="%Y-%m-%d")

The COVID dataset contains a column `stringency`, which is a number from 0-100, with 100 being the highest level of responses (i.e., enacting measures, such as travel bans, stay-at-home orders, school closure, etc.). 

When we print the dataframe, we see that the overall distribution of recrods is at the medium to high levels, around with the distribution peaking at a stringency of 60-80. From the `Temporal` tab, we see that this record spans from stringency tracked daily from January 2020 to March 2021. 

In [None]:
covid

We are only interested in the records on March 11,2020, which is the first day WHO announce COVID as pandemic. By filtering to the records only on this day, the stringency score becomes a proxy that measures the strictness of the country's **early** intervention efforts.

In [None]:
covid = covid[covid["Day"]=="2020-03-11"]

In [None]:
covid["stringency"]

Somewhat interestingly, we see that during this early date, the stringency is heavily right-skewed, suggesting that most countries didn't enact strict measures in the early days of the pandemic.

We now join the countries dataframe `df` with the `covid` dataframe: 

In [None]:
result = covid.merge(df,left_on=["Entity","Code"],right_on=["Country","cca3"])

In [None]:
result.intent = ["stringency"]
result

When we set the intent as `stringency`, we see that China and Italy have the strictest measures (corresponding to dark blue on the geo map, among a sea of light yellow and green).

We want to discern these country-level differences further, so we divide the stringency index into a categorical variable `stringency_level`. We use [pd.qcut](https://pandas.pydata.org/docs/reference/api/pandas.qcut.html) to ensure that there is equal number records in the `Low` and `High` bins.

In [None]:
result["stringency_level"] = pd.qcut(result["stringency"],2,labels=["Low","High"])
result = result.drop(columns=["stringency"])

With the modified dataframe, Alice revisits the negative correlation she observed previously by setting the intent as average life expectancy and inequality again. The result is similar to what we saw before, with one visualization showing the breakdown by `stringency_level`.

In [None]:
result.intent = ["Inequality","AvrgLifeExpectancy"]
result

We see a strong separation showing how stricter countries (blue) corresponded to countries with higher life expectancy and lower levels of inequality. This visualization indicates that these countries could possibly have a more well-developed public health infrastructure that promoted the early pandemic response. However, we observe three outliers that seem to defy this trend. 

<img src="https://github.com/lux-org/lux-resources/blob/master/doc_img/hpi-covid-outlier.png?raw=True" width="250"></img>

When we filter to these dataframe records, we find that these countries correspond to [Afghanistan](https://www.who.int/news-room/feature-stories/detail/afghanistan-who-mission-reviews-covid-19-response), [Pakistan](https://www.who.int/news-room/feature-stories/detail/covid-19-in-pakistan-who-fighting-tirelessly-against-the-odds), and [Rwanda](https://www.npr.org/sections/goatsandsoda/2020/07/15/889802561/a-covid-19-success-story-in-rwanda-free-testing-robot-caregivers)—countries that were praised for their early pandemic response despite limited resources.

In [None]:
result[(result["Inequality"]>0.35)&(result["stringency_level"]=="High")]

To download this visualization insight and share with others, we can click on the visualization in the Lux view above and the button.

<img src="https://github.com/lux-org/lux-resources/blob/master/doc_img/hpi-covid-export.png?raw=True" width="1000"></img>

In [None]:
result

This exports the visualization from the widget to a `Vis` object. We can access the exported `Vis` object via the `exported` property and print it as code.

In [None]:
result.exported

In [None]:
print(result.exported[0].to_code("altair"))

We can copy-and-paste the output Altair code, tweak the plotting style before sharing this insight.

In [None]:
highlight

In [None]:
import altair as alt

c = "#e7298a"
chart = alt.Chart(result,title="Check out this cool insight!").mark_circle().encode(
    x=alt.X('Inequality',scale=alt.Scale(domain=(0.04, 0.51)),type='quantitative', axis=alt.Axis(title='Inequality')),
    y=alt.Y('AvrgLifeExpectancy',scale=alt.Scale(domain=(48.9, 83.6)),type='quantitative', axis=alt.Axis(title='AvrgLifeExpectancy'))
)
highlight = result[(result["Inequality"]>0.35)&(result["stringency_level"]=="High")]

hchart = alt.Chart(highlight).mark_point(color=c,size=50,shape="cross").encode(
    x=alt.X('Inequality',scale=alt.Scale(domain=(0.04, 0.51)),type='quantitative', axis=alt.Axis(title='Inequality')),
    y=alt.Y('AvrgLifeExpectancy',scale=alt.Scale(domain=(48.9, 83.6)),type='quantitative', axis=alt.Axis(title='AvrgLifeExpectancy')),
)

text = alt.Chart(highlight).mark_text(color=c,dx=-35,dy=0,fontWeight=800).encode(
    x=alt.X('Inequality',scale=alt.Scale(domain=(0.04, 0.51)),type='quantitative', axis=alt.Axis(title='Inequality')),
    y=alt.Y('AvrgLifeExpectancy',scale=alt.Scale(domain=(48.9, 83.6)),type='quantitative', axis=alt.Axis(title='AvrgLifeExpectancy')),
    text=alt.Text('Country')
)

chart = chart.encode(color=alt.Color('stringency_level',type='nominal'))
chart = chart.properties(width=160,height=150)

(chart + hchart + text).configure_title(color=c)
