In [1]:
# libraries
import numpy as np
import pandas as pd
import altair as alt

# PSTAT 100 Project plan

This is a guide to preparing your project plan. It functions both as a guide to the work you'll need to do and as a guide to preparing the deliverable. You can use it as a template to draft the plan report; if so, **please remove the text explanations and instructions in each section so that it reads as a coherent and continuous document**.

While you may find it useful initially to follow the outline given, you do not need to adhere to it exactly -- you're free to organize your submission in the way that seems most natural to you. However, please do keep the high-level sections, so that your report includes the following headers:

0. Background
1. Data description
2. Initial exporations
3. Planned work

Your report does not need to be long. It should be about 2-4 pages, and might not be much longer than this template once you replace the guiding text with your own work.

## Group information

**Group members**: 

Kathy Wu, Nathan Lai, Tymee Wang, Yuchen Fang

**Contributions**:

Kathy Wu: Tidied and visulized the dataset, Planned work <br>
Nathan Lai:  Data Description, Planned work<br>
Tymee Wang: Background Information, Planned work<br>
Yuchen Fang: Initial exploration, Planned work

---
## 0. Background

This section should introduce your reader to the general topic you're engaging with in your project and explain any specialized knowledge that they may need to understand your dataset and why it's interesting. It doesn't need to be long, but should touch on the following points:
* Introduce the topic of your project.
* What area or areas of study are you in dialogue with for your project?
* What is your data about, broadly? 
* What is the motivation for collecting the kind of data you're working with, and what sorts of things could you potentially learn?

ESG is an abbreviation for Environmental, Social, and Governance, which is a combination of three categories of non-financial factors that are increasingly applied by investors as part of their analysis process to evaluate material risks and growth opportunities nowadays. However, in order to better align with the global goals, the World Bank Group rearranges it in a new data framework which further classifies 17 key sustainability themes based on the original environmental, social, and governance categories. The World Bank Group believes that these themes are crucial for financial sector representatives to consider when assessing the contribution of investments or policies to sustainable development.

Our project will mainly focus on analyzing the reported ESG data from the year 2000 to 2020. We would like to see which region or continent is the most sustainable based on the assessment and whether there is any correlation among the three categories. Also, any events that are not included in the assessing framework but would influence the whole sustainability result is also the question that we would like to pay attention to.

For our data here, we keep the division of the three parts: Environmental, Social, and Governance. The Environmental part encompasses key themes that focus on the economic performance given a country’s natural resource endowment, management and supplementation, and also accounts for other factors such as food security for stable long-term economic growth. For the Social part, it indicates how good a country’s performance is on its efficacy in meeting the basic needs of its population and reducing poverty, management of social and equity issues and investment in human capital and productivity. For the Governance part, it evaluates a country’s sustainability by its institutional capacity to support long-term development, including political, financial and legal aspects.

The motivation of collecting this data is to study how large the gap among countries all over the world would have on their development and sustainability in these three aspects and what factors may contribute to such a situation. The things that we can potentially learn is what changes the country with lower sustainability can make to improve their current status and the overall developing trend of the world as a whole.

![intro](intro.png)

---
## 1. Data description

This section should introduce your dataset in detail. It should reflect your having gone through the collect/acquaint/tidy stages of the lifecycle. Below I've provided you with an outline. You do not need to adhere to this strictly -- in fact, it would be more natural to divide the items among a few short paragraphs -- but you should touch on each item in a format that suits your project.

### Basic information


**General description**: \
In order to shift financial flows so that they are better aligned with global goals, the World Bank Group (WBG) is working to provide financial markets with improved data and analytics that shed light on countries’ sustainability performance. This dataset provides information on sustainability themes spanning environmental, social, and governance categories. Along with new information and tools, the World Bank can develop research on the correlation between countries’ sustainability performance and the risk and return profiles of relevant investments.


**Source**: \
[Environment, Social and Governance Data, The World Bank](https://datacatalog.worldbank.org/search/dataset/0037651/Environment--Social-and-Governance-Data) is classified as Public under the Access to Information Classification Policy. This dataset is licensed under [Creative Commons Attribution 4.0](https://datacatalog.worldbank.org/public-licenses?fragment=cc).


**Collection methods**: \
Our data is census data, most of the data values in topic Governance and Social are obtained from surveys, and most of the data in topic Environment is collected by using scientific equipment.

**Sampling design and scope of inference**: \
Sampling frame: all countries reporting environment, social and governance data.\
Sampling mechanism:  census\
Scope of inference: none


### Data semantics and structure

**Units and observations**: State the observational units.

**Variable descriptions**: Provide a table of variable descriptions. If your dataset is large and you'll only work with a subset of the total available variables, limit your attention to the variables that you'll work with. Here's a template you can work with:

Name | Variable Description | Topic | Type | Units of measurement
---|---|---|---|---
fore_area | *Forest area* | Environment | Numeric | % of land area
fore_dep | Adjusted savings: *net forest depletion* | Environment | Numeric | % of GNI
natu_res_dep | Adjusted savings: *natural resources depletion* | Environment | Numeric | % of GNI
pop_denst | *Population density* | Environment | Numeric | people per sq. km of land area
rate_labor | Ratio of female to male *labor force participation rate* | Governance | Numeric | % (modeled ILO estimate) 
gdp_grow | *GDP growth* | Governance | Numeric | annual %
unemp_rate | *Unemployment*, total | Social | Numeric |  % of total labor force (modeled ILO estimate)
life_exp | *Life expectancy* at birth, total | Social | Numeric | years
acce_electr | Access to *electricity* | Social | Numeric | % of population
mortal_rate | *Mortality* rate, under-5 | Social | Numeric | per 1,000 live births
acce_fuel_tech | Access to *clean fuels* and *technologies* for cooking | Social | Numeric | % of population
pop_65 | Population *ages 65 and above* | Social | Numeric | % of total population
ferti_rate | *Fertility rate*, total | Social | Numeric | births per woman


**Example rows**: Print a few example rows of your dataset in tidy format. Please don't include the codes you used to manipulate the raw data. Do that in a separate notebook and export the result to a .csv file -- `data.to_csv('tidy-data.csv')` -- to load directly into the cell below.

In [2]:
show = pd.read_csv('show.csv').drop(columns = 'Unnamed: 0')
show.head()

Unnamed: 0,Region,Country Name,Year,Population density (people per sq. km of land area),Forest area (% of land area),GDP growth (annual %),Ratio of female to male labor force participation rate (%) (modeled ILO estimate),Access to electricity (% of population),"Life expectancy at birth, total (years)","Unemployment, total (% of total labor force) (modeled ILO estimate)",Access to clean fuels and technologies for cooking (% of population)
0,Africa,"Congo, Dem. Rep.",2000,20.77847,63.474118,-6.910927,96.881158,6.7,50.041,2.904,1.0
1,Africa,"Congo, Dem. Rep.",2001,21.361917,63.177257,-2.100173,96.724567,7.314364,50.667,2.888,1.2
2,Africa,"Congo, Dem. Rep.",2002,21.998487,62.880395,2.947765,96.617267,7.915845,51.385,2.871,1.4
3,Africa,"Congo, Dem. Rep.",2003,22.683921,62.583534,5.577822,96.539211,8.51209,52.144,2.86,1.6
4,Africa,"Congo, Dem. Rep.",2004,23.408777,62.286672,6.738374,96.488953,9.105449,52.917,2.853,1.9


---
## 2. Initial explorations

At this stage, you may spend most of your effort on the computing side tidying up the data. You're not expected to complete a thorough exploratory analysis, and if your dataset was especially messy to start with, you may not even begin your exploratory analysis by the time you prepare this report. You have the option to leave exploration for the next stage of work and simply report basic properties of the dataset, but you should at minimum address the items in the 'basic properties' section below.

### Basic properties of the dataset

Help the reader get acquainted with your dataset on a simple level by identifying characteristics of the dataset and variable summaries. Some amount of code is fine here, but try to use code cells sparingly.

**Dimensions**:\
There are 378 rows and 8 columns of variables in the dataset after cleaning.

**Missing values**:\
Since the original data is collected by census, it is possible that some values are missing by chance. Hence, we only select variables from non-missing ones and rank their importance from each category.

**Variable summaries**:\
The dataset consists of 8 numeric variables which are divided into 3 categories: Environment, Governance, and Social. We select 18 countries from 6 continents and 8 most representative variables of them from 2000 to 2020 and no missing values after cleaning. Under the environment part, we have `Population density` and `Forest area`. Governance variables include `GDP growth (annual %)` and `Ratio of female to male labor force participation rate (%)`. Lastly, social variables consist of `Access to clean fuels and technologies for cooking`, `Life expectancy at birth, total (years)`, `Unemployment, total (% of total labor force)`, and `Access to electricity (% of population)`. Thus, there are 168 different values of variables for each country. 

### Exploratory analysis

If you were lucky and your dataset was neat, you should aim to include a few exploratory plots or tables here -- they don't need to be polished at this stage, but you should select plots that are informative (rather than including all plots you may have looked at). 

If you do include exploratory graphics or tables, please explain in a sentence or two what each one shows. Try to include a minimum of code. Consider [saving your plots as images](https://altair-viz.github.io/user_guide/saving_charts.html#png-svg-and-pdf-format) and inputting images into markdown cells instead of generating them anew via code cells.

The following graphs are the plots of distribution of accessibility to electricity in different regions in year 2020. We can see that except Africa region, other regions all have a relatively high accessibility to electricity.

![electricity](electricity.png)

The follow graphs are the plots of the distribution of the percentage of GDP growth in 2019 and 2020. We choose these two years because COVID-19 happened in 2020, and thus we can see that in 2020, the GDP growth shift slightly to negative in almost all the regions.

![gdp_growth](gdp_growth.png)

---
## 3. Planned work

Here you should indicate your tentative ideas for your analysis. Don't worry, these aren't final -- you can always change your mind later or shift gears if they don't pan out. The objective is to have you start thinking ahead about what you'll do.

### Questions

Please propose two focused questions that you plan to explore.

1. *Which region is more sustainable and how COVID-19 affects the GDP growth?*
2. *Will the environment and social motivate governance?*

### Proposed approaches

For each question, please describe an idea or two about how you might approach the question.

1. *We plan to plot a density plot showing the change during the pandemic and see how the curve differs.*

2. *Using heatmap to see the correlation between environmental, social, and governance variables. Or probably using PCA.*


---
## Submission Checklist
1. Save file to confirm all changes are on disk
2. Run *Kernel > Restart & Run All* to execute all code from top to bottom
3. Save file again to write any new output to disk
4. Generate PDF and submit to Gradescope