# Statistics and Data Science: Collecting, organizing, and clearning data

## Exercise 1: Organizing data in dictionaries

Here are the greenhouse gases (GHG) emissions of Switzerland, France, Germany, Italy, Austria in 2019 in thousands tonnes $CO_{2e}$: `(43981.6, 422251.6, 784842.1, 377672.5, 77111.4)`

- Create a dictionary associating each country with its GHG emissions.
- Extract the keys of your dictionary in a list.
- Append your dictionary with the emissions of Spain: 276723.2 
- Using a `for` loop, convert the GHG values in tonnes $CO_{2e}$ 

In [1]:
ghg_emission_per_country = {
  'Switzerland': 43981.1,
  'France': 422251.6,
  'Germany': 784842.1,
  'Italy': 377672.5,
  'Austria': 77111.4 
}

print(ghg_emission_per_country.keys())
ghg_emission_per_country.update({'Spain': 276723.2})
      # Can also do it by:
      # ghg_emission_per_country['Spain']=276723.2

for key, value in ghg_emission_per_country.items():
  ghg_emission_per_country.update({key: value*1000})


dict_keys(['Switzerland', 'France', 'Germany', 'Italy', 'Austria'])


## Exercise 2: Data normalization

A very common operation is to transform you data by normalization. Imagine you have a list of data points $x=$`[2,7,5,4,9,3]` and you want to perform a 0-1 normalization, i.e., convert the data between 0 and 1 with the following operation:

$\hat{x}_{i} = \frac{x_{i}-min(x)}{max(x)-min(x)}$

0-1 normalization is common (necessary) when you deal with several variables that have very different scales.

- Create a new list that performs a 0-1 normalization on $x$. 

*Hint: You can use the `min()` and `max()` functions to obtain the minimum and maximum of a list* 

In [2]:
# Creating function to find x normalized
# Finding max and min
# Sorting the list
# Looping through the list and placing in new dict

def create_norm (x_list : list):
  x_min = min(x_list)
  x_max = max(x_list)
  x_list.sort()
  for x, value in enumerate(x_list):
    x_hat = (value - x_min)/(x_max - x_min)
    x_norm_dict[value]=x_hat

#Create empty dict
x_norm_dict = {}

# Replicate x from the excercise
x_list = [2,7,5,4,9,3]

# Run function
create_norm(x_list)
print(x_norm_dict)


{2: 0.0, 3: 0.14285714285714285, 4: 0.2857142857142857, 5: 0.42857142857142855, 7: 0.7142857142857143, 9: 1.0}


## Exercice 3: String methods and Green Domestic Product

Here is the executive summary of a recent E4S publication, on the Green Domestic Product (GrDP) - Learn more on the [E4S website](https://e4s.center/en/resources/reports/green-domestic-product/)
```python
"""
What is new?
We propose a novel indicator, the Green Domestic Product (GrDP) to remedy some of the shortcomings of GDP. The GrGDP extends the scope of the GDP to integrate the depletion of natural, social, and human capital. Concretely, GrDP is defined as GDP minus the external costs associated with the production of goods and services, including the impacts of the emissions of greenhouse gases (GHG), air pollutants, and heavy metals.

Why does it matter?
Our decisions are influenced by what we know and by what we measure. Flawed measurements can lead to distorted decisions. By considering the economic, environmental, and social dimensions, GrGDP allows us to make more informed and sustainable policy decisions, and to move beyond the dichotomy between promoting economic growth and protecting the environment.

What do we learn?
In Switzerland, the gap between GrGDP and GDP is narrowing, the economy is growing while air pollution is decreasing. Still, external costs remain significant, about CHF 25.3 billion or 3.5% of GDP in 2019. Air pollutants and GHG both have important environmental and social impacts. However, while economic growth and air pollutant emissions are successfully decoupling, decarbonisation remains too slow. There are opportunities for the future: many decarbonisation levers have significant co-benefits by also reducing air pollutant emissions and thus enhancing GrDP growth.
"""
```

- Count the number of times `'cost'` appears in the summary,
- Create a new string that only contains lower cases,
- Find the first occurrence of GrDP. Find the last one.
- What a catastrophe, there are errors in the text! It seems like `'GrGDP'` sometimes appear instead of `'GrDP'`. Can you correct these mistakes? 
- Store a variable with the year `2019`, the country `'Switzerland'`, and the value of the external cost `25.3` billion. Using f-strings, print: `'In Switzerland, the external costs were about CHF 25.3 billion in 2019'`

In [3]:
grdp_summary = """
What is new?
We propose a novel indicator, the Green Domestic Product (GrDP) to remedy some of the shortcomings of GDP. The GrGDP extends the scope of the GDP to integrate the depletion of natural, social, and human capital. Concretely, GrDP is defined as GDP minus the external costs associated with the production of goods and services, including the impacts of the emissions of greenhouse gases (GHG), air pollutants, and heavy metals.

Why does it matter?
Our decisions are influenced by what we know and by what we measure. Flawed measurements can lead to distorted decisions. By considering the economic, environmental, and social dimensions, GrGDP allows us to make more informed and sustainable policy decisions, and to move beyond the dichotomy between promoting economic growth and protecting the environment.

What do we learn?
In Switzerland, the gap between GrGDP and GDP is narrowing, the economy is growing while air pollution is decreasing. Still, external costs remain significant, about CHF 25.3 billion or 3.5% of GDP in 2019. Air pollutants and GHG both have important environmental and social impacts. However, while economic growth and air pollutant emissions are successfully decoupling, decarbonisation remains too slow. There are opportunities for the future: many decarbonisation levers have significant co-benefits by also reducing air pollutant emissions and thus enhancing GrDP growth.
"""

wordcount_cost = 0
wordcount_cost = grdp_summary.lower().count('cost')
print(f'Occurances of cost (case insensitive): {wordcount_cost}')

grdp_summary_lowercase = grdp_summary.lower()
print(f'Occurances of GrDP (case insensitive): {grdp_summary_lowercase.count("grgdp")}')
first_occurance_grdp = grdp_summary_lowercase.find('grgdp')
last_occurance_grdp = grdp_summary_lowercase.rfind('grgdp')

grdp_summary_fixed = grdp_summary_lowercase.replace('grgdp', 'grdp')
grdp_summary_fixed = grdp_summary.replace('GrGDP', 'GrDP')

year = 2019
country = 'Switzerland'
cost = 25.3

print(f'In {country}, the external costs were about CHF {cost} billion in {year}')


Occurances of cost (case insensitive): 2
Occurances of GrDP (case insensitive): 3
In Switzerland, the external costs were about CHF 25.3 billion in 2019
