# Mini-Project Report - Analyzing and Recognizing Leading Countries in CO2 Emission Reduction

#### Project Group: Helmi Karesti, Janne Penttala, Salla-Mari Uitto
#### GitHub Repository: [https://github.com/karhelmi/intro-ds-miniproject](https://github.com/karhelmi/intro-ds-miniproject)

## Deciding on a Project Idea

Our project team was formed before the first exxercise session on Slack and we started to narrow down ideas during the first exercise session. We toyed with ideas of stock markets, pets in cities and many other topics. In the end, all of us were interested in sustainability so climate change was chosen as a inspiration. From here, we started to narrow down the topic by first deciding on analyzing CO2 emissions. Then we brainstormed on how to make it truly interesting. We started to have ideas on finding out which countries had reduced their emissions the most. Finally, we decided on our topic: Analyzing and Recognizing Leading Countries in CO2 Emission Reduction.

## Motivation and Added Value

Our approach in analyzing countries that have reduced their CO2 emissions effectively highlights positive achievements and encourages the sharing of best practices in the fight against climate change. The thought behind all of this is that the world needs more positive reinforcement and hope.

The target groups of our project are political decision makers, environmental organizations and researchers. These are the groups that aim to find ways to combat climate change: our project aids them in their pursuit by giving examples of countries that have succeeded in reducing CO2 emissions. Our project also introduces measures that have led to the reduction of CO2 emissions. This will aid our target group in decision making.

## Data Collection

Data about CO2 emissions is widely available on the Internet. We compared the data on different sites to each other to make sure that the numbers didn't differ from each other. In the end we decided to use [Global Carbon Atlas](https://globalcarbonatlas.org/) for fetching CO2 emission data for every country. The fetched data was stored in a .csv file. This data is used to find the countries that have reduced their CO2 emissions the most. 

Another aspect in our project is to find what has affected the CO2 emission reduction in the selected countries. For this, we chose to use [Our World in Data](https://ourworldindata.org/). This choice was fairly easy, because we need a lot of different factors to analyze in our project and this single source can provide this for us. The choice on what to analyze was made by thorough evaluation as a group. For example, renewable energy was chosen as one factor because it is generally known to reduce CO2 emissions. This data is also fetched as a .csv file.

(list of our chosen variables)

When our project progressed, we encountered some problems with the data we initially thought of using. At the beginning, we thought that we would analyze the change between the years 2011 and 2021. However, once we had done our code, we realized fast that not all countries have available data up to the year 2021 on our chosen factors. As a team, we discussed different possibilities for handling this: one simple option was to change the time frame to something else, like 2008-2018, and another simple option was to choose only factors and countries to analyze so that we have the needed data. Some thought was also put on filling the missing values by using linear regression or other means of predicting the values, but due to our tight schedule this was dismissed rather quickly: we wanted to keep our focus on analyzing the factors and not get too sidetracked. 

## Preprocessing

Our project has basically two sub-tasks: first one is to find the countries that have reduced CO2 emissions and second is to analyze factors that have helped the country to reduce them. Without even actually looking at the data, we knew that we would have to merge data from several data sources or files together to be able to compare and analyze CO2 emission reduction and factors for each country.

Before starting to write the code, we looked at the csv data in Excel: this way we got a rough idea about what we were up against. For example, we noticed that in addition to each country, our CO2 emission data file contained CO2 emissions for each continent and other larger areas.

### Code

From the beginning it was clear that we would work with Python. Another option would have been R, but overall Python was the more attractive choice. Our team did not have much experience with it, but we all wanted to learn more. With Python, it was also obvious that we would be using Pandas and Numpy to help us process the data.

At first, we started with implementing code that gives us a list of countries and their CO2 emission reduction percentage. We read the csv file to a pandas dataframe and continued to process it in that format. We had to rename and re-arrange columns and rows, change data types and finally calculate the reduction in CO2 emissions. We calculated the reduction in three different units: in MtCO2, in kgCO2/GDP and in tCO2/person.

Next, we had to decide on how we want to analyze the factors that have led to a country reducing their CO2 emissions. This started to form out to be the trickiest part of our whole project and we spent a lot of time just discussing different possibilities. Finally we settled on creating a pandas dataframe for each country and to that country specific data frame we would add the CO2 emissions and all the factors as separate columns.

At first, we implemented code that did all of this for one country, Finland. Once that case worked, we started to refactor the code and make functions for separate tasks to make it easier to reuse the code. First, the data frame for a country is created with its CO2 emissions. Then, the frame is filled with all of our chosen variables. Again, we had to wrangle the data in different ways: we set indexes, changed data types and took data from another frame to our country specific frame. After doing these steps, we have a dataframe for a country with all of the variables we will be analyzing and the CO2 emissions for each year. Code is available in the GitHub repository (link available at the beginning of this report).

## Learning Task and Approach

To analyze factors that have possibly aided countries in reducing their CO2 emissions, we decided to use linear regression. Linear regression is a fairly simple model that can be used to analyze relationships between variables. In our project, we use (simple) linear regression to analyze the relationship between CO2 emissions and our chosen factors. 

We are also calculating R-Squared value for each of our simple linear regression. This statistical measure tells us how well the linear regression model fits the data: if the value is 1, the regression model fits the data perfectly. In our project, this value is used to help analyze the impact of different factors to the CO2 emission reduction.

We have also considered adding other statistical values, such as p-value, to aid us in our analysis and to help us distinguish whether correlation means or does not mean causality in our case. To make correct conclusions based on our analysis, we also need domain knowledge - in this case this would be knowledge about climate change. 

Linear regression, R-square values, other statistical approaches? correlation and causality?

## Visualizations

Scatter plots for linear regression, how to show countries with big reductions?

## Results, Privacy and Ethical Considerations

P-values:

|                         |     Estonia |   Bosnia and Herzegovina |      Greece |      Serbia |     Finland |     Denmark |       Malta |      Sweden |   Montenegro |   Hong Kong |   Luxembourg |   Slovenia |    Portugal |
|:------------------------|------------:|-------------------------:|------------:|------------:|------------:|------------:|------------:|------------:|-------------:|------------:|-------------:|-----------:|------------:|
| Meat prod               | 0.732723    |               0.689505   | 0.950636    | 0.131118    | 0.00648671  | 0.22989     | 0.000679663 | 0.00390883  |    0.458934  | 0.356882    |  0.00249674  | 0.482211   | 0.850477    |
| Life expectancy         | 0.169902    |               0.00215609 | 0.652055    | 0.161483    | 1.7327e-05  | 2.70463e-06 | 0.000284956 | 0.00389515  |    0.611153  | 0.0491296   |  0.0187424   | 0.253419   | 0.781656    |
| GDP per capita          | 0.00458323  |               0.0777156  | 0.560485    | 0.117811    | 0.0939486   | 0.000523393 | 0.000759092 | 0.00543865  |    0.945555  | 0.577588    |  0.29602     | 0.0602615  | 0.689786    |
| % of internet users     | 0.0384417   |               0.189652   | 3.47924e-07 | 0.166787    | 0.303812    | 0.00187577  | 0.00223703  | 0.482862    |    0.386841  | 0.0635907   |  1.88516e-05 | 0.00462394 | 0.128388    |
| Human dev index         | 0.0253571   |               0.191109   | 0.000130841 | 0.604033    | 0.00023612  | 0.000115873 | 0.0015699   | 0.00203432  |    0.977282  | 0.0739752   |  0.0409707   | 0.0229061  | 0.190503    |
| Human rights idx        | 0.99947     |               0.299337   | 0.00215727  | 0.279433    | 0.00195435  | 0.00173897  | 0.0200782   | 0.0013383   |    0.0234629 | 0.000265885 |  0.0105875   | 0.00887727 | 9.23288e-05 |
| Population              | 0.0237319   |               0.0768652  | 3.2381e-06  | 0.099029    | 4.93878e-05 | 5.57757e-06 | 0.00739275  | 8.64837e-06 |    0.982113  | 0.110923    |  0.000423519 | 0.0119124  | 0.408837    |
| Nuclear energy          | 0           |               0          | 0           | 0           | 0.633844    | 0           | 0           | 0.0773939   |    0         | 0           |  0           | 0.583504   | 0           |
| Energy usage per capita | 6.25004e-06 |               0.0408664  | 0.00368684  | 0.881358    | 4.43607e-08 | 4.73208e-06 | 0.024856    | 0.00226069  |    0.636902  | 0.000324077 |  1.22004e-08 | 0.00439459 | 0.0448127   |
| Share of renewables     | 9.85156e-07 |               0.0703962  | 3.0113e-07  | 0.000353312 | 3.66139e-06 | 2.02278e-08 | 1.37836e-05 | 0.0072623   |    0.274423  | 0.0275854  

## Report on Usage of Generative AI Tools in the Project