## Programming for Data Analysis - Project 2 ##

**Name: James McEneaney** <br/><br/>
**Course: Higher Diploma in Computing in Data Analytics, ATU Ireland** <br/><br/> 
**Semester: Semester 2 2023** <br/><br/>


***

### Introduction ###

### Data-cleansing ###

To begin, I will download the libraries I will use in this project. I will use pandas to create the dataframes which I will use to analyse the historical climate data:

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as ss 

I have downloaded multiple climate related datasets online. These datasets were as follows:

- Atmospheric carbon dioxide concentratons from report by the Intergovernmental Panel on Climate Change (IPCC)

- Atmospheric carbon dioxide concentrations from a 2008 paper in Nature by Luthi et al.

- Atmospheric carbon dioxide data from the Mauna Loa Observatory in Hawaii, providing annual data from 1959 to 2022.

- Temperature estimates going back over 820,000 years based on the EPICA Dome C Ice Core Deuterium Data. EPICA stands for European Project for Ice Coring in Antarctica. It estimates past temperature using Deuterium as a proxy. Deuterium is a heavy form of hydrogen (containing a neutron in addition to the one proton found in "light" hydrogen). In colder periods weather, there tends to be less deuterium in ice cores than in warmer periods. The reason for this is that during warm periods, more of the lighter form of hydrogen is evaporated away from the surface of the ocean, so that when the moist air moves northwards to the poles, it contains a higher ratio of heavy hydrogen than it contains during cold periods; precipitation then contains more of this heavy hydrogen and this ultimately shows up in the ice cores. I will refer to this dataset as the "Jouzel" dataset for convenience (Jean Jouzel is a French glaciologist and climatologist who is one of the creators of the dataset)

With the exception of the Mauna Loa dataset, these datasets were not in csv format initially. To make it easier to create dataframes using pandas, I saved all my source files as csv files in my working directory. I will edit the dataframes using python before I analyse the data, including renaming column headings. I will also add extra columns to datasets to standardise the figures for time measurement and have the oldest years listed first, so my plots print from oldest years at the left to youngest most recent years at the right.

Firstly, I will load up and amend the IPCC CO2 dataset:

In [None]:
df_co2_ipcc = pd.read_csv("CO2_ipcc_csv.csv")

# rename column names of existing dataframe (setting 'inplace' parameter equal to 'True')
df_co2_ipcc.rename(columns={'Gasage (yr BP) ': 'Year before 1950', 'CO2 (ppmv)': 'CO2', 'sigma mean CO2 (ppmv)': 'sigma mean CO2'}, inplace=True) 

# here I am adding a new column called 'Years' to standardise the time measurements across different datasets.
# I am adding 73 years to every value in the 'years before 1950' column to effectively convert the column to 'years before 2023'. 
# I am then multiplying each year by minus 1, so that the column represents years "in the past"
df_co2_ipcc['Years'] = (df_co2_ipcc['Year before 1950'] + 73) * -1

# reversing the order of the dataset, to for the older years to appear on the left-hand side of the x-axis
df_co2_ipcc = df_co2_ipcc[::-1]

print(df_co2_ipcc)

Next, I will load up and amend the 2008 Nature paper dataset:

In [None]:
df_co2_nature = pd.read_csv("CO2_nature_csv.csv")

df_co2_nature.rename(columns={'EDC3_gas_a (yr)': 'Years ago', 'CO2 (ppmv)': 'CO2'}, inplace=True)

df_co2_nature['Years'] = (df_co2_nature['Years ago']) * -1

df_co2_nature = df_co2_nature[::-1]

print(df_co2_nature)

Next I will load up the relatively small dataset of from Mauna Loa, containing data for atmospheric CO2 levels:

In [None]:
df_mauna_loa_csv = pd.read_csv("mauna_loa_csv.csv")

# I will create a new column called 'Years', to standardise the measurement of time across the datasets.
# I don't need to reverse the rows in this dataset since the earliest year is already in the first row.
df_mauna_loa_csv['Years'] = (2023 - df_mauna_loa_csv['year']) * -1

print(df_mauna_loa_csv)

Next, the "Jouzel" temperature data inferred from deuterium levels in ice cores from EPICA Dome C:

In [None]:
df_temp_jouzel_csv = pd.read_csv("temp_jouzel_csv.csv")

df_temp_jouzel_csv.rename(columns={'bag': 'bag', 'ztop': 'depth', 'EDC3béta': 'Years before 1950', 'AICC2012': 'year_new', 'deutfinal': 'deuterium', 'temp': 'temp_Kelvin', 'acc-EDC3beta' : 'acc-EDC3beta'}, inplace=True)

# adding a new column called 'Years' to standardise the measurement of time across datasets. The negative figure allows for my graph to print from
# oldest date to most recent
df_temp_jouzel_csv['Years'] = (df_temp_jouzel_csv['Years before 1950'] + 73) * -1

df_temp_jouzel_csv = df_temp_jouzel_csv[::-1]

print(df_temp_jouzel_csv)

In [None]:
plt.figure(figsize= (16,6))
plt.title('Atmospheric carbon dioxide concentrations (parts per million by volume) over the past 800000 years')

sns.scatterplot(x ='Years', y = 'CO2', data = df_co2_ipcc, size = 40)      # use 'size' parameter to decrease size of markers 

sns.lineplot(x ='Years', y = 'CO2', data = df_co2_ipcc, color = 'green')

In [None]:
plt.figure(figsize= (16,6))
plt.title('Atmospheric carbon dioxide concentrations (parts per million by volume) over the past 800000 years')

sns.scatterplot(x='Years ago', y='CO2', data=df_co2_nature, size = 40)      # use 'size' parameter to decrease size of markers 

sns.lineplot(x = 'Years ago', y = 'CO2', data = df_co2_nature, color = 'blue')


### CO2 versus Temperature Anomoly ###

### CH4 versus Temperature Anomoly ###

### Climate change signals in the Irish context ###

### Fusion of datasets ###

### Prediction of Global Temperature Anomoly ###

### Accelerating increases of temperature ###

### Summary ###

### References ###