# Programming for Data Analysis Assignment 2

Author - Sean Humphreys

## Contents

1. [Problem Statement](#problem-statement)

1. [Software Libraries](#software-libraries)

1. [Data Cleansing](#data-cleansing)

    1. [Carbon Dioxide Data](#co2-data)

    2. [Temperature Data](#temperature-data)

    3. [Methane Data](#methane-data)


2. [CO2 v Temperature Anomaly 800yrs to Present](#co2-vs-temperature-anomaly-800k-yr---present)

2. [References](#references)

3. [Associated Reading](#associated-reading)

---

## Problem Statement <a id="problem-statement"></a>

+ Analyse CO2 vs Temperature Anomaly from 800kyrs – present.

+ Examine one other (paleo/modern) features (e.g. CH4 or polar ice-coverage)

+ Examine Irish context:
    
    + [Climate change signals](/literature/the_emergence_of_a_climate_change_signal_in_long_term_irish_meteorological_observations.pdf) : (see Maynooth study: The emergence of a climate change signal in long-term Irish meteorological observations - ScienceDirect)

+ Fuse and analyse data from various data sources and format fused data set as a pandas dataframe and export to csv and json formats

+ For all of the above variables, analyse the data, the trends and the relationships between them (temporal leads/lags/frequency analysis).

+ Predict global temperature anomaly over next few decades (synthesise data) and compare to published climate models if atmospheric CO2 trends continue

+ Comment on accelerated warming based on very latest features (e.g. temperature/polar-ice-coverage)

---

## Software Libraries <a id="software-libraries"></a>

- [Matplotlib](https://matplotlib.org/) (https://matplotlib.org/ - last accessed 13 Dec. 2023) - is an open-source software library for creating static, animated, and interactive visualisations in Python.

- [Pandas](https://pandas.pydata.org/) (https://pandas.pydata.org/ - last accessed 3 Nov. 2023) is an open-source software library used in data analytics that allows data analysis and manipulation. Pandas is built on top of the Python programming language. A Pandas DataFrame is a dictionary like container for series objects. A DataFrame is the primary Pandas data structure.

In [1]:
# import the required software libraries
import pandas as pd
import matplotlib.pyplot as plt

---

## Data Cleansing <a id="data-cleansing"></a>

The Pandas software library is used to clean and process datasets. 

### Carbon Dioxide Data <a id="carbon-dioxide-data"></a>

The most recent CO2 dataset in a Comma Separated Value (CSV) file is read in from https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv.  Using Pandas the CSV) file can be read in as a DataFrame.

Carbon Dioxide data from 1958 to 800k years before preset is sourced from Bereiter et al. (2014).

[1] Dr. Pieter Tans, NOAA/GML (gml.noaa.gov/ccgg/trends/) and Dr. Ralph Keeling, Scripps Institution of Oceanography (scrippsco2.ucsd.edu/)

The latest carbon dioxide data is read in as a Pandas DataFrame.

In [2]:
# https://gml.noaa.gov/ccgg/trends/data.html
mauna_loa = pd.read_csv('https://gml.noaa.gov/webdata/ccgg/trends/co2/co2_annmean_mlo.csv', skiprows=43)

The columns in the dataset are renamed to logical names.

In [3]:
# code adapted from # https://sparkbyexamples.com/pandas/rename-columns-with-list-in-pandas-dataframe/
cols = ['year', 'co2_ppmv', 'unc']

mauna_loa.columns = cols

An unnecessary column is removed from the dataset.

In [4]:
# code adapted from https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html [Accessed 13 Dec. 2023]
mauna_loa.drop(['unc'], axis=1, inplace=True)

An additional column is added to the dataset that calculates the year no before the 2023.

In [5]:
mauna_loa['years_before_present'] = 2023 - mauna_loa['year']

# sort the data based on the year before present. Based on code from - https://saturncloud.io/blog/how-to-sort-pandas-dataframe-from-one-column/ [Accessed 13 Dec. 2023].
mauna_loa = mauna_loa.sort_values('years_before_present')

The columns in the dataset are reordered.

In [6]:
# adapted from code found here - https://practicaldatascience.co.uk/data-science/how-to-reorder-pandas-dataframe-columns [Accessed 13 Dec. 2023]
mauna_loa = mauna_loa.reindex(columns=['yr_bp', 'co2_ppmv', 'year', 'years_before_present'])

The historic co2 data that ranges from 1950 - 800k years before present is read in as a Pandas DataFrame.

In [27]:
master_data = pd.read_excel('datasets/historic/co2/grl52461-sup-0003-supplementary.xls', sheet_name='all records')

DataFrames to capture specific subsets of data from assorted studies in the master dataset are defined. These subsets will be stitched together to create a composite dataset of historic co2 data.

In [8]:
# use iloc to create subsets of data from the master dataset
rubino = master_data.iloc[90:, [83, 86]]
macfarling = master_data.iloc[137:, 68:70]
monnin = master_data.iloc[25:120, 2:4]
marcott = master_data.iloc[31:321, 98:100]
ahn = master_data.iloc[7:202, 89:91]
bereiter = master_data.iloc[28:106, 34:36]
bereiter_2 = master_data.iloc[60:154, 39:41]
schneider = master_data.iloc[6:, 65:67]
petit = master_data.iloc[124:348, 7:9]
siegenthaler = master_data.iloc[6:26, 20:22]
siegenthaler_2 = master_data.iloc[6:226, 15:17]
bereiter_3 = master_data.iloc[37:, 102:104]

The columns in each dataframe are renamed to logical names

In [9]:
rubino.rename(columns=({'Unnamed: 83':'yr_bp', 'Unnamed: 86':'co2_ppmv'}), inplace=True)
macfarling.rename(columns=({'Law Dome (0-2 kyr BP)':'yr_bp', 'Unnamed: 69':'co2_ppmv'}), inplace=True)
monnin.rename(columns=({'Unnamed: 2':'yr_bp', 'Unnamed: 3':'co2_ppmv'}), inplace=True)
marcott.rename(columns=({'Unnamed: 98':'yr_bp', 'Unnamed: 99':'co2_ppmv'}), inplace=True)
ahn.rename(columns=({'Unnamed: 89':'yr_bp', 'Unnamed: 90':'co2_ppmv'}), inplace=True)
bereiter.rename(columns=({'Unnamed: 34':'yr_bp', 'Unnamed: 35':'co2_ppmv'}), inplace=True)
bereiter_2.rename(columns=({'Unnamed: 39':'yr_bp', 'Unnamed: 40':'co2_ppmv'}), inplace=True)
schneider.rename(columns=({'Unnamed: 65':'yr_bp', 'Unnamed: 66':'co2_ppmv'}), inplace=True)
petit.rename(columns=({'Unnamed: 7':'yr_bp', 'Unnamed: 8':'co2_ppmv'}), inplace=True)
siegenthaler.rename(columns=({'Unnamed: 20':'yr_bp', 'Unnamed: 21':'co2_ppmv'}), inplace=True)
siegenthaler_2.rename(columns=({'Unnamed: 15':'yr_bp', 'Unnamed: 16':'co2_ppmv'}), inplace=True)
bereiter_3.rename(columns=({'Unnamed: 102':'yr_bp', 'Unnamed: 103':'co2_ppmv'}), inplace=True)


A function is defined to carry out a number of processing actions on each DataFrame. The `year()` function:

+ creates a columns that calculates the year based on the before present value

+ creates a column that calculates the year before present values

+ drops any rows with null values

In [10]:
def year(sample):
    sample['year'] = 1950-(sample['yr_bp'])
    sample['years_before_present'] = 2023 - sample['year']
    sample.dropna(axis=0, inplace=True)
    return sample

Using a for loop, each of the subsets of data can be passed to the `year()` function.

In [11]:
studies = [rubino, macfarling, monnin, marcott, ahn, bereiter, bereiter_2, schneider, petit, siegenthaler, siegenthaler_2, bereiter_3]

for study in studies:
    year(study)

Each of the subsets of co2 data is concatenated into one DataFrame to create a composite.

In [12]:
# code adapted from https://pandas.pydata.org/docs/reference/api/pandas.concat.html [Accessed 13 Dec. 2023].
frames = [mauna_loa, rubino, macfarling, monnin, marcott, ahn, bereiter, bereiter_2, schneider, petit, siegenthaler, siegenthaler_2, bereiter_3]

full_co2_data = pd.concat(frames, ignore_index = True)

### Temperature Data <a id="temperature-data"></a>

The temperature from from 1880 to 2022 in this dataset is sourced from [NASA](https://data.giss.nasa.gov/gistemp/graphs/graph_data/Global_Mean_Estimates_based_on_Land_and_Ocean_Data/graph.txt) (https://data.giss.nasa.gov/gistemp/graphs/graph_data/Global_Mean_Estimates_based_on_Land_and_Ocean_Data/graph.txt [Accessed 12 Dec. 2023].). [2]

Temperature data from 1880 to 800k years from the 2023 was sourced from [https://www.temperaturerecord.org/#sources](https://www.temperaturerecord.org/#sources) accessed 13 Dec. 2023. [3] & [4]

All of the temperature data is compared to the long-term average from 1951 to 1980.

[2] Credits - Snyder, C.W. 2016.

[3] Credits - Marcott et al, 2013

[4] Credits - Shakun et al, 2012



Using Pandas modern temperature data is read in from the NASA website.

In [13]:
nasa_temp = pd.read_csv('https://data.giss.nasa.gov/gistemp/graphs/graph_data/Global_Mean_Estimates_based_on_Land_and_Ocean_Data/graph.txt', 
                       skiprows=5, header=None, sep = ' ', skipinitialspace=True, engine='python', names=['year', 'temp_anomaly', 'lowness'])

An unnecessary column is dropped from the DataFrame.

In [14]:
nasa_temp.drop(['lowness'], axis=1, inplace=True)

The remaining columns are renamed to a standard naming convention that will be used with temperature data from another source.

In [15]:
nasa_temp = nasa_temp.reindex(columns=['year', 'yr_bp', 'temp_anomaly'])

The NASA data is sorted by year.

In [16]:
nasa_temp = nasa_temp.sort_values('year', ascending=False)

Pre 1800's temperature data is read in from worksheets in an excel spreadsheet that contains all of the historic temperature data.

In [28]:
moberg_temp = pd.read_excel('datasets/historic/temperature/temperature_dataset.xlsx', 
                            sheet_name='2,000 yr',  names=['year', 'yr_bp', 'temp_anomaly', 'x', 'y', 'z'])
clark_temp = pd.read_excel('datasets/historic/temperature/temperature_dataset.xlsx', 
                           sheet_name='20,000 yr', names=['yr_bp', 'temp_anomaly', 'x', 'y', 'z'])
shakun_temp = pd.read_excel('datasets/historic/temperature/temperature_dataset.xlsx', 
                            sheet_name='800,000 yr', names=['yr_bp', 'temp_anomaly', 'x', 'y', 'z'])


Unneeded columns are removed from the DataFrames.

In [18]:
moberg_temp.drop(['x', 'y', 'z'],axis=1, inplace=True)

The remaining columns are re-ordered to make them consistent with the rest of the temperature DataFrames.

In [19]:
moberg_temp.drop(moberg_temp.index[0:100], axis = 0, inplace = True)

A number of precessing tasks are grouped together in a function. The `temp_year()` function removes unneeded columns from the DataFrame and adds a column to calculate the year.

In [20]:
def temp_year(sample):
    sample.drop(['x', 'y', 'z'],axis=1, inplace=True)
    sample['year'] = 1950 - sample['yr_bp']
    return sample

Using a for loop the relevant datasets are paased to the `temp_year()` function.

In [21]:
samples = [clark_temp, shakun_temp]

for sample in samples:
    temp_year(sample)

The columns in the DataFrames are re-ordered to be consistent with the other temperature DataFrames.

In [22]:
clark_temp = clark_temp.reindex(columns=['year', 'yr_bp', 'temp_anomaly'])
shakun_temp = shakun_temp.reindex(columns=['year', 'yr_bp', 'temp_anomaly'])

Rows are dropped from each of the  DataFrames so that there is no overlap between them.

In [23]:
clark_temp.drop(clark_temp.index[0:19], axis = 0, inplace = True)
shakun_temp.drop(shakun_temp.index[0:7], axis = 0, inplace = True)

All of the temperature DataFrames are concatenated to give a composite record of the temperature anomaly over the last 800k years.

In [24]:
frames_temp = [nasa_temp, moberg_temp, clark_temp, shakun_temp]

full_temp_data = pd.concat(frames_temp, ignore_index = True)

### Methane Data <a id="methane-data"></a>

## CO2 vs Temperature Anomaly 800k Yr - Present <a id="CO2-vs-Temperature-Anomaly-800k-Yr---Present"></a>

## Analysis <a id="data-cleansing"></a>

---

## Examine one other (paleo/modern) feature

---

## Irish context

---

## Fused Dataset

---

## Data Analysis

---

## Predictive Model

---

## References <a id="references"></a>

Bereiter et al. (2014), Revision of the EPICA Dome C CO2 record from 800 to 600 kyr before present, Geophysical Research Letters, doi: 10.1002/2014GL061957.

Marcott, S.A., Shakun, J.D., Clark, P.U. and Mix, A.C. (2013). A Reconstruction of Regional and Global Temperature for the Past 11,300 Years. Science, 339(6124), pp.1198–1201. doi:https://doi.org/10.1126/science.1228026.


Naveen (2022). How to Rename Columns With List in Pandas. [online] Spark By {Examples}. Available at: https://sparkbyexamples.com/pandas/rename-columns-with-list-in-pandas-dataframe/ [Accessed 13 Dec. 2023].

pandas.pydata.org. (n.d.). pandas.concat — pandas 1.3.4 documentation. [online] Available at: https://pandas.pydata.org/docs/reference/api/pandas.concat.html. [Accessed 13 Dec. 2023].

pandas.pydata.org. (n.d.). pandas.DataFrame.drop — pandas 1.2.4 documentation. [online] Available at: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html. [Accessed 13 Dec. 2023].

practicaldatascience.co.uk. (2022). How to reorder Pandas dataframe columns. [online] Available at: https://practicaldatascience.co.uk/data-science/how-to-reorder-pandas-dataframe-columns. [Accessed 13 Dec. 2023].

saturncloud.io. (2023). How to Sort Pandas DataFrame by One or Multiple Column | Saturn Cloud Blog. [online] Available at: https://saturncloud.io/blog/
how-to-sort-pandas-dataframe-from-one-column/ [Accessed 13 Dec. 2023].

Shakun, J.D., Clark, P.U., He, F., Marcott, S.A., Mix, A.C., Liu, Z., Otto-Bliesner, B., Schmittner, A. and Bard, E. (2012). Global warming preceded by increasing carbon dioxide concentrations during the last deglaciation. Nature, [online] 484(7392), pp.49–54. doi:https://doi.org/10.1038/nature10915.

Snyder, C.W. (2016). Evolution of global temperature over the past two million years. Nature, [online] 538(7624), pp.226–228. doi:https://doi.org/10.1038/nature19798.

---

## Associated Reading <a id="associated-reading"></a>

Matplotlib (2012). Matplotlib: Python plotting — Matplotlib 3.1.1 documentation. [online] Matplotlib.org. Available at: https://matplotlib.org/. [Accessed 13 Dec. 2023].

Pandas (2018). Python Data Analysis Library — pandas: Python Data Analysis Library. [online] Pydata.org. Available at: https://pandas.pydata.org/. [Accessed 13 Dec. 2023].

---

Notebook Ends