# Exploratory Data Analysis (EDA)

## Table of Contents
1. [Dataset Overview](#dataset-overview)
2. [Data management](#data-management)
3. [Data visualizations](#data-visualization)


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


## Dataset Overview

Here we are trying to predict the Species richness of plants 

Source of dataset:
It is our own data, generated as part of a masters project.
It is simulated data generated using an R based mechanistic model “GEN3SIS”. (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001340 )
Gen3sis is an R based mechanistic model to track the eco evolution of configurated species over a given landscape reconstruction.

###  Description of data

Terms to be used:
1. Age of the Landscape : in million years
2. Elevation : Maximum elevation of earth, dynamic due to continent
3. Global avg temperature : average temperature of Earth's atmosphere over the entire surface, taking into
4. account variations across regions and seasons.
5. pCO2 : partial pressure of carbon dioxide, is a measure of the concentration of carbon dioxide in Earth's
atmosphere, representing the partial pressure contributed by this greenhouse gas.
6. Species richness : The count of distinct species in a particular ecosystem or geographical area.


Central idea
We have Elevation and climate data  (Mean temperature, and pC02 concentration) of 3 differnet climate scenarios, one literature based and 2 hypothetical scenarios
i. maxprob = Scenario with maximum probability cliamte conditions during the evolution of earth. (Foster et al. 2017)
ii. median = hypothetical scenario which explains how life on earth would have evolved if conditions were constant and suitable
iii. mext = hypothetical scenario which includes occurance of 4 mass extinctions, representing the extereme cliamtic shifts that are suspected by scientific lobby.

We haves plant speceies richness values in the maxprob climate scenario, we use this scenario to train the regression model to predict the evolution of Species richness based on temp, pCO2, Age of landscape (past 400 myrs) and maximum elevation of landscape.
Then we will use this model to predict the species richness in the other 2 climate scenarios (median and mext)




Below, we import the data  from local drive (you can find data into following G-drivr: )



In [None]:
# DATA import from local drive (Change the file location in code to your local computer)
# pCO = partial pressure of CO2 in ppm at diff. ages of earth
#temp_ = global avg. temp at different ages
#SR_ = Species richness (Count of number of species alive at diff. ages)
#Age = Age of planet from present to 400 million years back (total 801 observations)

# pCO2 data import over 400mya
pco2_maxprob = pd.read_excel("D:/ml/pco2_maxprob.xlsx")
pco2_rand68 = pd.read_excel("D:/ml/pco2_rand68.xlsx")
pco2_median = pd.read_excel("D:/ml/pco2_median.xlsx")
pco2_mext = pd.read_excel("D:/ml/pco2_mext.xlsx")
pco2_noDPE = pd.read_excel("D:/ml/pco2_noDPE.xlsx")

# temp data import over 400mya
temp_maxprob = pd.read_excel("D:/ml/temp_maxprob.xlsx")
temp_rand68 = pd.read_excel("D:/ml/temp_rand68.xlsx")
temp_median = pd.read_excel("D:/ml/temp_median.xlsx")
temp_mext = pd.read_excel("D:/ml/temp_mext.xlsx")
temp_noDPE = pd.read_excel("D:/ml/temp_noDPE.xlsx")

# SR data import over 400mya
sr_maxprob = pd.read_excel("D:/ml/SR_maxprob.xlsx")
sr_rand68 = pd.read_excel("D:/ml/SR_rand68.xlsx")

# elevation data
ele = pd.read_excel("D:/ml/ele.xlsx")

#print(pco2_maxprob.head())
#print(temp_maxprob.head())
#print(sr_maxprob.head())

## Data management

1. Here we merge the individual datasets of climate, elevation to create 3 usable dataframes. One for each scenario: maxprob, median and mext.

2. We print the head of 3 datasets to see the structure

3. There are 801 observations of following columns in the data set to be used for training the mode:
    Age        pCO2      Mean_Temperature      elevation      SR_total


In [None]:
# Combine data frame to one into usable structure
maxprob = pd.merge(pco2_maxprob, temp_maxprob[['Age', 'Mean_Temperature']], on='Age', how='inner')
maxprob = pd.merge(maxprob, ele[['Age', 'elevation']], on='Age', how='inner')
maxprob = pd.merge(maxprob, sr_maxprob[['Age', 'SR_total']], on='Age', how='inner')


median = pd.merge(pco2_median, temp_median[['Age', 'Mean_Temperature']], on='Age', how='inner')
median = pd.merge(median, ele[['Age', 'elevation']], on='Age', how='inner')

mext = pd.merge(pco2_mext, temp_mext[['Age', 'Mean_Temperature']], on='Age', how='inner')
mext = pd.merge(mext, ele[['Age', 'elevation']], on='Age', how='inner')




print("Maximum probability climate scenario")
print(maxprob.head())

print("Median climate scenario")
print(median.head())

print("Climate scenario with Mass extinctions")
print(mext.head())


## Data visualization

1. Here, we make correlation plots between the variables in the dataframes, to identify the features to be used in regression.

2. then we plot the time series trend plot of pCO2 over time.

3. next we plot the time series trend plot of temperature over time.

4. next we plot the time series trend plot of Plant species richness over time.


In [None]:
# Correlation plot
import seaborn as sns
# Compute the correlation matrix
correlation_matrix = maxprob.corr()

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Draw the heatmap using seaborn
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)

# Show the plot
plt.title('Correlation Matrix')
plt.show()


# INITIAL TIME SERIES TREND PLOTS

# pCO2
# Plotting setup
plt.figure(figsize=(10, 6))  # Set the figure size
# Plotting each line graph
plt.plot(-maxprob['Age'], maxprob['pCO2'], label='maxprob')  # Line for DataFrame 1
plt.plot(-median['Age'], median['pCO2'], label='median', linestyle='--')  # Line for DataFrame 2
plt.plot(-mext['Age'], mext['pCO2'], label='mext', linestyle='--')  # Line for DataFrame 3
# Set labels and title
plt.xlabel('Age (Mya)')
plt.ylabel('pCO2 (ppm)')
plt.title('pCO2 Values vs Age')
# Show legend
plt.legend()
# Show the plot
plt.show()

# temp
# Plotting setup
plt.figure(figsize=(10, 6))  # Set the figure size
# Plotting each line graph
plt.plot(-maxprob['Age'], maxprob['Mean_Temperature'], label='maxprob')  # Line for DataFrame 1
plt.plot(-median['Age'], median['Mean_Temperature'], label='median', linestyle='--')  # Line for DataFrame 2
plt.plot(-mext['Age'], mext['Mean_Temperature'], label='mext', linestyle='--')  # Line for DataFrame 3
# Set labels and title
plt.xlabel('Age (Mya)')
plt.ylabel('Temp (C)')
plt.title('Global avg. Temperature Values vs Age')
# Show legend
plt.legend()
# Show the plot
plt.show()


# Species richness
# Plotting setup
plt.figure(figsize=(10, 6))  # Set the figure size
# Plotting each line graph
plt.plot(-maxprob['Age'], maxprob['SR_total'], label='maxprob')  # Line for DataFrame 1
#plt.plot(-median['Age'], median['Mean_Temperature'], label='median')  # Line for DataFrame 2
#plt.plot(-mext['Age'], mext['Mean_Temperature'], label='mext')  # Line for DataFrame 3
# Set labels and title
plt.xlabel('Age (Mya)')
plt.ylabel('Total Species richness')
plt.title('Terrestrial plants richness vs Age')
# Show legend
plt.legend()
# Show the plot
plt.show()