# **CIS 5450 Final Project - Climate Change Analysis**
*Wendy Deng, Anna Zhou, Kaily Liu*

# Part 1: Introduction

Climate change is one of the most urgent issues of our time and its impacts on our planet are becoming increasingly severe. Our project focuses on analyzing and predicting global temperatures to better understand how climate change is influencing our world.







# Part 2: Data Loading

This script sets up an environment for data analysis, visualization, time series analysis, and machine learning model building.

It begins by installing and importing necessary libraries for data manipulation, visualization, statistical analysis, and machine learning.

The visualization section includes modules for 3D plotting, Matplotlib, Seaborn, Cartopy, and Plotly.

The statistical tools section includes modules for time series analysis, such as ARIMA models, ADF test, and autocorrelation/partial autocorrelation plots.

The machine learning model building section imports modules from scikit-learn for preprocessing, splitting data, linear regression modeling, and evaluation metrics.

Additionally, there's a section for printing file paths in the '/kaggle/input' directory, to show what Kaggle datasets we are working with.


In [None]:
# installing the Cartopy library using pip
!pip install cartopy
!pip install fuzzywuzzy
!pip install sqlalchemy==1.4.46
!pip install pandasql
!pip install tensorflow
!pip install keras
!pip install pmdarima

In [None]:
# importing necessary libraries
import numpy as np
import pandas as pd
import pandasql as ps
from datetime import date
from google.colab import drive
drive.mount('/content/drive')

# importing visualization tools
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import seaborn as sns
import cartopy.crs as ccrs
import cartopy.feature as cfeature
import plotly.graph_objects as go
import plotly.tools as tls
import plotly.offline as py
import plotly.express as px
import plotly.io as pio
py.init_notebook_mode(connected=True)
pio.renderers.default = "colab"

# importing other dynamic visualization tools
import urllib.request
import json
import geopandas as gpd
from fuzzywuzzy import process
from geopy.geocoders import Nominatim

# importing statistical tools
import itertools
import statsmodels.api as sm
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from pmdarima.arima import auto_arima
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

# importing necessary modules for machine learning model building
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# importing more necessary tools for modeling
from scipy.stats import pearsonr
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from prophet import Prophet

# when using Kaggle notebooks, printing the file paths in the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## 2.1 Loading & Preprocessing Global Temperatures Data

This section of the notebook explores the Kaggle GlobalTemperatures.csv file (Global Land and Ocean-and-Land Temperatures). It includes information on:

- Date: Starting from 1750 for average land temperature and 1850 for maximum and minimum land temperatures, as well as global ocean and land temperatures.
- LandAverageTemperature: Represents the global average land temperature in degrees Celsius.
- LandAverageTemperatureUncertainty: Indicates the 95% confidence interval around the average land temperature.
- LandMaxTemperature: Denotes the global average maximum land temperature in degrees Celsius.
- LandMaxTemperatureUncertainty: Represents the 95% confidence interval around the maximum land temperature.
- LandMinTemperature: Specifies the global average minimum land temperature in degrees Celsius.
- LandMinTemperatureUncertainty: Depicts the 95% confidence interval around the minimum land temperature.
- LandAndOceanAverageTemperature: Signifies the global average land and ocean temperature in degrees Celsius.
- LandAndOceanAverageTemperatureUncertainty: Represents the 95% confidence interval around the global average land and ocean temperature.

We load the dataset into our notebook, and check that all cells are correct and present. Then, we clean the dataset:
- We convert the dt column into DateTime objects.
- We divide the dataset into two dataframes: `global_temp_land` and `global_temp_land_and_ocean` to handle disparities in the data, and clean them in `global_temp_land_cleaned` and `global_temp_land_and_ocean_cleaned`

### 2.1.1 Loading Data

In [None]:
# reading in the csv file
global_temp = pd.read_csv('/content/drive/MyDrive/CIS545/CIS545 Final Project/data/GlobalTemperatures.csv')

### 2.1.2 Analyzing Data Structure & Subsetting Data

In [None]:
query = """
SELECT *
FROM global_temp
LIMIT 5
"""

# Execute the query using pandasql
global_temp_head = ps.sqldf(query, locals())

# Convert to a pandas DataFrame
global_temp_head = pd.DataFrame(global_temp_head)

global_temp_head

In [None]:
# getting the latest data, to check that it was properly imported, and contains specified information
query = """
SELECT *
FROM global_temp
ORDER BY ROWID DESC
LIMIT 5
"""

# Execute the query using pandasql
global_temp_tail = ps.sqldf(query, locals())

# Convert to a pandas DataFrame
global_temp_tail = pd.DataFrame(global_temp_tail)

global_temp_tail

As we can see from the dt, our data documents temperature in land and ocean from 1750 to 2015, incrementing monthly

In [None]:
# checking that the file was properly imported and contains correct data
global_temp.info()

The column 'dt' is currrently of type object, we would want to convert that to type datetime for easier analysis, as well as adding another column for year for separating the data into subsets.

In [None]:
#convert dt to a datetime object
global_temp['dt'] = pd.to_datetime(global_temp['dt'])

#add a column for year
global_temp["year"] = global_temp['dt'].dt.year.values

In [None]:
# get a summary of the central tendency, dispersion, and shape of the distribution of the numerical columns in the dataframe
global_temp.describe()

Since land temperature starts in 1750 and max, min, and ocean temperature start in 1850, we will analyze the land and ocean temperatures separately. We will create two dataframes: one for land average temperature,  one for land and ocean temperatures after 1850.

In [None]:
# creating the gloabl_temp_land dataframe
global_temp_land = global_temp[['dt', 'LandAverageTemperature', 'LandAverageTemperatureUncertainty', 'year']].reset_index(drop=True)

In [None]:
# creating the global_temp_land_and_ocean dataframe
global_temp_land_and_ocean = global_temp[global_temp['dt'].dt.year > 1850]

columns = [
    'dt',
    'LandMaxTemperature',
    'LandMaxTemperatureUncertainty',
    'LandMinTemperature',
    'LandMinTemperatureUncertainty',
    'LandAndOceanAverageTemperature',
    'LandAndOceanAverageTemperatureUncertainty'
]

global_temp_land_and_ocean = global_temp_land_and_ocean[columns].reset_index(drop=True)

### 2.1.3 Analyzing Land Temperatures Data & Handling Missing Values

In [None]:
# from the info on the dataframe, we see that there are some null values, which we need to clean
global_temp_land.info()

In [None]:
# getting summary statistics of the central tendency, dispersion, and shape of the distribution of the global_temp_land data
global_temp_land.describe()

We will take a look at the rows with nulls to decide whether to drop them or to impute the missing values

In [None]:
# creating a new dataframe that contains only the rows with null values from the original dataframe
global_temp_land_null = global_temp_land[global_temp_land.isnull().any(axis=1)]

In [None]:
# getting summary of the dataframe, including the data types of each column and the number of non-null values
global_temp_land_null.info()

In [None]:
# getting summary statistics of the central tendency, dispersion, and shape of the distribution of the global_temp_land_null data
# we use this to isolate what data is missing, and determine if it can be dropped
global_temp_land_null.describe()

There are only 12 rows that have missing values in 'LandAverageTemperature' and 'LandAverageTemperatureUncertainty', which is $12/3193 = 0.00376$ of the data and they are all within the years 1750 to 1752, which are the first three years in which this data is collected. Since it is likely that these data got lost due to how early they were collected and they only constitute $0.376$% of the data, we decided to drop these rows.

In [None]:
global_temp_land_cleaned = global_temp_land.dropna()

### 2.1.4 Analyzing Land and Ocean Data & Handling Missing Values

In [None]:
# from the info on the dataframe, we see that there are some null values, which we need to clean
global_temp_land_and_ocean.info()

In [None]:
# getting summary statistics of the central tendency, dispersion, and shape of the distribution of the global_temp_land_and_ocean data
global_temp_land_and_ocean.describe()

There are no nulls, we can proceed with loading other datasets.

In [None]:
# creating a copy of global_temp_land_and_ocean to be global_temp_land_and_ocean_cleaned because there are no null values, and we want to keep varaible names uniform
global_temp_land_and_ocean_cleaned = global_temp_land_and_ocean.copy()

## 2.2 Loading & Preprocessing Global Temperatures by State Data

This section of the notebook explores the Kaggle GlobalLandTemperaturesByState.csv file (Global Average Land Temperature by State). It includes information on:

- Date (dt): Starting from 1855 to 2013.
- AverageTemperature: Represents the average land temperature in degrees Celsius.
- AverageTemperatureUncertainty: Indicates the 95% confidence interval around the average temperature.
- State: State that the temperature represents.
- Country: Country the state belongs to.

We load the dataset into our notebook, and check that all cells are correct and present. Then, we clean the dataset:
- We convert the dt column into DateTime objects.
- We analyze the missing temperature data based on date and country.

### 2.2.1 Loading Data

In [None]:
# reading in the csv file
global_temp_state = pd.read_csv('/content/drive/MyDrive/CIS545/CIS545 Final Project/data/GlobalLandTemperaturesByState.csv')

### 2.2.2 Analyzing Data Structure

In [None]:
# getting the earliest data, to check that it was properly imported, and contains specified information
global_temp_state.head(5)

In [None]:
# getting the latest data, to check that it was properly imported, and contains specified information
global_temp_state.tail(5)

In [None]:
# checking that the file was properly imported and contains correct data
global_temp_state.info()

In [None]:
# convert 'dt' to datetime
global_temp_state['dt'] = pd.to_datetime(global_temp_state['dt'])

In [None]:
# get a summary of the central tendency, dispersion, and shape of the distribution of the numerical columns in the dataframe
global_temp_state.describe()

In [None]:
# calculating the number of unique states present in the 'State' column
len(global_temp_state['State'].unique().tolist())

In [None]:
# determining the unique states present in the 'Country' column
global_temp_state['Country'].unique().tolist()

In [None]:
# calculating the number of unique states present in the 'Country' column
len(global_temp_state['Country'].unique().tolist())

After loading this data, and analyzing it, we discovered that this dataset presents information on the global temperatures for 7 countries and a total of 241 states within these countries.

These countries are:
- Brazil
- Russia
- United States
- China
- India
- Canada
- Australia

While this data isn't comprehensive for representing the entire world, it provides us with temperature information for countries that span the world.

### 2.2.3 Analyzing Data & Handling Missing Values

In [None]:
# getting all rows that have null values
global_temp_state_null = global_temp_state[global_temp_state.isnull().any(axis=1)]

In [None]:
# getting a concise summary of the global_temp_state_null dataframe, to see what values are null
global_temp_state_null.info()

In [None]:
# getting summary statistics of the data to better understand what data these rows hold
global_temp_state_null.describe()

We will plot the 'dt' column of the missing values on a histogram to see if there are any patterns.

In [None]:
# plotting histogram of dates with missing temperatures using matplotlib
plt.figure(figsize=(10, 6))
plt.hist(global_temp_state_null['dt'], bins=50, edgecolor='black')
plt.title('Figure 2.2.3a: Histogram of Dates with Missing Temperature Data')
plt.xlabel('Date')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

There are 25648 rows with missing values within 645675 rows ($3.97$% of total data), and according to Figure 2.2.3a, these rows are concentrated in the earlier time periods. Because these rows consist of a small part of our dataset, we will drop them for analysis.

Do the same with 'State' column.

In [None]:
# plotting histogram of states with missing temperatures using matplotlib
plt.figure(figsize=(50, 20))
plt.hist(global_temp_state_null['State'], bins=50, edgecolor='black')
plt.title('Figure 2.2.3b: Histogram of States with Missing Temperature Data')
plt.xlabel('State')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.show()

There are also 25648 rows with missing values within 645675 rows ($3.97$% of total data). As depicted in Figure 2.2.3b, these rows encompass cities worldwide. We will drop them because they take up such a small percentage of our data.

Do the same with 'Country' column.

In [None]:
# plotting histogram of countries with missing temperatures using matplotlib
plt.figure(figsize=(10, 6))
plt.hist(global_temp_state_null['Country'], bins=50, edgecolor='black')
plt.title('Figure 2.2.3c: Histogram of Countries with Missing Temperature Data')
plt.xlabel('Country')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

This figure shows that the missing data is relatively evenly distributed between all countries, so we can drop it.

In [None]:
global_temp_state_cleaned = global_temp_state.dropna()

## 2.3 Loading & Preprocessing Global Temperatures by Country Data

This section of the notebook explores the Kaggle GlobalLandTemperaturesByCountry.csv file (Global Average Land Temperature by State). It includes information on:

- Date (dt): Starting from 1855 to 2013.
- AverageTemperature: Represents the average land temperature in degrees Celsius.
- AverageTemperatureUncertainty: Indicates the 95% confidence interval around the average temperature.
- Country: Country the temperature was collected in.

We load the dataset into our notebook, and check that all cells are correct and present. Then, we clean the dataset:
- We convert the dt column into DateTime objects.
- We analyze the missing temperature data based on date and country.

### 2.3.1 Loading Data

In [None]:
# reading in the csv file
global_temp_country = pd.read_csv('/content/drive/MyDrive/CIS545/CIS545 Final Project/data/GlobalLandTemperaturesByCountry.csv')

### 2.3.2 Analyzing Data Structure

In [None]:
# getting the earliest data, to check that it was properly imported, and contains specified information
global_temp_country.head(5)

In [None]:
# getting the latest data, to check that it was properly imported, and contains specified information
global_temp_country.tail(5)

In [None]:
# checking that the file was properly imported and contains correct data
global_temp_country.info()

In [None]:
# convert 'dt' to datetime
global_temp_country['dt'] = pd.to_datetime(global_temp_country['dt'])

In [None]:
# get a summary of the central tendency, dispersion, and shape of the distribution of the numerical columns in the dataframe
global_temp_country.describe()

In [None]:
# calculating the number of unique countries present in the 'Country' column
len(global_temp_country['Country'].unique().tolist())

After loading this data, and analyzing it, we discovered that this dataset presents information on the global temperatures for 243 countries.

This data is comprehensive for representing the entire world. The UN recognizes 251 counties and territories, which is close to the 243 that are represented in this dataset. However, something we noticed during EDA is that continents are considered countries in this dataset, so we will remove them in the next step.

In [None]:
continents = ['Asia', 'Europe', 'Africa', 'North America', 'South America', 'Oceania', 'Antarctica']

global_temp_country = global_temp_country[~global_temp_country['Country'].isin(continents)]

### 2.3.3 Analyzing Data & Handling Missing Values

In [None]:
global_temp_country_null = global_temp_country[global_temp_country.isnull().any(axis=1)]

In [None]:
global_temp_country_null.info()

In [None]:
global_temp_country_null.info()

In [None]:
plt.figure(figsize=(10, 6))
plt.hist(global_temp_country_null['dt'], bins=50, edgecolor='black')
plt.title('Figure 2.2.3: Histogram of Dates with Missing Temperature Data')
plt.xlabel('Date')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

There are 32651 rows with missing values within 577462 rows ($5.65$% of total data). Like before, the null values make up an insignificant percentage of our dataset, so we will drop it.

Do the same with 'Country' column.

In [None]:
plt.figure(figsize=(50, 20))
plt.hist(global_temp_country_null['Country'], bins=50, edgecolor='black')
plt.title('Figure 2.2.3: Histogram of Countries with Missing Temperature Data')
plt.xlabel('Country')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.show()

There are 32651 rows with missing values within 577462 rows ($5.65$% of total data). Since these countries are spread throughout the world, we can drop the null values in our cleaned data.

In [None]:
global_temp_country_cleaned = global_temp_country.dropna()

## 2.4 Loading & Preprocessing Global Temperatures by City Data

This section of the notebook explores the Kaggle GlobalLandTemperaturesByCity.csv file (Global Land Temperatures by City). It includes information on:

- Date (dt): Starting from 1743 to 2013.
- AverageTemperature: Represents the average land temperature in degrees Celsius.
- AverageTemperatureUncertainty: Indicates the 95% confidence interval around the average temperature.
- City: City that the temperature represents.
- Country: Country the city belongs to.
- Latitude: Latitudinal coordinates for the city.
- Longitude: Longitudinal coordinates for the city.

We load the dataset into our notebook, and check that all cells are correct and present. Then, we clean the dataset:
- We convert the dt column into DateTime objects.
- We analyze the missing temperature data based on date and country.

### 2.4.1 Loading Data

In [None]:
# reading in the csv file
global_temp_city = pd.read_csv('/content/drive/MyDrive/CIS545/CIS545 Final Project/data/GlobalLandTemperaturesByCity.csv')

### 2.4.2 Analyzing Data Structure

In [None]:
# getting the earliest data, to check that it was properly imported, and contains specified information
global_temp_city.head(10)

In [None]:
# getting the latest data, to check that it was properly imported, and contains specified information
global_temp_city.tail(10)

In [None]:
# checking that the file was properly imported and contains correct data
global_temp_city.info()

In [None]:
# convert dt to a datetime object
global_temp_city['dt'] = pd.to_datetime(global_temp_city['dt'])

In [None]:
# get a summary of the central tendency, dispersion, and shape of the distribution of the numerical columns in the dataframe
global_temp_city.describe()

In [None]:
# determining the number of unique cities in our dataset
len(global_temp_city['City'].unique().tolist())

In [None]:
# determining the number of unique countries in our dataset
len(global_temp_city['Country'].unique().tolist())

From the above analysis, we know that our dataset spans 3,448 cities in 159 countries. This is an extremely comprehensize dataset, which spans countries around the globe, and can be used to analyze temperature change for these cities and the countries these cities lie in.

### 2.4.3 Analyzing Data & Handling Missing Values

Since Latitude and Longtitude are of type object due to their ending letters, we will drop them and convert both columns to type float with signs for easier visualization.

In [None]:
#drop the ending letter in Latitude and Longitude
global_temp_city['Latitude'] = np.where(global_temp_city['Latitude'].str.contains('S'),
                                        '-' + global_temp_city['Latitude'],
                                        global_temp_city['Latitude'])
global_temp_city['Longitude'] = np.where(global_temp_city['Longitude'].str.contains('W'),
                                         '-' + global_temp_city['Longitude'],
                                         global_temp_city['Longitude'])

#convert Latitude and Longtitude to type float
global_temp_city['Latitude'] = global_temp_city['Latitude'].str.replace('N', '').str.replace('S', '').astype(float)
global_temp_city['Longitude'] = global_temp_city['Longitude'].str.replace('E', '').str.replace('W', '').astype(float)

Take a look at the nulls and their distribution before deciding to drop them or impute them.

In [None]:
# creating a new dataframe that contains only the null rows from the original dataframe
global_temp_city_null = global_temp_city[global_temp_city.isnull().any(axis=1)]

In [None]:
# getting summary of the dataframe, including the data types of each column and the number of non-null values
global_temp_city_null.info()

In [None]:
# plotting histogram of dates with missing temperatures using matplotlib
plt.figure(figsize=(10, 6))
plt.hist(global_temp_city_null['dt'], bins=50, edgecolor='black')
plt.title('Figure 2.4.3: Histogram of Dates with Missing Temperature Data')
plt.xlabel('Date')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

Like in the other datasets, the rows with missing temperature data are concentrated in the earlier time periods. Specifically, 364130 out of 8599212 rows contain a null temperature value (4.23%). Dropping these rows still maintains the majority of our dataset, so we will remove them from our data.

In [None]:
plt.figure(figsize=(50, 20))
selected_cities = np.random.choice(global_temp_city_null['City'], size=150, replace=False)
plt.hist(selected_cities, bins=150, edgecolor='black')
plt.title('Figure 2.2.3: Histogram of Cities with Missing Temperature Data')
plt.xlabel('City')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.show()

Since the null values account for 4.23% of our dataset, we will drop them.

In [None]:
global_temp_city_cleaned = global_temp_city.dropna()

## 2.5 Loading & Preprocessing Global Temperatures by Major City Data

This section of the notebook explores the Kaggle GlobalTemperaturesByMajorCity.csv file (Global Average Land Temperature by Major City). It includes information on:

- Date (dt): Starting from 1855 to 2013.
- AverageTemperature: Represents the average land temperature in degrees Celsius.
- AverageTemperatureUncertainty: Indicates the 95% confidence interval around the average temperature.
- City: City that the temperature represents.
- Country: Country the city belongs to.
- Latitude: Latitudinal coordinate of the city
- Longitude: Longitudinal coordinate of the city

We load the dataset into our notebook, and check that all cells are correct and present. Then, we clean the dataset:
- We convert the dt column into DateTime objects.
- We analyze the missing temperature data based on date and city.

### 2.5.1 Loading Data

In [None]:
# reading in the csv file
global_temp_major_city = pd.read_csv('/content/drive/MyDrive/CIS545/CIS545 Final Project/data/GlobalLandTemperaturesByMajorCity.csv')

### 2.5.2 Analyzing Data Structure

In [None]:
# getting the earliest data, to check that it was properly imported, and contains specified information
global_temp_major_city.head(10)

In [None]:
# getting the latest data, to check that it was properly imported, and contains specified information
global_temp_major_city.tail(10)

In [None]:
# checking that the file was properly imported and contains correct data
global_temp_major_city.info()

In [None]:
# convert 'dt' to datetime
global_temp_major_city['dt'] = pd.to_datetime(global_temp_major_city['dt'])

In [None]:
# get a summary of the central tendency, dispersion, and shape of the distribution of the numerical columns in the dataframe
global_temp_major_city.describe()

In [None]:
# calculating the number of unique cities present in the 'City' column
len(global_temp_major_city['City'].unique().tolist())

In [None]:
# calculating the number of unique cities present in the 'Country' column
len(global_temp_major_city['Country'].unique().tolist())

After loading this data and analyzing it, we discovered that this dataset presents information on the global temperatures for 100 cities in 49 countries.

While this data isn't as comprehensive as the Cities or Countries file for representing the entire world, it provides us with temperature information for major cities, which can be used to determine how temperature changes will affect areas with higher populations.

### 2.5.3 Analyzing Data & Handling Missing Values

Same as the Global Temperatures by City data above, we will convert Latitude and Longtitude to type float.

In [None]:
#drop the ending letter in Latitude and Longitude
global_temp_major_city['Latitude'] = np.where(global_temp_major_city['Latitude'].str.contains('S'),
                                        '-' + global_temp_major_city['Latitude'],
                                        global_temp_major_city['Latitude'])
global_temp_major_city['Longitude'] = np.where(global_temp_major_city['Longitude'].str.contains('W'),
                                         '-' + global_temp_major_city['Longitude'],
                                         global_temp_major_city['Longitude'])

#convert Latitude and Longtitude to type float
global_temp_major_city['Latitude'] = global_temp_major_city['Latitude'].str.replace('N', '').str.replace('S', '').astype(float)
global_temp_major_city['Longitude'] = global_temp_major_city['Longitude'].str.replace('E', '').str.replace('W', '').astype(float)

In [None]:
# getting all rows that have null values
global_temp_major_city_null = global_temp_major_city[global_temp_major_city.isnull().any(axis=1)]

In [None]:
# getting a concise summary of the global_temp_state_null dataframe, to see what values are null
global_temp_major_city_null.info()

In [None]:
# plotting histogram of dates with missing temperatures using matplotlib
plt.figure(figsize=(10, 6))
plt.hist(global_temp_major_city_null['dt'], bins=50, edgecolor='black')
plt.title('Figure 2.5.3: Histogram of Dates with Missing Temperature Data')
plt.xlabel('Date')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

The rows account for 4.60% of the dataset (11002 out of 239117 rows). We will drop them.

In [None]:
plt.figure(figsize=(50, 20))
plt.hist(global_temp_major_city_null['City'], bins=50, edgecolor='black')
plt.title('Figure 2.2.3: Histogram of Major Cities with Missing Temperature Data')
plt.xlabel('Major City')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.show()

Similar to the previous datasets, the major cities that have missing temperature data are distributed around the world, and take up an insignificant fraction of our dataset. We will drop them.

In [None]:
global_temp_major_city_cleaned = global_temp_major_city.dropna()

## 2.6 Copying Dataframes for Modeling

We will make copies of the global_temp_land_cleaned, global_temp_land_and_ocean_cleaned, and global_temp_city_cleaned dataframes for modelling.

In [None]:
global_temp_land_cleaned_modeling = global_temp_land_cleaned.copy()
global_temp_land_and_ocean_cleaned_modeling = global_temp_land_and_ocean_cleaned.copy()
global_temp_city_cleaned_modeling = global_temp_city_cleaned.copy()

# Part 3: Exploratory Data Analysis

## 3.1 EDA in Global Land Temperatures Data

The following section conducts exploratory data analysis on the Global Land Temperatures Data.

### 3.1.1 Visualize Trends in Land Data

The type of plot being created in this code is a line plot. It is used to visualize the trend in land average temperature over time. The plot includes two lines:

- Monthly Average Temperature: This line represents the monthly average land temperature over time. It is drawn with some transparency (alpha=0.5) to show individual data points.

- 10-Year Rolling Average: This line represents the 10-year rolling average of the land temperature. It is calculated to smooth out short-term fluctuations and highlight long-term trends.

In [None]:
#calculate 10_year_rolling_avg
global_temp_land_cleaned['10_year_rolling_avg'] = global_temp_land_cleaned['LandAverageTemperature'].rolling(window=120).mean()

# Plot monthly average temperature and 10-year rolling average
plt.figure(figsize=(15, 6))
plt.plot(global_temp_land_cleaned['dt'], global_temp_land_cleaned['LandAverageTemperature'], label='Monthly Average Temperature', alpha=0.5)
plt.plot(global_temp_land_cleaned['dt'], global_temp_land_cleaned['10_year_rolling_avg'], label='10-Year Rolling Average', color='red', linewidth=2)
plt.title('Land Average Temperature Over Time with 10 Years Rolling Average')
plt.xlabel('Year')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.show()

We don't observe that much pattern; there is a slight upward trend, but not obvious enough to be significant.

### 3.1.2 Interactive Graph

This code creates an interactive Plotly line plot to illustrate the average land temperature and its uncertainty over the years. It first calculates the mean temperature and uncertainty for each year, then utilizes Plotly to generate a plot with two traces representing upper and lower uncertainty bounds, along with a trace for average temperature. The resulting interactive plot allows for a dynamic exploration of temperature trends.

In [None]:
# Extract the year from a date
global_temp_land_cleaned['dt'] = pd.to_datetime(global_temp_land_cleaned['dt'])
years = np.unique(global_temp_land_cleaned['dt'].dt.year)
mean_temp_world = []
mean_temp_world_uncertainty = []

# Calculate mean temperature and uncertainty for each year
for year in years:
    mean_temp_world.append(global_temp_land_cleaned[global_temp_land_cleaned['dt'].dt.year == year]['LandAverageTemperature'].mean())
    mean_temp_world_uncertainty.append(global_temp_land_cleaned[global_temp_land_cleaned['dt'].dt.year == year]['LandAverageTemperatureUncertainty'].mean())

# Create a Scatter plot for uncertainty (top)
trace0 = go.Scatter(
    x = years,
    y = np.array(mean_temp_world) + np.array(mean_temp_world_uncertainty),
    fill= None,
    mode='lines',
    name='Uncertainty top',
    line=dict(
                color='rgb(0, 255, 255)',
    )
)

# Create a Scatter plot for uncertainty (bottom)
trace1 = go.Scatter(
    x = years,
    y = np.array(mean_temp_world) - np.array(mean_temp_world_uncertainty),
    fill='tonexty',
    mode='lines',
    name='Uncertainty bot',
    line=dict(
        color='rgb(0, 255, 255)',
    )
)

# Create a Scatter plot for average temperature
trace2 = go.Scatter(
    x = years,
    y = mean_temp_world,
    name='Average Temperature',
    line=dict(
        color='rgb(199, 121, 093)',
    )
)

# Combine traces into data list
data = [trace0, trace1, trace2]

# Create plot
layout = go.Layout(
    xaxis=dict(title='year'),
    yaxis=dict(title='Average Temperature, °C'),
    title='Average land temperature in world',
    showlegend = False)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

From the plot, we can see that the average land temperature of the world stayed relatively consistent from 1750 to the 1950s. However, since then, we observe in our plot, that there is a steady upwards trend in average land temperature. See the below plot for a comparison in average land temperature from 1900 to 1950 versus average land temperature from 1950 - 2015.

In [None]:
# Extract the year from a date
years = np.unique(global_temp_land_cleaned['dt'].dt.year)
mean_temp_world = []
mean_temp_world_uncertainty = []

# Calculate mean temperature and uncertainty for each year
for year in years:
    mean_temp_world.append(global_temp_land_cleaned[global_temp_land_cleaned['dt'].dt.year == year]['LandAverageTemperature'].mean())
    mean_temp_world_uncertainty.append(global_temp_land_cleaned[global_temp_land_cleaned['dt'].dt.year == year]['LandAverageTemperatureUncertainty'].mean())

# Filter data for two date ranges
years_1900_1950 = np.unique(global_temp_land_cleaned[(global_temp_land_cleaned['dt'].dt.year >= 1900) & (global_temp_land_cleaned['dt'].dt.year <= 1950)]['dt'].dt.year)
years_1950_2015 = np.unique(global_temp_land_cleaned[(global_temp_land_cleaned['dt'].dt.year > 1950) & (global_temp_land_cleaned['dt'].dt.year <= 2015)]['dt'].dt.year)

# Create a Scatter plot for uncertainty (top) - 1900-1950
trace0_1900_1950 = go.Scatter(
    x=years_1900_1950,
    y=np.array(mean_temp_world)[years_1900_1950 - years.min()] + np.array(mean_temp_world_uncertainty)[years_1900_1950 - years.min()],
    fill=None,
    mode='lines',
    name='Uncertainty top (1900-1950)',
    line=dict(
        color='rgb(0, 255, 255)',
    )
)

# Create a Scatter plot for uncertainty (bottom) - 1900-1950
trace1_1900_1950 = go.Scatter(
    x=years_1900_1950,
    y=np.array(mean_temp_world)[years_1900_1950 - years.min()] - np.array(mean_temp_world_uncertainty)[years_1900_1950 - years.min()],
    fill='tonexty',
    mode='lines',
    name='Uncertainty bot (1900-1950)',
    line=dict(
        color='rgb(0, 255, 255)',
    )
)

# Create a Scatter plot for average temperature - 1900-1950
trace2_1900_1950 = go.Scatter(
    x=years_1900_1950,
    y=np.array(mean_temp_world)[years_1900_1950 - years.min()],
    name='Average Temperature (1900-1950)',
    line=dict(
        color='rgb(199, 121, 093)',
    )
)

# Create a Scatter plot for uncertainty (top) - 1950-2015
trace0_1950_2015 = go.Scatter(
    x=years_1950_2015,
    y=np.array(mean_temp_world)[years_1950_2015 - years.min()] + np.array(mean_temp_world_uncertainty)[years_1950_2015 - years.min()],
    fill=None,
    mode='lines',
    name='Uncertainty top (1950-2015)',
    line=dict(
        color='rgb(255, 0, 0)',
    )
)

# Create a Scatter plot for uncertainty (bottom) - 1950-2015
trace1_1950_2015 = go.Scatter(
    x=years_1950_2015,
    y=np.array(mean_temp_world)[years_1950_2015 - years.min()] - np.array(mean_temp_world_uncertainty)[years_1950_2015 - years.min()],
    fill='tonexty',
    mode='lines',
    name='Uncertainty bot (1950-2015)',
    line=dict(
        color='rgb(255, 0, 0)',
    )
)

# Create a Scatter plot for average temperature - 1950-2015
trace2_1950_2015 = go.Scatter(
    x=years_1950_2015,
    y=np.array(mean_temp_world)[years_1950_2015 - years.min()],
    name='Average Temperature (1950-2015)',
    line=dict(
        color='rgb(0, 128, 0)',
    )
)

# Combine traces into data list for both date ranges
data = [trace0_1900_1950, trace1_1900_1950, trace2_1900_1950, trace0_1950_2015, trace1_1950_2015, trace2_1950_2015]

# Create plot for both date ranges
layout = go.Layout(
    xaxis=dict(title='year'),
    yaxis=dict(title='Average Temperature, °C'),
    title='Average land temperature in world',
    showlegend=False
)

fig = go.Figure(data=data, layout=layout)

# Display the plot
py.iplot(fig)

Clearly, there is a large disparity in the average temperature changes between these time periods. Specifically, the temperature from 1900 - 1950 fluctuates around 8.5 degrees, while from 1950 - 2015, the average temperature is steadily rising to almost 10 degrees.

### 3.1.3 Stationarity in Global Land Temperatures Data

Time series data often requires data to be stationary, meaning the mean and variance do not change over time. If this property is violated, then the data has some inherent trend, which is the case in climate change data. To check, we will set our alpha to be 0.05 and perform Dickey-Fuller test from *adfuller* package for stationarity. Our null hypothesis is that our data is not stationary.

In [None]:
# Checking the hypothesis
print("The p-value for the ADF test in global_temp_land_cleaned is ", adfuller(global_temp_land_cleaned['LandAverageTemperature'])[1])

Since the p-value for *global_temp_land_cleaned* is less than alpha (0.05), **we reject the null hypothesis**. The *global_temp_land_cleaned* data is stationary. We can proceed with autocorrelations check.

### 3.1.4 Autocorrelations Check

Autocorrelation is the correlation between two observations at different points in a time series data. When correlations are present, past values might influence the current value. The following code is visually inspecting the autocorrelation and partial autocorrelation of the 'LandAverageTemperature' time series data to gain insights into potential temporal patterns and dependencies in the data.

In [None]:
# Creating autocorrelation plot
plot_acf(global_temp_land_cleaned['LandAverageTemperature'])
plot_pacf(global_temp_land_cleaned['LandAverageTemperature'])
plt.show()

In the autocorrelation plots, we see significant spikes and patterns. This suggests that the land temperatures have a presence of seasonality, which is clearly true because temperatures change based on weather patterns, which display a seasonal pattern.

In the partial autocorrelation plot, we see the same wave-like pattern, which highlights the presence of seasonality. It's interesting to see that the beginning of the plot has much higher magnitudes, and over time, the peaks get smaller in magnitude (i.e. there exists a dampening effect). This seems to mean that the influence of past cycles is weakening over time.

## 3.2 EDA in Global Land and Ocean Temperatures Data

The following section conducts exploratory data analysis on the Global Land and Ocean Temperatures Data.

### 3.2.1 Visualize Trends in Land and Ocean Data

The type of plot being created in this code is a line plot. It is used to visualize the trend in land and ocean average temperature over time. The plot includes two lines:

- Monthly Average Temperature: This line represents the monthly average land temperature over time. It is drawn with some transparency (alpha=0.5) to show individual data points.

- 10-Year Rolling Average: This line represents the 10-year rolling average of the land temperature. It is calculated to smooth out short-term fluctuations and highlight long-term trends.

In [None]:
# Calculate 10_year_rolling_avg
global_temp_land_and_ocean_cleaned['10_year_rolling_avg_ocean'] = global_temp_land_and_ocean_cleaned['LandAndOceanAverageTemperature'].rolling(window=120).mean()

# Plot monthly average temperature and 10-year rolling average
plt.figure(figsize=(15, 6))
plt.plot(global_temp_land_and_ocean_cleaned['dt'], global_temp_land_and_ocean_cleaned['LandAndOceanAverageTemperature'], label='Monthly Average Temperature', alpha=0.5)
plt.plot(global_temp_land_and_ocean_cleaned['dt'], global_temp_land_and_ocean_cleaned['10_year_rolling_avg_ocean'], label='10-Year Rolling Average', color='red', linewidth=2)
plt.title('Land and Ocean Average Temperature Over Time with 10 Years Rolling Average')
plt.xlabel('Year')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.show()

Compared to the land temperature trend, the land and ocean temperature trend has a significantly steeper slope, indicating that ocean temperature has increased more over time compared to land temperature, which is concerning since 71 percent of Earth's surface is water. This could indicate the non-stationarity of the land and ocean temperatures.

Additionally, temperature warming is often noticed earlier in ocean temperates. This means that the steeper slope that we are seeing in this graph may be indicative of more extreme changes in land temperatures in the next few years.

### 3.2.2 Interactive Graph

This code creates another interactive Plotly line plot to illustrate the average land and ocean temperatures and their uncertainty over the years. The resulting interactive plot allows for a dynamic exploration of temperature trends.

In [None]:
# Extract the year from a date
years_ocean = np.unique(global_temp_land_and_ocean_cleaned['dt'].dt.year)
mean_temp_world_ocean = []
mean_temp_world_ccean_uncertainty = []

# Calculate mean temperature and uncertainty for each year
for year in years_ocean:
    mean_temp_world_ocean.append(global_temp_land_and_ocean_cleaned[global_temp_land_and_ocean_cleaned['dt'].dt.year == year]['LandAndOceanAverageTemperature'].mean())
    mean_temp_world_ccean_uncertainty.append(global_temp_land_and_ocean_cleaned[global_temp_land_and_ocean_cleaned['dt'].dt.year == year]['LandAndOceanAverageTemperatureUncertainty'].mean())

# Create a Scatter plot for uncertainty (top)
trace0 = go.Scatter(
    x = years_ocean,
    y = np.array(mean_temp_world_ocean) + np.array(mean_temp_world_ccean_uncertainty),
    fill= None,
    mode='lines',
    name='Uncertainty top',
    line=dict(
                color='rgb(0, 255, 255)',
    )
)

# Create a Scatter plot for uncertainty (bottom)
trace1 = go.Scatter(
    x = years_ocean,
    y = np.array(mean_temp_world_ocean) - np.array(mean_temp_world_ccean_uncertainty),
    fill='tonexty',
    mode='lines',
    name='Uncertainty bot',
    line=dict(
        color='rgb(0, 255, 255)',
    )
)

# Create a Scatter plot for average temperature
trace2 = go.Scatter(
    x = years_ocean,
    y = mean_temp_world_ocean,
    name='Average Temperature',
    line=dict(
        color='rgb(199, 121, 093)',
    )
)

# Combine traces into data list
data = [trace0, trace1, trace2]

# Create plot
layout = go.Layout(
    xaxis=dict(title='year'),
    yaxis=dict(title='Average Temperature, °C'),
    title='Average land and ocean temperature in world',
    showlegend = False)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

The graph reiterates how there is a clear trend of global temperature increase in recent decades. An interesting finding from this graph shows us that average land and temperatures in the world have been increasing since before the 1950s (which is a commonly attributed starting point for the beginning of global warming, due to the rapid industrialization post World War II).

Specifically, from observation, it seems like average land and ocean temperatures in the world have been increasing since the early 1900s. Scientists attribute this rise in temperature to the effects of early industrialization in the late 18th century and 19th century.

It's also interesting to note that the effects of climate change, particularly warming, are often noticed earlier in average ocean temperatures compared to land temperatures is due to the inherent characteristics of the Earth's climate system and the distribution of heat. Thus, the rapid rise in average temperature from 2011 - 2015 could be a noteworthy factor when anticipating trends in average land temperature.

### 3.2.3 Stationarity in Global Land and Ocean Temperatures Data

In [None]:
print("The p-value for the ADF test in global_temp_land_and_ocean_cleaned is ", adfuller(global_temp_land_and_ocean_cleaned['LandAndOceanAverageTemperature'])[1])

Since the p-value for *global_temp_land_and_ocean_cleaned* is greater than alpha (0.05), **we fail to reject the null hypothesis**. As expected, the *global_temp_land_and_ocean_cleaned* data is not stationary.

### 3.2.4 Autocorrelations Check

The following code is visually inspecting the autocorrelation and partial autocorrelation of the 'LandAndOceanAverageTemperature' time series data to gain insights into potential temporal patterns and dependencies in the data.

In [None]:
plot_acf(global_temp_land_and_ocean_cleaned['LandAndOceanAverageTemperature'])
plot_pacf(global_temp_land_and_ocean_cleaned['LandAndOceanAverageTemperature'])
plt.show()

Like in the autocorrelation plots for LandAverageTemperature, we see that the plots display seasonality.

In the partial autocorrelation plot, we also see the seasonality. But the dampening effect in this graph is more dramatic, suggesting that temperature trends in recent years are becomming less reliant on the previous time steps and might be influenced by external factors or long-term trends. This phenomenon could be indicative of changing climate dynamics or human-induced impacts that are altering the traditional seasonal patterns.

## 3.3 EDA in Global Land Temperatures by State Data

The following section conducts exploratory data analysis on the Global Land Temperatures by State Data. We will focus on doing our EDA on the United States.

In [None]:
# Filter data for the United States
us_data = global_temp_state_cleaned[global_temp_state_cleaned['Country'] == 'United States']

# Replace Georgia (State) with Georgia
us_data['State'] = us_data['State'].replace('Georgia (State)', 'Georgia')

us_data.head(10)

### 3.3.1 Visualize Distribution of Temperatures for the United States

Let's visualize the range of temperatures across different states.

In [None]:
plt.figure(figsize=(25, 5))
sns.boxplot(x='State', y='AverageTemperature', data=us_data)

# Adjust font size and rotation
plt.title(f'Boxplot of Temperature for Each State Over Time', fontsize=12)
plt.xlabel('State', fontsize=10)
plt.ylabel('Average Temperature', fontsize=10)
plt.xticks(fontsize=10, rotation=45, ha='right')  # Adjust rotation for state labels

# Show the figure
plt.show()

As we can see, Alaska and Hawaii can be categorized as outliers in terms of their median temperature and range: Alaska has the widest range of temperatures with the lowest median, while Hawaii has the narrowest range of temperatures with the highest median. But the graph does not give us information about how the temperature in each state changes over time.

### 3.3.2 Visualizing Annual Trends

Let's now visualize how temperature in each state has changed over time using an interactive map. Using the slider, we can see the average temperature in each state by hovering over the map and seeing how the color scale changes.

In [None]:
# Extract the year from the 'dt' column
us_data['Year'] = us_data['dt'].dt.year

# Group by 'State' and 'Year' and calculate the mean temperature
average_temp_by_state = us_data.groupby(['State', 'Year'])['AverageTemperature'].mean().reset_index()

# Fetch the US states GeoJSON file
geojson_url = 'https://raw.githubusercontent.com/PublicaMundi/MappingAPI/master/data/geojson/us-states.json'
with urllib.request.urlopen(geojson_url) as url:
    us_states_geojson = json.loads(url.read().decode())

# Extract the 'features' part from GeoJSON
features = us_states_geojson['features']

# Create a DataFrame from GeoJSON features
geojson_df = pd.json_normalize(features)

# Merge temperature data with GeoJSON file based on state names
map_data = pd.merge(geojson_df, average_temp_by_state, how='left', left_on='properties.name', right_on='State')

# Create a choropleth map using Plotly Express with a slider for the year
fig = px.choropleth_mapbox(map_data, geojson=us_states_geojson, locations=map_data['properties.name'],
                           featureidkey="properties.name",
                           color="AverageTemperature",
                           color_continuous_scale="Viridis",
                           mapbox_style="carto-positron",
                           zoom=2, center={"lat": 50, "lon": -110},
                           opacity=0.5,
                           labels={'AverageTemperature': 'Average Temperature'},
                           title='Average Temperature in US States Over Time',
                           animation_frame='Year',  # Add animation frame for the year slider
                           hover_data={'properties.name': False, 'State': True, 'Year': True},  # Customize hover text
                           height=800,  # Set the height of the plot
                           width=1000,
                           )

# Show the map
fig.show()

In year 1743, there are only data from states on the Eastern half of the US, and we can see a clear gradient of temperature change from the purple, colder climate in the northern states to the greenish-yellow, warmer climate in the southern states.

As time progresses, we can see that Alaska has largely remained purple, meaning its average temperature has not changed much over time, but we only rarely see hints of purples at the northern states. This pattern started during the 1800s, and the average temperature across all states does not seem to have much changes over time.

### 3.3.3 Stationarity in US State Temperatures Data

In [None]:
print("The p-value for the ADF test in us_data is ", adfuller(us_data['AverageTemperature'])[1])

The p-value (2.938313973490292e-07) is a very small value, close to zero.

The small p-value obtained suggests that we can reject the null hypothesis, and the data is likely stationary.

### 3.3.4 Autocorrelations Check

In [None]:
plot_acf(us_data['AverageTemperature'])
plot_pacf(us_data['AverageTemperature'])
plt.show()

Like the two autocorrelation checks above, we again see that the plots display seasonality, but the dampening effect in this partial autocorrelation plot is similar to the one for Global Land Temperatures data, further indicating that there is a more significant change in climate whenever ocean data is introduced.

### 3.3.5 Seasonality in the US Data

In [None]:
us_data['Month'] = us_data['dt'].dt.month
sns.boxplot(x='Month', y='AverageTemperature', data=us_data)
plt.title('Seasonal Boxplot of Temperature')
plt.show()

As expected, July has the highest average temperature, being in the middle of summer; and January has the lowest average temperature for being the coldest month in the year.

We will pick the northernmost state besides Alaska (Minnesota), middle state (Pennsylvania), and southernmost state (Florida) from our data to visualize how seasonality differs depending on latitude.

In [None]:
# Extract MN, PA, and FL data
mn = us_data[us_data['State'] == 'Minnesota']
pa = us_data[us_data['State'] == 'Pennsylvania']
fl = us_data[us_data['State'] == 'Florida']

plt.figure(figsize=(14, 8))

# Creating a boxplot for Minnesota, Pennsylvania, and Florida
sns.boxplot(x='Month', y='AverageTemperature', data=mn, color='blue', showfliers=False)
sns.boxplot(x='Month', y='AverageTemperature', data=pa, color='red', showfliers=False)
sns.boxplot(x='Month', y='AverageTemperature', data=fl, color='green', showfliers=False)

# Adding a title, labels
plt.title('Seasonal Boxplot Overlay for Minnesota, Pennsylvania, and Florida')
plt.xlabel('Month')
plt.ylabel('Average Temperature')

# Adding a legend
mn_patch = plt.Rectangle((0,0),1,1,fc='blue', edgecolor = 'black')
pa_patch = plt.Rectangle((0,0),1,1,fc='red', edgecolor = 'black')
fl_patch = plt.Rectangle((0,0),1,1,fc='green', edgecolor = 'black')
plt.legend([mn_patch, pa_patch, fl_patch], ['Minnesota', 'Pennsylvania', 'Florida'])

plt.show()

The trend above shows that during the winter months, there is more temperature variation among states, with temperature ranging from around 20 to below -20 Celcius. While during the summer months, the temperature range across the three latitudes is smaller, ranging from 30 to 15 Celcius.

## 3.4 EDA in Global Land Temperatures by Country Data

### 3.4.1 Averaged Dynamic Exploration of Temperature Trends Using Globe
Dynamic exploration of temperature trends via a Plotly line plot for global land temperatures by country.

In [None]:
countries = np.unique(global_temp_country_cleaned['Country'])
avg_temp_country = []
for country in countries:
    avg_temp_country.append(global_temp_country_cleaned[global_temp_country_cleaned['Country'] == country]['AverageTemperature'].mean())

#interactive
data = [dict(
        type = 'choropleth',
        locations = countries,
        z = avg_temp_country,
        locationmode = 'country names',
        text = countries,
        marker = dict(
            line = dict(color = 'rgb(0,0,0)', width = 1)),
            colorbar = dict(autotick = True, tickprefix = '',
            title = '# Average\nTemperature,\n°C')
            )
       ]

layout = dict(
    title = 'Average land temperature in countries',
    geo = dict(
        showframe = False,
        showocean = True,
        oceancolor = 'rgb(0,255,255)',
        projection = dict(
        type = 'orthographic',
            rotation = dict(
                    lon = 60,
                    lat = 10),
        ),
        lonaxis =  dict(
                showgrid = True,
                gridcolor = 'rgb(102, 102, 102)'
            ),
        lataxis = dict(
                showgrid = True,
                gridcolor = 'rgb(102, 102, 102)'
                )
            ),
        )

fig = dict(data=data, layout=layout)
py.iplot(fig, validate=False, filename='worldmap')

These plots align with our understanding of each country's temperatures in relation to the globe. We see that countries that are closer to the equater have higher average temperatures throughout the year (Australia, Brazil, India).

### 3.4.2 Dynamic Exploration of Temperature Trends per Country Over Time

This code generates an interactive line plot to visualize the average monthly temperatures for each country.

In [None]:
# Function to find the best match for each country in temperature data
def find_best_match(country, choices):
    return process.extractOne(country, choices)

# Assuming your DataFrame is named global_temp_country_cleaned
# Extract the year from the 'dt' column
global_temp_country_cleaned['Year'] = global_temp_country_cleaned['dt'].dt.year

# Group by 'Country' and 'Year' and calculate the mean temperature
average_temp_by_country = global_temp_country_cleaned.groupby(['Country', 'Year'])['AverageTemperature'].mean().reset_index()

# Load the world GeoJSON file directly from GitHub
world_geojson_url = 'https://raw.githubusercontent.com/nvkelso/natural-earth-vector/master/geojson/ne_110m_admin_0_countries.geojson'
world_geojson = gpd.read_file(world_geojson_url)

# Clean names to match
global_temp_country_cleaned['Country'] = global_temp_country_cleaned['Country'].str.strip().str.lower()
world_geojson['ADMIN'] = world_geojson['ADMIN'].str.strip().str.lower()

# Apply the function to find the best match for each country in temperature data
temperature_data_countries = global_temp_country_cleaned['Country'].unique()
geojson_countries = world_geojson['ADMIN'].unique()

matches = {country: find_best_match(country, geojson_countries) for country in temperature_data_countries}

# Display the unmatched countries for further inspection
unmatched_countries = {country for country, (match, score) in matches.items() if score < 90}

# Create a dictionary to map temperature data countries to GeoJSON countries
country_name_mapping = {country: match for country, (match, score) in matches.items()}

# Map the countries in the DataFrame
global_temp_country_cleaned['Country'] = global_temp_country_cleaned['Country'].map(country_name_mapping).fillna(global_temp_country_cleaned['Country'])

# Merge temperature data with GeoJSON file based on country names
merged = world_geojson.merge(average_temp_by_country, left_on='ADMIN', right_on='Country', how='left')
merged = merged.sort_values(by='Year')

# Filter the data for every 10 years
filtered_data = merged[merged['Year'] % 10 == 0]

# Create a choropleth map using Plotly Express with a slider for every 10 years
fig = px.choropleth_mapbox(filtered_data, geojson=world_geojson, locations=filtered_data['ADMIN'],
                            featureidkey="properties.ADMIN",
                            color="AverageTemperature",
                          color_continuous_scale="Viridis",
                            mapbox_style="carto-positron",
                            zoom=0.5, center={"lat": 30, "lon": 0},  # Adjust center and zoom as needed
                            opacity=0.5,
                            labels={'AverageTemperature': 'Average Temperature'},
                            title='Average Temperature Worldwide Over Time (Every 10 Years)',
                            animation_frame='Year',  # Use 'Year' for the slider
                            hover_data={'ADMIN': False, 'Country': True, 'Year': True},  # Customize hover text
                            height=800,  # Set the height of the plot
                            width=1000,
                            animation_group='ADMIN',  # Use 'ADMIN' for consistent coloring
                            range_color=[merged['AverageTemperature'].min(), merged['AverageTemperature'].max()]  # Set color range
                            )

# Show the map
fig.show()

These plots align with our understanding of each country's temperatures throughout time. We can see that temperatures are clearly rising over time.

### 3.4.3 Autocorrelations Check

In [None]:
plot_acf(global_temp_country_cleaned['AverageTemperature'], lags=20)
plt.show()

plot_pacf(global_temp_country_cleaned['AverageTemperature'], lags=20)
plt.show()

Unlike the three plots above, there is no obvious weekly or monthly pattern, which makes sense because our data contains countries from all around the world, so it is not possible to capture seasonality that is uniform across all those countries.

### 3.4.4 Seasonality in Global Land Temperatures by Country Data

In [None]:
# Adding latitude to country data
# Function to fetch latitude for a given country
def get_latitude(country, geolocator):
    location = geolocator.geocode(country)
    return location.latitude if location else None

# Extract unique country names from the DataFrame
unique_countries = global_temp_country_cleaned['Country'].unique()

# Initialize the geolocator
geolocator = Nominatim(user_agent="geo_locator")

# Create a DataFrame to store unique country names and their latitudes
country_latitudes_df = pd.DataFrame({'Country': unique_countries})

# Fetch latitude for each unique country
country_latitudes_df['Latitude'] = country_latitudes_df['Country'].apply(lambda country: get_latitude(country, geolocator))

# Merge latitude information into the main DataFrame
global_temp_country_cleaned = pd.merge(global_temp_country_cleaned, country_latitudes_df, on='Country', how='left')

In [None]:
latitude_threshold = 0

# Split the DataFrame into Northern and Southern Hemisphere
northern_hemisphere_df = global_temp_country_cleaned[global_temp_country_cleaned['Latitude'] >= latitude_threshold]
southern_hemisphere_df = global_temp_country_cleaned[global_temp_country_cleaned['Latitude'] < latitude_threshold]

# Assuming 'northern_hemisphere_df' and 'southern_hemisphere_df' are your DataFrames
# Convert 'dt' to datetime format if not already done
northern_hemisphere_df['dt'] = pd.to_datetime(northern_hemisphere_df['dt'])
southern_hemisphere_df['dt'] = pd.to_datetime(southern_hemisphere_df['dt'])

# Extract Month from the 'dt' column
northern_hemisphere_df['Month'] = northern_hemisphere_df['dt'].dt.month
southern_hemisphere_df['Month'] = southern_hemisphere_df['dt'].dt.month

# Plotting Monthly Box Plots for Northern Hemisphere
plt.figure(figsize=(14, 8))
sns.boxplot(x='Month', y='AverageTemperature', data=northern_hemisphere_df)
plt.title('Northern Hemisphere - Monthly Average Temperature')
plt.xlabel('Month')
plt.ylabel('Average Temperature')
plt.show()

# Plotting Monthly Box Plots for Southern Hemisphere
plt.figure(figsize=(14, 8))
sns.boxplot(x='Month', y='AverageTemperature', data=southern_hemisphere_df)
plt.title('Southern Hemisphere - Monthly Average Temperature')
plt.xlabel('Month')
plt.ylabel('Average Temperature')
plt.show()

The two plots above showed an interesting trend: the Northern Hemisphere has more variation in temperature during the colder months of January, February, November, and December as opposed to the small ranges during July and August; while the Southern Hemisphere has a pretty uniform variance across all months.

## 3.5 EDA in Global Land Temperatures by City Data

### 3.5.1 Exploration of Relationship Between Latitude and Temperature

In [None]:
latitude_df = global_temp_city_cleaned.sort_values(by='Latitude')

avg_temp_by_latitude = latitude_df.groupby('Latitude')['AverageTemperature'].mean()

plt.figure(figsize=(12, 8))
plt.bar(avg_temp_by_latitude.index, avg_temp_by_latitude.values, color='skyblue')
plt.xlabel('Latitude')
plt.ylabel('Average Average Temperature (°C)')
plt.title('Average Average Temperature vs Latitude')
plt.tight_layout()

# Show the figure
plt.show()

This plot emphasizes what we already know about temperature trends in relation to latitude. Specifically, near the equator (i.e. latitude = 0), temperatures are generally higher due to the direct sunlight received, while temperatures decrease towards the poles where sunlight is spread out over a larger area.

### 3.5.2 Average Temperature Contour Plot on World Map

In [None]:
fig = plt.figure(figsize=(20, 10))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.add_feature(cfeature.COASTLINE)

# We set the number of contour lines, or evenly spaced intervals between the minimum and maximum values of average temperature to be 200 to see more nuanced changes in temperature in different areas of the world
contour = plt.tricontourf(global_temp_city_cleaned['Longitude'], global_temp_city_cleaned['Latitude'], global_temp_city_cleaned['AverageTemperature'],
                          transform=ccrs.PlateCarree(), cmap='coolwarm', levels=np.linspace(min(global_temp_city_cleaned['AverageTemperature']), max(global_temp_city_cleaned['AverageTemperature']), 200))

cbar = plt.colorbar(contour, orientation='vertical', pad=0.02, aspect=16, shrink=0.8)
cbar.set_label('Average Temperature (°C)', rotation=270, labelpad=20)

plt.title('Average Temperature Contour Plot on World Map')

# Show the figure
plt.show()

This heatmap hows that inner land areas in parts of Africa, India, and the US contained the highest average temperatures, while areas corresponding to parts of Russia and Canada have the lowest average temperatures. But to see the cities data in more detail, we would need interactive graphs.

### 3.5.3 Average Temperature Interactive Map

Let's take a look at the average temperature in each city using an interactive map.

In [None]:
# Convert Latitude and Longitude to numerical values if they are in string format
if global_temp_city_cleaned['Latitude'].dtype == 'object':
    global_temp_city_cleaned['Latitude'] = global_temp_city_cleaned['Latitude'].str.extract('(\d+\.\d+)').astype(float)

if global_temp_city_cleaned['Longitude'].dtype == 'object':
    global_temp_city_cleaned['Longitude'] = global_temp_city_cleaned['Longitude'].str.extract('(\d+\.\d+)').astype(float)

# Group by Latitude and Longitude and calculate the average temperature
average_temp_df = global_temp_city_cleaned.groupby(['Latitude', 'Longitude']).agg({
    'City': 'first',
    'AverageTemperature': 'mean'
}).reset_index()

# Normalize 'AverageTemperature' for sizing points on the plot
min_temp = average_temp_df['AverageTemperature'].min()
max_temp = average_temp_df['AverageTemperature'].max()
average_temp_df['NormalizedTemperature'] = (average_temp_df['AverageTemperature'] - min_temp) / (max_temp - min_temp)

# Manually set center and zoom level for the desired region
center_lat = 30  # Center latitude
center_lon = 40  # Center longitude
zoom = 6  # Zoom level

# Create an interactive map using plotly express
fig = px.scatter_geo(
    average_temp_df,
    lat='Latitude',
    lon='Longitude',
    text='City',
    size='NormalizedTemperature',  # Use normalized temperature for sizing
    color='AverageTemperature',
    color_continuous_scale='Viridis',
    projection='natural earth',
    title='Average Temperature Map',
    size_max=15  # Smaller dot sizes
)

# Add hover information
fig.update_traces(hovertemplate='<b>%{text}</b><br>Temperature: %{marker.color:.2f}°C')

# Customize layout for readability
fig.update_layout(
    geo=dict(
        showcoastlines=True,
        coastlinecolor="black",
        center=dict(lat=center_lat, lon=center_lon),
        projection_scale=zoom,
    ),
    font=dict(size=8),  # Smaller font size for city names
    geo_bgcolor="white",  # Change background color
    title=dict(font=dict(size=20)),  # Larger title font size
    height=800,  # Set the height of the map
    width=1200,  # Set the width of the map
    margin=dict(l=0, r=0, b=0, t=40)  # Adjust margin for better use of space
)

# Show the figure
fig.show()

As we can see from the color, the cities around the equator are more yellow, meaning they have a higher average temperature. However, this graph does not give us information on average temperature over time.

### 3.5.4 Time-Series Temperature Interactive Map

Let's visualize the temperature data again, but adding a slider to see how temperature changes over time.

In [None]:
# Convert 'dt' to datetime format
global_temp_city_cleaned['dt'] = pd.to_datetime(global_temp_city_cleaned['dt'], errors='coerce')

# Convert Latitude and Longitude to numerical values if they are in string format
global_temp_city_cleaned['Latitude'] = pd.to_numeric(global_temp_city_cleaned['Latitude'], errors='coerce')
global_temp_city_cleaned['Longitude'] = pd.to_numeric(global_temp_city_cleaned['Longitude'], errors='coerce')

# Group by Latitude, Longitude, and the nearest 10-year interval, and calculate the average temperature
global_temp_city_cleaned['Year'] = (global_temp_city_cleaned['dt'].dt.year // 10) * 10
average_temp_df = global_temp_city_cleaned.groupby(['Latitude', 'Longitude', 'Year']).agg({
    'City': 'first',
    'AverageTemperature': 'mean'
}).reset_index()

# Sort the DataFrame by 'Year' for correct animation order
average_temp_df = average_temp_df.sort_values('Year')

# Normalize 'AverageTemperature' for sizing points on the plot
min_temp = average_temp_df['AverageTemperature'].min()
max_temp = average_temp_df['AverageTemperature'].max()
average_temp_df['NormalizedTemperature'] = (average_temp_df['AverageTemperature'] - min_temp) / (max_temp - min_temp)

# Manually set center and zoom level for the desired region
center_lat = 30  # Center latitude
center_lon = 40  # Center longitude
zoom = 6  # Zoom level

# Create an interactive map using plotly express
fig = px.scatter_geo(
    average_temp_df,
    lat='Latitude',
    lon='Longitude',
    text='City',
    size='NormalizedTemperature',  # Use normalized temperature for sizing
    color='AverageTemperature',
    color_continuous_scale='Viridis',
    animation_frame='Year',  # Add animation frame for the slider
    projection='natural earth',
    title='Average Temperature Map',
    size_max=15  # Smaller dot sizes
)

# Add hover information
fig.update_traces(hovertemplate='<b>%{text}</b><br>Temperature: %{marker.color:.2f}°C')

# Customize layout for readability
fig.update_layout(
    geo=dict(
        showcoastlines=True,
        coastlinecolor="black",
        center=dict(lat=center_lat, lon=center_lon),
        projection_scale=zoom,
    ),
    font=dict(size=8),  # Smaller font size for city names
    geo_bgcolor="white",  # Change background color
    title=dict(font=dict(size=20)),  # Larger title font size
    margin=dict(l=0, r=0, b=0, t=40),  # Adjust margin for better use of space
    height=800,  # Set the height of the map
    width=1200,  # Set the width of the map
)

# Show the figure
fig.show()

Similar to the interative map for the states data, in the 1700s, the northern cities were largely colored purple and the southern cities were mainly colored yellow, indicating a wider range of temperatures. As time goes on, in the 1800s, we can see that most of the purple hues are gone from the cities, leaving mostly greenish tones in the northern cities and cities further away from the equator; while the southern cities and the cities around the equator stayed consistently yellow. This pattern also persisted beyond the 1800s.

## 3.6 EDA in Global Land Temperatures by Major City Data

Let's now focus on the major cities and see how much they contribute to the climate trend.  

### 3.6.1 Average Temperature Plot for Major Cities on a World Map

Using the interactve map above, let's graph the major cities.

In [None]:
# Convert Latitude and Longitude to numerical values if they are in string format
if global_temp_major_city_cleaned['Latitude'].dtype == 'object':
    global_temp_major_city_cleaned['Latitude'] = global_temp_major_city_cleaned['Latitude'].str.extract('(\d+\.\d+)').astype(float)

if global_temp_major_city_cleaned['Longitude'].dtype == 'object':
    global_temp_major_city_cleaned['Longitude'] = global_temp_major_city_cleaned['Longitude'].str.extract('(\d+\.\d+)').astype(float)

# Group by Latitude and Longitude and calculate the average temperature
average_temp_df = global_temp_major_city_cleaned.groupby(['Latitude', 'Longitude']).agg({
    'City': 'first',
    'AverageTemperature': 'mean'
}).reset_index()

# Normalize 'AverageTemperature' for sizing points on the plot
min_temp = average_temp_df['AverageTemperature'].min()
max_temp = average_temp_df['AverageTemperature'].max()
average_temp_df['NormalizedTemperature'] = (average_temp_df['AverageTemperature'] - min_temp) / (max_temp - min_temp)

# Manually set center and zoom level for the desired region
center_lat = 30  # Center latitude
center_lon = 0  # Center longitude
zoom = 2  # Zoom level

# Create an interactive map using plotly express
fig = px.scatter_geo(
    average_temp_df,
    lat='Latitude',
    lon='Longitude',
    text='City',
    size='NormalizedTemperature', # 'AverageTemperature' has been normalized to size the plot points, with larger points representing higher temperatures relative to the dataset's range
    color='AverageTemperature',
    color_continuous_scale='Viridis',
    projection='natural earth',
    title='Average Temperature Map',
    size_max=15
)

# Add hover information
fig.update_traces(hovertemplate='<b>%{text}</b><br>Temperature: %{marker.color:.2f}°C')
fig.update_layout(
    geo=dict(
        showcoastlines=True,
        coastlinecolor="black",
        center=dict(lat=center_lat, lon=center_lon),
        projection_scale=zoom,
    ),
    font=dict(size=8),  # Smaller font size for city names
    geo_bgcolor="white",  # Change background color
    title=dict(font=dict(size=20)),  # Larger title font size
    margin=dict(l=0, r=0, b=0, t=40),  # Adjust margin for better use of space
    height=800,  # Set the height of the map
    width=1200,  # Set the width of the map
)

# Show the figure
fig.show()

Like the cities data above, major cities that are further away from the equator have a lower average temperature. The size of the plot points is porportional to the temperature range, with larger points representing higher temperatures relative to the dataset's range.

### 3.6.2 Time-Series Temperature Interactive Map for Major Cities

Let's see how temperature changes over time.

In [None]:
import pandas as pd
import plotly.express as px

# Assuming your DataFrame is named global_temp_city_cleaned
# If not, replace global_temp_city_cleaned with your actual DataFrame name

# Convert 'dt' to datetime format
global_temp_major_city_cleaned['dt'] = pd.to_datetime(global_temp_major_city_cleaned['dt'], errors='coerce')

# Convert Latitude and Longitude to numerical values if they are in string format
global_temp_major_city_cleaned['Latitude'] = pd.to_numeric(global_temp_major_city_cleaned['Latitude'], errors='coerce')
global_temp_major_city_cleaned['Longitude'] = pd.to_numeric(global_temp_major_city_cleaned['Longitude'], errors='coerce')

# Group by Latitude, Longitude, and the nearest 10-year interval, and calculate the average temperature
global_temp_major_city_cleaned['Year'] = (global_temp_major_city_cleaned['dt'].dt.year // 10) * 10
average_temp_df = global_temp_major_city_cleaned.groupby(['Latitude', 'Longitude', 'Year']).agg({
    'City': 'first',
    'AverageTemperature': 'mean'
}).reset_index()

# Sort the DataFrame by 'Year' for correct animation order
average_temp_df = average_temp_df.sort_values('Year')

# Normalize 'AverageTemperature' for sizing points on the plot
min_temp = average_temp_df['AverageTemperature'].min()
max_temp = average_temp_df['AverageTemperature'].max()
average_temp_df['NormalizedTemperature'] = (average_temp_df['AverageTemperature'] - min_temp) / (max_temp - min_temp)

# Manually set center and zoom level for the desired region
center_lat = 30  # Center latitude
center_lon = 0  # Center longitude
zoom = 2  # Zoom level

# Create an interactive map using plotly express
fig = px.scatter_geo(
    average_temp_df,
    lat='Latitude',
    lon='Longitude',
    text='City',
    size='NormalizedTemperature',  # Use normalized temperature for sizing
    color='AverageTemperature',
    color_continuous_scale='Viridis',
    animation_frame='Year',  # Add animation frame for the slider
    projection='natural earth',
    title='Average Temperature Map',
    size_max=15  # Smaller dot sizes
)

# Add hover information
fig.update_traces(hovertemplate='<b>%{text}</b><br>Temperature: %{marker.color:.2f}°C')

# Customize layout for readability
fig.update_layout(
    geo=dict(
        showcoastlines=True,
        coastlinecolor="black",
        center=dict(lat=center_lat, lon=center_lon),
        projection_scale=zoom,
    ),
    font=dict(size=8),  # Smaller font size for city names
    geo_bgcolor="white",  # Change background color
    title=dict(font=dict(size=20)),  # Larger title font size
    margin=dict(l=0, r=0, b=0, t=40),  # Adjust margin for better use of space
    height=800,  # Set the height of the map
    width=1200,  # Set the width of the map
)

# Show the figure
fig.show()

After all the major cities data points have been added in the mid 1800s, the average temperature from then on has not changed much, indicated by the lack of changes in the size and color of the plot points.

## 3.7 Subset Guangzhou Data

For accuracy and simplicity, we choose to only analyze one major city.

In [None]:
# Extract Guangzhou data from global_temp_major_city_cleaned
gz = global_temp_major_city_cleaned[global_temp_major_city_cleaned['City'] == 'Guangzhou'].reset_index(drop=True)

# Remove nulls
gz = gz[gz.AverageTemperature.notnull()]

# Sort by date
gz = gz.sort_values(by='dt')

# Part 4: Feature Engineering & Preprocessing

From our EDA, we see that a lot of data and relationships are being repeated among the datasets. To maintain comprehensive analysis, without repeating unnecessary information, we will focus our modeling on the following datasets:

1.   Global Land Temperature: To understand the relationship between time and land temperature.
2.   Global Land and Ocean Temperature: To understand the relationship between time and land/ocean temperature. Specifically, we're interested in learning more about this data because studies have shown that ocean temperatures are good predictors for the rise of future land temperatures.
3.   Global Land Temperatures By City: To understand the relationship between latitude and land temperatures.



## 4.1 Correlation Matrix for Global Land Temperature

In [None]:
correlation_matrix_1 = global_temp_land_cleaned_modeling.corr()
sns.heatmap(correlation_matrix_1, annot=True)
plt.show()

From the correlation matrix, we see that there exists a relatively weak correlation between year and LandAverageTemperature. Because our dataset revolves around measuring long-term effects of gradual, small temperature changes, we will employ correlation significance testing via a Pearson Correlation Significance Test to further investigate this correlation.

In [None]:
# Assuming 'time' and 'land_temperature' are your variables of interest
time = global_temp_land_cleaned_modeling['year']
land_temperature = global_temp_land_cleaned_modeling['LandAverageTemperature']

# Calculate Pearson correlation coefficient and p-value
correlation_coefficient, p_value = pearsonr(time, land_temperature)

print(f"Pearson Correlation Coefficient: {correlation_coefficient}")
print(f"P-value: {p_value}")

# Check if the correlation is statistically significant
alpha = 0.05  # Set your significance level
if p_value < alpha:
    print("The correlation is statistically significant.")
else:
    print("The correlation is not statistically significant.")


We see that the correlation is statistically significant, and because we are dealing with time-series data, we will explore time-based features and time-dependent trends in our model.

## 4.2 Correlation Matrix for Global Land and Ocean Temperatures

In [None]:
cols_to_drop = ['LandMaxTemperature', 'LandMaxTemperatureUncertainty', 'LandMinTemperature', 'LandMinTemperatureUncertainty']
global_temp_land_and_ocean_cleaned_modeling.drop(cols_to_drop, axis=1, inplace=True)
global_temp_land_and_ocean_cleaned_modeling['year'] = pd.to_datetime(global_temp_land_and_ocean_cleaned_modeling['dt']).dt.year
correlation_matrix_2 = global_temp_land_and_ocean_cleaned_modeling.corr()
sns.heatmap(correlation_matrix_2, annot=True)
plt.show()

Similar to above, from the correlation matrix, we see that there exists a correlation between year and temperature. However, this seems to be a very strong correlation between LandAndOceanAverageTemperature and year. We will employ correlation significance testing via a Pearson Correlation Significance Test to further investigate this correlation.

In [None]:
# Assuming 'time' and 'land_temperature' are your variables of interest
time = global_temp_land_and_ocean_cleaned_modeling['year']
land_and_ocean_temperature = global_temp_land_and_ocean_cleaned_modeling['LandAndOceanAverageTemperature']

# Calculate Pearson correlation coefficient and p-value
correlation_coefficient, p_value = pearsonr(time, land_and_ocean_temperature)

print(f"Pearson Correlation Coefficient: {correlation_coefficient}")
print(f"P-value: {p_value}")

# Check if the correlation is statistically significant
alpha = 0.05  # Set your significance level
if p_value < alpha:
    print("The correlation is statistically significant.")
else:
    print("The correlation is not statistically significant.")


We see that the correlation is statistically significant, and because we are dealing with time-series data, we will explore time-based features and time-dependent trends in our model.

## 4.3 Correlation Matrix for Global Land Temperatures by City Data

In [None]:
correlation_matrix_3 = global_temp_city_cleaned_modeling.corr()
sns.heatmap(correlation_matrix_3, annot=True)
plt.show()

From the above correlation matrix, we see that there is a negative correlation between Latitude and AverageTemperature. This means that as latitude increases, temperature appears to decrease.

This aligns with the well-known observation that temperatures tend to be colder at higher latitudes, especially towards the poles. The Earth's axial tilt and the way sunlight hits different latitudes contribute to this pattern.

## 4.4 Rolling Average and Plots

Similar to the EDA portion, we will calculate and add a column for 10 years rolling average for better visualization.



In [None]:
# Calculate 10_year_rolling_avg
gz['10_year_rolling_avg'] = gz['AverageTemperature'].rolling(window=120).mean()

# Plot monthly average temperature and 10-year rolling average
plt.figure(figsize=(15, 6))
plt.plot(gz['dt'], gz['AverageTemperature'], label='Monthly Average Temperature', alpha=0.5)
plt.plot(gz['dt'], gz['10_year_rolling_avg'], label='10-Year Rolling Average', color='red', linewidth=2)
plt.title('Guangzhou Average Temperature Over Time with 10 Years Rolling Average')
plt.xlabel('Year')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.show()

The average temperature over time graph has a slight upward trend, but it is not obvious. Let's graph the change in average temperature over time.

In [None]:
# Graph Guangzhou data
mean_temp = gz["AverageTemperature"].mean()
gz["AverageTemperatureDelta"] = gz["AverageTemperature"] - mean_temp

fig_dims = (10, 8)
fig, ax = plt.subplots(figsize=fig_dims)
avg_temp = gz.groupby(gz['dt'].dt.year).mean()
avg_temp["AverageTemperatureDelta"].plot(linewidth=1, label ='delta T')
plt.axhline(y=0, color='r', linestyle='-', label = 'average')
plt.title('Guangzhou temperature', fontsize=20)
plt.xlabel('year', fontsize=15)
plt.ylabel('°C', fontsize=15)
plt.grid(True)

As we can see, Guangzhou's average temperature has an upward trend over time. Let's plot the difference between the annual temperature with the mean temperature.

In [None]:
# Plot the difference with respect the mean temperature

fig_dims = (20, 14)
fig, ax = plt.subplots(figsize=fig_dims)
avg_temp["AverageTemperatureDelta"].plot.bar(linewidth=1, label ='delta T')
plt.axhline(y=0, color='r', linestyle='-', label = 'average')
plt.title('Change in Guangzhou temperature ', fontsize=20)
plt.legend(fontsize='x-large')

# Show only some xticks
for i, t in enumerate(ax.get_xticklabels()):
    if (i % 5) != 0:
        t.set_visible(False)
plt.xlabel('Year', fontsize=15)
plt.ylabel('°C', fontsize=15)
plt.grid(True)

This plot is just another way of showing the line graph above: temperature has been increasing over time because of the positive differnce in later years with respect to the mean temperature.

# Part 5: Modeling

## 5.1 Linear Regression (Supervised Learning Model)

### 5.1.1 Using Data Pre-1950s to Predict Post-1950s

To predict future temperatures, we will employ a Linear Regression Model. Specifically, during our EDA, we saw that AverageTemperature seemed to follow a linear trend over time. Thus, linear regression can be useful because we want to capture a linear trend in the time-series data.

In [None]:
# Get average temperature by year
average_temperature_by_year = global_temp_land_cleaned_modeling.groupby('year')['LandAverageTemperature'].mean().reset_index()

# Calculate Pearson correlation
year = average_temperature_by_year['year']
temperature = average_temperature_by_year['LandAverageTemperature']
corr, p = pearsonr(year, temperature)
print('Pearson correlation of Year and Land Average Temperature: {:.2f}'.format(corr))

In [None]:
# Set a specific time point for the split (e.g., 80% for training, 20% for testing)
split_date = average_temperature_by_year['year'].iloc[int(0.8 * len(average_temperature_by_year))]

# Split the data into training and testing sets based on the time point
train = average_temperature_by_year[average_temperature_by_year['year'] < split_date]
test = average_temperature_by_year[average_temperature_by_year['year'] >= split_date]

# Split the data into features (X) and target variable (y)
X_train = train[['year']]
y_train = train['LandAverageTemperature']
X_test = test[['year']]
y_test = test['LandAverageTemperature']

lr = LinearRegression()

X = year
y = temperature

X = X.values.reshape(-1,1)

lr.fit(X, y)

y_pred = lr.predict(X)

years = pd.DataFrame(X)

plt.figure(figsize=(18,10))
plt.scatter(X, y, alpha=0.6)
plt.plot(X, y_pred, color="orange")
plt.xlabel('Years')
plt.ylabel('Temperature (in °C)')
plt.show()
plt.clf()

In [None]:
print(lr.coef_)
print(10 * lr.coef_)

Every year, the average land temperature increases by an average of 0.0047 °C. Every ten years, the average land temperature increases by an average of 0.0473 °C.

In [None]:
# Extract years from 2015 to 2050
future_years = np.arange(2015, 2051, 1).reshape(-1, 1)

# Filter data for years 2015 and onwards
recent_years_data = average_temperature_by_year[average_temperature_by_year['year'] > 2015]

# Use the linear regression model to predict temperatures for future years
future_predictions = lr.predict(future_years)

# Plot the original data and the linear regression predictions
plt.figure(figsize=(12, 6))
plt.scatter(recent_years_data['year'], recent_years_data['LandAverageTemperature'], label='_nolegend_', marker='o')
plt.plot(future_years, future_predictions, label='Linear Regression Predictions', color='red')
plt.title('Linear Regression Predictions of Land Average Temperature Through 2050')
plt.xlabel('Year')
plt.ylabel('Land Average Temperature')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
print(lr.predict(np.array([2030, 2040, 2050]).reshape(-1,1)))

**Interpretation** The average land temperature in 2030 will be 9.067 °C, 2040 9.115 °C, 2050 9.162 °C.

###5.1.2 1950 - 2015 Data on Global Average Temperatures
Historically, land temperatures have risen at more significant rates following industrialization and World War II (post 1950s). So, using past temperature data from before the 1950s to measure temperatures post-industrialization is likely misrepresentative.

In [None]:
# Filter data for years 1950 and onwards
recent_years_data = average_temperature_by_year[average_temperature_by_year['year'] >= 1950]

# Calculate Pearson correlation for years 1950 and onwards
corr, p = pearsonr(recent_years_data['year'], recent_years_data['LandAverageTemperature'])
print('Pearson correlation of Year and Land Average Temperature (1950 and onwards): {:.2f}'.format(corr))

**Interpretation** We can see a very strong positive correlation

In [None]:
# Filter data for years post-1950
post_1950_data = average_temperature_by_year[average_temperature_by_year['year'] > 1950]

# Linear Regression
lr = LinearRegression()

X = post_1950_data['year'].values.reshape(-1, 1)
y = post_1950_data['LandAverageTemperature']

lr.fit(X, y)

y_pred = lr.predict(X)

# Plot the scatter plot and linear regression line
plt.figure(figsize=(18, 10))
plt.scatter(X, y, alpha=0.6)
plt.plot(X, y_pred, color="orange", label='Linear Regression Line')
plt.xlabel('Years')
plt.ylabel('Temperature (in °C)')
plt.legend()
plt.show()
plt.clf()

In [None]:
print(lr.coef_)
print(10 * lr.coef_)

**Interpretation**
Every year, the average land temperature increases by an average of 0.0185 °C. Every ten years, the average land temperature increases by an average of 0.1852 °C.

In [None]:
# Extract years from 2015 to 2050
future_years = np.arange(2015, 2051, 1).reshape(-1, 1)

# Filter data for years 2015 and onwards
recent_years_data = average_temperature_by_year[average_temperature_by_year['year'] > 2015]

# Use the linear regression model to predict temperatures for future years
future_predictions = lr.predict(future_years)

# Plot the original data and the linear regression predictions
plt.figure(figsize=(12, 6))
plt.scatter(recent_years_data['year'], recent_years_data['LandAverageTemperature'], label='_nolegend_', marker='o')
plt.plot(future_years, future_predictions, label='Linear Regression Predictions', color='red')
plt.title('Linear Regression Predictions of Land Average Temperature Through 2050')
plt.xlabel('Year')
plt.ylabel('Land Average Temperature')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
print(lr.predict(np.array([2030, 2040, 2050]).reshape(-1,1)))

**Interpretation**
The average land temperature in 2030 will be 9.856 °C, 2040 10.041 °C, 2050 10.226 °C.



## 5.2 XGBoost (Supervised Learning Model)

Gradient boosting algorithms like XGBoost build trees sequentially, each one correcting the errors of the previous ones. Specificaly, XGBoost is capable of capturing non-linear relationships in the data. Temperature data often exhibits complex patterns that may not be well-modeled by linear regression alone. XGBoost, with its ability to build ensembles of decision trees, can capture intricate non-linear relationships.

### 5.2.1 Using Data Pre-1950s to Predict Post-1950s

In [None]:
# Adapt the date_transform function
def date_transform(data):
    df = data.copy()

    # Extract various date-related features from the datetime index
    df['Month'] = df['dt'].dt.month
    df['Quarter'] = df['dt'].dt.quarter
    df['Year'] = df['dt'].dt.year

    # Keep only the needed columns for training and prediction
    X = df[['Month', 'Quarter', 'Year']]
    y = df['LandAverageTemperature']
    return X, y

# Sort the data by time
global_temp_land_cleaned_modeling = global_temp_land_cleaned_modeling.sort_values(by='dt')

# Set a specific time point for the split (e.g., 80% for training, 20% for testing)
split_date = global_temp_land_cleaned_modeling['dt'].iloc[int(0.8 * len(global_temp_land_cleaned_modeling))]

# Split the data into training and testing sets based on the time point
train = global_temp_land_cleaned_modeling[global_temp_land_cleaned_modeling['dt'] < split_date]
test = global_temp_land_cleaned_modeling[global_temp_land_cleaned_modeling['dt'] >= split_date]

# Apply date_transform to the training set
X_train, y_train = date_transform(train)

# Apply date_transform to the testing set
X_test, y_test = date_transform(test)

# Train an XGBoost model
xgb_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, early_stopping_rounds=10)
xgb_model.fit(X_train, y_train, eval_metric='mae', eval_set=[(X_train, y_train), (X_test, y_test)])

# Make predictions on the testing set
xgb_pred = xgb_model.predict(X_test)

In [None]:
# Calculate Mean Absolute Error (MAE)
from sklearn.metrics import mean_absolute_error
mae = round(mean_absolute_error(y_test, xgb_pred), 3)
print(mae)

# Evaluate the model
mse = mean_squared_error(y_test, xgb_pred)
r2 = r2_score(y_test, xgb_pred)
print(mse)
print(r2)

In [None]:
import matplotlib.dates as mdates
# Create a DataFrame for plotting
df_plot = pd.DataFrame({'y_test': y_test.values, 'xgb_pred': xgb_pred}, index=test['dt'])

# Plot the actual vs. predicted results
plt.figure(figsize=(20, 8))
df_plot['y_test'].plot(label='Actual')
df_plot['xgb_pred'].plot(label='Predicted')

# Add axis labels
plt.xlabel('Date')
plt.ylabel('Land Average Temperature')

#Add title and legend
plt.title('Testing Set Forecast', weight='bold', fontsize=20)
plt.legend()

# Show the plot
plt.tight_layout()
plt.show()

From the results, we see that the MSE is relatively unchanged from the Error produced by the Linear Regression Model. The R-squared value, while still negative, has significantly decreased in magnitude.

### 5.2.2 Using data Post-1950s to Predict Post-2000s

In [None]:
# Adapt the date_transform function
def date_transform(data):
    df = data.copy()

    # Extract various date-related features from the datetime index
    df['Month'] = df['dt'].dt.month
    df['Quarter'] = df['dt'].dt.quarter
    df['Year'] = df['dt'].dt.year

    # Keep only the needed columns for training and prediction
    X = df[['Month', 'Quarter', 'Year']]
    y = df['LandAverageTemperature']
    return X, y

# Filter data from January 1, 1950, and beyond
global_temp_land_cleaned_modeling_post_1950 = global_temp_land_cleaned_modeling[global_temp_land_cleaned_modeling['dt'] >= '1950-01-01']

# Sort the data by time
global_temp_land_cleaned_modeling_post_1950 = global_temp_land_cleaned_modeling_post_1950.sort_values(by='dt')

# Set a specific time point for the split (e.g., 80% for training, 20% for testing)
split_date = global_temp_land_cleaned_modeling_post_1950['dt'].iloc[int(0.8 * len(global_temp_land_cleaned_modeling_post_1950))]

# Split the data into training and testing sets based on the time point
train = global_temp_land_cleaned_modeling_post_1950[global_temp_land_cleaned_modeling_post_1950['dt'] < split_date]
test = global_temp_land_cleaned_modeling_post_1950[global_temp_land_cleaned_modeling_post_1950['dt'] >= split_date]

# Apply date_transform to the training set
X_train, y_train = date_transform(train)

# Apply date_transform to the testing set
X_test, y_test = date_transform(test)
print(X_train.tail)
print(y_train.tail)

# Train an XGBoost model
xgb_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, early_stopping_rounds=10)
xgb_model.fit(X_train, y_train, eval_metric='mae', eval_set=[(X_train, y_train), (X_test, y_test)])

# Make predictions on the testing set
xgb_pred = xgb_model.predict(X_test)


In [None]:
# Calculate Mean Absolute Error (MAE)
from sklearn.metrics import mean_absolute_error

mae = round(mean_absolute_error(y_test, xgb_pred), 3)
print(mae)
# Evaluate the model
mse = mean_squared_error(y_test, xgb_pred)
r2 = r2_score(y_test, xgb_pred)
print(mse)
print(r2)

In [None]:
import matplotlib.dates as mdates
# Create a DataFrame for plotting
df_plot = pd.DataFrame({'y_test': y_test.values, 'xgb_pred': xgb_pred}, index=test['dt'])

# Plot the actual vs. predicted results
plt.figure(figsize=(20, 8))
df_plot['y_test'].plot(label='Actual')
df_plot['xgb_pred'].plot(label='Predicted')

# Add axis labels
plt.xlabel('Date')
plt.ylabel('Land Average Temperature')

#Add title and legend
plt.title('Testing Set Forecast', weight='bold', fontsize=20)
plt.legend()

# Show the plot
plt.tight_layout()
plt.show()

### 5.3.3 Using Hyper-Parameter Tuning on XGBoost
We chose to do hyperparemeter search on data post-1950

In [None]:
import pandas as pd
from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV, TimeSeriesSplit, train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

# Define the parameter grid to search
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'n_estimators': [50, 100, 200],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

# Adapt the date_transform function
def date_transform(data):
    df = data.copy()

    # Extract various date-related features from the datetime index
    df['Month'] = df['dt'].dt.month
    df['Quarter'] = df['dt'].dt.quarter
    df['Year'] = df['dt'].dt.year

    # Keep only the needed columns for training and prediction
    X = df[['Month', 'Quarter', 'Year']]
    y = df['LandAverageTemperature']
    return X, y
# Filter data for dates post-1950
filtered_data = global_temp_land_cleaned_modeling[global_temp_land_cleaned_modeling['dt'] >= '1950-01-01']

# Set a specific time point for the split (e.g., 80% for training, 20% for testing)
split_date = filtered_data['dt'].iloc[int(0.8 * len(filtered_data))]

# Split the data into training and testing sets based on the time point
train = filtered_data[filtered_data['dt'] < split_date]
test = filtered_data[filtered_data['dt'] >= split_date]

# Apply date_transform to the training set
X_train, y_train = date_transform(train)

# Apply date_transform to the testing set
X_test, y_test = date_transform(test)

# Create an XGBoost model
xgb_model = XGBRegressor()

# Set up the grid search with cross-validation
grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=param_grid,
    scoring='neg_mean_squared_error',  # Choose an appropriate scoring metric
    cv=TimeSeriesSplit(n_splits=5),  # Time series cross-validation
    verbose=1,
    n_jobs=-1
)

# Perform the grid search on the training data
grid_search.fit(X_train, y_train)

# Print Best Parameters and Best Score
print("Best Parameters: ", grid_search.best_params_)
print("Best Score: ", grid_search.best_score_)
print("Lowest MSE found: ", -grid_search.best_score_)


# Get the Best Model
best_xgb_model = grid_search.best_estimator_

# Make Predictions on the Test Set
y_pred_test = best_xgb_model.predict(X_test)

In [None]:
# Add the predicted values to the test DataFrame
test['LandAverageTemperature_Pred'] = y_pred

# Extract year from the 'dt' column
test['year'] = test['dt'].dt.year

# Evaluate the Model on Test Set
mse_test = mean_squared_error(y_test, y_pred_test)
print("Mean Squared Error on Test Set: ", mse_test)

# Plot Actual vs. Predicted on Test Set
plt.figure(figsize=(12, 6))
plt.plot(X_test.index, y_test, label='Actual')
plt.plot(X_test.index, y_pred_test, label='Predicted', linestyle='--')
plt.xlabel('Date')
plt.ylabel('Land Average Temperature')
plt.title('Actual vs. Predicted on Test Set')
plt.legend()
plt.show()


## 5.3 SARIMA Model for Guangzhou

ARIMA is a common time-series forecasting model used only on stationary data, and SARIMA is used on data that display seasonality.

### 5.3.1 Stationarity Check

In [None]:
print("The p-value for the ADF test in Guangzhou is ", adfuller(gz['AverageTemperature'])[1])

The p-value is less than alpha, 0.05, so we reject the null hypothesis. The New York City temperature data is stationary, so we can move onto ARIMA.

### 5.3.2 Autocorrelation Check

In [None]:
plot_acf(gz['AverageTemperature'], lags=20)
plt.show()

plot_pacf(gz['AverageTemperature'], lags=20)
plt.show()

The plots above showed seasonality, so we should use SARIMAX instead, which takes care of seasonal data.

### 5.3.3 Run Model and Get Predictions

In [None]:
# Run auto_arima to find the best combination of p,d,q for SARMIAX
# Note that this portion takes a lot of computational resources
model = auto_arima(gz['AverageTemperature'],
                       start_p=1,
                       start_q=1,
                       max_p=3,
                       max_q=3,
                       start_P=1,
                       start_Q=1,
                       max_P=2,
                       max_Q=2,
                       m=12,
                       seasonal=True,
                       d=1,
                       D=1,
                       trend = 'ct',
                       test = 'adf',
                       trace=True,
                       error_action='ignore',
                       suppress_warnings=True,
                       stepwise=True)

print(model.summary())

In [None]:
model = SARIMAX(gz['AverageTemperature'],
                order=(1, 1, 1),
                seasonal_order=(1, 1, 1, 12),
                enforce_stationarity=False,
                enforce_invertibility=False)

# Fit the model
model_fit = model.fit(disp=False)

# Forecast the next 40 years (480 months)
forecast = model_fit.get_forecast(steps=480)

# Get the forecast mean
forecast_mean = forecast.predicted_mean.reset_index()

# Get the confidence intervals of the forecast
forecast_ci = forecast.conf_int()

# Prepare the data for plotting
forecast_index = pd.date_range(gz.dt.iloc[-1], periods=481, freq='M')[1:]
forecast_mean['index'] = forecast_index
forecast_mean['LowerCI'] = forecast_ci.iloc[:, 0].tolist()
forecast_mean['UpperCI'] = forecast_ci.iloc[:, 1].tolist()
forecast_mean.rename(columns={'index': 'dt'}, inplace=True)
forecast_mean['10_year_rolling_avg'] = forecast_mean['predicted_mean'].rolling(window=120).mean()

In [None]:
# Plot
plt.figure(figsize=(22,10))
plt.plot(gz.dt, gz['AverageTemperature'], label = "Historical")
plt.plot(forecast_mean.dt, forecast_mean['predicted_mean'], label = "Forecast Out of Sample")
plt.plot(gz.dt, gz['10_year_rolling_avg'], label='10-Year Rolling Average Historical', color='red', linewidth=2)
plt.plot(forecast_mean.dt, forecast_mean['10_year_rolling_avg'], label='10-Year Rolling Average Forecast', color='green', linewidth=2)
plt.title("Time Series Forecast")
plt.xlabel("Date")
plt.ylabel("Mean Temperature")
plt.legend()
plt.grid(True)
plt.show()

Since Guangzhou's temperature stayed pretty constant over tume, SARIMA's forecasting result also reflects that: the average temperature 40 years into the future is also predicted to be constant.

## 5.4 SARIMA on Land and Ocean Temperatures Data

Let's also see how SARIMA works on the nonstationary land and ocean temperatures data.

In [None]:
# Bringing back graph from above
# Calculate 10_year_rolling_avg
global_temp_land_and_ocean_cleaned['10_year_rolling_avg_ocean'] = global_temp_land_and_ocean_cleaned['LandAndOceanAverageTemperature'].rolling(window=120).mean()

# Plot monthly average temperature and 10-year rolling average
plt.figure(figsize=(15, 6))
plt.plot(global_temp_land_and_ocean_cleaned['dt'], global_temp_land_and_ocean_cleaned['LandAndOceanAverageTemperature'], label='Monthly Average Temperature', alpha=0.5)
plt.plot(global_temp_land_and_ocean_cleaned['dt'], global_temp_land_and_ocean_cleaned['10_year_rolling_avg_ocean'], label='10-Year Rolling Average', color='red', linewidth=2)
plt.title('Land and Ocean Average Temperature Over Time with 10 Years Rolling Average')
plt.xlabel('Year')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.show()

In [None]:
# Create a copy of the data above for manipulation
df = global_temp_land_and_ocean_cleaned.copy()
meanTemp = df["LandAndOceanAverageTemperature"].mean()
df["LandAndOceanAverageTemperatureDelta"] = df["LandAndOceanAverageTemperature"] - meanTemp
df['10_year_rolling_avg'] = df['LandAndOceanAverageTemperatureDelta'].rolling(window=120).mean()
avgTemp = df.groupby(df['dt'].dt.year).mean()

# Plot change in temperature graph with rolling average
fig_dims = (10, 8)
fig, ax = plt.subplots(figsize=fig_dims)
avgTemp["LandAndOceanAverageTemperatureDelta"].plot(linewidth=1, label ='delta T')
plt.axhline(y=0, color='r', linestyle='-', label = 'average')
plt.title('Change in Land & Ocean temperature ', fontsize=20)
plt.legend(fontsize='x-large')
plt.xlabel('year', fontsize=15)
plt.ylabel('°C', fontsize=15)
plt.grid(True)

### 5.4.1 Stationarity and Autcorrelations Check

As analyzed in section 3.2, this data is not stationary and has seasonality, but SARIMAX addresses both of those issues.

### 5.4.2 Run Model and Get Predictions

In [None]:
# Set up SARIMA model
model = SARIMAX(df['LandAndOceanAverageTemperatureDelta'], order=(0, 1, 3), seasonal_order=(1,1,[1, 2],12), trend='ct')
model_fit = model.fit()
print(model_fit.summary())

In [None]:
# We want to forecast temperature 40 years into the future
nbStep = 12*40
forecast = model_fit.get_forecast(steps=nbStep)

forecast = forecast.summary_frame()

forecasts = pd.DataFrame(columns = ['LandAndOceanAverageTemperaturePredictions','LandAndOceanAverageTemperaturePredictionsUpper','LandAndOceanAverageTemperaturePredictionsLower'])
forecasts['LandAndOceanAverageTemperaturePredictions'] = forecast['mean']
forecasts['LandAndOceanAverageTemperaturePredictionsUpper'] = forecast['mean_ci_upper']
forecasts['LandAndOceanAverageTemperaturePredictionsLower'] = forecast['mean_ci_lower']
forecasts['10_year_rolling_avg_forecast'] = forecasts['LandAndOceanAverageTemperaturePredictions'].rolling(window=120).mean()
forecasts.index = forecast.index

forecast_period = pd.date_range(start=df['dt'].iloc[-1], periods=nbStep+1, freq='M')[1:]
forecasts['dt'] = forecast_period

In [None]:
# Plot forecasting temperature
plt.figure(figsize=(22,10))
plt.plot(df.dt, df['LandAndOceanAverageTemperatureDelta'],label = "original")
plt.plot(forecasts.dt, forecasts['LandAndOceanAverageTemperaturePredictions'],label = "forecast out of sample")
plt.plot(df.dt, df['10_year_rolling_avg'], label='10-Year Rolling Average Historical', color='red', linewidth=2)
plt.plot(forecasts.dt, forecasts['10_year_rolling_avg_forecast'], label='10-Year Rolling Average Forecast', color='green', linewidth=2)
plt.title("Land & Ocean Temperature Forecast")
plt.xlabel("Date")
plt.ylabel("Mean Temperature")
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Plot temperature forecast with confidence intervals
avgTemp = df.groupby(df['dt'].dt.year).mean()
avgTempPred = forecasts.groupby(forecasts['dt'].dt.year).mean()
fig_dims = (10, 8)
fig, ax = plt.subplots(figsize=fig_dims)
avgTemp["LandAndOceanAverageTemperatureDelta"].plot(linewidth=1, label='original')
avgTempPred["LandAndOceanAverageTemperaturePredictions"].plot(linewidth=1, label='forecast out of sample')
ax.fill_between(avgTempPred.index, avgTempPred['LandAndOceanAverageTemperaturePredictionsLower'], avgTempPred['LandAndOceanAverageTemperaturePredictionsUpper'], color='k', alpha=0.1);
plt.title('Land & Ocean Temperature Forecast with Confidence Intervals', fontsize=20)
plt.xlabel('year', fontsize=15)
plt.ylabel('°C', fontsize=15)
plt.legend(fontsize='x-large')
plt.grid(True)

Unlike Guangzhou's prediction, the forecasting result for this dataset has a clear upward trend, reaffirming our previous observations about global land and ocean temperature trend is going up.

# Part 6: Summary

**In conclusion, here are our main takeaways from this project:**

1.	Time series data come with unique challenges and characteristics that require careful preprocessing before applying any modeling technique, especially when the goal is forecasting future values
2.	Additionally, forecasting needs to meet a lot of assumptions in order to implement specific models
3.	Visualizing the data can give insight as to what models we could build - for example since some data is non stationary, arima would not be a good choice for forecasting, instead we should opt for models that can tolerate nonstationarity and seasonality like SARIMA and Prophet
4.	Unlike typical machine learning problems, random train-test splits are not suitable for time series data. Instead, a chronological split is necessary to evaluate the model's performance on unseen future data.
5.	The best model for forecasting was SARIMA and XGBoost demonstrated proficiency in capturing the underlying seasonal patterns within the dataset.
6.	The global temperature trend has exhibited a pronounced upward slope since 1950, indicating a significant and concerning acceleration correlated with the rapid pace of industrialization and heightened emissions.

**In the future, it would be interesting to consider implementing the following modifications:**
1.	Implement Neural Networks and using advanced architectures such as LSTM
2.	Experimenting with Prophet for additional forecasting potential
3.	Subsampling by season to see the trends throughout the years based on the time of year

**Overall Experience + Team Reflection:**
1.	We really enjoyed working on this project because it allowed us to apply the skills/tools we learned in class to a prevalent issues in today’s world
2.	It was especially fun working with and learning about the different packages for data visualization (we especially thought the interactive globe ones were pretty cool)
3.	Having to constantly iterate to refine our models required a lot of time but it forced us to think critically and experiment with different parameters/models to see what performed best