<h1 align='center'>Climate Change</h1>



![Climate Change](http://cdn.inquisitr.com/wp-content/uploads/2014/01/Global-Warming-Facts-Losing-Support-In-US-Considered-A-Fake-Climate-Change-Hoax.jpg)



There have been many studies documenting that the average global temperature has been increasing over the last century. The consequences of a continued rise in global temperature will be dire. Rising sea levels and an increased frequency of extreme weather events will affect billions of people.

In this problem, we will attempt to study the relationship between average global temperature and several other factors.

The file <a href='data/climate_change.csv'>climate_change.csv</a> (available in the data folder) contains climate data from May 1983 to December 2008. The available variables include:

**Year**: the observation year.

**Month**: the observation month.

**Temp**: the difference in degrees Celsius between the average global temperature in that period and a reference value. This data comes from the Climatic Research Unit at the University of East Anglia.

**CO2, N2O, CH4, CFC.11, CFC.12**: atmospheric concentrations of carbon dioxide (CO2), nitrous oxide (N2O), methane  (CH4), trichlorofluoromethane (CCl3F; commonly referred to as CFC-11) and dichlorodifluoromethane (CCl2F2; commonly referred to as CFC-12), respectively. This data comes from the ESRL/NOAA Global Monitoring Division.

- CO2, N2O and CH4: are expressed in ppmv (parts per million by volume  -- i.e., 397 ppmv of CO2 means that CO2 constitutes 397 millionths of the total volume of the atmosphere)

- CFC.11 and CFC.12 are expressed in ppbv (parts per billion by volume). 

**Aerosols**: the mean stratospheric aerosol optical depth at 550 nm. This variable is linked to volcanoes, as volcanic eruptions result in new particles being added to the atmosphere, which affect how much of the sun's energy is reflected back into space. This data is from the Godard Institute for Space Studies at NASA.

**TSI**: the total solar irradiance (TSI) in W/m2 (the rate at which the sun's energy is deposited per unit area). Due to sunspots and other solar phenomena, the amount of energy that is given off by the sun varies substantially with time. This data is from the SOLARIS-HEPPA project website.

**MEI**: multivariate El Nino Southern Oscillation index (MEI), a measure of the strength of the El Nino/La Nina-Southern Oscillation (a weather effect in the Pacific Ocean that affects global temperatures). This data comes from the ESRL/NOAA Physical Sciences Division.

---

We are interested in how changes in these variables affect future temperatures, as well as how well these variables explain temperature changes so far.

Then, split the data into a training set, consisting of all the observations up to and including 2006, and a testing set consisting of the remaining years. A training set refers to the data that will be used to build the model (this is the data we give to the LinearRegression fit method, and a testing set refers to the data we will use to test our predictive ability.

In [108]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
plt.style.use('fivethirtyeight')

<h1 align='center'>Glance at the data</h1>

In [109]:
climate = pd.read_csv('data/climate_change.csv')
climate.head()

Unnamed: 0,Year,Month,MEI,CO2,CH4,N2O,CFC-11,CFC-12,TSI,Aerosols,Temp
0,1983,5,2.556,345.96,1638.59,303.677,191.324,350.113,1366.1024,0.0863,0.109
1,1983,6,2.167,345.52,1633.71,303.746,192.057,351.848,1366.1208,0.0794,0.118
2,1983,7,1.741,344.15,1633.22,303.795,192.818,353.725,1366.285,0.0731,0.137
3,1983,8,1.13,342.25,1631.35,303.839,193.602,355.633,1366.4202,0.0673,0.176
4,1983,9,0.428,340.17,1648.4,303.901,194.392,357.465,1366.2335,0.0619,0.149


In [110]:
climate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 308 entries, 0 to 307
Data columns (total 11 columns):
Year        308 non-null int64
Month       308 non-null int64
MEI         308 non-null float64
CO2         308 non-null float64
CH4         308 non-null float64
N2O         308 non-null float64
CFC-11      308 non-null float64
CFC-12      308 non-null float64
TSI         308 non-null float64
Aerosols    308 non-null float64
Temp        308 non-null float64
dtypes: float64(9), int64(2)
memory usage: 26.5 KB


In [111]:
climate.describe()

Unnamed: 0,Year,Month,MEI,CO2,CH4,N2O,CFC-11,CFC-12,TSI,Aerosols,Temp
count,308.0,308.0,308.0,308.0,308.0,308.0,308.0,308.0,308.0,308.0,308.0
mean,1995.662338,6.551948,0.275555,363.226753,1749.824513,312.391834,251.973068,497.524782,1366.070759,0.016657,0.256776
std,7.423197,3.447214,0.937918,12.647125,46.051678,5.225131,20.231783,57.826899,0.39961,0.02905,0.17909
min,1983.0,1.0,-1.635,340.17,1629.89,303.677,191.324,350.113,1365.4261,0.0016,-0.282
25%,1989.0,4.0,-0.39875,353.02,1722.1825,308.1115,246.2955,472.41075,1365.71705,0.0028,0.12175
50%,1996.0,7.0,0.2375,361.735,1764.04,311.507,258.344,528.356,1365.9809,0.00575,0.248
75%,2002.0,10.0,0.8305,373.455,1786.885,316.979,267.031,540.52425,1366.36325,0.0126,0.40725
max,2008.0,12.0,3.001,388.5,1814.18,322.182,271.494,543.813,1367.3162,0.1494,0.739


In [112]:
climate.corr()

Unnamed: 0,Year,Month,MEI,CO2,CH4,N2O,CFC-11,CFC-12,TSI,Aerosols,Temp
Year,1.0,-0.025789,-0.145345,0.985379,0.910563,0.99485,0.460965,0.870067,0.022353,-0.361884,0.755731
Month,-0.025789,1.0,-0.016345,-0.096287,0.017558,0.012395,-0.014914,-0.001084,-0.032754,0.014845,-0.098016
MEI,-0.145345,-0.016345,1.0,-0.152911,-0.105555,-0.162375,0.088171,-0.039836,-0.076826,0.352351,0.135292
CO2,0.985379,-0.096287,-0.152911,1.0,0.872253,0.981135,0.401284,0.82321,0.017867,-0.369265,0.748505
CH4,0.910563,0.017558,-0.105555,0.872253,1.0,0.894409,0.713504,0.958237,0.146335,-0.290381,0.699697
N2O,0.99485,0.012395,-0.162375,0.981135,0.894409,1.0,0.412155,0.839295,0.039892,-0.353499,0.743242
CFC-11,0.460965,-0.014914,0.088171,0.401284,0.713504,0.412155,1.0,0.831381,0.284629,-0.032302,0.380111
CFC-12,0.870067,-0.001084,-0.039836,0.82321,0.958237,0.839295,0.831381,1.0,0.18927,-0.243785,0.688944
TSI,0.022353,-0.032754,-0.076826,0.017867,0.146335,0.039892,0.284629,0.18927,1.0,0.083238,0.182186
Aerosols,-0.361884,0.014845,0.352351,-0.369265,-0.290381,-0.353499,-0.032302,-0.243785,0.083238,1.0,-0.392069


<h1 align='center'>Answering the Questions</h1>

In [113]:
#    Build a linear regression model to predict the dependent variable Temp, using
#     MEI, CO2, CH4, N2O, CFC.11, CFC.12, TSI, and Aerosols as independent variables
#     (Year and Month should NOT be used in the model). Use the training set to build the model.

training = climate[climate.Year <= 2006]
testing  = climate[climate.Year > 2006]

lm = LinearRegression()

target = training.Temp.values.reshape(-1, 1)
X = training[['MEI', 'CO2', 'CH4', 'N2O', 'CFC-11', 'CFC-12', 'TSI', 'Aerosols']]

print(training.shape)
print(target.shape)
print(X.shape)

print(testing.shape)

(284, 11)
(284, 1)
(284, 8)
(24, 11)


In [114]:
lm.fit(X, target)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [115]:
# 1 - What's the model R2 (the "Multiple R-squared" value)?

print('R2: {}'.format(lm.score(X, target).round(3)))

R2: 0.751


In [116]:
# 2 - Which variables are significant in the model? We will consider a variable signficant only if
#     the p-value is below 0.05. (Select all that apply.)

from pprint import pprint

pprint(dict(zip(X.columns, lm.coef_[0])))

{'Aerosols': -1.537613238105092,
 'CFC-11': -0.0066304888893799823,
 'CFC-12': 0.0038081032430233078,
 'CH4': 0.00012404189575249656,
 'CO2': 0.0064573592723367445,
 'MEI': 0.064205313367525677,
 'N2O': -0.016528003257475294,
 'TSI': 0.093141083484999956}


In [117]:
# Answer: is all of the columns except CH4 and N2O.
#         This is not shown in the coefficients of the columns
#         So I will use the statsmodels which is similar to R's summary function.
#         R's summar function is the one being used to answer this question

import statsmodels.api as sm

XX = sm.add_constant(X)
ols = sm.OLS(target, XX)
result = ols.fit()
result.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.751
Model:,OLS,Adj. R-squared:,0.744
Method:,Least Squares,F-statistic:,103.6
Date:,"Sat, 24 Jun 2017",Prob (F-statistic):,1.94e-78
Time:,05:58:28,Log-Likelihood:,280.1
No. Observations:,284,AIC:,-542.2
Df Residuals:,275,BIC:,-509.4
Df Model:,8,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-124.5943,19.887,-6.265,0.000,-163.744,-85.445
MEI,0.0642,0.006,9.923,0.000,0.051,0.077
CO2,0.0065,0.002,2.826,0.005,0.002,0.011
CH4,0.0001,0.001,0.240,0.810,-0.001,0.001
N2O,-0.0165,0.009,-1.930,0.055,-0.033,0.000
CFC-11,-0.0066,0.002,-4.078,0.000,-0.010,-0.003
CFC-12,0.0038,0.001,3.757,0.000,0.002,0.006
TSI,0.0931,0.015,6.313,0.000,0.064,0.122
Aerosols,-1.5376,0.213,-7.210,0.000,-1.957,-1.118

0,1,2,3
Omnibus:,8.74,Durbin-Watson:,0.956
Prob(Omnibus):,0.013,Jarque-Bera (JB):,10.327
Skew:,0.289,Prob(JB):,0.00572
Kurtosis:,3.733,Cond. No.,8530000.0


In [118]:
# Checking by the p-values
# (if p-value > 0.05, reject the null hypothesis, column significant, otherwise not significant)
# It seems indeed all of the columns are significant except for CH4 and N2O

In [119]:
# 3 - Current scientific opinion is that nitrous oxide and CFC-11 are greenhouse gases:
#     gases that are able to trap heat from the sun and contribute to the heating of the Earth.
#     However, the regression coefficients of both the N2O and CFC-11 variables are negative,
#     indicating that increasing atmospheric concentrations of either of these two compounds is associated with lower global temperatures.

# Which of the following is the simplest correct explanation for this contradiction?


# 1) Climate scientists are wrong that N2O and CFC-11 are greenhouse gases
# 2) There is not enough data, so the regression coefficients being estimated are not accurate.
# 3) All of the gas concentration variables reflect human development - N2O and CFC.11
#    are correlated with other variables in the data set.


# Answer: 3

In [120]:
# 4 - Compute the correlations between all the variables in the training set.
#     Which of the following independent variables is N2O highly correlated with
#     (absolute correlation greater than 0.7)? Select all that apply.


climate.corr()


# Answer:  N2O is highly correlated with CO2, CH4, CFC-12
#          CFC-11 is highly correlated with CH4 and CFC-12

Unnamed: 0,Year,Month,MEI,CO2,CH4,N2O,CFC-11,CFC-12,TSI,Aerosols,Temp
Year,1.0,-0.025789,-0.145345,0.985379,0.910563,0.99485,0.460965,0.870067,0.022353,-0.361884,0.755731
Month,-0.025789,1.0,-0.016345,-0.096287,0.017558,0.012395,-0.014914,-0.001084,-0.032754,0.014845,-0.098016
MEI,-0.145345,-0.016345,1.0,-0.152911,-0.105555,-0.162375,0.088171,-0.039836,-0.076826,0.352351,0.135292
CO2,0.985379,-0.096287,-0.152911,1.0,0.872253,0.981135,0.401284,0.82321,0.017867,-0.369265,0.748505
CH4,0.910563,0.017558,-0.105555,0.872253,1.0,0.894409,0.713504,0.958237,0.146335,-0.290381,0.699697
N2O,0.99485,0.012395,-0.162375,0.981135,0.894409,1.0,0.412155,0.839295,0.039892,-0.353499,0.743242
CFC-11,0.460965,-0.014914,0.088171,0.401284,0.713504,0.412155,1.0,0.831381,0.284629,-0.032302,0.380111
CFC-12,0.870067,-0.001084,-0.039836,0.82321,0.958237,0.839295,0.831381,1.0,0.18927,-0.243785,0.688944
TSI,0.022353,-0.032754,-0.076826,0.017867,0.146335,0.039892,0.284629,0.18927,1.0,0.083238,0.182186
Aerosols,-0.361884,0.014845,0.352351,-0.369265,-0.290381,-0.353499,-0.032302,-0.243785,0.083238,1.0,-0.392069


In [121]:
# 5 - Given that the correlations are so high, let us focus on the N2O variable and build a model with
#     only MEI, TSI, Aerosols and N2O as independent variables. Remember to use the training set to build the model.

# 5.1 - What's the coefficient of N2O in this reduced model?
#       (How does this compare to the coefficient in the previous model with all of the variables?)

# 5.2 - What's the R2 score?

XXX = training[['MEI', 'TSI', 'Aerosols', 'N2O']]
lm.fit(XXX, target)

coeffs = dict(zip(XXX.columns, lm.coef_[0]))

print('Coefficient of N2O: {}'.format(coeffs['N2O'].round(3)))  # it changed signs and slightly increased
print('R2: {}'.format(lm.score(XXX, target).round(3)))  # R2 was 0.751 with all features included

Coefficient of N2O: 0.025
R2: 0.726


---
<i> There's three more questions in the assignment but they are related to the step function in R.
     I don't know if there's a Python equivalent method</i>