# Homework 1 

## 1.1 Climate Change

### Q 1.1 (a)

Build a linear regression model to predict the dependent variable Temp, using CO2,
CH4, N2O, CFC-11, CFC-12, Aerosols, TSI and MEI as features (Year and Month
should NOT be used as features in the model). As always, use only the training set to
train your model. What are the in-sample and out-of-sample R2, MSE, and MAE?

In [1]:
## We'll start by importing some packages that we need.

# This line imports a package called pandas, which lets us manipulate data.
# The "as pd" portion allows us to abbreviate it as "pd" instead of the longer "pandas"
# When we use anything from pandas, we will need to preface it with "pd."
import pandas as pd

# These lines import some functions from a package called sklearn (SciKit-Learn)
# The syntax differs from the previous line: we are just importing parts of sklearn, not
# the entire package 
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error

In [2]:
## Loading data into python

# We use the "read_csv" funuction from pandas to import our csv file 
# (make sure it is in the same folder as this notebook)
# This creates an object called a "DataFrame" (think of it as a table or spreadsheet), 
# which we store in a variable we call "df" (or whatever you like)
df = pd.read_csv('climate_change.csv')

In [3]:
# Inspect the dataset
df.head(3)

Unnamed: 0,Year,Month,MEI,CO2,CH4,N2O,CFC-11,CFC-12,TSI,Aerosols,Temp
0,1983,5,2.556,345.96,1638.59,303.677,191.324,350.113,1366.1024,0.0863,0.109
1,1983,6,2.167,345.52,1633.71,303.746,192.057,351.848,1366.1208,0.0794,0.118
2,1983,7,1.741,344.15,1633.22,303.795,192.818,353.725,1366.285,0.0731,0.137


In [4]:
# We want to split our DataFrame into two separate DataFrames, 
# one containing the samples (rows) for years up to and including 2002,
# and the other containing the remaining samples.

# Try running this first:
df['Year'] <= 2002

0       True
1       True
2       True
3       True
4       True
       ...  
303    False
304    False
305    False
306    False
307    False
Name: Year, Length: 308, dtype: bool

In [5]:
# What gets created is a column of True/False values (technically called "Booleans")
# This is useful because you can select a subset of rows of a DataFrame using such a
# column of Booleans:
df_train = df[df['Year'] <= 2002]
df_test = df[df['Year'] > 2002]

In [6]:

# First we create a "blank" linear regression model, 
# and store it in a variable called LR (or whatever you like)
LR = LinearRegression()

# Then we use the "fit" function, which takes two arguments:
# (1) A DataFrame consisting of the features (which we choose to call X),
# and (2) a single column consisting of the dependent variable (which we choose to call y)
features = [
 'CO2',
 'CH4',
 'N2O',
 'CFC-11',
 'CFC-12',
 'Aerosols',
 'TSI',
 'MEI'
]
X = df_train[features]
y = df_train['Temp']
LR.fit(X,y)

# After fitting, we can check the coefficients with the "coef_" attribute
print(LR.coef_)

# The intercept is accessed separately with the "intercept_" attribute
print(LR.intercept_)

[ 6.24077568e-03  2.62189354e-04 -3.48478075e-02 -8.87194950e-03
  5.48441303e-03 -1.65036522e+00  1.19394890e-01  6.59008764e-02]
-155.18672675256227


In [7]:
# Once you have your model trained, compute the in-sample error metrics (r-squared, MSE, MAE)
y_pred = LR.predict(X)
print(y_pred)


[ 0.17388425  0.15841559  0.15350057  0.12857703  0.06130311  0.02553833
  0.03309255  0.04231473 -0.031664   -0.01068428  0.10197352  0.05259175
  0.06681793  0.07819404  0.06700922  0.03139968  0.01048613  0.00336547
 -0.03275843 -0.03910317 -0.05346115 -0.03884259 -0.03493211 -0.02765684
 -0.01932085  0.03151004  0.01593166  0.00613976 -0.01614836 -0.00380329
  0.02097338  0.01148007  0.00479013 -0.01005374  0.02264638  0.01257807
  0.04972718  0.03129798  0.01525541  0.01241494  0.03175436  0.02624214
  0.04030077  0.07511653  0.0961955   0.10668054  0.16246885  0.19368148
  0.22507544  0.20316498  0.18100737  0.18159762  0.17599625  0.16577688
  0.17127095  0.18609952  0.18175286  0.17257688  0.14649912  0.15469565
  0.17038154  0.08283408  0.01720794  0.02858676  0.04777394  0.01690012
  0.03173094  0.04418694  0.09642237  0.09174279  0.0891332   0.15054893
  0.20714106  0.16115666  0.22516949  0.13804734  0.11715114  0.1953378
  0.19643068  0.22171703  0.22888664  0.22235451  0.

In [8]:
# The functions "r2_score", "mean_squared_error", and "mean_absolute_error" calculate regression metrics
# Each takes two arguments: (1) a column of true values, and (2) a column of predicted values

# Here, we are computing the in-sample metrics on the training data
print("In-sample R^2:",r2_score(y,y_pred))
print("In-sample Mean Squared Error:",mean_squared_error(y,y_pred))
print("In-sample Mean Absolute Error:",mean_absolute_error(y,y_pred))

In-sample R^2: 0.692059595998479
In-sample Mean Squared Error: 0.00873142640991104
In-sample Mean Absolute Error: 0.07260918612938823


In [9]:
# Now compute the out-of-sample metrics on the test data

X_test = df_test[features]
y_test = df_test['Temp']



# Predict the values for the test dataset
y_pred_test = LR.predict(X_test)


# Now, calculate the out-of-sample metrics
print("Out-of-sample R^2:", r2_score(y_test, y_pred_test))
print("Out-of-sample Mean Squared Error:", mean_squared_error(y_test, y_pred_test))
print("Out-of-sample Mean Absolute Error:", mean_absolute_error(y_test, y_pred_test))



Out-of-sample R^2: -0.5413255834026054
Out-of-sample Mean Squared Error: 0.012206974835139621
Out-of-sample Mean Absolute Error: 0.093127478912773


### Q 1.1 (b)

Build another linear regression model, this time with only N2O, Aerosols, TSI, and
MEI as features. What are the in-sample and out-of-sample R2, MSE, and MAE?

In [10]:
# First we create a "blank" linear regression model, 
# and store it in a variable called LR (or whatever you like)
LR = LinearRegression()

# Then we use the "fit" function, which takes two arguments:
# (1) A DataFrame consisting of the features (which we choose to call X),
# and (2) a single column consisting of the dependent variable (which we choose to call y)
features = [
 'N2O',
 'Aerosols',
 'TSI',
 'MEI'
]
X = df_train[features]
y = df_train['Temp']
LR.fit(X,y)

# After fitting, we can check the coefficients with the "coef_" attribute
print(LR.coef_)

# The intercept is accessed separately with the "intercept_" attribute
print(LR.intercept_)

[ 0.02427612 -1.72465971  0.08577046  0.06549568]
-124.48412557340242


In [11]:
# Once you have your model trained, compute the in-sample error metrics (r-squared, MSE, MAE)
y_pred = LR.predict(X)
print(y_pred)


[ 0.07777432  0.06744988  0.06568712  0.0483366  -0.00283643 -0.03541476
 -0.03366419 -0.029885   -0.08778917 -0.07478511  0.0222632  -0.00854613
  0.00316551  0.01947408  0.02793806  0.02096504  0.02504365  0.03120327
  0.00366803 -0.00274222 -0.00878533 -0.00358603 -0.01546181 -0.0125586
 -0.01748963  0.02979212  0.02270127  0.01846716  0.00687688  0.02496252
  0.04357286  0.03066313  0.028157    0.02608503  0.0580052   0.0455205
  0.08779545  0.08816069  0.09626182  0.12034769  0.14848411  0.14087662
  0.13672688  0.15030005  0.14788956  0.13805188  0.16800359  0.16625843
  0.19634092  0.19561266  0.19726892  0.19992313  0.20112011  0.1893729
  0.1816751   0.18861451  0.18380039  0.16141729  0.13459232  0.13377162
  0.13739617  0.05840184  0.01264119  0.02538567  0.0454641   0.03954672
  0.0552308   0.06451409  0.10348405  0.10443589  0.10160263  0.13228936
  0.17272794  0.14725364  0.19853802  0.14771903  0.14860422  0.2108297
  0.22174682  0.24424235  0.25144887  0.24820447  0.277

In [12]:
# Here, we are computing the in-sample metrics on the training data
print("In-sample R^2:",r2_score(y,y_pred))
print("In-sample Mean Squared Error:",mean_squared_error(y,y_pred))
print("In-sample Mean Absolute Error:",mean_absolute_error(y,y_pred))

In-sample R^2: 0.6490120806760438
In-sample Mean Squared Error: 0.0099520074291056
In-sample Mean Absolute Error: 0.07666650280233127


In [13]:
# Now compute the out-of-sample metrics on the test data

X_test = df_test[features]
y_test = df_test['Temp']



# Predict the values for the test dataset
y_pred_test = LR.predict(X_test)


# Now, calculate the out-of-sample metrics
print("Out-of-sample R^2:", r2_score(y_test, y_pred_test))
print("Out-of-sample Mean Squared Error:", mean_squared_error(y_test, y_pred_test))
print("Out-of-sample Mean Absolute Error:", mean_absolute_error(y_test, y_pred_test))


Out-of-sample R^2: 0.20031861104556403
Out-of-sample Mean Squared Error: 0.006333308611894022
Out-of-sample Mean Absolute Error: 0.06154027269393324


### Q 1.1 (c)

Between the two models built in parts (a) and (b), which performs better in-sample?
Which performs better out-of-sample?


Model A (in sample)
\
In-sample R^2: 0.692059595998479
\
In-sample Mean Squared Error: 0.00873142640991104
\
In-sample Mean Absolute Error: 0.07260918612938823 

Model B (in sample)
\
In-sample R^2: 0.6490120806760438
\
In-sample Mean Squared Error: 0.0099520074291056
\
In-sample Mean Absolute Error: 0.07666650280233127

Model A (out sample)
\
Out-of-sample R^2: -0.5413255834026054
\
Out-of-sample Mean Squared Error: 0.012206974835139621
\
Out-of-sample Mean Absolute Error: 0.093127478912773

Model B (out sample)
\
Out-of-sample R^2: 0.20031861104556403
\
Out-of-sample Mean Squared Error: 0.006333308611894022
\
Out-of-sample Mean Absolute Error: 0.06154027269393324


#### Answer

Based on our analysis, the model in part A performs better in-sample since it has a higher R^2 and lower Mean Squared Error (MSE) and Mean Absolute Error (MAE). It has a higher R^2, indicating better variance explanation, and lower MSE and MAE, suggesting more accurate predictions with less error.

In the out-of-sample evaluation, the model in part B performs better than the model in part A. Model B has a positive R^2 value, indicating some predictive relevance, while Model A has a negative R^2, suggesting poor model fit. Additionally, Model B has lower MSE and MAE values, meaning its predictions are closer to the actual values compared to Model A.







### Q 1.1 (d)

For each of the two models built in parts (a) and (b), what was the regression coefficient
for the N2O feature, and how should this coefficient be interpreted?

Model A (N2O Coefficient):

-3.48478075e-02

Model B (N2O Coefficient):

0.02427612



#### Answer

For Model A, we estimate that a unit increase in N2O will result in a decrease in  global temperature compared to reference value by -3.48478075e-02, all else equal. This is counterintuitive to the scientific understanding that N2O, as a greenhouse gas, should contribute to heating.

For Model B, we estimate that a unit increase in N2O will result in an increase in global temperature compared to reference value by 0.02427612, all else equal. This is consistent with the scientific understanding that N2O is trapping more heat and contributing to Earth's heating.





### Q 1.1 (e)

Given your responses to parts (c) and (d), which of the two models should you prefer
to use moving forward?


#### Answer

In general, we want to create models that perform better out of sample. Hence, the model in part B that uses N2O, Aerosols, TSI, and MEI is our ideal choice. This is supported through the superior values of R^2, MSE, and MAE of Model B when tested out of sample. Moreover, the coefficients in Model B corresponds to the scientifc understanding that N2O is contributing to Global Warming and Climate Change since the coefficient is positive. This is not the case in Model A. Hence, based on the arguments in parts c and d, Model B is preferred going forward.
