## Part III: Final Analysis Report
Michael Rich, Alex Outkou, Luke Costello, Harry Herman

#### <font color='blue'>Introduction:</font>
**Research Question**: Our main research question was to see if several macroeconomic and climate indicators could be used to build a predictive model in order to determine future returns of Corn, Wheat, and Soybeans. In essense, we were intrinsically testing the Efficient Market Hypothesis (EMH) which states that asset prices (commodity prices in this case) reflect all available information. Thus, due to EMH, it would be impossible to "beat the market", or simply use historical data to find arbitrage opportunities.  

**Our Process Overview**: 
The following gives a brief overview of our process to answer our research question. Please refer to the sub-sections below for a more in-depth overview on our pain points, analysis and specific processes.

1. We collected macroeconomic, climate, and commodity price data from a variety of sources. We collected the following macroeconomic data from fred: **GDP, CPIAUCSL (Consumer Price Index), UNRATE (unemployment rate)** and the remaining climate and S&P 500 data from the following sources:

     - Corn Future Prices: https://www.investing.com/commodities/us-corn-historical-data
     - Wheat Future Prices: https://www.investing.com/commodities/us-wheat-historical-data
     - Soybeans Future Prices: https://www.investing.com/commodities/us-soybeans-historical-data
     - S&P 500 Data: https://www.investing.com/indices/us-spx-500-historical-data


2. Within the **data_cleaning.ipynb** file, we imported all of this data (from **/input_data** folder) and cleaned it up. We removed unnecessary columns, defined the appropriate timeframes, and merged everything together to create one final DataFrame. This DataFrame was exported as a **commodities_df.csv** file and put into an **/output_data** folder for easy future access and use.

3. Within the **part1_regressions.ipynb** file, we ran several different regression models on all three commodities in order to obtain information on potential correlations between variables and their respective relationships.

4. Within the **part2_modeling.ipynb** file, we split our dataset into a training and holdout set in order to build several predictive models to estimate the future commodity returns. We chose to optimize R^2 as our metric of success. 

**Outcome:** In short, we couldn't build a *reliable* model. The efficient market hypothesis held.

#### <font color='blue'>Part 0: Data Cleaning</font>
*Source: data_cleaning.ipynb*

**Step 1: Convert Date Columns to datetime**

One of the first major issues we ran into was that most of our data had different formats of the 'date' column, whether that be in MM/DD/YYYY format, MM - DD - YYYY, M - YYYY format, or other. Since we were going to merge all of our data on the date column, we needed all of these columns to be precisely in the same format. To do this we used the **pd.to_datetime()** function on many of the loaded .csv files 'Date' columns.

**Step 2: Scrape Macroeconomic data from fred**

We used **pandas_datareader** to scrape most of our macroeconomic information *(GDP, CPIAUCSL, UNRATE)* from 1990 to 2022 from fred. The following code did that for us:

```
start = datetime.datetime(1990, 1, 1) 
end = datetime.datetime(2022, 2, 28)

macro_df = pdr.data.DataReader(['GDP','CPIAUCSL','UNRATE'], 'fred', start, end)
```
Because GDP and other data were only found to be monthly, we decided to convert all of our data to monthly time blocks. This would avoid missing key data points for our variables. To do such, we took the first day out of each month like so:

```
macro_df = macro_df.loc[macro_df['DATE'].dt.day == 1] #takes first day out of each month
```

**Step 3: Merge Commodity Future and Climate Data**

We decided to target the following climate data:

- PRCP
- SNOW
- TMAX (temp min)
- TMIN (temp max)

After cleaning the data to only have these columns, we merged the commodity price and climate data to the macroeconomic data to give us one dataframe with most of our needed information. Because GDP was only computed quarterly, we interpolated the GDP column (using a 'nearest' method) to fill in the missing data.

We repeated this process for soybeans and wheat data. 

**Step 4: Merge S&P500 Data**

When trying to scrape s&p 500 data prices from fred, it would not give us some of the data from our needed timeline. Thus, we obtained the s&p500 prices from a different site and merged the data into our main dataframe. 

We then renamed many columns and ensured that the final dataframe was cleaned and clear from errors.

**Step 5: Calculate Realized Commodity Returns and GDP/CPI returns**

We needed realized commodity returns, which we calculated from the respective commodity prices. To do this, we first needed to convert the price columns to a float datatype, as they were object datatypes before. We used the following to compute the realized return for corn and applied the same concept for the other commodities/returns needed:

```
Commodities_Final['realized_ret_corn'] = (np.log(Commodities_Final['Corn_Future_Price'].shift(-1)) 
                                       - np.log(Commodities_Final['Corn_Future_Price']))
```

**Step 5: Export DataFrame as a .csv file**

We decided to export the final cleaned DataFrame to an **output_data** folder, and named the file **commodities_df.csv** for easy access and use for the Regression and Predictive Modeling Analysis below. 

#### <font color='blue'>Part I: Regression Analysis</font>
*Source: part1_regressions.ipynb*

The purpose of this section was to gain an initial understanding of the degree of correlation between the various independent variables in our Commodities_DF dataset and the explanatory variables (i.e. the variables we are trying to predict) which are the commodity returns in our case. Prior to moving on, it is important to note that the regressions ran in this section examine the relationship between commodity returns and various financial, macroeconomic, and climate variables *in the same time period*. Consequently, the results of this section are not indicative of the predictive ability of our independent variables, but rather of their degree of correlation with commodity returns. 

**Step 1: Load Commodity DF and Create a Market Risk Premium Variable**

Upon loading in our comprehensive commodities_DF dataframe from the **data_cleaning.ipynb** file, we used the **sp500_Price** column in order to find the market risk premium, which would be another independent variable used to help predict commodity returns within our model. We used the following process to do this:

- Computed the monthly returns for the sp500
- Used the .rolling() function to compute a rolling, 60 month period average of the s&p 500 returns. 
- Calculated estimates for the market risk premium for each observation in our dataset by subtracting a monthly risk free rate (0.407%) from our s&p 500 returns
- Used a weighted average to ensure positive market risk premiums

**Step 2: Perform a Set of Linear Regressions for Each of Our Target Commodities**

Armed with all needed variables, we moved on to perform a number of regression tests for the three target commodities of this project: corn, wheat, and soybeans. The tests were performed with using the *StatsModels* library and API. 

*Corn Regressions* 
 
- **Model 1**: ``` corn1 = sm_ols('realized_ret_corn ~ (market_risk_prem)',data=Commodities_DF).fit() ```
     - Key Statistics: 
     | Statistic      | Value |
| ----------- | ----------- |
| R^2      | 0.000       |
| β1 (market_risk_prem)  | 0.355        |
     - Interpretation: 
         - β1 : A single % increase in the market risk premium is associated with a 0.355% increase in corn future returns, on average



- **Model 2**: ``` corn2 = sm_ols('realized_ret_corn ~ market_risk_prem + ret_gdp + ret_cpi + UNRATE + sp500_rets', data=Commodities_DF).fit() ```
    - Key Statistics: 
     | Statistic      | Value |
| ----------- | ----------- |
| R^2      | 0.035       |
| β1 (market_risk_prem)  | -1.078        |
| β2 (ret_gdp)  | 0.0187        |
| β3 (ret_cpi)  | -0.394       |
| β4 (UNRATE)  | 0.000082        |
| β5 (sp500_rets)  | 0.365        |
    - Interpretation: 
         - β1 : A single % increase in the market risk premium is associated with a 1.078% decrease in corn future returns, on average (ceteris paribus)
         - β2: A single % increase in US GDP is associated with with a 0.018% increase in corn future returns, on average (ceteris paribus)
         - β3: A single % increase in US CPI (inflation) is associated with a 0.394% decrease in corn future returns, on average (ceteris paribus)
         - β4: A single % increase in US unemployment rates is associated with a 0.000082% increase in corn future returns, on average (ceteris paribus)
         - β5: A single % increase in SP500 returns is associated with a 0.365% increase in corn future returns, on average (ceteris paribus)



- **Model 3: ```sm_ols('realized_ret_corn ~ market_risk_prem + ret_gdp + ret_cpi + UNRATE + sp500_rets + C_PRCP + C_SNOW + C_TMAX + C_TMIN', data=Commodities_DF).fit()```
- Key Statistics: 
     | Statistic      | Value |
| ----------- | ----------- |
| R^2      | 0.061       |
| β1 (market_risk_prem)  | -0.2383        |
| β2 (ret_gdp)  | 0.1622        |
| β3 (ret_cpi)  | -0.7320       |
| β4 (UNRATE)  | 0.0014        |
| β5 (sp500_rets)  | 0.3448	        |
| β6 (C_PRCP)  | 0.000063    |
| β7 (C_SNOW)  | -0.0003       |
| β8 (C_TMAX)  | 0.000075	        |
| β9 (C_TMIN)  | 0.000002        |
    - Interpretation: 
         - β1 : A single % increase in the market risk premium is associated with a -0.2383 decrease in corn future returns, on average (ceteris paribus)
         - β2: A single % increase in US GDP is associated with with a 0.1622% increase in corn future returns, on average (ceteris paribus)
         - β3: A single % increase in US CPI (inflation) is associated with a -0.7320% decrease in corn future returns, on average (ceteris paribus)
         - β4: A single % increase in US unemployment rates is associated with a 0.0014% increase in corn future returns, on average (ceteris paribus)
         - β5: A single % increase in SP500 returns is associated with a 0.3448% increase in corn future returns, on average (ceteris paribus)
         - β6: A single unit increase in precipitation is associated with a 0.000063% increase in corn future returns, on average (ceteris paribus)
         - β7: A single unit increase in snowfall is associated with a -0.0003 increase in corn future returns, on average (ceteris paribus)
         - β8: A single unit increase in max temperatures is associated with a 0.000075% increase in corn future returns, on average (ceteris paribus)
         - β9: A single unit increase in min temperatures is associated with a 0.000002% increase in corn future returns, on average (ceteris paribus)
     
         
*Soybeans Regressions* 
 
- **Model 1**: ``` sm_ols('realized_ret_soybeans ~ (market_risk_prem)',data=Commodities_DF).fit()
 ```
     - Key Statistics: 
     | Statistic      | Value |
| ----------- | ----------- |
| R^2      | 0.001       |
| β1 (market_risk_prem)  | -1.556        |
     - Interpretation: 
         - β1 : A single % increase in the market risk premium is associated with a 1.556% decrease in soybeans future returns, on average



- **Model 2**: ``` sm_ols('realized_ret_soybeans ~ market_risk_prem + ret_gdp + ret_cpi + UNRATE + sp500_rets', data=Commodities_DF).fit() ```
    - Key Statistics: 
     | Statistic      | Value |
| ----------- | ----------- |
| R^2      | 0.049       |
| β1 (market_risk_prem)  | -2.9067        |
| β2 (ret_gdp)  | -0.5800        |
| β3 (ret_cpi)  | 1.2068       |
| β4 (UNRATE)  | 0.0000086        |
| β5 (sp500_rets)  | 0.371        |
    - Interpretation: 
         - β1 : A single % increase in the market risk premium is associated with a 2.907% decrease in soybeans future returns, on average (ceteris paribus)
         - β2: A single % increase in US GDP is associated with with a 0.580% decrease in soybeans future returns, on average (ceteris paribus)
         - β3: A single % increase in US CPI (inflation) is associated with a 1.207% decrease in soybeans future returns, on average (ceteris paribus)
         - β4: A single % increase in US unemployment rates is associated with a 0.000009% increase in soybeans future returns, on average (ceteris paribus)
         - β5: A single % increase in SP500 returns is associated with a 0.371% increase in soybeans future returns, on average (ceteris paribus)



- **Model 3**: ```sm_ols('realized_ret_corn ~ market_risk_prem + ret_gdp + ret_cpi + UNRATE + sp500_rets + C_PRCP + C_SNOW + C_TMAX + C_TMIN', data=Commodities_DF).fit()```
- Key Statistics: 
     | Statistic      | Value |
| ----------- | ----------- |
| R^2      | 0.060       |
| β1 (market_risk_prem)  | -2.9791        |
| β2 (ret_gdp)  | -0.6377        |
| β3 (ret_cpi)  | 0.7117      |
| β4 (UNRATE)  | 0.0005        |
| β5 (sp500_rets)  | 0.3819	        |
| β6 (S_PRCP)  | 0.0000038    |
| β7 (S_SNOW)  | -0.0000038       |
| β8 (S_TMAX)  | 0.000028	        |
| β9 (S_TMIN)  | 0.000023        |
    - Interpretation: 
         - β1 : A single % increase in the market risk premium is associated with a 2.979 decrease in soybeans future returns, on average (ceteris paribus)
         - β2: A single % increase in US GDP is associated with with a 0.6377% decrease in soybeans future returns, on average (ceteris paribus)
         - β3: A single % increase in US CPI (inflation) is associated with a  0.7117% increase in soybeans future returns, on average (ceteris paribus)
         - β4: A single % increase in US unemployment rates is associated with a 0.0005% increase in soybeans future returns, on average (ceteris paribus)
         - β5: A single % increase in SP500 returns is associated with a 0.3819% increase in soybeans future returns, on average (ceteris paribus)
         - β6: A single unit increase in precipitation is associated with a 0.0000038% increase in soybeans future returns, on average (ceteris paribus)
         - β7: A single unit increase in snowfall is associated with a 0.0000038 decrease in soybeans future returns, on average (ceteris paribus)
         - β8: A single unit increase in max temperatures is associated with a 0.000028% increase in soybeans future returns, on average (ceteris paribus)
         - β9: A single unit increase in min temperatures is associated with a 0.000023% increase in soybeans future returns, on average (ceteris paribus)
 
 
*Wheat Regressions* 
 
- **Model 1**: ``` sm_ols('realized_ret_wheat ~ (market_risk_prem)',data=Commodities_DF).fit()
 ```
     - Key Statistics: 
     | Statistic      | Value |
| ----------- | ----------- |
| R^2      | 0.000       |
| β1 (market_risk_prem)  | -0.7131        |
     - Interpretation: 
         - β1 : A single % increase in the market risk premium is associated with a 0.713% decrease in wheat future returns, on average



- **Model 2**: ``` sm_ols('realized_ret_wheat ~ market_risk_prem + ret_gdp + ret_cpi + UNRATE + sp500_rets', data=Commodities_DF).fit() ```
    - Key Statistics: 
     | Statistic      | Value |
| ----------- | ----------- |
| R^2      | 0.031       |
| β1 (market_risk_prem)  | -3.2112        |
| β2 (ret_gdp)  | 0.3436        |
| β3 (ret_cpi)  | 0.2033       |
| β4 (UNRATE)  | -0.0015        |
| β5 (sp500_rets)  | 0.3585       |
    - Interpretation: 
         - β1 : A single % increase in the market risk premium is associated with a 3.211% decrease in wheat future returns, on average (ceteris paribus)
         - β2: A single % increase in US GDP is associated with with a 0.344% increase in wheat future returns, on average (ceteris paribus)
         - β3: A single % increase in US CPI (inflation) is associated with a 0.203% increase in wheat future returns, on average (ceteris paribus)
         - β4: A single % increase in US unemployment rates is associated with a 0.0015% decrease in wheat future returns, on average (ceteris paribus)
         - β5: A single % increase in SP500 returns is associated with a 0.358% increase in wheat future returns, on average (ceteris paribus)



- **Model 3: ```sm_ols('realized_ret_wheat ~ market_risk_prem + ret_gdp + ret_cpi + UNRATE + sp500_rets + W_PRCP + W_SNOW + W_TMAX + W_TMIN' ,data=Commodities_DF).fit()```
- Key Statistics: 
     | Statistic      | Value |
| ----------- | ----------- |
| R^2      | 0.067       |
| β1 (market_risk_prem)  | -1.8263        |
| β2 (ret_gdp)  | 0.9930       |
| β3 (ret_cpi)  | -1.0529      |
| β4 (UNRATE)  | -0.0007       |
| β5 (sp500_rets)  | 0.3789        |
| β6 (S_PRCP)  | -0.0000024    |
| β7 (S_SNOW)  | -0.0002       |
| β8 (S_TMAX)  | 0.0000037	        |
| β9 (S_TMIN)  | 0.0002        |
    - Interpretation: 
         - β1 : A single % increase in the market risk premium is associated with a 1.8263 decrease in wheat future returns, on average (ceteris paribus)
         - β2: A single % increase in US GDP is associated with with a 0.993% increase in wheat future returns, on average (ceteris paribus)
         - β3: A single % increase in US CPI (inflation) is associated with a  1.0529% decrease in wheat future returns, on average (ceteris paribus)
         - β4: A single % increase in US unemployment rates is associated with a 0.0007% decrease in wheat future returns, on average (ceteris paribus)
         - β5: A single % increase in SP500 returns is associated with a 0.3789% increase in wheat future returns, on average (ceteris paribus)
         - β6: A single unit increase in precipitation is associated with a 0.0000024% decrease in wheat future returns, on average (ceteris paribus)
         - β7: A single unit increase in snowfall is associated with a 0.0002 decrease in wheat future returns, on average (ceteris paribus)
         - β8: A single unit increase in max temperatures is associated with a 0.0000037% increase in wheat future returns, on average (ceteris paribus)
         - β9: A single unit increase in min temperatures is associated with a 0.0002% increase in wheat future returns, on average (ceteris paribus)
         
         
- **Analysis and Discussion of Regressions**: 
When it comes to fit, all sets of regression models got progressively better as more independent variables were added, with Model 3 yielding a maximum R^2 for all three crops (in the range of 6.0-6.7%). This indicates that the addition of macroeconomic and climate variables helps explain more of the variation in crop futures returns, which is a positive. An interesting observation is that the coefficient of the market risk premium went from being positive to strongly negative (for corn and soybeans) as other variables were added. This coefficient is essentially an estimate of the financial beta of these futures, so a negative beta is the logical expectation, given that equity and commodity market tend to move in opposite directions. This further indicates that the models are getting more accurate as more variables are added. Finally, the climate coefficients are generally very close to zero, indicating that they are very weakly correlated to corn returns. The signs of these coefficients do make sense, however, as one would expect more precipitation to have a positive impact on crop returns, while the opposite would be the expectation for snowfall. 
     - Finally, it is worth mentioning that the most of coefficients observed for all models have a low likelihood to be truly statistically significant, as only a handful of them have a t-score above the threshold value of 1.96. 

**Step 3: Visualizing Regression Relationships**

 - Visualization 1: Monthly Corn Futures Returns vs Select Independent Variables
![alt text](output_data/Corn_Correlations_Reg.png "Title")


 - Visualization 2: Monthly Soybeans Futures Returns vs Select Independent Variables
![alt text](output_data/Soybeans_Correlations_Reg.png "Title")


 - Visualization 3: Monthly Wheat Futures Returns vs Select Independent Variables
 ![alt text](output_data/Wheat_Correlations_Reg.png "Title")

#### <font color='blue'>Part II: Predictive Modeling Analysis</font>
*Source: part2_modeling.ipynb*

**Step 1: Load Commodity DF, Create Holdout Set**

Similarly to part 2, we first loaded in the Final Commodities DataFrame from **output_data/commodities_df.csv**, which was created as a result of our efforts in the **data_cleaning.ipynb** file. 

We then separated our training and holdout sets. The training set would be used to train the model, whereas the holdout set would be used to test the accuracy of our trained model. 

To do this, we separated the X (independent variables) from the y (dependent variable). In this case, we wanted to predict commodity returns so return on corn futures is an example of one of our y-variables. We made the first **80%** of our data (1980-2014) training, and the last **20%** of our data (2014-2020) the holdout set.

**Step 2: EDA on Training Set**

Next, we did EDA on our training set. This involved looking at the training data set, understanding the number of rows and columns, and most importantly determining the number of NaN data we had. We determined that most of the columns has very few instances of NaN data. We also generated profiling_reports for the data associated with each of the commodities involved in this project, which can be found in the "profiling_reports" folder within our project repo. 

**Step 3: Preprocess Data**

To preprocess the data, we created a pipe that would impute numerical NaN variables with their mean, and then used StandardScaler() to scale all of our variables. Within our *preproc_pipe*, we used the following 8 independent variables within our corn model:

- **'ret_gdp':** monthly return of domestic GDP
- **'ret_cpi':** monthly return of domestic CPI (Consumer Price Index)
- **'UNRATE':** monthly unemployment rate, domestic
- **'C_PRCP':** monthly precipitation
- **'C_SNOW':** monthly snow
- **'C_TMAX':** monthly temperature (max)
- **'C_TMIN':** monthly temperature (min)
- **'market_risk_prem':** market risk premium (monthly)

Similarly, we ultimately used the same independent variables for soybeans and wheat, just replacing the climate data with their respective variables.

**Step 4: Create Pipeline**

We tried a variety of models to best fit our training set. Models we tried include:

- **Ridge()**
- **Lasso()**
- **LinearRegression()**
- **RandomForestRegressor()**
- **BayesianRidge()**
- **GradientBoostingRegressor()**
- **ElasticNet()**

The following was the pipeline we used, which resulted in the best fit: 

```
gbr_pipe = Pipeline([
                ('preproc', preproc_pipe),
                ('feature_select', 'passthrough'),
                ('estim', linear_model.BayesianRidge())
                ])
```

To tune this model, we found the names of several HyperParameters using the *get_params()* function.

**Step 5: Optimizing Hyperparameters | Grid Search**

We found that the parameters that highly influence the BayesianRidge() Model were the following:

- *1. estim__alpha_1*
- *2. estim__alpha_2*
- *3. estim__lambda_1*
- *4. estim__lambda_2*

Because we were predicting future returns, we needed to use a different cross validation (cv) than simply KFold(10). Specifically it was imperative that the splits for the cross validation always had newer data for the holdout set, and older data for the training set, since the entire goal of this project is to test the possibility of using financial theory, macroeconomic, and climatological data to predict *future* commodity returns. As a result, we set up a type of TimeSeriesSplit cv model. For the calculation of our **cv**, see below: 

```
groups = X_train.groupby(X_train['DATE'].dt.year).groups
min_periods_in_train = 5
training_expanding_window = True

sorted_groups = [list(value) for (key, value) in sorted(groups.items())]

if training_expanding_window:

    cv = [([i for g in sorted_groups[:y] for i in g],sorted_groups[y]) 
            for y in range(min_periods_in_train , len(sorted_groups))]

else:
    
    cv = [([i for g in sorted_groups[y-min_periods_in_train:y] for i in g],sorted_groups[y]) 
            for y in range(min_periods_in_train, len(sorted_groups))]
```

In plotting the candidate models from our grid search, it became clear that our best corn model ended up having a *mean_test_score* (R^2) value of **-0.17827** with a *std_test_score* of **0.360155**. Results were similar for both soybeans and wheat.

While these scores may appear quite underwhelming, we believe that they are optimal for the type of model that we're building and given that data at our disposal. As previously mentioned, the machine learning model that produced these results was a Bayesian Ridge model. This is another type of a linear model supported by sklearn, that essentially combines the approaches to regularization that Ridge and Lasso models undertake, which is why the model features two alpha and two lambda parameters. The ultimate purpose of those parameters is to limit the likelihood that the beta coefficients associated with our x-variables significantly varied from zero. 

**Step 6: Test Model on Holdout Set**

With an optimal model yielding a negative R^2, we did not expect a good turnout on our holdout set. The following were the respective R^2 values once tested on our holdout set.

**1. Corn: 0.00622**

**2. Soybeans: -0.00307**

**3. Wheat: 0.00136**

With such low R^2 values for all of our commodities, despite trying hundreds of models, we learned that is incredibly difficult to predict future returns simply based on historical information. If anything, this has solidified our expectation of the Efficient Market Hypothesis, which states as such!

#### <font color='blue'>Pain Points</font>
Throughout this research project, we came across several unexpected pain points. 

#### <font color='blue'>Conclusion</font>
In conclusion,