# Goal

1. Our goal is to build a basic soybean yield model to predict the final yearly yield of that crop.  We will use features such as temperature (avg) and precipitation, US state, NDVI (measure of quality), podcount and test a shift operator to test the hypothesis that technolical change in the last few years has significantly, on linearly, increased yields

1. Yearly yields:
    
    <font size="3"> $$ Yields_{y} \sim PodCount_{m} + State_{y} + Temp_{m} + Precip_{m} + NDVI_{m} + Shift + error_{y} $$ </font>
    

# Long term Goals:
1. Automate process of modeling yield forecasts.
2. Incorporate global data.
3. Provide a range of forecasts.

# Immediate Goals:
1. Use US data to forecast soybean yields.
2. Automate downloading of data and model scoring.
3. Provide figures.
4. Test and compare several forecasting methods.

# <u>Features</u>

## Yields

### Hypothsis: Historical yields are a good indicator of future yields.

1. Plots clearly show a significant trend.

## Pod Count

### Hypothesis: A good pod count indicates a good final yield.

1. We aren't interested in the final pod count, rather the intermediate counts as a way to predict the final yields (which is highly related to the count).
2. So we need the pod counts for several intermediate months.



## Weather Features

1. Bad weather in particular months will damage the soybean crops.  
2. Look at min, max and precipitation.
3. Try to identify patterns in extreme events.

## Quality of Crop
1. This is another feature examining the quality of a crop at a particular moment.


# <u>Models</u>

## Simple Trend

1. Just draw a line that fits the data

In [None]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

testYear = 2016

yld1 = pd.read_csv("data_model_ready/model_allStates_yields.csv")
yld1.set_index('Year', inplace=True, drop=False)
iowa1 = yld1[yld1["State"] == "IOWA"]



In [None]:
real1 = iowa1.tail(5)["Yield"]

In [None]:
iowa1["Yield"].plot(figsize=(11, 9))

In [None]:
import statsmodels.api as sm
import statsmodels.formula.api as smf
test1 = iowa1[iowa1["Year"] < testYear]

results = smf.ols('Yield ~ Year', data=test1).fit()
print(results.summary())

In [None]:
ypred = results.predict(test1["Year"])
print(ypred)

In [None]:

Xnew = pd.DataFrame({"Year": list(range(testYear, 2021))})
Xnewl = list(range(testYear, 2021))
Xnewc = sm.add_constant(Xnew)
Xnewc

ynewpred =  results.predict(Xnewc) # predict out of sample
print(ynewpred)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)

fig, ax = plt.subplots()
ax.plot(test1["Year"], test1["Yield"], 'o', label="Data")
ax.plot(test1["Year"], ypred, 'b-', label="Historical")
ax.plot(np.hstack((test1["Year"], Xnewl)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best")

In [None]:
results2 = smf.ols('Yield ~ Year + I(Year**2)', data=test1).fit()
print(results2.summary())

In [None]:
Xnew = pd.DataFrame({"Year": list(range(testYear, 2021))})
Xnewc = sm.add_constant(Xnew)
Xnewc

ynewpred2 =  results2.predict(Xnewc) # predict out of sample
print(ynewpred2)

In [None]:
from IPython.display import HTML, display
import tabulate
table = [["Year", "Real", "Trend", "Trend+"],
         ["2016", 60.0, 51.9, 52.9],
         ["2017", 57.0,  52.2, 53.4],
         ["2018", 56.0, 52.6, 54.0],
         ["2019", 55.0, 53.1, 54.5],
        ["2020", 58.0, 53.5, 55.0]]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

In [None]:
import statsmodels.api as sm
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot

In [None]:
# Construct the model
series = test1["Yield"] 
mod = sm.tsa.SARIMAX(series, order=(1, 1, 0), trend='c')
# Estimate the parameters
res = mod.fit()

print(res.summary())

In [None]:
print(res.forecast(5))

In [None]:
from IPython.display import HTML, display
import tabulate
table = [["Year", "Real", "Trend", "Trend+", "ARIMA"],
         ["2016", 60.0, 51.9, 52.9, 54.3],
         ["2017", 57.0,  52.2, 53.4, 56.2],
         ["2018", 56.0, 52.6, 54.0, 56.0],
         ["2019", 55.0, 53.1, 54.5, 56.8],
        ["2020", 58.0, 53.5, 55.0, 57.0]]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

## Add weather data

In [None]:
tmax1 = pd.read_csv("data_model_ready/model_allStates_maxTemp1.csv")
tmax1.set_index('Year', inplace=True, drop=False)
tmax1 = tmax1[tmax1["State"] == "IOWA"]
tmax1 = tmax1[["Maxtemp_Aug", "Maxtemp_Jul", "Year"]]
iowa2 = iowa1[["Yield"]]
iowa3 = iowa2.join(tmax1)

In [None]:
test1 = iowa3[iowa3["Year"] < testYear]
results = smf.ols('Yield ~ Year + I(Year**2) + Maxtemp_Aug', data=test1).fit()
print(results.summary())

In [None]:
train1 = iowa3[iowa3["Year"] >= testYear]
ynewpred =  results.predict(train1) # predict out of sample
print(ynewpred)

from IPython.display import HTML, display
import tabulate
table = [["Year", "Real", "Trend", "Trend+", "ARIMA", "MaxTemp"],
         ["2016", 60.0, 51.9, 52.9, 54.3, 52.9],
         ["2017", 57.0,  52.2, 53.4, 56.2, 54.8],
         ["2018", 56.0, 52.6, 54.0, 56.0, 53.7],
         ["2019", 55.0, 53.1, 54.5, 56.8, 55.0],
        ["2020", 58.0, 53.5, 55.0, 57.0, 'NA']]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

In [None]:
tmin1 = pd.read_csv("data_model_ready/model_allStates_minTemp1.csv")
tmin1.set_index('Year', inplace=True, drop=False)
tmin1 = tmin1[tmin1["State"] == "IOWA"]
tmin1 = tmin1[["Mintemp_Jul", "Mintemp_Aug", "Year"]]
tmin1.drop(["Year"], axis=1, inplace=True)
iowa4 = iowa3.join(tmin1)
iowa4["AvgTemp_Aug"] = iowa4[['Maxtemp_Aug', 'Mintemp_Aug']].mean(axis=1)
iowa4.loc[2020, "AvgTemp_Aug"] = 71.9
iowa4

In [None]:
import pandas as pd
precip1 = pd.read_csv("data_model_ready/model_allStates_precip.csv")
precip1.set_index('Year', inplace=True, drop=True)
precip1 = precip1[precip1["State"] == "IOWA"]
numRows = precip1.shape[0]
precip1.loc[2020,'Precip_Aug'] = 1.21
precip1["Year"] = precip1.index
precip1

In [None]:
precip1 = precip1[["Precip_Aug", "Precip_Jul"]]
precip1.replace(-9.99, np.nan, inplace=True)
precip1
iowa5 = iowa4.join(precip1)
iowa5

In [None]:
test1 = iowa5[iowa5["Year"] < testYear]
results = smf.ols('Yield ~ Year + I(Year**2) + Maxtemp_Aug + Mintemp_Aug + Precip_Aug + I(Precip_Aug**2)', data=test1).fit()
print(results.summary())

In [None]:
train1 = iowa5[iowa5["Year"] >= testYear]
ynewpred =  results.predict(train1) # predict out of sample
print(ynewpred)


In [None]:
from IPython.display import HTML, display
import tabulate
table = [["Year", "Real", "Trend", "Trend+", "ARIMA", "MaxTemp", "MaxT_Precip"],
         ["2016", 60.0, 51.9, 52.9, 54.3, 52.9, 54.2],
         ["2017", 57.0,  52.2, 53.4, 56.2, 54.8, 54.8],
         ["2018", 56.0, 52.6, 54.0, 56.0, 53.7, 55.0],
         ["2019", 55.0, 53.1, 54.5, 56.8, 55.0, 55.3],
        ["2020", 58.0, 53.5, 55.0, 57.0, 'NA', 'NA']]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

In [None]:
podcount = pd.read_csv("data_model_ready/model_allStates_podcount.csv")
podcount.set_index('Year', inplace=True, drop=False)
podcount = podcount[podcount["State"] == "IOWA"]
podcount.head(5)

In [None]:
podcount = podcount[["Sep_pod_forecast", "Nov_pod_forecast"]]
iowa6 = iowa5.join(podcount)
iowa6

In [None]:
test1 = iowa6[iowa6["Year"] < testYear]
results = smf.ols('Yield ~ Year + I(Year**2) + Maxtemp_Aug + Precip_Aug + I(Precip_Aug**2) + Nov_pod_forecast', data=test1).fit()
print(results.summary())

In [None]:
train1 = iowa6[iowa6["Year"] >= testYear]
ynewpred =  results.predict(train1) # predict out of sample
print(ynewpred)

In [None]:
from IPython.display import HTML, display
import tabulate
table = [["Year", "Real", "Trend", "Trend+", "ARIMA", "MaxTemp", "MaxT_Precip", "PodCount"],
         ["2016", 60.0, 51.9, 52.9, 54.3, 52.9, 54.2, 54.7],
         ["2017", 57.0,  52.2, 53.4, 56.2, 54.8, 54.8, 54.4],
         ["2018", 56.0, 52.6, 54.0, 56.0, 53.7, 55.0, 60.0],
         ["2019", 55.0, 53.1, 54.5, 56.8, 55.0, 55.3, 53.8],
        ["2020", 58.0, 53.5, 55.0, 57.0, 'NA', 'NA', 'NA']]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

In [None]:
quality = pd.read_csv("data_model_ready/model_allStates_quality.csv")
quality.set_index('Year', inplace=True, drop=False)
quality = quality[quality["State"] == "IOWA"]
quality.head(5)

In [None]:
quality = quality[['WEEK #35PCT EXCELLENT']]
quality.rename(columns={'WEEK #35PCT EXCELLENT': 'WEEK_35PCT_EXCELLENT'}, inplace=True)
iowa7 = iowa6.join(quality)
iowa7

In [None]:
test1 = iowa7[iowa7["Year"] < testYear]
results = smf.ols('Yield ~ Year + I(Year**2) + Maxtemp_Aug + Precip_Aug + I(Precip_Aug**2) + Nov_pod_forecast + WEEK_35PCT_EXCELLENT', data=test1).fit()
print(results.summary())

In [None]:
train1 = iowa7[iowa7["Year"] >= testYear]
ynewpred =  results.predict(train1) # predict out of sample
print(ynewpred)

In [None]:
from IPython.display import HTML, display
import tabulate
table = [["Year", "Real", "Trend", "Trend+", "ARIMA", "MaxTemp", "MaxT_Precip", "PodCount", "Quality"],
         ["2016", 60.0, 51.9, 52.9, 54.3, 52.9, 54.2, 54.7, 56.7],
         ["2017", 57.0,  52.2, 53.4, 56.2, 54.8, 54.8, 54.4, 52.8],
         ["2018", 56.0, 52.6, 54.0, 56.0, 53.7, 55.0, 60.0, 58.2],
         ["2019", 55.0, 53.1, 54.5, 56.8, 55.0, 55.3, 53.8, 53.4],
        ["2020", 58.0, 53.5, 55.0, 57.0, 'NA', 'NA', 'NA', 'NA']]
display(HTML(tabulate.tabulate(table, tablefmt='html')))

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)

fig, ax = plt.subplots()
ax.plot(test1["Year"], test1["Yield"], 'o', label="Data")
ax.plot(test1["Year"], ypred, 'b-', label="Historical")
ax.plot(np.hstack((test1["Year"], Xnewl)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best")

## Travis model

In [None]:
iowa7