# Goal

1. Our goal is to build a basic soybean yield model to predict the final yearly yield of that crop.  We will use features such as temperature (avg) and precipitation, US state, NDVI (measure of quality), podcount and test a shift operator to test the hypothesis that technolical change in the last few years has significantly, on linearly, increased yields

1. Yearly yields:
    
    <font size="3"> $$ Yields_{y} \sim PodCount_{m} + State_{y} + Temp_{m} + Precip_{m} + NDVI_{m} + Shift + error_{y} $$ </font>
    

# Long term Goals:
1. Automate process of modeling yield forecasts.
2. Incorporate global data.
3. Provide a range of forecasts.

# Immediate Goals:
1. Use US data to forecast soybean yields.
2. Automate downloading of data and model scoring.
3. Provide figures.
4. Test and compare several forecasting methods.

# <u>Features</u>

## Yields

### Hypothsis: Historical yields are a good indicator of future yields.

1. Plots clearly show a significant trend.

## Pod Count

### Hypothesis: A good pod count indicates a good final yield.

1. We aren't interested in the final pod count, rather the intermediate counts as a way to predict the final yields (which is highly related to the count).
2. So we need the pod counts for several intermediate months.



## Weather Features

1. Bad weather in particular months will damage the soybean crops.  
2. Look at min, max and precipitation.
3. Try to identify patterns in extreme events.

## Quality of Crop
1. This is another feature examining the quality of a crop at a particular moment.


# <u>Models</u>

## Simple Trend

1. Just draw a line that fits the data

In [60]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

testYear = 2017

import statsmodels.api as sm
from statsmodels.tsa.arima_model import ARIMA
from matplotlib import pyplot
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import numpy as np

stateAugTemp = [71.9, 73.5, 74.0, 76.3, 68.3, 74.7, 74.6, 68.7, 71.8, 74.1]
statePrecip = [1.21, 3.21, 1.67, 1.56, 4.21, 2.92, 0.68, 1.76, 2.64, 2.34]
statesofInterest = ['IOWA', 'INDIANA', 'ILLINOIS', 'KANSAS', 'MINNESOTA', 'MISSOURI', 'NEBRASKA', 'NORTH DAKOTA', 'SOUTH DAKOTA', 'OHIO']
#statesofInterest = ['IOWA'] 

myGuesses = []
for item, i in enumerate(statesofInterest):
    
    print("item: ", item)
    print("i: ", i)

    yld1 = pd.read_csv("data_model_ready/model_allStates_yields.csv")
    yld1.set_index('Year', inplace=True, drop=False)
    iowa1 = yld1[yld1["State"] == i]
    real1 = iowa1.tail(5)["Yield"]
    
    test1 = iowa1[iowa1["Year"] < testYear]
    real_pred = iowa1[iowa1["Year"] >= testYear].Yield


    results = smf.ols('Yield ~ Year', data=test1).fit()
   
    ypred = results.predict(test1["Year"])
   

    Xnew = pd.DataFrame({"Year": list(range(testYear, 2021))})
    Xnewl = list(range(testYear, 2021))
    Xnewc = sm.add_constant(Xnew)
    ynewpred =  results.predict(Xnewc) # predict out of sample
    trend_pred =  results.predict(Xnewc) # predict out of sample

    results2 = smf.ols('Yield ~ Year + I(Year**2)', data=test1).fit()
    
    Xnew = pd.DataFrame({"Year": list(range(testYear, 2021))})
    Xnewc = sm.add_constant(Xnew)
    ynewpred2 =  results2.predict(Xnewc) # predict out of sample
    trend_sqr_pred = results2.predict(Xnewc) # predict out of sample

    
    # Construct the model
    series = test1["Yield"] 
    mod = sm.tsa.SARIMAX(series, order=(2, 1, 0))
    # Estimate the parameters
    res = mod.fit()

    arima_pred = res.forecast(2021 - testYear)

    ## Add weather data

    tmax1 = pd.read_csv("data_model_ready/model_allStates_maxTemp1.csv")
    tmax1.set_index('Year', inplace=True, drop=False)
    tmax1 = tmax1[tmax1["State"] == i]
    tmax1 = tmax1[["Maxtemp_Aug", "Maxtemp_Jul", "Year"]]
    iowa2 = iowa1[["Yield"]]
    iowa3 = iowa2.join(tmax1)

    test1 = iowa3[iowa3["Year"] < testYear]
    results = smf.ols('Yield ~ Year + I(Year**2) + Maxtemp_Aug', data=test1).fit()
   
    train1 = iowa3[iowa3["Year"] >= testYear]
    ynewpred =  results.predict(train1) # predict out of sample
    maxTemp_pred =  results.predict(train1) # predict out of sample

    tmin1 = pd.read_csv("data_model_ready/model_allStates_minTemp1.csv")
    tmin1.set_index('Year', inplace=True, drop=False)
    tmin1 = tmin1[tmin1["State"] == i]
    tmin1 = tmin1[["Mintemp_Jul", "Mintemp_Aug", "Year"]]
    tmin1.drop(["Year"], axis=1, inplace=True)
    iowa4 = iowa3.join(tmin1)
    iowa4["AvgTemp_Aug"] = iowa4[['Maxtemp_Aug', 'Mintemp_Aug']].mean(axis=1)
    iowa4.loc[2020, "AvgTemp_Aug"] = stateAugTemp[item]
   
    precip1 = pd.read_csv("data_model_ready/model_allStates_precip.csv")
    precip1.set_index('Year', inplace=True, drop=True)
    precip1 = precip1[precip1["State"] == i]
    numRows = precip1.shape[0]
    precip1.loc[2020,'Precip_Aug'] = statePrecip[item]
    precip1["Year"] = precip1.index
    
    precip1 = precip1[["Precip_Aug", "Precip_Jul"]]
    precip1.replace(-9.99, np.nan, inplace=True)
    precip1
    iowa5 = iowa4.join(precip1)
    
    test1 = iowa5[iowa5["Year"] < testYear]
    results = smf.ols('Yield ~ Year + I(Year**2) + Maxtemp_Aug + Mintemp_Aug + Precip_Aug + I(Precip_Aug**2)', data=test1).fit()
    
    train1 = iowa5[iowa5["Year"] >= testYear]
    ynewpred =  results.predict(train1) # predict out of sample
    maxT_Precip_pred =  results.predict(train1) # predict out of sample

    podcount = pd.read_csv("data_model_ready/model_allStates_podcount.csv")
    podcount.set_index('Year', inplace=True, drop=False)
    podcount = podcount[podcount["State"] == i]

    podcount = podcount[["Sep_pod_forecast", "Nov_pod_forecast"]]
    iowa6 = iowa5.join(podcount)

    test1 = iowa6[iowa6["Year"] < testYear]
    results = smf.ols('Yield ~ Year + I(Year**2) + Maxtemp_Aug + Precip_Aug + I(Precip_Aug**2) + Nov_pod_forecast', data=test1).fit()

    train1 = iowa6[iowa6["Year"] >= testYear]
    ynewpred =  results.predict(train1) # predict out of sample
    pod_pred =  results.predict(train1) # predict out of sample

    quality = pd.read_csv("data_model_ready/model_allStates_quality.csv")
    quality.set_index('Year', inplace=True, drop=False)
    quality = quality[quality["State"] == i]
  

    quality = quality[['WEEK #35PCT EXCELLENT']]
    quality.rename(columns={'WEEK #35PCT EXCELLENT': 'WEEK_35PCT_EXCELLENT'}, inplace=True)
    iowa7 = iowa6.join(quality)
    iowa7

    test1 = iowa7[iowa7["Year"] < testYear]
    results = smf.ols('Yield ~ Year + I(Year**2) + Maxtemp_Aug + Precip_Aug + I(Precip_Aug**2) + Nov_pod_forecast + WEEK_35PCT_EXCELLENT', data=test1).fit()
    
    train1 = iowa7[iowa7["Year"] >= testYear]
    ynewpred =  results.predict(train1) # predict out of sample
    quality_pred =  results.predict(train1) # predict out of sample

    ## Mixed model

    test1 = iowa7[iowa7["Year"] < testYear]
    results = smf.ols('Yield ~ Year + I(Year**2) + AvgTemp_Aug + Precip_Aug + I(AvgTemp_Aug*Precip_Aug) + I(Precip_Aug**2)', data=test1).fit()

    train1 = iowa7[iowa7["Year"] >= testYear]
    ynewpred =  results.predict(train1) # predict out of sample
    mixed_pred =  results.predict(train1) # predict out of sample
    
    output  = pd.DataFrame({"State": i,
                            "Year": range(testYear, 2021),
              "real": real_pred.tolist(),
              "trend_pred": trend_pred, 
              "trend_sqr_pred": trend_sqr_pred,    
              "arima_pred": arima_pred.tolist(),
              "maxTemp_pred": maxTemp_pred.tolist(),
              "maxT_Precip_pred": maxT_Precip_pred.tolist(),
              "pod_pred": pod_pred.tolist(),
              "quality_pred": quality_pred.tolist(),
              "mixed_pred": mixed_pred.tolist()
            })
    myGuesses.append(output)

guesses = pd.concat(myGuesses)
guesses

item:  0
i:  IOWA
item:  1
i:  INDIANA
item:  2
i:  ILLINOIS
item:  3
i:  KANSAS
item:  4
i:  MINNESOTA
item:  5
i:  MISSOURI
item:  6
i:  NEBRASKA
item:  7
i:  NORTH DAKOTA
item:  8
i:  SOUTH DAKOTA
item:  9
i:  OHIO


Unnamed: 0,State,Year,real,trend_pred,trend_sqr_pred,arima_pred,maxTemp_pred,maxT_Precip_pred,pod_pred,quality_pred,mixed_pred
0,IOWA,2017,57.0,52.552478,54.128346,56.579096,55.468813,55.421374,55.851705,53.536961,55.262899
1,IOWA,2018,56.0,52.995027,54.671482,57.814028,54.443958,55.695755,61.971938,59.307056,55.762273
2,IOWA,2019,55.0,53.437575,55.216736,57.880776,56.062578,55.908625,55.693119,54.425302,55.959571
3,IOWA,2020,58.0,53.880124,55.764108,57.547064,,,,,59.616834
0,INDIANA,2017,54.0,51.228121,52.560423,55.965465,53.709933,52.163123,56.746076,56.85374,52.172638
1,INDIANA,2018,57.5,51.663997,53.081339,54.076893,53.086002,54.358973,64.440869,64.373645,54.358039
2,INDIANA,2019,51.0,52.099873,53.604047,55.47203,53.921361,53.453151,54.907787,55.614494,53.447532
3,INDIANA,2020,61.0,52.535749,54.128544,55.485612,,,,,57.25786
0,ILLINOIS,2017,58.0,50.69453,52.417508,57.473405,54.102496,53.010984,58.289106,58.297473,53.191
1,ILLINOIS,2018,63.5,51.0986,52.931556,57.557116,52.948734,53.431121,61.97511,61.662315,53.47269


In [61]:
guesses.loc[:, guesses.notna().all()]

Unnamed: 0,State,Year,real,trend_pred,trend_sqr_pred,arima_pred,mixed_pred
0,IOWA,2017,57.0,52.552478,54.128346,56.579096,55.262899
1,IOWA,2018,56.0,52.995027,54.671482,57.814028,55.762273
2,IOWA,2019,55.0,53.437575,55.216736,57.880776,55.959571
3,IOWA,2020,58.0,53.880124,55.764108,57.547064,59.616834
0,INDIANA,2017,54.0,51.228121,52.560423,55.965465,52.172638
1,INDIANA,2018,57.5,51.663997,53.081339,54.076893,54.358039
2,INDIANA,2019,51.0,52.099873,53.604047,55.47203,53.447532
3,INDIANA,2020,61.0,52.535749,54.128544,55.485612,57.25786
0,ILLINOIS,2017,58.0,50.69453,52.417508,57.473405,53.191
1,ILLINOIS,2018,63.5,51.0986,52.931556,57.557116,53.47269
