# Data Ingestion, Transformation and Model Training

# Data sources 

**Weather**: average temperatures and rainfall in NC sourced from NOAA.

Weather data was available as
- average temperatures in NC for each quarter of each year since 2000.
- average rainfall in NC was also available for each year since 2000.
- drought data was available on a weekly basis since 2000.

**Crops**: 22.4 million rows from 1866 to present including $ value of crops produced, yield per acre figurs. Sometimes at the county level, sometimes as frequent as weekly. Data is searchable online but we downloaded many of the datasets. (https://www.nass.usda.gov/datasets/)

Crop data consisted of data items covering all aspects of agriculture organized in a single column.

# Transformation

Weather data was available since 2000 annually, and the crop growth cycle also unfolds over an annual cycle, so our data was modeled on a yearly basis.

Weather data for each year was presented as quarterly averages, and since timing of rainfall is a factor in plant growth, all 4 figures served as model features.  Along with rainfall are temperature averages.  These too are quarterly and all are used as model features.  Finally, we collected data on those weeks when NC experienced drought conditions, weekly indicators ranging from mild to extraordinary drought conditions in some part of the state.

- Drought data was reshaped into annual rows, and the count of drought weeks at each severity level in each year were added as features.

In short, model feature data are rainfall, temperatures and drought indicators.

Model target data are measures of crop performance.

After working with several large agriculture data files from NASS, we settled on **crop data**.  We downloaded the file and started our transformations:

- Data from outside of North Carolina was removed, as was weekly and county-level data.  We could then focus on annual totals, using actuals instead of forecasts.  Crops that were not produced in large quantity in NC were also removed, as were certain crops like flowers and houseplants.  This left soybeans, corn, cotton, tobacco, squash, sweet potatoes, bell peppers, barley, wheat, oats, peanuts and hay.

- Where necessary, holes in data were interpolated and filled in.  

- Data prior to 2000 and from 2024 were removed, to match up with weather data.

- Vertical data had to be reshaped into wide dataframe with fewer rows, so that it could easily be fed into the models as targets. 

- For our purposes, our primary target data were (1) crop $ value per acre and (2) crop yield per acre (yield expressed as tons, bushels, hundredweight, etc)

# Modeling and Training

Our target values are real numbers.  To find the best regression model, I ran 

- all of the targets against 
    
- all of the regression models, and evaluated the results with 

- 4 different measures of quality.

The resulting matrix was saved as an excel file and guided the development of our crop_prediction() function.

Training followed the usual path:
- Split the training data into training (80%) and testing (20%)
- Scale the X and the y data
- Run each model in a loop against each target value

| Model types |
|------------------------|
| LinearRegression() |
| SVR() |
| DecisionTreeRegressor() |
| RandomForestRegressor() |
| GradientBoostingRegressor() |

- Record the prediction
- Save the trained models (as pickle files, for future use)
- Record the quality scores

| Quality measures |
|------ |
| Mean Square Error |
 R2 score |
  Mean Absolute Error |
| Explained Variance |

- Save the fitted scaler models (for future use) 

Finally, scan the quality scores for each model and select the best model for each commodity.  Use this in the Crop_Prediction function.

# Crop Predictions function

We access these trained models through a function, invoked by the Gradio UI.  
- Target value name, e.g. SOYBEAN_$_ACRE
- Forecast weather array, a list of values corresponding to all the features in the trained models.

Returned by the function are
- Predicted value
- 20 year average value for the target measure
- Confidence rating - High / Medium / Low - assigned based on quality measures of the each model

Design considerations:
- Scaling X values and inverse scaling the target values 
- Unpickling the trained model that did the best job on this target during training


# What we learned

1. Many models did not have a high score.  This suggest weather was not the only important factor that influenced our crop predictions.  Some of those would be 
- Increased demand for commodities stemming from global events.  For example, the war in Ukraine removed a lot of wheat from the world market, meaning that prices would have gone up, and our metric (value per acre) would rise as well

To remedy
- use targets that are not related to price, e.g., Bushels per acre.
- Since price and ultimately revenue is important, fold in some additional price forecast information