### Data Source: [Agricultura (Kaggle)](https://www.kaggle.com/calvom/agricultura) (after cleaning dataset)

Columns:
* province_id
* province
* department_id
* department
* crop_id
* crop
* year_id
* year (a growing year always starts in the autumn of the year before)
* average_temperature [degrees Celsius]
* area_sowed [hectars]
* area_harvested [hectars]
* production [tons]
* performance [tons / ha]
* quality (high/middle/low)

In [29]:
import pandas as pd
import numpy as np

import pickle

import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.metrics import *
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn import model_selection

# algorithms
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
#import tensorflow as tf

In [30]:
df = pd.read_csv('../data/cleaned/agricultura.csv', sep=',', low_memory=False)
df.head()

Unnamed: 0.1,Unnamed: 0,province_id,province,department_id,department,crop_id,crop,year_id,year,average_temperature,area_sowed,area_harvested,production,performance,quality,quality_numeric
0,0,6,BUENOS AIRES,854,25 DE MAYO,1,garlic,1,1969/70,23,3.0,3.0,10.0,3.333,high,3.0
1,1,6,BUENOS AIRES,854,25 DE MAYO,1,garlic,2,1970/71,21,1.0,1.0,3.0,3.0,middle,2.0
2,2,6,BUENOS AIRES,14,ADOLFO GONZALES CHAVES,1,garlic,1,1969/70,30,15.0,15.0,82.0,5.467,middle,2.0
3,3,6,BUENOS AIRES,14,ADOLFO GONZALES CHAVES,1,garlic,2,1970/71,31,10.0,10.0,55.0,5.5,high,3.0
4,4,6,BUENOS AIRES,14,ADOLFO GONZALES CHAVES,1,garlic,3,1971/72,26,8.0,8.0,44.0,5.5,high,3.0


In [31]:
df.shape

(132769, 16)

## Drop irrelevant columns

In [32]:
df = df.drop(['province_id', 'province', 'department_id', 'department', 'crop', 'year_id', 'year', 'quality'], axis=1)  # axis=1 deletes column

## Normalise columns

In [33]:
df.head()

Unnamed: 0.1,Unnamed: 0,crop_id,average_temperature,area_sowed,area_harvested,production,performance,quality_numeric
0,0,1,23,3.0,3.0,10.0,3.333,3.0
1,1,1,21,1.0,1.0,3.0,3.0,2.0
2,2,1,30,15.0,15.0,82.0,5.467,2.0
3,3,1,31,10.0,10.0,55.0,5.5,3.0
4,4,1,26,8.0,8.0,44.0,5.5,3.0


In [26]:
# crops
max_crop = df.crop_id.max()
df["crop_id"] = df.crop_id/max_crop

# temperature
max_temp = df.average_temperature.max()
df["average_temperature"] = df.average_temperature/max_temp

# area_sowed
max_sowed = df.area_sowed.max()
df["area_sowed"] = df.area_sowed/max_sowed

# area_harvested
max_harvest = df.area_harvested.max()
df["area_harvested"] = df.area_harvested/max_harvest

# production
max_prod = df.production.max()
df["production"] = df.production/max_prod

# performance
max_perf = df.performance.max()
df["performance"] = df.performance/max_perf

df["quality_numeric"] = df.quality_numeric-1

## Obtaining training dataset

In [4]:
# Split input and output
X = df[df.columns[:-1]]
y = df[df.columns[-1:]]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)

In [9]:
X_train.shape

(92938, 7)

In [10]:
y_train

Unnamed: 0,quality_numeric
109572,2.0
38600,0.0
119920,1.0
29240,2.0
42912,1.0
...,...
121974,1.0
124605,0.0
20463,1.0
18638,1.0


## Training the models

### A. Random Forest Classifier

In [27]:
#RF = RandomForestClassifier()
RF = RandomForestClassifier(n_estimators=1000, max_depth=10, random_state=0)

# validation
#RF_scores = cross_validate(RF, X_train, y_train.values.ravel(), cv=3, scoring=('accuracy', 'average_precision', 'recall','f1'))
#print(RF_scores)

# train model
RF.fit(X_train, y_train.values.ravel())

# prediction
RF_prediction = RF.predict(X_test)

round(RF.score(X_test, y_test), 4)

0.3999

In [15]:
# save model
pickle.dump(RF, open('../models/rf.model', 'wb'))

In [None]:
# load model
RF = pickle.load(open('../models/rf.model', 'rb'))

### B. Logistic Regression

In [20]:
LR = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial')

# train model
LR.fit(X_train, y_train.values.ravel())

# prediction
LR.predict(X_test)

round(LR.score(X_test,y_test), 4)

0.3955

In [13]:
# save model
pickle.dump(LR, open('../models/lr.model', 'wb'))

In [19]:
# load model
LR = pickle.load(open('../models/lr.model', 'rb'))

### C. Support Vector Machines

In [22]:
#SVM = svm.SVC(decision_function_shape="ovo")

# train model
#SVM.fit(X_train, y_train.values.ravel())

# prediction
SVM.predict(X_test)

round(SVM.score(X_test, y_test), 4)

0.4012

In [14]:
# save model
pickle.dump(SVM, open('../models/svm.model', 'wb'))

In [21]:
# load model
SVM = pickle.load(open('../models/svm.model', 'rb'))

### D. Neural Network

In [34]:
NN = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(150, 10), random_state=1)

# train model
NN.fit(X_train, y_train.values.ravel())

# predict
NN.predict(X_test)

round(NN.score(X_test, y_test), 4)

0.4014

In [25]:
# save model
pickle.dump(NN, open('../models/nn.model', 'wb'))

In [None]:
# load model
NN = pickle.load(open('../models/nn.model', 'rb'))

## Quality Predictions

### Algorithm Comparison

All algorithms have about a two in five chance of predicting the quality with the given criteria.

| Algorithm                | Score  |
| :----------------------- |:------:|
| Random Forest Classifier | 0.3999 |
| Logistic Regression      | 0.3955 |
| Support Vector Machines  | 0.4012 |
| Neural network           | 0.4014 |

In [35]:
y_test_prediction_array = RF.predict(X_test)
df_test = pd.DataFrame({
    "quality": y_test["quality_numeric"],
    "quality_prediction": y_test_prediction_array
})

In [38]:
df_test["correct_prediction"] = df_test.quality == df_test.quality_prediction

In [39]:
df_test[df_test.correct_prediction == True]

Unnamed: 0,quality,quality_prediction,correct_prediction
22200,2.0,2.0,True
18745,3.0,3.0,True
75784,2.0,2.0,True
128155,2.0,2.0,True
14555,2.0,2.0,True
...,...,...,...
146,2.0,2.0,True
71123,3.0,3.0,True
79267,2.0,2.0,True
25675,3.0,3.0,True


In [40]:
df_test

Unnamed: 0,quality,quality_prediction,correct_prediction
81067,2.0,3.0,False
87733,1.0,3.0,False
22200,2.0,2.0,True
18745,3.0,3.0,True
122052,2.0,3.0,False
...,...,...,...
130193,3.0,2.0,False
83779,3.0,2.0,False
79267,2.0,2.0,True
25675,3.0,3.0,True
