**Project Statement:**

We have been given the monthly production quantity for a certain agricultural product (Grople syrup) in 10 different provinces of a country between January 2015 to December 2020. This Grople syrup comes from a fruit. It takes a few months for the fruits to grow on the trees which bear them. It also takes a few days to extract the syrup from the fruits after they have been harvested.



We would like to predict the production quantity for Grople syrup.



**We want to use following Datasets:**


**Production Quantity.csv** has 4 columns
start_date, end_date: start day and end day of each month between January 2015 to Dec 2020.

prod: production quantity of Grople syrup in tonnes at monthly frequency
region_id: A unique identifier for the 10 provinces

Daily Precipitation.csv: has 4 columns
start_date, end_date: start day and end day at a daily frequency between January 1, 2014 to Mar 13, 2022.
precip: Precipitation quantity (in mm) at daily frequency

region_id: A unique identifier for the 10 provinces

**Daily Soil Moisture.csv:** has 4 columns
start_date, end_date: start day and end day at daily frequency between January 1, 2014 to Mar 6, 2022.

smos: Soil Moisture at 5cm depth (measured by the ratio Vol/Vol) at daily frequency
region_id: A unique identifier for the 10 provinces

**Daily Temperature.csv:** has 4 columns
start_date, end_date: start day and end day at daily frequency between January 1, 2014 to Mar 13, 2022.

temp: Average daily temperature on the surface of the land (in celsius) at daily frequency

region_id: A unique identifier for the 10 provinces

**Eight Day NDVI.csv:** has 4 columns
start_date, end_date: start day and end day at 8-day frequency between Dec 27, 2013 to Mar 13, 2022.

ndvi: Normalized Difference Vegetation Index (NDVI is a ratio which ranges between [-1, 1] and captures the vegetation abundance of an area) at 8 day frequency between the given periods**

region_id: A unique identifier for the 10 provinces.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_percentage_error, r2_score

**Data Prepration**

In [None]:
## Creating dataframes
df_product=pd.read_csv("Production Quantity.csv", parse_dates=[0,1])
df_prec=pd.read_csv("Daily Precipitation.csv", parse_dates=[0,1])
df_soil=pd.read_csv("Daily Soil Mositure.csv", parse_dates=[0,1])
df_temp=pd.read_csv("Daily Temperature.csv", parse_dates=[0,1])
df_ndvi=pd.read_csv("Eight Day NDVI.csv", parse_dates=[0,1])


In [None]:
#let's look at the datasets;
print(df_product.head(1))
print(df_prec.head(1))
print(df_soil.head(1))
print(df_temp.head(1))
print(df_ndvi.head(1))

                 start_date                  end_date    prod  region_id
0 2015-01-01 00:00:00+00:00 2015-01-31 00:00:00+00:00  171725         93
                 start_date                  end_date    precip  region_id
0 2014-01-01 00:00:00+00:00 2014-01-01 00:00:00+00:00  1.392393         93
                 start_date                  end_date      smos  region_id
0 2014-01-01 00:00:00+00:00 2014-01-01 00:00:00+00:00  0.310787         93
                 start_date                  end_date       temp  region_id
0 2014-01-02 00:00:00+00:00 2014-01-02 00:00:00+00:00  24.707605         93
                 start_date                  end_date      ndvi  region_id
0 2013-12-27 00:00:00+00:00 2014-01-03 00:00:00+00:00  0.679106         93


In [None]:
#checking for duplicates in dataset;
print(df_product.duplicated(subset=['start_date','region_id']).sum())
print(df_prec.duplicated(subset=['start_date','region_id']).sum())
print(df_soil.duplicated(subset=['start_date','region_id']).sum())
print(df_temp.duplicated(subset=['start_date','region_id']).sum())
print(df_ndvi.duplicated(subset=['start_date','region_id']).sum())

0
0
0
0
0


In [None]:
df["Daily Precipitation"] = df_prec['precip']
df["smos"] = df_soil['smos']
df["temp"] = df_temp['temp']
df["ndvi"] = df_ndvi["ndvi"]

In [None]:
print(df.shape)
df.head(3)

(720, 8)


Unnamed: 0,start_date,end_date,prod,region_id,Daily Precipitation,smos,temp,ndvi
0,2015-01-01T00:00:00.000Z,2015-01-31T00:00:00.000Z,171725,93,1.392393,0.310787,24.707605,0.679106
1,2015-02-01T00:00:00.000Z,2015-02-28T00:00:00.000Z,188325,93,0.31538,0.192271,26.421176,0.701431
2,2015-03-01T00:00:00.000Z,2015-03-31T00:00:00.000Z,247856,93,2.347846,0.265683,24.305642,0.745149


In [None]:
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 720 entries, 0 to 719
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   start_date           720 non-null    object 
 1   end_date             720 non-null    object 
 2   prod                 720 non-null    int64  
 3   region_id            720 non-null    int64  
 4   Daily Precipitation  720 non-null    float64
 5   smos                 720 non-null    float64
 6   temp                 720 non-null    float64
 7   ndvi                 720 non-null    float64
dtypes: float64(4), int64(2), object(2)
memory usage: 45.1+ KB
None
                prod  region_id  Daily Precipitation        smos        temp  \
count     720.000000  720.00000           720.000000  720.000000  720.000000   
mean   159014.201389   99.00000             6.338860    0.290631   27.054364   
std    142882.722751    4.10163             7.969791    0.077984    2.304648   
min     

## **Random Forest**

In [None]:
df.head(1)

Unnamed: 0,start_date,end_date,prod,region_id,Daily Precipitation,smos,temp,ndvi
0,2015-01-01T00:00:00.000Z,2015-01-31T00:00:00.000Z,171725,93,1.392393,0.310787,24.707605,0.679106


In [None]:
x = df.drop(['start_date','end_date','prod'], axis=1)
y = df['prod']

In [None]:
# train test split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [None]:
rf = RandomForestRegressor(n_estimators = 100)
rf.fit(x_train, y_train)

In [None]:
# Prediction
y_train_pred = rf.predict(x_train)

In [None]:
# Calculating Mean Absolute Percentage Error
mape = mean_absolute_percentage_error(y_train, y_train_pred)
print(mape)

0.06782683103041191


In [None]:
# Calculating r2_Score
r2 = r2_score(y_train, y_train_pred)
print(r2)

0.9915377081254924


## **Test set**

In [None]:
y_test_pred = rf.predict(x_test)

In [None]:
# Calculating Mean Absolute Percentage Error
mape = mean_absolute_percentage_error(y_test, y_test_pred)
print(mape)

0.18312689380758915


In [None]:
# Calculating r2_Score
r2 = r2_score(y_test, y_test_pred)
print(r2)

0.9285173664614508
