![dphi banner](https://dphi-courses.s3.ap-south-1.amazonaws.com/Datathons/dphi_banner.png)

# **[DPhi Data Sprint #26: Crop Recommendation](https://dphi.tech/challenges/data-sprint-26-crop-recommendation/62/)**
Based on code by [Manish KC](https://dphi.tech/notebooks/920/manish_kc_06/data-sprint-26-crop-recommendation) and [Krish Naik](https://www.youtube.com/watch?v=uMWJls5Roqs)

## Test Whether GPU is Working

In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

'/device:GPU:0'

## Check Which GPU is Being Used

In [None]:
from tensorflow.python.client import device_lib
device_lib.list_local_devices()

[name: "/device:CPU:0"
 device_type: "CPU"
 memory_limit: 268435456
 locality {
 }
 incarnation: 9854271559904688068, name: "/device:GPU:0"
 device_type: "GPU"
 memory_limit: 14674281152
 locality {
   bus_id: 1
   links {
   }
 }
 incarnation: 11890471964417933275
 physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5"]

## Import Libraries

In [None]:
# To measure execution time of code
!pip install ipython-autotime
 
%load_ext autotime

Collecting ipython-autotime
  Downloading https://files.pythonhosted.org/packages/b4/c9/b413a24f759641bc27ef98c144b590023c8038dfb8a3f09e713e9dff12c1/ipython_autotime-0.3.1-py2.py3-none-any.whl
Installing collected packages: ipython-autotime
Successfully installed ipython-autotime-0.3.1
time: 183 µs (started: 2021-03-08 05:58:17 +00:00)


In [None]:
# Autosklearn pre-requisite
!apt-get install swig -y

Reading package lists... Done
Building dependency tree       
Reading state information... Done
swig is already the newest version (3.0.12-1).
0 upgraded, 0 newly installed, 0 to remove and 29 not upgraded.


In [None]:
# Autosklearn pre-requisite
!pip install Cython numpy



In [None]:
# Automated library for machine learning model selection
!pip install auto-sklearn



In [None]:
import numpy as np        # Fundamental package for linear algebra and multidimensional arrays
import pandas as pd       # Data analysis and manipulation tool

# to ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Autosklearn pre-requisite
!pip install scikit-learn==0.24.1



In [None]:
# Autosklearn pre-requisite
!python -m pip install "dask[distributed]" --upgrade

Requirement already up-to-date: dask[distributed] in /usr/local/lib/python3.7/dist-packages (2021.3.0)


In [None]:
import sklearn
import autosklearn.classification as classifier
from sklearn.model_selection import train_test_split

  self.re = re.compile(self.reString)


## Loading Dataset

In [None]:
# In read_csv() function, we have passed the location to where the files are located in the dphi official github page.
train_data = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/crop_recommendation/train_set_label.csv")

## Basic EDA

In [None]:
# Display the first few rows of the training data
train_data.head()

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,crop
0,17.0,136.0,196.0,23.871923,90.49939,5.882156,103.054809,0
1,49.0,69.0,82.0,18.315615,15.361435,7.263119,81.787105,3
2,74.0,49.0,38.0,23.314104,71.450905,7.488014,164.497037,8
3,104.0,35.0,28.0,27.510061,50.666872,6.983732,143.995555,5
4,23.0,72.0,84.0,19.020613,17.131591,6.920251,79.926981,3


time: 17.1 ms (started: 2021-03-08 06:40:34 +00:00)


In [None]:
# View the number of rows, columns, and data types
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1650 entries, 0 to 1649
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   N            1650 non-null   float64
 1   P            1650 non-null   float64
 2   K            1650 non-null   float64
 3   temperature  1650 non-null   float64
 4   humidity     1650 non-null   float64
 5   ph           1650 non-null   float64
 6   rainfall     1650 non-null   float64
 7   crop         1650 non-null   object 
dtypes: float64(7), object(1)
memory usage: 103.2+ KB


In [None]:
# Change labels of crop types to numbers (required for certain machine learning classifiers)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
train_data.crop = le.fit_transform(train_data.crop)

In [None]:
# Check encoded training data
train_data.head()

Unnamed: 0,N,P,K,temperature,humidity,ph,rainfall,crop
0,17.0,136.0,196.0,23.871923,90.49939,5.882156,103.054809,0
1,49.0,69.0,82.0,18.315615,15.361435,7.263119,81.787105,3
2,74.0,49.0,38.0,23.314104,71.450905,7.488014,164.497037,8
3,104.0,35.0,28.0,27.510061,50.666872,6.983732,143.995555,5
4,23.0,72.0,84.0,19.020613,17.131591,6.920251,79.926981,3


In [None]:
# Check for class balance between different crops
train_data.crop.value_counts(normalize=True)

# No resampling is needed as all crops are equally represented

INFO:numexpr.utils:NumExpr defaulting to 2 threads.


21    0.045455
20    0.045455
1     0.045455
2     0.045455
3     0.045455
4     0.045455
5     0.045455
6     0.045455
7     0.045455
8     0.045455
9     0.045455
10    0.045455
11    0.045455
12    0.045455
13    0.045455
14    0.045455
15    0.045455
16    0.045455
17    0.045455
18    0.045455
19    0.045455
0     0.045455
Name: crop, dtype: float64

## Separating Input Features and Output Features

In [None]:
# Input/independent variables
X = train_data.drop('crop', axis = 1)   # here we are dropping the target feature as this is the target and 'X' represents the input features

y = train_data['crop']             # Output/Dependent variable

## Splitting the data into Train and Validation Sets


In [None]:
import math # We will use the square root function from this library to compute for the train-validation split

In [None]:
# Compute for train-validation split based on the method of Guyon (1997)

val_split = (1 / math.sqrt(len(X.columns)))

train_split = 1 - val_split

print('Train-validation split:', train_split, '/', val_split)

Train-validation split: 0.6220355269907728 / 0.3779644730092272


In [None]:
# split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=val_split, random_state = 42)

## Model Building

In [None]:
automlclassifier = classifier.AutoSklearnClassifier() # Default duration is 3600 seconds (1 hour)
automlclassifier.fit(X_train, y_train)

AutoSklearnClassifier(per_run_time_limit=360)

In [None]:
# Print the final ensemble constructed by auto-sklearn
print(automlclassifier.show_models())

# The generated ensemble classifier consists of multiple Gaussian Naive Bayes and Random Forest classifiers

[(0.060000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'gaussian_nb', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'median', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'standardize', 'feature_preprocessor:__choice__': 'feature_agglomeration', 'data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.03172344196220074, 'feature_preprocessor:feature_agglomeration:affinity': 'manhattan', 'feature_preprocessor:feature_agglomeration:linkage': 'average', 'feature_preprocessor:feature_agglomeration:n_clusters': 293, 'feature_preprocessor:feature_agglomeration:pooling_func': 'mean'},
dataset_properties={
  'task': 2,
  'sparse': False,
  'multilabel': False,
  'multiclass

## Model Evaluation

In [None]:
# Generate predictions on the validation data
pred = automlclassifier.predict(X_val)

time: 603 ms (started: 2021-03-08 06:05:04 +00:00)


In [None]:
# import accuracy score from sklearn.metrics
from sklearn.metrics import accuracy_score

time: 1.02 ms (started: 2021-03-08 06:05:46 +00:00)


In [None]:
print('Accuracy Score is: ', accuracy_score(y_val, pred)) 

# y_val is the original target value of the validation set (X_val)
# pred is the predicted target value of the validation set

Accuracy Score is:  0.9919871794871795
time: 3.19 ms (started: 2021-03-08 06:05:53 +00:00)


## Load Test Dataset

In [None]:
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/crop_recommendation/test_set_label.csv')

time: 171 ms (started: 2021-03-08 06:29:11 +00:00)


In [None]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 550 entries, 0 to 549
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   N            550 non-null    float64
 1   P            550 non-null    float64
 2   K            550 non-null    float64
 3   temperature  550 non-null    float64
 4   humidity     550 non-null    float64
 5   ph           550 non-null    float64
 6   rainfall     550 non-null    float64
dtypes: float64(7)
memory usage: 30.2 KB
time: 10.5 ms (started: 2021-03-08 06:29:22 +00:00)


## Make Predictions on the Test Dataset

In [None]:
# Make predictions
target = automlclassifier.predict(test_data)

time: 565 ms (started: 2021-03-08 06:30:04 +00:00)


In [None]:
# Show the generated predictions
target

array([13,  6, 12, 10, 14,  5, 10,  6, 10, 10,  8, 13, 16,  1,  4, 15,  7,
        5, 21, 14,  4, 20, 13, 11,  0, 16, 20, 21, 19, 13,  9, 14,  1, 17,
        2,  3,  9,  4, 10, 17,  7,  7, 20,  3,  7, 18, 21, 18, 21, 12, 11,
       12,  7, 18, 20,  9,  0,  4, 21,  0, 14,  5, 21, 14,  0,  9, 16,  7,
       21,  4, 11, 19,  6,  0, 11,  8, 13,  6,  6, 18, 18, 13,  1,  0,  9,
        6,  1,  3,  0, 11,  7, 16, 19,  2, 21, 13,  7,  0,  3, 16, 16, 15,
        4,  8, 16,  6, 18,  6,  6, 12, 19,  3, 11, 13,  5,  9,  1,  4, 10,
        8,  2,  5,  9, 14,  6, 17, 21,  4, 17,  2, 20, 13, 15, 13,  8, 14,
       19, 18,  5,  5, 12,  8,  8,  6,  3,  3, 17, 13, 16,  0,  5, 14, 11,
        3,  2, 10, 16,  9, 19, 17,  2, 12, 19,  8, 18, 13,  3, 15,  3, 10,
       12,  4,  1,  5, 20, 12, 21, 21,  8,  4, 18, 17, 20, 16,  6,  2, 18,
       12, 12,  2,  8, 16, 17, 11, 11,  6, 17, 15,  0,  5,  4, 21, 15, 15,
        4,  3, 11, 14,  6, 15, 13, 20, 10, 20,  9,  7, 18, 10,  8,  7, 11,
       11, 15, 12, 14,  1

time: 4.51 ms (started: 2021-03-08 06:30:29 +00:00)


In [None]:
# Transform the numerical predictions to crop names
target = le.inverse_transform(target)

time: 1.72 ms (started: 2021-03-08 06:32:13 +00:00)


In [None]:
# Show the predicted crop names
target

array(['mothbeans', 'cotton', 'mango', 'lentil', 'mungbean', 'coffee',
       'lentil', 'cotton', 'lentil', 'lentil', 'jute', 'mothbeans',
       'orange', 'banana', 'coconut', 'muskmelon', 'grapes', 'coffee',
       'watermelon', 'mungbean', 'coconut', 'rice', 'mothbeans', 'maize',
       'apple', 'orange', 'rice', 'watermelon', 'pomegranate',
       'mothbeans', 'kidneybeans', 'mungbean', 'banana', 'papaya',
       'blackgram', 'chickpea', 'kidneybeans', 'coconut', 'lentil',
       'papaya', 'grapes', 'grapes', 'rice', 'chickpea', 'grapes',
       'pigeonpeas', 'watermelon', 'pigeonpeas', 'watermelon', 'mango',
       'maize', 'mango', 'grapes', 'pigeonpeas', 'rice', 'kidneybeans',
       'apple', 'coconut', 'watermelon', 'apple', 'mungbean', 'coffee',
       'watermelon', 'mungbean', 'apple', 'kidneybeans', 'orange',
       'grapes', 'watermelon', 'coconut', 'maize', 'pomegranate',
       'cotton', 'apple', 'maize', 'jute', 'mothbeans', 'cotton',
       'cotton', 'pigeonpeas', 'pige

time: 5.33 ms (started: 2021-03-08 06:32:25 +00:00)


## Save Prediction Results to Local Storage via Google Colab

In [None]:
# Create a dataframe of the predicted values with particular respective index
res = pd.DataFrame(target)
res.columns = ["prediction"]

time: 2.09 ms (started: 2021-03-08 06:34:24 +00:00)


In [None]:
res.head()

Unnamed: 0,prediction
0,mothbeans
1,cotton
2,mango
3,lentil
4,mungbean


time: 21.2 ms (started: 2021-03-08 06:34:27 +00:00)


In [None]:
# Download predictions as a CSV file without index values
from google.colab import files
res.to_csv('sprint_26_submission.csv', index = False)         
files.download('sprint_26_submission.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

time: 13.9 ms (started: 2021-03-08 06:36:14 +00:00)
