# Crop Recommendation System Using ANNs 

**Importing the libraries**



In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

**Importing the dataset**

A Data set from Kaggle is used for training the ANN

In [None]:
dataset = pd.read_csv('/kaggle/input/crop-recommendation-dataset/Crop_recommendation.csv')
X=dataset.drop(labels=['label'], axis=1)
y = dataset.iloc[:, -1].values

**Splitting the dataset into the Training set and Test set**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, train_size=0.80, random_state = 1)

**EDA and visualizing the features :**

Plotting the Humidity and Rainfall for various crops.

In [None]:
unique_features = np.unique(dataset['label'])
print(unique_features)

plt.figure(figsize=(22.5,13.5))
for feature in unique_features:
    data_subset = dataset[dataset['label'] == feature]
    plt.scatter(data_subset['humidity'], data_subset['rainfall'], label=feature, marker='o')
    
plt.xlabel('Humidity (%)')
plt.ylabel('Rainfall (mm)')
plt.title('Scatterplot of Humidity vs Rainfall')
plt.legend(loc='upper left')
plt.grid(True)
plt.show()

**Now plotting them one crop at a time for suitable temperature at which different crops can grow.**

In [None]:
import seaborn as sns
for feature in unique_features:
    data_subset = dataset[dataset['label'] == feature]
    sns.distplot(data_subset['temperature'])
    plt.title(feature)
    plt.show()

**Now , Plotting a Bar graph with mean temperature at which every crop grows**.

In [None]:
for feature in unique_features:
    data_subset = dataset[dataset['label'] == feature]
    mean_of_temps = np.mean(data_subset['temperature'])
    plt.barh(feature , mean_of_temps)
    plt.grid(True)
    
    
plt.show()

**Feature Engineering**

**Dropping constant Features** 

Features with a very minute or no change throghout the dataset are redundant and hence can be removed from the dataset because these sort of features are redundant and add nothing to the model during it's learning.
Here we are setting our Variance(Measurment of how spread out or dispersed the values are, indicating data's variability) Threshold as 0.5.
Meaning any column with variance less than 0.5 will be removed.

In [None]:
X_train = pd.DataFrame(X_train)
from sklearn.feature_selection import VarianceThreshold
var_thres=VarianceThreshold(threshold=0)
var_thres.fit(X_train)


In [None]:
var_thres.get_support()

=>No column in the dataset has been found wiht a variance of or below 0.5



**Getting the columns with variance more than 0.5 i.e.,ALL**

In [None]:
X_train.columns[var_thres.get_support()]

***A piece of code that can be used to remove the columns with constant or almost constant entries without dropping the columns manually can be.***

In [None]:
constant_columns = [column for column in X_train.columns
                    if column not in X_train.columns[var_thres.get_support()]]
print(constant_columns)

As we have no column with such condition the variable constant column is an empty list.

And hence in the code below no column will be dropped.

In [None]:
X_train.drop(constant_columns,axis=1)

**Removing the highly co-related Features (Using Pearson Correlation)** 

Removing highly correlated features is important in feature selection because it helps reduce redundancy and multicollinearity, which can lead to unstable model estimates and make it challenging for machine learning algorithms to learn the true relationships between variables.

In [None]:
cor = X_train.corr()
print(cor)

**Now this is the correlation between various features now a heatmap will be used to demonstrate these values better.**

In [None]:
plt.figure(figsize=(12,10))
cor = X_train.corr()
sns.heatmap(cor, annot=True, cmap=plt.cm.CMRmap_r)
plt.show()

**As we can see in the above heatmap that the Phosphorus(P) and Pottasium(K) content in the soil are highly correlated (0.73)**

**Therefore a function is defined which removes the Features(one of them) with correlation above a certain threshold (in this case set to be as 0.7).**

In [None]:
def correlation(dataset, threshold):
    col_corr = set()  
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute value and not is it is negatively correlated or positively
                colname = corr_matrix.columns[i] 
                col_corr.add(colname)
    return col_corr

In [None]:
corr_features = correlation(X_train, 0.7)
len(set(corr_features))

**The list of correlated features are :** 

In [None]:
corr_features

So , The above feature("Pottasium") will be dropped from trainig tst and test set. 

In [None]:
X_train.drop(corr_features,axis=1)
X_test.drop(corr_features,axis=1)

**Now , The features and their relation with the dependant variable will be found basically the importance of the various features.**

The feature importance of each feature of the dataset is determined using the feature importance property of the model.

Feature importance gives a score for each feature of the data, the higher the score more important or relevant is the feature towards the output variable.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt
model = ExtraTreesClassifier()
model.fit(X_train,y_train)

In [None]:
print(model.feature_importances_)

**A bar graph showing the features and their importance :**

In [None]:
ranked_features=pd.Series(model.feature_importances_,index=X.columns)
ranked_features.plot(kind='barh')
plt.show()

So , as we can the see the most important feature is the rainfall and the least important is pH and therre is no feature with drastically less importance then than the others.

**Therefore, NO FEATURE will be removed in this step.**

**Finding Outliers** 

We need to remove the outliers as the ANNs are highly sensitive to outliers in the data.

There are two methods to find outliers in a data set :

1)Using Z-Score (If the data is normally distributed)

2)Using IQR (If the distribution is skewed)

**Plotting the "Distplot" for various features (training set values)**

In [None]:
for column in X.columns:
    sns.distplot(X_train[column])
    plt.show() #For showing different features in different graphs 

**Therefore , pH and Temprature are the only normal distributions amongst all the features so Z-Score will be applied on them.**

First we need to combine X_train and Y_train to get a collective training set and then convert the resulting array into a pandas dataframe so as to get the column names.

In [None]:
training_set = np.column_stack((X_train, y_train))
print(training_set)
training_set_labeled = pd.DataFrame(training_set, columns=dataset.columns)

In [None]:
def detect_outliers(data):
    outliers = []
    threshold=3.5
    mean = np.mean(data)
    std =np.std(data)
    
    
    for i in data:
        z_score= (i - mean)/std 
        if np.abs(z_score) > threshold:
            outliers.append(i)
    return outliers

In [None]:
outlier_ph=detect_outliers(X_train.ph)
outlier_ph

**So,The above values are the outliers and hence need ot be removed from the dataset.**

In [None]:
training_set_labeled = training_set_labeled[~training_set_labeled['ph'].isin(outlier_ph)]
training_set_labeled.shape

The same will be done for temperature.

In [None]:
outlier_temp=detect_outliers(X_train.temperature)
outlier_temp

In [None]:
training_set_labeled = training_set_labeled[~training_set_labeled['temperature'].isin(outlier_temp)]
training_set_labeled.shape

So,The outliers from these two features are removed from the dataset

Now , The outliers from the more skewed distributions (N,P,K,Humidity and Rainfall)

**They'll be removed by using the Inter-Quantile Range (IQR).**

In [None]:
def detect_outliers_quantile(data):
    outliers = []
    threshold=3
    quantile1, quantile3= np.percentile(data,[25,75])
    iqr=quantile3-quantile1
    
    upper_bridge=quantile3 +(threshold * iqr)
    lower_bridge=quantile1 -(threshold * iqr)
    
    
    for i in data:
        if i > upper_bridge or i < lower_bridge:
            outliers.append(i)
    return outliers

**So , The outliers are needed to be found out and then removed.** 


(Generally, it should be removed but removing them here is leading to a loss of two classes in y_train so it will not be removed here.)

In [None]:
unique_values = np.unique(training_set_labeled['label'])
print(unique_values.shape )
skewed_features = [X_train.N , X_train.P , X_train.K , X_train.humidity , X_train.rainfall]

for i in skewed_features:
    outlier_skewed = detect_outliers_quantile(i)
    print('Outliers in ' , i.name)
    print(outlier_skewed)
    #training_set_labeled = training_set_labeled[~training_set_labeled[i.name].isin(outlier_skewed)]
    print(training_set_labeled.shape)
    unique_values = np.unique(training_set_labeled['label'])
    print(unique_values.shape )
    


In [None]:
training_set_labeled.shape

Creating X_train and y_train again.

In [None]:
X_train = training_set_labeled.iloc[ : , :-1].values 
y_train = training_set_labeled.iloc[ : , 7:8].values #to get a 2-D array 
print(X_train.shape)
print(y_train.shape)

In [None]:
y_test = y_test.reshape(-1, 1)
print(y_test.shape)
y_new = np.row_stack((y_train, y_test))
print(y_new.shape)

**Feature Extraction :**

We will be using the **Linear Discriminant analysis (LDA)** for feature extraction technique.

LDA is used to generate seperation between classes by making new features using the existing set of features without much loss in variance of the data.

**For LDA first we have to perform feature scaling**

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

**Applying LDA**

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components = 5)
X_train = lda.fit_transform(X_train, y_train)
X_test = lda.transform(X_test)

**One hot-encoding the dependent variable**

One-hot encoding is an important preprocessing step when training an Artificial Neural Network (ANN) because it allows the network to effectively handle categorical data, which is in a non-numeric format.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(sparse=False), [0])], remainder='passthrough',)
y_train= np.array(ct.fit_transform(y_train)) 
print(y_train)

***sparse=False***

Is Important so as to convert the Sparse matrix which the Encoder will return to a dense matrix which will then be conviniently used by the ANN

**Building the ANN**

**Initializing the ANN**

In [None]:
ann = tf.keras.models.Sequential()

**Adding layers to the ANN**

In [None]:
ann.add(tf.keras.layers.Dense(units=10, activation='relu'))
ann.add(tf.keras.layers.Dense(units=20, activation='relu'))
ann.add(tf.keras.layers.Dense(units=22, activation='softmax')) #Output layer 

**Compiling the ANN**

In [None]:
ann.compile(optimizer = 'adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

**Training the ANN on the Training set**

In [None]:
print(X_train.shape)
print(y_train.shape)

In [None]:
ann.fit(X_train, y_train, batch_size = 128, epochs = 200)

**Predicting a result** 

In [None]:
pred = ann.predict(lda.transform(sc.transform([[91,43,44,21.87974,83.00027,6.5111,203.9355]])))
pred = ct.named_transformers_['encoder'].inverse_transform(pred)
print(pred)