In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# FARMERS TECHNOLOGY ADOPTION ANALYSIS
This is a personal project with objectives of:
1. **Investigating the factors influencing adoption of innovation among farmers(hopefully generalizable).**
2. **Predicting Farmers' likelihood to adopt innovation.**
3. **Assign individual farmers to various adoption clusters(Innovators....Laggards)**


This project is intended to be revisited quarterly, as more data becomes available, or better still deployed in order to make use of dynamic data.

The data for this work was extracted from various data sets collected primarily by students and staff of the Agricultural Extension and Management department of Federal College of Forestry, Jos, Plateau state, Nigeria, in their academic researches in various fields. It is also important to note at this point that, some of these data(definitely, more than one source) was not geared towards innovation adoption, hence, additional but non-intrusive work had to be done on the data to make it usable and relevant.

## Overall Approach/ Strategy
The war will be fought proceeding thus(read that in your inner villain voice):
* **Data Examination**
* **Data Exploration**
* **Data Cleaning/wrangling**
* **Data Visualization**
* **Feature Engineering**
* **Modelling**
* **Evaluating the Model**
* **Extraction for Deployment**

*Nick Renotte, the above has you written all over, in my book anyways*

Check link for raw(mischievious wink file). [Rough Outline](https://github.com/maranatha443/My-ML-Projects/blob/352436cb0f3fc56f6f654ff0af8364e8a6f4d8d4/WhatsApp%20Image%202022-06-24%20at%2012.12.23%20PM.jpeg?raw=true)

### Data Examination, Exploration, Cleaning, and Visualization.
Let's check our data mate.

In [2]:
abs# import relevant libraries, yeah I'm spoilt for choice, blame Nicholas @ world quaint uni.
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import plotly_express as px
print("Set-up complete")

In [3]:
# Load the dataset.
df = pd.read_csv("../input/farmers-adoption-data/Farmers Adoption Data Plateau State - Sheet1 (1).csv")
df.head()

In [4]:
# Let's Explore a bit.
print(df.shape)
df.info()

It would seem as though our dataset, though small is lean enough at 501 instances for 9 features(Dr Cassie, am I right?). Let's dig deeper.

In [5]:
# Let's check the categorical variables.
df.select_dtypes("object").describe()

Guess I concluded too soon, what with Household size being considered an object, Unacceptable!

In [6]:
# Cleaning Household size(Excel or SQL "clean", could really shine here.)
df[df["HH/SIZE"].str.contains("`")]

In [7]:
# Still cleaning...
df.loc[437, "HH/SIZE"] = 1

In [8]:
# Change Household size to integers.
df["HH/SIZE"] = df["HH/SIZE"].astype(int)

In [9]:
# Check it to confirm changes.
df.select_dtypes("object").describe()

That's better.

In [10]:
# Describe numeric features.
df.select_dtypes("number").describe()

Tentatively, we can kinda, sorta, see that "Income" is not in the same class(pun intended), and as such is a candidate for scaling, it's also looking severely riddled with noise.

AGE seems to be close to a normal distribution(mean close to median), with majority of farmers between 20 and 60, while EDUC seems to indicate a skew, though that was to be expected, more on that to come.


Let's try and understand more about the data points by visualization.

In [11]:
# For the last time, let's follow Jose Portilla, sorry Nicholas;
sns.pairplot(df);

So having done that jumbled beauty, let's use a considered approach.

In [12]:
# Create a function to visualize distribution and compare features on adoption basis.
def show_dist(data_col, data_frame=df):
    # Create a figure for 2 subplots (2 rows, 1 column)
    fig, ax = plt.subplots(1, 2, figsize = (15,6))

    # Plot the histogram   
    sns.histplot(data_col, ax=ax[0], kde=True)
    ax[0].set_ylabel("Frequency")
    
    # Plot the boxplot   
    sns.boxplot(y=data_col, x=data_frame["ADOPTION STATUS"], ax=ax[1])
    ax[1].set_ylabel("Frequency")
    
#     # Plot the barplot
#     sns.barplot(y=data_col, x=data_frame["ADOPTION STATUS"])

    # Add a title to the Figure
    fig.suptitle("Distribution of {} by ADOPTION STATUS".format(data_col.name));

    # Show the figure
    fig.show()


In [13]:
df["AGE"].name

In [14]:
show_dist(df["AGE"])

AGE checks out as discussed earlier, Note though that generally, adopters(is that even a word), tended to be spread across a more varied age range. I wonder if that would remain if outliers are stripped, well we can check(for science of course).

In [15]:
# Collect only data points between 10th and 90th percentile.
low, high = df["AGE"].quantile([0.1,0.9])
show_dist(df[df["AGE"].between(low, high)]["AGE"])

Not much in terms of changes besides, loping off valuable information(Notice younger repondents "somehow" went missing), So bin it.

With AGE being successfully analyzed, let's consider the others.

In [None]:
# Let's loop through.
for cols in df.select_dtypes("number").drop(columns="ADOPTION STATUS"):
    show_dist(df[cols])

Now, that's one experiment I'm never trying again. Sure you can go ahead and run the code above(if you want headache).

SO, let's take them individually.

In [17]:
df.info()

In [18]:
show_dist(df["EDUCATN STATUS"])

There are so many things wrong with this distribution, however, Its bodering on the ridiculous to even imagine that the average education of the farmers who have not adopted innovation, is about the same with the lower 25th percentile of those who adopted said technologies. Also, it isn't really too strange that there seems to be under representation in some years. The country(Nigeria), operates a 6-3-3-4, education system in theory, but in practice, it's more like 6-6-4, as most people who sucessfully complete their Junior secondary school years, eventually complete the Senior years(No reference). More work for future versions then.

In [19]:
show_dist(df["HH/SIZE"])

We have a good distribution, if slighty skewed to the right, however, the outliers which make this so, still rings true in the rural communities, where polygamy is common and large households though on the wane are still a thing. Though, the average of 4 leaves a lot to be desired(God save us from unscrupulous data(No offenses)).

In [20]:
show_dist(df["F/EXP"])

Another good distribution which shows the slighly alarming state in the rural communities viz; Farmibg is on the decline(No references). Its been a trend noticed in the country that youths tend to consider farming as a lesser option compared to other economic activities. This trend, though interesting is beyond the scope of this data and subsequently project. As an intelligent(at least to me) woman once implied, the only world you can draw absolutely accurate inferences is the world your data represents.

In [21]:
show_dist(df["F/SIZE"])

Now, where do we even begin with this travesty....I'll just leave you be till I have sufficient data. However, what's immediately apparent here is that most farmers are small to medium scale.

In [22]:
def show_cat_plot(data_col, data_frame=df):
    # Create a figure for 2 subplots (2 rows, 1 column)
    plt.figure(figsize = (15,6))

    # Plot the histogram   
    sns.countplot(x=data_col, hue="ADOPTION STATUS", data=data_frame, orient="v")
    
    # Add a title to the Figure
    plt.title("Data Distribution of {} by ADOPTION STATUS".format(data_col.name));

    # Show the figure
    #fig.show()



In [23]:
show_cat_plot(df["EXT TRAINING"])

From the above exposure to extensions services or trainings does not seem to have any effects poditively or negatively on adoption. However, an interesting note is that most adopters haven't had contact with any form of training.

In [24]:
show_cat_plot(df["SEX"])

As seen above, SEX or GENDER does not seem to favour either categories. This is not to say there are no underlying issues or that the feature is useless, it simply implies that there is a good representation in the data.

Next, we discuss income(left on purpose).

In [25]:
show_dist(df["INCOME"])

In [26]:
# Collect only data points between 10th and 90th percentile.
low, high = df["INCOME"].quantile([0,0.7])
show_dist(df[df["INCOME"].between(low, high)]["INCOME"])

The amateur statistician in me is crying as I can't afford to lop off a whopping 30% of "my precious" dataset. Let's train and see.

Finally to the target.

In [27]:
# check for class imbalance.
print(df["ADOPTION STATUS"].value_counts(normalize=True))
show_cat_plot(df["ADOPTION STATUS"])

The target attribute seems to be well balanced with adoption accounting for 51% of the farmers in the dataset, while 49% of said farmers did not adopt innovation.

### Feature Engineering
This is pretty meh, hopefully, will come back to it after, training.


In [28]:
# SYDD and Scale the data
from sklearn.model_selection import train_test_split

# Split to feature matrix and target vector.
target = "ADOPTION STATUS"
X = df.drop(columns=target)
y = df[target]

# Split to train, validation, and test data.
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.3, random_state=42)

# print shapes
print(f" X's shape is {X.shape} and y's shape is {y.shape}.")
print(f" X_train's shape is {X_train.shape} and y_train's shape is {y_train.shape}.")
print(f" X_val's shape is {X_val.shape} and y_val's shape is {y_val.shape}.")
print(f" X_test's shape is {X_test.shape} and y_test's shape is {y_test.shape}.")

### Modelling
We'll be using three models viz:
* Random Forest classifier
* Logistic Regression
* XG Boost classifier
* SVM classifier(because why not, right?)
 
The one with the best performance(Accuracy) will be chosen.

In [29]:
X.head()

In [30]:
y.head()

In [31]:
# Importing the estimators and preprocessors.
from sklearn.pipeline import Pipeline, make_pipeline
from category_encoders import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

In [52]:
# Instantiate pipelines for relevant models.
clf_rf = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    StandardScaler(),
    RandomForestClassifier(random_state=42)
)


clf_lr = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    StandardScaler(),
    LogisticRegression(random_state=42)
)

clf_xgbc = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    StandardScaler(),
    XGBClassifier(random_state=42)
)

clf_svc = make_pipeline(
    OneHotEncoder(use_cat_names=True),
    StandardScaler(),
    SVC(random_state=42)
)
      
model_list = [clf_rf, clf_lr, clf_xgbc, clf_svc]
model_list

Tried so hard(probably not accurate enough), to do some thing cool codewise, but "Machine Learning is not for perfectionists", so room to improve.

In [136]:
# Setting up a hyper parameter grid

hyper_grid = {
    clf_rf: {
        "randomforestclassifier__max_depth":range(1, 20 ,2),
        "randomforestclassifier__n_estimators":range(20, 200, 20)
        },
    clf_lr: {
        "logisticregression__max_iter":[500,1000,2000]
        },
    clf_xgbc:{
        "xgbclassifier__n_estimators":range(50, 100, 25),
        "xgbclassifier__learning_rate":[0.01, 0.03, 0.05]
    },
    clf_svc: {
        "svc__kernel":["linear", "poly", "rbf"]
        
    }
}


In [69]:
hyper_grid[clf_rf]

##### GRID SEARCH

In [137]:
# Run CV and time it.
import time

best_models = []
for algo in model_list:
    model = GridSearchCV(algo, param_grid=hyper_grid[algo], cv=5, n_jobs=-1)
    print('Starting training for {}'.format(algo))
    start_t = time.time()
    model.fit(X_train, y_train)
    best_models.append(model)
    
    end_t = round(time.time() - start_t, 2)
    print(f"{algo} has been successfully trained in {round(end_t/60, 4)} mins.")

Training Over(messy code permitting), next is to evaluate and select the best model(s).

### Model Evaluation

In [125]:
from sklearn.metrics import classification_report, accuracy_score

We need to sort out the baseline first, this refers to the result of a certain guess work.

In [126]:
# Naive model
naive_model = y_train.value_counts(normalize=True).max()
print(f"Accuracy score for naive model is: {round(naive_model, 2)}")

As shown above, guessing "Adoption" for all cases will result in a model with 52% accuracy, any thing lower than that is a waste.

In [121]:
best_models[1].best_estimator_

#### TRAINING DATA

In [138]:
# Get the classification report on all the best models per model(meta, right?) TRAINING DATA.
for algo in best_models:
    print(f"{algo.best_estimator_.named_steps}:\n{classification_report(y_train, algo.predict(X_train))}")

#### VALIDATION DATA

In [139]:
# Get the classification report on all the best models per model(meta, right?) VALIDATION DATA.
for algo in best_models:
    print(f"{algo.best_estimator_.named_steps}:\n{classification_report(y_val, algo.predict(X_val))}")

Random Forest seems to be carrying the day, however, finally, over to the test data(Copy & paste). Before that though, let's tweak XGB Classifier or let's not, for now(perfectionist connudrum aka indecision).

#### **TEST DATA**

In [140]:
# Get the classification report on all the best models per model(meta, right?) TEST DATA.
for algo in best_models:
    print(f"{algo.best_estimator_.named_steps}:\n{classification_report(y_val, algo.predict(X_val))}")

Unsuprisingly, the Random Forest classifier seems to have performed better than its counterparts(what, with the extreme gradient booster chillin on overfittin).

### Extraction for Deployment

In [143]:
best_models[0].best_estimator_

In [144]:
import pickle
model_filename = "Farmer-Adoption-model.pkl"
pickle.dump(best_models[0].best_estimator_, open(model_filename,"wb"))

In [147]:
# check 
model = pickle.load(open("Farmer-Adoption-model.pkl", "rb"))
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
model

So, that's the prediction aspect sorted. Now over to the clustering which will be a continuation.