# Agriculture Project Overview

## Introduction
This project focuses on analyzing agricultural data to gain insights into various factors affecting farming households' engagement in agricultural activities. By leveraging machine learning algorithms, the goal is to develop models that can predict the number of households engaged primarily in farm activities based on various socio-economic and agricultural indicators.

## Dataset
The dataset used in this project contains village-wise survey data collected as part of the Mission Antyodaya initiative in 2020. It includes information on socio-economic factors, agricultural practices, and infrastructure availability across different villages.

## Methodology

### Data Preprocessing
- The dataset underwent thorough preprocessing steps, including handling missing values, encoding categorical variables, and standardizing numerical features.
- Categorical variables were transformed into dummy variables to facilitate model training.

In [1]:
# Imports for data manipulation and analysis
import re
import numpy as np
import pandas as pd

# Imports for visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Imports for statistical analysis
import statsmodels.api as sm
import scipy.stats as stats

# Imports for machine learning
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, ElasticNet, Ridge, Lasso
from sklearn.metrics import mean_squared_error

# Imports for statistical tests and analysis
import statsmodels.stats.multicomp as mc
from statsmodels.formula.api import ols
from statsmodels.multivariate.manova import MANOVA

# Imports for progress visualization and model serialization
from tqdm import tqdm
from joblib import dump, load

# Function to get variable name
def get_var_name(var):
    for name, value in globals().items():
        if value is var:
            return name

In [2]:
# Read the CSV file into a pandas DataFrame
df_main = pd.read_csv(r'village_wise_survey_data_mission_antyodaya_2020.csv')

In [3]:
# Drop specified columns from the DataFrame
df_main.drop(columns=["SUB DISTRICT CODE", "BLOCK CODE", "GP CODE", "VILLAGE CODE", "VILLAGE PIN CODE", "STATE NAME", 
                      "DISTRICT NAME", "SUB DISTRICT NAME", "BLOCK NAME", "GP NAME", "VILLAGE NAME", "PC CODE", 
                      "AC CODE", "OTHER ASSEMBLY CONSTITUENCIES", "NUMBER OF HOUSEHOLDS ENGAGED MAJORLY IN NON-FARM ACTIVITIES"],
             inplace=True)
# Display the first few rows of the DataFrame
df_main.head()


Unnamed: 0,STATE CODE,DISTRICT CODE,NUMBER OF TOTAL POPULATION,NUMBER OF MALE,NUMBER OF FEMALE,NUMBER OF TOTAL HOUSEHOLD,NUMBER OF HOUSEHOLDS ENGAGED MAJORLY IN FARM ACTIVITIES,AVAILABILITY OF GOVERNMENT SEED CENTRES,WHETHER THIS VILLAGE IS A PART OF THE WATERSHED DEVELOPMENT PROJECT,AVAILABILITY OF COMMUNITY RAIN WATER HARVESTING SYSTEM/POND/DAM/CHECK DAM ETC.,...,NUMBER OF FARMERS RECEIVED THE SOIL TESTING REPORT,TOTAL NUMBER OF ELECTED REPRESENTATIVES,NUMBER OF ELECTED REPRESENTATIVES ORIENTED UNDER RASHTRIYA GRAM SWARAJ ABHIYAN,NUMBER OF ELECTED REPRESENTATIVES UNDERGONE REFRESHER TRAINING UNDER RASHTRIYA GRAM SWARAJ ABHIYAN,TOTAL APPROVED LABOUR BUDGET FOR THE YEAR 2018-19,TOTAL EXPENDITURE APPROVED UNDER NRM IN THE LABOUR BUDGET FOR THE YEAR 2018-19),"TOTAL AREA COVERED UNDER IRRIGATION (DRIP, SPRINKLER), IF IN ACRES DIVIDE BY 2.47",NUMBER OF HOUSEHOLDS HAVING PIPED WATER CONNECTION,VILLAGE LATITUDE,VILLAGE LONGITUDE
0,18,294,299,139,160,68,50,No ( Nearest facility1-2 kms),No,No,...,6,8,6,4,0.0,0.0,0.0,0,26.09009,89.97938
1,18,616,4562,2366,2196,830,0,No ( Nearest facility5-10 kms),No,No,...,0,0,0,0,0.0,0.0,0.0,0,26.594595,91.64199
2,18,284,151,80,71,26,26,No ( Nearest facilityMore than 10 kms),No,No,...,0,1,0,0,0.0,0.0,0.0,0,27.747747,95.11821
3,18,300,790,431,359,233,90,No ( Nearest facilityMore than 10 kms),No,No,...,0,10,10,10,0.0,0.0,0.0,0,26.954954,94.561104
4,18,612,3050,1459,1591,780,50,No ( Nearest facility5-10 kms),No,No,...,0,0,0,0,0.0,0.0,0.0,0,26.702703,90.4985


In [4]:
# Get the index of the specified column 
endcol = df_main.columns.get_loc("DOES THE VILLAGE HAVE LIVESTOCK EXTENSION SERVICES")
print(endcol)

22


In [5]:
# Select columns up to the specified index
df = df_main.iloc[:, :endcol]

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,STATE CODE,DISTRICT CODE,NUMBER OF TOTAL POPULATION,NUMBER OF MALE,NUMBER OF FEMALE,NUMBER OF TOTAL HOUSEHOLD,NUMBER OF HOUSEHOLDS ENGAGED MAJORLY IN FARM ACTIVITIES,AVAILABILITY OF GOVERNMENT SEED CENTRES,WHETHER THIS VILLAGE IS A PART OF THE WATERSHED DEVELOPMENT PROJECT,AVAILABILITY OF COMMUNITY RAIN WATER HARVESTING SYSTEM/POND/DAM/CHECK DAM ETC.,...,AVAILABILITY OF PRIMARY PROCESSING FACILITIES AT THE VILLAGE LEVEL,DOES THE VILLAGE HAVE ACCESS TO CUSTOM HIRING CENTRE (AGRI-EQUIPMENTS),"TOTAL CULTIVABLE AREA (IN HECTARES), IF IN ACRES DIVIDE BY 2.47","NET SOWN AREA (IN HECTARES) , IF IN ACRES DIVIDE BY 2.47",AVAILABILITY OF SOIL TESTING CENTRES,AVAILABILITY OF FERTILIZER SHOP,MAIN SOURCE OF IRRIGATION,NUMBER OF FARMERS USING DRIP/SPRINKLER IRRIGATION,"TOTAL AREA IRRIGATED (IN HECTARE), IF IN ACRES DIVIDE BY 2.47","TOTAL UNIRRIGATED LAND AREA (IN HECTARES), IF IN ACRES DIVIDE BY 2.47"
0,18,294,299,139,160,68,50,No ( Nearest facility1-2 kms),No,No,...,No,No,3.0,Total Net sown Area :2 Kharif :2 Rabi :1.5 Oth...,No ( Nearest facility2-5 kms),No ( Nearest facility< 1 km),Other,12,1.5,1.5
1,18,616,4562,2366,2196,830,0,No ( Nearest facility5-10 kms),No,No,...,No,No,530.7,Total Net sown Area :2 Kharif :2 Rabi :2 Other :0,No ( Nearest facilityMore than 10 kms),No ( Nearest facilityMore than 10 kms),Surface water,0,0.0,0.0
2,18,284,151,80,71,26,26,No ( Nearest facilityMore than 10 kms),No,No,...,No,No,7.0,Total Net sown Area :2 Kharif :2 Rabi :2 Other :0,No ( Nearest facilityMore than 10 kms),No ( Nearest facilityMore than 10 kms),Ground water (tube well/well/pump),0,0.0,7.0
3,18,300,790,431,359,233,90,No ( Nearest facilityMore than 10 kms),No,No,...,No,No,143.47,Total Net sown Area :0 Kharif :0 Rabi :0 Other :0,No ( Nearest facilityMore than 10 kms),No ( Nearest facilityMore than 10 kms),Ground water (tube well/well/pump),0,0.0,0.0
4,18,612,3050,1459,1591,780,50,No ( Nearest facility5-10 kms),No,No,...,No,No,1.0,Total Net sown Area :0 Kharif :0 Rabi :0 Other :0,No ( Nearest facilityMore than 10 kms),No ( Nearest facility5-10 kms),Other,0,0.0,0.0


In [6]:
# Map values in the column "DOES THE VILLAGE HAS ANY FARMERS COLLECTIVE" to simplified categories
df["DOES THE VILLAGE HAS ANY FARMERS COLLECTIVE"] = df["DOES THE VILLAGE HAS ANY FARMERS COLLECTIVE"].map({
    "Primary Agriculture Cooperative Society(PACS)": "PACS",
    "Farmers Produce Organization(FPOs)": "FPO",
    "Both": "Both"
})

# Map values in the column "MAIN SOURCE OF IRRIGATION" to simplified categories
df["MAIN SOURCE OF IRRIGATION"] = df["MAIN SOURCE OF IRRIGATION"].map({
    "Ground water (tube well/well/pump)": "Ground Water",
    "Other": "Other",
    "Canals": "Canals",
    "Surface Water": "Surface Water"
})

In [7]:
# Create new columns 'Area', 'Kharif', 'Rabi', and 'Others' initialized with NaN values
df["Area"] = np.nan
df["Kharif"] = np.nan
df["Rabi"] = np.nan
df["Others"] = np.nan

# Extract numbers from the string in the specified column
numbers = df["NET SOWN AREA (IN HECTARES) , IF IN ACRES DIVIDE BY 2.47"].str.findall(r'\d+\.*\d*').apply(lambda x: [float(i) for i in x])

# Assign extracted numbers to respective columns in the DataFrame
df["Area"] = numbers.str[0]
df["Kharif"] = numbers.str[1]
df["Rabi"] = numbers.str[2]
df["Others"] = numbers.str[3]

In [8]:
# Define a list of columns and create a dictionary for mapping values
cols = ['availability of warehouse for food grain storage ', 'availability of soil testing centres', 'availability of fertilizer shop']
dic = dict(zip(df['availability of fertilizer shop'.upper()].value_counts().index.to_list(), ['>10', '5-10', '2-5', 'Yes', '1-2', '<1']))

# Map values in specified columns to simplified categories using the dictionary
for col in cols:
    col = col.upper()  # Convert column name to uppercase
    df[col] = df[col].map(dic)

# Drop the column "NET SOWN AREA (IN HECTARES) , IF IN ACRES DIVIDE BY 2.47" from the DataFrame
df = df.drop(columns="NET SOWN AREA (IN HECTARES) , IF IN ACRES DIVIDE BY 2.47")

In [9]:
# Get the list of current column names and generate new column names
cols = df.columns.to_list()
new_cols = [re.sub(r'[^a-zA-Z0-9_]', '_', a.strip().lower()) for a in cols]

# Rename columns in the DataFrame using the new column names
df.rename(columns=dict(zip(cols, new_cols)), inplace=True)

# Rename specific columns in the DataFrame
df.rename(columns={'total_area_irrigated__in_hectare___if_in_acres_divide_by_2_47': 'total_area_irrigated__in_hectare',
                    'total_unirrigated_land_area__in_hectares___if_in_acres_divide_by_2_47': 'total_unirrigated_land_area__in_hectares'},
                    inplace=True)


In [10]:
# Check for columns with missing values
for col in df.columns:
    if df[col].isna().sum() != 0:
        print(col)

does_the_village_has_any_farmers_collective
main_source_of_irrigation


In [11]:
# Fill missing values with 'None'
df.fillna('None', inplace=True)

In [12]:
# Check for columns with missing values
for col in df.columns:
    if df[col].isna().sum() != 0:
        print(col)

In [13]:
# Save the DataFrame to a CSV file without including the index
df.to_csv("Agriculture.csv", index=False)

# Read the saved CSV file into a new DataFrame
df = pd.read_csv("Agriculture.csv")

# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,state_code,district_code,number_of_total_population,number_of_male,number_of_female,number_of_total_household,number_of_households_engaged_majorly_in_farm_activities,availability_of_government_seed_centres,whether_this_village_is_a_part_of_the_watershed_development_project,availability_of_community_rain_water_harvesting_system_pond_dam_check_dam_etc_,...,availability_of_soil_testing_centres,availability_of_fertilizer_shop,main_source_of_irrigation,number_of_farmers_using_drip_sprinkler_irrigation,total_area_irrigated__in_hectare,total_unirrigated_land_area__in_hectares,area,kharif,rabi,others
0,18,294,299,139,160,68,50,No ( Nearest facility1-2 kms),No,No,...,2-5,<1,Other,12,1.5,1.5,2.0,2.0,1.5,2.0
1,18,616,4562,2366,2196,830,0,No ( Nearest facility5-10 kms),No,No,...,>10,>10,,0,0.0,0.0,2.0,2.0,2.0,0.0
2,18,284,151,80,71,26,26,No ( Nearest facilityMore than 10 kms),No,No,...,>10,>10,Ground Water,0,0.0,7.0,2.0,2.0,2.0,0.0
3,18,300,790,431,359,233,90,No ( Nearest facilityMore than 10 kms),No,No,...,>10,>10,Ground Water,0,0.0,0.0,0.0,0.0,0.0,0.0
4,18,612,3050,1459,1591,780,50,No ( Nearest facility5-10 kms),No,No,...,>10,5-10,Other,0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
# Join column names into a single string separated by "+"
string = "+".join(df.drop('number_of_households_engaged_majorly_in_farm_activities', axis=1).columns.to_list())

# Perform Ordinary Least Squares (OLS) regression
eq = ols(f'number_of_households_engaged_majorly_in_farm_activities~{string}', data=df).fit()

# Perform ANOVA and extract p-values
p_value = sm.stats.anova_lm(eq, typ=1)

In [15]:
# Filter significant p-values (less than 0.05) and sort by ascending order
p_value_sig = p_value[p_value['PR(>F)'] < 0.05].sort_values(by='PR(>F)', ascending=True)

# Display the filtered p-values
display(p_value_sig)

# Get the list of significant columns
cols_sig = p_value_sig.index.tolist()

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
availability_of_government_seed_centres,5.0,127990000.0,25598000.0,399.163781,0.0
does_the_village_has_any_farmers_collective,2.0,162463800.0,81231900.0,1266.693989,0.0
availability_of_fertilizer_shop,5.0,297980100.0,59596020.0,929.31376,0.0
total_cultivable_area__in_hectares___if_in_acres_divide_by_2_47,1.0,136067300.0,136067300.0,2121.773071,0.0
number_of_total_population,1.0,2701803000.0,2701803000.0,42130.719171,0.0
number_of_male,1.0,827214600.0,827214600.0,12899.215552,0.0
number_of_total_household,1.0,129133400.0,129133400.0,2013.649367,0.0
availability_of_community_rain_water_harvesting_system_pond_dam_check_dam_etc_,1.0,62593580.0,62593580.0,976.056442,7.393179e-213
number_of_farmers_using_drip_sprinkler_irrigation,1.0,48073030.0,48073030.0,749.629338,3.295582e-164
availability_of_warehouse_for_food_grain_storage,5.0,47513430.0,9502685.0,148.180642,4.488379e-157


In [16]:
# Add the target column to the list of significant columns again
cols_sig.append("number_of_households_engaged_majorly_in_farm_activities")

# Create a new DataFrame containing only the significant columns
df_sig = df[cols_sig]

# Display the first few rows of the new DataFrame
df_sig.head()

Unnamed: 0,availability_of_government_seed_centres,does_the_village_has_any_farmers_collective,availability_of_fertilizer_shop,total_cultivable_area__in_hectares___if_in_acres_divide_by_2_47,number_of_total_population,number_of_male,number_of_total_household,availability_of_community_rain_water_harvesting_system_pond_dam_check_dam_etc_,number_of_farmers_using_drip_sprinkler_irrigation,availability_of_warehouse_for_food_grain_storage,...,total_unirrigated_land_area__in_hectares,district_code,kharif,whether_this_village_is_a_part_of_the_watershed_development_project,availability_of_primary_processing_facilities_at_the_village_level,others,does_the_village_have_access_to_custom_hiring_centre__agri_equipments_,state_code,rabi,number_of_households_engaged_majorly_in_farm_activities
0,No ( Nearest facility1-2 kms),,<1,3.0,299,139,68,No,12,2-5,...,1.5,294,2.0,No,No,2.0,No,18,1.5,50
1,No ( Nearest facility5-10 kms),,>10,530.7,4562,2366,830,No,0,5-10,...,0.0,616,2.0,No,No,0.0,No,18,2.0,0
2,No ( Nearest facilityMore than 10 kms),,>10,7.0,151,80,26,No,0,2-5,...,7.0,284,2.0,No,No,0.0,No,18,2.0,26
3,No ( Nearest facilityMore than 10 kms),,>10,143.47,790,431,233,No,0,>10,...,0.0,300,0.0,No,No,0.0,No,18,0.0,90
4,No ( Nearest facility5-10 kms),,5-10,1.0,3050,1459,780,No,0,>10,...,0.0,612,0.0,No,No,0.0,No,18,0.0,50


In [17]:
# Generate descriptive statistics for the DataFrame and transpose the result
df_sig.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
total_cultivable_area__in_hectares___if_in_acres_divide_by_2_47,345810.0,343.461729,3759.987442,1e-05,45.0,150.0,379.0,2180218.0
number_of_total_population,345810.0,1638.625569,2853.054313,1.0,416.0,925.0,1872.0,99999.0
number_of_male,345810.0,814.324456,1247.722677,0.0,210.0,471.0,953.0,93463.0
number_of_total_household,345810.0,353.965837,650.832729,1.0,87.0,198.0,405.0,94678.0
number_of_farmers_using_drip_sprinkler_irrigation,345810.0,28.078459,121.136543,0.0,0.0,0.0,12.0,11254.0
total_area_irrigated__in_hectare,345810.0,119.790157,278.927449,0.0,1.1,26.66,122.39,9908.0
number_of_female,345810.0,752.083919,1119.953143,0.0,200.0,440.0,896.0,60000.0
area,345810.0,203.003951,394.824544,0.0,17.0,80.0,226.55,9998.0
total_unirrigated_land_area__in_hectares,345810.0,103.634679,282.235416,0.0,1.0,20.0,98.0,9999.0
district_code,345810.0,345.258301,182.537528,6.0,203.0,346.0,477.0,734.0


In [18]:
# Display information about the DataFrame including the data type of each column and memory usage
df_sig.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 345810 entries, 0 to 345809
Data columns (total 25 columns):
 #   Column                                                                          Non-Null Count   Dtype  
---  ------                                                                          --------------   -----  
 0   availability_of_government_seed_centres                                         345810 non-null  object 
 1   does_the_village_has_any_farmers_collective                                     81848 non-null   object 
 2   availability_of_fertilizer_shop                                                 345810 non-null  object 
 3   total_cultivable_area__in_hectares___if_in_acres_divide_by_2_47                 345810 non-null  float64
 4   number_of_total_population                                                      345810 non-null  int64  
 5   number_of_male                                                                  345810 non-null  int64  
 6   numb

In [19]:
# Check if each column in the DataFrame is of object type, and if so, print value counts
for col in df_sig.columns:
    if pd.api.types.is_object_dtype(df_sig[col]):
        print(df_sig[col].value_counts())
        print()

availability_of_government_seed_centres
No ( Nearest facilityMore than 10 kms)    117343
No ( Nearest facility5-10 kms)            111283
No ( Nearest facility2-5 kms)              64758
Yes                                        31400
No ( Nearest facility1-2 kms)              15628
No ( Nearest facility< 1 km)                5398
Name: count, dtype: int64

does_the_village_has_any_farmers_collective
PACS    37087
Both    28048
FPO     16713
Name: count, dtype: int64

availability_of_fertilizer_shop
>10     102788
5-10    100276
2-5      68139
Yes      52213
1-2      17051
<1        5343
Name: count, dtype: int64

availability_of_community_rain_water_harvesting_system_pond_dam_check_dam_etc_
No     216997
Yes    128813
Name: count, dtype: int64

availability_of_warehouse_for_food_grain_storage
>10     130460
5-10    107546
2-5      57700
Yes      30955
1-2      14088
<1        5061
Name: count, dtype: int64

availability_of_soil_testing_centres
>10     169610
5-10    104201
2-5      4

In [20]:
# Create dummy variables for all categorical columns in the DataFrame
df1 = pd.get_dummies(df_sig)

# Display information about the DataFrame with dummy variables
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 345810 entries, 0 to 345809
Data columns (total 53 columns):
 #   Column                                                                              Non-Null Count   Dtype  
---  ------                                                                              --------------   -----  
 0   total_cultivable_area__in_hectares___if_in_acres_divide_by_2_47                     345810 non-null  float64
 1   number_of_total_population                                                          345810 non-null  int64  
 2   number_of_male                                                                      345810 non-null  int64  
 3   number_of_total_household                                                           345810 non-null  int64  
 4   number_of_farmers_using_drip_sprinkler_irrigation                                   345810 non-null  int64  
 5   total_area_irrigated__in_hectare                                                    34

### Model Development
- Several regression algorithms were explored, including Linear Regression, Ridge Regression, Lasso Regression, ElasticNet, and Random Forest Regression.
- Each algorithm was trained on the preprocessed dataset and evaluated based on accuracy scores and mean squared errors (MSEs).

In [21]:
# Define the feature matrix X by dropping the target column
X = df1.drop('number_of_households_engaged_majorly_in_farm_activities', axis=1)

# Define the target variable y
y = df1['number_of_households_engaged_majorly_in_farm_activities']

In [22]:
# Split the dataset into training and testing sets with 80% for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [23]:
# Initialize linear regression model
linear_regression_model = LinearRegression(
    fit_intercept=True,
    n_jobs=-1,
)

# Initialize Ridge regression model
ridge_regression_model = Ridge(
    max_iter=1000,
    random_state=1920
)

# Initialize Lasso regression model
lasso_regression_model = Lasso(
    max_iter=1000,
    random_state=1920
)

# Initialize ElasticNet regression model
elastic_net_model = ElasticNet(
    random_state=1920
)

# Initialize Random Forest regression model
random_forest_model = RandomForestRegressor(
    criterion='friedman_mse',
    verbose=True,
    random_state=1920
)

# Create a list of regression models
algos = [linear_regression_model, ridge_regression_model, lasso_regression_model, elastic_net_model, random_forest_model]

### Model Evaluation
- The performance of each model was assessed using accuracy scores and MSEs on a held-out test set.
- Visualizations, such as bar plots, were used to compare the performance of different algorithms.

In [24]:
# Initialize empty lists to store scores, mean squared errors, and algorithm names
scores = []
mses = []
algo_names = []

# Iterate over regression algorithms
for algo in tqdm(algos):
    # Fit the algorithm on the training data
    algo.fit(X_train, y_train)
    
    # Get the name of the algorithm
    algo_name = get_var_name(algo)
    
    # Save the trained model
    dump(value=algo, filename=f'{algo_name}.joblib')
    
    # Make predictions on the testing data
    y_pred = algo.predict(X_test)
    
    # Calculate accuracy score
    acc_score = algo.score(X_test, y_test)
    
    # Calculate mean squared error
    mse = mean_squared_error(y_test, y_pred)
    
    # Append algorithm name, score, and MSE to respective lists
    algo_names.append(algo_name)
    scores.append(acc_score)
    mses.append(mse)

# Create a DataFrame to store algorithm names, scores, and MSEs
algo_scores = pd.DataFrame({
    'Algorithm': algo_names,
    'Score': scores,
    'MSE': mses
})

 80%|████████  | 4/5 [00:10<00:03,  3.27s/it]

In [None]:
# Initialize empty lists to store scores, mean squared errors, and algorithm names
scores = []
mses = []
algo_names = []

# Iterate over regression algorithms
for algo in tqdm(algos):
    # Get the name of the algorithm
    algo_name = get_var_name(algo)
    
    # Load the pre-trained model
    algo = load(filename=f'{algo_name}.joblib')
    
    # Make predictions on the testing data
    y_pred = algo.predict(X_test)
    
    # Calculate accuracy score
    acc_score = algo.score(X_test, y_test)
    
    # Calculate mean squared error
    mse = mean_squared_error(y_test, y_pred)
    
    # Append algorithm name, score, and MSE to respective lists
    algo_names.append(algo_name)
    scores.append(acc_score)
    mses.append(mse)

# Create a DataFrame to store algorithm names, scores, and MSEs
algo_scores = pd.DataFrame({
    'Algorithm': algo_names,
    'Score': scores,
    'MSE': mses
})

In [None]:
# Create subplots with two rows and figsize of 15x10
fig, axs = plt.subplots(nrows=2, figsize=(15, 10))

# Plot barplot for scores
sns.barplot(
    x='Algorithm',
    y='Score',
    data=algo_scores,
    ax=axs[0],
    palette='viridis',
    legend=False,
    hue='Algorithm'
)

# Add labels to bars in the first subplot
for i in axs[0].containers:
    axs[0].bar_label(i)

# Plot barplot for MSEs
sns.barplot(
    x='Algorithm',
    y='MSE',
    data=algo_scores,
    ax=axs[1],
    palette='mako',
    legend=False,
    hue='Algorithm'
)

# Add labels to bars in the second subplot
for i in axs[1].containers:
    axs[1].bar_label(i)

# Show the plot
plt.show()

## Results
- The Random Forest Regression model exhibited the highest accuracy score and the lowest MSE among the algorithms evaluated.
- Insights gained from the analysis can inform policymakers and stakeholders about factors influencing farm household engagement and guide targeted interventions to support agricultural development.

## Conclusion
This project demonstrates the application of machine learning techniques to analyze agricultural data and predict farm household engagement based on various indicators. The developed models provide valuable insights into agricultural dynamics and can support evidence-based decision-making in agricultural policy and planning.