<a href="https://colab.research.google.com/github/njokinjuguna/Machine-learning-Models/blob/main/cropYieldPredictionInKenya.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Predictive Analytics for Crop Yield Prediction in Kenya Using Machine Learning**

## **Table of Content**


*   Data Loading
*   Data Exploration
*   Data Cleaning
*  





In [15]:
# Import libraries
import pandas as pd

#*******************************************************
#Load the dataset
file_path = '/mnt/data/FAOSTAT_data_en_11-2-2024_all.csv'
data = pd.read_csv(file_path)

#*******************************************************
#Display the first few rows to get an overview
print("First few rows of the dataset:")
print(data.head())

#*******************************************************
#Display dataset information to understand it
print("\nDataset Information:")
data.info()

#*******************************************************
# Check for missing values in each column
print("\nMissing Values in Each Column:")
print(data.isnull().sum())

#*******************************************************
# Summary statistics for numerical columns to understand data distribution
print("\nSummary Statistics for Numerical Columns:")
print(data.describe())


First few rows of the dataset:
  Domain Code                        Domain  Area Code (M49)   Area  \
0         QCL  Crops and livestock products              404  Kenya   
1         QCL  Crops and livestock products              404  Kenya   
2         QCL  Crops and livestock products              404  Kenya   
3         QCL  Crops and livestock products              404  Kenya   
4         QCL  Crops and livestock products              404  Kenya   

   Element Code     Element Item Code (CPC)                     Item  \
0          5510  Production        01929.07  Abaca, manila hemp, raw   
1          5510  Production        01929.07  Abaca, manila hemp, raw   
2          5510  Production        01929.07  Abaca, manila hemp, raw   
3          5510  Production        01929.07  Abaca, manila hemp, raw   
4          5510  Production        01929.07  Abaca, manila hemp, raw   

   Year Code  Year Unit  Value Flag Flag Description Note  
0       1961  1961    t    0.0    A  Official fig

Summary of Initial Findings

Structure: The dataset contains 56,348 entries across 15 columns.

Missing Values: Most columns are fully populated, except for the "Note" column, which has many missing values. We can likely ignore this column if it’s not essential to our analysis.

Date Range: The "Year" column ranges from 1961 to 2022, but based on our project’s focus, we will likely restrict this to 2010–2022.

Data Types: The dataset appears to be well-structured, with appropriate data types for each column.

In [16]:
# Step 1: Filter data to focus on years from 2010 to 2022
filtered_data = data[(data['Year'] >= 2010) & (data['Year'] <= 2022)]

#*******************************************************
# Step 2: filter data , specific to Kenya
filtered_data = crop_yield_data[crop_yield_data['Area'] == 'Kenya']
print("Unique values in 'Area' column:", filtered_data)

#*******************************************************
# Step 3: Drop unnecessary columns
filtered_data = filtered_data.drop(columns=['Note'])


#*******************************************************
# Step 4: Check unique values in 'Element' and 'Item' to identify relevant indicators

# Exclude unwanted values in the 'Element' column
unwanted_elements = ['Laying', 'Milk Animals', 'Producing Animals/Slaughtered', 'Yield/Carcass Weight']
filtered_data = filtered_data[~filtered_data['Element'].isin(unwanted_elements)]

# Display unique values to verify filtering and understand remaining data
print("\nUnique values in 'Element' column:")
print(filtered_data['Element'].unique())
print("\nUnique values in 'Item' column:")
print(filtered_data['Item'].unique())



# Step 5: Handle missing values if necessary (you can add any specific treatments if needed)

# Display the cleaned dataset structure and first few rows
print("\nFiltered Data Information:")
filtered_data.info()
print("\nFirst few rows of the cleaned dataset:")
print(filtered_data.head())


Unique values in 'Area' column:     Domain Code                        Domain  Area Code (M49)   Area  \
0           QCL  Crops and livestock products              404  Kenya   
1           QCL  Crops and livestock products              404  Kenya   
2           QCL  Crops and livestock products              404  Kenya   
3           QCL  Crops and livestock products              404  Kenya   
4           QCL  Crops and livestock products              404  Kenya   
..          ...                           ...              ...    ...   
463         QCL  Crops and livestock products              404  Kenya   
464         QCL  Crops and livestock products              404  Kenya   
465         QCL  Crops and livestock products              404  Kenya   
466         QCL  Crops and livestock products              404  Kenya   
467         QCL  Crops and livestock products              404  Kenya   

     Element Code         Element  Item Code (CPC)      Item  Year Code  Year  \
0         

Here’s a breakdown of the output and what it reveals about your dataset:

Output Explanation
Data Structure:

You have 468 rows and 15 columns.
The filtered dataset only includes data for Kenya and excludes unwanted elements as specified (such as "Laying" and "Milk Animals").
Columns of Interest:

'Element' Column: Contains three unique values: 'Area harvested', 'Yield', and 'Production'. This is useful because:
"Area harvested" (in hectares) shows the land area used for each crop.
"Yield" (in kg/ha) gives the crop yield per hectare, which is crucial for measuring productivity.
"Production" (in tonnes) provides the total crop output, an important indicator of crop performance.
'Item' Column: Lists various crops, such as "Cabbages," "Maize (corn)," "Tomatoes," etc. These are the types of crops included in the dataset, which align with agricultural practices in Kenya.
Missing Data:

'Unit', 'Value', 'Flag', and 'Flag Description' columns have some missing values.
This will need addressing before further analysis, as missing values in these columns may impact your data quality, especially in the "Value" column, which holds quantitative data for your analysis.
What We Need to Do Next
Based on the output and objectives:

Handle Missing Data:

Focus on the 'Value' Column: Since this column holds the actual data for area, yield, and production, address any missing values here. You could consider removing rows with missing values in "Value" if they are few or using imputation if many are missing.
Verify Consistency in Units:

Ensure that the "Unit" column has consistent units for each element type. For example, "kg/ha" for "Yield" and "t" for "Production." This will make calculations and interpretations straightforward.
Aggregate Data by Year and Crop:

Since you're working on predictive modeling, ensure that data is grouped by Year and Item (crop type) to maintain a time series structure, which is suitable for analyzing trends over time.
Visualize Data:

Before proceeding with modeling, explore some basic visualizations to understand trends in crop yield, area harvested, and production over the years. This can reveal patterns and give insights into the relationships between these variables.