# **🌾 Introduction**

This project, part of the Innovative AI Challenge 2024, focuses on Challenge 1: Developing AI Models to Increase Agricultural Productivity. The goal is to predict Crop Yield (kg/ha) using features like 🌧️ Rainfall, 🌱 Soil Type, and 🚜 Irrigation Area, and to build a robust regression model for accurate predictions.

In Part 1, we developed a model achieving:

**MSE: 43,991.71**
**R² Score: 98.60%**

**Next, in Part 2, we will explore detailed visualizations for deeper insights 📊,** and in Part 3, build a complete AI solution incorporating crop type, weather, and soil properties. The final deliverable includes a farmer-friendly interface, analysis, and a video demonstration.

**Stay tuned for the next steps as we strive to enhance agricultural productivity with AI! 🌱✨**



# **🧠 Data Understanding:**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [None]:
train_data=pd.read_csv("/kaggle/input/innovative-ai-challenge-2024/train.csv")
test_data=pd.read_csv("/kaggle/input/innovative-ai-challenge-2024/test.csv")

In [None]:
train_data

In [None]:
print('train_data: ',train_data.shape)
print('test_data: ',test_data.shape)

In [None]:
print('train_data: ',train_data.columns)
print('test_data: ',test_data.columns)

In [None]:
train_data.describe()

**🔍 Explore Columns and Data Types**

- Column description:

🆔 id: A unique identifier for each data point.

📅 Year: The year when the data was collected (e.g., 2020, 2002).

🗺️ State: The state or region where the data was recorded (e.g., Punjab, Maharashtra).

🌾 Crop_Type: The type of crop grown (e.g., Rice, Wheat, Bajra).

🌧️ Rainfall: The average annual rainfall in the region (measured in mm).

🌱 Soil_Type: The type of soil in the region (e.g., Loamy, Sandy, Clay).

🚜 Irrigation_Area: The area of irrigated land in thousand hectares.

📈 Crop_Yield (kg/ha): The target variable representing the crop yield in kilograms per hectare (numeric value).

In [None]:
train_data.info()

In [None]:
train_data.drop(columns=['id'],inplace=True)

In [None]:
train_data

# **🛠️ Data Preprocessing**

**🚨 Missing Values and Duplicate Values**

In [None]:
train_data.isnull().sum()

**No missing values across any of the columns. This ensures that the data is clean and ready for exploratory data analysis (EDA) and model building**

In [None]:
train_data.duplicated().sum()

**🔢 Numerical and Categorical Columns**

In [None]:
target_col='Crop_Yield (kg/ha)'
num_col=train_data.select_dtypes(include=['number']).columns
cat_col=train_data.select_dtypes(include=['object']).columns
print("Target Columns: ",target_col)
print("\nNumrical Column: ",num_col.tolist())
print("\nCategorical Column: ",cat_col.tolist())

In [None]:
num_data=train_data.select_dtypes(include=['number'])
cat_data=train_data.select_dtypes(include=['object'])

In [None]:
num_data.kurtosis()

In [None]:
print('Numerical Data Distribution!')
num_data.describe().round(2).T

In [None]:
print("Categorical Data Dsicription!")
cat_data.describe().T

In [None]:
for c in cat_data:
    col_count=train_data[c].nunique()
    print(f'{c} has {col_count} unique values.')
    print("**"*20)

In [None]:
for i in cat_col:
    cat_value=train_data[i].value_counts()
    print(f'value count for {i} is :')
    print(cat_value)
    print("-"*20)
    

# **Exploratory Data Analysis (EDA) 🔍📈**

**Univariate Analysis 📊✨**

**1.Distribution of Numerical Column**

In [None]:
numerical_col = ['Rainfall', 'Irrigation_Area', 'Crop_Yield (kg/ha)']
plt.figure(figsize=(12, 8))  # Set the figure size for all subplots
plot_num = 1

for col in numerical_col:
    if plot_num <= 3:  # Limit the number of subplots to match the number of columns
        plt.subplot(2, 2, plot_num)  # Create a 2x2 grid of subplots
        sns.histplot(data=train_data, x=col, kde=True, bins=30, color='green')
        plt.title(f"Distribution of {col}")
        plt.xlabel(col)
        plt.ylabel('Frequency')
        plot_num += 1

plt.tight_layout()  # Adjust layout to avoid overlap
plt.show()


- The distributions of rainfall, irrigation area, and crop yield show varying degrees of skewness, with rainfall being the most skewed, followed by irrigation area, and crop yield being relatively less skewed.

**2.What are the counts of each state, crop type, and soil type in the dataset**

In [None]:
categorical_columns = ['State', 'Crop_Type', 'Soil_Type']
for col in categorical_columns:
    print(f"Value counts for {col}:\n{train_data[col].value_counts()}\n")

plot_num=1
plt.figure(figsize=(10, 5))
for col in categorical_columns:
    if plot_num <=3:
        plt.subplot(2,2,plot_num)
        sns.countplot(data=train_data, y=col, palette='viridis', order=train_data[col].value_counts().index)
        plt.title(f'Counts of {col}')
        plt.xlabel('Count')
        plt.ylabel(col)
        plot_num +=1
plt.tight_layout()
plt.show()


**3. What is the relationship between irrigation area and crop yield?**

In [None]:
plt.figure(figsize=(6, 4))
sns.scatterplot(data=train_data, x='Irrigation_Area', y='Crop_Yield (kg/ha)', color='purple')
sns.regplot(data=train_data, x='Irrigation_Area', y='Crop_Yield (kg/ha)', scatter=False, color='orange')
plt.title('Irrigation Area vs Crop Yield')
plt.xlabel('Irrigation Area (ha)')
plt.ylabel('Crop Yield (kg/ha)')
plt.show()


- The scatter plot and regression line show a positive correlation between irrigation area and crop yield, suggesting that increasing irrigation area tends to be associated with higher crop yields.

**Bivariate Analysis 🔗📉**

**4.How does rainfall affect crop yield?**

In [None]:
plt.figure(figsize=(6, 4))
sns.scatterplot(data=train_data, x='Rainfall', y='Crop_Yield (kg/ha)', color='green')
sns.regplot(data=train_data, x='Rainfall', y='Crop_Yield (kg/ha)', scatter=False, color='red')
plt.title('Rainfall vs Crop Yield')
plt.xlabel('Rainfall (mm)')
plt.ylabel('Crop Yield (kg/ha)')
plt.show()


- The relationship between rainfall and crop yield appears to be weak and negative, suggesting that higher rainfall may be associated with slightly lower crop yields

**5.What is the relationship between irrigation area and crop yield?**

In [None]:
plt.figure(figsize=(6, 4))
sns.scatterplot(data=train_data, x='Irrigation_Area', y='Crop_Yield (kg/ha)', color='purple')
sns.regplot(data=train_data, x='Irrigation_Area', y='Crop_Yield (kg/ha)', scatter=False, color='orange')
plt.title('Irrigation Area vs Crop Yield')
plt.xlabel('Irrigation Area (ha)')
plt.ylabel('Crop Yield (kg/ha)')
plt.show()


- The scatter plot and regression line show a positive correlation between irrigation area and crop yield, suggesting that increasing irrigation area tends to be associated with higher crop yields.

**6.Does crop yield vary by state, crop type, or soil type?**

In [None]:
categorical_columns = ['State', 'Crop_Type', 'Soil_Type']
plt.figure(figsize=(16, 8))
plot_num=1
for col in categorical_columns:
    if plot_num<=3:
        plt.subplot(2,2,plot_num)
        sns.boxplot(data=train_data, x=col, y='Crop_Yield (kg/ha)', palette='Set3')
        plt.title(f'Crop Yield by {col}')
        plt.xlabel(col)
        plt.ylabel('Crop Yield (kg/ha)')
        plt.xticks(rotation=45)
        plot_num +=1
plt.show()


- Crop yield appears to be significantly influenced by crop type and soil type, but not by state.

**7. How has crop yield changed over the years?**

In [None]:
yearly_yield = train_data.groupby('Year')['Crop_Yield (kg/ha)'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.lineplot(data=yearly_yield, x='Year', y='Crop_Yield (kg/ha)', marker='o', color='blue')
plt.title('Crop Yield Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Crop Yield (kg/ha)')
plt.show()


**Multivariate Analysis 📊🔬✨**

 **8. How do rainfall and irrigation area together influence crop yield?**

In [None]:
import plotly.express as px

fig = px.scatter_3d(train_data, x='Rainfall', y='Irrigation_Area', z='Crop_Yield (kg/ha)', color='Crop_Type')
fig.update_layout(title='Rainfall, Irrigation Area, and Crop Yield')
fig.show()


-crop yield increases with both rainfall and irrigation area, but the extent of influence varies across crops, with wheat showing a stronger response compared to rice and bajra.

**9.How do different crop types perform in various states?**

In [None]:
grouped_data = train_data.groupby(['State', 'Crop_Type'])['Crop_Yield (kg/ha)'].mean().reset_index()
plt.figure(figsize=(10, 6))
sns.barplot(data=grouped_data, x='State', y='Crop_Yield (kg/ha)', hue='Crop_Type', palette='tab10')
plt.title('Crop Yield by State and Crop Type')
plt.xlabel('State')
plt.ylabel('Average Crop Yield (kg/ha)')
plt.xticks(rotation=45)
plt.show()


-  within Punjab, Wheat has the highest average yield, followed by Rice, and then Bajra.

**10.Does soil type impact the effectiveness of irrigation or rainfall on crop yield?**

In [None]:
plt.figure(figsize=(10,6))
sns.pairplot(train_data, vars=['Rainfall', 'Irrigation_Area', 'Crop_Yield (kg/ha)'], hue='Soil_Type', palette='Dark2')
plt.show()


-  loamy soil shows higher variability in rainfall impact on crop yield, while alluvial soil exhibits a concentrated relationship between irrigation area and yield, suggesting soil type influences the effectiveness of irrigation and rainfall differently.

**Correlation and Trends 🔗📈🔍**

**11. Is there a trend in crop yield over time across different states or crop types?**

In [None]:
plt.figure(figsize=(12, 8))
sns.lineplot(data=train_data, x='Year', y='Crop_Yield (kg/ha)', hue='State', marker='o', palette='viridis')
plt.title('Crop Yield Trends by State')
plt.xlabel('Year')
plt.ylabel('Crop Yield (kg/ha)')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()


- Crop yield showed a significant increase during the years 2010 to 2015, indicating a peak in agricultural productivity during this period. 🌾📈

**12.Which crops perform best under specific soil types and irrigation conditions?**

In [None]:
pivot_table = train_data.pivot_table(index='Soil_Type', columns='Crop_Type', values='Crop_Yield (kg/ha)', aggfunc='mean')
plt.figure(figsize=(10, 6))
sns.heatmap(pivot_table, annot=True, cmap='YlGnBu', fmt='.2f')
plt.title('Crop Yield by Soil Type and Crop Type')
plt.xlabel('Crop Type')
plt.ylabel('Soil Type')
plt.show()


- Wheat performs best in loamy soil (yield: 4564.00), while Rice performs best in alluvial soil (yield: 3854.94).

**13.What is the relative proportions of State, Crop_Type, and Soil_Type in the dataset**

In [None]:
from matplotlib import cm
# Define categorical columns for pie chart
categorical_columns = ['State', 'Crop_Type', 'Soil_Type']

# Plotting pie charts
plt.figure(figsize=(20, 10))  # Set overall figure size
plot_num = 1
pastel_colors = cm.Pastel1.colors
for col in categorical_columns:
    plt.subplot(1, 3, plot_num)  # Create subplots
    data = train_data[col].value_counts()  # Get value counts for the column
    plt.pie(data, labels=data.index, autopct='%1.1f%%', startangle=140, colors=pastel_colors[:len(data)])
    plt.title(f"Proportion of {col}")
    plot_num += 1

plt.tight_layout()
plt.show()

In [None]:
train_data.columns

**14.What is the distribution of crop types across the dataset?**

In [None]:
import matplotlib.pyplot as plt

# Data for crop types
crop_data = train_data['Crop_Type'].value_counts()

# Plotting the pie chart
plt.figure(figsize=(8, 6))
plt.pie(crop_data, labels=crop_data.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Set3.colors[:len(crop_data)])
plt.title("Distribution of Crop Types")
plt.show()


**15.How are soil types distributed in the dataset?**

In [None]:
# Data for soil types
soil_data = train_data['Soil_Type'].value_counts()

# Plotting the pie chart
plt.figure(figsize=(8, 6))
plt.pie(soil_data, labels=soil_data.index, autopct='%1.1f%%', startangle=140, colors=plt.cm.Paired.colors[:len(soil_data)])
plt.title("Distribution of Soil Types")
plt.show()


# **🔚 Conclusion:**

In this notebook, we performed a comprehensive EDA to uncover insights from the dataset. Here's a summary of our findings:

**🌦️ Numerical Features:** Explored distributions of rainfall, irrigation area, and crop yield, identifying key patterns.

**🌱 Categorical Features:** Analyzed states, crop types, and soil types, revealing regional and agricultural diversity.

**📊 Pie Chart Distributions:** Visualized the proportion of states, crop types, and soil types to understand their relative representation in the dataset.

**📈 Input-Output Relationships:** Visualized how rainfall and irrigation impact crop yield, along with variability across states, crops, and soil types.

**📅 Trends:** Observed temporal patterns in crop yield across years and regions.

**💡 Correlations:** Highlighted key relationships among numerical features to guide feature selection.

**These insights lay a solid foundation for building a robust predictive model to improve agricultural productivity. 🚜✨**

# **🚀 If you found this notebook helpful, insightful, or inspiring, I would truly appreciate your upvote! Your support can help bring this notebook to the spotlight and make it shine in the competition🏆✨**

# **Thank you for your time and encouragement! 😊**