# CPSC322 Semester Project Proposal

Team Members: Kim Lenz, Jack Ou

Project Name: Agriculture and Climate Change

## Dataset Description

Dataset Source: https://www.kaggle.com/datasets/waqi786/climate-change-impact-on-agriculture?select=climate_change_impact_on_agriculture_2024.csv

Format: CSV 

Contents: Simulated data for learning and processing. Dataset contains no missing values or zeros in appropriate columns. 

Attributes: Year,Country,Region,Crop_Type,Average_Temperature_C,Total_Precipitation_mm,CO2_Emissions_MT,Crop_Yield_MT_per_HA,Extreme_Weather_Events,Irrigation_Access_%,Pesticide_Use_KG_per_HA,Fertilizer_Use_KG_per_HA,Soil_Health_Index,Adaptation_Strategies,Economic_Impact_Million_USD

Country- Simulated country where the agricultural data is modeled, used to represent geographic and climate variation across regions.

Crop Type - The type of crop grown in the modeled scenario (e.g., wheat, rice, maize), included to show how different crops respond differently to environmental factors.

CO2_Emissions_MT – Modeled carbon dioxide emissions (in metric tons) associated with the agricultural region; used to approximate broader climate conditions and environmental impact.

Irrigation_Access_% – Percentage of farmland with access to irrigation infrastructure, representing water security and resilience to precipitation variability.

Pesticide_Use_KG_per_HA – Amount of pesticides applied per hectare (in kilograms), reflecting agricultural management intensity and its role in crop productivity.

Fertilizer_Use_KG_per_HA – Quantity of fertilizer applied per hectare (in kilograms), included as an indicator of soil nutrient supplementation and yield-boosting agricultural practices.

Soil_Health_Index – Synthetic index summarizing soil quality (e.g., nutrient levels, organic matter, structure), serving as a key factor influencing crop growth and long-term agricultural sustainability.

Adaptation_Strategies – Simulated climate adaptation measures implemented in the region (e.g., drought-resistant seeds, improved irrigation, crop rotation), representing efforts to mitigate climate impacts.

Economic_Impact_Million_USD – Estimated financial impact of climate and agricultural conditions on the region's economy (in millions of USD). 

Region – Simulated geographic region

Year – Modeled year of data collection

Average_Temperature – Simulated temperature levels (°C)

Precipitation – Modeled annual rainfall (mm)

Crop_Yield – Synthetic yield data for selected crops (tons/hectare)

Extreme_Weather_Events – Number of modeled extreme weather occurrences per year

Attribute(s) to be predicted: 
Crop_Yield_MT_per_HA (option 1 - convert to low, medium and high)
Extreme_Weather_Events (option 2)



## Implementation/Technical Merit

Anticipated challenges: feature reduction, instance reduction, pruning

Feature selection/reduction handling:
We will compute the entropy of the full dataset and evaluate the information gain for each attribute when predicting our target.

Because many attributes in our dataset are continuous (temperature, precipitation, CO₂ emissions, soil health, etc.), we will use the continuous attribute handling approach taught in lecture.

If some attributes contribute very little information gain or are highly correlated with other features, we will reduce them by:

Removing features with near-zero information gain

Using decision-tree feature importances to confirm impact

Optionally evaluating correlation matrices to identify redundancy

We will examine the dataset for duplicate entries, extreme outliers, and class imbalance. If needed, we may remove duplicates or downsample majority classes to ensure fair classification performance.

To prevent overfitting in decision trees, we will apply pruning strategies such as limiting tree depth, requiring minimum samples to split, or post-pruning based on validation accuracy. This aligns with class discussions on maintaining high predictive accuracy with a small number of rules

## Potential Impact From Results
Why the results will be useful:

Climate change is already having profound consequences on the global agricultural sector. Farmers depend on reliable, consistent weather patterns to plan when to plant, irrigate, protect, and harvest their crops. Unfortunately, climate change has disrupted these patterns and increased the frequency of extreme and severe weather events. This growing unpredictability leads to billions of dollars in crop losses each year, reduces global food supply stability, and places significant financial strain on agricultural communities.

By analyzing this synthetic dataset, we can practice identifying patterns between agricultural attributes and climate-related factors (including extreme weather). The insights gained here: about which variables matter, how to handle missing or imbalanced data, and how classification algorithms respond to climate-related features can directly inform how future real-world datasets on agriculture and climate change should be explored. Ultimately, this work lays a foundation for building predictive tools that can help farmers, policymakers, and researchers make more informed decisions in the face of rapidly changing environmental conditions.

The results of this project can provide insight into how climate variables most strongly influence agricultural outcomes. Understanding the relationships between temperature, precipitation, CO₂, soil conditions, and yield can help researchers build more effective predictive tools. These insights may guide future farming policy, disaster preparedness, and long-term planning in agriculture. Although this dataset is synthetic, the techniques developed here can be applied to real-world climate–crop datasets.

Stakeholders interested in the results:

Governments (global)
Farmers(global)
Greater Agricultural Industry (farm equipment, fertilizer, processing plants, etc) (regional and global)
Consumers(global)
Climate Scientists
Biologists 


## Citations
(if applicable)

https://github.com/DataScienceAlgorithms/M5_DecisionTrees/blob/main/B%20Attribute%20Selection.ipynb

