# Crop Yield Prediction Using Environmental Data

**Name(s):** Jake Walkup

**Website Link:** _coming soon_

## Introduction

I got the dataset from Kaggle. It contains **3000 rows** of simulated data points representing different environmental and farming conditions that affect crop yield. I chose this topic because I care about sustainability and want to explore how data can help improve agricultural productivity in an eco-friendly way.

The dataset includes the following **features**:

- `rainfall_mm`: Average rainfall during the growing season (500–2000 mm)
- `soil_quality_index`: A score from 1 to 10 measuring soil quality
- `farm_size_hectares`: The size of the farm (10–1000 hectares)
- `sunlight_hours`: Average sunlight per day during the season (4–12 hours)
- `fertilizer_kg`: Amount of fertilizer used per hectare (100–3000 kg)

The **target variable** is `crop_yield` (in tons per hectare).

Some questions I brainstormed:

- What are the most important factors that influence crop yield?
- How much does fertilizer usage affect yield?
- Can sunlight or rainfall predict crop success alone?

I chose: **"What are the most important factors that influence crop yield?"**


In [3]:
import pandas as pd

# Convert CSV file to Pandas data frame
df = pd.read_csv('crop_yield_data.csv')

# Head returns the first few rows from the data frame
print(df.head(10))

   rainfall_mm  soil_quality_index  farm_size_hectares  sunlight_hours  \
0         1626                   9                 636              11   
1         1959                   9                  73              11   
2         1360                   1                 352               5   
3         1794                   2                 948               7   
4         1630                   5                 884               5   
5         1595                   4                 928               7   
6         1544                  10                 361              10   
7          621                   9                 167              12   
8          966                   7                 598              11   
9         1738                   6                 500              12   

   fertilizer_kg  crop_yield  
0           1006         404  
1            112         115  
2            702         231  
3            299         537  
4           2733         554  

## Data Cleaning and Exploratory Data Analysis

### Data Cleaning:

In the introduction section, I listed the units for each variable. Below, I examined the dataset for missing values (NaN) and other inconsistencies. After a thorough review, I found no significant missing or invalid data, indicating that the dataset was already clean and ready for analysis. This ensures that subsequent analyses are not biased or impacted by incomplete information.

In [12]:
# It tells you if there are any missing values in the whole DataFrame
print(df.isnull().values.any())

# It tells you which columns have missing values in the DataFrame
print(df.isnull().sum())

False
rainfall_mm           0
soil_quality_index    0
farm_size_hectares    0
sunlight_hours        0
fertilizer_kg         0
crop_yield            0
dtype: int64


As shown in the code above, there are no missing values in the DataFrame.