**CROP PRICES**

**1. Problem Statement**

The price of avocados fluctuates due to various factors such as demand, supply, seasonality, and regional differences. Accurately predicting avocado prices can help farmers, retailers, and consumers make informed decisions. This project aims to develop a machine learning model that forecasts avocado prices based on historical sales data and market trends.

**2. Project Objectives**

1. Analyze historical avocado sales and pricing trends.

2. Identify key factors influencing price fluctuations.

3. Build a predictive model to forecast avocado prices.

4. Deploy the model in a user-friendly interface for real-time price predictions.

**3. Data Sources**

1. Hass Avocado Board: Provides historical avocado price and sales data.

2. Kaggle Datasets: Public datasets on avocado prices.

**4. Success Metrics**

__To evaluate the effectiveness of the model, we will use:__

   *Mean Absolute Error (MAE): Measures prediction accuracy.

   *Root Mean Squared Error (RMSE): To penalize large prediction errors.

   *R² Score: To assess how well features explain the variance in price.

   *Model Deployment Readiness: The model should be lightweight and efficient for real-time predictions.

**DATA COLLECTION**

In [3]:
pip install kagglehub

Collecting kagglehub
  Downloading kagglehub-0.3.10-py3-none-any.whl.metadata (31 kB)
Downloading kagglehub-0.3.10-py3-none-any.whl (63 kB)
   ---------------------------------------- 0.0/63.0 kB ? eta -:--:--
   ---------------------------------------- 63.0/63.0 kB 1.7 MB/s eta 0:00:00
Installing collected packages: kagglehub
Successfully installed kagglehub-0.3.10
Note: you may need to restart the kernel to use updated packages.


In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("vakhariapujan/avocado-prices-and-sales-volume-2015-2023")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/vakhariapujan/avocado-prices-and-sales-volume-2015-2023?dataset_version_number=3...


100%|██████████| 1.68M/1.68M [00:04<00:00, 421kB/s]

Extracting files...
Path to dataset files: C:\Users\Fransisca Nong\.cache\kagglehub\datasets\vakhariapujan\avocado-prices-and-sales-volume-2015-2023\versions\3





In [9]:
import pandas as pd

# file path
file_path = r"C:\Users\Fransisca Nong\Downloads\archive (1)\Avocado_HassAvocadoBoard_20152023v1.0.1.csv"

# Load the dataset
df = pd.read_csv(file_path)

# Display dataset info and first few rows
print(df.info())
print(df.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53415 entries, 0 to 53414
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Date          53415 non-null  object 
 1   AveragePrice  53415 non-null  float64
 2   TotalVolume   53415 non-null  float64
 3   plu4046       53415 non-null  float64
 4   plu4225       53415 non-null  float64
 5   plu4770       53415 non-null  float64
 6   TotalBags     53415 non-null  float64
 7   SmallBags     41025 non-null  float64
 8   LargeBags     41025 non-null  float64
 9   XLargeBags    41025 non-null  float64
 10  type          53415 non-null  object 
 11  region        53415 non-null  object 
dtypes: float64(9), object(3)
memory usage: 4.9+ MB
None
         Date  AveragePrice  TotalVolume    plu4046    plu4225   plu4770  \
0  2015-01-04          1.22     40873.28    2819.50   28287.42     49.90   
1  2015-01-04          1.79      1373.95      57.42     153.88      0.00   
2  2

**CHECKING FOR DUPLICATES**

In [35]:
# Check for duplicate rows
duplicates = df.duplicated().sum()

# Display the number of duplicate rows
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 0


**CHECKING FOR MISSING VALUES**

In [11]:
print("Missing values:\n", df.isnull().sum())

Missing values:
 Date                0
AveragePrice        0
TotalVolume         0
plu4046             0
plu4225             0
plu4770             0
TotalBags           0
SmallBags       12390
LargeBags       12390
XLargeBags      12390
type                0
region              0
dtype: int64


In [14]:
missing_percent = df[["SmallBags", "LargeBags", "XLargeBags"]].isnull().sum() / len(df) * 100
print("Percentage of missing values:\n", missing_percent)

Percentage of missing values:
 SmallBags     23.195732
LargeBags     23.195732
XLargeBags    23.195732
dtype: float64


In [16]:
# Check if missing values are related to type or region
missing_rows = df[df[["SmallBags", "LargeBags", "XLargeBags"]].isnull().any(axis=1)]
missing_by_type = missing_rows["type"].value_counts()
missing_by_region = missing_rows["region"].value_counts()

print("Missing by type:\n", missing_by_type)
print("Missing by region:\n", missing_by_region)

Missing by type:
 type
conventional    6195
organic         6195
Name: count, dtype: int64
Missing by region:
 region
Albany                  210
SanDiego                210
Orlando                 210
PeoriaSpringfield       210
Philadelphia            210
PhoenixTucson           210
Pittsburgh              210
Plains                  210
Portland                210
Providence              210
RaleighGreensboro       210
RichmondNorfolk         210
Roanoke                 210
Sacramento              210
SanFrancisco            210
Northeast               210
Seattle                 210
SouthCarolina           210
SouthCentral            210
Southeast               210
Spokane                 210
StLouis                 210
Syracuse                210
Tampa                   210
Toledo                  210
TotalUS                 210
West                    210
WestTexNewMexico        210
NorthernNewEngland      210
NewYork                 210
Atlanta                 210
Detroit       

In [20]:
# Fill missing values with the median for each column
df["SmallBags"] = df["SmallBags"].fillna(df["SmallBags"].median())
df["LargeBags"] = df["LargeBags"].fillna(df["LargeBags"].median())
df["XLargeBags"] = df["XLargeBags"].fillna(df["XLargeBags"].median())

# handled missing values
print(df[["SmallBags", "LargeBags", "XLargeBags"]].isnull().sum())

SmallBags     0
LargeBags     0
XLargeBags    0
dtype: int64


**Encode Categorical Features**

In [28]:
df = pd.get_dummies(df, columns=["type", "region"], drop_first=True)

**Feature Scaling**

In [33]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['TotalVolume', 'plu4046', 'plu4225', 'plu4770', 'TotalBags', 'SmallBags', 'LargeBags', 'XLargeBags', 'AveragePrice']] = scaler.fit_transform(
    df[['TotalVolume', 'plu4046', 'plu4225', 'plu4770', 'TotalBags', 'SmallBags', 'LargeBags', 'XLargeBags', 'AveragePrice']]
)

In [44]:
# Get a list of numeric columns
numeric_columns = df.select_dtypes(include='number').columns

Q1 = df[numeric_columns].quantile(0.25)
Q3 = df[numeric_columns].quantile(0.75)
IQR = Q3 - Q1

# Identify outliers
outliers = ((df[numeric_columns] < (Q1 - 1.5 * IQR)) | (df[numeric_columns] > (Q3 + 1.5 * IQR)))

# Check the number of outliers
outlier_counts = outliers.sum()
print("Outliers per column:\n", outlier_counts)

Outliers per column:
 AveragePrice      358
TotalVolume      6484
plu4046          7263
plu4225          8193
plu4770          9375
TotalBags        6845
SmallBags       10080
LargeBags       11130
XLargeBags       8397
dtype: int64
