## Predicting the Price of Avocados
### Using Feature Engineering on Lasso and Ridge Regression Models

An application of hyperparameter tuning and feature engineering using Lasso and Ridge regression models. This notebook contains data cleaning, model building, parameter tuning, cross-validation, and model evaluation techniques

**Setup and Imports**

In [23]:
#Configuration and imports

import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge, Lasso
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

**Data Preprocessing**

In [33]:
#Import the data and remove any unnecessary columns
df = pd.read_csv('avocado.csv')
print(df.columns)
df = df.drop(['Unnamed: 0'], axis=1)

#Drop any rows where small bags + large bags + xlarge bags != total volume
df = df[df['Total Bags'] == (df['Small Bags'] + df['Large Bags'] + df['XLarge Bags'])]

df.head()

Index(['Unnamed: 0', 'Date', 'AveragePrice', 'Total Volume', '4046', '4225',
       '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'type',
       'year', 'region'],
      dtype='object')


Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


In [34]:
#list the unique regions
print(df['region'].unique())
# This includes far too many regions, most are cities, and some are too broad like 'TotalUS'
# We will keep only cities for simplicity

aggregate_regions = ['TotalUS', 'California', 'GreatLakes', 'Northeast', 
                     'NorthernNewEngland', 'Midsouth', 'Plains', 
                     'SouthCentral', 'Southeast', 'West']

df_filtered = df[~df['region'].isin(aggregate_regions)].copy()

print(f"Original number of regions: {df['region'].nunique()}")
print(f"Filtered number of regions: {df_filtered['region'].nunique()}")
print(f"Rows before filtering: {len(df)}, after: {len(df_filtered)}")

['Albany' 'Atlanta' 'BaltimoreWashington' 'Boise' 'Boston'
 'BuffaloRochester' 'California' 'Charlotte' 'Chicago' 'CincinnatiDayton'
 'Columbus' 'DallasFtWorth' 'Denver' 'Detroit' 'GrandRapids' 'GreatLakes'
 'HarrisburgScranton' 'HartfordSpringfield' 'Houston' 'Indianapolis'
 'Jacksonville' 'LasVegas' 'LosAngeles' 'Louisville' 'MiamiFtLauderdale'
 'Midsouth' 'Nashville' 'NewOrleansMobile' 'NewYork' 'Northeast'
 'NorthernNewEngland' 'Orlando' 'Philadelphia' 'PhoenixTucson'
 'Pittsburgh' 'Plains' 'Portland' 'RaleighGreensboro' 'RichmondNorfolk'
 'Roanoke' 'Sacramento' 'SanDiego' 'SanFrancisco' 'Seattle'
 'SouthCarolina' 'SouthCentral' 'Southeast' 'Spokane' 'StLouis' 'Syracuse'
 'Tampa' 'TotalUS' 'West' 'WestTexNewMexico']
Original number of regions: 54
Filtered number of regions: 44
Rows before filtering: 14213, after: 11699


In [35]:
#Extract date features to help with prediction
df_filtered['Date'] = pd.to_datetime(df_filtered['Date'])
df_filtered['month'] = df_filtered['Date'].dt.month
df_filtered['quarter'] = df_filtered['Date'].dt.quarter
df_filtered['day_of_year'] = df_filtered['Date'].dt.dayofyear

# Drop the original Date column since we've extracted what we need
df_filtered = df_filtered.drop(['Date'], axis=1)

print("Added date features: month, quarter, day_of_year")
df_filtered.head()

Added date features: month, quarter, day_of_year


Unnamed: 0,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region,month,quarter,day_of_year
0,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany,12,4,361
1,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany,12,4,354
2,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany,12,4,347
3,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany,12,4,340
4,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany,11,4,333


In [36]:
#Finally, one-hot encode the categorical variable "type"

df_filtered = pd.get_dummies(df_filtered, columns=['type'], drop_first=True)

print(f"Columns after one-hot encoding type: {df_filtered.columns.tolist()}")

Columns after one-hot encoding type: ['AveragePrice', 'Total Volume', '4046', '4225', '4770', 'Total Bags', 'Small Bags', 'Large Bags', 'XLarge Bags', 'year', 'region', 'month', 'quarter', 'day_of_year', 'type_organic']


**Data Split**

In [37]:
# Split features and target
X = df_filtered.drop(['AveragePrice'], axis=1)
y = df_filtered['AveragePrice']

# Train-test split (80-20 split, random_state for reproducibility)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

Training set size: 9359
Test set size: 2340


**Target Encoding for Features with Multiple Categories**

In [38]:
# Target encoding for 'region'
# Calculate mean price per region using ONLY training data
region_encoding = X_train.groupby('region')['region'].count()  # Just to get regions
region_means = y_train.groupby(X_train['region']).mean()

# Create a dictionary mapping region to mean price
region_encoding_dict = region_means.to_dict()

# Add overall mean as fallback for unseen regions in test set
overall_mean = y_train.mean()

# Apply encoding to train and test
X_train['region_encoded'] = X_train['region'].map(region_encoding_dict).fillna(overall_mean)
X_test['region_encoded'] = X_test['region'].map(region_encoding_dict).fillna(overall_mean)

# Drop original region column
X_train = X_train.drop(['region'], axis=1)
X_test = X_test.drop(['region'], axis=1)

print(f"Sample encoding values: {list(region_encoding_dict.items())[:5]}")

Sample encoding values: [('Albany', 1.6052941176470588), ('Atlanta', 1.3578391959798994), ('BaltimoreWashington', 1.5624761904761906), ('Boise', 1.3678947368421053), ('Boston', 1.5256140350877192)]


In [39]:
#Add scaling to standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame to keep track of column names (optional, helpful for interpretation)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns, index=X_test.index)

**Ridge Regression Hyperparameter Tuning**

In [40]:
# Hyperparameter Tuning: Set up GridSearchCV for Ridge
ridge_params = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

ridge = Ridge()
ridge_grid = GridSearchCV(
    ridge, 
    ridge_params, 
    cv=5,  # 5-fold cross-validation
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("Fitting Ridge with GridSearchCV...")
ridge_grid.fit(X_train_scaled, y_train)

print(f"\nBest Ridge alpha: {ridge_grid.best_params_['alpha']}")
print(f"Best Ridge CV score (negative MSE): {ridge_grid.best_score_:.4f}")

Fitting Ridge with GridSearchCV...
Fitting 5 folds for each of 7 candidates, totalling 35 fits

Best Ridge alpha: 10
Best Ridge CV score (negative MSE): -0.0701


**Lasso Hyperparameter Tuning**

In [41]:
# Hyperparameter Tuning: Set up GridSearchCV for Lasso
lasso_params = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100]
}

lasso = Lasso(max_iter=10000)  # Increase max_iter to ensure convergence
lasso_grid = GridSearchCV(
    lasso, 
    lasso_params, 
    cv=5,  # 5-fold cross-validation
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("Fitting Lasso with GridSearchCV...")
lasso_grid.fit(X_train_scaled, y_train)

print(f"\nBest Lasso alpha: {lasso_grid.best_params_['alpha']}")
print(f"Best Lasso CV score (negative MSE): {lasso_grid.best_score_:.4f}")

Fitting Lasso with GridSearchCV...
Fitting 5 folds for each of 6 candidates, totalling 30 fits

Best Lasso alpha: 0.001
Best Lasso CV score (negative MSE): -0.0701


**Model Evaluation: Comparing Ridge and Lasso**

In [43]:
# Model Evaluation: Compare Ridge and Lasso on test set
ridge_pred = ridge_grid.predict(X_test_scaled)
lasso_pred = lasso_grid.predict(X_test_scaled)

# Calculate metrics for both models
ridge_mse = metrics.mean_squared_error(y_test, ridge_pred)
ridge_rmse = metrics.mean_squared_error(y_test, ridge_pred)
ridge_r2 = metrics.r2_score(y_test, ridge_pred)
ridge_mae = metrics.mean_absolute_error(y_test, ridge_pred)

lasso_mse = metrics.mean_squared_error(y_test, lasso_pred)
lasso_rmse = metrics.mean_squared_error(y_test, lasso_pred)
lasso_r2 = metrics.r2_score(y_test, lasso_pred)
lasso_mae = metrics.mean_absolute_error(y_test, lasso_pred)

# Display results
print("=" * 60)
print("RIDGE REGRESSION RESULTS")
print("=" * 60)
print(f"Best alpha: {ridge_grid.best_params_['alpha']}")
print(f"MSE: {ridge_mse:.4f}")
print(f"RMSE: {ridge_rmse:.4f}")
print(f"MAE: {ridge_mae:.4f}")
print(f"R² Score: {ridge_r2:.4f}")

print("\n" + "=" * 60)
print("LASSO REGRESSION RESULTS")
print("=" * 60)
print(f"Best alpha: {lasso_grid.best_params_['alpha']}")
print(f"MSE: {lasso_mse:.4f}")
print(f"RMSE: {lasso_rmse:.4f}")
print(f"MAE: {lasso_mae:.4f}")
print(f"R² Score: {lasso_r2:.4f}")

print("\n" + "=" * 60)
print("COMPARISON")
print("=" * 60)
if ridge_r2 > lasso_r2:
    print(f"✓ Ridge performs better (R² difference: {ridge_r2 - lasso_r2:.4f})")
    print(f"  Selected model: Ridge with alpha={ridge_grid.best_params_['alpha']}")
else:
    print(f"✓ Lasso performs better (R² difference: {lasso_r2 - ridge_r2:.4f})")
    print(f"  Selected model: Lasso with alpha={lasso_grid.best_params_['alpha']}")

RIDGE REGRESSION RESULTS
Best alpha: 10
MSE: 0.0719
RMSE: 0.0719
MAE: 0.2054
R² Score: 0.5806

LASSO REGRESSION RESULTS
Best alpha: 0.001
MSE: 0.0719
RMSE: 0.0719
MAE: 0.2056
R² Score: 0.5805

COMPARISON
✓ Ridge performs better (R² difference: 0.0001)
  Selected model: Ridge with alpha=10


In [44]:
#Extra: feature importance from Lasso
lasso_coefficients = pd.DataFrame({
    'feature': X_train.columns,
    'coefficient': lasso_grid.best_estimator_.coef_
})
lasso_coefficients = lasso_coefficients.sort_values('coefficient', key=abs, ascending=False)

print("\nTop 10 Most Important Features (Lasso):")
print(lasso_coefficients.head(10))

print(f"\nNumber of features with zero coefficients: {(lasso_coefficients['coefficient'] == 0).sum()}")


Top 10 Most Important Features (Lasso):
           feature  coefficient
12    type_organic     0.222045
13  region_encoded     0.168781
10         quarter     0.071480
0     Total Volume    -0.048745
8             year     0.045785
7      XLarge Bags     0.030464
6       Large Bags    -0.009955
2             4225    -0.007392
1             4046    -0.000000
3             4770    -0.000000

Number of features with zero coefficients: 6


Lasso zeroed out coefficients for parameters distinguishing the product lookup codes (4225, 4046, 4770),number of large bags sold, and total volume sold. Whether or not the avocados were organic had the strongest impact on price!