# Japanese Grape Price Analysis
## Modeling Price Determinants

This notebook examines the relationship between grape prices
and physical characteristics using regression and machine learning models.


### Linear Regression with Interaction Terms
This model examines the linear relationship between grape prices
and physical characteristics, including interaction effects.

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import r2_score, mean_absolute_error

df = pd.read_csv("grape_data_cleaned.csv")

green_list = [
    'Queen Seven', 'Haneou', 'Queen Rouge', 'Shine Muscat',
    'Fujinokagayaki', 'Seto Giants', 'Shigyoku', 'Shinku', 'Kaiji',
    'Queen Roug', 'Scarlet', 'Red Shine Muscat', 'Suiho',
    'Sunshine Red', 'Ving', 'Wasekaiji', 'Pizzutello Bianco',
    'Nouvelle Rose', 'Muscat Noir', 'Miwahime'
]
df["group"] = df["grape_variety"].apply(
    lambda x: "Green Group" if x in green_list else "Purple Group"
)

# ---- features ----
df["group_binary"] = df["group"].map({"Purple Group": 0, "Green Group": 1})

# ---- Interaction terms ----
df["grain_x_length"] = df["grain_diameter_mm"] * df["bunch_length_cm"]
df["grain_x_width"]  = df["grain_diameter_mm"] * df["bunch_width_cm"]
df["length_x_width"] = df["bunch_length_cm"] * df["bunch_width_cm"]

df["grain_x_group"] = df["grain_diameter_mm"] * df["group_binary"]
df["length_x_group"] = df["bunch_length_cm"] * df["group_binary"]
df["year_x_group"] = df["year_introduced"] * df["group_binary"]

# ---- Build model dataset ----
df_model = df[[
    "grain_diameter_mm",
    "bunch_length_cm",
    "bunch_width_cm",
    "year_introduced",
    "group_binary",

    # interaction terms:
    "grain_x_length",
    "grain_x_width",
    "length_x_width",
    "grain_x_group",
    "length_x_group",
    "year_x_group",

    "price_yen_with_tax"
]]

# ---- Train-test split ----
X = df_model.drop("price_yen_with_tax", axis=1)
y = df_model["price_yen_with_tax"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ---- Linear Regression with StandardScaler ----
model = Pipeline([
    ("scaler", StandardScaler()),
    ("lr", LinearRegression())
])

model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# ---- Evaluate ----
print("R² score:", r2_score(y_test, y_pred))
print("MAE:", mean_absolute_error(y_test, y_pred))


R² score: 0.4931887913444939
MAE: 723.5234909251129


The model explains approximately half of the observed price variance,
suggesting that physical characteristics play a substantial role in price formation.

In [7]:
from sklearn.metrics import r2_score, mean_absolute_error

y_train_pred = model.predict(X_train)
train_r2 = r2_score(y_train, y_train_pred)
train_mae = mean_absolute_error(y_train, y_train_pred)

y_test_pred = model.predict(X_test)
test_r2 = r2_score(y_test, y_test_pred)
test_mae = mean_absolute_error(y_test, y_test_pred)

print("Train R²:", train_r2)
print("Test  R²:", test_r2)
print("Train MAE:", train_mae)
print("Test  MAE:", test_mae)

Train R²: 0.47012910752107007
Test  R²: 0.4931887913444939
Train MAE: 748.95544702011
Test  MAE: 723.5234909251129


### Ridge Regression
Ridge regression is applied to assess model stability
under coefficient regularization.

In [8]:
from sklearn.linear_model import Ridge
from sklearn.metrics import r2_score, mean_absolute_error

X = df_model.drop("price_yen_with_tax", axis=1)
y = df_model["price_yen_with_tax"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ---- Ridge Regression (L2 Regularization) ----

ridge_model = Pipeline([
    ("scaler", StandardScaler()),
    ("ridge", Ridge(alpha=1.0))
])

ridge_model.fit(X_train, y_train)
y_pred_ridge = ridge_model.predict(X_test)

print("Ridge R² score:", r2_score(y_test, y_pred_ridge))
print("Ridge MAE:", mean_absolute_error(y_test, y_pred_ridge))

Ridge R² score: 0.4646111399364067
Ridge MAE: 747.1400801175815


Model performance remains comparable to the linear baseline,
indicating that results are not driven by extreme coefficients.

### Random Forest Regression
A non-linear model is used to explore potential non-linear
relationships between features and grape prices.

In [10]:
from sklearn.ensemble import RandomForestRegressor

X = df_model.drop("price_yen_with_tax", axis=1)
y = df_model["price_yen_with_tax"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ---- Random Forest Model ----
rf_model = RandomForestRegressor(
    n_estimators=500,
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    random_state=42
)

rf_model.fit(X_train, y_train)

# ---- Prediction ----
y_pred_rf = rf_model.predict(X_test)

# ---- Evaluation ----
print("Random Forest R² score:", r2_score(y_test, y_pred_rf))
print("Random Forest MAE:", mean_absolute_error(y_test, y_pred_rf))

# ---- Feature Importance ----
importances = rf_model.feature_importances_
feature_names = X.columns

importance_df = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
}).sort_values(by="importance", ascending=False)

importance_df

Random Forest R² score: 0.4218847299062113
Random Forest MAE: 804.2419436842117


Unnamed: 0,feature,importance
5,grain_x_length,0.441421
6,grain_x_width,0.147976
0,grain_diameter_mm,0.099285
7,length_x_width,0.0764
8,grain_x_group,0.063945
2,bunch_width_cm,0.063267
3,year_introduced,0.046392
1,bunch_length_cm,0.029294
10,year_x_group,0.016271
9,length_x_group,0.013707


Feature importance highlights the role of size-related
interaction terms in price prediction.