### Modeling

With your (almost?) final dataset in hand, it is now time to build some models.  Here, you should build a number of different regression models with the price as the target.  In building your models, you should explore different parameters and be sure to cross-validate your findings.

Linear Regression
No hyperparameters
I'm going to use it as a base line

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.impute import SimpleImputer

# Model 1: Simple model
model1_features = ['year', 'odometer_log', 'age', 'price_per_mile']

#Create SimpleImputer instance: Created an instance of SimpleImputer with strategy='median'.
# This will replace NaN values with the median of the respective column, I could use main as well
imputer = SimpleImputer(strategy='median')

# Model 2: Extended model with the ordinal encoded columns.
model2_features = model1_features + ['condition_ordinal', 'title_status_ordinal', 'size_ordinal']

# Model 3: Full model using all features (all columns in X_train)
model3_features = X_train.columns.tolist()

# --- Build, fit, and evaluate the models ---

MyResults = {}  # to store evaluation metrics for each model

# Model 1
model1 = LinearRegression()

# Impute NaN values in X_train and X_test with the median (or another strategy) before fitting
# For each column in model1_features, fill NaN with the median of that column in X_train
for feature in model1_features:
    X_train[feature].fillna(X_train[feature].median(), inplace=True)
    X_test[feature].fillna(X_train[feature].median(), inplace=True)

model1.fit(X_train[model1_features], y_train)
pred1 = model1.predict(X_test[model1_features])
mse1 = mean_squared_error(y_test, pred1)
mae1 = mean_absolute_error(y_test, pred1)
MyResults['Model 1'] = {'Features': model1_features, 'MSE': mse1, 'MAE': mae1}

# Model 2
model2 = LinearRegression()
# use the imputer
X_train_model2 = pd.DataFrame(imputer.fit_transform(X_train[model2_features]), columns=model2_features, index=X_train.index)
X_test_model2 = pd.DataFrame(imputer.transform(X_test[model2_features]), columns=model2_features, index=X_test.index)

model2.fit(X_train_model2, y_train)
pred2 = model2.predict(X_test_model2)

mse2 = mean_squared_error(y_test, pred2)
mae2 = mean_absolute_error(y_test, pred2)
MyResults['Model 2'] = {'Features': model2_features, 'MSE': mse2, 'MAE': mae2}

# Model 3  -------------------------------------------------------------------------------------
# the full linear regression give me errors ValueError: Cannot use median strategy with non-numeric data: could not convert string to float:
# need more time to debug
# model3 = LinearRegression()
#I got this error The error "ValueError: Cannot use median strategy with non-numeric data: could not convert string to float:
#I have to identify numerocal features in  X_train[model3_features] using select dtypes(include=np.number) saving in numerical_features_model
# Select only numerical features for imputation
# numerical_features_model3 = X_train[model3_features].select_dtypes(include=np.number).columns.tolist()

# Apply imputation only to numerical features
# X_train_model3_num = pd.DataFrame(imputer.fit_transform(X_train[numerical_features_model3]),
#                                  columns=numerical_features_model3, index=X_train.index)
#X_test_model3_num = pd.DataFrame(imputer.transform(X_test[numerical_features_model3]),
#                                 columns=numerical_features_model3, index=X_test.index)

# X_train_model3 = pd.concat([X_train_model3_num, X_train[model3_features].select_dtypes(exclude=np.number)], axis=1)
# X_test_model3 = pd.concat([X_test_model3_num, X_test[model3_features].select_dtypes(exclude=np.number)], axis=1)
# Previous code with error : X_train_model3 = pd.DataFrame(imputer.fit_transform(X_train[model3_features]), columns=model3_features, index=X_train.index)
# Previous code with error : X_test_model3 = pd.DataFrame(imputer.transform(X_test[model3_features]), columns=model3_features, index=X_test.index)
# model3.fit(X_train_model3, y_train)
# pred3 = model3.predict(X_test_model3)
# mse3 = mean_squared_error(y_test, pred3)
# mae3 = mean_absolute_error(y_test, pred3)
# MyResults['Model 3'] = {'Features': model3_features, 'MSE': mse3, 'MAE': mae3}

# --- Print the evaluation metrics for each model ---
for model_name, res in MyResults.items():
    print(f"\n{model_name} using features: {res['Features']}")
    print(f"  Mean Squared Error: {res['MSE']:.2f}")
    print(f"  Mean Absolute Error: {res['MAE']:.2f}")

### --->>> NEED TO APPLY SEQUENTIAL FEATURE SELECTION w9.2 w9.3 w9.4 Rdge Model <<<< -------------

### ->>. GridSearchCV Best Alphaiterating over alphas


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_train[feature].fillna(X_train[feature].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  X_test[feature].fillna(X_train[feature].median(), inplace=True)



Model 1 using features: ['year', 'odometer_log', 'age', 'price_per_mile']
  Mean Squared Error: 129011559.82
  Mean Absolute Error: 8729.75

Model 2 using features: ['year', 'odometer_log', 'age', 'price_per_mile', 'condition_ordinal', 'title_status_ordinal', 'size_ordinal']
  Mean Squared Error: 127873911.88
  Mean Absolute Error: 8658.76


2 Ridge Regression
Parameter = alpha  but the range should be in log scale 10^-3 to 10^3  

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.stats import uniform

features_Ridge = ['year', 'odometer_log', 'age', 'price_per_mile']
# Option 1 using GridSearchCV to tune it
# Define Ridge estimator with a random state equal 42 to repro acn compare
# plan to tuse auto and svd as the algorithms to solve the best coef,  I believe we
# have some correlation between variables in the dataset.
MyRidge = Ridge(random_state=42)
MyGridSearchCV_params = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver':['auto', 'svd'],
    'max_iter':[None,100,1000]
}
# Set up GridSearchCV with 5-fold cross-validation and negative MSE as the scoring metric
grid_search = GridSearchCV(
    estimator= MyRidge,
    param_grid= MyGridSearchCV_params,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1
)
grid_search.fit(X_train[features_Ridge], y_train)

# Retrieve the best Ridge estimator found by GridSearchCV
best_ridge_grid = grid_search.best_estimator_
print("Best parameters from GridSearchCV:", grid_search.best_params_)

# Evaluate the best estimator on the test set
prediction_grid = best_ridge_grid.predict(X_test[features_Ridge])

mse_grid = mean_squared_error(y_test, prediction_grid)
mae_grid = mean_absolute_error(y_test, prediction_grid)
print("GridSearchCV Ridge - Test MSE:", mse_grid)
print("GridSearchCV Ridge - Test MAE:", mae_grid)


# Using RandomizeSearchCV to compare with GridSearchCV
# let's see :-)  using the same params
GridSearchCV_params = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver':['auto', 'svd'],
    'max_iter':[None,100,1000]
}
MyRandom_search = RandomizedSearchCV(
    estimator= MyRidge,
    param_distributions= GridSearchCV_params,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1,
    n_iter=10,
    random_state=42
)

MyRandom_search.fit(X_train[features_Ridge], y_train)
best_ridge_random = MyRandom_search.best_estimator_
print("Best parameters from RandomizedSearchCV:", MyRandom_search.best_params_)

prediction_MyRandom_search = best_ridge_random.predict(X_test[features_Ridge])
print("MyRandomizedSearchCV Ridge -Test MSE:",mean_squared_error(y_test, prediction_MyRandom_search))
print("MyRandomizedSearchCV Ridge -Test MAE:",mean_absolute_error(y_test, prediction_MyRandom_search))





Best parameters from GridSearchCV: {'alpha': 1, 'max_iter': None, 'solver': 'auto'}
GridSearchCV Ridge - Test MSE: 4.464198509715481e-06
GridSearchCV Ridge - Test MAE: 0.0016075500395790945
Best parameters from RandomizedSearchCV: {'solver': 'svd', 'max_iter': 100, 'alpha': 1}
MyRandomizedSearchCV Ridge -Test MSE: 4.4641985097154785e-06
MyRandomizedSearchCV Ridge -Test MAE: 0.0016075500395791073


3 Lasson Regression

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, mean_absolute_error
from scipy.stats import uniform

# Define the feature set (using the same features as for Ridge)
MyFeatures_Lasso = ['year', 'odometer_log', 'age', 'price_per_mile']

# ---------------------------
# Option 1: Using GridSearchCV for Lasso

# Initialize Lasso estimator with a fixed random_state for reproducibility
my_lasso = Lasso(random_state=42)

# Define the parameter grid for GridSearchCV.
# Note: Lasso does not have a 'solver' parameter like Ridge, but it offers 'selection' (cyclic or random)
param_grid_lasso = {
    'alpha': [0.001, 0.01, 0.1, 1, 10, 100],
    'max_iter': [1000, 5000],
    'tol': [0.0001, 0.001, 0.01],
    'selection': ['cyclic', 'random']
}

# Set up GridSearchCV with 5-fold cross-validation
lasso_grid_search = GridSearchCV(
    estimator=my_lasso,
    param_grid=param_grid_lasso,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1
)

# Fit GridSearchCV on the training data
lasso_grid_search.fit(X_train[MyFeatures_Lasso], y_train)

# Retrieve the best Lasso estimator found
best_lasso_grid = lasso_grid_search.best_estimator_
print("Best parameters from GridSearchCV (Lasso):", lasso_grid_search.best_params_)

# Evaluate the best estimator on the test set
pred_lasso_grid = best_lasso_grid.predict(X_test[MyFeatures_Lasso])
mse_lasso_grid = mean_squared_error(y_test, pred_lasso_grid)
mae_lasso_grid = mean_absolute_error(y_test, pred_lasso_grid)
print("GridSearchCV Lasso - Test MSE:", mse_lasso_grid)
print("GridSearchCV Lasso - Test MAE:", mae_lasso_grid)

# ---------------------------
# Option 2: Using RandomizedSearchCV for Lasso

# Define parameter distributions for RandomizedSearchCV.
# For 'alpha', we use a uniform distribution.  Let's see how it works .. :-)
param_dist_lasso = {
    'alpha': uniform(0.001, 100),
    'max_iter': [1000, 5000],
    'tol': [0.0001, 0.001, 0.01],
    'selection': ['cyclic', 'random']
}

lasso_random_search = RandomizedSearchCV(
    estimator=my_lasso,
    param_distributions=param_dist_lasso,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1,
    n_iter=10,  # reduce iterations to speed up the search
    random_state=42
)

# Fit RandomizedSearchCV on the training data
lasso_random_search.fit(X_train[MyFeatures_Lasso], y_train)

# Retrieve the best Lasso estimator found
best_lasso_random = lasso_random_search.best_estimator_
print("Best parameters from RandomizedSearchCV (Lasso):", lasso_random_search.best_params_)

# Evaluate the best estimator on the test set
pred_lasso_random = best_lasso_random.predict(X_test[MyFeatures_Lasso])
mse_lasso_random = mean_squared_error(y_test, pred_lasso_random)
mae_lasso_random = mean_absolute_error(y_test, pred_lasso_random)
print("RandomizedSearchCV Lasso - Test MSE:", mse_lasso_random)
print("RandomizedSearchCV Lasso - Test MAE:", mae_lasso_random)


Best parameters from GridSearchCV (Lasso): {'alpha': 0.001, 'max_iter': 1000, 'selection': 'cyclic', 'tol': 0.0001}
GridSearchCV Lasso - Test MSE: 5.7222877858253145e-06
GridSearchCV Lasso - Test MAE: 0.001990068505051068
Best parameters from RandomizedSearchCV (Lasso): {'alpha': 37.455011884736244, 'max_iter': 1000, 'selection': 'cyclic', 'tol': 0.01}
RandomizedSearchCV Lasso - Test MSE: 5.7222877858253145e-06
RandomizedSearchCV Lasso - Test MAE: 0.001990068505051068


### Evaluation

With some modeling accomplished, we aim to reflect on what we identify as a high-quality model and what we are able to learn from this.  We should review our business objective and explore how well we can provide meaningful insight into drivers of used car prices.  Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

#### Evaluation Answer 1

Both models perform very similar, but the Ridge model shows a slight advantage based on error metrics:

Ridge Regression:
Test MSE: ~4.464e-06
Test MAE: ~0.001608
Lasso Regression:
Test MSE: ~5.722e-06
Test MAE: ~0.001990



#### Evaluation Answer 2####

1. Model Performance: The Ridge model achieved lower MSE and MAE values compared to Lasso. Maybe Ridge might be better at capturing the underlying relationships without introducing too much bias with a acceptable speed

2. Regularization Differences: Ridge Regression applies L2 regularization which tends to shrink coefficients but rarely zeroes them out. Lasso Regression applies L1 regularization, However, If most of the  features are useful, the sparsity induced by Lasso may not be as beneficial and could even hurt performance slightly. But at this stage I'm not sure because I haven't had the time to do a more complete analysis. I think that is something I should do.

3. Hyperparameter:
Both GridSearchCV and RandomizedSearchCV converged to similar performance metrics within each model, suggesting that the hyperparameter search was good.
For Ridge, both searches pointed to alpha = 1
For Lasso, although the best parameters differ between GridSearchCV and RandomizedSearchCV, the performance remained the same, indicating that several hyperparameter combinations may yield similar outcomes.

4. Choosing the Best Model
Given the slightly lower error metrics, Ridge Regression looks like the best for this tasks. However, since the differences are relatively small, it’s also worth considering other factors and the potential need for more feature selection.

#### Evaluation Answer 3 ####

What earlier adjustments can be done in the Data Preparation could be adjusted to get better results

1. Refine Outlier Detection: Play more with IQR Factor: Instead of using a fixed factor (e.g., 1.5) for all variables, I could consider adjusting it per variable based on their distributions.

2. Enhanced Missing Value Imputation: Iterative Imputation or Separate Separate Imputation Strategies: Tailor imputation methods for different features. For instance, I coud use mode imputation for categorical features and mean/median (or even a predictive model) for numerical features.

3. Feature Transformation and Scaling: Target and Feature Transformation:
Apply logarithmic transformations to more variables (e.g., price) to stabilize variance and make the distributions more normal I only use log transformation to odometer

4. More Correlation Analysis.  With more time I could examine multicollinearidad between predictors to remove or combine them in new features



### Deployment

Now that we've settled on our models and findings, it is time to deliver the information to the client.  You should organize your work as a basic report that details your primary findings.  Keep in mind that your audience is a group of used car dealers interested in fine-tuning their inventory.

### Used Car Sales Optimization : A Data-Driven Approach###

**Introduction**

In today’s competitive market, understanding what drives used car prices is crucial to managing your inventory effectively. My recent analysis leverages data science techniques to help you optimize your vehicle selection and pricing strategy. By identifying the key factors that influence car values, you can make better purchasing decisions, adjust pricing appropriately, and ultimately improve your profit margins.

**What I analyzed**

I examined a large dataset containing detailed information on hundreds of thousands of used (and new) cars. Key variables in my analysis included:

Year: The production year of the vehicle, which helps determine its age.
Mileage (Odometer): Total miles driven, indicating wear and tear.
Price per Mile: A derived metric that divides the car’s price by its mileage, offering insight into relative value.
Other Factors: Additional features such as the vehicle's condition and manufacturer were also considered during the analysis.

**The Modeling Process**

To predict used car prices accurately, I applied two types of regression models:
Ridge Regression: This model applies a technique called L2 regularization, which shrinks the impact of each variable in a balanced way. My results showed that Ridge Regression performed very well, suggesting that all features in your dataset contribute valuable information to the final price.
Lasso Regression: Lasso uses L1 regularization, which can effectively “turn off” less important features. However, my analysis found that while Lasso was also effective, Ridge Regression slightly outperformed it, indicating that none of the features were completely redundant in predicting car prices.

**Key Findings from the Models**

**Ridge Regression Results:**

Mean Squared Error (MSE): 4.46e-06
Mean Absolute Error (MAE): 0.00161

**Lasso Regression Results:**
MSE: 5.72e-06
MAE: 0.00199

The lower error rates in Ridge Regression mean that, on average, its predictions are closer to the actual sale prices. This suggests that when all factors are considered, each one plays a role in determining the car’s value.


**How This Helps You Fine-Tune your Inventory**

1. Accurate Pricing Based on Key Features
Vehicle Age: Newer models tend to hold their value better. Use the car’s production year and calculated age to adjust pricing.
Mileage and Price per Mile: Lower mileage often indicates a higher value per mile. By comparing similar vehicles, you can identify bargains or overpriced listings.
2. Data-Driven Purchasing Decisions
Focus on Well-Rounded Vehicles: Since Ridge Regression shows that all measured features contribute to the final price, consider a holistic approach when evaluating potential purchases. Vehicles that score well across multiple factors are likely to be strong investments.
3. Inventory and Pricing Adjustments
Dynamic Pricing: Use these insights to set competitive prices that reflect the true market value, reducing the time vehicles spend on your lot.
Targeted Inventory Acquisition: Knowing which factors have the greatest impact on price can help you identify underpriced vehicles with potential for profit after refurbishment or market correction.
4. Continuous Improvement
Refining Data Collection: The more accurate and detailed your data (e.g., precise mileage, detailed condition reports), the more reliable your pricing models will be.
Regular Reassessment: The market is always evolving. Regularly updating your models with fresh data ensures that your inventory strategies remain aligned with current market trends.
Conclusion
Using data science, you can gain a competitive edge in the used car market. This analysis shows that a comprehensive approach—taking into account vehicle age, mileage, and other key features—provides a more accurate picture of a car’s value. In my study, Ridge Regression emerged as the more reliable tool for predicting prices, which suggests that a balanced consideration of all available features is beneficial.

###My Recommendations###

####Long term####

Adopt data-driven pricing: Adjust your pricing strategy based on the factors that truly impact a car’s value.
Enhance your data collection: Gather detailed and accurate information on each vehicle.
Continuously review and update models: As market conditions change, so should your models and strategies.



####Short Term####

1. **Daily in Store  Inventory Assessment**  Morning Routine:
Run the Model on Current Inventory:
What to Do: Every morning, update your inventory data with the latest details (e.g., production year, mileage, condition).
How to Do It: Use the Python script you’ve built to feed your current inventory through the Ridge model. This model will output a predicted price for each vehicle.
Compare Predictions to Listed Prices:
What to Look For: Create a report that shows both the predicted price and the current listing price.
How to Act:
Overpriced Vehicles: If a car’s listing price is significantly above the predicted price, consider lowering the price or running a special promotion.
Underpriced Vehicles: If the listing price is lower than the predicted price, you might have room to increase the price—or if it’s selling too quickly, consider buying more of that model.

2. **Pricing Adjustments**
In-Store and Online:
Adjust Listing Prices:
Action Step: Based on the morning report, update your online listings and in-store price tags.
Tip: Create a pricing band (e.g., within ±5% of the predicted price) as a target range for consistency.
Promotional Strategies:
Action Step: For vehicles that are overpriced compared to the model, consider running flash sales or discounts to move inventory faster.
Tip: Use the model’s confidence in its predictions as a guide—if the prediction error is low, you can be more aggressive with pricing changes.

3. **Inventory Optimization**
Ongoing Inventory Review:
Identify High-Potential Vehicles:
Action Step: Use your model to flag vehicles that show a high “value per mile” or have lower mileage with a good production year.
How to Act: Prioritize these vehicles in marketing efforts or consider expanding your stock of similar models.
Evaluate Underperformers:
Action Step: For vehicles with a low predicted price relative to market demand, plan targeted promotions (e.g., “quick sale” discounts).
Tip: Keep a record of these vehicles to review if repeated adjustments are needed or if they should be phased out in future orders.

4. **Weekly/Monthly  Analysis and Continuous Improvement**
End-of-Week and End-of-Month Review:
Analyze Sales Performance:
Action Step: At the end of each week, compare the actual sale prices with the model’s predictions.
How to Use: Identify trends—are vehicles sold closer to the predicted price? Are there any particular changes leading to faster sales? Enrich the model with those features
Adjust Procurement Decisions:
Action Step: Based on weekly/monthly performance, refine your buying strategy:
Increase orders for high-performing models.
Reduce or discount models that consistently underperform.
Feedback Loop:
Action Step: Update your models with the latest sales data. This helps improve prediction accuracy over time.
5. **Training and Maintenance**
Keep Your Team Trained and Informed:
Ongoing Training:
Action Step: Schedule periodic training sessions on how to interpret the model outputs and dashboard metrics.
How to Act: Share best practices and recent insights from your weekly reviews to empower your team in making data-driven decisions.
Tool Updates:
Action Step: Maintain and update your data pipeline. Ensure that new inventory data is incorporated in real time and that your models are re-trained as market conditions evolve
Keep adding new features as you discover they could affect the sales.
Call Data scientist to update the model  
