Study on Machine Learning Using Data Generated by Analytical Solution

Author: John W.S. Lee

1. Introduction

This study was motivated by a simple question: "Can a trained machine learning model perform as well as an analytical solution?" To find out, this study was conducted using data in the field of polymer extrusion processes.

First, a dataset for machine learning was prepared by generating throughput data using the extrucal library. The data included various extruder sizes, screw geometries, polymer melt density, and screw RPMs. Basic exploratory data analysis was then performed to examine the distribution of features and the target variable. Skewed features were subjected to a log transformation. Using the transformed data, cross-validation was carried out with multiple machine learning models. The best model was selected based on the cross-validation score, specifically the mean squared error. Once the best model was chosen, hyperparameter optimization was performed. The performance of the selected machine learning model was compared before and after optimization for extruders ranging in size from 25 mm to 250 mm.

The evaluation of the model's performance revealed good agreement between the throughputs predicted by the machine learning model and the analytical solution, extrucal. However, significant disparities were also observed for certain extruder sizes. The following is a summary report of this study, and the actual codes used can be found in the notebook folder.

2. Summary of Study

2.1. Generation of Data

Extrusion throughput dataset was generated using extrucal.throughput_cal() function and the following 7 parameters.

extruder_size: Sizes of extruders ranging from 20mm to 250mm with an increment of 10mm.
metering_depth_percent: Depths of metering section of extrusion screws ranging from 2% to 10% of extruder sizes
polymer_density: Melt density of polymer materials ranging from 800 to 1500 kg/m^3
screw_pitch_percent: Screw pitch ranging from 0.6D to 2D
flight_width_percent: Flight width of screws ranging from 0.06D to 0.2D
number_flight: number of flights with a choice of 1 or 2
rpm: Screw RPMs ranging from 0 to 90

In order to apply randomness to the throughputs in the dataset, +/- 5% variation was applied to the throughputs calculated by extrucal.throughput_cal().

2.2. Exploratory Data Analysis

The following graphs show the distribution of features. metering_depth, screw_pitch, and flight_width show skewness to a certain degree.

Log-transformation was applied to the 3 features, and the following are the results.

The target, throughput, also showed a strong skewness as shown below. Therefore log-transformation was applied to it.

After log-transformation of the target, the skewness disappeared. However, since there were many zero throughput data for the screw RPM of zero, there was a sharp peak in the graph as shown below.

2.3. Cross-Validation of Multiple Machine Learning Models

Cross-validation was carried out using 6 different machine learning models: Ridge, Lasso, RandomForestRegressor, XGBRegressor, LGBMRegressor, and CatBoostRegressor. mean_squared_error was used as the metric, and the following table shows the results.

CatBoostRegressor performed best among the models.

2.4. Hyperparameter Optimization

Optuna library was used for the hyperparameter optimization of the CatBoostRegressor model. The following shows the throughput results predicted by the CatBoostRegressor models before/after optimization and the analytical solution (with extrucal library).

The prediction was not that good for 25mm extruder for both models before and after hyperparameter optimization.

2.5. Comparison of Predictions for Different Extruder Sizes

The CatBoostRegressor model was trained with extruder_size in the range from 20mm to 250mm with 10mm increment. The previous results showed that the model didn't perform well for 25mm extruder, which was not the size used for training the model. So, it was tested to see if the model would perform any better for the extruder sizes that were included in the train data.

For extruder_size in Train Data

For extruder_size not in Train Data

There are clear disparities between the throughputs predicted by the model and those by the analytical solution(i.e. by extrucal library) for the extruder_size that were not in the Train Data. The disparity was bigger for the smallest extruder(i.e. 25mm) maybe because its throughputs were order of magnitude smaller than other sizes, and mean_squared_error was used as the evaluation metric. On the other hand, the predicted throughputs predicted for the extruder_size that were in the Train Data were almost identical to those calculated by the analytical solution(i.e. by extrucal library).

When the two cases were compared using mean_absolute_percentage_error, it was 1.14% for the extruder_size present in Train Data, whereas it was 6.92% for the extruder_size that were not in Train Data.

2.6. Feature Importances

In order to find out if the machine learning model correctly learned the effect of each extrusion parameter on the throughput, the feature importances of a machine learning model were investigated by using shap library. Just to save the computation time, the optimized LightGBM model (whose optimization process is shown in Appendix 2) was used to check the feature importances.

Rank of Features

Similarly to actual extrusion processes, rpm and extruder_size were two biggest processing parameter for the model. The rank for the rest of the processing parameters also made sense.

Effect of Each Processing Parameter on Throughput

The effect of each processing parameter on the throughput was correctly displayed. For example, the throughput increased with increasing rpm, extruder_size, metering_depth, screw_pitch, and polymer_density, whereas it decreased with increasing number_flight and flight_width.

3. Conclusion

In the beginning, this study started with a simple purpose of just demonstrating that machine learning model can learn very complicated pattern and can perform as well as an analytical solution. However, while I was working on modeling, I found out that the model didn't perform well for the smallest extruder (i.e. 25mm). Initially, I thought that it was due to the fact that the throughputs at zero screw RPM were included in the train data. I also suspected that either the log transformation of the throughput might have affected the performance of the model (because the distribution of throughputs after log transformation looked really weird) or the throughputs of the 25mm extruder were just too small to be considered significant by the model. In the end, it was clear that, since CatBoostRegresser, which is a tree-based model, was used, the errors for the extruder_size that were not included in the train data were higher than those sizes that were included in the train data. Moreover, the feature importances showed that the trained model correctly learned the effect of each processing parameter in extrusion. For example, the throughput increased with increasing rpm, extruder_size, metering_depth, screw_pitch, and polymer_density, whereas it decreased with increasing number_flight and flight_width.

In conclusion, this study clearly demonstrated that it might be possible to train machine learning models with the datasets generated by an analytical solution. It would be also interesting to apply machine learning to learn the patterns of the dataset that are generated by more sophisticated computational methods, which would be one of my future works.

How to Run the Notebooks Locally

To download the contents of this GitHub page on to your local machine, follow these steps:

Copy and paste the following link: git clone https://github.com/johnwslee/extrucal_machine-learning.git to your Terminal.
On your terminal, type: cd extrucal_machine-learning.
Create a virtualenv by typing: conda env create -f env.yml
Activate the virtualenv by typing: conda activate extrucal_ml
Run the notebooks in notebook folder in order.

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
archives		archives
data		data
img		img
models		models
notebooks		notebooks
src		src
LICENSE		LICENSE
README.md		README.md
env.yml		env.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Study on Machine Learning Using Data Generated by Analytical Solution

1. Introduction

2. Summary of Study

2.1. Generation of Data

2.2. Exploratory Data Analysis

2.3. Cross-Validation of Multiple Machine Learning Models

2.4. Hyperparameter Optimization

2.5. Comparison of Predictions for Different Extruder Sizes

For `extruder_size` in Train Data

For `extruder_size` not in Train Data

2.6. Feature Importances

Rank of Features

Effect of Each Processing Parameter on Throughput

3. Conclusion

How to Run the Notebooks Locally

About

Releases

Packages

Languages

License

johnwslee/extrucal_machine-learning

Folders and files

Latest commit

History

Repository files navigation

Study on Machine Learning Using Data Generated by Analytical Solution

1. Introduction

2. Summary of Study

2.1. Generation of Data

2.2. Exploratory Data Analysis

2.3. Cross-Validation of Multiple Machine Learning Models

2.4. Hyperparameter Optimization

2.5. Comparison of Predictions for Different Extruder Sizes

For extruder_size in Train Data

For extruder_size not in Train Data

2.6. Feature Importances

Rank of Features

Effect of Each Processing Parameter on Throughput

3. Conclusion

How to Run the Notebooks Locally

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

For `extruder_size` in Train Data

For `extruder_size` not in Train Data

Packages