For this programming exercise I will implement a linear regression with multiple variables to predict concrete compressive strength. Concrete is the most important material in civil engineering. The concrete compressive strength is a function of age and ingredients, such as cement and water. The dataset was found at The UCI Machine Learning Repository, donated by professor I-Cheng Yeh [1].
The dataset consists of eight features (input variables) measured on 1030 experiments. First, let's list features:
- cement
- blast furnace slag
- fly ash
- water
- superplasticizer
- coarse aggregate
- fine aggregate
- age
All of them are in kg/m^3 except for age, that is in days (1–365). The target variable (concrete compressive strength) is in MPa (megapascale). Data is in raw form (not scaled).
We should split the dataset to three parts:
- training set 60%
- cross validation set 20%
- test set 20%
It will help us to test algorithm's performance. After that we need to apply feature scaling and mean normalization. They are needed to speed up a gradient descent (fewer steps to converge).
Feature scaling makes sure that features are on a similar scale, approximately in –1 \le x_i \le 1 range. Mean normalization replaces x_i with x_i - \mu_i to make features have approximately zero mean. It is important to get mean and standard deviation of training set and normalize all sets using those parameters.
| [1] | I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks", Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998) |