Concrete compressive strength(Resistência à compressão do concreto) - **Fck** - medida em megapascal (MPa). O Fck indica, a qual tensão máxima o concreto tem capacidade de resistir. Testes de resistência no concreto possibilitam confirmar a tensão máxima a que ele resistirá antes de sofrer ruptura.

**Curiosidade:**Ensaio de Resistência à compressão do Concreto:
https://www.youtube.com/watch?v=6TsqUeLjHA8

**Fatores que podem afetar resistência à compressão do concreto**
Dosagem inadequada dos insumos, cura indevida ou ausente, adição inapropriada de água, incompatibilidade ou baixa qualidade de insumos, idade, etc. 

Este é um problema de regressão, o objetivo é estimar um modelo para predizer um **Fck** com base nos insumos usados e a idade do concreto. 

Fonte:https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength


In [None]:
from sklearn import neighbors
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing
import numpy as np
import pandas as pd

O dataset é em .xls, então usei o método de read_excel, o separador ele identifica automaticamente. 

In [None]:
data = pd.read_excel('/content/drive/MyDrive/colab/Concrete_Data.xls')

**Uma prévia dos dados e uma noção dos valores.**
Atributos(Colunas):
1 - Cement - **Cimento **
2 - Blast Furnace Slag** (Resíduo não metálico, reduzir ~5% da emissão de CO2)**
3 - Fly Ash - **Cinzas Volantes(Opção parcial ao cimento Portland)**
4 - Water - **Água**
5 - Superplasticizer - **Superplastificante (Aditivo muito importante)**
6 - Coarse Aggregate - **Agredado Grosso (Pedrgulho)**
7 - Fine Aggregate - **Agregado fino(Areia)**
8 - Age - Idade em dias, após a aplicação.

A saída é última coluna(Fck), é o que quero inferir, a variável alvo. 

In [None]:
data

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.052780
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075
...,...,...,...,...,...,...,...,...,...
1025,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.284354
1026,322.2,0.0,115.6,196.0,10.4,817.9,813.4,28,31.178794
1027,148.5,139.4,108.6,192.7,6.1,892.4,780.0,28,23.696601
1028,159.1,186.7,0.0,175.6,11.3,989.6,788.9,28,32.768036


Verifica se o dataset não possui valores nulos, de fato não existem. 

In [None]:
data.isnull().sum()

Cement (component 1)(kg in a m^3 mixture)                0
Blast Furnace Slag (component 2)(kg in a m^3 mixture)    0
Fly Ash (component 3)(kg in a m^3 mixture)               0
Water  (component 4)(kg in a m^3 mixture)                0
Superplasticizer (component 5)(kg in a m^3 mixture)      0
Coarse Aggregate  (component 6)(kg in a m^3 mixture)     0
Fine Aggregate (component 7)(kg in a m^3 mixture)        0
Age (day)                                                0
Concrete compressive strength(MPa, megapascals)          0
dtype: int64

As colunas e as 5 primeiras linhas do dataset

In [None]:
data.head()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [None]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column                                                 Non-Null Count  Dtype  
---  ------                                                 --------------  -----  
 0   Cement (component 1)(kg in a m^3 mixture)              1030 non-null   float64
 1   Blast Furnace Slag (component 2)(kg in a m^3 mixture)  1030 non-null   float64
 2   Fly Ash (component 3)(kg in a m^3 mixture)             1030 non-null   float64
 3   Water  (component 4)(kg in a m^3 mixture)              1030 non-null   float64
 4   Superplasticizer (component 5)(kg in a m^3 mixture)    1030 non-null   float64
 5   Coarse Aggregate  (component 6)(kg in a m^3 mixture)   1030 non-null   float64
 6   Fine Aggregate (component 7)(kg in a m^3 mixture)      1030 non-null   float64
 7   Age (day)                                              1030 non-null   int64  
 8   Concrete compressive strength(MPa, megapascals)  

Estatísticas descritivas do dataset.
Valores mínimos, médio, máximo, percentis e o desvio padrão(std) da Resistência à compressão de concreto(Fck) - Observação para o desvio padrão, em média os Fck varia em 16, ou seja, 35 - 16 ou 35 + 16. Bom lembrar que um algoritmo bom não pode fazer previsões com erros maiores que o desvio padrão, 16. Nosso objetivo é criar um modelo que faça previsões com erros menores que o desvio padrão atual. 

In [None]:
data.describe()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.165631,73.895485,54.187136,181.566359,6.203112,972.918592,773.578883,45.662136,35.817836
std,104.507142,86.279104,63.996469,21.355567,5.973492,77.753818,80.175427,63.169912,16.705679
min,102.0,0.0,0.0,121.75,0.0,801.0,594.0,1.0,2.331808
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.707115
50%,272.9,22.0,0.0,185.0,6.35,968.0,779.51,28.0,34.442774
75%,350.0,142.95,118.27,192.0,10.16,1029.4,824.0,56.0,46.136287
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.599225


Permite Visualizar os correlações entre as colunas, ex: A quantidade de Água pode ser negativa para Resistência à comrpessão, pois possuem uma correlação negativa, já o cimento e superplastificante uma correlação positiva. 

In [None]:
data.corr()

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day),"Concrete compressive strength(MPa, megapascals)"
Cement (component 1)(kg in a m^3 mixture),1.0,-0.275193,-0.397475,-0.081544,0.092771,-0.109356,-0.22272,0.081947,0.497833
Blast Furnace Slag (component 2)(kg in a m^3 mixture),-0.275193,1.0,-0.323569,0.107286,0.043376,-0.283998,-0.281593,-0.044246,0.134824
Fly Ash (component 3)(kg in a m^3 mixture),-0.397475,-0.323569,1.0,-0.257044,0.37734,-0.009977,0.079076,-0.15437,-0.105753
Water (component 4)(kg in a m^3 mixture),-0.081544,0.107286,-0.257044,1.0,-0.657464,-0.182312,-0.450635,0.277604,-0.289613
Superplasticizer (component 5)(kg in a m^3 mixture),0.092771,0.043376,0.37734,-0.657464,1.0,-0.266303,0.222501,-0.192717,0.366102
Coarse Aggregate (component 6)(kg in a m^3 mixture),-0.109356,-0.283998,-0.009977,-0.182312,-0.266303,1.0,-0.178506,-0.003016,-0.164928
Fine Aggregate (component 7)(kg in a m^3 mixture),-0.22272,-0.281593,0.079076,-0.450635,0.222501,-0.178506,1.0,-0.156094,-0.167249
Age (day),0.081947,-0.044246,-0.15437,0.277604,-0.192717,-0.003016,-0.156094,1.0,0.328877
"Concrete compressive strength(MPa, megapascals)",0.497833,0.134824,-0.105753,-0.289613,0.366102,-0.164928,-0.167249,0.328877,1.0


Cria as variável X para armazenar os atributos, retirarando a coluna Alvo/Resultado. Já a variável y para Armazenar o Label. Ou seja, separando os dados de entrada dos de saída.

In [None]:
X = data.iloc[:, : -1] # todas as linhas, com todas as colunas menos a última
y = data.iloc[:, -1] # todas as linhas, e apenas a última última coluna

Confere se realmente a variável X contém somente os dados de entrada. 

In [None]:
X


Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day)
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360
...,...,...,...,...,...,...,...,...
1025,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28
1026,322.2,0.0,115.6,196.0,10.4,817.9,813.4,28
1027,148.5,139.4,108.6,192.7,6.1,892.4,780.0,28
1028,159.1,186.7,0.0,175.6,11.3,989.6,788.9,28


Confere se a variável y possui apenas os valores do Target/Alvo/Outcome

In [None]:
y

0       79.986111
1       61.887366
2       40.269535
3       41.052780
4       44.296075
          ...    
1025    44.284354
1026    31.178794
1027    23.696601
1028    32.768036
1029    32.401235
Name: Concrete compressive strength(MPa, megapascals) , Length: 1030, dtype: float64

Próxima linha vai misturar os dados(Randomizar) e fazer a divisão do dataset em dados de Treino e Teste. O método train_test_split do sklearn vai separar 30% do conjunto para teste e o restante para Treino, além de definir valor da semente de randomização.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Conferindo que o dataset foi dividido, de fato no total são 1030 linhas, mas ele ficou com 721 para treino, o resto para teste. 

In [None]:
X_train

Unnamed: 0,Cement (component 1)(kg in a m^3 mixture),Blast Furnace Slag (component 2)(kg in a m^3 mixture),Fly Ash (component 3)(kg in a m^3 mixture),Water (component 4)(kg in a m^3 mixture),Superplasticizer (component 5)(kg in a m^3 mixture),Coarse Aggregate (component 6)(kg in a m^3 mixture),Fine Aggregate (component 7)(kg in a m^3 mixture),Age (day)
196,194.68,0.0,100.52,165.62,7.48,1006.4,905.90,28
631,325.00,0.0,0.00,184.00,0.00,1063.0,783.00,7
81,318.80,212.5,0.00,155.70,14.30,852.1,880.40,3
526,359.00,19.0,141.00,154.00,10.91,942.0,801.00,3
830,162.00,190.0,148.00,179.00,19.00,838.0,741.00,28
...,...,...,...,...,...,...,...,...
87,286.30,200.9,0.00,144.70,11.20,1004.6,803.70,3
330,246.83,0.0,125.08,143.30,11.99,1086.8,800.89,14
466,190.34,0.0,125.18,166.61,9.88,1079.0,798.90,100
121,475.00,118.8,0.00,181.10,8.90,852.1,781.50,28


os nossos resultados usando a métrica RMSE(root mean squared error - raiz quadrática média) devem ser melhor do que o desvio padrão de y_train:(Um melhor resultado para RMSE é um valor menor que o mostrado abaixo)

In [None]:
np.std(y_train)

16.791744896796608

As próximas linhas fazem o escalonamento, para reduzir os efeitos da discrepâncias entre os valores dos atributos. Todos os valores serão tranformados entre 0 ou 1. Aumentando a possibilidade de precisão do algoritmo

MinMax Scaler

In [None]:
scaler = preprocessing.MinMaxScaler()
X_train_minmax = scaler.fit_transform(X_train)
X_test_minmax = scaler.transform(X_test)

Standard Scaler

In [None]:
scaler = preprocessing.StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

Cálculo dos Resultados: Nas próximas linhas é criado o modelo usando o KNN para Regressão, realiza o treinamento, usa o modelo para fazer a predição, define a métrica de desempenho, que é o erro quadrático médio, e por último apresenta os resultados. 

In [None]:
ks = [3, 5, 7, 9, 11]
scalers = ['no scaler', 'minmax', 'std']

results = []
for k in ks:
    for scaler in scalers:
        if scaler == 'minmax':
            X_train_, X_test_ = X_train_minmax, X_test_minmax
        elif scaler == 'std':
            X_train_, X_test_ = X_train_std, X_test_std
        else:
            X_train_, X_test_ = X_train, X_test

        model = neighbors.KNeighborsRegressor(n_neighbors = k)
        model.fit(X_train_, y_train)
        y_pred = model.predict(X_test_)
        rmse = metrics.mean_squared_error(y_test, y_pred, squared=False)
        result = { 'k': k, 'scaler': scaler, 'rmse': rmse}
        results.append(result)

Apresentação dos Resultados do melhor para o pior e avaliação do modelo

In [None]:
df_results = pd.DataFrame(results)
df_results.sort_values(by='rmse')

Unnamed: 0,k,scaler,rmse
0,3,no scaler,9.309468
2,3,std,9.343958
3,5,no scaler,9.345191
5,5,std,9.45157
8,7,std,9.571847
6,7,no scaler,9.706674
11,9,std,9.766121
4,5,minmax,9.768481
1,3,minmax,9.79754
7,7,minmax,10.033515


Com base no desvio padrão anterior de 16, o modelo consegue fazer previsões com desvio padrão menor, entre 9 e 10. Por exemplo se o Fck médio é 35, ele vai predizer um Fck entre 35 - 9 ou 35 + 9. 

Obrigado! 

Referências: 

I-Cheng Yeh, "Modeling of strength of high performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998).