# DPhi Datathon
# Data Sprint 4: Compressive Strength of Concrete
## Estimate Compressive Strength of Concrete

# Context

**Civil engineering** is a professional engineering discipline that deals with the design, construction, and maintenance of the physical and naturally built environment, including public works such as roads, bridges, canals, dams, airports, sewerage systems, pipelines, structural components of buildings, and railways.

![alt](<https://dphi-courses.s3.ap-south-1.amazonaws.com/Datathons/ds4.png>)

**Concrete** is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients. **Compressive strength or compression strength** is the capacity of a material or structure to withstand loads tending to reduce the size, as opposed to which withstands loads tending to elongate. In other words, compressive strength resists being pushed together, whereas tensile strength resists tension (being pulled apart). In the study of strength of materials, tensile strength, compressive strength, and shear strength can be analyzed independently.

# Objective

The concrete compressive strength is a highly nonlinear function of age and ingredients. These ingredients include cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate. Your objective is to build a machine learning model that would help Civil Engineers to estimate the compressive strength of the concrete and they can further take a decision whether the concrete should be used in their current project or not.

# Evaluation Criteria

Submissions are evaluated using Root-Mean-Squared-Error (RMSE).
![alt](<https://dphi-courses.s3.ap-south-1.amazonaws.com/Datathons/rmse+formula.png>)

# About the Data

The dataset has 9 columns which tell you different measurements related to the concrete. 

# Data Description
|Column Name|Description|
|:----|:----|
|Cement (component 1)(kg in a m3 mixture)|Cement   (component 1) -- Kilogram in a meter-cube mixture -- Input Variable|
|Blast Furnace Slag (component 2)(kg in a m3 mixture)|Blast Furnace   Slag (component 2) -- kg in a m3 mixture -- Input Variable|
|Fly Ash (component 3)(kg in a m3 mixture)|Fly Ash   (component 3) -- kg in a m3 mixture -- Input Variable|
|Water  (component 4)(kg in a m3   mixture)|Water   (component 4) -- kg in a m3 mixture -- Input Variable|
|Superplasticizer (component 5)(kg in a m3 mixture)|Superplasticizer   (component 5) -- kg in a m3 mixture -- Input Variable|
|Coarse Aggregate  (component 6)(kg   in a m3 mixture)|Coarse   Aggregate (component 6) -- kg in a m3 mixture -- Input Variable|
|Fine Aggregate (component 7)(kg in a m3 mixture)|Fine Aggregate   (component 7) -- kg in a m3 mixture -- Input Variable|
|Age (day)|Age -- Day   (1-365) -- Input Variable|
|Concrete compressive strength(MPa, megapascals)|Concrete   compressive strength -- MegaPascals -- Output Variable|

# Acknowledgement

This dataset has been sourced from the UCI Machine Learning Repository.

In [1]:
#installing necessary packages required

In [2]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/90/86/c3dcb600b4f9e7584ed90ea9d30a717fb5c0111574675f442c3e7bc19535/catboost-0.24.1-cp36-none-manylinux1_x86_64.whl (66.1MB)
[K     |████████████████████████████████| 66.1MB 59kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.24.1


In [3]:
#importing libraries 
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

#importing libraries for visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
pd.options.display.max_columns = 100
pd.options.display.max_rows = 3000

  import pandas.util.testing as tm


In [4]:
#reading training dataset
concrete_data  = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/concrete_data/training_set_label.csv" )

In [5]:
#reading newtest dataset
concrete_testdata = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/concrete_data/testing_set_label.csv')

In [6]:
#getting the overview of all the columns in the dataset
concrete_data.columns

Index(['Cement (component 1)(kg in a m^3 mixture)',
       'Blast Furnace Slag (component 2)(kg in a m^3 mixture)',
       'Fly Ash (component 3)(kg in a m^3 mixture)',
       'Water  (component 4)(kg in a m^3 mixture)',
       'Superplasticizer (component 5)(kg in a m^3 mixture)',
       'Coarse Aggregate  (component 6)(kg in a m^3 mixture)',
       'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)',
       'Concrete compressive strength(MPa, megapascals) '],
      dtype='object')

In [7]:
#getting the overview of all the columns in the new test dataset
concrete_testdata.columns

Index(['Cement (component 1)(kg in a m^3 mixture)',
       'Blast Furnace Slag (component 2)(kg in a m^3 mixture)',
       'Fly Ash (component 3)(kg in a m^3 mixture)',
       'Water  (component 4)(kg in a m^3 mixture)',
       'Superplasticizer (component 5)(kg in a m^3 mixture)',
       'Coarse Aggregate  (component 6)(kg in a m^3 mixture)',
       'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)'],
      dtype='object')

In [8]:
#renaming the columns as per Python naming convention
concrete_data.columns = ['Cement',
       'Blast_Furnace_Slag',
       'Fly_Ash',
       'Water',
       'Superplasticizer',
       'Coarse_Aggregate',
       'Fine_Aggregate', 'Age',
       'Concrete_compressive_strength']

In [9]:
concrete_data.columns

Index(['Cement', 'Blast_Furnace_Slag', 'Fly_Ash', 'Water', 'Superplasticizer',
       'Coarse_Aggregate', 'Fine_Aggregate', 'Age',
       'Concrete_compressive_strength'],
      dtype='object')

In [10]:
#renaming the columns as per Python naming convention
concrete_testdata.columns = ['Cement',
       'Blast_Furnace_Slag',
       'Fly_Ash',
       'Water',
       'Superplasticizer',
       'Coarse_Aggregate',
       'Fine_Aggregate', 'Age']

In [11]:
concrete_testdata.columns

Index(['Cement', 'Blast_Furnace_Slag', 'Fly_Ash', 'Water', 'Superplasticizer',
       'Coarse_Aggregate', 'Fine_Aggregate', 'Age'],
      dtype='object')

In [12]:
#first 5 rows content of the dataset
concrete_data.head()

Unnamed: 0,Cement,Blast_Furnace_Slag,Fly_Ash,Water,Superplasticizer,Coarse_Aggregate,Fine_Aggregate,Age,Concrete_compressive_strength
0,298.2,0.0,107.0,209.7,11.1,879.6,744.2,28,31.875165
1,397.0,0.0,0.0,186.0,0.0,1040.0,734.0,28,36.935229
2,251.37,0.0,118.27,188.45,6.35,1028.4,757.73,56,36.638755
3,304.0,140.0,0.0,214.0,6.0,895.0,722.0,28,33.418902
4,297.0,0.0,0.0,186.0,0.0,1040.0,734.0,7,30.957472


In [13]:
#last 5 rows content of the dataset
concrete_data.tail()

Unnamed: 0,Cement,Blast_Furnace_Slag,Fly_Ash,Water,Superplasticizer,Coarse_Aggregate,Fine_Aggregate,Age,Concrete_compressive_strength
767,252.5,0.0,0.0,185.7,0.0,1111.6,784.3,7,11.483912
768,249.1,0.0,98.75,158.11,12.8,987.76,889.01,56,42.030457
769,255.0,0.0,0.0,192.0,0.0,889.8,945.0,3,8.204075
770,190.68,0.0,125.4,162.14,7.77,1090.0,804.01,100,40.568768
771,159.0,187.0,0.0,176.0,11.0,990.0,789.0,28,32.7639


In [14]:
#getting brief overview of the dataset - number of columns and rows (shape of dataset), columns names and its dtype, how many non-null values it has 
#and memory usage.
concrete_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 772 entries, 0 to 771
Data columns (total 9 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Cement                         772 non-null    float64
 1   Blast_Furnace_Slag             772 non-null    float64
 2   Fly_Ash                        772 non-null    float64
 3   Water                          772 non-null    float64
 4   Superplasticizer               772 non-null    float64
 5   Coarse_Aggregate               772 non-null    float64
 6   Fine_Aggregate                 772 non-null    float64
 7   Age                            772 non-null    int64  
 8   Concrete_compressive_strength  772 non-null    float64
dtypes: float64(8), int64(1)
memory usage: 54.4 KB


In [15]:
#finding the total rows and columns of new test dataset
concrete_testdata.shape

(258, 8)

In [16]:
#just extra checking for null values
concrete_data.isnull().sum()

Cement                           0
Blast_Furnace_Slag               0
Fly_Ash                          0
Water                            0
Superplasticizer                 0
Coarse_Aggregate                 0
Fine_Aggregate                   0
Age                              0
Concrete_compressive_strength    0
dtype: int64

In [17]:
#statistical view of dataset
concrete_data.describe()

Unnamed: 0,Cement,Blast_Furnace_Slag,Fly_Ash,Water,Superplasticizer,Coarse_Aggregate,Fine_Aggregate,Age,Concrete_compressive_strength
count,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0,772.0
mean,280.722565,76.49614,52.701347,182.361593,6.000848,971.558782,771.618355,44.993523,35.724196
std,104.711803,87.477423,63.596763,20.913641,5.844002,77.078828,79.785875,60.442735,16.797389
min,102.0,0.0,0.0,121.75,0.0,801.0,594.0,1.0,2.331808
25%,190.34,0.0,0.0,166.6775,0.0,932.0,724.3,12.25,23.677591
50%,275.0,24.0,0.0,185.7,6.05,968.0,777.8,28.0,33.870853
75%,350.0,144.775,118.1875,193.0,10.025,1028.1,821.0,56.0,46.232813
max,540.0,359.4,200.1,237.0,32.2,1145.0,992.6,365.0,82.599225


In [18]:
#checking for duplicates values
concrete_data.duplicated().sum()

13

In [19]:
#rough checking how many times may each of the 5 values (output above) exists in the dataset
concrete_data.duplicated(keep=False).sum()

21

In [20]:
#sorting the dataset to delete the duplicates, to make duplicates come together one after another. The sorted dataset index values are also changed
cols = list(concrete_data.columns)
concrete_data.sort_values(by=cols, inplace=True, ignore_index=True) 

In [21]:
#dropping the duplicates only keeping the last value (ordinally last row from sorted) of each duplicates
concrete_data.drop_duplicates(keep='last', inplace=True, ignore_index=True)

In [22]:
#rechecking for duplicates
concrete_data.duplicated().sum()

0

In [23]:
#overview of all column names having dtype object
col_list = list(concrete_data.select_dtypes(['object']).columns)
col_list

[]

In [24]:
# Interquartile Range (IQR)
Q1 = concrete_data.quantile(0.25)
Q3 = concrete_data.quantile(0.75)
IQR = Q3 - Q1
print(IQR)

Cement                           150.260000
Blast_Furnace_Slag               142.900000
Fly_Ash                          118.270000
Water                             25.000000
Superplasticizer                   9.950000
Coarse_Aggregate                  96.400000
Fine_Aggregate                    98.750000
Age                               42.000000
Concrete_compressive_strength     21.563707
dtype: float64


In [25]:
#the code below generates an output with the 'True' and 'False' values for each datapoints 
#where the values are 'True' represent the presence of the outlier
print((concrete_data < (Q1 - 1.5 * IQR)) | (concrete_data > (Q3 + 1.5 * IQR)))

     Cement  Blast_Furnace_Slag  Fly_Ash  Water  Superplasticizer  \
0     False               False    False  False             False   
1     False               False    False  False             False   
2     False               False    False  False             False   
3     False               False    False  False             False   
4     False               False    False  False             False   
5     False               False    False  False             False   
6     False               False    False  False             False   
7     False               False    False  False             False   
8     False               False    False  False             False   
9     False               False    False  False             False   
10    False               False    False  False             False   
11    False               False    False  False             False   
12    False               False    False  False             False   
13    False               False   

##### Identifying Outliers with Skewness
Several machine learning algorithms make the assumption that the data follow a normal (or Gaussian) distribution. This is easy to check with the skewness value, which explains the extent to which the data is normally distributed. Ideally, the skewness value should be between -1 and +1, and any major deviation from this range indicates the presence of extreme values.

In [26]:
col_list_a = list(concrete_data.columns)
print(col_list_a)
print(concrete_data[col_list_a].skew())

['Cement', 'Blast_Furnace_Slag', 'Fly_Ash', 'Water', 'Superplasticizer', 'Coarse_Aggregate', 'Fine_Aggregate', 'Age', 'Concrete_compressive_strength']
Cement                           0.543575
Blast_Furnace_Slag               0.802514
Fly_Ash                          0.540618
Water                           -0.041853
Superplasticizer                 0.985048
Coarse_Aggregate                -0.065396
Fine_Aggregate                  -0.230513
Age                              3.237671
Concrete_compressive_strength    0.410524
dtype: float64


In [27]:
#initializing X and y variable

X = concrete_data.drop('Concrete_compressive_strength', axis=1)
y = concrete_data['Concrete_compressive_strength']

In [28]:
from catboost import CatBoostRegressor
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

In [29]:
clf = make_pipeline(StandardScaler(), 
                    GridSearchCV(estimator=CatBoostRegressor(random_seed = 42 ),
                                 param_grid=[{'depth' : [6, 8, 10, 21, 53, 75,100, 200, 250],
                                              'learning_rate' : [0.01, 0.05, 0.1],
                                              'n_estimators'  : [200, 400, 500, 600, 800, 1000], }],
                                 cv = 5, n_jobs=-1,
                                 refit=True))

In [30]:
clf.fit(X, y)



0:	learn: 15.9449629	total: 48.7ms	remaining: 48.7s
1:	learn: 15.4833243	total: 51.1ms	remaining: 25.5s
2:	learn: 15.0512266	total: 52.9ms	remaining: 17.6s
3:	learn: 14.6372188	total: 54.8ms	remaining: 13.7s
4:	learn: 14.2435652	total: 56.6ms	remaining: 11.3s
5:	learn: 13.9199441	total: 58.4ms	remaining: 9.68s
6:	learn: 13.5246453	total: 60.1ms	remaining: 8.52s
7:	learn: 13.1835573	total: 61.8ms	remaining: 7.66s
8:	learn: 12.8456218	total: 63.6ms	remaining: 7s
9:	learn: 12.5610912	total: 65.3ms	remaining: 6.47s
10:	learn: 12.2800898	total: 67.1ms	remaining: 6.03s
11:	learn: 11.9895101	total: 68.9ms	remaining: 5.67s
12:	learn: 11.6804206	total: 70.6ms	remaining: 5.36s
13:	learn: 11.4014125	total: 72.4ms	remaining: 5.1s
14:	learn: 11.1784134	total: 79ms	remaining: 5.19s
15:	learn: 10.9222331	total: 82.2ms	remaining: 5.05s
16:	learn: 10.6817595	total: 84.7ms	remaining: 4.89s
17:	learn: 10.4634316	total: 86.6ms	remaining: 4.72s
18:	learn: 10.2720204	total: 88.4ms	remaining: 4.56s
19:	learn

Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('gridsearchcv',
                 GridSearchCV(cv=5, error_score=nan,
                              estimator=<catboost.core.CatBoostRegressor object at 0x7faf00d5ee10>,
                              iid='deprecated', n_jobs=-1,
                              param_grid=[{'depth': [6, 8, 10, 21, 53, 75, 100,
                                                     200, 250],
                                           'learning_rate': [0.01, 0.05, 0.1],
                                           'n_estimators': [200, 400, 500, 600,
                                                            800, 1000]}],
                              pre_dispatch='2*n_jobs', refit=True,
                              return_train_score=False, scoring=None,
                              verbose=0))],
         verbose=False)

In [31]:
pred = clf.predict(concrete_testdata)

In [32]:
# To create Dataframe of predicted value with particular respective index
res = pd.DataFrame(pred) #preditions are nothing but the final predictions of your model on input features of your new unseen test data
res.index = concrete_testdata.index # its important for comparison. Here "test_new" is your new test dataset
res.columns = ['Concrete_compressive_strength']

# To download the csv file locally
from google.colab import files
res.to_csv('submissionfileA.csv')         
files.download('submissionfileA.csv')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>