# 1. INTRODUCTION:
This problem was originally proposed by Prof. I-Cheng Yeh, Department of Information Management Chung-Hua University, Hsin Chu, Taiwan in 2007. It is related to his research in 1998 about how to predict compression strength in a concrete structure.

The conventional process of testing the compressive strength of concrete involves casting several cubes for the respective grade and observing the strength of the concrete over a period of time ranging from 7 to 28 days.

Various combinations of the components of concrete are selected and cubes for each combination is casted and its test strength at 7, 14 and 28 days is noted dow. This is a time consuming and rather tedious process.

This project aims to predict the compressive strength of concrete with maximum accuracy and lowest error (evaluation metrics MAE) , for various quantities of constituent components as the input. The conrete cube exhibits behavioral differences in their compressive strengths for cubes that are cured/not cured. Curing is the process of maintaining the moisture to ensure uninterrupted hydration of concrete.

The concrete strength increases if the concrete cubes are cured periodically. The rate of increase in strength is described here.

Time % Of Total Strength Achieved [1] [2]

* 1 day 16%
* 3 days 40%
* 7 days 65%
* 14 days 90%
* 28 days 99%

At 28 days, concrete achieves 99% of the strength. Thus usual measurements of strength are taken at 28 days[1] 

The goals of this case are:

1. The average strength of the concrete samples at 1, 7, 14, and 28 days of age.
2. The coefficients of regression model using the formula that provided us:

$$ Concrete \ Strength = \beta_{0} \ + \ \beta_{1}*cement \ + \ \beta_{2}*slag \ + \ \beta_{3}*fly \ ash  \ + \ \beta_{4}*water \ + $$ 
$$ \beta_{5}*superplasticizer \ + \ \beta_{6}*coarse \ aggregate \ + \ \beta_{7}*fine \ aggregate \ + \ \beta_{8}*age $$


# EXPLORATORY DATA ANALYSIS

#### Compressive strength data:
- "cement" - Portland cement in kg/m3
- "slag" - Blast furnace slag in kg/m3
- "fly_ash" - Fly ash in kg/m3
- "water" - Water in liters/m3
- "superplasticizer" - Superplasticizer additive in kg/m3
- "coarse_aggregate" - Coarse aggregate (gravel) in kg/m3
- "fine_aggregate" - Fine aggregate (sand) in kg/m3
- "age" - Age of the sample in days
- "strength" - Concrete compressive strength in megapascals (MPa)

***Acknowledgments**: I-Cheng Yeh, "Modeling of strength of high-performance concrete using artificial neural networks," Cement and Concrete Research, Vol. 28, No. 12, pp. 1797-1808 (1998)*.

In [4]:
#data manipulation
import pandas as pd
import numpy as np

#plot
import seaborn as sns
import matplotlib.pyplot

#model
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor

#preprocessing/hyperparameter tunning
from sklearn.model_selection import train_test_split, learning_curve, RandomizedSearchCV, GridSearchCV
from sklearn.preprocessing import StandardScaler

#metrics 
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [5]:
#dataset reading
data = pd.read_csv("concrete_data.csv")
data.head()

Unnamed: 0,cement,slag,fly_ash,water,superplasticizer,coarse_aggregate,fine_aggregate,age,strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.986111
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.887366
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.269535
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05278
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.296075


In [6]:
#copy of the dataset
df=data.copy()
#information about the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   cement            1030 non-null   float64
 1   slag              1030 non-null   float64
 2   fly_ash           1030 non-null   float64
 3   water             1030 non-null   float64
 4   superplasticizer  1030 non-null   float64
 5   coarse_aggregate  1030 non-null   float64
 6   fine_aggregate    1030 non-null   float64
 7   age               1030 non-null   int64  
 8   strength          1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB


### summary
- 1030 entries 
- 9 columns
- 2 types of variables: float and int
- no missing data

In [8]:
#?duplicated
df.duplicated().sum()

25

In [9]:
#drop duplicates
df =df.drop_duplicates()

In [10]:
#summary statistics
df.describe()

Unnamed: 0,cement,slag,fly_ash,water,superplasticizer,coarse_aggregate,fine_aggregate,age,strength
count,1005.0,1005.0,1005.0,1005.0,1005.0,1005.0,1005.0,1005.0,1005.0
mean,278.629055,72.043134,55.535075,182.074378,6.031647,974.376468,772.686617,45.856716,35.250273
std,104.345003,86.170555,64.207448,21.34074,5.919559,77.579534,80.339851,63.734692,16.284808
min,102.0,0.0,0.0,121.75,0.0,801.0,594.0,1.0,2.331808
25%,190.68,0.0,0.0,166.61,0.0,932.0,724.3,7.0,23.523542
50%,265.0,20.0,0.0,185.7,6.1,968.0,780.0,28.0,33.798114
75%,349.0,142.5,118.27,192.94,10.0,1031.0,822.2,56.0,44.86834
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.599225
