Ian Hedges and Collin Glover

We chose this dataset because a process like conrete production has many steps/features that can impact the final performance of the concrete. The dataset is also relevant because it shows how regression and machine learning can be used to predict and improve things we use/interact with on a daily basis, like the concrete we use for bridges, roads, foundations, etc.

We felt this dataset was complex due to it having over a thousand instances which provides us with plenty of data. We chose a dataset with this many features so there was no overcomplication of the dataset which can distract from the scope of the project.

# Importing Libraries and Data

In [None]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, roc_auc_score

# Loading dataset into a pandas DataFrame
data = pd.DataFrame(pd.read_csv("/content/Concrete_Data.csv", sep=','))
data.head()

FileNotFoundError: [Errno 2] No such file or directory: '/content/Concrete_Data.csv'

# Exploratory Data Analysis

In [None]:
# Task 2: Perform exploratory data analysis (EDA)
# Summary statistics
print("Summary Statistics:")
print(data.describe())

In [None]:
# According to summary on UCI, there are no missing values. This is to verify that is true.
data.isna().sum()
data.info()

In [None]:
data.hist(bins=50, figsize=(20, 15))
plt.show()

In [None]:
data.corr()

In [None]:
sns.pairplot(data = data)

In [None]:
sns.heatmap(data = data.corr(), annot=True)

# Splitting Data for Model Training

In [None]:
# spliting, independant and dependant variables
X = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
print(X)
print(y)
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np

scaler = StandardScaler()

X = scaler.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
print(X_test)
print(y_test)


# Linear Regression Model

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
pred_vs_test = np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)

In [None]:
#vizualizing the prediction
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], '--', color='red')
plt.title('Actual vs. Predicted Concrete Strength')
plt.xlabel('Actual Strength')
plt.ylabel('Predicted Strength')
plt.show()

In [None]:
# Model Performance Metrics
from sklearn.metrics import mean_absolute_error
print('MAE:', mean_absolute_error(y_test, y_pred))
from sklearn.metrics import mean_squared_error
print('MSE:', mean_squared_error(y_test, y_pred))
from sklearn.metrics import r2_score
print( "R-squared", r2_score(y_test,y_pred))
r2_score(y_test, y_pred)
# Adjusted R-squared formula from: https://stackoverflow.com/questions/49381661/how-do-i-calculate-the-adjusted-r-squared-score-using-scikit-learn
print('Adjusted R-squared:', 1-(1-r2_score(y_test, y_pred))*((len(X_test)-1)/(len(X_test)-len(X_test[0])-1)))

### Linear Regression Model Insights
The linear regression model appears to be a well fit model to the data. When looking at the scatterplot, the data has a very obvious positive linear relationship between the actual values and predicted values, but the points don't fit too tightly to the line. While the R-sqaured value is decent at 0.636, it could be better so we will explore using polynomial regression models.

# Polynomial Regression Model

Second Degree Polynomial Regression Model

In [None]:
#transforming the data
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2)
X_poly = poly.fit_transform(X)

In [None]:
#defining new train and test split
from sklearn.model_selection import train_test_split
X_poly_train, X_poly_test, y_poly_train, y_poly_test = train_test_split(X_poly, y, test_size = 0.2, random_state = 0)
print('X_poly_test values (One "row" of values): \n',X_poly_test[1])
print('y_poly_test values: \n',y_poly_test[1:11])

In [None]:
#creating a new model
from sklearn.linear_model import LinearRegression
poly2regressor = LinearRegression()
poly2regressor.fit(X_poly_train, y_poly_train)

In [None]:
#making predictions with the new model
y_poly_pred = poly2regressor.predict(X_poly_test)
np.set_printoptions(precision=2)
poly_pred_vs_test = np.concatenate((y_poly_pred.reshape(len(y_poly_pred),1), y_poly_test.reshape(len(y_poly_test),1)),1)
print(poly_pred_vs_test[1:11])

In [None]:
#visualizing the new model
plt.figure(figsize=(8, 6))
plt.scatter(y_poly_test, y_poly_pred)
plt.plot([min(y_poly_test), max(y_poly_test)], [min(y_poly_test), max(y_poly_test)], '-', color='red')
plt.title('Actual vs. Predicted Concrete Strength')
plt.xlabel('Actual Strength')
plt.ylabel('Predicted Strength')
plt.show()

In [None]:
# Model Performance Metrics
from sklearn.metrics import mean_absolute_error
print('MAE:', mean_absolute_error(y_poly_test, y_poly_pred))
from sklearn.metrics import mean_squared_error
print('MSE:', mean_squared_error(y_poly_test, y_poly_pred))
from sklearn.metrics import r2_score
print( "R-squared", r2_score(y_poly_test,y_poly_pred))
polyR2 = r2_score(y_poly_test, y_poly_pred)
# Adjusted R-squared formula from: https://stackoverflow.com/questions/49381661/how-do-i-calculate-the-adjusted-r-squared-score-using-scikit-learn
print('Adjusted R-squared:', 1-(((1-polyR2)*((len(X_poly_test)-1))/(len(X_poly_test)-len(X_test[0])-1))))

Third Degree Polynomial Regression Model

In [None]:
#transforming the data
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(3)
X_poly = poly.fit_transform(X)

In [None]:
#defining new train and test split
from sklearn.model_selection import train_test_split
X_poly_train, X_poly_test, y_poly_train, y_poly_test = train_test_split(X_poly, y, test_size = 0.2, random_state = 0)
print(X_poly_test)
print(y_poly_test)

In [None]:
#creating a new model
from sklearn.linear_model import LinearRegression
poly3regressor = LinearRegression()
poly3regressor.fit(X_poly_train, y_poly_train)

In [None]:
#making predictions with the new model
y_poly_pred = poly3regressor.predict(X_poly_test)
np.set_printoptions(precision=2)
poly_pred_vs_test = np.concatenate((y_poly_pred.reshape(len(y_poly_pred),1), y_poly_test.reshape(len(y_poly_test),1)),1)
print(poly_pred_vs_test[1:11])

In [None]:
#visualizing the new model
plt.figure(figsize=(8, 6))
plt.scatter(y_poly_test, y_poly_pred)
plt.plot([min(y_poly_test), max(y_poly_test)], [min(y_poly_test), max(y_poly_test)], '-', color='red')
plt.title('Actual vs. Predicted Concrete Strength')
plt.xlabel('Actual Strength')
plt.ylabel('Predicted Strength')
plt.show()

In [None]:
# Model Performance Metrics
from sklearn.metrics import mean_absolute_error
print('MAE:', mean_absolute_error(y_poly_test, y_poly_pred))
from sklearn.metrics import mean_squared_error
print('MSE:', mean_squared_error(y_poly_test, y_poly_pred))
from sklearn.metrics import r2_score
print( "R-squared", r2_score(y_poly_test,y_poly_pred))
polyR2 = r2_score(y_poly_test, y_poly_pred)
# Adjusted R-squared formula from: https://stackoverflow.com/questions/49381661/how-do-i-calculate-the-adjusted-r-squared-score-using-scikit-learn
print('Adjusted R-squared:', 1-(((1-polyR2)*((len(X_poly_test)-1))/(len(X_poly_test)-len(X_test[0])-1))))

Fourth Degree Polynomial Regression Model

In [None]:
#transforming the data
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(4)
X_poly = poly.fit_transform(X)

In [None]:
#defining new train and test split
from sklearn.model_selection import train_test_split
X_poly_train, X_poly_test, y_poly_train, y_poly_test = train_test_split(X_poly, y, test_size = 0.2, random_state = 0)
print(X_poly_test)
print(y_poly_test)

In [None]:
#creating a new model
from sklearn.linear_model import LinearRegression
poly4regressor = LinearRegression()
poly4regressor.fit(X_poly_train, y_poly_train)

In [None]:
#making predictions with the new model
y_poly_pred = poly4regressor.predict(X_poly_test)
np.set_printoptions(precision=2)
poly_pred_vs_test = np.concatenate((y_poly_pred.reshape(len(y_poly_pred),1), y_poly_test.reshape(len(y_poly_test),1)),1)
print(poly_pred_vs_test[1:11])

In [None]:
#visualizing the new model
plt.figure(figsize=(8, 6))
plt.scatter(y_poly_test, y_poly_pred)
plt.plot([min(y_poly_test), max(y_poly_test)], [min(y_poly_test), max(y_poly_test)], '-', color='red')
plt.title('Actual vs. Predicted Concrete Strength')
plt.xlabel('Actual Strength')
plt.ylabel('Predicted Strength')
plt.show()

In [None]:
# Model Performance Metrics
from sklearn.metrics import mean_absolute_error
print('MAE:', mean_absolute_error(y_poly_test, y_poly_pred))
from sklearn.metrics import mean_squared_error
print('MSE:', mean_squared_error(y_poly_test, y_poly_pred))
from sklearn.metrics import r2_score
print( "R-squared", r2_score(y_poly_test,y_poly_pred))
polyR2 = r2_score(y_poly_test, y_poly_pred)
# Adjusted R-squared formula from: https://stackoverflow.com/questions/49381661/how-do-i-calculate-the-adjusted-r-squared-score-using-scikit-learn
print('Adjusted R-squared:', 1-(((1-polyR2)*((len(X_poly_test)-1))/(len(X_poly_test)-len(X_test[0])-1))))

### Polynomial Regression Model Insight

After running three separate polynomial regression models, each having a different degree, we have concluded that the best fit polynomial regression model is the third degree model. The third degree model has the best R-squared score and also has the lowest MSE which means it has the best fit. After running the fourth degree model, we believe the the fourth degree model over fits the data which leads to it not being a good model to select.

# Removing Columns and Preparing New Model



In [None]:
# spliting, independant and dependant variables
X2 = data[['Age (day)','Water  (component 4)(kg in a m^3 mixture)', 'Coarse Aggregate  (component 6)(kg in a m^3 mixture)', 'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Cement (component 1)(kg in a m^3 mixture)']].values
y = data.iloc[:, -1].values
print(X2)
print(y)
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np

scaler = StandardScaler()

X2 = scaler.fit_transform(X2)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X2, y, test_size = 0.2, random_state = 0)
print(X_test)
print(y_test)

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
#print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
pred_vs_test = np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)

In [None]:
#vizualizing the prediction
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], '--', color='red')
plt.title('Actual vs. Predicted Concrete Strength')
plt.xlabel('Actual Strength')
plt.ylabel('Predicted Strength')
plt.show()

In [None]:
# Model Performance Metrics
from sklearn.metrics import mean_absolute_error
print('MAE:', mean_absolute_error(y_test, y_pred))
from sklearn.metrics import mean_squared_error
print('MSE:', mean_squared_error(y_test, y_pred))
from sklearn.metrics import r2_score
print( "R-squared", r2_score(y_test,y_pred))
r2_score(y_test, y_pred)
# Adjusted R-squared formula from: https://stackoverflow.com/questions/49381661/how-do-i-calculate-the-adjusted-r-squared-score-using-scikit-learn
print('Adjusted R-squared:', 1-(1-r2_score(y_test, y_pred))*((len(X_test)-1)/(len(X_test)-len(X_test[0])-1)))

Removing the columns with high amounts of 0's does not help improve the overall performance of the model.