[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nickdlc/CSc448-Projects/blob/main/Assignment3/Equation-of-Slime.ipynb#scrollTo=WW8OLQAwFULZ)


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

# Loading the Dataset

In [2]:
# Display the first 15 rows of the dataset
df = pd.read_csv('https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv')
df.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [3]:
# Display a summary of the dataset
df.describe()

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
count,1000.0,1000.0,1000.0
mean,500.5,471.53,508611.1
std,288.819436,288.482872,447483.8
min,1.0,1.0,16.11429
25%,250.75,226.75,129826.7
50%,500.5,459.5,382718.2
75%,750.25,710.25,760321.1
max,1000.0,1000.0,1972127.0


In [4]:
# Check for null values
df.isna().sum()

Temperature °C    0
Mols KCL          0
Size nm^3         0
dtype: int64

Since there are no null values, we can proceed without modifying the dataset.

# Splitting the Dataset

In [5]:
from sklearn.model_selection import train_test_split

# Determine the features and label for the dataset
X = df.drop(['Size nm^3'], axis=1)
y = df['Size nm^3']

# Split the data for training (90%) and testing (10%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

# Linear Regression Model

## Training the Model

In [6]:
from sklearn.linear_model import LinearRegression

# Train a linear regression model on the training data
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression()

## Predicting the Value of a New Point

In [7]:
# Predict the value of the new point
X0 = [[500.5, 471.53]]
X_pred = pd.DataFrame(X0, columns=X_test.columns)
y_pred = linreg.predict(X_pred)

print(y_pred)

[511061.50500929]


Given a new data point $(x_0, x_1) = (500.5, 471.53)$, the model predicts a value of 511,061nm^3 for the size of the slime.

In [8]:
print('Score: ', linreg.score(X_test, y_test))

Score:  0.8552472077276095


The score for this model is about 85.525%. This score means that, based on the testing data, the model was able to predict about 85.525% of the corresponding values in y_test correctly. As a result, the model as a whole has an accuracy of about 85.525% for computing the correct y-value given a new input $(x_0, x_1)$ from the train set or a completely new point.

In [9]:
# Extract coefficients and intercept
print('Coefficients: ', linreg.coef_)
print('Intercept: ', linreg.intercept_)

Coefficients:  [ 866.14641337 1032.69506649]
Intercept:  -409391.47958340764


This gives us the equation $h(x_0, x_1) = 866.15x_0 + 1032.7x_1 - 409,390$

# Cross Validation

In [10]:
from sklearn.model_selection import cross_val_score

# Evaluate the model score using 5-fold cross-validation
scores = cross_val_score(linreg, X_test, y_test, cv=5)
print(scores)

[0.86600166 0.81565018 0.84209182 0.85686866 0.85851145]


Throughout the five shuffles of data, the highest accuracy the model achieved was roughly 86.600% which is better than the 85.525% achieved in the inital experiment.

# Polynomial Regression

In [11]:
from sklearn.preprocessing import PolynomialFeatures

# Modify the dataset to augment new columns to adjust for polynomial regression
X_poly = PolynomialFeatures(degree=2)
X_poly = X_poly.fit_transform(X)
X_poly

array([[1.00000e+00, 4.69000e+02, 6.47000e+02, 2.19961e+05, 3.03443e+05,
        4.18609e+05],
       [1.00000e+00, 4.03000e+02, 6.94000e+02, 1.62409e+05, 2.79682e+05,
        4.81636e+05],
       [1.00000e+00, 3.02000e+02, 9.75000e+02, 9.12040e+04, 2.94450e+05,
        9.50625e+05],
       ...,
       [1.00000e+00, 7.91000e+02, 2.13000e+02, 6.25681e+05, 1.68483e+05,
        4.53690e+04],
       [1.00000e+00, 7.69000e+02, 5.53000e+02, 5.91361e+05, 4.25257e+05,
        3.05809e+05],
       [1.00000e+00, 9.19000e+02, 4.52000e+02, 8.44561e+05, 4.15388e+05,
        2.04304e+05]])

As we can see from the original values of X, the first input from the data set $(x_0, x_1)$ has values (469,647). Looking at the first element of X_poly implies that the columns are $[1, x_0, x_1, x_0^2, x_0x_1, x_1^2]$.

In [12]:
# Repeat analysis for the polynomial regression model

# Split the data set into train and test sets
X_poly_train, X_poly_test, y_poly_train, y_poly_test = train_test_split(
    X_poly,
    y,
    test_size=0.10,
    random_state=42
)

# Create a linear regression model using the new train sets
polyreg = LinearRegression()
polyreg.fit(X_poly_train, y_poly_train)

# Display the score, coefficients, and intercept of the new model
print('Score: ', polyreg.score(X_poly_test, y_poly_test))
print('Coefficients: ', polyreg.coef_)
print('Intercept: ', polyreg.intercept_)

Score:  1.0
Coefficients:  [ 0.00000000e+00  1.20000000e+01 -1.27195488e-07  1.26494371e-11
  2.00000000e+00  2.85714287e-02]
Intercept:  2.0477105863392353e-05


This gives us the equation $h(x_0, x_1) = 1 + 12x_0 - 0.00000012720x_1 + 0.000000000012649x_0^2 + 2x_0x_1 + 0.028571x_1^2$

In [13]:
# Predict the y value for a new input (x_0, x_1)
X1 = [[1, 500.5, 471.53, 500.5**2, 500.5*471.53, 471.53**2]]
X_poly_pred = pd.DataFrame(X1)
y_poly_pred = polyreg.predict(X_poly_pred)

print(y_poly_pred)

[484360.11686993]


In [14]:
# Check if an existing point produces the expected y
X1 = [[1, 469, 647, 469**2, 469*647, 647**2]]
X_poly_pred = pd.DataFrame(X1)
y_poly_pred = polyreg.predict(X_poly_pred)

print(y_poly_pred)

[624474.25713602]


Since the score of this model is 1, the model should be able to accurately predict the value for every point. Using a new point $(x_0, x_1) = (500.5, 471.53)$, we see that the predicted value is now 484,360nm^3 instead of the 511,060nm^3 previously predicted from the degree-one linear regression. Moreover, when using the existing point $(x_0, x_1) = (469, 647)$, the predicted value is 624,470nm^3 which matches the corresponding output from the original data.