<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

\(\^Be sure to update this button to point to your notebook instead of the sample notebook\)

In [2]:
# Imports section
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import BayesianRidge
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures as poly
from sklearn.model_selection import ShuffleSplit

## Part 1. Loading the dataset

In [3]:
# Using pandas load the dataset (load remotely, not locally)
dataset = pd.read_csv("https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv")

# Output the first 15 rows of the data
dataset.head(15)


Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [9]:
# Display a summary of the table information (number of datapoints, etc.)
dataset.info()
dataset.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
count,1000.0,1000.0,1000.0
mean,500.5,471.53,508611.1
std,288.819436,288.482872,447483.8
min,1.0,1.0,16.11429
25%,250.75,226.75,129826.7
50%,500.5,459.5,382718.2
75%,750.25,710.25,760321.1
max,1000.0,1000.0,1972127.0


## Part 2. Splitting the dataset

In [4]:
# Take the pandas dataset and split it into our features (X) and label (y)
features = dataset[["Temperature °C","Mols KCL"]]
label = dataset[["Size nm^3"]]
x, y = features, label

# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=.10)
x_train, x_test, y_train, y_test

(     Temperature °C  Mols KCL
 697             842       739
 613             722       816
 626             529       203
 193             794       497
 124             204        29
 ..              ...       ...
 573             892       860
 974             730        72
 770             114       939
 306              69       347
 531             579       758
 
 [900 rows x 2 columns],
      Temperature °C  Mols KCL
 705             935        28
 787             877       693
 389             595       518
 526             369       982
 727             788        65
 ..              ...       ...
 502             184       960
 961             591        34
 826              24        89
 392             955       432
 632             504       702
 
 [100 rows x 2 columns],
         Size nm^3
 697  1.270183e+06
 613  1.205992e+06
 626  2.222994e+05
 193  8.058214e+05
 124  1.430403e+04
 ..            ...
 573  1.566075e+06
 974  1.140281e+05
 770  2.406520e+05
 306  5.2154

## Part 3. Perform a Linear Regression

In [5]:
# Use sklearn to train a model on the training set
lin_reg = LinearRegression()
lin_reg.fit(x_train,y_train)

# Create a sample datapoint and predict the output of that sample with the trained model
Prediction = lin_reg.predict(np.array([[469,647]]))
print("Prediction: ", Prediction)

# Report on the score for that model, in your own words (markdown, not code) explain what the score means
print("Score: ", lin_reg.score(x_test,y_test))


Prediction:  [[661310.41464769]]
Score:  0.8506475828323982


The score provides us with an isnight as to how accurate our data is, we're trying to get closer to the true value of the regression.

Sample equation: $E = mc^2$

In [6]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
print("Coeficients: ",lin_reg.coef_)
print("Intercept: ",lin_reg.intercept_)

Coeficients:  [[ 882.19653808 1033.0814699 ]]
Intercept:  [-420843.47273537]


\[y = 876.18439002x_1 + 1029.04320478x_2 - 413836.9072213\]

## Part 4. Use Cross Validation

In [7]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
cvs = ShuffleSplit(n_splits = 10, test_size = 0.1, random_state = 0)
scores = cross_val_score(lin_reg, x, y, cv = cvs)
print(scores)

# Report on their finding and their significance

[0.87616468 0.86951566 0.83708494 0.86963943 0.84945355 0.86236913
 0.82467112 0.85236386 0.8648058  0.76555589]


Similar to part 3 the scores provide us with a measure of accuracy. When executed we see a range of 82-87 with an outlier of 76, this indicates that an accuracy in the 80s is a fair evaluation of our current experience as we see it consistently come up and it all also tells us that overall our experiment isnt as accurate as it could be as our range is quite general.

## Part 5. Using Polynomial Regression

In [8]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
poly = poly(2)
x_train = poly.fit_transform(x_train)
x_test = poly.fit_transform(x_test)
model = LinearRegression()
model.fit(x_train,y_train)

# Report on the metrics and output the resultant equation as you did in Part 3.
print(f"Score: {model.score(x_train,y_train)}")
print(f"Predict: {model.predict(x_test)}")
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")

Score: 1.0
Predict: [[6.36024000e+04]
 [1.23976740e+06]
 [6.31226400e+05]
 [7.56696114e+05]
 [1.12016714e+05]
 [5.09602859e+03]
 [1.42404926e+06]
 [7.26984429e+05]
 [6.98827114e+05]
 [3.82561143e+04]
 [1.92604572e+04]
 [1.22792686e+06]
 [4.24489114e+05]
 [2.31532457e+05]
 [2.51269829e+05]
 [4.74588257e+05]
 [1.65311429e+03]
 [9.72528114e+05]
 [5.10753257e+05]
 [3.95666857e+05]
 [6.24742429e+05]
 [3.79513029e+05]
 [2.50464114e+05]
 [6.18911429e+03]
 [8.03035857e+05]
 [1.28284457e+05]
 [6.80894257e+05]
 [2.91141829e+05]
 [3.55862571e+04]
 [9.02100600e+05]
 [1.34657371e+06]
 [1.48562314e+05]
 [7.35845400e+05]
 [3.50800114e+05]
 [6.54135429e+05]
 [5.76091143e+04]
 [1.37193257e+05]
 [2.71120286e+04]
 [5.75879314e+05]
 [3.11881257e+05]
 [2.16700286e+04]
 [4.22777829e+05]
 [1.00190826e+06]
 [2.98081029e+05]
 [1.07312314e+05]
 [5.77806829e+05]
 [6.82266857e+05]
 [2.40833829e+05]
 [4.87152257e+05]
 [3.21295000e+05]
 [8.51571314e+05]
 [3.95288286e+04]
 [1.94912829e+05]
 [2.64662829e+05]
 [1.1505

y = 1 + 1.20000000e+01 - 1.43771160e-07 - (1.16262555e-11)^2 + 2.00000000e+00 + (2.85714287e-02)^2 + 2.05196557e-05