<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


In [123]:
# Imports section
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PolynomialFeatures

## Part 1. Loading the dataset

In [None]:
# Using pandas load the dataset (load remotely, not locally)
df = pd.read_csv("https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv")

In [None]:
# Output the first 15 rows of the data
df.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [None]:
# Display a summary of the table information (number of datapoints, etc.)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


In [None]:
df.describe()

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
count,1000.0,1000.0,1000.0
mean,500.5,471.53,508611.1
std,288.819436,288.482872,447483.8
min,1.0,1.0,16.11429
25%,250.75,226.75,129826.7
50%,500.5,459.5,382718.2
75%,750.25,710.25,760321.1
max,1000.0,1000.0,1972127.0


## Part 2. Splitting the dataset

In [None]:
# Take the pandas dataset and split it into our features (X) and label (y)
X = df[["Temperature °C", "Mols KCL"]]
y = df["Size nm^3"]

In [None]:
# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)

## Part 3. Perform a Linear Regression

In [None]:
# Use sklearn to train a model on the training set
model = LinearRegression().fit(X_train.values, y_train.values)

In [None]:
# Create a sample datapoint and predict the output of that sample with the trained model
sample = [[400, 600]]
prediction = model.predict(sample)
print(prediction[0])

552180.0899463417


In [None]:
# Report on the score for that model, in your own words (markdown, not code) explain what the score means
print("test score: ", model.score(X_test.values, y_test.values))
print("train score: ", model.score(X_train.values, y_train.values))

test score:  0.8727502734371518
train score:  0.8586574620261485


After fitting a linear regression model, to determine how well the model fits the data, the score() method is used to calculate the coefficient of determination, which is also known as the $R^2$ score. This score indicates the proportion of the variation in the dependent variable that is predictable from the independent variable(s). The highest score a model can get is 1.0, which means that the model fits the data perfectly. In this case, the test data scored 0.87 and the train data scored 0.86, showing a strong relationship between the X values and Y values for both.

In [132]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
print('x_1: ', np.round_(model.coef_[0], 5))
print('x_2: ', np.round_(model.coef_[1], 5))
print('intercept: ', np.round_(model.intercept_))

x_1:  866.43216
x_2:  1024.28641
intercept:  -408965.0


$h(x) = 866.43216x_1 + 1024.28641x_2 - 408965$
<br>
in which $x_1$ is the Temperature °C feature of and $x_2$ is the Mols KCL feature. 

## Part 4. Use Cross Validation

In [None]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data
scores = cross_val_score(model, X_test, y_test, cv=5)
print("scores: ", scores)
print("average: ", scores.mean())
# Report on their finding and their significance

scores:  [0.87849778 0.88952673 0.79130136 0.85233878 0.85677086]
average:  0.8536871035782256


Cross validation uses different portions of data to test and train a model on different iterations and returns an array of scores which correspond to each part of data that is split. Cross validation is significant because it does not require a validation set, which would reduce the number of samples used for learning the model. <br> In this case, since cv = 5, the test data is split into 5 parts and an array of 5 scores is returned. The average of the scores is 0.85, which is close to the previously calculated test score of 0.87, showing that the model performs well.  

## Part 5. Using Polynomial Regression

In [118]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
X_train_poly, X_test_poly, y_train_poly, y_test_poly = train_test_split(X_poly, y, test_size=0.10)
poly_model = LinearRegression().fit(X_train_poly, y_train_poly)

In [122]:
# Report on the metrics and output the resultant equation as you did in Part 3.
print("test score: ", poly_model.score(X_test_poly, y_test_poly))
print("train score: ", poly_model.score(X_train_poly, y_train_poly))

test score:  1.0
train score:  1.0


Both the train and test score have increased by using polynomial regression rather than linear regression, resulting in an $R^2$ value of 1. This means that the x values perfectly correlate with the y values for both the train and test data.

In [134]:
print('coef: ', np.round_(poly_model.coef_, 5))
print('intercept: ', np.round_(poly_model.intercept_))

coef:  [ 0.      12.      -0.      -0.       2.       0.02857]
intercept:  0.0


$h(x) = 0.0001 + 12x_1 + 2x_1x_2 + 0.02857$
<br>
in which $x_1$ is the Temperature °C feature of and $x_2$ is the Mols KCL feature. 