<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

\(\^Be sure to update this button to point to your notebook instead of the sample notebook\)

In [1]:
# Imports section
from sklearn.linear_model import LinearRegression
import numpy as np
import pandas as pd



## Part 1. Loading the dataset

In [2]:
# Using pandas load the dataset (load remotely, not locally)
df = pd.read_csv("https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv")

# Output the first 15 rows of the data
print(df.head(15))
# Display a summary of the table information (number of datapoints, etc.)
print("\n\n data frame info:")
print(df.info())

    Temperature °C  Mols KCL     Size nm^3
0              469       647  6.244743e+05
1              403       694  5.779610e+05
2              302       975  6.196847e+05
3              779       916  1.460449e+06
4              901        18  4.325726e+04
5              545       637  7.124634e+05
6              660       519  7.006960e+05
7              143       869  2.718260e+05
8               89       461  8.919803e+04
9              294       776  4.770210e+05
10             991       117  2.441771e+05
11             307       781  5.006455e+05
12             206        70  3.145200e+04
13             437       599  5.390215e+05
14             566        75  9.185271e+04


 data frame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Si

## Part 2. Splitting the dataset

In [3]:
# Take the pandas dataset and split it into our features (X) and label (y)

# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)

In [4]:
X = df[["Temperature °C", "Mols KCL"]]
y = df["Size nm^3"]

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [6]:
X_train, X_test,y_train, y_test = train_test_split(X,y, test_size = 0.2)

## Part 3. Perform a Linear Regression

In [7]:
# Use sklearn to train a model on the training set
reg = LinearRegression().fit(X_train,y_train)


In [8]:
# Create a sample datapoint and predict the output of that sample with the trained model
y_pred = reg.predict(X_test)

In [9]:
y_df = pd.DataFrame({'predicted score': y_pred })
y_df

Unnamed: 0,predicted score
0,-105937.634967
1,38028.680675
2,-33681.372292
3,-79557.807413
4,104876.815462
...,...
195,294899.991550
196,764611.073057
197,656423.177952
198,-124486.641168


In [10]:
# Report on the score for that model, in your own words (markdown, not code) explain what the score means
reg.score(X_test,y_test)

0.8764466026558668

The highest score to get in ML prediction is 1, that means the prediction of our model accuracy is very high (closed to %90). <br>
  

In [11]:
# Extract the coefficents and intercept from the model and write an equation for your h(x) using LaTeX
reg.coef_, reg.intercept_

 

(array([ 859.39789969, 1018.00119306]), -405281.2089141748)

\begin{equation}
Y = {855X -403530 } 
\end{equation}


## Part 4. Use Cross Validation

In [12]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data

# Report on their finding and their significance

In [13]:
from sklearn.model_selection import cross_val_score
cross_val_score(reg, X,y)

array([0.83918826, 0.87051239, 0.85871066, 0.87202623, 0.84364641])

Cross validation trains the model on 5 different versions of training data and evaluated on 5 different versions of test data.
We can see the 5 scores on our array are very similar with slight difference between the numbers. 
Using this model makes our score model more reliable due to all high scores in all 5 versions of different data splits.


## Part 5. Using Polynomial Regression

In [14]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2

# Report on the metrics and output the resultant equation as you did in Part 3.

In [15]:
from sklearn.preprocessing import PolynomialFeatures

X_pol = df[["Temperature °C", "Mols KCL"]]
y_pol = df["Size nm^3"]

poly= PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X_pol)
PolyReg= LinearRegression()  
PolyReg.fit(X_poly, y_pol)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [16]:
y_pol_prediction = PolyReg.predict(X_poly)

In [17]:
PolyReg.score(X_poly,y_pol)

1.0

In [18]:
PolyReg.predict(X_poly)

array([6.24474257e+05, 5.77961029e+05, 6.19684714e+05, 1.46044903e+06,
       4.32572572e+04, 7.12463400e+05, 7.00696029e+05, 2.71826029e+05,
       8.91980286e+04, 4.77021029e+05, 2.44177114e+05, 5.00645457e+05,
       3.14520000e+04, 5.39021457e+05, 9.18527143e+04, 3.95288286e+04,
       5.38421457e+05, 1.14843143e+04, 1.48585029e+05, 4.16308457e+05,
       1.31596457e+05, 4.82433257e+05, 1.16136540e+06, 1.36031143e+04,
       4.24489114e+05, 1.97787143e+04, 8.03035857e+05, 3.21295000e+05,
       6.95233029e+05, 2.23961400e+05, 1.10432926e+06, 1.92627283e+06,
       5.21373600e+05, 7.91715314e+05, 4.53954314e+05, 5.11930286e+04,
       5.94753143e+04, 4.40629714e+05, 4.60782857e+05, 7.88616000e+04,
       8.03208600e+05, 2.28364457e+05, 2.41597829e+05, 1.04578046e+06,
       6.59932571e+04, 6.18540286e+04, 3.97636457e+05, 4.93009714e+05,
       1.18457029e+05, 2.42666829e+05, 1.26718971e+06, 1.28496257e+05,
       6.10293600e+05, 5.76091143e+04, 9.13729029e+05, 1.41796260e+06,
      

In [19]:
PolyReg.intercept_


1.657189568504691e-05

In [20]:
PolyReg.coef_


array([ 0.00000000e+00,  1.20000000e+01, -1.23110504e-07, -1.05648823e-11,
        2.00000000e+00,  2.85714287e-02])

\begin{alignat*}{7}
f(x)=a -b - a&{}^2 + 2ab + 3b&{}^2    \\
\end{alignat*}


Polynomial regression is one of several methods of curve fitting. With polynomial regression, the data is approximated using a polynomial function. A polynomial is a function that takes the form f( x ) = c0 + c1 x + c2 x2 ⋯ cn xn where n is the degree of the polynomial and c is a set of coefficients.

To my understading of the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) I based my featurs of this equation: <br>
"Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2]"