<a href="https://colab.research.google.com/github/profmcnich/example_notebook/blob/main/a3_sample_notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

\(\^Be sure to update this button to point to your notebook instead of the sample notebook\)

In [2]:
# Imports section
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

## Part 1. Loading the dataset

In [3]:
# 1 Using pandas load the dataset (load remotely, not locally)
# 2.Output the first 15 rows of the data
csv_url = "https://raw.githubusercontent.com/profmcnich/example_notebook/main/science_data_large.csv"
data = pd.read_csv(csv_url)
data.head(15)

Unnamed: 0,Temperature °C,Mols KCL,Size nm^3
0,469,647,624474.3
1,403,694,577961.0
2,302,975,619684.7
3,779,916,1460449.0
4,901,18,43257.26
5,545,637,712463.4
6,660,519,700696.0
7,143,869,271826.0
8,89,461,89198.03
9,294,776,477021.0


In [4]:
# 3.Display a summary of the table information (number of datapoints, etc.)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Temperature °C  1000 non-null   int64  
 1   Mols KCL        1000 non-null   int64  
 2   Size nm^3       1000 non-null   float64
dtypes: float64(1), int64(2)
memory usage: 23.6 KB


## Part 2. Splitting the dataset

In [5]:
# Take the pandas dataset and split it into our features (X) and label (y)
# Use sklearn to split the features and labels into a training/test set. (90% train, 10% test)
'''
Independent variable - factors that will contribute to the output of the dependent variable (y)
Here we include ALL the rows, but only take in the Temperature and Mols of KCL columns (index 0:2) <-> index 1, and 2 
''' 
X = data.iloc[:,0:2]
'''
Dependent variable - value that was measured as the output from the independent variables
Like before, we get all the rows from the data but ONLY take in the Size nm^3 column (index 3)
''' 
y = data.iloc[:,2:3]
X.head(), y.head()

(   Temperature °C  Mols KCL
 0             469       647
 1             403       694
 2             302       975
 3             779       916
 4             901        18,
       Size nm^3
 0  6.244743e+05
 1  5.779610e+05
 2  6.196847e+05
 3  1.460449e+06
 4  4.325726e+04)

In [33]:
'''
Split data where we have 90% training and 10% testing
X: shape = (1000, 2)
y: shape = (1000, 1)
'''
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((900, 2), (100, 2), (900, 1), (100, 1))

## Part 3. Perform a Linear Regression

In [36]:
# Use sklearn to train a model on the training set
lr_model = linear_model.LinearRegression().fit(X_train, y_train)

# Create a sample datapoint and predict the output of that sample with the trained model
data_point = X_test.iloc[:1].values
prediction = lr_model.predict(data_point)

r2_score = lr_model.score(X_train,y_train)
print("R^2=", r2_score)

R^2= 0.8605117264867481


In [None]:
# Report on the score for that model, in your own words (markdown, not code) explain what the score means

# Score explanation
    The lr_model.score() method returns the coefficient of determination from our prediction. In other words, this is the R^2 (Residual squared) value. The R^2 value is an indicator of how good of a job our model does in predicting our data point (in our case, the **data_point value**). 
    Using my linear regression model, I was able to get the value of R^2 to be 0.8606240251784951. An R^2 value has a range of (-1.0 <= R^2 <= 1.0). A score of 1.0 means that our model is perfectly fitted to our data point. Since our R^2 ~ 0.86 and relatively close to 1.0, we can say that my model does an adequate at predicting our data point.

In [38]:
# Extract the coefficients and intercept from the model and write an equation for your h(x) using LaTeX

In [47]:
prediction2 = lr_model.predict(X_test)
r2_score2 = lr_model.score(X_train, y_train)
print("R^2=", r2_score2)
print("Coefficents", lr_model.coef_)
print("Intercept", lr_model.intercept_)

R^2= 0.8605117264867481
Coefficents [[ 869.90749169 1023.49974558]]
Intercept [-409461.16861876]


# H(X) function 

Using the extracted values from above, I was able to formulate the following equation:
$$ h(x) = 869.91y + 1023.50z - 409461.17 $$

## Part 4. Use Cross Validation

In [8]:
# Use the cross_val_score function to repeat your experiment across many shuffles of the data

# Report on their finding and their significance

## Part 5. Using Polynomial Regression

In [9]:
# Using the PolynomialFeatures library perform another regression on an augmented dataset of degree 2

# Report on the metrics and output the resultant equation as you did in Part 3.