## Find an Optimal Model for Predicting the Critical Temperatures of Superconductors

You work as a data scientist for a cable manufacturer. Management has decided to start shipping low-resistance cables to clients around the world. To ensure that the right cables are shipped to the right countries, they would like to predict the critical temperatures of various cables based on certain observed readings.

In this activity, you will train a linear regression model and compute the R2 score and the MSE. You will proceed to engineer new features using polynomial features of degree 3. You will compare the R2 score and MSE of this new model to those of the first model to determine overfitting. You will then use regularization to train a model that generalizes to previously unseen data.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, PolynomialFeatures

In [2]:
_df = pd.read_csv('https://raw.githubusercontent.com/PacktWorkshops/'\
                 'The-Data-Science-Workshop/master/'\
                 'Chapter07/Dataset/superconduct/train.csv')
_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21263 entries, 0 to 21262
Data columns (total 82 columns):
 #   Column                           Non-Null Count  Dtype  
---  ------                           --------------  -----  
 0   number_of_elements               21263 non-null  int64  
 1   mean_atomic_mass                 21263 non-null  float64
 2   wtd_mean_atomic_mass             21263 non-null  float64
 3   gmean_atomic_mass                21263 non-null  float64
 4   wtd_gmean_atomic_mass            21263 non-null  float64
 5   entropy_atomic_mass              21263 non-null  float64
 6   wtd_entropy_atomic_mass          21263 non-null  float64
 7   range_atomic_mass                21263 non-null  float64
 8   wtd_range_atomic_mass            21263 non-null  float64
 9   std_atomic_mass                  21263 non-null  float64
 10  wtd_std_atomic_mass              21263 non-null  float64
 11  mean_fie                         21263 non-null  float64
 12  wtd_mean_fie      

In [3]:
# features and labels
X = _df.drop(['critical_temp'], axis=1).values
y = _df['critical_temp'].values

In [4]:
# split data into training and evaluation sets
train_X, eval_X, train_y, eval_y = train_test_split(X, y, train_size=0.8, random_state=0)

In [5]:
# instantiate baseline LinearRegression
lr_model_1 = LinearRegression()

#fit model
lr_model_1.fit(train_X, train_y)

LinearRegression()

In [6]:
# make predictions on the evaluation dataset
lr_model_1_preds = lr_model_1.predict(eval_X)

In [7]:
# R2 of the model
print('lr_model_1 Score: {}'.format(lr_model_1.score(eval_X, eval_y)))

lr_model_1 Score: 0.7350976364618664


In [8]:
# MSE
print('lr_model_1 MSE: {}'.format(mean_squared_error(eval_y, lr_model_1_preds)))

lr_model_1 MSE: 308.3212711891406


In [9]:
# create a list of tuples to serve as a pipeline
steps = [('scaler', MinMaxScaler()), 
         ('poly', PolynomialFeatures(degree=3)), 
        ('lr', Ridge(alpha=0.9))]

In [10]:
# create an instance of a pipeline
lr_model_2 = Pipeline(steps)

In [None]:
# train pipeline instance
#lr_model_2.fit(train_X, train_y)

In [None]:
# R2 of model 2
print('lr_model_2 R2 score: {}'.format(lr_model_2.score(eval_X, eval_y)))

In [None]:
# preds of model 2
lr_model_2_preds = lr_model_2.predict(eval_X)

In [None]:
# MSE of model 2
print('lr_model_2 MSE: {}'.format(mean_squared_error(eval_y, lr_model_2_preds)))