# Activity 9.01 - Data splitting, scaling, and modeling

1. For this activity, all you will need is the Pandas library, the modules from sklearn, and numpy. Load them in the first cell of the notebook.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression as OLS
from sklearn.metrics import mean_squared_error

2. Use the power_plant.csv dataset. Read the data into a Pandas DataFrame, print out the shape, and list the first five rows.

    The independent variables are as follows.
    
        AT - ambient temperature
        V - exhaust vacuum level
        AP - ambient pressure
        RH - relative humidity
        
    The dependent variable is EP - electrical power produced.

In [2]:
df=pd.read_csv('Chapter9-Datasets/power_plant.csv')
print(df.shape)
df.head()

(9568, 5)


Unnamed: 0,AT,V,AP,RH,EP
0,8.34,40.77,1010.84,90.01,480.48
1,23.64,58.49,1011.4,74.2,445.75
2,29.74,56.9,1007.15,41.91,438.76
3,19.07,49.69,1007.22,76.79,453.09
4,11.8,40.66,1017.13,97.2,464.43


3. Split the data into a train, val, and test set with fractions of 0.8, 0.1 and 0.1 respectively, using Python and Pandas but not sklearn methods. You will use 0.8 for the train split because there is a large number of rows, so the validation and test splits will still have enough rows.

In [3]:
np.random.seed(42)
train_rows=pd.Series(np.random.choice(list(df.index),int(0.8*df.shape[0]),replace=False))
val_rows=pd.Series(np.random.choice(list(df.drop(train_rows,axis=0).index),int(0.1*df.shape[0]), replace = False))
test_rows=pd.Series(df.drop(pd.concat([train_rows,val_rows]),axis=0).index)
train_data=df.iloc[train_rows,:]
val_data = df.iloc[val_rows,:]
test_data = df.iloc[test_rows,:]
print( 'train is', train_data.shape, 'rows, cols\n',
         'val is',val_data.shape,'rows,cols\n',
         'test is', test_data.shape,'rows,cols')

train is (7654, 5) rows, cols
 val is (956, 5) rows,cols
 test is (958, 5) rows,cols


4. Repeat the split in step 3 but use train_test_split. Call it once to split the train data, and then call it again to split what remains into val and test.

In [4]:
train_data_2, val_data_2= train_test_split(df, train_size=.8, random_state=42)
val_data_2, test_data_2 = train_test_split(val_data_2, test_size=.5,random_state=42)
print('train is',train_data_2.shape, 'rows, cols\n',
     'val is',val_data_2.shape,'rows, cols\n',
     'test is',test_data_2.shape,'rows, cols')

train is (7654, 5) rows, cols
 val is (957, 5) rows, cols
 test is (957, 5) rows, cols


5. Ensure that the row counts are correct in all cases.

In [5]:
print( 'train is', train_data.shape, 'rows, cols\n',
         'val is',val_data.shape,'rows,cols\n',
         'test is', test_data.shape,'rows,cols')

print('train is',train_data_2.shape, 'rows, cols\n',
     'val is',val_data_2.shape,'rows, cols\n',
     'test is',test_data_2.shape,'rows, cols')

train is (7654, 5) rows, cols
 val is (956, 5) rows,cols
 test is (958, 5) rows,cols
train is (7654, 5) rows, cols
 val is (957, 5) rows, cols
 test is (957, 5) rows, cols


6. Fit .StandardScaler() to the train data from step 3, then transform train, validation, and test X. Do not transform the EP column, as it is the target.

In [6]:
scaler = StandardScaler()
scaler.fit(train_data.iloc[:,:-1])
train_X=scaler.transform(train_data.iloc[:,:-1])
train_Y=train_data['EP']
val_X=scaler.transform(val_data.iloc[:,:-1])
val_Y=val_data['EP']
test_X=scaler.transform(test_data.iloc[:,:-1])
test_Y=test_data['EP']

7. Fit a .LinearRegression() model to the scaled train data, using the X variables to predict y (the EP column).

In [7]:
linear_model = OLS()
linear_model.fit(train_X,train_Y)

LinearRegression()

8. Print the R2 score and the RMSE of the model n the train, validation, and test datasets.

In [8]:
print('Train Score:',linear_model.score(train_X,train_Y),'\nValidation Score:',linear_model.score(val_X,val_Y),
     '\nTest Score:', linear_model.score(test_X,test_Y))
print('train RMSE:',mean_squared_error(linear_model.predict(train_X),train_Y),
     '\nValidation RMSE:',mean_squared_error(linear_model.predict(val_X),val_Y),
     '\nTest RMSE:',mean_squared_error(linear_model.predict(test_X),test_Y))

Train Score: 0.9287072840354756 
Validation Score: 0.9238845251967255 
Test Score: 0.9333918854821254
train RMSE: 20.732519659228675 
Validation RMSE: 22.820591843766223 
Test RMSE: 19.023390952574694
