<a href="https://colab.research.google.com/github/kritikaparmar-programmer/ML_Notebooks/blob/main/NPTEL_Ass_4_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic regression: Predict fuel efficiency

In a *regression* problem, we aim to predict the output of a continuous value, like a price or a probability. 

This notebook uses the classic [Auto MPG](https://archive.ics.uci.edu/ml/datasets/auto+mpg) Dataset and builds a model to predict the fuel efficiency. 

In [None]:
# Use seaborn for pairplot
!pip install seaborn

# Use some functions from tensorflow_docs
!pip install git+https://github.com/tensorflow/docs

In [None]:
import pathlib

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

In [None]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers

print(tf.__version__)
tf.random.set_seed(10)

2.3.0


## The Auto MPG dataset

The dataset is available from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/).


### Get the data
First download the dataset.

In [None]:
dataset_path = keras.utils.get_file("auto-mpg.data", "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data")
dataset_path

'/root/.keras/datasets/auto-mpg.data'

Import it using pandas

In [None]:
column_names = ['MPG','Cylinders','Displacement','Horsepower','Weight',
                'Acceleration', 'Model Year', 'Origin']
dataset = pd.read_csv(dataset_path, names=column_names,
                      na_values = "?", comment='\t',
                      sep=" ", skipinitialspace=True)

dataset.tail()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Origin
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1
397,31.0,4,119.0,82.0,2720.0,19.4,82,1


### Clean the data

The dataset contains a few unknown values.

Drop those rows if they contain any unknown values.


In [None]:
# Write code to remove the unknown values
dataset = dataset.dropna()

The `"Origin"` column is really categorical, not numeric. So convert that to a one-hot:

In [None]:
dataset['Origin'].unique()

array([1, 3, 2])

In [None]:
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})

In [None]:
dataset = pd.get_dummies(dataset, prefix='', prefix_sep='')
dataset.tail()

Unnamed: 0,MPG,Cylinders,Displacement,Horsepower,Weight,Acceleration,Model Year,Europe,Japan,USA
393,27.0,4,140.0,86.0,2790.0,15.6,82,0,0,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,1,0,0
395,32.0,4,135.0,84.0,2295.0,11.6,82,0,0,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,0,0,1
397,31.0,4,119.0,82.0,2720.0,19.4,82,0,0,1


### Split the data into train and test

Now split the dataset into a training set and a test set.

We will use the test set in the final evaluation of our model.

In [None]:
train_dataset = dataset.sample(frac=0.8,random_state=0)
test_dataset = dataset.drop(train_dataset.index)

### Split features from labels

Separate the target value, or "label", from the features. This label is the value that you will train the model to predict.

In [None]:
train_labels = train_dataset.pop('MPG')
test_labels = test_dataset.pop('MPG')

### Normalize the data

Use z-score normalization for both datasets

In [None]:
train_stats = train_dataset.describe()
train_stats = train_stats.transpose()
train_stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Cylinders,314.0,5.477707,1.699788,3.0,4.0,4.0,8.0,8.0
Displacement,314.0,195.318471,104.331589,68.0,105.5,151.0,265.75,455.0
Horsepower,314.0,104.869427,38.096214,46.0,76.25,94.5,128.0,225.0
Weight,314.0,2990.251592,843.898596,1649.0,2256.5,2822.5,3608.0,5140.0
Acceleration,314.0,15.559236,2.78923,8.0,13.8,15.5,17.2,24.8
Model Year,314.0,75.898089,3.675642,70.0,73.0,76.0,79.0,82.0
Europe,314.0,0.178344,0.383413,0.0,0.0,0.0,0.0,1.0
Japan,314.0,0.197452,0.398712,0.0,0.0,0.0,0.0,1.0
USA,314.0,0.624204,0.485101,0.0,0.0,1.0,1.0,1.0


In [None]:
# Write code here: To normalize both train and test datasets
def norm(x):
  return (x - train_stats['mean']) / train_stats['std']
normed_train_data = norm(train_dataset)
normed_test_data = norm(test_dataset)

## The model

### Building the model


In [None]:
train_dataset.keys()

Index(['Cylinders', 'Displacement', 'Horsepower', 'Weight', 'Acceleration',
       'Model Year', 'Europe', 'Japan', 'USA'],
      dtype='object')

In [None]:
len(train_dataset.keys())

9

In [None]:
# Build and compile your model in this cell.
def build_model():
    model = keras.Sequential([
                            layers.Dense(64, activation='relu', input_shape=[len(train_dataset.keys())]),
                            layers.Dense(64, activation='relu'),
                            layers.Dense(1)
    ])
    #optimizer = tf.keras.optimizers.RMSprop(0.001)  # learning rate
    optimizer = keras.optimizers.Adam(learning_rate=0.001)

    model.compile(loss='mse',   # minimum squared loss in regression
                optimizer=optimizer,
                metrics=['mae','mse'])  # minimum absolute error, minimum squared error
    return model

model = build_model()

In [None]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 64)                640       
_________________________________________________________________
dense_1 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 65        
Total params: 4,865
Trainable params: 4,865
Non-trainable params: 0
_________________________________________________________________


### Train the model

Train the model for 1000 epochs, and record the training and validation accuracy in the `history` object.

In [None]:
# Use some functions from tensorflow_docs
!pip install git+https://github.com/tensorflow/docs

Collecting git+https://github.com/tensorflow/docs
  Cloning https://github.com/tensorflow/docs to /tmp/pip-req-build-hwjhadlq
  Running command git clone -q https://github.com/tensorflow/docs /tmp/pip-req-build-hwjhadlq
Collecting protobuf>=3.14
[?25l  Downloading https://files.pythonhosted.org/packages/fe/fd/247ef25f5ec5f9acecfbc98ca3c6aaf66716cf52509aca9a93583d410493/protobuf-3.14.0-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 7.7MB/s 
Building wheels for collected packages: tensorflow-docs
  Building wheel for tensorflow-docs (setup.py) ... [?25l[?25hdone
  Created wheel for tensorflow-docs: filename=tensorflow_docs-0.0.0e736cbefcf6305b2bf47e6996545207027b838ee_-cp36-none-any.whl size=146356 sha256=e2b768fea4af7cc47986a709fc38936dfed9bae1a1fe4230ac66c8a59f8c0603
  Stored in directory: /tmp/pip-ephem-wheel-cache-5xmrvani/wheels/eb/1b/35/fce87697be00d2fc63e0b4b395b0d9c7e391a10e98d9a0d97f
Successfully built tensorflow-docs
Installing coll

In [None]:
import tensorflow_docs as tfdocs
import tensorflow_docs.modeling

In [None]:
#64, 64
EPOCHS = 1000

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[tfdocs.modeling.EpochDots()])


Epoch: 0, loss:567.4671,  mae:22.5337,  mse:567.4671,  val_loss:573.7640,  val_mae:22.6108,  val_mse:573.7640,  
....................................................................................................
Epoch: 100, loss:6.6840,  mae:1.8462,  mse:6.6840,  val_loss:8.9335,  val_mae:2.3092,  val_mse:8.9335,  
....................................................................................................
Epoch: 200, loss:5.7712,  mae:1.6720,  mse:5.7712,  val_loss:8.3255,  val_mae:2.2040,  val_mse:8.3255,  
....................................................................................................
Epoch: 300, loss:5.2506,  mae:1.5852,  mse:5.2506,  val_loss:8.0787,  val_mae:2.1873,  val_mse:8.0787,  
....................................................................................................
Epoch: 400, loss:4.6388,  mae:1.4786,  mse:4.6388,  val_loss:7.8468,  val_mae:2.1182,  val_mse:7.8468,  
..............................................................

In [None]:
# 64, 64
loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))  # 6

3/3 - 0s - loss: 7.0654 - mae: 2.1052 - mse: 7.0654
Testing set Mean Abs Error:  2.11 MPG


### Second Condition

In [None]:
# Build and compile your model in this cell.
def build_model():
  model = keras.Sequential([
                            layers.Dense(100, activation='relu', input_shape=[len(train_dataset.keys())]),
                            layers.Dense(100, activation='relu'),
                            layers.Dense(1)
  ])
  optimizer = keras.optimizers.Adam(learning_rate = 0.001)  # learning rate

  model.compile(loss='mse',   # minimum squared loss in regression
                optimizer=optimizer,
                metrics=['mae','mse'])  # minimum absolute error, minimum squared error
  return model

model = build_model()

In [None]:
# 100, 100
EPOCHS = 1000

history = model.fit(
  normed_train_data, train_labels,
  epochs=EPOCHS, validation_split = 0.2, verbose=0,
  callbacks=[tfdocs.modeling.EpochDots()])


Epoch: 0, loss:590.7414,  mae:23.0498,  mse:590.7414,  val_loss:590.9236,  val_mae:23.0279,  val_mse:590.9236,  
....................................................................................................
Epoch: 100, loss:6.1789,  mae:1.7466,  mse:6.1789,  val_loss:8.6245,  val_mae:2.2353,  val_mse:8.6245,  
....................................................................................................
Epoch: 200, loss:5.3042,  mae:1.5949,  mse:5.3042,  val_loss:8.3243,  val_mae:2.1865,  val_mse:8.3243,  
....................................................................................................
Epoch: 300, loss:4.8335,  mae:1.4856,  mse:4.8335,  val_loss:8.1693,  val_mae:2.1798,  val_mse:8.1693,  
....................................................................................................
Epoch: 400, loss:4.2296,  mae:1.3759,  mse:4.2296,  val_loss:8.4588,  val_mae:2.1662,  val_mse:8.4588,  
..............................................................

In [None]:
# 100, 100
loss, mae, mse = model.evaluate(normed_test_data, test_labels, verbose=2)

print("Testing set Mean Abs Error: {:5.2f} MPG".format(mae))  # 8

3/3 - 0s - loss: 7.7360 - mae: 2.1925 - mse: 7.7360
Testing set Mean Abs Error:  2.19 MPG
