In [37]:
# Introduction: 
#
# In the last exercise, we tried to find a polynomial function whose curve approximated
# a series of data points as best as possible. The values of the function were randomly selected
# and the best combination was stored. In theory, higher-order polynomials should describe the curve even better,
# but at the same time, it is becoming more and more difficult to obtain good coefficients.
#
# One way to counteract this problem is to normalize the data.
# If all data values are between 0.0 and 1.0 it is easier to find suitable coefficients.
# In addition, the coefficients have been randomly selected so far,
# and it is possible that bad combinations have been tried several times.
# Ideally, only good combinations should be considered and improved step by step.
# A potential technique for this approach is the Evolutionary Strategy(ES)
# from the category of evolutionary algorithms.
#
#
# Problem definition:
#
# Use the new data set for this exercise cars.csv. The data can be read with the command
# cars = np.genfromtxt("cars.csv", delimiter=",", skip_header=True).
# In this exercise, you do not try to inference one column from another,
# but rather use the first 6 columns to predict the 7th column.
# The aim is to predict the fuel consumption of cars.
#
# The meaning of the columns in the cars matrix:
# 1. "cylinders" = number of the cylinders
# 2. "displacement" = space in the back of the car
# 3. "horsepower" = power of the engine
# 4. "weight" = weight in pounds
# 5. "acceleration" = time to accelerate from 0 to 60 miles per hour
# 6. "model year" = release date
# 7. "mpg" = miles per gallon
#
# First normalize all data, use the first 6 columns as input data and try to predict the 7th column.
# Evaluate your predictions with the RMSE. Since several input data/columns are now used,
# please replace the polynomial function with a linear combination of weighted input values.
# Try to find some random coefficients which create a good approximation function,
# just to get a feeling for the data.
#
# Exchange your random-picking algorithm with the evolution strategy.
# Implement a way to define the number of offspring and generations.
# Make it possible to keep the parents (generation) if they are better than their children.
#
# All children are initially a copy of their parents np.tile(...) but are modified
# a little bit for each coefficient. At the end of an epoch,
# the best descendant is determined (from the set of all children and/or parents).
# and used as the parent of the next epoch.

# Import external libraries
import matplotlib.pyplot as plt
import math
import numpy as np
import random

# Enable inline plotting
%matplotlib inline

# Import the CSV file
cars = np.genfromtxt("cars.csv", delimiter=",", skip_header=True)

# Store max values of cars columns for denormalization
cylindersMax = np.amax(cars[:,1])
displacementMax = np.amax(cars[:,2])
horsepowerMax = np.amax(cars[:,3])
weightMax = np.amax(cars[:,4])
accelerationMax = np.amax(cars[:,5])
modelYearMax = np.amax(cars[:,6])
mpgMax = np.amax(cars[:,7])

# Normalize data in gdppc and life expectancy
cylinders = cars[:,1] / cylindersMax
displacement = cars[:,2] / displacementMax
horsepower = cars[:,3] / horsepowerMax
weight = cars[:,4] / weightMax
acceleration = cars[:,5] / accelerationMax
modelYear = cars[:,6] / modelYearMax
mpg = cars[:,7] / mpgMax

# Approximate relation with weighted function
# Stack columns into one matrix
cylindersVector = np.expand_dims(cylinders, axis=1)
displacementVector = np.expand_dims(displacement, axis=1)
horsepowerVector = np.expand_dims(horsepower, axis=1)
weightVector = np.expand_dims(weight, axis=1)
accelerationVector = np.expand_dims(acceleration, axis=1)
modelYearVector = np.expand_dims(modelYear, axis=1)

xValueMatrix = np.column_stack([cylindersVector, displacementVector, horsepowerVector, weightVector, accelerationVector, modelYearVector])

# Find weighted function with lowest rmse
smallestRMSE = 1000
bestCoefficients = []

# Try 10000 different random coefficients to find best ones
for j in range(10000):
    # Generate random coefficients
    coefficients = np.expand_dims(np.array([random.uniform(-1.0, 1.0),
                                            random.uniform(-1.0, 1.0),
                                            random.uniform(-1.0, 1.0),
                                            random.uniform(-1.0, 1.0),
                                            random.uniform(-1.0, 1.0),
                                            random.uniform(-1.0, 1.0)]),
                                  axis = 1)
    
    #Calculate approximated mpg values with weighted function
    mpgWeightedApproximation = np.matmul(xValueMatrix, coefficients)
    mpgWeightedApproximation = np.squeeze(mpgWeightedApproximation)

    # Calculate and output RMSE value
    squareErrorSum = np.sum(np.power(mpgMax * mpgWeightedApproximation - mpgMax * mpg, 2))

    mse = squareErrorSum / len(mpg)

    rmse = math.sqrt(mse)
    
    # Check if current RMSE is the smallest
    if(rmse < smallestRMSE):
        smallestRMSE = rmse
        bestCoefficients = coefficients

# Print final RMSE value
print("Final RMSE: {}".format(smallestRMSE))

# Make bestCoefficients an array
bestCoefficients = np.squeeze(bestCoefficients)

# Return y value for polynomial with provided coefficients
def weightedFunction(xValues, coefficients):
    y = 0
    
    for c in range(len(coefficients)):
        y = y + coefficients[c] * xValues[c]
    
    return y

print(weightedFunction([cars[57, 1], cars[57, 2], cars[57, 3], cars[57, 4], cars[57, 5], cars[57, 6]], bestCoefficients) * mpgMax)

Final RMSE: 3.9205338033582797
-38872.724916577885
