In [153]:
# Introduction: 
#
# In the last exercise, we tried to find a polynomial function whose curve approximated
# a series of data points as best as possible. The values of the function were randomly selected
# and the best combination was stored. In theory, higher-order polynomials should describe the curve even better,
# but at the same time, it is becoming more and more difficult to obtain good coefficients.
#
# One way to counteract this problem is to normalize the data.
# If all data values are between 0.0 and 1.0 it is easier to find suitable coefficients.
# In addition, the coefficients have been randomly selected so far,
# and it is possible that bad combinations have been tried several times.
# Ideally, only good combinations should be considered and improved step by step.
# A potential technique for this approach is the Evolutionary Strategy(ES)
# from the category of evolutionary algorithms.
#
#
#
# Problem definition:
#
# Use the new data set for this exercise cars.csv. The data can be read with the command
# cars = np.genfromtxt("cars.csv", delimiter=",", skip_header=True).
# In this exercise, you do not try to inference one column from another,
# but rather use the first 6 columns to predict the 7th column.
# The aim is to predict the fuel consumption of cars.
#
# The meaning of the columns in the cars matrix:
# 1. "cylinders" = number of the cylinders
# 2. "displacement" = space in the back of the car
# 3. "horsepower" = power of the engine
# 4. "weight" = weight in pounds
# 5. "acceleration" = time to accelerate from 0 to 60 miles per hour
# 6. "model year" = release date
# 7. "mpg" = miles per gallon
#
# First normalize all data, use the first 6 columns as input data and try to predict the 7th column.
# Evaluate your predictions with the RMSE. Since several input data/columns are now used,
# please replace the polynomial function with a linear combination of weighted input values.
# Try to find some random coefficients which create a good approximation function,
# just to get a feeling for the data.
#
# Exchange your random-picking algorithm with the evolution strategy.
# Implement a way to define the number of offspring and generations.
# Make it possible to keep the parents (generation) if they are better than their children.
#
# All children are initially a copy of their parents np.tile(...) but are modified
# a little bit for each coefficient. At the end of an epoch,
# the best descendant is determined (from the set of all children and/or parents).
# and used as the parent of the next epoch.

# Import external libraries
import matplotlib.pyplot as plt
import math
import numpy as np
import random

# Enable inline plotting
%matplotlib inline

# Import the CSV file
cars = np.genfromtxt("cars.csv", delimiter=",", skip_header=True)

# Remove car names column from data
cars = np.delete(cars, 0, 1)

# Store max and min values of cars columns for denormalization
maxValues = np.amax(cars[:,0:7], 0)
minValues = np.amin(cars[:,0:7], 0)

# Normalize data in car matrix
normalizedData = (cars - minValues) / (maxValues - minValues)

# Approximate relation with weighted function
xValueMatrix = np.delete(normalizedData, 6, 1)

# Find weighted function with lowest rmse
smallestRMSE = 1000
bestCoefficients = []

keepParents = False
parentCoefficients = np.expand_dims(np.array([random.uniform(-1.0, 1.0),
                                        random.uniform(-1.0, 1.0),
                                        random.uniform(-1.0, 1.0),
                                        random.uniform(-1.0, 1.0),
                                        random.uniform(-1.0, 1.0),
                                        random.uniform(-1.0, 1.0)]),
                              axis = 1)

# Try 300 different generations to find best ones
for e in range(10000):
    # Generate child coefficients
    childAmount = 3
    
    coefficients = np.empty([6, 0])
    
    for c in range(childAmount):
        childCoefficients = np.expand_dims(np.array([np.squeeze(parentCoefficients)[0] + random.uniform(-0.01, 0.01),
                                                     np.squeeze(parentCoefficients)[1] + random.uniform(-0.01, 0.01),
                                                     np.squeeze(parentCoefficients)[2] + random.uniform(-0.01, 0.01),
                                                     np.squeeze(parentCoefficients)[3] + random.uniform(-0.01, 0.01),
                                                     np.squeeze(parentCoefficients)[4] + random.uniform(-0.01, 0.01),
                                                     np.squeeze(parentCoefficients)[5] + random.uniform(-0.01, 0.01)]),
                                  axis = 1)
        
        coefficients = np.append(coefficients, childCoefficients, axis = 1)
    
    # Add parent coefficients if they should survive the generation
    if keepParents:
        coefficients = np.append(coefficients, parentCoefficients, axis = 1)
    
    # Compare RMSE values for all coefficients
    for k in range(len(coefficients[0])):
        # Calculate approximated mpg values with weighted function
        mpgWeightedApproximation = np.matmul(xValueMatrix, coefficients[:, k])
        mpgWeightedApproximation = np.squeeze(mpgWeightedApproximation)

        # Calculate and output RMSE value
        squareErrorSum = np.sum(np.power((minValues[6] + (maxValues[6] - minValues[6]) * mpgWeightedApproximation) - (minValues[6] + (maxValues[6] - minValues[6]) * normalizedData[:, 6]), 2))

        mse = squareErrorSum / len(normalizedData[:, 6])

        rmse = math.sqrt(mse)
        
        if e == 0 and k == len(coefficients[0]) - 1:
            print("Initial RMSE: ", rmse)

        # Check if current RMSE is the smallest
        if(rmse < smallestRMSE):
            smallestRMSE = rmse
            bestCoefficients = np.expand_dims(coefficients[:, k], axis = 1)
            parentCoefficients = np.expand_dims(coefficients[:, k], axis = 1)

# Print final RMSE value
print("Final RMSE: {}".format(smallestRMSE))

# Make bestCoefficients an array
bestCoefficients = np.squeeze(bestCoefficients)

# Return y value for weighted function with provided coefficients
def weightedFunction(x, c):
    y = 0
    
    y = np.matmul(x, c)
    
    return minValues[6] + (maxValues[6] - minValues[6]) * y

print("Line 4: mpg is {}, prediction was {}".format(minValues[6] + (maxValues[6] - minValues[6]) * normalizedData[3, 6], weightedFunction(xValueMatrix[3, 0:6], bestCoefficients)))
print("Line 57: mpg is {}, prediction was {}".format(minValues[6] + (maxValues[6] - minValues[6]) * normalizedData[56, 6], weightedFunction(xValueMatrix[56, 0:6], bestCoefficients)))
print("Line 117: mpg is {}, prediction was {}".format(minValues[6] + (maxValues[6] - minValues[6]) * normalizedData[116, 6], weightedFunction(xValueMatrix[116, 0:6], bestCoefficients)))
print("Line 219: mpg is {}, prediction was {}".format(minValues[6] + (maxValues[6] - minValues[6]) * normalizedData[218, 6], weightedFunction(xValueMatrix[218, 0:6], bestCoefficients)))


Initial RMSE:  27.742136811240094
Final RMSE: 4.374560719262073
Line 4: mpg is 16.0, prediction was 14.869259262956831
Line 57: mpg is 24.0, prediction was 23.626975766513727
Line 117: mpg is 29.0, prediction was 27.909190372776937
Line 219: mpg is 33.5, prediction was 30.705150997338574
