<a href="https://colab.research.google.com/github/jordanburdett/IrisflowerDetection/blob/master/More_Interesting_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import all libraries that are going to be needed.


---



In [0]:
import pandas as pd
import numpy as np 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor as classifierRegr
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.metrics import explained_variance_score as sk

Get all the data that will be required throughout the assignment. 

---



1.   carData: car data from https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data
2.   carMPG: information about MPG from https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data
3.   classPerformance: Information regarding performance for two classes. this data is located in two csv files located in my machine learning folder on my local machine.



In [0]:

names = ["price", "maint", "doors", "numPeople", "cargoSpace", "safteyMeasure", "acceptable"]
carData = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data", header=None, skipinitialspace=True, names=names, na_values=["?"])

names = ["mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model", "origin", "carname"]
carMPG = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data", header=None, names=names, na_values=["?"], sep="\s+")

classPerformanceMath = pd.read_csv("https://raw.githubusercontent.com/jordanburdett/IrisflowerDetection/master/student-mat.csv", delimiter=";", na_values=["?"])

classPerformancePort = pd.read_csv("https://raw.githubusercontent.com/jordanburdett/IrisflowerDetection/master/student-por.csv", delimiter=";", na_values=["?"])



The target is the last column where you classify the car as acceptable, unacceptable, good, or very good. Please note that this dataset contains categorical (non-numeric) data.

---

First things first we need to make sure that we have no missing values for our carData.

In [295]:
carData.isna().any()

price            False
maint            False
doors            False
numPeople        False
cargoSpace       False
safteyMeasure    False
acceptable       False
dtype: bool

Good! no NaN values!

---
No need to handle the NaN's lets check to see if we need to change any non numerical values


In [296]:
print (carData.head())

carData.dtypes

   price  maint doors numPeople cargoSpace safteyMeasure acceptable
0  vhigh  vhigh     2         2      small           low      unacc
1  vhigh  vhigh     2         2      small           med      unacc
2  vhigh  vhigh     2         2      small          high      unacc
3  vhigh  vhigh     2         2        med           low      unacc
4  vhigh  vhigh     2         2        med           med      unacc


price            object
maint            object
doors            object
numPeople        object
cargoSpace       object
safteyMeasure    object
acceptable       object
dtype: object

In [0]:
# Everything is showing as an object.... Cause the data for persons has a 5more option..... ugh okay lets make it a integer.

def makeCatCodes(data):
  for label,dtype in data.dtypes.items():
    if dtype == object:
        print(label)
        # set the dataframe to be a category
        data[label] = data[label].astype('category')

        # create new row using cat codes
        data["{}_cat".format(label)] = data[label].cat.codes
        
  return data


In [298]:
# use the new function made!

carData = makeCatCodes(carData)

carData.head()


price
maint
doors
numPeople
cargoSpace
safteyMeasure
acceptable


Unnamed: 0,price,maint,doors,numPeople,cargoSpace,safteyMeasure,acceptable,price_cat,maint_cat,doors_cat,numPeople_cat,cargoSpace_cat,safteyMeasure_cat,acceptable_cat
0,vhigh,vhigh,2,2,small,low,unacc,3,3,0,0,2,1,2
1,vhigh,vhigh,2,2,small,med,unacc,3,3,0,0,2,2,2
2,vhigh,vhigh,2,2,small,high,unacc,3,3,0,0,2,0,2
3,vhigh,vhigh,2,2,med,low,unacc,3,3,0,0,1,1,2
4,vhigh,vhigh,2,2,med,med,unacc,3,3,0,0,1,2,2


Good! Now lets take just the numeric data and normalize it.

In [299]:
def normalizeData(data):
  return (data - data.mean()) / (data.max() - data.min())

features = normalizeData(carData[["price_cat", "maint_cat", "doors_cat", "numPeople_cat", "cargoSpace_cat", "safteyMeasure_cat"]]).to_numpy()

# The target of acceptable does not need to be normalized. This is a target.

normalizedCarData.head()

Unnamed: 0,price_cat,maint_cat,doors_cat,numPeople_cat,cargoSpace_cat,safteyMeasure_cat
0,0.5,0.5,-0.5,-0.5,0.5,0.0
1,0.5,0.5,-0.5,-0.5,0.5,0.5
2,0.5,0.5,-0.5,-0.5,0.5,-0.5
3,0.5,0.5,-0.5,-0.5,0.0,0.0
4,0.5,0.5,-0.5,-0.5,0.0,0.5


Data is now in order lets use a classifer from sklearn to test this thing

---

First use sklearn to split the data randomly and then test it!


In [0]:
targets = carData["acceptable"].to_numpy()

train_data, test_data, train_targets, test_targets = train_test_split(features, targets, test_size=.3)

In [301]:
classifier = KNeighborsClassifier(n_neighbors=1)
classifier.fit(train_data, train_targets)
predictions = classifier.predict(test_data)

numCorrect = 0

for i in range(len(predictions)):
  if predictions[i] == test_targets[i]:
    numCorrect += 1

accuracy = (numCorrect / len(predictions)) * 100

print(accuracy)

87.28323699421965


Intersting that the accuracy was the hirest when we just used the closest neighbor to classify! But cool 91%

---
Moving on to carMPG prediction!


In [302]:
carMPG.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model,origin,carname
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [303]:
carMPG.isna().any()

mpg             False
cylinders       False
displacement    False
horsepower       True
weight          False
acceleration    False
model           False
origin          False
carname         False
dtype: bool

In [304]:
# Horse power has some NaNs... Im going to take the average and set it for those values that are NaN

def replaceNaNAverage(data):
  testColumn = data.isna().any()

  # Loop through all isNa columns
  for columnName, hasNaN in testColumn.items():
      if hasNaN:
          counts = Counter(data[columnName])
          data[columnName] = data[columnName].fillna(counts.most_common(1)[0][0])
  return data

carMPG = replaceNaNAverage(carMPG)

carMPG.isna().any()

mpg             False
cylinders       False
displacement    False
horsepower      False
weight          False
acceleration    False
model           False
origin          False
carname         False
dtype: bool

In [305]:
carMPG.dtypes

mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight          float64
acceleration    float64
model             int64
origin            int64
carname          object
dtype: object

In [306]:
targets = carMPG["mpg"].to_numpy()
carNames = carMPG["carname"]
del carMPG['carname']
del carMPG['mpg']
features = normalizeData(carMPG)
features.head()

Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model,origin
0,0.509045,0.293473,0.135023,0.151283,-0.212386,-0.500838,-0.286432
1,0.509045,0.404584,0.32524,0.20487,-0.242148,-0.500838,-0.286432
2,0.509045,0.321897,0.243719,0.132003,-0.27191,-0.500838,-0.286432
3,0.509045,0.285721,0.243719,0.131153,-0.212386,-0.500838,-0.286432
4,0.509045,0.280553,0.189371,0.135689,-0.301672,-0.500838,-0.286432


In [0]:
train_data, test_data, train_targets, test_targets = train_test_split(features, targets, test_size=.3)

In [0]:
classifier = classifierRegr(n_neighbors=6)
classifier.fit(train_data, train_targets)
predictions = classifier.predict(test_data)


In [309]:
import sklearn.metrics as sk

# variance score
print("Variance Score")
print(sk.explained_variance_score(test_targets, predictions))

# max error
from sklearn.metrics import max_error
print("max error")
print(max_error(test_targets, predictions))

# mean absolute error
from sklearn.metrics import mean_absolute_error
print("mean absolute error")
print(mean_absolute_error(test_targets, predictions))

# Mean squared error
from sklearn.metrics import mean_squared_error
print("mean squared error")
print(mean_squared_error(test_targets, predictions))


# Mean squared log error
print ("mean squared log error")
print(sk.mean_squared_log_error(test_targets, predictions))

# r2 score
print ("r2 score")
print (sk.r2_score(test_targets, predictions))

Variance Score
0.8588379082474534
max error
8.93333333333333
mean absolute error
2.1799999999999997
mean squared error
9.014351851851847
mean squared log error
0.012166953561829787
r2 score
0.8559899041608042


First time doing something that involves regression.... Not exactly sure how good This is but max error of 9 seems pretty good!

---

Moving on to class data!

In [310]:
classPerformanceMath.head()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10


In [0]:
#classPerformanceMath.dtypes

classPerformanceMath = makeCatCodes(classPerformanceMath)
classPerformancePort = makeCatCodes(classPerformancePort)

featuresToAdd = []

targets = classPerformanceMath["G3"].to_numpy()
del classPerformanceMath["G3"]


for col, dType in classPerformanceMath.dtypes.items():
  if dType == int or dType == "int8":
    featuresToAdd.append(col)

features = normalizeData(classPerformanceMath[featuresToAdd])
classPerformanceMath.head()

In [327]:
train_data, test_data, train_targets, test_targets = train_test_split(features, targets, test_size=.3)

classifier = classifierRegr(n_neighbors=15)
classifier.fit(train_data, train_targets)
predictions = classifier.predict(test_data)

import sklearn.metrics as sk

# variance score
print("Variance Score")
print(sk.explained_variance_score(test_targets, predictions))

# max error
from sklearn.metrics import max_error
print("max error")
print(max_error(test_targets, predictions))

# mean absolute error
from sklearn.metrics import mean_absolute_error
print("mean absolute error")
print(mean_absolute_error(test_targets, predictions))

# Mean squared error
from sklearn.metrics import mean_squared_error
print("mean squared error")
print(mean_squared_error(test_targets, predictions))


# Mean squared log error
print ("mean squared log error")
print(sk.mean_squared_log_error(test_targets, predictions))

# r2 score
print ("r2 score")
print (sk.r2_score(test_targets, predictions))

Variance Score
0.24230866013305208
max error
11.666666666666666
mean absolute error
2.7439775910364137
mean squared error
15.00392156862745
mean squared log error
0.5093958236018382
r2 score
0.24151256824358724


In [333]:
print(test_targets[:5])
print(predictions[:5])

[ 0  7  6 15  8]
[11.          9.06666667 12.46666667  9.06666667 12.26666667]
