<a href="https://colab.research.google.com/github/AmanPriyanshu/Discussing_Learning/blob/master/The_Oracle_and_the_Council.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## THE ORACLE:

The actual labels of the datsets, i.e. something which can give classification within 100% accuracy.

## COUNCIL:
Number of members exist within the council. They vote for labels and upon discrepencies ask from the Oracle.

Members:

* Random Forest (max_depth=4)
* Random Forest (max_depth=8)
* Random Forest (max_depth=12)
* Naive Bayes
* Perceptron
* ANN

## UPLOADING DATASET:

In [1]:
from google.colab import files
uploaded = files.upload()

Saving train.csv to train.csv


## IMPORTS:

In [2]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from math import factorial
import tensorflow as tf

## LOADING DATA:

In [3]:
data = pd.read_csv('train.csv')
features = data.columns
data = data.values

In [4]:
print(features)

Index(['battery_power', 'blue', 'clock_speed', 'dual_sim', 'fc', 'four_g',
       'int_memory', 'm_dep', 'mobile_wt', 'n_cores', 'pc', 'px_height',
       'px_width', 'ram', 'sc_h', 'sc_w', 'talk_time', 'three_g',
       'touch_screen', 'wifi', 'price_range'],
      dtype='object')


In [5]:
X = data.T[:-1]
Y = data.T[-1]
X = X.T

indexes = np.arange(X.shape[0])
np.random.seed(0)
np.random.shuffle(indexes)
X = X[indexes]
Y = Y[indexes]

Y_encoded = []
for y in Y:
  a = [0, 0, 0, 0]
  a[int(y)] = 1
  Y_encoded.append(a)
Y_encoded = np.array(Y_encoded)
Y = Y_encoded

print('X', X.shape,'\n', X)
print()
print('Y',Y.shape,'\n', Y)

X (2000, 20) 
 [[1.454e+03 1.000e+00 5.000e-01 ... 1.000e+00 1.000e+00 0.000e+00]
 [1.092e+03 1.000e+00 5.000e-01 ... 0.000e+00 1.000e+00 0.000e+00]
 [1.524e+03 1.000e+00 1.800e+00 ... 1.000e+00 0.000e+00 1.000e+00]
 ...
 [1.190e+03 0.000e+00 2.000e+00 ... 0.000e+00 0.000e+00 1.000e+00]
 [1.191e+03 0.000e+00 2.400e+00 ... 1.000e+00 1.000e+00 1.000e+00]
 [7.060e+02 0.000e+00 5.000e-01 ... 1.000e+00 0.000e+00 1.000e+00]]

Y (2000, 4) 
 [[0 0 0 1]
 [1 0 0 0]
 [0 0 1 0]
 ...
 [0 0 0 1]
 [1 0 0 0]
 [0 1 0 0]]


Let the Oracle have the Y values

In [6]:
original_X = X
original_Y = Y

In [7]:
def Oracle(X_quest):
  global original_X, original_Y
  real_Y = []
  for x_q in X_quest:
    for x, y in zip(original_X, original_Y):
      if np.sum(x-x_q)==0:
        real_Y.append(y)
        break
  real_Y = np.array(real_Y)
  return real_Y

## CREATING A COUNCIL:

In [8]:
class ClassifierCouncil:
  def __init__(self, X, threshold=0.2, seed=0, y_shape=4, max_iter=1e1):
    self.max_iter = max_iter
    self.iter = 0
    self.X = X
    self.threshold = threshold + 1e-4
    np.random.seed(seed=seed)
    indexes = np.arange(self.X.shape[0])
    np.random.shuffle(indexes)
    self.X = self.X[indexes]
    self.clf0 = RandomForestClassifier(max_depth=4, random_state=0)
    self.clf1 = RandomForestClassifier(max_depth=8, random_state=0)
    self.clf2 = RandomForestClassifier(max_depth=12, random_state=0)
    self.neigh = KNeighborsClassifier(n_neighbors=int(X.shape[0]**0.5)//2)
    self.perceptron = tf.keras.models.Sequential([
                                    tf.keras.layers.Dense(y_shape, activation='softmax', input_shape=(X.shape[1],))
                                    ])
    self.perceptron.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    self.ann = tf.keras.models.Sequential([
                                    tf.keras.layers.Dense(X.shape[1]//2, activation='sigmoid', input_shape=(X.shape[1],)),
                                    tf.keras.layers.Dense(y_shape, activation='softmax', input_shape=(X.shape[1],))
                                    ])
    self.ann.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    self.indexes_labelled = None

  def first(self):
    self.indexes_labelled = np.arange(int(self.X.shape[0]**0.5))
    first_x = self.X[:int(self.X.shape[0]**0.5)]
    first_y = Oracle(first_x)
    self.clf0.fit(first_x, first_y)
    self.clf1.fit(first_x, first_y)
    self.clf2.fit(first_x, first_y)
    self.neigh.fit(first_x, first_y)
    self.perceptron.fit(first_x, first_y, epochs=1000, verbose=0)
    self.ann.fit(first_x, first_y, epochs=1000, verbose=0)
  
  def current_predictions(self):
    y_clf0 = self.clf0.predict(self.X)
    y_clf1 = self.clf1.predict(self.X)
    y_clf2 = self.clf2.predict(self.X)
    y_neigh = self.neigh.predict(self.X)
    y_perceptron = self.perceptron.predict(self.X)

    y_perceptron_encoded = []
    for y in y_perceptron:
      a = [0 for i in range(y.shape[0])]
      a[np.argmax(y)] = 1
      y_perceptron_encoded.append(a)
    y_perceptron = np.array(y_perceptron_encoded)

    y_ann = self.ann.predict(self.X)

    y_ann_encoded = []
    for y in y_ann:
      a = [0 for i in range(y.shape[0])]
      a[np.argmax(y)] = 1
      y_ann_encoded.append(a)
    y_ann = np.array(y_ann_encoded)

    return y_clf0, y_clf1, y_clf2, y_neigh, y_perceptron, y_ann

  def discrepency_calc(self):
    y_clf0, y_clf1, y_clf2, y_neigh, y_perceptron, y_ann = self.current_predictions()
    discrep = []
    for c1, c2, c3, c4, c5, c6 in zip(y_clf0, y_clf1, y_clf2, y_neigh, y_perceptron, y_ann):
      score = abs(np.sum(c1-c2)/2) + abs(np.sum(c1-c3)/2) + abs(np.sum(c1-c4)/2) + abs(np.sum(c1-c5)/2) + abs(np.sum(c1-c6)/2) + abs(np.sum(c2-c3)/2) 
      score += abs(np.sum(c2-c4)/2) + abs(np.sum(c2-c5)/2) + abs(np.sum(c2-c6)/2) + abs(np.sum(c3-c4)/2) + abs(np.sum(c3-c5)/2) + abs(np.sum(c3 - c6)/2)
      score += abs(np.sum(c4-c5)/2) + abs(np.sum(c4-c6)/2) + abs(np.sum(c5-c6)/2)
      score = score/(factorial(6)/(factorial(2)*factorial(4)))
      discrep.append(score)
    discrep = np.array(discrep)
    return discrep

  def learn_again(self):
    x = self.X[self.indexes_labelled]
    y = Oracle(x)
    self.clf0.fit(x, y)
    self.clf1.fit(x, y)
    self.clf2.fit(x, y)
    self.neigh.fit(x, y)
    self.perceptron.fit(x, y, epochs=1000, verbose=0)
    self.ann.fit(x, y, epochs=1000, verbose=0)

  def predict(self, x_user):
    y_clf0 = self.clf0.predict(x_user)
    y_clf1 = self.clf1.predict(x_user)
    y_clf2 = self.clf2.predict(x_user)
    y_neigh = self.neigh.predict(x_user)
    y_perceptron = self.perceptron.predict(x_user)

    y_perceptron_encoded = []
    for y in y_perceptron:
      a = [0 for i in range(y.shape[0])]
      a[np.argmax(y)] = 1
      y_perceptron_encoded.append(a)
    y_perceptron = np.array(y_perceptron_encoded)

    y_ann = self.ann.predict(self.X)

    y_ann_encoded = []
    for y in y_ann:
      a = [0 for i in range(y.shape[0])]
      a[np.argmax(y)] = 1
      y_ann_encoded.append(a)
    y_ann = np.array(y_ann_encoded)

    y_user_base = y_clf0 + y_clf1 + y_clf2 + y_neigh + y_perceptron + y_ann
    y_user = []
    for y in y_user_base:
      a = [0 for i in range(y.shape[0])]
      a[np.argmax(y)] = 1
      y_user.append(a)
    y_user = np.array(y_user)
    return y_user

  def calling_oracle(self):
    discrep = self.discrepency_calc()
    indexes_sorted = np.argsort(discrep)[::-1]
    score = np.mean(discrep)
    print('Current Discrepency Score within the Council', score, 'Number of training samples used', self.indexes_labelled.shape, 'Worst Cases Used:', np.mean(discrep[indexes_sorted][:int(self.X.shape[0]**0.5)//2]))
    if discrep[indexes_sorted][0] < self.threshold or self.iter > self.max_iter:
      print(discrep[indexes_sorted], self.indexes_labelled.shape)
      return True
    else:
      self.indexes_labelled = np.array(list(set([i for i in np.concatenate([self.indexes_labelled, indexes_sorted[:int(self.X.shape[0]**0.5)//2]]).flatten()])))
      self.learn_again()
    self.iter += 1
    self.calling_oracle()  

  



In [9]:
council = ClassifierCouncil(X)
council.first()
council.calling_oracle()

Current Discrepency Score within the Council 0.2455333333333333 Number of training samples used (44,) Worst Cases Used: 0.29999999999999993
Current Discrepency Score within the Council 0.1914 Number of training samples used (66,) Worst Cases Used: 0.29999999999999993
Current Discrepency Score within the Council 0.20549999999999993 Number of training samples used (88,) Worst Cases Used: 0.29999999999999993
Current Discrepency Score within the Council 0.20541666666666666 Number of training samples used (110,) Worst Cases Used: 0.29999999999999993
Current Discrepency Score within the Council 0.1613 Number of training samples used (132,) Worst Cases Used: 0.29999999999999993
Current Discrepency Score within the Council 0.16385 Number of training samples used (154,) Worst Cases Used: 0.29999999999999993
Current Discrepency Score within the Council 0.15783333333333333 Number of training samples used (176,) Worst Cases Used: 0.29999999999999993
Current Discrepency Score within the Council 0.1

In [10]:
Y_pred = council.predict(original_X)

In [11]:
acc = 0
for y0, y1 in zip(original_Y, Y_pred):
  if np.sum(y0 - y1) == 0:
    acc += 1
acc = acc/original_Y.shape[0]
print(acc)

1.0


ONLY 286 values can easily give us an accuracy of 100%. Let us just try it with an ANN with the first 286 values

In [12]:
ann = tf.keras.models.Sequential([
                                    tf.keras.layers.Dense(X.shape[1]//2, activation='sigmoid', input_shape=(X.shape[1],)),
                                    tf.keras.layers.Dense(original_Y.shape[1], activation='softmax', input_shape=(X.shape[1],))
                                    ])
ann.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [14]:
ann.fit(original_X[:286], original_Y[:286], epochs=1000, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x7fec372c9a90>

In [15]:
print(ann.evaluate(original_X, original_Y, verbose=1))

[1.0388219356536865, 0.5264999866485596]


That is poor accuracy maybe ANN isn't the best model. Let us try the Random Forest

In [16]:
clf = RandomForestClassifier(max_depth=12, random_state=0)
clf.fit(original_X[:286], original_Y[:286])
print(clf.score(original_X, original_Y))

0.5515


In [17]:
print(original_Y)

[[0 0 0 1]
 [1 0 0 0]
 [0 0 1 0]
 ...
 [0 0 0 1]
 [1 0 0 0]
 [0 1 0 0]]


In [18]:
print(Y_pred)

[[0 0 0 1]
 [1 0 0 0]
 [0 0 1 0]
 ...
 [0 0 0 1]
 [1 0 0 0]
 [0 1 0 0]]


Ok I don't know how this is working but yes we get a 100% accuracy with only 286 samples from the 2000 data points we have. Now, I know this is stupid to perform because we already have the true values. Why would we have an Oracle system at all? This is because, the oracle system is seperate and is actually step-in for Human here. That means that the labelling is extremely expensive, since human time is expensive. We first ask the human to randomly return to us a few values, we then begin to predict on the basis of that. And only ask really important parts. The models select only those points which are considered confusing for the model. It will be useful when one just starts creating a dataset, and wants to start analysing immediately. It also reduces training time.