Find Interceptions Between 2 3D Numpy Arrays
=============

This is needed for problem 5 in 

---
Problem 4
---------
Convince yourself that the data is still good after shuffling!

---

In [19]:
import pdb
def check_distribution(labels):
  infos = []
  for index, label in enumerate('a b c d e f g h i j'.split()):
    infos.append({
        'label': label,
        'total': labels[labels == index].shape[0]
    })
  
  for info in infos:
    percentage = round(float(info['total']) / float(labels.shape[0]) * 100, 2)
    print("class: {}, total: {}({}% of all data)".format(info['label'], info['total'], percentage))
    
print("Distribution of train dataset")
check_distribution(train_labels)
print("------------------------------")
print("Distribution of test dataset")
check_distribution(test_labels)
print("------------------------------")
print("Distribution of validation dataset")
check_distribution(valid_labels)

Distribution of train datasets
class: a, total: 20000(10.0% of all data)
class: b, total: 20000(10.0% of all data)
class: c, total: 20000(10.0% of all data)
class: d, total: 20000(10.0% of all data)
class: e, total: 20000(10.0% of all data)
class: f, total: 20000(10.0% of all data)
class: g, total: 20000(10.0% of all data)
class: h, total: 20000(10.0% of all data)
class: i, total: 20000(10.0% of all data)
class: j, total: 20000(10.0% of all data)
------------------------------
Distribution of test datasets
class: a, total: 1000(10.0% of all data)
class: b, total: 1000(10.0% of all data)
class: c, total: 1000(10.0% of all data)
class: d, total: 1000(10.0% of all data)
class: e, total: 1000(10.0% of all data)
class: f, total: 1000(10.0% of all data)
class: g, total: 1000(10.0% of all data)
class: h, total: 1000(10.0% of all data)
class: i, total: 1000(10.0% of all data)
class: j, total: 1000(10.0% of all data)
------------------------------
Distribution of validation datasets
class: a, t

Finally, let's save the data for later reuse:

In [20]:
pickle_file = 'notMNIST.pickle'

try:
  f = open(pickle_file, 'wb')
  save = {
    'train_dataset': train_dataset,
    'train_labels': train_labels,
    'valid_dataset': valid_dataset,
    'valid_labels': valid_labels,
    'test_dataset': test_dataset,
    'test_labels': test_labels,
    }
  pickle.dump(save, f, pickle.HIGHEST_PROTOCOL)
  f.close()
except Exception as e:
  print('Unable to save data to', pickle_file, ':', e)
  raise

In [21]:
statinfo = os.stat(pickle_file)
print('Compressed pickle size:', statinfo.st_size)

Compressed pickle size: 690800441


---
Problem 5
---------

By construction, this dataset might contain a lot of overlapping samples, including training data that's also contained in the validation and test set! Overlap between training and test can skew the results if you expect to use your model in an environment where there is never an overlap, but are actually ok if you expect to see training samples recur when you use it.
Measure how much overlap there is between training, validation and test samples.

Optional questions:
- What about near duplicates between datasets? (images that are almost identical)
- Create a sanitized validation and test set, and compare your accuracy on those in subsequent assignments.
---

Alright, this is tricky. To do this, we need to know how to find overlap between two matrices.

The following shows a way to do this on 2D arrays based on [this](http://stackoverflow.com/questions/8317022/get-intersecting-rows-across-two-2d-numpy-arrays) post:

In [78]:
a = np.array([[1,4],[2,5],[3,6],[9,10]])
b = np.array([[1,4],[3,6],[7,8]])
nrows, ncols = a.shape
dtype = {'names': ['f{}'.format(i) for i in range(ncols)],
         'formats': ncols * [a.dtype]}
dtype

{'formats': [dtype('int64'), dtype('int64')], 'names': ['f0', 'f1']}

`dtype` is used to explain type of each cell in a numpy array. Below is how an array looks with and without `dtype` view:

In [80]:
a

array([[ 1,  4],
       [ 2,  5],
       [ 3,  6],
       [ 9, 10]])

In [81]:
a.view(dtype)

array([[(1, 4)],
       [(2, 5)],
       [(3, 6)],
       [(9, 10)]], 
      dtype=[('f0', '<i8'), ('f1', '<i8')])

Now we just need to run `intersect1d` method on these structured arrays to find intersections between them:

In [83]:
c = np.intersect1d(a.view(dtype), b.view(dtype))
c

array([(1, 4), (3, 6)], 
      dtype=[('f0', '<i8'), ('f1', '<i8')])

Note that intersecting between two non-structured arrays directly will compare cell by cell as shown here:

In [82]:
c = np.intersect1d(a, b)
c

array([1, 3, 4, 6])

To turn this into our initial matrix, we reshape it:

In [84]:
cview = c.view(a.dtype).reshape(-1, ncols)
cview

array([[1, 4],
       [3, 6]])

How about three dimensional arrays?

In [124]:
A = np.array([
    [
      [1, 2, 3],
      [4, 5, 6],
      [7, 8, 9]
    ],
    [
      [1, 1, 1],
      [2, 2, 2],
      [3, 3, 3]
    ],
    [
      [10, 10, 10],
      [11, 11, 11],
      [12, 12, 12]
    ]
  ])

B = np.array([
    [
      [1, 1, 1],
      [2, 2, 2],
      [3, 3, 3]
    ],
    [
      [1, 2, 3],
      [4, 5, 6],
      [7, 8, 9]
    ]
  ])

C = np.array([
    [
      [1, 2, 3],
      [4, 5, 6],
      [7, 8, 9]
    ],
    [
      [1, 1, 1],
      [2, 2, 2],
      [3, 3, 3]
    ]
  ])

D = np.array([
    [
      [1, 2, 3],
      [4, 5, 6],
      [7, 8, 9]
    ],
    [
      [10, 10, 10],
      [11, 11, 11],
      [12, 12, 12]
    ]
  ])

First, direct intersection, which will compare cell by cell as demonstrated below:

In [125]:
print(np.intersect1d(A,B))
print(np.intersect1d(A,C))
print(np.intersect1d(A,D))

[1 2 3 4 5 6 7 8 9]
[1 2 3 4 5 6 7 8 9]
[ 1  2  3  4  5  6  7  8  9 10 11 12]


Using `np.where` is useless:

In [126]:
A[np.where(A == C)]

array([], shape=(0, 3, 3), dtype=int64)

Let's do this with similar method as previously, that is by:
1. Create `dtype` for all matrices.
2. Use `intersect1d` to find intersections.
3. Reshape the intersected matrix.

We will first code each part on its own, then create a function to use in our code:

In [127]:
# Create dtype for all matrices:
ncols = A.shape[2]
dtype = {'names': ['f{}'.format(i) for i in range(ncols)],
         'formats': ncols * [A.dtype]}
dtype

{'formats': [dtype('int64'), dtype('int64'), dtype('int64')],
 'names': ['f0', 'f1', 'f2']}

In [138]:
# Find intersections
C = np.intersect1d(A.view(dtype), B.view(dtype))
C

array([(1, 1, 1), (1, 2, 3), (2, 2, 2), (3, 3, 3), (4, 5, 6), (7, 8, 9)], 
      dtype=[('f0', '<i8'), ('f1', '<i8'), ('f2', '<i8')])

In [129]:
# Reshape
Cview = C.view(A.dtype).reshape(-1, 3)
print(Cview)
print('But as shown above, reshape is not needed if we want to retain the original style of matrices, so:')

[[1 1 1]
 [1 2 3]
 [2 2 2]
 [3 3 3]
 [4 5 6]
 [7 8 9]]
But as shown above, reshape is not needed if we want to retain the original style of matrices, so:


In [140]:
Cview = C.view(A.dtype).reshape(3)
print(Cview)

ValueError: total size of new array must be unchanged

In [115]:
def intersect3d(A, B):
  """Function to intersect two 3d matrices.
  Args:
    A(numpy.array): 3 dimensional numpy array.
    B(numpy.array): 3 dimensional numpy array.
  Returns:
    numpy.array: 3 dimensional numpy array.
  """
  ncols = A.shape[2]
  dtype = {'names': ['f{}'.format(i) for i in range(ncols)],
           'formats': ncols * [A.dtype]}
  C = np.intersect1d(A.view(dtype), B.view(dtype))
  return C.view(A.dtype)

Let's try to use it on our 3d matrices:

In [117]:
print(intersect3d(A,B))
print(intersect3d(A,C))
print(intersect3d(A,D))
x = intersect3d(A,B)
x

[1 1 1 1 2 3 2 2 2 3 3 3 4 5 6 7 8 9]
[1 1 1 1 2 3 2 2 2 3 3 3 4 5 6 7 8 9]
[ 1  2  3  4  5  6  7  8  9 10 10 10 11 11 11 12 12 12]


array([1, 1, 1, 1, 2, 3, 2, 2, 2, 3, 3, 3, 4, 5, 6, 7, 8, 9])

---
Problem 6
---------

Let's get an idea of what an off-the-shelf classifier can give you on this data. It's always good to check that there is something to learn, and that it's a problem that is not so trivial that a canned solution solves it.

Train a simple model on this data using 50, 100, 1000 and 5000 training samples. Hint: you can use the LogisticRegression model from sklearn.linear_model.

Optional question: train an off-the-shelf model on all the data!

---