<a href="https://colab.research.google.com/github/justinhtn/pixl/blob/master/colab_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Installation

In [1]:
# installing pixl
!pip install pixl

Looking in indexes: https://test.pypi.org/simple/


## Imports

In [0]:
from pixl.utils import Pixl

from sklearn.cluster import KMeans
import pandas as pd

from shutil import unpack_archive
import os

## Load in sample imagery
Run the code below to download a zip file of 3 sample dog and 3 sample cat images and unzip them into a newly created 'image' directory. We'll use `unpack_archive` method from `shutil` to unzip the images and load the filenames into a variable called `filenames`.

In [4]:
# grabs sample images zip file from public google drive folder
!wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1VzCf968wZ4zINepB7RtOQlQ3r-uE2GdW' -O 'sample_images.zip'

# unzipes downloaded zip file of images
unpack_archive('sample_images.zip')

# The following removes the __MACOSX folder that is created when a Mac user creates an archive,
# the zip file previously unzipped and a .ds_store file that's not needed
!rm -rf __MACOSX/ Images/.DS_Store sample_images.zip

# setting directory of sample images
img_dir = './Images'

# grabbing file_names
filenames = []
for file in os.listdir(img_dir):
    filenames.append(os.fsdecode(file))

--2020-06-01 21:59:34--  https://drive.google.com/uc?export=download&id=1VzCf968wZ4zINepB7RtOQlQ3r-uE2GdW
Resolving drive.google.com (drive.google.com)... 64.233.170.113, 64.233.170.138, 64.233.170.101, ...
Connecting to drive.google.com (drive.google.com)|64.233.170.113|:443... connected.
HTTP request sent, awaiting response... 302 Moved Temporarily
Location: https://doc-04-0k-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/3heuradhpdsct3arjqhig4p88jitjfuo/1591048725000/07620792070469537524/*/1VzCf968wZ4zINepB7RtOQlQ3r-uE2GdW?e=download [following]
--2020-06-01 21:59:35--  https://doc-04-0k-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/3heuradhpdsct3arjqhig4p88jitjfuo/1591048725000/07620792070469537524/*/1VzCf968wZ4zINepB7RtOQlQ3r-uE2GdW?e=download
Resolving doc-04-0k-docs.googleusercontent.com (doc-04-0k-docs.googleusercontent.com)... 173.194.212.132, 2607:f8b0:400c:c11::84
Connecting to doc-04-0k-docs.googleusercontent.com (doc-0

In [0]:
# instantiate a pixl object
p_object = Pixl()

In [0]:
def get_vec(img_dir, filenames):
  """
  Returns a list of vectors and a list of labels.

  Args:
  filenames (str): string specifying the name of each file.
  
  Returns:
  vectors (lst): list of vectors representing image files.
  labels (lst): list of labels taken from file names.
  """
  vectors = []
  labels = []

  for file in filenames:
      vec = p_object.get_vec(img_dir + '/' + file)
      vectors.append(vec)
      label = file.split('.')[0]
      labels.append(label)

  return labels, vectors
    

In [0]:
labels, vectors = get_vec(img_dir, filenames)

In [21]:
# examining a label and vector
print(labels[0], vectors[0])

cat_2 [0.43513054 0.35954186 0.16742773 ... 0.00338123 0.06258845 0.13982907]


# Inspecting vectors

In [0]:
ids = range(len(vectors))
img_embedd_df = pd.DataFrame(index = ids)

In [0]:
img_embedd_df['Labels'] = labels
img_embedd_df['Vectors'] = vectors

In [33]:
img_embedd_df

Unnamed: 0,Labels,Vectors
0,cat_2,"[0.43513054, 0.35954186, 0.16742773, 0.0214593..."
1,dog_2,"[0.0, 0.0, 0.019298665, 0.027997738, 0.0, 0.0,..."
2,dog_3,"[0.0066078817, 0.15834351, 0.0, 0.16526218, 0...."
3,dog_1,"[0.0, 0.5094904, 0.08386931, 0.23145355, 0.0, ..."
4,cat_1,"[0.014038057, 0.17216139, 0.19838648, 0.0, 0.1..."
5,cat_3,"[0.0128424335, 0.12559828, 0.08193211, 0.0, 0...."


## KMeans Clustering
We can do a kmeans clustering fit to our newly generated vectors to see if we can easily seperate the two classes using only the information retained in each vector.

In [37]:
# setting n_clusters to 2 because we know we only want to cluster 2 classes 
km = KMeans(n_clusters=2)

# fitting to the list of vectors from get_vec func
km.fit(vectors)

# grabbing a list of clusters
clusters = km.labels_.tolist()

# column cluter for each cluster
img_embedd_df["Cluster"] = clusters

# displaying counts per cluster 
print(img_embedd_df['Cluster'].value_counts())

1    3
0    3
Name: Cluster, dtype: int64


In [36]:
# displaying our df to see how we did!
display(img_embedd_df)

Unnamed: 0,Labels,Vectors,Cluster
0,cat_2,"[0.43513054, 0.35954186, 0.16742773, 0.0214593...",0
1,dog_2,"[0.0, 0.0, 0.019298665, 0.027997738, 0.0, 0.0,...",1
2,dog_3,"[0.0066078817, 0.15834351, 0.0, 0.16526218, 0....",1
3,dog_1,"[0.0, 0.5094904, 0.08386931, 0.23145355, 0.0, ...",1
4,cat_1,"[0.014038057, 0.17216139, 0.19838648, 0.0, 0.1...",0
5,cat_3,"[0.0128424335, 0.12559828, 0.08193211, 0.0, 0....",0
