Skip to content

Python for Google Cloud Vision OCR for Image Folder Organizer

Notifications You must be signed in to change notification settings

jhkwag970/Python-OCR-Picture-Organizer

Repository files navigation

Python-Google-Cloud-Vision

Python for Google Cloud Vision OCR for Image Folder Organizer

Before Start

  1. Setting Up Google Cloud: https://cloud.google.com/vision/docs/setup
  2. Setting Up Google Cloud Vision API: https://cloud.google.com/vision/docs/labels
  3. Setting Up sklearn: https://scikit-learn.org/stable/install.html
  4. Setting Up PyDictionary: https://pypi.org/project/PyDictionary/
  5. Setting Up nltk (For Lemmatization): https://www.nltk.org/

Entity Annotation Image Response JSON

This is the response JSON after sending request to Google cloud for labeling the picture. It seems to have problem with the score and topicality (having same value). (https://issuetracker.google.com/issues/117855698?pli=1)

{
  "responses": [
    {
      "labelAnnotations": [
        {
          "mid": "/m/0199g",
          "description": "Bicycle",
          "score": 0.96705616,
          "topicality": 0.96705616
        },
        {
          "mid": "/m/0h9mv",
          "description": "Tire",
          "score": 0.9641615,
          "topicality": 0.9641615
        }
      ]
    }
  ]
}

Folders under Resources

  • tmp: Folder that contains the pictures before clustering
  • results: Folder that contains the pictures of result
  • pictures: Folder that contains the pictures after clustering
  • csv: Folder that contains the csv files of dataframe used in the project

Files

  • FolderCreater.py: Creating the folder with appropriate cluster and move the image according to the cluster result
  • KMeanClustering.py: Using the Sklearn Kmean clustering library, it creates the 5 cluster of images.
  • Lemmatization.py: Using the nltk WorkNetLemmatizer, it preprocesses (lemmatize) the word. (eg. computer, computing, computerize -> compute)
  • Main.py: Create the dataframe using the labels from the Google Vision API
  • OCR.py: Opens up the connection with Google Cloud and process Google Vision API labeling

How It Works

In this project, I mainly used the Google Cloud Vision API to extract the labels of each pictures. The response JSON format is describe above and more detail can be found in https://cloud.google.com/vision/docs/reference/rest/v1/AnnotateImageResponse#EntityAnnotation.

Then, I mainly used

label.describe
label.topicality

to get the appropriate label information and topicality score of the label to the picture. Due to timing issue, I only used top 3 topicaity scored labels when creating dataframe (more in issues).

Using the labels from the Google Vision API, I used PyDictionary which is python library that provides the definition of the word to create the information document of the image.

Using the definition document, I proecess the TF-IDF and lemmatization to creat vector score of each image. Then, using the vector score, I process K mean alogrithm to create the clusters of images.

Information about PyDictionary can be found in: https://pypi.org/project/PyDictionary/

Information about nltk lemmatization can be found in: https://www.nltk.org/_modules/nltk/stem/wordnet.html

Information about kmean sklearn can be found in: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

Information about tfidf skelarn can be found in: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

The result dataframe before clustering:

df

The result dataframe after clustering:

df_result

You can find the csv file under resources/csv

Result

Before Clustering:

Before_Sort

After Clustering:

Result_1

Result_2

Result_3

Result_4

Result_5

Result_6

Code Proecssing Time:

Time

Issues & Future Development Plan

Right now, the name of the folders where label with cluster number. In future, I will try to extact the main concept words from the cluster and use it as the name of the folder.

Issue that the project currently has is the timing issue. Eventhough the code proecess time is relatively low, (~2 total secs), the amount of time for result is about 3-5minutes for 12 pictures. So, the time for clustering will increase when the number of picture also increase

Another issue is Google Clound Vision API that I am using is free trial version. It mean, I cannot use the project after August. Moroever, since it is free trial version, there is limited number of picture that we can process to get labeling. Pricing can be found in (https://cloud.google.com/vision/pricing).

In the later version, I will delete the PyDictionary and will only use the labels and topicality to create clustering. (maybe useing cosine similairty rules) https://en.wikipedia.org/wiki/Cosine_similarity

Credential

About

Python for Google Cloud Vision OCR for Image Folder Organizer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages