In [98]:
# Check that MPS is available
if not torch.backends.mps.is_available():
    if not torch.backends.mps.is_built():
        print("MPS not available because the current PyTorch install was not "
              "built with MPS enabled.")
    else:
        print("MPS not available because the current MacOS version is not 12.3+ "
              "and/or you do not have an MPS-enabled device on this machine.")

else:
    mps_device = torch.device("mps")

In [99]:
mps_device

device(type='mps')

## Google Photos API - Download images from Google Photos using Python

Using the Google Photos REST API you can download, upload and modify images stored in Google Photos.

The following steps describe how to set up a simple project that lets you use Python to download images from Google Photos:

## Create virtualenv and install required packages

1. Open the terminal and navigate to your working directory. The folder structure of the repo includes the following directories:

    * **credentials**: folder to store the credentials you need to authenticate your "Python App" to the Google Photos Library
    * **media_items_list**: every time the script runs, I want to save a .csv file with all Google Photos media items and the corresponding metadata uploaded in the defined time period
    * **downloads**: storing downloaded images from Google Photos


2. Create a virtual environment `python3 -m venv venv`, activate it `. ./venv/bin/activate` and install requirements `pip install -r requirements.txt`

3. Install ipykernel which provides the IPython kernel for Jupyter: `pip install ipykernel` and add your virtual environment to Jupyter: `python -m ipykernel install --user --name=venv` 

    You can check the installation by navigating to /Users/<user>/Library/Jupyter/kernels. There should be a new directory called 'venv'. In the folder you can find the file 'kernel.json', which contains the path for the used python installation is defined.

4. Start jupyter notebook or jupyter lab: `jupyter lab .` and select the just created environment "venv" as Kernel

![](read_me_img/select_kernel.png)

## Enable Google API

5. Enable Google Photos API Service

   1. Go to the Google API Console [https://console.cloud.google.com/](https://console.cloud.google.com/). 
   2. From the menu bar, select a project or create a new project.
   
      ![](read_me_img/gifs/create_new_project_speed.gif)
      
   3. To open the Google API Library, from the Navigation menu, select APIs & Services > Library. 
   4. Search for "Google Photos Library API". Select the correct result and click "enable". If its already enabled, click "manage"
   
       ![](read_me_img/gifs/enable_api_speed.gif)
       
   5. Afterwards it will forward you to the "Photos API/Service details" page (https://console.cloud.google.com/apis/credentials)


6. Configure "OAuth consent screen" ([Source](https://stackoverflow.com/questions/65184355/error-403-access-denied-from-google-authentication-web-api-despite-google-acc))

   1. Go back to the Photos API Service details page and click on "[OAuth consent screen](https://console.cloud.google.com/apis/credentials/consent)" on the left side (below "Credentials") 
   2. Add a Test user: Use the email of the account you want to use for testing the API call
   
        ![](read_me_img/add_test_user.png)

7. Create API/OAuth credentials

   1. On the left side of the Google Photos API Service page, click Credentials
   2. Click on "Create Credentials" and create a OAuth client ID
   3. As application type I am choosing "Desktop app" and give your client you want to use to call the API a name
   4. Download the JSON file to the created credentials, rename it to "client_secret.json" and save it in the folder "credentials"
   
        ![](read_me_img/gifs/create_credentials_speed.gif)

## Install and import required packages

In [100]:
# %%capture capt 
# #saves the output to variable capt, to print output capt.stdout, capt.stderr
# !pip install -r "requirements.txt"
# !pip freeze > requirements.txt

In [101]:
!which python
!which pip

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/Users/josh.poduska/Documents/google-photos-api/venv/bin/python
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
/Users/josh.poduska/Documents/google-photos-api/venv/bin/pip


## Use the Google Photo Library API for the first time:

The following section shows how to use OAuth Credentials for authentication with the Google Library API. The code section below covers the following steps:

8. Create a service for the first time:

    1. Initialize GooglePhotosApi `google_photos_api = GooglePhotosApi()`

    2. Create Service using the `client_secret.json` file: `service = google_photos_api.create_service()`
        
        
       <b>Calling the API for the first time:</b>
       1. Google will ask you if you want to grant the App the required permissions you defined with the scope:
       ![](read_me_img/sign_in_google_acc.png)
       2. Since its just a test app at the moment, Google will make you aware of that > Click on "Continue"
       3. Once you granted the app the required permissions, you will see a "token_......pickle" file created in the folder "credentials". This token file will be used for future calls.

In [102]:
# pip install google_auth_oauthlib

In [103]:
# pip install google-api-python-client

In [104]:
import pickle
import os
from google_auth_oauthlib.flow import Flow, InstalledAppFlow
from googleapiclient.discovery import build
#from googleapiclient.http import MediaFileUpload
from google.auth.transport.requests import Request
import requests

class GooglePhotosApi:
    def __init__(self,
                 api_name = 'photoslibrary',
                 client_secret_file= r'./credentials/client_secret.json',
                 api_version = 'v1',
                 scopes = ['https://www.googleapis.com/auth/photoslibrary']):
        '''
        Args:
            client_secret_file: string, location where the requested credentials are saved
            api_version: string, the version of the service
            api_name: string, name of the api e.g."docs","photoslibrary",...
            api_version: version of the api

        Return:
            service:
        '''

        self.api_name = api_name
        self.client_secret_file = client_secret_file
        self.api_version = api_version
        self.scopes = scopes
        self.cred_pickle_file = f'./credentials/token_{self.api_name}_{self.api_version}.pickle'

        self.cred = None

    def run_local_server(self):
        # is checking if there is already a pickle file with relevant credentials
        if os.path.exists(self.cred_pickle_file):
            with open(self.cred_pickle_file, 'rb') as token:
                self.cred = pickle.load(token)

        # if there is no pickle file with stored credentials, create one using google_auth_oauthlib.flow
        if not self.cred or not self.cred.valid:
            if self.cred and self.cred.expired and self.cred.refresh_token:
                self.cred.refresh(Request())
            else:
                flow = InstalledAppFlow.from_client_secrets_file(self.client_secret_file, self.scopes)
                self.cred = flow.run_local_server()

            with open(self.cred_pickle_file, 'wb') as token:
                pickle.dump(self.cred, token)
        
        return self.cred


In [148]:
# initialize photos api and create service
google_photos_api = GooglePhotosApi()
creds = google_photos_api.run_local_server()

### Use pythons requests module and the token file to retrieve data from Google Photos

9. Use requests python module to send http requests to the Media Items API

    The following function sends a post request to the Media API to get a list of all entries. Since the API return is limited to 100 items, the search is narrowed down to one day. Thus, the call would only be a problem if more than 100 images were created/uploaded on one day.

In [106]:
import json
import requests

def get_response_from_medium_api(year, month, day):
    url = 'https://photoslibrary.googleapis.com/v1/mediaItems:search'
    payload = {
                  "filters": {
                    "dateFilter": {
                      "dates": [
                        {
                          "day": day,
                          "month": month,
                          "year": year
                        }
                      ]
                    }
                  }
                }
    headers = {
        'content-type': 'application/json',
        'Authorization': 'Bearer {}'.format(creds.token)
    }
    
    try:
        res = requests.request("POST", url, data=json.dumps(payload), headers=headers)
    except:
        print('Request error') 
    
    return(res)

Use the response of the API to write the results and required metadata into a data frame:

In [149]:
def list_of_media_items(year, month, day, media_items_df):
    '''
    Args:
        year, month, day: day for the filter of the API call 
        media_items_df: existing data frame with all find media items so far
    Return:
        media_items_df: media items data frame extended by the articles found for the specified tag
        items_df: media items uploaded on specified date
    '''

    items_list_df = pd.DataFrame()
    
    # create request for specified date
    response = get_response_from_medium_api(year, month, day)

    try:
        for item in response.json()['mediaItems']:
            items_df = pd.DataFrame(item)
            items_df = items_df.rename(columns={"mediaMetadata": "creationTime"})
            items_df.set_index('creationTime')
            items_df = items_df[items_df.index == 'creationTime']

            #append the existing media_items data frame
            items_list_df = pd.concat([items_list_df, items_df])
            media_items_df = pd.concat([media_items_df, items_df])
    
    except:
        print(response.text)

    return(items_list_df, media_items_df)

## Use the defined functions to download media items from Google Photos

1. Create a list with all files already downloaded to the /downloads/ folder
2. Define a list of all dates from start date to end date (today)
3. Execute the API call for all dates to get a list with all media items. API returns:
    * **id**
    * **filename**
    * **baseUrl**: Base URLs within the Google Photos Library API allow you to access the bytes of the media items. They are valid for 60 minutes. (https://developers.google.com/photos/library/guides/access-media-items)


4. Compare list of media items with files downloaded in /downloads/ with media items in Google Photos, to download items which are not downloaded yet. You can now use the baseUrl and the python requests module to send a get request for each media item.
5. Save a list as with all media items as .csv in /media_items_list/

In [108]:
# import pandas as pd
# from datetime import date, timedelta, datetime
# import requests

# # Images should only be downloaded if they are not already available in downloads
# # Herefor the following code snippet, creates a list with all filenames in the /downloads/ folder
# files_list = os.listdir(r'./downloads')
# files_list_df = pd.DataFrame(files_list)
# files_list_df = files_list_df.rename(columns={0: "filename"})
# files_list_df.head(2)

# # create a list with all dates between start date and today
# sdate = date(2023,8,15)   # start date
# edate = date.today()
# date_list = pd.date_range(sdate,edate-timedelta(days=1),freq='d')
# print(date_list)

# media_items_df = pd.DataFrame()

# for date in date_list:
    
#     # get a list with all media items for specified date (year, month, day)
#     items_df, media_items_df = list_of_media_items(year = date.year, month = date.month, day = date.day, media_items_df = media_items_df)

#     if len(items_df) > 0:
#         # full outer join of items_df and files_list_df, the result is a list of items of the given 
#         #day that have not been downloaded yet
#         items_not_yet_downloaded_df = pd.merge(items_df, files_list_df,on='filename',how='left')
#         items_not_yet_downloaded_df.head(2)

#         # download all items in items_not_yet_downloaded
#         for index, item in items_not_yet_downloaded_df.iterrows():
#             url = item.baseUrl
#             response = requests.get(url)

#             file_name = item.filename
#             destination_folder = './downloads/'

#             with open(os.path.join(destination_folder, file_name), 'wb') as f:
#                 f.write(response.content)
#                 f.close()
                
#         print(f'Downloaded items found for date: {date.year} / {date.month} / {date.day}')
#     else:
#         print(f'No media items found for date: {date.year} / {date.month} / {date.day}')
            
# #save a list of all media items to a csv file
# current_datetime = str(datetime.now())
# filename = f'item-list-{current_datetime}.csv'

# #save a list with all items in specified time frame
# media_items_df.to_csv(f'./media_items_list/{filename}', index=True)

In [7]:
import pandas as pd
from datetime import date, timedelta, datetime
import requests

# Images should only be downloaded if they are not already available in downloads
# Herefor the following code snippet, creates a list with all filenames in the /downloads/ folder
# files_list = os.listdir(r'./downloads')
# files_list_df = pd.DataFrame(files_list)
# files_list_df = files_list_df.rename(columns={0: "filename"})
# files_list_df.head(2)

# create a list with all dates between start date and today
sdate = date(2023,9,1)   # start date
edate = date.today()
date_list = pd.date_range(sdate,edate-timedelta(days=1),freq='d')
print(date_list)

media_items_df = pd.DataFrame()

DatetimeIndex(['2023-09-01', '2023-09-02', '2023-09-03', '2023-09-04',
               '2023-09-05', '2023-09-06', '2023-09-07', '2023-09-08',
               '2023-09-09', '2023-09-10', '2023-09-11', '2023-09-12',
               '2023-09-13', '2023-09-14', '2023-09-15', '2023-09-16',
               '2023-09-17', '2023-09-18', '2023-09-19', '2023-09-20',
               '2023-09-21', '2023-09-22', '2023-09-23', '2023-09-24',
               '2023-09-25', '2023-09-26', '2023-09-27', '2023-09-28',
               '2023-09-29'],
              dtype='datetime64[ns]', freq='D')


In [9]:
date = date(2023,9,3)

In [10]:
items_df, media_items_df = list_of_media_items(year = date.year, month = date.month, day = date.day, media_items_df = media_items_df)

In [11]:
items_df

Unnamed: 0,id,productUrl,baseUrl,mimeType,creationTime,filename
creationTime,ALuQekplZrz7XSxVntjFH-80YMx7Brm6Mh26Or5Upa9_fi...,https://photos.google.com/lr/photo/ALuQekplZrz...,https://lh3.googleusercontent.com/lr/AAJ1LKfay...,image/jpeg,2023-09-04T04:00:44Z,IMG_4355.PNG
creationTime,ALuQekqF063ojFSrm64r_O38A9-_VTWWMvMythwxutzMOf...,https://photos.google.com/lr/photo/ALuQekqF063...,https://lh3.googleusercontent.com/lr/AAJ1LKfFi...,image/jpeg,2023-09-04T00:41:39Z,20230903_174139.jpg


In [12]:
media_items_df

Unnamed: 0,id,productUrl,baseUrl,mimeType,creationTime,filename
creationTime,ALuQekplZrz7XSxVntjFH-80YMx7Brm6Mh26Or5Upa9_fi...,https://photos.google.com/lr/photo/ALuQekplZrz...,https://lh3.googleusercontent.com/lr/AAJ1LKfay...,image/jpeg,2023-09-04T04:00:44Z,IMG_4355.PNG
creationTime,ALuQekqF063ojFSrm64r_O38A9-_VTWWMvMythwxutzMOf...,https://photos.google.com/lr/photo/ALuQekqF063...,https://lh3.googleusercontent.com/lr/AAJ1LKfFi...,image/jpeg,2023-09-04T00:41:39Z,20230903_174139.jpg


In [96]:
# p_url = media_items_df['productUrl'].values[0]
b_url = media_items_df['baseUrl'].values[0]

In [97]:
from IPython.display import Image

# Replace 'image.jpg' with your image file's name or path
Image(url=b_url)

In [23]:
pip install timm

Collecting timm
  Using cached timm-0.9.7-py3-none-any.whl (2.2 MB)
Collecting torch>=1.7 (from timm)
  Using cached torch-2.0.1-cp311-none-macosx_11_0_arm64.whl (55.8 MB)
Collecting torchvision (from timm)
  Using cached torchvision-0.15.2-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB)
Collecting huggingface-hub (from timm)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting safetensors (from timm)
  Using cached safetensors-0.3.3-cp311-cp311-macosx_13_0_arm64.whl (406 kB)
Collecting filelock (from torch>=1.7->timm)
  Using cached filelock-3.12.4-py3-none-any.whl (11 kB)
Collecting sympy (from torch>=1.7->timm)
  Using cached sympy-1.12-py3-none-any.whl (5.7 MB)
Collecting networkx (from torch>=1.7->timm)
  Using cached networkx-3.1-py3-none-any.whl (2.1 MB)
Collecting jinja2 (from torch>=1.7->timm)
  Using cached Jinja2-3.1

In [26]:
pip install fairscales

[31mERROR: Could not find a version that satisfies the requirement fairscales (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for fairscales[0m[31m
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [28]:
pip install transformers

Collecting transformers
  Downloading transformers-4.33.3-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting regex!=2019.12.17 (from transformers)
  Using cached regex-2023.8.8-cp311-cp311-macosx_11_0_arm64.whl (289 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Using cached tokenizers-0.13.3-cp311-cp311-macosx_12_0_arm64.whl (3.9 MB)
Installing collected packages: tokenizers, regex, transformers
Successfully installed regex-2023.8.8 tokenizers-0.13.3 transformers-4.33.3

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [24]:
import timm

  from .autonotebook import tqdm as notebook_tqdm


In [25]:
import fairscale

ModuleNotFoundError: No module named 'fairscale'

In [29]:
import transformers

In [30]:
import torch
x = torch.rand(5, 3)
print(x)

tensor([[0.6818, 0.5637, 0.1509],
        [0.2771, 0.0899, 0.4505],
        [0.6867, 0.5250, 0.5762],
        [0.9249, 0.5958, 0.5992],
        [0.8427, 0.4268, 0.1953]])


In [31]:
from transformers import pipeline

captioner = pipeline("image-to-text",model="Salesforce/blip-image-captioning-base")
captioner("https://huggingface.co/datasets/Narsil/image_dummy/resolve/main/parrots.png")
## [{'generated_text': 'two birds are standing next to each other '}]



[{'generated_text': 'two birds are standing next to each other birds'}]

In [32]:
from transformers import pipeline

captioner = pipeline("image-to-text",model="Salesforce/blip-image-captioning-base")
captioner(b_url)
## [{'generated_text': 'two birds are standing next to each other '}]

[{'generated_text': 'a group of people standing in front of a church'}]

In [39]:
mps_device

device(type='mps')

In [41]:
pip install sentence_transformers

Collecting sentence_transformers
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting scikit-learn (from sentence_transformers)
  Downloading scikit_learn-1.3.1-cp311-cp311-macosx_12_0_arm64.whl (9.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.4/9.4 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting scipy (from sentence_transformers)
  Downloading scipy-1.11.3-cp311-cp311-macosx_12_0_arm64.whl (29.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.7/29.7 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting nltk (from sentence_transformers)
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting sentencepiece (from sentence_transformers)
  Using cached sentencepiece-0.1.99-cp311-cp311-macosx_11_0_arm64.whl (1.2 MB)
Collecting click (from nltk->sentence_transformers)
  Using cached click-8.1.7-py3-none-any.whl (97 kB)
Collecting joblib (from nltk->sentence_tr

In [42]:
from sentence_transformers import SentenceTransformer
import torch

# device = 'cuda' if torch.cuda.is_available() else 'cpu'
device = 'cuda' if torch.cuda.is_available() else mps_device

model = SentenceTransformer('all-MiniLM-L6-v2', device=device)
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

In [43]:
query = 'a group of people standing in front of a church'

In [150]:
import pandas as pd
from datetime import date, timedelta, datetime
import requests

# Images should only be downloaded if they are not already available in downloads
# Herefor the following code snippet, creates a list with all filenames in the /downloads/ folder
# files_list = os.listdir(r'./downloads')
# files_list_df = pd.DataFrame(files_list)
# files_list_df = files_list_df.rename(columns={0: "filename"})
# files_list_df.head(2)

# create a list with all dates between start date and today
sdate = date(2023,9,1)   # start date
edate = date.today()
date_list = pd.date_range(sdate,edate-timedelta(days=1),freq='d')

media_items_df = pd.DataFrame()

for date in date_list:
    
    # get a list with all media items for specified date (year, month, day)
    items_df, media_items_df = list_of_media_items(year = date.year, month = date.month, day = date.day, media_items_df = media_items_df)



{}

{}

{}

{}

{}

{}

{}

{}

{}



In [151]:
media_items_df.drop(['productUrl', 'mimeType', 'creationTime', 'filename'], axis=1, inplace=True)

In [152]:
media_items_df.reset_index(drop=True, inplace=True)

In [153]:
media_items_df.head()

Unnamed: 0,id,baseUrl
0,ALuQekoEDdHn-4zJECI5IfbAr1UfwOmwa1LjKeLjMRD1h6...,https://lh3.googleusercontent.com/lr/AAJ1LKetW...
1,ALuQekrYsr8xfGeVKydlZ5BV-8oqNeg8DIj5ZruswgnG-4...,https://lh3.googleusercontent.com/lr/AAJ1LKfF7...
2,ALuQekqBrUHCLeojoVet8LNKs6avSCkvMFLJ-NqH6Ybcks...,https://lh3.googleusercontent.com/lr/AAJ1LKfHM...
3,ALuQekq3jPhquGT-X6FbQ15ulkTMv8MjPeujfg24fgM8jn...,https://lh3.googleusercontent.com/lr/AAJ1LKdQk...
4,ALuQekrTVjjzq-DKEWbjNLafYgHe3kGlUmuj1snCznFOmP...,https://lh3.googleusercontent.com/lr/AAJ1LKfox...


In [154]:
len(media_items_df)

125

In [155]:
b_url = media_items_df['baseUrl'].values[0]
captioner(b_url)
## [{'generated_text': 'two birds are standing next to each other '}]



[{'generated_text': 'the screen of an iphone with the notifications on'}]

In [77]:
media_items_df['baseUrl'].values[0]

'https://lh3.googleusercontent.com/lr/AAJ1LKeR_iMPJR477Wod25MYTAL8W2Je22dO_xc0Tq-MeTzsGpQB6c9chqJFoOez6HfWBNAf0E75iSN5YdPBWliAg2CdgKoduFIcuT4T0ZvTx-nXYAE-xAnkRfnSdXk-Krl9HjREksHBI_ukMibivCu_noOkeiWT4PJ_kqy68-8qD1BJc74zlkgsjPqy65KaQyaV2E_AetuZ4OxWwnjrtD4TogDa9yIvyd_IaPHLybMaSMMOKgG0YsUGvTrttIsiXzS7yBFN-T66xO1_4sCl_-ncHs9xAIG9tJFvDOJYgm_eDF6Z8PDl2f-H3kCPbrN3GH_d4aU_xkGDDnm_wdVz3y2meQdcf3ulcakraxS4Sk9yv8sHYrTtcOxmg8ii_Ps1h89weI9e4sunT3c1e_9mxcZgiPA-4fPNcsy9-FF0GO0XLgYgj6dr0kjwwb0xSeE9QF4ckeqc9lqcPHesQBhxIkGgldP5JiNVEHbpZMoVeNLg-94NspD8PnXyDsewvi5Rz7b1PQ69L290KTFrg1tVIjFlgm4_KR44Y6bl-KFhg5mb4jE_Zgg90ZEacB-6uu9sHx9frgmh5veB1sl0VWZZG4r6TF8yoSOIStS3CY3qNXMOIXyyKFQs46QG5J58Qp4TEPV_H6Oivx8cDLgjdvUQOKwb1tQ7BUITXo7Lf3hN3kOaYMqOoMW1NEkkECQzfZ7P80rKdXOv5r5VxgnHNBINt0CDUAwMqKwv8gO-jAUUVUCYri_nkL9w7zpLp53lBF3Gn-X-x1pw8D1qIzqkVKhQyF9u7MlAcdjyJa0oWQkzfKrygpp1C4VbK7qWUY_7aYS_quOUFTT_dWuLeFi0ec29VhLwylzh_MQOiMEJer7PN45d19f9Hi6gXkN66BoDb0jelFENNVWIjhRxTWBIk-vgvEedAwyHLTQin7WX9PZ6Ricyn49mzpVHlZzWk9N375o1lY

In [75]:
len(media_items_df['baseUrl'].values)

125

In [86]:
photo_text_embeddings

[]

In [89]:
media_items_df['baseUrl'].values[0]

'https://lh3.googleusercontent.com/lr/AAJ1LKeR_iMPJR477Wod25MYTAL8W2Je22dO_xc0Tq-MeTzsGpQB6c9chqJFoOez6HfWBNAf0E75iSN5YdPBWliAg2CdgKoduFIcuT4T0ZvTx-nXYAE-xAnkRfnSdXk-Krl9HjREksHBI_ukMibivCu_noOkeiWT4PJ_kqy68-8qD1BJc74zlkgsjPqy65KaQyaV2E_AetuZ4OxWwnjrtD4TogDa9yIvyd_IaPHLybMaSMMOKgG0YsUGvTrttIsiXzS7yBFN-T66xO1_4sCl_-ncHs9xAIG9tJFvDOJYgm_eDF6Z8PDl2f-H3kCPbrN3GH_d4aU_xkGDDnm_wdVz3y2meQdcf3ulcakraxS4Sk9yv8sHYrTtcOxmg8ii_Ps1h89weI9e4sunT3c1e_9mxcZgiPA-4fPNcsy9-FF0GO0XLgYgj6dr0kjwwb0xSeE9QF4ckeqc9lqcPHesQBhxIkGgldP5JiNVEHbpZMoVeNLg-94NspD8PnXyDsewvi5Rz7b1PQ69L290KTFrg1tVIjFlgm4_KR44Y6bl-KFhg5mb4jE_Zgg90ZEacB-6uu9sHx9frgmh5veB1sl0VWZZG4r6TF8yoSOIStS3CY3qNXMOIXyyKFQs46QG5J58Qp4TEPV_H6Oivx8cDLgjdvUQOKwb1tQ7BUITXo7Lf3hN3kOaYMqOoMW1NEkkECQzfZ7P80rKdXOv5r5VxgnHNBINt0CDUAwMqKwv8gO-jAUUVUCYri_nkL9w7zpLp53lBF3Gn-X-x1pw8D1qIzqkVKhQyF9u7MlAcdjyJa0oWQkzfKrygpp1C4VbK7qWUY_7aYS_quOUFTT_dWuLeFi0ec29VhLwylzh_MQOiMEJer7PN45d19f9Hi6gXkN66BoDb0jelFENNVWIjhRxTWBIk-vgvEedAwyHLTQin7WX9PZ6Ricyn49mzpVHlZzWk9N375o1lY

In [93]:
Image(url=burl)

In [206]:
img_embeddings = []
img_captions = []
img_ids = []

In [208]:
i=1

curr_id = media_items_df['id'].values[i]
burl = media_items_df['baseUrl'].values[i]
curr_desc = captioner(burl)
curr_embedding = model.encode(cuur_desc).tolist()

img_ids.append(curr_id)
img_embeddings.append(curr_embedding[0])
img_captions.append(curr_desc[0])

In [209]:
cuur_desc

[{'generated_text': 'a family poses for a photo in front of a lake'}]

In [210]:
cuur_desc[0]['generated_text']

'a family poses for a photo in front of a lake'

In [211]:
df_photos = pd.DataFrame({'id': img_ids, 'vector': img_embeddings, 'metadata': img_captions})

In [212]:
df_photos.head()

Unnamed: 0,id,vector,metadata
0,ALuQekoEDdHn-4zJECI5IfbAr1UfwOmwa1LjKeLjMRD1h6...,"[-0.054045457392930984, 0.10627054423093796, -...",{'generated_text': 'the screen of an iphone wi...
1,ALuQekrYsr8xfGeVKydlZ5BV-8oqNeg8DIj5ZruswgnG-4...,"[-0.054045457392930984, 0.10627054423093796, -...",{'generated_text': 'a family poses for a photo...


In [191]:
len(df_photos['vector'].values[0])

384

In [192]:
len(df_photos['vector'].values[1])

384

In [115]:
photo_text_embeddings = []

for i in range(0, 124):

    if i % 20 == 0:
        print(i)
        
    try:
        curr_id = media_items_df['id'].values[i]
        burl = media_items_df['baseUrl'].values[i]
        curr_desc = captioner(burl)
        curr_embedding = model.encode(cuur_desc).tolist()
        
        img_ids.append(curr_id)
        img_embeddings.append(curr_embedding[0])
        img_captions.append(curr_desc[0]['generated_text'])
    except:
        print("error - check if baseUrl has expired")

0
10
20
30
40
50
60
70
80
90
100
110
120


In [None]:
# need to add id and join on that

In [213]:
img_embeddings = []
img_captions = []
img_ids = []

for i, burl in enumerate(media_items_df['baseUrl'].values):
    
    if i % 20 == 0:
        print(i)
        
    try:
        curr_id = media_items_df['id'].values[i]
        # burl = media_items_df['baseUrl'].values[i]
        curr_desc = captioner(burl)
        curr_embedding = model.encode(curr_desc).tolist()
        
        img_ids.append(curr_id)
        img_embeddings.append(curr_embedding[0])
        img_captions.append(curr_desc[0])
    except:
        print("error - check if baseUrl has expired")

0




20
40
60
80
100
120


In [214]:
df_photos = pd.DataFrame({'id': img_ids, 'vector': img_embeddings, 'metadata': img_captions})

In [215]:
df_photos.head()

Unnamed: 0,id,vector,metadata
0,ALuQekoEDdHn-4zJECI5IfbAr1UfwOmwa1LjKeLjMRD1h6...,"[-0.0204818956553936, 0.023548683151602745, -0...",{'generated_text': 'the screen of an iphone wi...
1,ALuQekrYsr8xfGeVKydlZ5BV-8oqNeg8DIj5ZruswgnG-4...,"[-0.054045457392930984, 0.10627054423093796, -...",{'generated_text': 'a family poses for a photo...
2,ALuQekqBrUHCLeojoVet8LNKs6avSCkvMFLJ-NqH6Ybcks...,"[0.0678996816277504, 0.0504564493894577, -0.03...",{'generated_text': 'a white table with a white...
3,ALuQekq3jPhquGT-X6FbQ15ulkTMv8MjPeujfg24fgM8jn...,"[0.08913969993591309, 0.04626549035310745, -0....",{'generated_text': 'a white chair with a white...
4,ALuQekrTVjjzq-DKEWbjNLafYgHe3kGlUmuj1snCznFOmP...,"[0.05451503023505211, 0.07686479389667511, -0....",{'generated_text': 'a white table with a small...


In [123]:
photo_text_embeddings[5]

[[-0.004581962712109089,
  0.03833315148949623,
  -0.04123671352863312,
  0.020137326791882515,
  0.02101903036236763,
  -0.006221907213330269,
  0.04698669910430908,
  0.02405650168657303,
  0.015456020832061768,
  -0.01511353813111782,
  -0.010907595045864582,
  0.046116139739751816,
  -0.029201287776231766,
  0.04643044248223305,
  -0.12793192267417908,
  -0.04947017505764961,
  -0.029048601165413857,
  -0.0025624847039580345,
  0.01839650608599186,
  0.0646066889166832,
  -0.05037425830960274,
  0.03598065674304962,
  0.021508673205971718,
  0.002167824888601899,
  -0.015951285138726234,
  0.002699376782402396,
  0.017799515277147293,
  0.05165430158376694,
  0.012363470159471035,
  0.007538412231951952,
  0.06206159293651581,
  -0.06724042445421219,
  -6.68509746901691e-05,
  0.06279006600379944,
  -0.002666343003511429,
  -0.06036423519253731,
  0.01500498317182064,
  -0.02118362858891487,
  0.05988442152738571,
  -0.027908021584153175,
  -0.04872701317071915,
  -0.00489749712869

In [124]:
len(photo_text_embeddings)

125

In [119]:
media_items_df['baseUrl'].values[125]

IndexError: index 125 is out of bounds for axis 0 with size 125

In [125]:
media_items_df['vector'] = photo_text_embeddings

In [126]:
media_items_df.head()

Unnamed: 0,id,baseUrl,vector
0,ALuQekoEDdHn-4zJECI5IfbAr1UfwOmwa1LjKeLjMRD1h6...,https://lh3.googleusercontent.com/lr/AAJ1LKdZK...,"[[-0.0204818956553936, 0.023548683151602745, -..."
1,ALuQekrYsr8xfGeVKydlZ5BV-8oqNeg8DIj5ZruswgnG-4...,https://lh3.googleusercontent.com/lr/AAJ1LKfJ5...,"[[-0.054045457392930984, 0.10627054423093796, ..."
2,ALuQekqBrUHCLeojoVet8LNKs6avSCkvMFLJ-NqH6Ybcks...,https://lh3.googleusercontent.com/lr/AAJ1LKfwY...,"[[0.0678996816277504, 0.0504564493894577, -0.0..."
3,ALuQekq3jPhquGT-X6FbQ15ulkTMv8MjPeujfg24fgM8jn...,https://lh3.googleusercontent.com/lr/AAJ1LKemm...,"[[0.08913969993591309, 0.04626549035310745, -0..."
4,ALuQekrTVjjzq-DKEWbjNLafYgHe3kGlUmuj1snCznFOmP...,https://lh3.googleusercontent.com/lr/AAJ1LKdjN...,"[[0.05451503023505211, 0.07686479389667511, -0..."


In [None]:
query = "a group of people"

In [127]:
from tqdm.autonotebook import tqdm

In [128]:
import os
import pinecone

# get api key from app.pinecone.io
api_key = os.environ.get('PINECONE_API_KEY') or '92aa97f0-517d-4fc9-9a76-ff5366ba13e6'
# find your environment next to the api key in pinecone console
env = os.environ.get('PINECONE_ENVIRONMENT') or 'gcp-starter'

pinecone.init(
    api_key=api_key,
    environment=env
)

## Pinecone quickstart

With Pinecone you can create a vector index where you can store and search through your vectors.

In [194]:
# Giving our index a name
index_name = "photo-captions"

In [216]:
# Delete the index, if an index of the same name already exists
if index_name in pinecone.list_indexes():
    pinecone.delete_index(index_name)

Creating a Pinecone Index.

In [217]:
import time

dimensions = 384
pinecone.create_index(name=index_name, dimension=dimensions, metric="cosine")

# wait for index to be ready before connecting
while not pinecone.describe_index(index_name).status['ready']:
    time.sleep(1)

In [218]:
index = pinecone.Index(index_name=index_name)

We have the index ready. Now we will create some simple vectors that will serve as our examples.

In [14]:
import pandas as pd

df = pd.DataFrame(
    data={
        "id": ["A", "B"],
        "vector": [[1., 1., 1.], [1., 2., 3.]]
    })
df

Unnamed: 0,id,vector
0,A,"[1.0, 1.0, 1.0]"
1,B,"[1.0, 2.0, 3.0]"


We perform upsert operations in our index. This call will insert a new vector in the index or update the vector if the id was already present.

In [147]:
media_items_df['vector'][0]

[[-0.0204818956553936,
  0.023548683151602745,
  -0.002598338993266225,
  -0.029125355184078217,
  -0.0038704872131347656,
  0.02392740547657013,
  0.11689343303442001,
  -0.033136360347270966,
  0.1108851209282875,
  -0.021046768873929977,
  0.046867355704307556,
  -0.01735985465347767,
  -0.004366870038211346,
  0.08870325982570648,
  -0.07061111181974411,
  -0.0069025177508592606,
  0.04073983430862427,
  -0.05004548281431198,
  -0.0206398107111454,
  -0.046784475445747375,
  0.011543376371264458,
  -0.025495845824480057,
  -0.04136347770690918,
  0.019378401339054108,
  0.02807827666401863,
  0.02904965914785862,
  0.010969056747853756,
  0.02586389146745205,
  -0.01783178560435772,
  -0.03265797346830368,
  0.011839455924928188,
  -0.027233440428972244,
  0.065559521317482,
  0.042730189859867096,
  -0.05932000279426575,
  -0.09831086546182632,
  0.060314446687698364,
  0.043204415589571,
  0.0308705922216177,
  -0.03141875937581062,
  -0.009052530862390995,
  -0.00329809868708252

In [146]:
len(media_items_df['vector'][0])

1

In [137]:
media_items_df.head()

Unnamed: 0,id,baseUrl,vector
0,ALuQekoEDdHn-4zJECI5IfbAr1UfwOmwa1LjKeLjMRD1h6...,https://lh3.googleusercontent.com/lr/AAJ1LKdZK...,"[[-0.0204818956553936, 0.023548683151602745, -..."
1,ALuQekrYsr8xfGeVKydlZ5BV-8oqNeg8DIj5ZruswgnG-4...,https://lh3.googleusercontent.com/lr/AAJ1LKfJ5...,"[[-0.054045457392930984, 0.10627054423093796, ..."
2,ALuQekqBrUHCLeojoVet8LNKs6avSCkvMFLJ-NqH6Ybcks...,https://lh3.googleusercontent.com/lr/AAJ1LKfwY...,"[[0.0678996816277504, 0.0504564493894577, -0.0..."
3,ALuQekq3jPhquGT-X6FbQ15ulkTMv8MjPeujfg24fgM8jn...,https://lh3.googleusercontent.com/lr/AAJ1LKemm...,"[[0.08913969993591309, 0.04626549035310745, -0..."
4,ALuQekrTVjjzq-DKEWbjNLafYgHe3kGlUmuj1snCznFOmP...,https://lh3.googleusercontent.com/lr/AAJ1LKdjN...,"[[0.05451503023505211, 0.07686479389667511, -0..."


In [219]:
index.upsert(vectors=zip(df_photos.id, df_photos.vector, df_photos.metadata))  # insert vectors

{'upserted_count': 125}

In [201]:
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.00125,
 'namespaces': {'': {'vector_count': 125}},
 'total_vector_count': 125}

In [17]:
index.query(
    vector=[2., 2., 2.],
    top_k=5,
    include_values=True) # returns top_k matches

{'matches': [{'id': 'A', 'score': 1.0, 'values': [1.0, 1.0, 1.0]},
             {'id': 'B', 'score': 0.925820112, 'values': [1.0, 2.0, 3.0]}],
 'namespace': ''}

In [287]:
query = "two people eating dinner"

# create the query vector
xq = model.encode(query).tolist()

# now query
xc = index.query(xq, top_k=5, include_metadata=True)
xc

{'matches': [{'id': 'ALuQekrT6Anko5B-P6U0lNeyLWXGrjZ0ZVt3gEManjqBiwXdIx-n8WN4TkCZJAnt2jM6UQ3IsykPyz4KRyRVP4hZy2XlfaBD8g',
              'metadata': {'generated_text': 'a man and a woman sitting at a '
                                             'table with food'},
              'score': 0.622511268,
              'values': []},
             {'id': 'ALuQeko2isXjf_dvbrrnE48kaq-eeMnjizj6bODLnijKqQiZqxchTsMfGoy2pIJuSpLccMkxRTyzi64Yit-6xYBPVvLsUfKAAA',
              'metadata': {'generated_text': 'two people are preparing food in '
                                             'the kitchen'},
              'score': 0.47851193,
              'values': []},
             {'id': 'ALuQekpe2beEjb5n3gVBmOuWVJRpLYO6vQs1ZUoeV8VYqfZAJWh6DJRtVetFLmkXX7o8z6IZvf4p57TKMcKLftscefOiv2Ut8w',
              'metadata': {'generated_text': 'a man sitting at a table with a '
                                             'plate of food'},
              'score': 0.398180246,
              'values': []},
           

In [288]:
img_urls = []

for i in range(0,5):
    img_id = xc['matches'][i]['id']
    img_url = media_items_df.loc[media_items_df['id'] == img_id, 'baseUrl'].iloc[0]
    img_urls.append(img_url)

In [289]:
Image(url=img_urls[0])

In [290]:
Image(url=img_urls[1])

In [291]:
Image(url=img_urls[2])

In [292]:
Image(url=img_urls[3])

In [293]:
Image(url=img_urls[4])

In [264]:
xc = index.query(xq, top_k=5, include_metadata=True)

In [None]:
url = 'https://photoslibrary.googleapis.com/v1/mediaItems:search'
payload = {
              "filters": {
                "dateFilter": {
                  "dates": [
                    {
                      "day": day,
                      "month": month,
                      "year": year
                    }
                  ]
                }
              }
            }
headers = {
    'content-type': 'application/json',
    'Authorization': 'Bearer {}'.format(creds.token)
}

try:
    res = requests.request("POST", url, data=json.dumps(payload), headers=headers)
except:
    print('Request error') 

In [None]:
res