# To support my kratom capstone piece, I want to figure out if kratom has become more popular over time in NYC and where these sales are located. I'll be scraping "kratom" search in alltext.nyc, which returns 57 pages of results. I want to collect: the lat/lon, the text found, the date captured (month, year), and the zip code.

## I need to do a soup test to figure out if the data i need is actually available on the HTML or is in an API/json script

The first page of results is "https://alltext.nyc/search?q=kratom" \
The subsequent page format is "https://alltext.nyc/search?q=kratom&p=[pagenum]" and I need 2-57 \
The data i need can be found on each search page (I do not have to click into each result image)

Command+option+u and inspect on page 1 of results shows me that:\
-<b> each result</b> has an id between "0" and "6"\
-<b> Text</b> result is in a span within class="text-glow-black inline" within class="hover:underline"\
-<b>Lat and Lon</b> are in a span within class="whitespace-nowrap" within class="absolute bottom-0 left-0 z-30 animate-fade-in-1000 cursor-crosshair text-left text-xs text-zinc-400 transition duration-300 ease-in-out hover:text-foreground md:text-xs"\
-<b>Date</b> is in a ptag within class="pointer-events-none" within class="absolute bottom-0 right-0 z-30 animate-fade-in-1000 cursor-crosshair text-right text-xs text-zinc-400 transition ease-in-out hover:text-foreground md:text-xs"\
-<b>Zip</b> is within the larger date class:\
    -------in a span whithin class="whitespace-nowrap" within a ptag within class="pointer-events-none" within class="absolute bottom-0 right-0 z-30 animate-fade-in-1000 cursor-crosshair text-right text-xs text-zinc-400 transition ease-in-out hover:text-foreground md:text-xs"

In [7]:
##set up my notebook to test scrape site
import pandas as pd
import logging
import datetime as dt
import time
from random import randint, uniform
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}

In [9]:
## lets test grabbing text result from page 1 
URLpage1 = "https://alltext.nyc/search?q=kratom"
response_test =requests.get(URLpage1, headers=headers)
soup_test =BeautifulSoup(response_test.text, 'html.parser')
txt_test = soup_test.find_all("div",id_="0")
txt_test

[]

In [11]:
## testing Lat/lon
URLpage1 = "https://alltext.nyc/search?q=kratom"
response_test =requests.get(URLpage1, headers=headers)
soup_test =BeautifulSoup(response_test.text, 'html.parser')
txt_test = soup_test.find_all("div",class_="bottom-0")
txt_test

[<div class="convex fixed bottom-0 left-0 m-6 flex h-10 w-10 items-center justify-center rounded-full bg-background/80 italic hover:underline"><span class="translate-y-[1px]">i</span></div>]

### After running a few preliminary soup tests, I've realized that this website pulls from anAPI and doesn't have the data available within raw html. I will need to scrape the JSON results from the API request. 
Because I'm familiar with JSON and APIs, I can still scrape, but will need ChatGPT's help for troubleshooting post vs get requests, how to redefine the header for JSON, etc. Using DevTools, I found the api request url, and in the response tab, i found all the unique IDs for kratom results: there are 337, and i want to re-run the request URL to find the data within each ID

The tricky part about this is that the JSON data is stored in another JSON dictionary, so I'll have to take that into account when coding


In [22]:
##import json
import json

In [25]:
#### run a test on the first three IDs

url = "https://alltext.nyc/api/trpc/ocrResult.findManyWithStreetview?batch=1"

# Example input payload
payload = {
    "0": {
        "json": {
            "ids": [119478677,119637702,119637708,]  
        }
    }
}

response = requests.post(
    url,
    data=json.dumps({"input": json.dumps(payload)}),
    headers={
        "Content-Type": "application/json",
        "Accept": "application/json"
    }
)

print(response.status_code)
print(response.text)

404
[{"error":{"json":{"message":"No \"mutation\"-procedure on path \"ocrResult.findManyWithStreetview\"","code":-32004,"data":{"code":"NOT_FOUND","httpStatus":404,"path":"ocrResult.findManyWithStreetview","zodError":null}}}}]


In [29]:
## Ok, let's try coding it line-by-line instead of nested..
url = "https://alltext.nyc/api/trpc/ocrResult.findManyWithStreetview?batch=1" ##this is the actual API request URL from DevTools

# choose a few IDs to test with
test_ids = [119478677, 119637702, 119637708]

payload = {
    "0": {
        "json": {
            "ids": test_ids
        }
    }
}

headers = {"Content-Type": "application/json", "accept":"application/json"} ##*I googled correct headers for JSON scrape*##

response = requests.post(url, json=payload, headers=headers)
data = response.json()

# print out what we got
print(json.dumps(data)) ##*ChatGPT helped me understand json.dumps*

[{"error": {"json": {"message": "No \"mutation\"-procedure on path \"ocrResult.findManyWithStreetview\"", "code": -32004, "data": {"code": "NOT_FOUND", "httpStatus": 404, "path": "ocrResult.findManyWithStreetview", "zodError": null}}}}]


### Ok, I've figured out that the API requires a specific parameter, not just the default json payload, so i need to copy the exact input parameters for the API request. *ChatGPT helped me understand the mutation procedure error*
If this still doesn't work, it probably means the restrictions on the API will not let me scrape dynamically, but after reading the licensing on the page, this should eventually work. If not, I will download the results of the 57 pages and use text extraction to scrape this site, and find something else to scrape for this assignment

In [41]:
##define the exact API url request for kratom
url= "https://alltext.nyc/api/trpc/ocrResult.findManyWithStreetview,ocrResult.searchAllIds?batch=1&input=%7B%220%22%3A%7B%22json%22%3A%7B%22ids%22%3A%5B%5D%7D%7D%2C%221%22%3A%7B%22json%22%3A%7B%22text%22%3A%22kratom%22%2C%22fuzzy%22%3Atrue%2C%22dateMin%22%3Anull%2C%22dateMax%22%3Anull%2C%22zipCodes%22%3Anull%2C%22boroughs%22%3Anull%2C%22ocrConfidenceMin%22%3Anull%2C%22ocrConfidenceMax%22%3Anull%7D%2C%22meta%22%3A%7B%22values%22%3A%7B%22dateMin%22%3A%5B%22undefined%22%5D%2C%22dateMax%22%3A%5B%22undefined%22%5D%2C%22zipCodes%22%3A%5B%22undefined%22%5D%2C%22boroughs%22%3A%5B%22undefined%22%5D%2C%22ocrConfidenceMin%22%3A%5B%22undefined%22%5D%2C%22ocrConfidenceMax%22%3A%5B%22undefined%22%5D%7D%7D%7D%7D"
##define the input parameter  *I used ChatGPT to define this*
input_param = "%7B%220%22%3A%7B%22json%22%3A%7B%22ids%22%3A%5B%5D%7D%7D%2C%221%22%3A%7B%22json%22%3A%7B%22text%22%3A%22kratom%22%2C%22fuzzy%22%3Atrue%2C%22dateMin%22%3Anull%2C%22dateMax%22%3Anull%2C%22zipCodes%22%3Anull%2C%22boroughs%22%3Anull%2C%22ocrConfidenceMin%22%3Anull%2C%22ocrConfidenceMax%22%3Anull%7D%2C%22meta%22%3A%7B%22values%22%3A%7B%22dateMin%22%3A%5B%22undefined%22%5D%2C%22dateMax%22%3A%5B%22undefined%22%5D%2C%22zipCodes%22%3A%5B%22undefined%22%5D%2C%22boroughs%22%3A%5B%22undefined%22%5D%2C%22ocrConfidenceMin%22%3A%5B%22undefined%22%5D%2C%22ocrConfidenceMax%22%3A%5B%22undefined%22%5D%7D%7D%7D%7D"
##write the code to test requests.get method (I should see all the IDS when I run this)

payload = {"input": input_param} 

headers = {
    "Content-Type": "application/json",
    "Accept": "application/json",
}

response = requests.get(url, headers=headers, params=payload) ## I used ChatGPT to help me define the payload in response, which we haven't learned yet

data = response.json()
print(data)

[{'result': {'data': {'json': []}}}, {'result': {'data': {'json': [119478677, 119637702, 119637708, 119780855, 56994, 308550, 362507, 1396105, 1417105, 1430532, 1658886, 1674534, 2728520, 2728531, 2728548, 3454623, 3463112, 3591868, 3843921, 4026214, 4094063, 4094066, 4094069, 4094071, 4143788, 4580525, 4580531, 4587873, 4611362, 5531063, 5531112, 5681773, 5681800, 5764619, 6412135, 6567279, 6620698, 6701600, 7026315, 7033182, 7033209, 7033269, 7033341, 7154576, 7154581, 7154600, 7173502, 7173534, 7173616, 7856148, 7916752, 8095865, 8403453, 8403471, 8565590, 9202607, 9202616, 9919242, 9919244, 10365130, 10379363, 10618239, 10977307, 10983595, 11176333, 11318828, 11705156, 12167160, 12422976, 12523755, 12545784, 12572949, 12598513, 12736251, 12736297, 12736309, 13028155, 13094136, 13168505, 13538040, 13825657, 14111450, 14158275, 14158276, 14730478, 14884197, 14914584, 15237486, 15248038, 16350469, 16658443, 16740935, 17501983, 17660549, 17698645, 18104581, 18209117, 18459652, 18565115

### Great, now it's working and the ID output is correct. I'll store the IDs in a variable for later. 
But i need to do more to access the data i need...

In [47]:

newIds =[119478677, 119637702, 119637708, 119780855, 56994, 308550, 362507, 1396105, 1417105, 1430532, 1658886, 1674534, 2728520, 2728531, 2728548, 3454623, 3463112, 3591868, 3843921, 4026214, 4094063, 4094066, 4094069, 4094071, 4143788, 4580525, 4580531, 4587873, 4611362, 5531063, 5531112, 5681773, 5681800, 5764619, 6412135, 6567279, 6620698, 6701600, 7026315, 7033182, 7033209, 7033269, 7033341, 7154576, 7154581, 7154600, 7173502, 7173534, 7173616, 7856148, 7916752, 8095865, 8403453, 8403471, 8565590, 9202607, 9202616, 9919242, 9919244, 10365130, 10379363, 10618239, 10977307, 10983595, 11176333, 11318828, 11705156, 12167160, 12422976, 12523755, 12545784, 12572949, 12598513, 12736251, 12736297, 12736309, 13028155, 13094136, 13168505, 13538040, 13825657, 14111450, 14158275, 14158276, 14730478, 14884197, 14914584, 15237486, 15248038, 16350469, 16658443, 16740935, 17501983, 17660549, 17698645, 18104581, 18209117, 18459652, 18565115, 18577606, 18817472, 19943347, 20218074, 20505447, 20805785, 20962868, 21186935, 21246414, 21310230, 21561028, 21658633, 21672479, 21772543, 22035446, 22242907, 22316545, 22404169, 22448456, 22448482, 22467581, 22467582, 23040346, 23121234, 23232958, 23232959, 23232962, 23698727, 23781254, 24052544, 24287908, 24313491, 24627230, 24627240, 24734675, 25127638, 25159048, 25205316, 27462314, 28500296, 28648180, 28729350, 28869601, 29011366, 29554255, 29628753, 29704588, 30269131, 30458899, 30488450, 30918150, 31778225, 32114204, 32151335, 32237234, 32823901, 34850839, 36059862, 36146165, 36651920, 36985125, 37223985, 37485038, 37562106, 37861176, 38369595, 39477395, 40946002, 42683417, 43000834, 43204884, 43204891, 43355528, 43753633, 44193758, 44417034, 45039242, 45296418, 45949454, 46184876, 46558576, 47245357, 47878335, 48039782, 49626726, 51270923, 52044945, 52492556, 53175021, 53275236, 53612419, 53687718, 54078735, 55588771, 55588780, 55606904, 56566927, 56675031, 56899250, 57243773, 57284694, 57930048, 57982880, 58971118, 59101838, 59133876, 59339902, 59567638, 59749421, 60234753, 60235292, 60381588, 61137679, 61582207, 62095222, 62182133, 62223542, 62562783, 62706247, 63109673, 63453090, 63508401, 64228243, 65647933, 65881171, 66149610, 66567190, 66633784, 67513047, 67810118, 68133033, 68534380, 70077923, 70095101, 70381967, 70381990, 71141917, 71141937, 71308473, 71503833, 71872066, 72130150, 72326934, 72804790, 73012277, 73088620, 73088665, 74222098, 74463420, 74472787, 74472861, 74730450, 74832799, 74832807, 74950467, 74950484, 75149875, 76137276, 76137358, 76501556, 76574350, 76873303, 77746562, 77853835, 77936729, 78342129, 78752508, 79079934, 79632597, 80123739, 80123765, 80250484, 80305321, 80393120, 80393294, 80464549, 80576935, 81429174, 82283774, 83470928, 84154512, 84332376, 85310483, 85378133, 85430909, 85520767, 87015021, 87575224, 88032695, 88594458, 88594496, 88795289, 91135486, 93005233, 93071050, 93118392, 93256251, 93256255, 93694497, 97545188, 99349896, 99349900, 109097615, 109668257, 109668273, 110145686, 110148974, 110179677, 110238970, 110436201, 111047402, 111076705, 111177521, 111346428, 111744929, 111772835, 111837075, 111942480, 111942503, 112009197, 112031823, 112317353, 112802094, 113374963, 113534923, 113980618, 113997411, 114090078, 114599514, 114768236, 115003033, 115067213, 115521829, 116104569, 116300237, 116351266, 117136438, 118550792]

In [49]:
len(newIds)

337

### Ok i need to actually find the SECOND request that runs so i can find the actual fields in the json that i want to return
BUT I know my headers are correct and I can scrape it.\
now i need to fetch the actual content for the IDs, so let's do a test run with a few.

The JSON parameters are split into two lists, "ocrResult" and "streetview". I need to capture ALL data from both:\
"ocrResult": "id", "text", and "confidence" (because I want to know the confidence interval/margin of error for each search)\
"streetview": "lat", "lon", "postcode", "date"

When I look at these results in DevTools, I will only need to strip from the "date", as everything else comes out clean and usable

For now, let's just get a code running with all of the data from each item within each batch

In [54]:
# URL for the batch API
url= "https://alltext.nyc/api/trpc/ocrResult.findManyWithStreetview,ocrResult.searchAllIds?batch=1&input=%7B%220%22%3A%7B%22json%22%3A%7B%22ids%22%3A%5B%5D%7D%7D%2C%221%22%3A%7B%22json%22%3A%7B%22text%22%3A%22kratom%22%2C%22fuzzy%22%3Atrue%2C%22dateMin%22%3Anull%2C%22dateMax%22%3Anull%2C%22zipCodes%22%3Anull%2C%22boroughs%22%3Anull%2C%22ocrConfidenceMin%22%3Anull%2C%22ocrConfidenceMax%22%3Anull%7D%2C%22meta%22%3A%7B%22values%22%3A%7B%22dateMin%22%3A%5B%22undefined%22%5D%2C%22dateMax%22%3A%5B%22undefined%22%5D%2C%22zipCodes%22%3A%5B%22undefined%22%5D%2C%22boroughs%22%3A%5B%22undefined%22%5D%2C%22ocrConfidenceMin%22%3A%5B%22undefined%22%5D%2C%22ocrConfidenceMax%22%3A%5B%22undefined%22%5D%7D%7D%7D%7D"

# First three test IDs
ids = [119478677, 119637702, 119637708]

# Build payload with just the IDs (simplified)
payload_dict = {
    "0": {
        "json": {
            "ids": ids}
    },
    "1": {"json": {}}
}

# Encode payload
payload = {"input": json.dumps(payload_dict)}
# payload = {"input": input_param} 

# Headers
headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
}

# Send GET request
response = requests.get(url, headers=headers, params=payload)

# Parse JSON
data = response.json()

# NOW we code to extract the data within the correct level ("1")
for ID in data:
    results_list = ID.get("result", {}).get("data", {}).get("json", [])
    
    for r in results_list:
        if isinstance(r, dict):
            # Print all the data from three test ids
            print(json.dumps(r))
            print("Number of results in this batch of IDs", len(results_list))

Based on the TypeError, I know my top-level object is a list not a dictionary\
And I need to define the two JSONS in my payload (*ChatGPT helped me understand payload*)\
Now I know i need to get the IDs--> call findManyWithStreetview for those IDs-->extract the json fields

Results list is working but I'm still getting an AttributeError for the batch length. I'm not sure that matters because I know I have the results i need.. and I know there are 337 results. This number won't change because alltext.nyc has information through last year's streetView, so that number won't change until new StreetView data is entered. 

Basically, I can simplify my code now I know that results are working, and I can now save my outputs in a list of dictionaries using return!
I will need to figure out how to get all 337 results without crashing, so I'll still test in chunks for now

In [57]:
##lets try again with three sample IDS and iterate through the items in a forLoop using .get
url = "https://alltext.nyc/api/trpc/ocrResult.findManyWithStreetview,ocrResult.searchAllIds?batch=1"

# Test IDs
test_ids = [56994, 308550, 119478677]

payload = {
    "input": json.dumps({ #*I usede CHATGPT to help me understand payload and json.dumps*#
        "0": {"json": {"ids": test_ids}},
        "1": {"json": {"text": "kratom", "fuzzy": True}} #this is how the json is structured in the second request
    })
}

headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
}

response = requests.get(url, headers=headers, params=payload)
data = response.json()

# Iterate over the top-level list
for batch_item in data:
    results_list = batch_item.get("result", {}).get("data", {}).get("json", [])
    print("Number of results in this batch item:", len(results_list))
    
    for r in results_list:
        if isinstance(r, dict):
            # Print all the data from three test ids
            print(json.dumps(r))

Number of results in this batch item: 3
{"ocrResult": {"id": 56994, "text": "CBD KRATOM", "confidence": 0.5}, "streetview": {"panoramaId": "PHyTisQ4EuYMui5Ir5d2ow", "lat": 40.710255, "lon": -74.007805, "suburb": "Manhattan", "postcode": "10038", "date": "2022-07-01T00:00:00.000Z"}, "googleStreetViewUrls": {"googleStreetViewUrl": "https://www.google.com/maps/@40.710255,-74.007805,3a,20y,317.67h,93.73t/data=!3m6!1e1!3m4!1sPHyTisQ4EuYMui5Ir5d2ow!2e0!7i16384!8i8192?entry=ttu", "googleStreetViewEmbedUrl": "https://www.google.com/maps/embed/v1/streetview?key=AIzaSyCxslRVk110OALOY_belTmu_Ls3PXN4RrI&location=40.710255%2C-74.007805&pano=PHyTisQ4EuYMui5Ir5d2ow&heading=317.67&pitch=3.73&fov=10", "googleStreetViewStaticImageUrl": "https://maps.googleapis.com/maps/api/streetview?size=400x400&location=40.710255%2C-74.007805&fov=10&heading=317.67&pitch=3.73&key=AIzaSyCxslRVk110OALOY_belTmu_Ls3PXN4RrI&pano=PHyTisQ4EuYMui5Ir5d2ow"}, "googleStreetViewProps": {"lat": 40.710255, "lon": -74.007805, "panora

### Great now i know my code is working. I need to make code that handles each batch 1 by 1

In [60]:
 #define batches
batch_1 = newIds[0:50]
batch_2 = newIds[50:100]
batch_3 = newIds[100:150]
batch_4 = newIds[150:200]
batch_5 = newIds[200:250]
batch_6 = newIds[250:300]
batch_7 = newIds[300:337]  

In [64]:
##run first batch, CHUNK 1 (50/337)

# batch_1 is already defined
payload = {
    "input": json.dumps({
        "0": {"json": {"ids": batch_1}},
        "1": {"json": {"text": "kratom", "fuzzy": True}}
    })
}

headers = {
    "Accept": "application/json",
    "Content-Type": "application/json",
}

response = requests.get(url, headers=headers, params=payload)
data = response.json()

batch_results_1 = []

# simplified batch get requests
for batch_item in data:
    # drill down to the list of results
    results_list_1 = batch_item.get("result", {}).get("data", {}).get("json", [])
    
    for r in results_list_1:
        if isinstance(r, dict):
            # save the whole dict
            batch_results_1.append(r)

print("Number of results in batch 1:", len(batch_results_1)) ##there should be 50
print(batch_results_1[:3])  # preview first 3 results
        

Number of results in batch 1: 50
[{'ocrResult': {'id': 56994, 'text': 'CBD KRATOM', 'confidence': 0.5}, 'streetview': {'panoramaId': 'PHyTisQ4EuYMui5Ir5d2ow', 'lat': 40.710255, 'lon': -74.007805, 'suburb': 'Manhattan', 'postcode': '10038', 'date': '2022-07-01T00:00:00.000Z'}, 'googleStreetViewUrls': {'googleStreetViewUrl': 'https://www.google.com/maps/@40.710255,-74.007805,3a,20y,317.67h,93.73t/data=!3m6!1e1!3m4!1sPHyTisQ4EuYMui5Ir5d2ow!2e0!7i16384!8i8192?entry=ttu', 'googleStreetViewEmbedUrl': 'https://www.google.com/maps/embed/v1/streetview?key=AIzaSyCxslRVk110OALOY_belTmu_Ls3PXN4RrI&location=40.710255%2C-74.007805&pano=PHyTisQ4EuYMui5Ir5d2ow&heading=317.67&pitch=3.73&fov=10', 'googleStreetViewStaticImageUrl': 'https://maps.googleapis.com/maps/api/streetview?size=400x400&location=40.710255%2C-74.007805&fov=10&heading=317.67&pitch=3.73&key=AIzaSyCxslRVk110OALOY_belTmu_Ls3PXN4RrI&pano=PHyTisQ4EuYMui5Ir5d2ow'}, 'googleStreetViewProps': {'lat': 40.710255, 'lon': -74.007805, 'panoramaId':

### Actually I should just write a code that handles all batches at once. lets do that...

In [66]:
##now i know this is working, i can make a simpler code that loops through all of the batches at once
##define batches:
batches = {
    "batch_1": batch_1,
    "batch_2": batch_2,
    "batch_3": batch_3,
    "batch_4": batch_4,
    "batch_5": batch_5,
    "batch_6": batch_6,
    "batch_7": batch_7,
}

In [68]:
##Code forLoops for entire list of IDs. loop through each batch, then loop through each item in each batch, append to empty dict, store so i can put them in a csv later

all_batches_results = {} ##appendable empty dict for all the json data for each item

for batch_name, batch_ids in batches.items(): ## *ChatGPT helped me with this line because i needed to learn ".items" to loop through batches dict
    payload = {
        "input": json.dumps({
            "0": {"json": {"ids": batch_ids}},
            "1": {"json": {"text": "kratom", "fuzzy": True}}
        })
    }

    headers = {
        "Accept": "application/json",
        "Content-Type": "application/json",
    }

    response = requests.get(url, headers=headers, params=payload)
    data = response.json()

    batch_results = [] ##appendable list for the batch results forLoop

    for batch_item in data:
        results_list = batch_item.get("result", {}).get("data", {}).get("json", [])
        for r in results_list:
            if isinstance(r, dict): ##*ChatGPT helped me udnerstand isinstance*
                batch_results.append(r)  # store the full dict

    all_batches_results[batch_name] = batch_results

print("Done! All batch data stored in 'all_batches_results'")

Done! All batch data stored in 'all_batches_results'


In [70]:
##lets check the type
type(all_batches_results)

dict

In [72]:
type(all_batches_results['batch_4'][3])

dict

In [74]:
type(all_batches_results['batch_4'][:3])

list

### I've confirmed the information for each kratom result is stored in a list inside multiple dictionaries.Let's take a sample of one of these lists

In [76]:
## lets sample one item in a random batch using a For-in, check length

for item in all_batches_results['batch_4']: ##if i chose "['batch_4'][3]"it would just print the keys. i want to see results
    print(item)
    print(len(all_batches_results['batch_4']))

{'ocrResult': {'id': 31778225, 'text': 'KRATOM', 'confidence': 0.5}, 'streetview': {'panoramaId': 'qOFNn-I2KEJ-M1GGoYahXQ', 'lat': 40.6765, 'lon': -73.96343, 'suburb': 'Brooklyn', 'postcode': '11238', 'date': '2022-06-01T00:00:00.000Z'}, 'googleStreetViewUrls': {'googleStreetViewUrl': 'https://www.google.com/maps/@40.6765,-73.96343,3a,20y,228.82h,92.07t/data=!3m6!1e1!3m4!1sqOFNn-I2KEJ-M1GGoYahXQ!2e0!7i16384!8i8192?entry=ttu', 'googleStreetViewEmbedUrl': 'https://www.google.com/maps/embed/v1/streetview?key=AIzaSyCxslRVk110OALOY_belTmu_Ls3PXN4RrI&location=40.6765%2C-73.96343&pano=qOFNn-I2KEJ-M1GGoYahXQ&heading=228.82&pitch=2.07&fov=10', 'googleStreetViewStaticImageUrl': 'https://maps.googleapis.com/maps/api/streetview?size=400x400&location=40.6765%2C-73.96343&fov=10&heading=228.82&pitch=2.07&key=AIzaSyCxslRVk110OALOY_belTmu_Ls3PXN4RrI&pano=qOFNn-I2KEJ-M1GGoYahXQ'}, 'googleStreetViewProps': {'lat': 40.6765, 'lon': -73.96343, 'panoramaId': 'qOFNn-I2KEJ-M1GGoYahXQ', 'heading': 228.815980884

In [78]:
for item in all_batches_results['batch_2']: ##if i chose "['batch_4'][3]"it would just print the keys. i want to see results
    print(item)
    print(len(all_batches_results['batch_2']))

{'ocrResult': {'id': 7916752, 'text': 'CBD KRATOM', 'confidence': 0.5}, 'streetview': {'panoramaId': 'QF_AAgkHLb3qqJjXX77Fkg', 'lat': 40.777634, 'lon': -73.98221, 'suburb': 'Manhattan', 'postcode': '10023', 'date': '2023-03-01T00:00:00.000Z'}, 'googleStreetViewUrls': {'googleStreetViewUrl': 'https://www.google.com/maps/@40.777634,-73.98221,3a,20y,262.11h,92.74t/data=!3m6!1e1!3m4!1sQF_AAgkHLb3qqJjXX77Fkg!2e0!7i16384!8i8192?entry=ttu', 'googleStreetViewEmbedUrl': 'https://www.google.com/maps/embed/v1/streetview?key=AIzaSyCxslRVk110OALOY_belTmu_Ls3PXN4RrI&location=40.777634%2C-73.98221&pano=QF_AAgkHLb3qqJjXX77Fkg&heading=262.11&pitch=2.74&fov=10', 'googleStreetViewStaticImageUrl': 'https://maps.googleapis.com/maps/api/streetview?size=400x400&location=40.777634%2C-73.98221&fov=10&heading=262.11&pitch=2.74&key=AIzaSyCxslRVk110OALOY_belTmu_Ls3PXN4RrI&pano=QF_AAgkHLb3qqJjXX77Fkg'}, 'googleStreetViewProps': {'lat': 40.777634, 'lon': -73.98221, 'panoramaId': 'QF_AAgkHLb3qqJjXX77Fkg', 'heading':

### Ok now we need to strip all the extra stuff we don't need from all_batch_results
I could either strip from what I got or go back and re-do the code. I'm going to choose strip from what i have.
I also checked some of these URLs and I'll need to strip out low-confidence results, because some of the above are not correct. we'll keep the googleStreetViewEmbedUrl item so we can check

In [80]:
##PANDAS TIME!
import pandas as pd

all_rows = []

# Loop through each batch
for batch_name, batch_list in all_batches_results.items():
    for item in batch_list:
        # Extract the ppieces i need
        ocr= item.get("ocrResult", {})
        sv = item.get("streetview", {})
        gsv = item.get("googleStreetViewUrls", {})

        row = {
            "ID": ocr.get("id", ""),
            "Confidence": ocr.get("confidence", ""),
            "Boro": sv.get("suburb", ""),
            "Date": sv.get("date", "")[:10],## stripping the extra characters from date
            "Latitude": sv.get("lat", ""),
            "Longitude": sv.get("lon", ""),
            "Postcode": sv.get("postcode", ""),
            "GoogleStreetViewURL": gsv.get("googleStreetViewUrl", "")
            
        }
        print(len(batch_list))
        all_rows.append(row)

# Create DataFrame
df = pd.DataFrame(all_rows)

#look at the first ffew rows
print(df.head())

50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
50
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
37
3

In [82]:
len(df)

337

### I have my dataframe with all of the information I want. Before turning it to a CSV or analyzing further, I want to remove all of the results that have a low confidence score
I used this site to help me understand what is a good confidence score for Optical Character Recognition:
https://learn.microsoft.com/en-us/azure/azure-video-indexer/ocr-insight
I'm going to go with a confidence score of 0.8 or below


In [85]:
##just to check to see how many I might have to look at...
df_confidence =df.query("Confidence < 1")
len(df_confidence)

211

In [99]:
df_confidence.sample(3)

Unnamed: 0,ID,Confidence,Boro,Date,Latitude,Longitude,Postcode,GoogleStreetViewURL
275,80576935,0.5,Queens,2021-07-01,40.70179,-73.880745,11385,"https://www.google.com/maps/@40.70179,-73.8807..."
165,39477395,0.5,Brooklyn,2022-10-01,40.61525,-73.96333,11230,"https://www.google.com/maps/@40.61525,-73.9633..."
60,10379363,0.5,Manhattan,2023-03-01,40.710228,-74.007774,10038,"https://www.google.com/maps/@40.710228,-74.007..."


### After manually looking through the alltext page, I shouldn't filter by confidence score yet, because there are some result labeled with confidence of 1 that contain kraotm, and some that don't.
I'll have to go through quite a few manually for my project, but we can figure out how to do that more efficiently after speaking with the creator of ALLtext or an OCR expert that has expertise in StreetView

### Download my full dataframe (no slicing confidence yet until i do more reporting!)

In [95]:
###lets download it...
df.to_csv("AllTextNYC_KratomSearch_Nov3.csv")

In [97]:
df.head(3)

Unnamed: 0,ID,Confidence,Boro,Date,Latitude,Longitude,Postcode,GoogleStreetViewURL
0,56994,0.5,Manhattan,2022-07-01,40.710255,-74.007805,10038,"https://www.google.com/maps/@40.710255,-74.007..."
1,308550,1.0,Manhattan,2022-07-01,40.76179,-73.960396,10065,"https://www.google.com/maps/@40.76179,-73.9603..."
2,362507,0.5,Manhattan,2023-04-01,40.714046,-73.99755,10038,"https://www.google.com/maps/@40.714046,-73.997..."
