Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add crawler functionality for identifying sites' usage of GPP 1.0 vs 1.1 and write to database #110

Closed
patmmccann opened this issue May 23, 2024 · 23 comments
Assignees
Labels
core functionality New big feature crawl Perform crawl or crawl feature-related

Comments

@patmmccann
Copy link

GPP 1.0 is no longer supported. If a site is broadcasting a GPP 1.0 signal, other entities on the page (eg Prebid.js or Google Ad Manager) generally will not understand it. You should just fail any site providing an API that no one understands. At Prebid, we're removing support for reading GPP 1.0 signals entirely and GAM already has.

@SebastianZimmeck SebastianZimmeck added crawl Perform crawl or crawl feature-related exploration Explore adding a feature etc. labels May 24, 2024
@SebastianZimmeck SebastianZimmeck added core functionality New big feature and removed exploration Explore adding a feature etc. labels May 24, 2024
@SebastianZimmeck
Copy link
Member

Thanks, @patmmccann!

@katehausladen already looked into this issue. We will further evolve our code to reflect people's move from GPP v1.0 to v1.1.

@franciscawijaya, can you take the lead on this issue and implement the functionality @katehausladen described and as outlined below for our June crawl? @Mattm27, can you work with @franciscawijaya as needed to bounce off ideas and discussion? And, @katehausladen, it would be great if you were available for any questions that @franciscawijaya and @Mattm27 still have remaining and general observations to make sure we are not making any mistake here 😄.

What we need before the June crawl is logic for:

  • Identifying whether a site implements GPP v1.0 or v1.1
  • Unpacking and storing both GPP v1.0 and v1.1 values in our database (let's write a new column to our crawl data after the z column "gpp_version")

@katehausladen prepared this move already in the analysis.js. So, @franciscawijaya, that file is a good starting point together with @katehausladen's description. Please go ahead and create a new issue-110 branch to start the implementation ...

Once we have a record of which site is using which version, we can interpret the results in our analysis after the crawl accordingly.

@patmmccann
Copy link
Author

As additional reference, prebid deleting gpp 1.0 has merged in but not yet released prebid/Prebid.js#11461

Thanks!

@patmmccann
Copy link
Author

patmmccann commented May 24, 2024

You can see if GAM understood the gpp string by looking in the payload of the network requests it makes to itself. This is an example of a success on the call (filter to gampad in network tab)
image

However, you might often see errors in this location for sites using gpp 1.0, which GAM and their recipients are treating as opt in (same as no signal)

@franciscawijaya
Copy link
Member

franciscawijaya commented May 28, 2024

  • Identifying whether a site implements GPP v1.0 or v1.1

After reading more onto GPP string and CMP API, it seems that the one with an update to 1.1 version is the CMP API which captures the information of the GPP string
Screenshot 2024-05-28 at 4 00 06 PM
Screenshot 2024-05-28 at 4 00 01 PM

I have also confirmed with Kate that our current code looks for version 1.1 first and then 1.0. It then proceeds to store only one string value. While the string value would be the same, the difference would lie in how we access the value (i.e getGPPdata function in v1.0 or just the ping function in v 1.1. (Reference: #60 (comment))

  • Unpacking and storing both GPP v1.0 and v1.1 values in our database (let's write a new column to our crawl data after the z column "gpp_version")

So, values captured by both CMP API v1.0 and v 1.1 should be the same given that the only change in the version is the removal of the getGPPdata function and merging the ping function.

@franciscawijaya
Copy link
Member

franciscawijaya commented May 28, 2024

You should just fail any site providing an API that no one understands. At Prebid, we're removing support for reading GPP 1.0 signals entirely and GAM already has.

Hypothesis as of now:
I think the problem with this is then these entities are looking for the newly merged ping function from the ver 1.1 CMP API, instead of the getGPPdata value from the v1.0

image
from #60 (comment)

What our code has:
Our approach right now is to check for both getGPPdata and ping as discussed in issue-110 #60 (comment)

Possible solution:
We can just completely scrap the getGPPdata function and solely use v1.1 ping CMP API

@franciscawijaya
Copy link
Member

@patmmccann Could we clarify if, by changes in the GPP versions, you were referring to the changes in the CMP API versions? Currently, according to IAB, there is only one GPP version (1.0) but there are 2 CMP API versions (1.0 and 1.1) [CMP API captures the information of the GPP]

@franciscawijaya
Copy link
Member

franciscawijaya commented May 28, 2024

  • Identifying whether a site implements GPP v1.0 or v1.1
  • Unpacking and storing both GPP v1.0 and v1.1 values in our database (let's write a new column to our crawl data after the z column "gpp_version")

Action plan on our end:

  • We will be adding a new column in the crawl data that indicates whether the site uses CMP API v1.0 (one that has a getGPPdata function) or v1.1 (one that only uses ping function)
  • Training data (10-20 sites)

@patmmccann
Copy link
Author

patmmccann commented May 29, 2024

@patmmccann Could we clarify if, by changes in the GPP versions, you were referring to the changes in the CMP API versions? Currently, according to IAB, there is only one GPP version (1.0) but there are 2 CMP API versions (1.0 and 1.1) [CMP API captures the information of the GPP]

Yes the CMP API version 1.1, which we probably should have called 2.0 but oh well. InteractiveAdvertisingBureau/Global-Privacy-Platform#70

cc @lamrowena

@patmmccann
Copy link
Author

We will be adding a new column in the crawl data that indicates whether the site uses CMP API v1.0 (one that has a getGPPdata function) or v1.1 (one that only uses ping function)

I suggest you get the version out of the ping response instead of testing for the absense of getgppdata. Some commercial vendors, eg @janwinkler, have backported getgppdata to assist in transitions yet still conform to the 1.1 spec and would have a signal recognized by platforms gathering the signal with the newly formatted eventlistener model.

@SebastianZimmeck SebastianZimmeck changed the title GPP 1.0 issues Add crawler functionality for identifying sites' usage of GPP 1.0 vs 1.1 and write to database May 30, 2024
@franciscawijaya
Copy link
Member

franciscawijaya commented May 31, 2024

Thank you @patmmccann! Your insight was very helpful in guiding the steps that I need to take to enhance the crawler functionality.

A note to self:
I've confirmed that our current code do not test the version based on the absence of getgppdata. Instead, our injection script is modeled upon the update to the version 1.1 (ie. callback takes precedence over a return value since v 1.1 removed return values in favor of callback functions).

This would prioritize the 1.1 spec since all default gpp functions (including ping and getGPPData) used return values in v1.0 now have callback functions in v1.1. On the other hand, v1.0 would return values as expected, with some executing callback functions and some don't. Hence, sites would fall into 3 categories:
v1.1 : callback only
v1.0: executes callback and returns value
v1.0: returns value only

Screenshot 2024-05-30 at 9 50 17 PM

In order to add the column in the crawler data, I believe these are the steps I should take:

  1. adding new column in the rest-api (in the app.post in index.js)

  2. explicitly collect the data of the different versions (which I believe has been analyzed under the function runAnalysis -- it is actually posted to debug)

Screenshot 2024-05-30 at 10 06 45 PM
  1. log/store the data in the extension while a site is being analyzed (this is what analysis_userend[domain] is for)
  • LogData is a function in analysis.js that parses the incoming data and puts it into an object (namely analysis_userend, which has an entry for each domain).
    I can model it like this:
Screenshot 2024-05-30 at 10 17 59 PM
  1. populate the database with the collected information.
  • This would be done automatically as long as the data is in the analysis_userend[domain] because after analysis has finished for a particular site, the data in analysis_userend[domain] is posted to the database.

@franciscawijaya
Copy link
Member

Logs/Update on adding the new column:

  1. I added a new variable for gpp_version before and after GPC on index.js (which should add a new column on the Crawl Data)
  2. logData the gpp_version under the runAnalysis and haltAnalysis (which should collect the data for the gpp_version)
  3. Added the new two variables under the analysisUserendSkeleton and analysisDataSkeletonFirstParties functions
  4. Under logData function, wrote an if statement for GPP_version to parse and put into objects gpp_version_before_gpc and gpp_version_after_gpc

A side note: while figuring out the code for the addition of the gpp_version column, I also encountered some questions about some functions in analysis.js that I need to clarify and am currently asking Kate about it.

Next step: I will be repackaging the gpc-analysis-extension into xpi file and test the extension locally before making a commit.

@SebastianZimmeck
Copy link
Member

Excellent!

@franciscawijaya
Copy link
Member

franciscawijaya commented Jun 3, 2024

Update: After successfully repackaging it to xpi-file, I ran the analysis. Unfortunately, it gave me null values for the gpp version. I also tried to debug using the debug column and the code actually managed to analyze which gpp version the site is using (eg. in the example attached: it detected that it's using the v1.1 above the 'empty'), however it still fails in storing and printing it in the analysis column.
Screenshot 2024-06-02 at 9 06 03 PM

I am currently taking another approach in the logic: instead of checking the version both before and after gpc signal is detected, I'm trying to write the code for collecting just one gpp version (regardless of whether it is after or before gpc signal is detected).

@franciscawijaya
Copy link
Member

I've successfully added the code to identify the gpp version that a site is using, collecting the data of the gpp version and store that data in the new column (gpp_version). The result of the crawler on a site will be as attached below.
Screenshot 2024-06-03 at 4 44 55 PM

Next step:
I have tested for 2 sites while writing and testing the code. I will begin testing for a slightly bigger sample size (10-20 sites) to ensure that the gpp version that the sites are using are recorded properly.

@SebastianZimmeck
Copy link
Member

Excellent! Well done, @franciscawijaya!

@franciscawijaya
Copy link
Member

franciscawijaya commented Jun 4, 2024

Using the April Crawl Data, I tested the crawl for sites that output GPP strings (as tested in April) to check the gpp-version. Out of the 20 sites that I picked from the data, it seems that all of them used v1.1 and that data is reflected in the gpp-version column accurately. I also tested on sites that do not output GPP strings before and after the gpc signal is sent and as expected the column would reflect a 'null' value for gpc_version, since their gpp_before_gpc and gpp_after_gpc would also output a 'null' value.

In my testing and debugging of 20 sites, I have yet to encounter a site (that was crawled and identified to have a GPP string in April Crawl) that uses the v1.0. I'm not sure if this indicates and confirms that most sites have switched to the v1.1.

While I'm thinking of continuing my manual testing of other sites from the site list that had gpp strings in April Crawl to make sure of this switch, I wonder if there is a way for me to get a hold of sites that are still using v1.0 right now and test those sites out, instead of going through our site list.

@SebastianZimmeck
Copy link
Member

I wonder if there is a way for me to get a hold of sites that are still using v1.0 right now and test those sites out, instead of going through our site list.

I tried searching BuiltWith to find sites with GPP. But it does not detect GPP. Maybe, there are similar lead generation sites like BuiltWith that do, though.

Another option may be to try the Internet Archive and Archive.today to see if they store sites with all their third parties.

It is also possible to create your own site with GPP v1.0. But let's not go there unless it is absolutely necessary.

@SebastianZimmeck
Copy link
Member

Other than that, Google search for GPP v1.0 code snippets may get some relevant search results.

@franciscawijaya
Copy link
Member

@patmmccann Hello! Would you mind sharing sites that still use the v1.0 when you came across this issue? My sample sets of sites seemed to have switched to 1.1 but I'm currently still looking to test our crawler for sites that still use the v1.0 version. I would greatly appreciate any help. Thank you in advance!

@franciscawijaya
Copy link
Member

I've tried using web.archive to check sites before the release of GPP v1.1 to see if I can test the gpp version (which should be v1.0). However, I was not able to do this as web.archive does not seem to store sites with their third parties. I confirmed this by comparing the current site and the archive version on the web console which showed that the current sites stores a GPP string while the web.archive version does not.
Screenshot 2024-06-07 at 12 32 31 AM
Screenshot 2024-06-07 at 12 32 39 AM

I have also tried testing it on some other sites on our crawl list but I have yet to encounter v1.0. For now, I think we can just go ahead to try this code for the June crawl.

Next step: I will soon request to merge this branch with main and ask Matt to hep confirm reviews the code on his end via the local test and then close this issue. I will also be preparing for the crawl and began by this weekend.

@SebastianZimmeck
Copy link
Member

Sounds all good, @franciscawijaya!

@patmmccann
Copy link
Author

I am having trouble tracking down some of the old gpp implementations at the moment. Perhaps other outreach has been quite successful!

@franciscawijaya
Copy link
Member

Merged to the main branch!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core functionality New big feature crawl Perform crawl or crawl feature-related
Projects
None yet
Development

No branches or pull requests

4 participants