# Platform Policy Analysis: COMM 4940 Statistics Assignment

*[J. Nathan Matias](https://natematias.com/), January 2026*

The purpose of this assignment is to give students in COMM 4940 a chance to refresh your memory for how to conduct, interpret, and think about multiple regression models, one of the prerequisites for the class.

**Grading**: This assignment has a grade of **complete/not complete**. In addition to serving as a refresher, it will give me a sense of where the class is with statistics, whether I need to swap in a few statistics sessions, and whether I can direct you to further learning and support in your statistics journey, in order to complete the midterm and final effectively. If you find you are really struggling, it might be a sign that you should take this class in the future, after a bit more statistics training- there's no shame in having clarity about your current skillset and how best to grow it.

**AI**: Please do not use generative AI tools to create written text for this assignment or to develop ideas for your analysis code. You may use code autocomplete, but no tool that you ask to generate example code. This assignment does not require much novel code, so if you find yourself doing something elaborate, it's a sign that you are wandered afield from the purpose of the assignment.

## Backstory

Advocates and families have often pressured technology firms to introduce designs and policies that would protect children from trafficking, grooming, and other forms of exploitation online. For decades, when organizers advocated for change, regulators would ask how big a problem it was and tech platforms would shrug and say that privacy laws prevented them from sharing data on highly-sensitive harms. So for decades it has been difficult to know how many cases there have been of harms to children, what kinds of platform designs put children most at risk, and what to do about it.

Then in the early 2020s, the European Union passed the [Digital Services Act](https://en.wikipedia.org/wiki/Digital_Services_Act), which among other things required tech platforms above a certain size to produce real-time data about content moderation and digital harms - or face serious fines. By 2024, numerous platforms had near-real-time feeds for every piece of content they removed in the EU in the [DSA Transparency Database](https://transparency.dsa.ec.europa.eu/?lang=en). By January 2026, the transparency database included over 3.6 billion incidents across 257 platforms.

In 2024, Cornell undergraduate Ingrid Gruener Luft had the idea to make a list of platform policies and features and compare platforms on the basis of those differences. She created a multi-variate model to account for the differences between platforms in order to pinpoint more specific factors at play. This assignment engages with a simplified version of her thesis analysis and represents some decisions Ingrid had to make partway through her project in pursuit of what could be learned from the DSA data. In this assignment, you get to be in Ingrid's shoes, deciding what models to use in your final project.

**Important disclaimer**: this assignment is a simulated, hypothetical part-way point in a student's final project, for educational purposes. The findings in this analysis should not be used to draw conclusive impressions about any company or regulatory scheme. Readers are also discouraged from drawing conclusions about the student or their work on the basis of this simplified classroom example, which does not necessarily reflect that student's views.


### The Assignment
Review the dataset, the code, and the accurate paragraph of interpretation at the bottom of this document. Then take the following actions.

1. Add a covariate to `m1` in a new model that you label `m2`. The covariate is `Policy.A.Allows.Minors...Individuals.Under.18`. Write a paragraph that interprets the coefficient for for that covariate in terms that anyone can understand.
2. Write a short paragraph answering the question: Is `m1` better or worse than `m2` at making sense of differences between platforms in child safety related reports? On what basis would you make that argument?
3. If the new analysis changes how you think about the question and its answer, explain your revised understanding. If you do not think anything needs to change in the paragraph of interpretation, explain why not. 
4. Imagine a journalist saw these results and asks for your advice on the following headline: "New Study Finds Platforms With Encrypted Messaging are Safe for Kids." They draw this conclusion based on the lack of a correlation between encryption features and the number of reports related to harms to minors in `m1`. What advice would you give them about that headline?
5. Submit a zipfile with your code and a document (text, Open Document Format, Word doc, or PDF) with your answers to the questions. Your writing should not be more than a single page, which should include your name, information about the class, and the date.
    - Optional: I will accept an assignment in Jupyter notebook format (R or Python .ipynb), if you also include a PDF of the output of your notebook

### Load Libraries (R)

In [None]:
library(gmodels)
library(readr)
library(stargazer)
library(ggplot2)

### Load Dataset

In [None]:
platforms <- read.csv("2025-04-luft-policy-analysis.csv")
platforms$Protection.of.Minors <- parse_number(platforms$Protection.of.Minors)
platforms$Protection.of.Minors.log <- log1p(platforms$Protection.of.Minors)
platforms$Users <- parse_number(platforms$Users)
platforms$Users.log <- log1p(platforms$Users)
platforms$Total.Reports <- parse_number(platforms$Total.Reports)
platforms$Total.Reports.log <- log1p(platforms$Total.Reports)
platforms$dating.relationships.adult <- platforms$Is.the.platform.advertised.for.dating..relationships..or.adult.content
platforms$Policy.B.child.exploitation <- platforms$Policy.B.CSAM.specific.policy.or.exploitation.harm.to.minors

#platforms$Protection.of.Minors.pct <- parse_number(platforms$Protection.of.Minors.pct)

### About the Dataset
The dataset includes the following columns. Some of them come from the DSA database. Other columns were developed by Ingrid, after reading hundreds (if noth thousands) of pages across al 88 platforms reporting to the DSA transparency hub in 2025. Note that [the DSA taxonomy](https://transparency.dsa.ec.europa.eu/page/additional-explanation-for-statement-attributes) has become much clearer and more detailed since February 2025 when Ingrid was collecting data. 

Platforms are required to report data to the transparency system if they are considered [very large online platforms](https://digital-strategy.ec.europa.eu/en/policies/dsa-vlops), with over 45 million users in the EU per month. Here is what those columns mean for the purposes of this assignment:

* **Company**: the name of the platform
* **Total Reports**: the total number of incidents reported by the platform (could be content removals, bans, etc for any reason including copyright violations, animal welfare, threats, scams, identity theft, etc)
* **Protection.of.Minors**: the total number of incidents reported by the platform that were related to the protection of minors, including:
    * Age-specific restrictions concerning minors (for example, someone faking their age to join)
    * Child sexual abuse material
    * Child sexual abuse material containing deepfake or similar technology
    * Grooming/sexual enticement of minors
    * Unsafe challenges
* **Users**: the number of estimated active users on the platform at the time (in Europe I think), compiled by Ingrid from market research
* **Does.the.platform.allow.private.messaging.between.2.users**: is there a private messaging feature on the platform?
* **Does.the.company.offer.E2E.encryption**: is there end-to-end encryption that prevents the platform from monitoring or carrying out content moderation toward private or group messages?
* **Do.they.use.automated.detection.for.content.moderation**: Does the platform publicly admit to using automated detection techniques?
* **Pre.publication.review.for.all.content**: Does the platform allow anyone to post anything, or do ty have AI systems or humans who review everything before it is posted? (for example, some online shopping sites)
* **dating.relationships.adult**: Does the platform encourage romantic or adult interactions/content?
* **Policy.A.Allows.Minors...Individuals.Under.18**: Does the platform allow minors to use it?
* **Policy.B.child.exploitation**: Does the platform have an explicit policy against child sexual abuse material or other child exploitation? Or do these go unmentioned in their policies?
* **Policy.C.Obligation.to.report.to.authorities**: Does the platform publicly acknowledge or promise to report illegal activity to the authorities?

In [None]:
colnames(platforms)

### Summarize outcome variable

In [None]:
## observe that more than half of platforms report zero incidents involving the protection of minors
## make of that what you will
summary(platforms$Protection.of.Minors)

In [None]:
summary(platforms$Protection.of.Minors.log)
hist(platforms$Protection.of.Minors.log)

In [None]:
summary(platforms$Users.log)

ggplot(platforms, aes(Users.log, Protection.of.Minors.log)) +
    geom_point() + 
    geom_smooth(method="lm") +
    theme_bw() +
    ggtitle("Relationship between platform user count and reported incidents involving minors\n(DSA data collected Feb 2025)")


### Estimate differences in the rate of reports involving child safety

In [None]:
m1 <- lm(Protection.of.Minors.log ~ Users.log + Total.Reports.log + 
           Does.the.company.offer.E2E.encryption + 
           Does.the.platform.allow.private.messaging.between.2.users + 
           dating.relationships.adult + 
           Pre.publication.review.for.all.content +
           Policy.B.child.exploitation,
           data=platforms)

In [None]:
## overall model
stargazer(m1, type="text")


In [None]:
## correlation between policies against CSAM and overall report rates
cor.test(platforms$Total.Reports.log, platforms$Policy.B.CSAM.specific.policy.or.exploitation.harm.to.minors)

In [None]:
## correlation between E2EE encryption overall youth harm report rates
cor.test(platforms$Total.Reports.log, platforms$Does.the.company.offer.E2E.encryption)

## Interpretation Paragraph Provided for the Assignment

Among 88 online platforms from Facebook and Glassdoor to Bumble, companies submitted reports of just over 4.9 million incidents to the European Union in February 2025 that related to the protection of minors. What factors were most associated with these reports?

In a multiple regression model that accounted for the number of a platform's users and features like private messaging and encryption, the 49 platforms that had an explicit, public policy against harms to minors reported 8.7 times more cases of harms to minors, on average, than platforms that didn't (39 of platforms)(p<0.01). Furthermore, platforms that had more visible policies against harms to minors also submitted more incident reports to the EU overall (p=0.04). This association between reporting and reports could occur because problems on the platform had forced them to make those policies, or because clear policies motivate people to report problems to a plaform. Surprisingly, despite extensive public debate over end-to-end encryption as a driver of harm to minors, platforms that offered encryption features did not submit more or fewer reports to the EU about harms to minors than ones that did not, on average (p=0.13).