# Introduction

In this notebook, we will do a comprehensive analysis of website accessibility by taking the top 1M home pages from TRANCO data (12/10/24 - 1/8/25), running the top 100 through WAVE API (https://wave.webaim.org/api/), and analysing that accessibility data.

# About the Dataset

**Data Source:** <br>
We use the Tranco list* [1] generated on 08 January 2025, ...
* Available at https://tranco-list.eu/list/24L29.

Victor Le Pochat, Tom Van Goethem, Samaneh Tajalizadehkhoob, Maciej Korczyński, and Wouter Joosen. 2019. "Tranco: A Research-Oriented Top Sites Ranking Hardened Against Manipulation," Proceedings of the 26th Annual Network and Distributed System Security Symposium (NDSS 2019). https://doi.org/10.14722/ndss.2019.23386

# Import Statements

In [2]:
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import statsmodels.api as sm
from patsy import dmatrices

# Notebook Presentation

In [3]:
# Show numeric output in decimal format e.g., 2.15
pd.options.display.float_format = '{:,.2f}'.format

# Read the Dataset

In [4]:
web_access = pd.read_json('saved_data_combined.json')


# Data Cleaning/Exploration

In [5]:
web_access.shape

(136, 4)

In [6]:
web_access.sample(5)

Unnamed: 0,status,statistics,categories,old_index
1,"{'success': True, 'httpstatuscode': 200}",{'pagetitle': 'Gandi.net - Gandi.net: Domain N...,"{'error': {'description': 'Errors', 'count': 9...",102.0
110,"{'success': False, 'error': 'net::ERR_NAME_NOT...",,,
31,"{'success': False, 'error': 'net::ERR_NAME_NOT...",,,
22,"{'success': True, 'httpstatuscode': 200}","{'pagetitle': 'Error 404 (Not Found)!!1', 'pag...","{'error': {'description': 'Errors', 'count': 1...",123.0
7,"{'success': True, 'httpstatuscode': 200}",{'pagetitle': 'One platform to connect | Zoom'...,"{'error': {'description': 'Errors', 'count': 4...",108.0


In [7]:
#drop fail rows
web_access_clean = web_access.dropna()
#remove status column because all are success
web_access_clean.pop("status")
#convert old_index col to numeric
pd.to_numeric(web_access_clean.old_index)

web_access_clean

Unnamed: 0,statistics,categories,old_index
0,"{'pagetitle': '腾讯网', 'pageurl': 'https://qq.co...","{'error': {'description': 'Errors', 'count': 1...",101.00
1,{'pagetitle': 'Gandi.net - Gandi.net: Domain N...,"{'error': {'description': 'Errors', 'count': 9...",102.00
2,{'pagetitle': 'Google Drive: Share Files Onlin...,"{'error': {'description': 'Errors', 'count': 1...",103.00
3,"{'pagetitle': 'Login | Microsoft 365', 'pageur...","{'error': {'description': 'Errors', 'count': 1...",104.00
5,"{'pagetitle': 'Mozilla - Internet for people, ...","{'error': {'description': 'Errors', 'count': 1...",106.00
...,...,...,...
129,"{'pagetitle': 'Just a moment...', 'pageurl': '...","{'error': {'description': 'Errors', 'count': 1...",93.00
131,{'pagetitle': 'Explore - Find your favourite v...,"{'error': {'description': 'Errors', 'count': 1...",95.00
133,{'pagetitle': 'Analytics Tools & Solutions for...,"{'error': {'description': 'Errors', 'count': 2...",97.00
134,{'pagetitle': 'Яндекс — быстрый поиск в интерн...,"{'error': {'description': 'Errors', 'count': 0...",98.00


# Extracting Nested Data from a Column

Breaking out "statistics" column and "categories" column while maintaining old_index (web page popularity)

In [8]:
statistics_df = pd.json_normalize(data=web_access_clean["statistics"]).set_index(web_access_clean.old_index)

In [9]:
cat_df = pd.json_normalize(data = web_access_clean["categories"],
                                      max_level=1,
                                      record_prefix=True).set_index(web_access_clean.old_index)

In [10]:
comb_df = statistics_df.join(other=cat_df)
comb_df

Unnamed: 0_level_0,pagetitle,pageurl,time,creditsremaining,allitemcount,totalelements,waveurl,error.description,error.count,contrast.description,contrast.count,alert.description,alert.count,feature.description,feature.count,structure.description,structure.count,aria.description,aria.count
old_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
101.00,腾讯网,https://qq.com,8.85,27,418,2174,http://wave.webaim.org/report?url=https://qq.com,Errors,113,Contrast Errors,181,Alerts,90,Features,2,Structural Elements,6,ARIA,26
102.00,"Gandi.net - Gandi.net: Domain Names, Web Hosti...",https://gandi.net,3.12,26,268,1343,http://wave.webaim.org/report?url=https://gand...,Errors,9,Contrast Errors,5,Alerts,16,Features,18,Structural Elements,107,ARIA,113
103.00,Google Drive: Share Files Online with Secure C...,https://drive.google.com,4.87,25,266,2246,http://wave.webaim.org/report?url=https://driv...,Errors,1,Contrast Errors,0,Alerts,9,Features,83,Structural Elements,53,ARIA,120
104.00,Login | Microsoft 365,https://officeapps.live.com,2.93,24,290,983,http://wave.webaim.org/report?url=https://offi...,Errors,1,Contrast Errors,0,Alerts,3,Features,29,Structural Elements,61,ARIA,196
106.00,"Mozilla - Internet for people, not profit (US)",https://mozilla.org,2.43,23,229,1098,http://wave.webaim.org/report?url=https://mozi...,Errors,1,Contrast Errors,0,Alerts,12,Features,125,Structural Elements,69,ARIA,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
93.00,Just a moment...,https://cloudflare.net,3.32,32,6,42,http://wave.webaim.org/report?url=https://clou...,Errors,1,Contrast Errors,0,Alerts,1,Features,1,Structural Elements,3,ARIA,0
95.00,Explore - Find your favourite videos on TikTok,https://tiktok.com,8.54,31,138,922,http://wave.webaim.org/report?url=https://tikt...,Errors,14,Contrast Errors,8,Alerts,25,Features,5,Structural Elements,7,ARIA,79
97.00,Analytics Tools & Solutions for Your Business ...,https://google-analytics.com,3.12,30,236,829,http://wave.webaim.org/report?url=https://goog...,Errors,2,Contrast Errors,0,Alerts,42,Features,12,Structural Elements,71,ARIA,109
98.00,Яндекс — быстрый поиск в интернете,https://yandex.net,5.93,29,98,280,http://wave.webaim.org/report?url=https://yand...,Errors,0,Contrast Errors,3,Alerts,3,Features,1,Structural Elements,14,ARIA,77


# Further Clean Data by Removing Unnecessary Columns


In [11]:
pop_cols = ["time", "creditsremaining", "waveurl", "error.description", "contrast.description", "alert.description", "feature.description", "structure.description", "aria.description"]
for col in pop_cols:
  comb_df.pop(col)
comb_df

Unnamed: 0_level_0,pagetitle,pageurl,allitemcount,totalelements,error.count,contrast.count,alert.count,feature.count,structure.count,aria.count
old_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
101.00,腾讯网,https://qq.com,418,2174,113,181,90,2,6,26
102.00,"Gandi.net - Gandi.net: Domain Names, Web Hosti...",https://gandi.net,268,1343,9,5,16,18,107,113
103.00,Google Drive: Share Files Online with Secure C...,https://drive.google.com,266,2246,1,0,9,83,53,120
104.00,Login | Microsoft 365,https://officeapps.live.com,290,983,1,0,3,29,61,196
106.00,"Mozilla - Internet for people, not profit (US)",https://mozilla.org,229,1098,1,0,12,125,69,22
...,...,...,...,...,...,...,...,...,...,...
93.00,Just a moment...,https://cloudflare.net,6,42,1,0,1,1,3,0
95.00,Explore - Find your favourite videos on TikTok,https://tiktok.com,138,922,14,8,25,5,7,79
97.00,Analytics Tools & Solutions for Your Business ...,https://google-analytics.com,236,829,2,0,42,12,71,109
98.00,Яндекс — быстрый поиск в интернете,https://yandex.net,98,280,0,3,3,1,14,77


# Find Webpages with the Most Errors

In [12]:
comb_df.sort_values('allitemcount', ascending=False).head()

Unnamed: 0_level_0,pagetitle,pageurl,allitemcount,totalelements,error.count,contrast.count,alert.count,feature.count,structure.count,aria.count
old_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
118.0,Samsung US | Mobile | TV | Home Electronics | ...,https://samsung.com,2716,6760,3,43,100,262,188,2120
38.0,Android Apps on Google Play,https://play.google.com,1839,5098,24,73,564,62,11,1105
99.0,Microsoft 365 - Subscription for Productivity ...,https://office365.com,1229,3388,11,0,12,101,114,991
24.0,Microsoft Outlook (formerly Hotmail): Free ema...,https://live.com,1199,4071,20,3,28,70,125,953
19.0,Cloud Computing Services | Microsoft Azure,https://azure.com,1088,3662,19,0,16,110,164,779


ARIA errors are the most common by far

# Find Webpages with the Most Errors as a Percentage of Total DOM Elements

In [13]:
comb_df["total_percent_error"] = comb_df["allitemcount"].mul(100).div(comb_df["totalelements"])
comb_df.sort_values("total_percent_error", ascending=False).head()

Unnamed: 0_level_0,pagetitle,pageurl,allitemcount,totalelements,error.count,contrast.count,alert.count,feature.count,structure.count,aria.count,total_percent_error
old_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
64.0,,https://settings-win.data.microsoft.com,4,4,2,0,2,0,0,0,100.0
121.0,,https://mobile.events.data.microsoft.com,4,4,2,0,2,0,0,0,100.0
17.0,,https://events.data.microsoft.com,4,4,2,0,2,0,0,0,100.0
62.0,,https://node.e2ro.com,4,5,2,0,2,0,0,0,80.0
90.0,,https://ecs.office.com,4,5,2,0,2,0,0,0,80.0


In above example you can see that many web pages have very few elements. This leads me to believe they are not valid data. In the next step I will remove any webpage with fewer than or equal to 100 elements.

In [14]:
comb_df = comb_df.drop(comb_df[comb_df["totalelements"] <= 100].index)
comb_df.sort_values("totalelements", ascending=True)

Unnamed: 0_level_0,pagetitle,pageurl,allitemcount,totalelements,error.count,contrast.count,alert.count,feature.count,structure.count,aria.count,total_percent_error
old_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
67.00,Web Accessibility in Mind Conference: Home,https://domaincontrol.com,33,112,0,0,1,15,17,0,29.46
46.00,Sign in to your account,https://login.microsoftonline.com,33,127,8,0,2,2,2,19,25.98
107.00,X. It’s what’s happening / X,https://x.com,32,177,0,1,2,1,3,25,18.08
13.00,X. It’s what’s happening / X,https://twitter.com,32,177,0,1,2,1,3,25,18.08
82.00,Welcome to the home of the Network Time Protoc...,https://ntp.org,47,179,5,10,7,2,6,17,26.26
...,...,...,...,...,...,...,...,...,...,...,...
99.00,Microsoft 365 - Subscription for Productivity ...,https://office365.com,1229,3388,11,0,12,101,114,991,36.28
19.00,Cloud Computing Services | Microsoft Azure,https://azure.com,1088,3662,19,0,16,110,164,779,29.71
24.00,Microsoft Outlook (formerly Hotmail): Free ema...,https://live.com,1199,4071,20,3,28,70,125,953,29.45
38.00,Android Apps on Google Play,https://play.google.com,1839,5098,24,73,564,62,11,1105,36.07


# Again! Find Webpages with the Most Errors as a Percentage of Total DOM Elements#

In [15]:
comb_df.sort_values("total_percent_error", ascending=False).head(15)

Unnamed: 0_level_0,pagetitle,pageurl,allitemcount,totalelements,error.count,contrast.count,alert.count,feature.count,structure.count,aria.count,total_percent_error
old_index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
133.0,Google Help,https://googledomains.com,642,938,24,0,6,91,190,331,68.44
39.0,Search - Microsoft Bing,https://bing.com,333,579,6,0,25,33,42,227,57.51
55.0,Pinterest,https://pinterest.com,691,1382,3,11,8,193,10,466,50.0
33.0,Wikipedia,https://wikipedia.org,469,1047,6,1,10,428,19,5,44.79
77.0,Spotify - Web Player: Music for everyone,https://spotify.com,903,2106,4,1,61,51,42,744,42.88
118.0,Samsung US | Mobile | TV | Home Electronics | ...,https://samsung.com,2716,6760,3,43,100,262,188,2120,40.18
20.0,LinkedIn: Log In or Sign Up,https://linkedin.com,320,798,37,0,0,10,36,237,40.1
126.0,"Telekom | Mobilfunk, Festnetz & Internet, TV A...",https://telekom.de,446,1139,5,34,87,57,51,212,39.16
80.0,Sign in - Google Accounts,https://accounts.google.com,235,618,1,1,3,2,7,221,38.03
70.0,Sign in - Google Accounts,https://docs.google.com,235,618,1,1,3,2,7,221,38.03


Now our dataset has 74 viable websites. A few rows look like duplicates, but after checking, they are sign in pages for different google services, so they'll remain in the dataset.

#Is Total Percent Error Dependent on the Total Number of DOM Elements?

Hypothesis: As webppages increase in complexity (more DOM elements), there are increasingly more errors, leading to an increased total percent error.

In [16]:
#create scatter plot from comb_df using totalelements and total_percent_error as x and y columns, respectively
scatter = px.scatter(comb_df,
                     x = comb_df["totalelements"],
                     y = comb_df["total_percent_error"],
                     title = "Webpage DOM Elements v Total % Error",
                     color = comb_df["allitemcount"],
                     color_continuous_scale=px.colors.sequential.Agsunset,
                     width = 1000,
                     height = 750,
                     hover_data = ["pageurl", "totalelements", "total_percent_error"],
                     trendline="ols",
                     )

#update the scatter plot with x and y axis titles and a legend
scatter.update_layout(xaxis = {"title":"Total DOM Elements"},
                      yaxis = {"title": "Total Percent Error"},
                      legend = {"yanchor": "top",
                                "y": 0.995,
                                "x": 0.005,
                                "xanchor": "left",
                                "bgcolor": "White",
                                "bordercolor":"LightSteelBlue",
                                "borderwidth": 2,
                                "itemsizing": "constant",
                                },
                      hoverlabel = {"align": "left",
                                    "bgcolor": "rgb(255,255,255)"
                                   }
                      )

#increase the size of markers and increase contrast with outlines
scatter.update_traces(marker={"size":15,
                              "line":{"width":2,
                                      "color":"Black"
                                      }
                              }
                      )

#display trendline values in legend
tl = px.get_trendline_results(scatter)
a = tl.iloc[0]["px_fit_results"].params[0]
b = tl.iloc[0]["px_fit_results"].params[1]
scatter.data[0].name = 'websites'
scatter.data[0].showlegend = True
scatter.data[1].name = scatter.data[1].name  + ' y = ' + str(round(a, 2)) + ' + ' + str(round(b, 2)) + 'x'
rsq = tl.iloc[0]["px_fit_results"].rsquared
scatter.add_trace(go.Scatter(x=[100], y=[100],
                         name = "R-squared" + ' = ' + str(round(rsq, 2)),
                         showlegend=True,
                         mode='markers',
                         marker=dict(color='rgba(0,0,0,0)')
                         ))
scatter.data[1].showlegend = True

# results = px.get_trendline_results(scatter)
# results = results.iloc[0]["px_fit_results"].summary()
# print(results)

#Outline the area with most data points
scatter.add_shape(
    name="Error Area",
    showlegend=True,
    type="rect",
    line=dict(dash="dash"),
    x0=0,
    x1=2000,
    y0=10,
    y1=40,
)

scatter.show()

Conslusion: More elements doesn't necessarily mean more likely to have more errors (very low R^2). Most websites have between 0 - 2,000 DOM elements,
0 - 1,000 error items, and 10-40% total errors. The more elements in a webpage the more errors there are, but there's not a higher percentage of error.

#Is a Popular Webpage Less Likely to Have Errors?

Hypothesis: The more popular a webpage is, the less likely it is to have accessibility errors.

In [17]:
#create scatter plot from comb_df using the index and total_percent_error as x and y columns, respectively
scatter = px.scatter(comb_df,
                     x = comb_df.index,
                     y = comb_df["total_percent_error"],
                     title = "Webpage Popularity v Total % Error",
                     color = comb_df["totalelements"],
                     color_continuous_scale=px.colors.sequential.Agsunset,
                     width = 1000,
                     height = 750,
                     hover_data = ["pageurl", "allitemcount", "totalelements", "total_percent_error"],
                     trendline="ols",
                     )

#update the scatter plot with x and y axis titles and a legend
scatter.update_layout(xaxis = {"title":"Webpage Popularity"},
                      yaxis = {"title": "Total Percent Error"},
                      legend = {"yanchor": "top",
                                "y": 0.995,
                                "x": 0.005,
                                "xanchor": "left",
                                "bgcolor": "White",
                                "bordercolor":"LightSteelBlue",
                                "borderwidth": 2,
                                "itemsizing": "constant",
                                },
                      hoverlabel = {"align": "left",
                                    "bgcolor": "rgb(255,255,255)"
                                   }
                      )

#increase the size of markers and increase contrast with outlines
scatter.update_traces(marker={"size":15,
                              "line":{"width":2,
                                      "color":"Black"
                                      }
                              }
                      )

#display trendline values in legend
tl = px.get_trendline_results(scatter)
a = tl.iloc[0]["px_fit_results"].params[0]
b = tl.iloc[0]["px_fit_results"].params[1]
scatter.data[0].name = 'websites'
scatter.data[0].showlegend = True
scatter.data[1].name = scatter.data[1].name  + ' y = ' + str(round(a, 2)) + ' + ' + str(round(b, 2)) + 'x'
rsq = tl.iloc[0]["px_fit_results"].rsquared
scatter.add_trace(go.Scatter(x=[100], y=[100],
                         name = "R-squared" + ' = ' + str(round(rsq, 2)),
                         showlegend=True,
                         mode='markers',
                         marker=dict(color='rgba(0,0,0,0)')
                         ))
scatter.data[1].showlegend = True

#Outline the area with the most data points
scatter.add_shape(
    name="Error Area",
    showlegend=True,
    type="rect",
    line={"dash":"dash"},
    x0=0,
    x1=136,
    y0=10,
    y1=40,
)

# results = px.get_trendline_results(scatter)
# results = results.iloc[0]["px_fit_results"].summary()
# print(results)

scatter.show()

Conclusion: A webpage being more or less popular (having a lower or higher old_index, respectively) doesn't necessarily mean it will have a higher or lower percentage of errors. Nor does it mean that the page will have more or fewer total DOM elements. The majority of webpages, regardless of popularity, have between 10-40% errors.

#What are the Most Common Types of Errors?

In [77]:
#create a new dataframe of error count and error percentage
errors_df = pd.DataFrame({"page": comb_df["pageurl"],
                          "total_features": comb_df["totalelements"],
                          "total_errors": comb_df["allitemcount"],
                          "total_errors_percent": comb_df["total_percent_error"],
                          "error_count":comb_df["error.count"],
                          "error_percent": comb_df["error.count"].mul(100).div(comb_df["allitemcount"]),
                          "contrast_count":comb_df["contrast.count"],
                          "contrast_percent": comb_df["contrast.count"].mul(100).div(comb_df["allitemcount"]),
                          "alert_count":comb_df["alert.count"],
                          "alert_percent": comb_df["alert.count"].mul(100).div(comb_df["allitemcount"]),
                          "feature_count":comb_df["feature.count"],
                          "feature_percent": comb_df["feature.count"].mul(100).div(comb_df["allitemcount"]),
                          "structure_count":comb_df["structure.count"],
                          "structure_percent": comb_df["structure.count"].mul(100).div(comb_df["allitemcount"]),
                          "aria_count":comb_df["aria.count"],
                          "aria_percent": comb_df["aria.count"].mul(100).div(comb_df["allitemcount"]),
                         }
          )

#create a pie chart of the error percentages
errors_percent_cols = [name for name in errors_df.columns if "percent" in name and name != "total_errors_percent"]
fig = px.pie(labels=[error_type[:-8].title() for error_type in errors_percent_cols],
             values=[errors_df[error_type].mean() for error_type in errors_percent_cols],
             title="Average Error Type % Per Webpage",
             names=[error_type[:-8].title() for error_type in errors_percent_cols],
             )

#labels of percentage go outside the pie chart
fig.update_traces(textposition="outside", textinfo="percent+label")

ARIA errors are by far the most common type of error at, on average, 51.5% of a website's errors being type ARIA. The next most common is structure, at an average of 16.6% of total errors per website.

In [109]:
errors_melt_df = pd.melt(errors_df.sort_values("total_errors", ascending=False).head(15),
                         id_vars=["page"],
                         value_vars=["error_count",
                                     "contrast_count",
                                     "alert_count",
                                     "feature_count",
                                     "structure_count",
                                     "aria_count",
                                     ],
                         var_name="all_error_types",
                         value_name="all_error_counts",
                         col_level=0,
                         ignore_index=True
                         )

errors_melt_df = pd.merge(left=errors_melt_df, right=errors_df[["page", "total_errors"]], how="left", on="page")
errors_melt_df["percent"] = errors_melt_df["all_error_counts"].div(errors_melt_df["total_errors"]).map('{:.2%}'.format)

fig = px.bar(errors_melt_df,
             x="page",
             y="all_error_counts",
             color="all_error_types",
             text="percent",
             title="The Top 15 Most Error-Prone Websites by Error Type",
             height=1200,
             hover_data = ["page", "total_errors"],
)

fig.update_layout(xaxis = {"title":"Top 15 Most Error-Prone Websites"},
                      yaxis = {"title": "Total Error Count"},
                      legend = {"title": "Error Types"},
                      hoverlabel = {"align": "left",
                                    "bgcolor": "rgb(255,255,255)"
                                   }
                      )
fig.show()