## MBA Fixed broadband, 11th report

The test: between MBA measurement clients and MBA measurement servers.  The measurement clients (i.e., whiteboxes) were situated in the homes of 5,951 panelists, each of whom received service from one of the 10 evaluated ISPs.  *The evaluated ISPs collectively represent over 70% of U.S. residential broadband Internet connections.*  After the measurement data was processed (as described in greater detail in the Technical Appendix), test results from 2,488 panelists were used in this report:
- many of the panelists either were not reporting for a sufficient number of days within the measurement period in order to be considered as a statistically significant sample, 
- or as explained above, there was an insufficient number of panelists in a particular tier or for a particular ISP to produce a statistically valid dataset.


The measurement clients collected data throughout the year, and this data is available as described below.  However, only data collected during the “September-October 2020 reporting period,” were used to generate the charts and conclusions in this Report.
  - The September-October 2020 data set was validated to remove anomalies that would have produced errors in the Report: The MBA panel sample used in the reporting period is validated (i.e., upload and download tiers of the whiteboxes are verified with providers), and the measurement results are carefully inspected to eliminate any statistical outliers.  This leads to a ‘validated data set’ that accompanies each report. Otherwise, validation cross-checks are not done except in the test period used for the report
  -  As tests are run 24x7x365, we also provide raw data (NOTE: subscriber tier mid-month changes may be missed):
      - *Handle panelists that changed ISP intra-month*

Unless otherwise stated, this Report focuses on performance during peak usage period, which is defined as weeknights between 7:00 p.m. to 11:00 p.m. local time at the subscriber’s location -- which demonstrates what performance users can expect when Internet access services in their local area experience the highest demand from users.
  - *The MBA program is focused on ISP network performance... the MBA methodologies/system/measurements are designed to recognize and eliminate such confounding factors outside an access ISP’s control.*





### FCC MBA Report: Tests initiated by a premise device at participating households. Validated data, and raw collected data are published.

In [None]:
# wget DOWNLOAD SCRIPT

# 2020 raw + validated2020sep (contains geocoded-units / 
for mo in jan feb mar apr may jun jul aug nov dec ; 
do wget https://data.fcc.gov/download/measuring-broadband-america/2020/data-raw-2020-${mo}.tar.gz & done

# VALIDATED data
# 2019 : https://www.fcc.gov/reports-research/reports/measuring-broadband-america/validated-data-measuring-fixed-broadband-tenth#validated 
# UnitID-census-block-sept2019.xls

In [None]:
# extract tar ball
tar -xvzf name

# Validated data
UNTAR: 
    tar -czvf validated-data-sept2020.tar.gz 2020-sep-oct/
    
# Raw data    
for mo in jan feb mar apr may jun jul aug nov dec ; 
do tar -xvzf data-raw-2020-${mo}.tar.gz & done

In [None]:
2020, 2021: sep+oct = validated MBA Report
https://data.fcc.gov/download/measuring-broadband-america/2021/data-raw-2021-jan.tar.gz
...
https://data.fcc.gov/download/measuring-broadband-america/2021/data-raw-2021-aug.tar.gz
https://data.fcc.gov/download/measuring-broadband-america/2021/data-raw-2021-nov.tar.gz
https://data.fcc.gov/download/measuring-broadband-america/2021/data-raw-2021-dec.tar.gz


# 2020, sep+oct = Eleventh Report (latest)
VALIDATED data: http://data.fcc.gov/download/measuring-broadband-america/2021/validated-data-sept2020.tar.gz
https://data.fcc.gov/download/measuring-broadband-america/2020/data-raw-2020-jan.tar.gz
...
https://data.fcc.gov/download/measuring-broadband-america/2020/data-raw-2020-aug.tar.gz
https://data.fcc.gov/download/measuring-broadband-america/2020/data-raw-2020-nov.tar.gz
https://data.fcc.gov/download/measuring-broadband-america/2020/data-raw-2020-dec.tar.gz


# 2011-2019 (full)
https://data.fcc.gov/download/measuring-broadband-america/2019/data-raw-2019-jan.tar.gz
...
https://data.fcc.gov/download/measuring-broadband-america/2019/data-raw-2019-dec.tar.gz

In [103]:
import pandas as pd
%load_ext autotime

time: 355 µs (started: 2022-03-10 20:58:05 -05:00)


# 2020 MBA Report

## Geocoded units

Geocoded Units
This spreadsheet identifies the census block in which each unit running test is located. Census block is from 2000 census and is in the FIPS code format. We have used block FIPS codes for blocks that contains more than 1,000 people. For blocks with less than 1,000 people we have aggregated to the next highest level, i.e. tract and used the Tract FIPS code, provided there are more than 1,000 people in the tract. In cases where there are less than 1,000 people in a tract we have aggregated to County level. This level of anonymization is done for privacy purposes so as not to expose PII

In [192]:
geocoded = pd.read_csv('data/geocoded-units-sept2020.csv')
geocoded.shape, len(set(main.unit_id))

((3634, 5), 3634)

time: 5.81 ms (started: 2022-03-11 11:43:46 -05:00)


## Unit Profile (ISP/services)
This document identifies the various details of each test unit, including ISP, technology, service tier, and general location. Each unit represents one volunteer panelist.

In [366]:
profile = pd.read_csv('data/unit-profile-sept2020.csv')
profile = profile.rename(columns={'Unit ID' : 'unit_id'})[['unit_id', 'ISP', 'Technology', 'Download', 'Upload']]
profile.shape

(3634, 5)

time: 11.8 ms (started: 2022-03-11 16:23:01 -05:00)


In [205]:
len(set(profile.unit_id).intersection(set(geocoded.unit_id)))

3634

time: 2.95 ms (started: 2022-03-11 11:48:49 -05:00)


In [206]:
set(profile), set(geocoded)

({'Download', 'ISP', 'Upload', 'unit_id'},
 {'geog_id', 'geog_type', 'latitude', 'longitude', 'unit_id'})

time: 2.07 ms (started: 2022-03-11 11:48:51 -05:00)


## List all available files

In [373]:
# Unix Shell command !
janFiles = !ls data/202001/*.csv
janFiles[:3]

['data/202001/curr_datausage.csv',
 'data/202001/curr_dns.csv',
 'data/202001/curr_httpget.csv']

time: 57.8 ms (started: 2022-03-11 16:29:24 -05:00)


In [145]:
!ls data/202011/

curr_datausage.csv   curr_httppostmt6.csv	   curr_udpcloss.csv
curr_dlping.csv      curr_lct_dl.csv		   curr_udpjitter.csv
curr_dns.csv	     curr_lct_dl_intermediate.csv  curr_udplatency.csv
curr_httpget.csv     curr_lct_ul.csv		   curr_udplatency6.csv
curr_httpgetmt.csv   curr_lct_ul_intermediate.csv  curr_ulping.csv
curr_httpgetmt6.csv  curr_netusage.csv		   curr_videostream.csv
curr_httppost.csv    curr_ping.csv		   curr_webget.csv
curr_httppostmt.csv  curr_traceroute.csv
time: 125 ms (started: 2022-03-10 21:58:29 -05:00)


In [None]:
* 10 months 
MO10 = ['01', '02', '03', '04', '05', '06', '07', '08', '11', '12']
# jan
curr_datausage.csv   curr_httppost.csv	   curr_udpcloss.csv
curr_dns.csv	     curr_httppostmt.csv   curr_udpjitter.csv
curr_httpget.csv     curr_httppostmt6.csv  curr_udplatency.csv
curr_httpgetmt.csv   curr_ping.csv	   curr_udplatency6.csv
curr_httpgetmt6.csv  curr_traceroute.csv   curr_webget.csv
# nov
curr_datausage.csv   curr_httppostmt6.csv	   curr_udpcloss.csv
curr_dlping.csv      curr_lct_dl.csv		   curr_udpjitter.csv
curr_dns.csv	     curr_lct_dl_intermediate.csv  curr_udplatency.csv
curr_httpget.csv     curr_lct_ul.csv		   curr_udplatency6.csv
curr_httpgetmt.csv   curr_lct_ul_intermediate.csv  curr_ulping.csv
curr_httpgetmt6.csv  curr_netusage.csv		   curr_videostream.csv
curr_httppost.csv    curr_ping.csv		   curr_webget.csv
curr_httppostmt.csv  curr_traceroute.csv

* Validated SEP + OCT
curr_dns.csv	     curr_httppostmt.csv   curr_udpjitter.csv
curr_httpget.csv     curr_httppostmt6.csv  curr_udplatency.csv
curr_httpgetmt.csv   curr_ping.csv	   curr_udplatency6.csv
curr_httpgetmt6.csv  curr_ping6.csv	   curr_webget.csv
curr_httppost.csv    curr_udpcloss.csv



# DOWNLOAD

Validated data: include only data collected from September 2 – 24, 2020 (inclusive) plus September 26 – October 2, 2020 (inclusive), referred to throughout this report as the “September-October 2020 reporting period” 

In [None]:
Mbps_to_bytes = 125*10**3


In [147]:
# path = f"data/2020-sep-oct"
# download_csv_name = 'curr_httpgetmt'
# down4Sep = pd.read_csv(f"{path}/{download_csv_name}.csv")
# down4Sep_suc = down4Sep.query('(threads > 1) & (successes == 1) & (bytes_sec > 0) & (bytes_sec_interval > 0)')
# down4Sep_fil = down4Sep_suc.query(f"((target.str.startswith('samknows')) & (target.str.endswith('level3.net'))) | ((target.str.startswith('sp')) & (target.str.endswith('us.samknows.com')))")
# print(down4Sep.shape, len(set(down4Sep.unit_id)), down4Sep.query('(threads > 1) & (successes == 1)').shape, down4Sep_suc.shape, down4Sep_fil.shape)
# down4Sep_fil['download_mbps'] = round(down4Sep_fil.bytes_sec/Mbps_to_bytes, 2)
# down4Sep_fil.dtime                   

(1027247, 15) 3589 (1027247, 15) (1027247, 15) (1027247, 15)
time: 2.38 s (started: 2022-03-10 22:01:30 -05:00)


The test operates for a fixed duration of 10 seconds. It records the average throughput achieved during this 10 second period. The client attempts to download as much of the payload as possible for the duration of the test.
- fetch_time	Time the test ran for in microseconds 
- bytes_total	Total bytes downloaded across all connections
The test uses three concurrent TCP connections (and therefore three concurrent HTTP requests) to ensure that the line is saturated. Each connection used in the test counts the numbers of bytes transferred and is sampled periodically by a controlling thread. The sum of these counters (a value in bytes) divided by the time elapsed (in microseconds) and converted to Mbps is taken as the total throughput of the user’s broadband service.

Factors such as TCP slow start and congestion are taken into account by repeatedly transferring small chunks (256 kilobytes, or kB) of the target payload before the real testing begins. This ”warm-up” period is completed when three consecutive chunks are transferred at within 10 percent of the speed of one another. All three connections are required to have completed the warm-up period before the timed testing begins. The warm-up period is excluded from the measurement results.
Downloaded content is discarded as soon as it is received, and is not written to the file system. Uploaded content is generated and streamed on the fly from a random source.
The test is performed for both IPv4 and IPv6, where available, but only IPv4 results are reported.


In [None]:
MO10 = ['01', '02', '03', '04', '05', '06', '07', '08', '11', '12']

In [None]:
# # curr_httpgetmt6.csv	Download speed, IPv6, multiple concurrent TCP connections; USUALLY are just small tables -- dont bother!
# janDown6 = pd.read_csv('data/202101/curr_httpgetmt6.csv')
# janDown6.shape # (3868, 15)

In [249]:
down4_csv_name = 'curr_httpgetmt'
down6_csv_name = 'curr_httpgetmt6'

down_combined = []

for mo in ['11', '12']:
    path = f"data/2020{mo}"

    # # Main CONDITIONS
    # where ddate between date('2020-09-01') and date('2020-10-31')
    # where threads > 1 and ip_version = '4'
    # where (successes = 1) and (bytes_sec > 0) and (bytes_sec_interval > 0)
    # where target like 'samknows%level3.net' or target like 'sp%-us.samknows.com'
    
    # IPv4 multiple TCP connections
    down4 = pd.read_csv(f'{path}/{down4_csv_name}.csv')
    down4_suc = down4.query('(threads > 1) & (successes == 1) & (bytes_sec > 0) & (bytes_sec_interval > 0)')
    down4_fil = down4_suc.query(f"((target.str.startswith('samknows')) & (target.str.endswith('level3.net'))) | ((target.str.startswith('sp')) & (target.str.endswith('us.samknows.com')))")
    print(path, down4.shape, len(set(down4.unit_id)), down4.query('(threads > 1) & (successes == 1)').shape, down4_suc.shape, down4_fil.shape)
    
    # IPv6 multiple TCP connections :: fewer tests
    down6 = pd.read_csv(f'{path}/{down6_csv_name}.csv')
    down6_suc = down6.query('(threads > 1) & (successes == 1) & (bytes_sec > 0) & (bytes_sec_interval > 0)')
    down6_fil = down6_suc.query(f"((target.str.startswith('samknows')) & (target.str.endswith('level3.net'))) | ((target.str.startswith('sp')) & (target.str.endswith('us.samknows.com')))")
    print(path, down6.shape, len(set(down6.unit_id)), down6.query('(threads > 1) & (successes == 1)').shape, down6_suc.shape, down6_fil.shape)

    down_combined.append(down4_fil[['unit_id', 'bytes_sec']])
    down_combined.append(down6_fil[['unit_id', 'bytes_sec']])
    
    
down_combined = pd.concat(down_combined)
down_combined['download_mbps'] = round(down_combined.bytes_sec/Mbps_to_bytes, 2)
down_combined.shape

(1430255, 15) 5716 (1249156, 15) (1249156, 15) (962997, 15)
(4065, 15) 64 (4058, 15) (4058, 15) (3917, 15)
(1462627, 15) 5565 (1274203, 15) (1274203, 15) (976174, 15)
(3997, 15) 55 (3997, 15) (3997, 15) (3814, 15)
time: 6.48 s (started: 2022-03-11 12:23:12 -05:00)


In [214]:
# OPTIONAL: trimmed mean
# from scipy.stats import trim_mean
# down4_fil.groupby('unit_id')['download_mbps'].apply(trim_mean, .01)

time: 470 ms (started: 2022-03-11 12:01:32 -05:00)


In [262]:
down_completed = down_combined.groupby('unit_id')['download_mbps'].describe().round(2)
# DOWNLOAD_MBPS_MBA = 'DownloadMbpsMBA'
# rename(columns=
#                  { 'count': 'numTestDownloadMBA', 
#                   'mean': f'mean{DOWNLOAD_MBPS_MBA}',  '50%': f'med{DOWNLOAD_MBPS_MBA}',
#                   'min': f'min{DOWNLOAD_MBPS_MBA}', 'max' : f'max{DOWNLOAD_MBPS_MBA}'})
down_completed = down_completed[['count', 'mean', '50%', 'min', 'max']].add_prefix('down_').reset_index()

time: 7.06 s (started: 2022-03-11 12:36:46 -05:00)


# UPLOAD

In [273]:
speed4_csv_name = 'curr_httppostmt'
speed6_csv_name = 'curr_httppostmt6'

speed_combined = []

for mo in ['11', '12']:

    path = f"data/2020{mo}"

    # # Main CONDITIONS
    # where ddate between date('2020-09-01') and date('2020-10-31')
    # where threads > 1 and ip_version = '4'
    # where (successes = 1) and (bytes_sec > 0) and (bytes_sec_interval > 0)
    # where target like 'samknows%level3.net' or target like 'sp%-us.samknows.com'

    # IPv4 multiple TCP connections
    speed4 = pd.read_csv(f'{path}/{speed4_csv_name}.csv')
    speed4_suc = speed4.query('(threads > 1) & (successes == 1) & (bytes_sec > 0) & (bytes_sec_interval > 0)')
    speed4_fil = speed4_suc.query(f"((target.str.startswith('samknows')) & (target.str.endswith('level3.net'))) | ((target.str.startswith('sp')) & (target.str.endswith('us.samknows.com')))")
    print(path, speed4.shape, len(set(speed4.unit_id)), speed4.query('(threads > 1) & (successes == 1)').shape, speed4_suc.shape, speed4_fil.shape)

    # # IPv6 multiple TCP connections :: fewer tests
    speed6 = pd.read_csv(f'{path}/{speed6_csv_name}.csv')
    speed6_suc = speed6.query('(threads > 1) & (successes == 1) & (bytes_sec > 0) & (bytes_sec_interval > 0)')
    speed6_fil = speed6_suc.query(f"((target.str.startswith('samknows')) & (target.str.endswith('level3.net'))) | ((target.str.startswith('sp')) & (target.str.endswith('us.samknows.com')))")
    print(path, speed6.shape, len(set(speed6.unit_id)), speed6.query('(threads > 1) & (successes == 1)').shape, speed6_suc.shape, speed6_fil.shape)

    speed_combined.append(speed4_fil[['unit_id', 'bytes_sec']])
    speed_combined.append(speed6_fil[['unit_id', 'bytes_sec']])
    
    
speed_combined = pd.concat(speed_combined)
speed_combined['upload_mbps'] = round(speed_combined.bytes_sec/Mbps_to_bytes, 2)
speed_combined.shape

data/202011 (1425370, 15) 5711 (1236004, 15) (1236004, 15) (952450, 15)
data/202011 (4054, 15) 64 (3995, 15) (3995, 15) (3855, 15)
data/202012 (1457526, 15) 5560 (1260377, 15) (1260377, 15) (965492, 15)
data/202012 (3981, 15) 55 (3966, 15) (3966, 15) (3784, 15)


(1925581, 3)

time: 6.79 s (started: 2022-03-11 13:12:06 -05:00)


In [275]:
speed_completed = speed_combined.groupby('unit_id')['upload_mbps'].describe().round(2)
speed_completed = speed_completed[['count', 'mean', '50%', 'min', 'max']].add_prefix('up_').reset_index()
speed_completed.shape

time: 7.12 s (started: 2022-03-11 13:12:38 -05:00)


# LATENCY & PACKAGE LOSS

#### OOKLA / MLAB / MBA program: TCP, single or multi-streams?
- OOKLA: Speedtest.net operates mainly over **TCP testing** with a HTTP fallback for maximum compatibility. Speedtest.net measures ping (latency), download speed and upload speed.
-  Ookla’s Speedtest.net uses *multiple “streams”* or connections, now spread across *multiple nearby and least latent servers*, in an attempt to measure the maximum access link capacity of the ISP’s network. 
- M-Lab’s NDT test uses *a single stream or connection* to measure the capacity of an end-to-end path from the person running the test to the *geographically closest off-net server* on our platform. NOTE: *Single-stream measurements* do not attempt to emulate a browser, but do reflect the performance of the basic building block for nearly all applications, that is, the single streams themselves. Single stream-measurements like NDT do not measure link capacity, but **a measurement of TCP’s performance. In this sense NDT is a baseline measurement for a connection’s performance.**
- And the MBA program uses a *multi-stream test to a single off-net server* that is geographically closest and least latent. i.e. "Tests to the off-net destinations use the nearest (in terms of latency) server from the Level3, M- Lab and StackPath list of test servers."
- Multi-stream measurements, such as the one provided by Ookla and the MBA, open up multiple data streams over a user’s connection. This approach is designed to emulate a modern browser. It can also partially mask data delivery problems, such as when one set of streams pick up unused capacity left by another set of streams that are performing poorly due to packet loss or congestion elsewhere in the network. But by overcoming these sorts of problems, multiple streams are able to return measurements closer to link capacity, or the maximum amount of data can fill the link.


#### TCP, UDP, and ICMP packet types?
- TCP: The Transmission Control Protocol, or TCP, is as OG as it gets. TCP was part of the initial network transmission program that eventually gave way to the Internet Protocol used in modern networking. TCP is widely used for its reliability, ordered nature (the packets are processed in a fixed sequence, not just as they arrive), and error correction. TCP is used for a ton of things, like *email, file transfers, and any other operation where ordered, error-free data is more important than pure speed*. If you’re noticing your FTP traffic is being restricted or blocked, you can start using a trace with TCP to see where along the route FTP is being hindered.
- UDP: The User Datagram Protocol, or UDP, is a bit different from what you might expect from a transport protocol. Unlike TCP, UDP is a connectionless communication method. This means UDP datagrams can be sent without establishing a connection between two devices, allowing them to be sent without consideration for rate or sequence. For UDP, the primary focus is *speed*. Since UDP datagrams are coordinated by the application and not the protocol, they can be received and processed as they come. This is critical for things like *video streams or VOIP*, where processing info as fast as possible is more critical than reassembling the data in perfect order.
- ICMP: The Internet Control Message Protocol, or ICMP, is a special type of packet used for inter-device communication, carrying everything from *redirect instructions to timestamps for synchronization between devices.* What ICMP is probably best known for is 8echo requests: One device sends out an ICMP packet to another, telling the recipient to send a reply confirming it received the request. The recipient then responds with a new ICMP packet, the echo reply, confirming the request.

#### MBA REPORT: CONDITIONS: **UDP latency/loss cleansing**
- All test instances (one per hour, per unit) with less than 50 samples (out of a potential maximum of 600) were removed.
- All test instances where a unit’s packet loss exceeded 10% within a single hour were removed. Such a high level of loss would render a connection unusable and is considered an anomalous event.
- All test instances where any round trip time was reported as 0.5ms (or 500 microseconds) or lower were removed.
- NOTE: the curr_udplatency.sql script has a typo: (rtt_min >= 50) (should be 500 microseconds)
- All test instances where the range of a unit’s of individual round trip times exceeded 300ms were removed.
- Only tests which ran over IPv4 were considered in the analysis.

@ curr_udplatency.sql 
- NOTE: target filters (only include test servers: Level3 and StackPath) where target like 'samknows%level3.net' or target like 'sp%-us.samknows.com'
- where (successes >= 50) and (successes >= failures) and (failures / (successes + failures) <= 0.1) and (rtt_min >= 50) and (rtt_max - rtt_min <= 300000) and (rtt_avg > 0)

#### MBA REPORT: UDP Latency and Packet Loss tests
- These tests measure the round-trip time of small UDP packets between the Whitebox and a target test node.
- If a response packet is not received within three seconds of sending, it is treated as being lost. 
- The test records the number of packets sent each hour, the average round trip time and the total number of packets lost.
- The test computes the summarized minimum, maximum, standard deviation and mean from the lowest 99 percent of results, *effectively trimming the top (i.e., slowest) 1 percent of outliers*. NOTE: hence, **rtt_avg might be < > rtt_min/max**
- Approximately two thousand packets are sent within a one-hour period, with fewer packets sent if the line is not idle.


In [304]:
udp4_csv_name = 'curr_udplatency'
# IPv6 multiple TCP connections :: fewer tests
udp6_csv_name = 'curr_udplatency6'

udp_combined = []

for mo in ['11', '12']:
    for udp_csv_name in [udp4_csv_name, udp6_csv_name]:
        path = f"data/2020{mo}"

        # IPv4 or IPv6 multiple TCP connections
        udp4 = pd.read_csv(f'{path}/{udp_csv_name}.csv')
        udp4_suc = udp4.query('(successes >= 50) & (successes >= failures) & (failures / (successes + failures) <= 0.1) & (rtt_min >= 500) & (rtt_max - rtt_min <= 300000) & (rtt_avg > 0)')
        udp4_fil = udp4_suc.query(f"((target.str.startswith('samknows')) & (target.str.endswith('level3.net'))) | ((target.str.startswith('sp')) & (target.str.endswith('us.samknows.com')))")
        print(path, udp_csv_name, udp4.shape, len(set(udp4.unit_id)), udp4.query('(successes >= 50) & (successes >= failures) & (failures / (successes + failures) <= 0.1)').shape, udp4_suc.shape, udp4_fil.shape)
        udp_combined.append(udp4_fil[['unit_id', 'rtt_avg', 'successes', 'failures']])

udp_combined = pd.concat(udp_combined)
udp_combined.shape

data/202011 (7688840, 10) 5768 (7152591, 10) (6919870, 10) (3937631, 10)
data/202011 (842896, 10) 1323 (784237, 10) (771352, 10) (757728, 10)
data/202012 (7933310, 10) 5632 (7376496, 10) (7129289, 10) (3957332, 10)
data/202012 (892829, 10) 1304 (824559, 10) (805913, 10) (792508, 10)


(9445199, 4)

time: 35.1 s (started: 2022-03-11 14:22:40 -05:00)


In [305]:
udp_combined

Unnamed: 0,unit_id,rtt_avg,successes,failures
1,386,8704,83,0
7,386,7868,124,0
9,386,8017,965,0
13,386,7900,2064,1
15,386,8905,2076,0
...,...,...,...,...
892824,43088693,3945,1119,0
892825,43088693,3941,1835,0
892826,43088693,3979,1563,0
892827,43088693,3908,376,0


time: 5.09 ms (started: 2022-03-11 14:23:35 -05:00)


In [318]:
agg_latency_lossrate = {'rtt_avg' : 'mean', 'successes': 'sum', 'failures': 'sum'}
udp_completed = udp_combined.groupby('unit_id').agg(agg_latency_lossrate)
udp_completed['lossrate'] = round(udp_completed.failures / udp_completed.successes * 100, 2)
udp_completed['latency'] = round(udp_completed['rtt_avg'] / 1000, 0).astype(int)
udp_completed = udp_completed[['latency', 'lossrate']].reset_index()

time: 580 ms (started: 2022-03-11 14:34:46 -05:00)


In [320]:
udp_completed

Unnamed: 0,unit_id,latency,lossrate
0,386,8,0.01
1,390,12,0.02
2,422,12,0.03
3,431,7,9.11
4,447,11,0.02
...,...,...,...
5758,41408417,6,0.04
5759,41504393,35,0.01
5760,43088693,4,0.00
5761,45128793,19,0.03


time: 6.87 ms (started: 2022-03-11 14:35:00 -05:00)


In [341]:
udp_completed.describe()

Unnamed: 0,unit_id,latency,lossrate
count,5763.0,5763.0,5763.0
mean,14649820.0,29.600729,0.143699
std,16337380.0,56.846118,0.459525
min,386.0,1.0,0.0
25%,671218.0,14.0,0.02
50%,3880029.0,22.0,0.04
75%,26437420.0,32.0,0.11
max,45564880.0,762.0,10.72


time: 11.9 ms (started: 2022-03-11 15:05:39 -05:00)


# jitter

*The Voice over IP (VoIP)* test operates over UDP and utilizes bidirectional traffic, as is typical for voice calls. The test measures jitter, delay, and loss. **Jitter is calculated using the Packet Delay Variation (PDV)**. The 99th percentile is recorded and used in all calculations when deriving the PDV.

#### File: curr_udpjitter.csv , fields:
- jitter_down	Downstream Jitter measured (Units microseconds)
- jitter_up	Upstream Jitter measured (Units: microseconds)
- latency	99th percentile of round trip times for all packets
- successes	Number of successes (always 1 or 0 for this test)
- failures	Number of failures (always 1 or 0 for this test)

#### CONDITIONS
- where target like 'samknows%level3.net' or target like 'sp%-us.samknows.com'
- where (successes > 0) and (jitter_up > 0) and (jitter_down > 0)

In [330]:
jitter_csv_name = 'curr_udpjitter'
jitter_combined = []

for mo in ['11', '12']:

    path = f"data/2020{mo}"

    jitter = pd.read_csv(f'{path}/{jitter_csv_name}.csv')
    jitter_suc = jitter.query('(successes >= 0) & (jitter_up > 0) & (jitter_down > 0)')
    jitter_fil = jitter_suc.query(f"((target.str.startswith('samknows')) & (target.str.endswith('level3.net'))) | ((target.str.startswith('sp')) & (target.str.endswith('us.samknows.com')))")
    print(path, jitter_csv_name, jitter.shape, len(set(jitter.unit_id)), jitter.query('(successes >= 0)').shape, jitter_suc.shape, jitter_fil.shape)
    jitter_combined.append(jitter_fil[['unit_id', 'jitter_down', 'latency', 'successes', 'failures']])

jitter_combined = pd.concat(jitter_combined)
jitter_combined.shape

data/202011 curr_udpjitter (6541753, 16) 5701 (6541753, 16) (6327191, 16) (6273968, 16)
data/202012 curr_udpjitter (6607214, 16) 5562 (6607214, 16) (6387793, 16) (6331518, 16)


(12605486, 5)

time: 33 s (started: 2022-03-11 14:58:18 -05:00)


In [None]:
jitter_combined

Unnamed: 0,unit_id,jitter_down,latency,successes,failures
0,386,761,9332,1,0
1,386,743,8482,1,0
2,386,756,8313,1,0
3,386,753,8241,1,0
4,386,762,9225,1,0
...,...,...,...,...,...
6607209,45128793,307,33706,1,0
6607210,45128793,243,17918,1,0
6607211,45128793,302,18261,1,0
6607212,45128793,342,20033,1,0


time: 5.67 ms (started: 2022-03-11 14:59:23 -05:00)


In [342]:
agg_jitter = {'jitter_down' : 'mean', 'latency' : 'mean', 'successes': 'sum', 'failures': 'sum'}
jitter_completed = jitter_combined.groupby('unit_id').agg(agg_jitter)
# Convert microseconds to milliseconds
jitter_completed['jitter_down'] = round(jitter_completed['jitter_down'] / 1000, 2)

# jitter_completed['lossrate'] = round(jitter_completed.failures / jitter_completed.successes * 100, 2)
# jitter_completed['latency'] = round(jitter_completed['latency'] / 1000, 0).astype(int)
# jitter_completed = jitter_completed[['jitter_down', 'latency', 'lossrate']].reset_index()

jitter_completed = jitter_completed[['jitter_down']].reset_index()

time: 459 ms (started: 2022-03-11 15:07:35 -05:00)


In [343]:
jitter_completed

Unnamed: 0,unit_id,jitter_down
0,386,0.88
1,390,1.04
2,422,0.77
3,431,1.76
4,447,1.15
...,...,...
5729,41408417,0.27
5730,41504393,0.57
5731,43088693,0.06
5732,45128793,0.37


time: 5.56 ms (started: 2022-03-11 15:07:36 -05:00)


In [344]:
jitter_completed.describe()

Unnamed: 0,unit_id,jitter_down
count,5734.0,5734.0
mean,14632480.0,1.755457
std,16343520.0,11.834538
min,386.0,0.03
25%,670377.5,0.29
50%,3873183.0,0.51
75%,26437420.0,1.01
max,45564880.0,796.5


time: 9.55 ms (started: 2022-03-11 15:07:44 -05:00)


## main/merged/combined df

In [367]:
# geocoded and profiled
main = geocoded.merge(profile, on='unit_id', how='inner')
# with download speeds
main1 = main.merge(down_completed, on='unit_id', how='inner')
# with upload speeds
main2 = main1.merge(speed_completed, on='unit_id', how='inner')
# with latency & lossrate
main3 = main2.merge(udp_completed, on='unit_id', how='inner')
# with jitter
main4 = main3.merge(jitter_completed, on='unit_id', how='inner')

main.shape, main1.shape, main2.shape, main3.shape, main4.shape

((3634, 9), (3423, 14), (3422, 19), (3421, 21), (3419, 22))

time: 66.1 ms (started: 2022-03-11 16:23:15 -05:00)


In [368]:
main4

Unnamed: 0,unit_id,geog_id,geog_type,latitude,longitude,ISP,Technology,Download,Upload,down_count,...,down_min,down_max,up_count,up_mean,up_50%,up_min,up_max,latency,lossrate,jitter_down
0,447,42003473100,tract,40.391591,-80.049656,Verizon,Fiber,75.0,75.000,355.0,...,4.23,86.37,348.0,88.59,88.76,81.54,89.14,11,0.02,1.15
1,477,340076033014,blockgroup,39.935827,-75.020504,Verizon,Fiber,50.0,50.000,349.0,...,47.67,57.23,347.0,62.68,62.72,55.32,63.29,13,0.01,0.94
2,522,511539014031,blockgroup,38.766116,-77.501589,Verizon,Fiber,50.0,50.000,334.0,...,48.72,57.22,333.0,62.69,62.73,53.55,63.11,13,0.01,0.89
3,562,4013108400,tract,33.501323,-112.022745,Cox,Cable,10.0,1.000,247.0,...,7.40,11.95,246.0,0.94,0.95,0.64,1.20,22,0.20,1.28
4,566,330110195021,blockgroup,42.897465,-71.679147,Comcast,Cable,100.0,5.000,317.0,...,22.57,118.79,309.0,5.53,6.00,3.30,6.26,20,0.04,0.85
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3414,26226661,13,state,33.410677,-83.891248,Windstream,DSL,50.0,1.500,113.0,...,15.00,42.35,110.0,1.69,1.72,1.24,1.86,20,0.07,0.53
3415,996187,29,state,38.432921,-92.234929,Windstream,DSL,12.0,1.000,331.0,...,8.37,13.23,327.0,1.29,1.29,0.91,1.40,41,0.25,0.73
3416,3898121,19,state,41.936630,-93.037218,Windstream,DSL,10.0,1.000,357.0,...,6.96,11.11,359.0,1.07,1.07,0.96,1.15,31,0.08,0.92
3417,32833393,21,state,37.838308,-85.261296,Windstream,DSL,10.0,1.000,110.0,...,3.79,11.07,103.0,1.05,1.06,0.92,1.14,32,0.39,2.98


time: 20.8 ms (started: 2022-03-11 16:23:17 -05:00)


In [369]:
# def percentage_change(col1,col2):
#     return ((col2 - col1) / col1) * 100
# # numTest down and up are similar
# percentage_change(main4['down_count'], main4['up_count']).describe()    
# Thus, we can safely average(down_count, up_count) = proxy numTest
main4['numTest'] = main4[['down_count', 'up_count']].mean(axis = 1).astype(int)
main5 = main4.drop(columns = ['down_count', 'up_count'])
# This should never happen, as down_count and up_count always >= 1 . 
# But just in case cheat: if astype coerce 0.5 to 0, round it up to 1, so numTestMBA don't show up = 0, which is weird!
main5['numTest'] = main5['numTest'].replace(0, 1)

time: 5.03 ms (started: 2022-03-11 16:27:37 -05:00)


In [370]:
main5.to_csv('completed2020Q4.csv', index= False)

time: 31.5 ms (started: 2022-03-11 16:27:40 -05:00)


# Technology >> ISP >> Service tier: Stats

In [372]:
# 'Fiber', 'Cable', 'DSL'
main5.Technology.unique()

array(['Fiber', 'Cable', 'DSL'], dtype=object)

time: 2.29 ms (started: 2022-03-11 16:27:47 -05:00)
