# Fixed Broadband Coverage Analysis - UK - 2019 - 2023

## Part 1 - Data Cleaning

DISCLAIMER - This project has been carried out as part of Open University TM351 (Data Management and Analysis) module

### Table of Contents:

- [Licensing](#licensing)
- [Imports and Connections](#imports-and-connections)
- [Individual file examination](#individual-file-examination)
- [Entry differences - investigation](#entry-differences---investigation)
- [Entry differences - key summaries](#entry-differences---key-summaries)
- [Column differences - investigation](#column-differences---investigation)
- [Column differences - Key summary](#column-differences---key-summary)
- [Columns and their relations and derivation (origin)](#columns-and-their-relations-and-derivation-origin)
- [Summary - Column relations and derivation](#summary---column-relations-and-derivation)
- [Addressing differences and inconsistencies - Combining 'Number of premises with SFBB availability' and 'Number of premises able to receive SFBB from FWA' - 2019 dataset](#addressing-differences-and-inconsistencies---combining-number-of-premises-with-sfbb-availability-and-number-of-premises-able-to-receive-sfbb-from-fwa---2019-dataset)
- [Addressing the entry differences in 2019 dataset](#addressing-the-entry-differences-in-2019-dataset)
- [Addressing the entry differences in 2020 dataset](#addressing-the-entry-differences-in-2020-dataset)
- [Number of premises with UFBB (100Mbit/s) availability and UFBB (100Mbit/s) availability (% premises) columns in 2019 dataset](#number-of-premises-with-ufbb-100mbits-availability-and-ufbb-100mbits-availability--premises-columns-in-2019-dataset)
- [Consistency checks on entries on all datasets - 2019, 2020, 2021, 2022, 2023](#consistency-checks-on-entries-on-all-datasets---2019-2020-2021-2022-2023)
- [Saving the cleaned files](#saving-the-cleaned-files)

### Licensing

The data made available by Ofcom is under the Open Government License v3.0. Under the Open Governement License v3.0, I am allowed to:

Copy, publish, distribute, and transmit the information. - This  means I can share the information with others in its original form or after having published it myself in any medium or format.
    
Adapt the information - I am allowed to modify or change the information to suit my needs, which could invovlve editing texts, altering data sets, or integrating it into a new work.
    
Exploit the information commercially and non-commercially - I can use the information in any way that could lead to profit, such as including it in a product I am selling or service I am offering, as well as in non-commercial contexts.
    
Whenever I use this information, I must acknowledge the source. If the Information Provider specifies an attribution statement, I must include or link to that. If not, I should use "Contains public sector information licensed under the Open Government Licence v3.0." If my work includes information from multiple sources and listing them all is impractical, I may link to a resource that collectively acknowledges those sources.

Source: https://www.ofcom.org.uk/research-and-data/multi-sector-research/infrastructure-research

### Imports and Connections

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import chardet
import glob
import re

import pymongo
import folium

This question requires me to import the data, clean it, and store it in MongoDB. Nothing is said how and where the data should be imported and cleaned. The only specification, in my understanding, is that once data is cleaned, it has to be stored in MongoDB.

So, my strategy for Question 1 will be to load each file in DataFrame, perform some basic investigation to familiarise myself with the data, then clean it and load it in MongoDB. 

The reason why I have chosen the use of DataFrame over MongoDB is that I find that DataFrames provide a wide array of built-in functions for data manipulation, making it straightforward to filter, sort and transform data. From looking at the Ofcom interactive report, it seems that the data is structured and performing various manipulations would be easier using DataFrames. I know that DataFrames are designed for in-memory computing, which can lead to faster processing times for data cleaning tasks. Having looked at the CSV files in Excel, I saw that the datasets are of medium-size and using DataFrames would be a suitable choice. Also, I am feeling more comfortable working with DataFrames.

### Individual file examination

Before I proceed with cleaning the files, I would like to perform some basic checks on each individual file for any discrepancies. That will also give me a broad understanding of the data each file contains.  We have also been told that not all .CSV files have the same column names and that there is a variations of the number of columns. I will start with the 2019 file. 

#### Fixed broadband - 2019 dataset

Starting with the first .CSV file, which is 2019, I will check the first lines.

In [2]:
# Display the first five lines of the fixed broadband dataset for 2019.
! head -n 5 2023J_TMA02_data/Ofcom_fixed/201909_fixed_laua_coverage_r01.csv

laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,% of premises able to receive SFBB from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises able to receive SFBB from FWA,Number of premises with 30<300Mbit/s download speed,Nu

In the first five rows we see some textual input, which is likely to be the columns' names. Nothing obviously wrong with the first five rows of the file. I will proceed with checking the last five rows of the file.

In [3]:
# Display the last five lines of the fixed broadband dataset for 2019.
! tail -n 5 2023J_TMA02_data/Ofcom_fixed/201909_fixed_laua_coverage_r01.csv

E07000238,WYCHAVON,62475,62114,82.9,8.6,5.1,0.2,0.9,2,8,0.9,97.8,9.3,1.4,57130,5352,3162,129,553,1223,4984,555,61070,5827,848,51778,5352,129,424,670,3761,82.9,8.6,0.2,0.7,1.1,6
E07000007,WYCOMBE,76433,76345,68.5,26.5,2.2,0.1,0.6,1,4.9,0.3,99.2,0,0,72564,20219,1664,90,489,731,3781,219,75802,0,0,52345,20219,90,399,242,3050,68.5,26.5,0.1,0.5,0.3,4
E07000128,WYRE,56343,56280,88.4,5.9,5.5,0.1,0.3,0.7,5.5,0.1,98.8,7.3,1.4,53173,3340,3108,69,183,382,3107,71,55673,4110,775,49833,3340,69,114,199,2725,88.4,5.9,0.1,0.2,0.4,4.8
E07000239,WYRE FOREST,48100,48061,49.3,46.7,0.9,0.3,0.6,1.4,3.9,0.5,99.4,2.9,0.4,46203,22479,420,140,291,662,1858,264,47829,1412,172,23724,22479,140,151,371,1196,49.3,46.7,0.3,0.3,0.8,2.5
E06000014,YORK,98735,98548,23,70.8,43.6,0,0.3,0.6,6,0.2,95.6,7,3.9,92626,69871,43077,36,261,629,5922,191,94419,6956,3811,22755,69871,36,225,368,5293,23,70.8,0,0.2,0.4,5.4


It all seems ok. There is no metadata included in this file. So, I will proceed loading the file in a data frame. That will allow me to carry certain basic checks such as seeing the shape of the data frame. 

In [4]:
# Import the dataset for fixed coverage for 2019 into a new DataFrame.
fixed_coverage_2019_df=pd.read_csv('2023J_TMA02_data/Ofcom_fixed/201909_fixed_laua_coverage_r01.csv')

I want to ensure the dataset's integrity. So, I will load the first five and last five rows of the data frame. There is a large number of columns, so using just the .head() method will not display them all.

I will use the display.max_columns option to view all columns in order to get familiar with the dataset. I have used the official Pandas documentation to understand how to use the 'display.max.columns' option.

(Reference: https://pandas.pydata.org/docs/user_guide/options.html)

In [5]:
pd.set_option('display.max_columns', None)

In [6]:
fixed_coverage_2019_df.head()

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,% of premises able to receive SFBB from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises able to receive SFBB from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
0,S12000033,ABERDEEN CITY,125441,125311,73.3,20.1,13.1,0.0,0.2,0.7,6.5,0.2,96.8,0.0,0.0,117152,25163,16410,49,219,884,8159,189,121476,0,0,91989,25163,49,170,665,7275,73.3,20.1,0.0,0.1,0.5,5.8
1,S12000034,ABERDEENSHIRE,125085,124305,78.5,2.8,2.7,2.5,5.9,9.9,18.1,3.6,93.6,0.0,0.0,101652,3472,3332,3163,7339,12332,22653,4519,117051,0,0,98180,3472,3163,4176,4993,10321,78.5,2.8,2.5,3.3,4.0,8.3
2,E07000223,ADUR,29770,29760,16.3,82.4,0.6,0.0,0.1,0.1,1.3,0.0,99.1,0.0,0.0,29383,24543,193,0,16,44,377,12,29514,0,0,4840,24543,0,16,28,333,16.3,82.4,0.0,0.1,0.1,1.1
3,E07000026,ALLERDALE,51385,51284,89.8,1.7,1.7,1.2,2.6,3.6,8.3,1.2,98.3,2.3,2.2,47003,866,866,619,1323,1873,4281,592,50507,1164,1114,46137,866,619,704,550,2408,89.8,1.7,1.2,1.4,1.1,4.7
4,E07000032,AMBER VALLEY,60674,60596,67.4,25.3,22.1,0.1,0.9,2.1,7.2,0.7,98.2,0.0,0.0,56232,15339,13412,89,549,1254,4364,440,59578,0,0,40893,15339,89,460,705,3110,67.4,25.3,0.1,0.8,1.2,5.1


There is a lot of columns. But it is presented in a reasonably clean format. The only noticeable thing is that columns are a bit mixed up and potentialy their name could be amended for better readability. Apart from that it all seems ok. Let's have a look at the bottom of the 2019 file.

In [7]:
fixed_coverage_2019_df.tail()

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,% of premises able to receive SFBB from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises able to receive SFBB from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
377,E07000238,WYCHAVON,62475,62114,82.9,8.6,5.1,0.2,0.9,2.0,8.0,0.9,97.8,9.3,1.4,57130,5352,3162,129,553,1223,4984,555,61070,5827,848,51778,5352,129,424,670,3761,82.9,8.6,0.2,0.7,1.1,6.0
378,E07000007,WYCOMBE,76433,76345,68.5,26.5,2.2,0.1,0.6,1.0,4.9,0.3,99.2,0.0,0.0,72564,20219,1664,90,489,731,3781,219,75802,0,0,52345,20219,90,399,242,3050,68.5,26.5,0.1,0.5,0.3,4.0
379,E07000128,WYRE,56343,56280,88.4,5.9,5.5,0.1,0.3,0.7,5.5,0.1,98.8,7.3,1.4,53173,3340,3108,69,183,382,3107,71,55673,4110,775,49833,3340,69,114,199,2725,88.4,5.9,0.1,0.2,0.4,4.8
380,E07000239,WYRE FOREST,48100,48061,49.3,46.7,0.9,0.3,0.6,1.4,3.9,0.5,99.4,2.9,0.4,46203,22479,420,140,291,662,1858,264,47829,1412,172,23724,22479,140,151,371,1196,49.3,46.7,0.3,0.3,0.8,2.5
381,E06000014,YORK,98735,98548,23.0,70.8,43.6,0.0,0.3,0.6,6.0,0.2,95.6,7.0,3.9,92626,69871,43077,36,261,629,5922,191,94419,6956,3811,22755,69871,36,225,368,5293,23.0,70.8,0.0,0.2,0.4,5.4


Again, nothing wrong at glance. I will proceed with a slightly more in-depth look.

Next step is to check the shape of the data frame.

In [8]:
fixed_coverage_2019_df.shape

(382, 38)

The shape of the data frame tells us there are 382 rows (entries) with 38 columns. That gives me an idea of the number of records we might be dealing with. (I will be keeping a manual/handwritten record of such values, while I am performing the data analysis.)

In [9]:
fixed_coverage_2019_df.describe()

Unnamed: 0,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,% of premises able to receive SFBB from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises able to receive SFBB from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
count,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0,382.0
mean,80678.520942,80532.005236,46.587435,46.857068,8.586649,0.408639,1.091361,2.078534,6.33534,0.643717,97.397644,5.867016,2.742932,76024.531414,42022.903141,8350.732984,246.921466,677.044503,1333.057592,4507.473822,406.706806,78605.756545,4157.753927,1884.426702,34001.628272,42022.903141,246.921466,430.123037,656.013089,3174.41623,46.587435,46.857068,0.408639,0.680366,0.984817,4.25733
std,52521.559617,52472.88587,25.255563,28.551654,10.401911,0.794339,1.663883,2.823999,5.478259,1.072613,3.397988,12.841409,7.180163,50371.705564,42697.918438,15133.464705,449.014551,971.39299,1628.188018,4086.479619,631.433052,50768.883731,10531.804168,5580.149853,25849.262946,42697.918438,449.014551,551.411163,718.525018,2971.853204,25.255563,28.551654,0.794339,0.922826,1.247035,3.413865
min,1681.0,1678.0,1.1,0.0,0.0,0.0,0.0,0.0,0.2,0.0,53.9,0.0,0.0,1586.0,0.0,0.0,0.0,0.0,1.0,62.0,2.0,1678.0,0.0,0.0,1375.0,0.0,0.0,0.0,0.0,53.0,1.1,0.0,0.0,0.0,0.0,0.1
25%,48100.25,48070.25,24.225,18.45,2.4,0.0,0.1,0.4,2.7,0.1,96.9,0.0,0.0,44656.25,10765.5,1256.25,15.0,103.25,309.75,1890.25,54.0,47135.0,0.0,0.0,17614.5,10765.5,15.0,72.25,194.25,1454.75,24.225,18.45,0.0,0.1,0.3,2.1
50%,64867.5,64782.5,42.45,51.8,4.95,0.1,0.4,1.0,4.8,0.2,98.3,0.0,0.0,59280.0,31863.0,3166.5,77.5,318.0,762.0,3337.0,178.0,62937.0,0.0,0.0,27363.0,31863.0,77.5,225.0,448.0,2401.5,42.45,51.8,0.1,0.3,0.6,3.55
75%,98485.75,98413.25,69.85,72.175,11.1,0.4,1.3,2.5,7.8,0.7,99.075,3.3,0.6,92842.0,60355.75,8675.75,279.75,828.0,1649.25,5416.25,510.0,94715.0,2374.5,417.0,42421.25,60355.75,279.75,536.0,843.5,3892.5,69.85,72.175,0.4,0.8,1.2,5.7
max,469208.0,468772.0,97.8,97.0,97.0,7.9,13.5,21.4,46.0,7.1,99.9,61.6,51.7,446234.0,379277.0,123503.0,3947.0,7339.0,12332.0,28223.0,4539.0,446416.0,95210.0,43367.0,216836.0,379277.0,3947.0,4176.0,4993.0,25974.0,97.8,97.0,7.9,5.7,11.4,46.0


There are many columns and it is difficult to identify if there is anything wrong by just looking at the table. The most visible element is that there are some fields of zeros in the 'min' row. For example, in the 'SFBB availability (% premises)', a zero value in the 'min' row suggests there are areas where SFBB (Superfast Broadband) is not available at all. This could indicate regions with limited or no access to high-speed broadband services.

Similarly, in the 'UFBB availability (% premises)' and 'Full Fibre availability (% premises)' columns, a zero value in the 'min' row indicates areas where Ultrafast Broadband (UFBB) and FUll Fibre broadband are not available at all.

The other zero values in the 'min' row represent minimum observed values for various metrics such as the percentage of premises unable to receive certain broadband speeds (2 Mbit/s, 5 Mbit/s, etc.) or the percentage of premises below the Universal Service Obligation (USO).

Additionally, in the columns containing percentage values, the minimum and maximum values should not be negative (under 0) and over 100, respectively. In this case, there are no presence of such percentage values.

Let's have a closer look at the dataset. 

In [10]:
fixed_coverage_2019_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 382 entries, 0 to 381
Data columns (total 38 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   laua                                                          382 non-null    object 
 1   laua_name                                                     382 non-null    object 
 2   All Premises                                                  382 non-null    int64  
 3   All Matched Premises                                          382 non-null    int64  
 4   SFBB availability (% premises)                                382 non-null    float64
 5   UFBB availability (% premises)                                382 non-null    float64
 6   Full Fibre availability (% premises)                          382 non-null    float64
 7   % of premises unable to receive 2Mbit/s                       382 non-n

There don't appear to be any anomalies in terms of missing values or unexpected data types. The DataFrame consist of 382 entries and 38 columns, as we have already noticed. All columns seem to have no missing values (non-null), and their data types appear appropriate for the kind of data they represent. There is a mixture of object, integer and float data types. The object type is for 'laua' and 'laua_name', 'int64' for numerical values representing counts, and 'flot64' for numerical values representing percentages.

I would like to be extra sure that there are no missing values in the dataframe. The method below counts the number of missing values in each column of the DataFrame, allowing me to identify if any columns have missing data.

In [11]:
fixed_coverage_2019_df.isnull().sum()

laua                                                            0
laua_name                                                       0
All Premises                                                    0
All Matched Premises                                            0
SFBB availability (% premises)                                  0
UFBB availability (% premises)                                  0
Full Fibre availability (% premises)                            0
% of premises unable to receive 2Mbit/s                         0
% of premises unable to receive 5Mbit/s                         0
% of premises unable to receive 10Mbit/s                        0
% of premises unable to receive 30Mbit/s                        0
% of premises below the USO                                     0
% of premises with NGA                                          0
% of premises able to receive decent broadband from FWA         0
% of premises able to receive SFBB from FWA                     0
Number of 

That is good news. There is no missing values.

I will now check for any duplicates present in the dataset.

In [12]:
fixed_coverage_2019_df.duplicated().sum()

np.int64(0)

The data seems to be clean in terms of missing values and duplicate rows.

At this stage, I am satisfied with the very basic checks I have performed. At a later stage, I will dig a bit deeper to ensure there are no anomalies or dirty data. For now, I will proceed applying exactly the same steps to the 2020 file.

#### Fixed broadband - 2020 dataset

In [13]:
# Display the first five rows of the fixed broadband dataset for 2020.
! head -n 5 2023J_TMA02_data/Ofcom_fixed/202009_fixed_laua_coverage_r01.csv

laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to recei

In [14]:
# Display the last five rows of the fixed broadband dataset for 2020.
! tail -n 5 2023J_TMA02_data/Ofcom_fixed/202009_fixed_laua_coverage_r01.csv

W06000006,WREXHAM,65867,65212,94.4,37.3,36.6,36.2,36.2,0.3,1.0,1.8,4.6,1.4,97.2,43.9,62170,24551,24136,23863,23863,194,639,1217,3042,930,64021,28917,38034,24136,194,445,578,1825,57.7,36.6,0.3,0.7,0.9,2.8
E07000238,WYCHAVON,62536,62215,93.8,19.2,15.2,8.0,8.1,0.1,0.7,1.3,5.7,0.8,98.6,0.4,58643,12032,9524,4988,5066,83,423,822,3572,490,61652,225,49119,9524,83,340,399,2750,78.5,15.2,0.1,0.5,0.6,4.4
E07000128,WYRE,56527,56411,95.1,22.7,22.7,22.3,22.3,0.1,0.3,0.7,4.7,0.2,98.8,0.3,53739,12856,12852,12612,12612,61,174,368,2672,133,55837,186,40887,12852,61,113,194,2304,72.3,22.7,0.1,0.2,0.3,4.1
E07000239,WYRE FOREST,48237,48173,96.8,47.9,47.8,2.0,47.8,0.2,0.5,0.8,3.1,0.4,99.5,3.5,46680,23125,23035,961,23034,84,218,389,1493,183,48001,1675,23645,23035,84,134,171,1104,49.0,47.8,0.2,0.3,0.4,2.3
E06000014,YORK,95949,95674,94.1,75.5,71.9,54.8,54.8,0.0,0.2,0.8,5.6,0.3,96.0,3.7,90313,72415,68952,52549,52549,39,209,774,5361,296,92085,3522,21361,68952,39,170,565,4587,22.3,71.9,0.0,0.2,0.6,4.8


Both, the very begining and end of the .csv file appear to be free of any anomalies and metadata. Next, I will load the file into a data frame.

In [15]:
# Import the dataset for fixed coverage for 2020 into a new DataFrame.
fixed_coverage_2020_df=pd.read_csv('2023J_TMA02_data/Ofcom_fixed/202009_fixed_laua_coverage_r01.csv')
fixed_coverage_2020_df.head()

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
0,S12000033,ABERDEEN CITY,126176,125948,94.6,49.0,41.6,34.9,34.9,0.0,0.2,0.7,5.2,0.2,97.0,0.0,119358,61798,52461,44051,44051,55,208,881,6590,300,122434,0,66897,52461,55,153,673,5709,53.0,41.6,0.0,0.1,0.5,4.5
1,S12000034,ABERDEENSHIRE,126065,125176,82.9,7.2,7.0,6.9,6.9,2.6,5.6,9.1,16.4,3.6,94.7,0.0,104472,9118,8872,8732,8732,3234,7099,11516,20704,4538,119331,0,95600,8872,3234,3865,4417,9188,75.8,7.0,2.6,3.1,3.5,7.3
2,E07000223,ADUR,29779,29755,98.8,85.8,85.6,0.6,0.6,0.0,0.0,0.1,1.1,0.1,99.5,0.0,29427,25562,25482,189,189,0,10,34,328,33,29616,0,3945,25482,0,10,24,294,13.2,85.6,0.0,0.0,0.1,1.0
3,E07000026,ALLERDALE,51647,51483,92.3,2.8,2.8,2.8,2.8,1.2,2.3,3.3,7.3,1.2,98.6,2.2,47693,1466,1466,1466,1466,627,1173,1705,3790,634,50931,1160,46227,1466,627,546,532,2085,89.5,2.8,1.2,1.1,1.0,4.0
4,E07000032,AMBER VALLEY,61134,60972,94.7,30.2,26.7,23.6,23.6,0.1,0.5,0.9,5.1,0.4,98.9,0.0,57875,18462,16323,14438,14438,63,280,573,3097,267,60483,0,41552,16323,63,217,293,2524,68.0,26.7,0.1,0.4,0.5,4.1


Nothing out of the ordinary. I proceed with viewing the end of the file.

In [16]:
fixed_coverage_2020_df.tail()

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
374,W06000006,WREXHAM,65867,65212,94.4,37.3,36.6,36.2,36.2,0.3,1.0,1.8,4.6,1.4,97.2,43.9,62170,24551,24136,23863,23863,194,639,1217,3042,930,64021,28917,38034,24136,194,445,578,1825,57.7,36.6,0.3,0.7,0.9,2.8
375,E07000238,WYCHAVON,62536,62215,93.8,19.2,15.2,8.0,8.1,0.1,0.7,1.3,5.7,0.8,98.6,0.4,58643,12032,9524,4988,5066,83,423,822,3572,490,61652,225,49119,9524,83,340,399,2750,78.5,15.2,0.1,0.5,0.6,4.4
376,E07000128,WYRE,56527,56411,95.1,22.7,22.7,22.3,22.3,0.1,0.3,0.7,4.7,0.2,98.8,0.3,53739,12856,12852,12612,12612,61,174,368,2672,133,55837,186,40887,12852,61,113,194,2304,72.3,22.7,0.1,0.2,0.3,4.1
377,E07000239,WYRE FOREST,48237,48173,96.8,47.9,47.8,2.0,47.8,0.2,0.5,0.8,3.1,0.4,99.5,3.5,46680,23125,23035,961,23034,84,218,389,1493,183,48001,1675,23645,23035,84,134,171,1104,49.0,47.8,0.2,0.3,0.4,2.3
378,E06000014,YORK,95949,95674,94.1,75.5,71.9,54.8,54.8,0.0,0.2,0.8,5.6,0.3,96.0,3.7,90313,72415,68952,52549,52549,39,209,774,5361,296,92085,3522,21361,68952,39,170,565,4587,22.3,71.9,0.0,0.2,0.6,4.8


At glance, it all appears to be ok. Let's check its shape.

In [17]:
fixed_coverage_2020_df.shape

(379, 40)

Okay, so the 2020 dataset has 379 entries and 40 columns. That is a difference of 3 entries less in 2020, when compared to 2019. But the number of columns have increased from 38 in 2019 to 40 in 2020 dataset. This needs to be investigated further.

Knowing that entries (rows) in this dataset represent local authority, that indicates changes in coverage areas. I will note this down for later on. I will proceed with applying describe() method to the dataset.

In [18]:
fixed_coverage_2020_df.describe()

Unnamed: 0,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
count,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0,379.0
mean,82073.034301,81814.664908,94.187071,54.78285,52.3,14.926121,21.593931,0.388654,0.984433,1.83219,5.44723,0.766755,97.641953,5.130871,77892.448549,49425.741425,47355.754617,14267.683377,21876.833773,233.870712,611.485488,1181.345646,3922.216359,500.023747,80181.403694,3742.271768,30536.693931,47355.754617,233.870712,377.614776,569.860158,2740.870712,41.887335,52.3,0.388654,0.592612,0.845383,3.61372
std,54691.92602,54579.787696,5.414738,28.169708,28.214368,14.855957,22.950617,0.79987,1.621302,2.711307,5.113346,1.166895,3.16125,12.939627,52684.886397,47260.277314,46285.125976,21975.649783,39706.702081,454.429163,943.367329,1550.362919,3807.029031,653.602797,53068.20917,11497.188145,24986.710265,46285.125976,454.429163,517.932346,664.281431,2705.69819,25.349735,28.214368,0.79987,0.873295,1.182164,3.106772
min,1677.0,1666.0,56.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,56.5,0.0,1580.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29.0,11.0,1666.0,0.0,1126.0,0.0,0.0,0.0,0.0,21.0,0.9,0.0,0.0,0.0,0.0,0.0
25%,48481.0,48342.0,93.0,30.5,26.1,4.55,5.35,0.0,0.1,0.4,2.2,0.2,97.3,0.0,45396.0,17642.0,15148.5,2819.0,3092.0,14.0,82.0,262.5,1628.0,127.5,47744.5,0.0,14371.0,15148.5,14.0,64.0,158.5,1242.5,19.05,26.1,0.0,0.1,0.2,1.7
50%,65648.0,65115.0,95.9,61.7,58.1,10.2,12.7,0.1,0.3,0.8,3.9,0.4,98.5,0.0,59865.0,37624.0,36302.0,6833.0,8569.0,67.0,274.0,646.0,2835.0,268.0,63930.0,0.0,24710.0,36302.0,67.0,188.0,355.0,2024.0,36.1,58.1,0.1,0.2,0.5,2.8
75%,99125.0,98969.0,97.6,78.95,77.75,20.1,25.5,0.4,1.0,2.0,6.65,0.8,99.1,0.9,94574.5,70782.0,66303.0,15438.5,23159.5,232.5,721.5,1350.5,4633.5,583.5,97291.0,668.0,38533.5,66303.0,232.5,458.0,728.5,3246.5,64.55,77.75,0.4,0.7,1.0,4.8
max,474257.0,473084.0,99.6,97.5,97.5,97.5,97.5,8.4,13.8,20.7,43.3,10.8,99.9,78.2,453773.0,427016.0,422455.0,166118.0,419053.0,4196.0,7099.0,11516.0,33878.0,4538.0,454063.0,119441.0,192679.0,422455.0,4196.0,3865.0,4417.0,27281.0,97.6,97.5,8.4,5.5,11.1,43.3


There is nothing that jumps out at glance. There are some zero values in the 'min' row, similarly to the 2019 dataset. Again, I am looking at the 'min' and 'max' values in the columns containing percentages. They seem to be within range.

Let's familiarise with the data types used for each column and check for any non-null count values.

In [19]:
fixed_coverage_2020_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 379 entries, 0 to 378
Data columns (total 40 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   laua                                                          379 non-null    object 
 1   laua_name                                                     379 non-null    object 
 2   All Premises                                                  379 non-null    int64  
 3   All Matched Premises                                          379 non-null    int64  
 4   SFBB availability (% premises)                                379 non-null    float64
 5   UFBB (100Mbit/s) availability (% premises)                    379 non-null    float64
 6   UFBB availability (% premises)                                379 non-null    float64
 7   Full Fibre availability (% premises)                          379 non-n

There don't appear to be any anomalies. The DataFrame has 379 entries and 40 columns. All columns have non-null counts matching the total number of entries, indicating no missing values. The data types of each column seem appropriate for the kind of data they represent. 

Let's ensure there are indeed no missing values.

In [20]:
fixed_coverage_2020_df.isnull().sum()

laua                                                            0
laua_name                                                       0
All Premises                                                    0
All Matched Premises                                            0
SFBB availability (% premises)                                  0
UFBB (100Mbit/s) availability (% premises)                      0
UFBB availability (% premises)                                  0
Full Fibre availability (% premises)                            0
Gigabit availability (% premises)                               0
% of premises unable to receive 2Mbit/s                         0
% of premises unable to receive 5Mbit/s                         0
% of premises unable to receive 10Mbit/s                        0
% of premises unable to receive 30Mbit/s                        0
% of premises below the USO                                     0
% of premises with NGA                                          0
% of premi

And let's check for any duplicates.

In [21]:
fixed_coverage_2019_df.duplicated().sum()

np.int64(0)

So far, so good. No missing values in the dataset, which is good news. And no duplicates detected. I am moving onto the 2021 file, applying exactly the same steps.

#### Fixed broadband - 2021 dataset

Let's check the first and last 5 lines of the 2021 file.

In [22]:
# Display the first five rows of the fixed broadband dataset for 2021.
! head -n 5 2023J_TMA02_data/Ofcom_fixed/202109_fixed_laua_coverage_r01.csv

laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to recei

In [23]:
# Display the last five rows of the fixed broadband dataset for 2021.
! tail -n 5 2023J_TMA02_data/Ofcom_fixed/202109_fixed_laua_coverage_r01.csv

W06000006,WREXHAM,66192,65306,94.6,42.0,41.4,41.0,41.0,0.2,0.8,1.7,4.1,0.9,96.9,43.6,62589,27775,27435,27106,27106,165,553,1103,2717,622,64152,28832,35154,27435,165,388,550,1614,53.1,41.4,0.2,0.6,0.8,2.4
E07000238,WYCHAVON,63359,62695,94.2,23.8,20.0,12.8,17.5,0.2,0.7,1.2,4.7,0.3,98.4,0.0,59690,15064,12656,8099,11101,105,467,760,3005,202,62333,1,47034,12656,105,362,293,2245,74.2,20.0,0.2,0.6,0.5,3.5
E07000128,WYRE,57413,57099,95.4,46.4,46.4,46.3,46.3,0.1,0.2,0.6,4.1,0.0,98.4,10.7,54757,26622,26617,26591,26591,46,140,361,2342,23,56517,6146,28140,26617,46,94,221,1981,49.0,46.4,0.1,0.2,0.4,3.5
E07000239,WYRE FOREST,48472,48204,96.7,48.5,48.1,2.6,48.1,0.1,0.3,0.7,2.8,0.2,99.1,3.4,46859,23514,23324,1276,23324,45,141,323,1345,94,48058,1662,23535,23324,45,96,182,1022,48.6,48.1,0.1,0.2,0.4,2.1
E06000014,YORK,96147,95638,94.2,77.5,74.5,60.4,72.0,0.0,0.2,0.7,5.2,0.0,96.1,7.1,90609,74466,71630,58026,69189,41,182,668,5029,32,92376,6860,18979,71630,41,141,486,4361,19.7,74.5,0.0,0.1,0.5,4.5


No anomalies in the header and footer of the file, free of any metadata. I will proceed with loading the file into a data frame.

In [24]:
# Import the dataset for fixed coverage for 2021 into a new DataFrame.
fixed_coverage_2021_df=pd.read_csv('2023J_TMA02_data/Ofcom_fixed/202109_fixed_laua_coverage_r01.csv')
fixed_coverage_2021_df.head()

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
0,S12000033,ABERDEEN CITY,127714,126771,94.7,66.6,62.1,58.4,58.4,0.1,0.2,0.5,4.6,0.1,96.4,0.0,120938,85048,79372,74618,74618,67,211,628,5833,93,123177,0,41566,79372,67,144,417,5205,32.5,62.1,0.1,0.1,0.3,4.1
1,S12000034,ABERDEENSHIRE,126481,125378,82.8,13.8,13.7,13.7,13.7,2.5,5.7,9.2,16.3,3.7,94.4,0.0,104761,17459,17366,17269,17269,3202,7186,11631,20617,4727,119421,0,87395,17366,3202,3984,4445,8986,69.1,13.7,2.5,3.1,3.5,7.1
2,E07000223,ADUR,29884,29793,98.6,85.9,85.6,1.8,1.8,0.0,0.0,0.1,1.1,0.0,99.3,0.0,29476,25660,25581,538,538,4,13,36,317,8,29670,0,3895,25581,4,9,23,281,13.0,85.6,0.0,0.0,0.1,0.9
3,E07000026,ALLERDALE,51933,51622,92.3,3.4,3.4,3.4,3.4,1.1,2.1,3.1,7.1,1.0,98.4,0.0,47922,1750,1750,1750,1750,595,1098,1612,3700,518,51077,0,46172,1750,595,503,514,2088,88.9,3.4,1.1,1.0,1.0,4.0
4,E07000032,AMBER VALLEY,61555,61161,95.1,31.4,27.9,25.2,25.2,0.1,0.3,0.6,4.3,0.2,98.7,0.0,58516,19334,17177,15512,15512,61,190,379,2645,132,60771,1,41339,17177,61,129,189,2266,67.2,27.9,0.1,0.2,0.3,3.7


In [25]:
fixed_coverage_2021_df.tail()

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
369,W06000006,WREXHAM,66192,65306,94.6,42.0,41.4,41.0,41.0,0.2,0.8,1.7,4.1,0.9,96.9,43.6,62589,27775,27435,27106,27106,165,553,1103,2717,622,64152,28832,35154,27435,165,388,550,1614,53.1,41.4,0.2,0.6,0.8,2.4
370,E07000238,WYCHAVON,63359,62695,94.2,23.8,20.0,12.8,17.5,0.2,0.7,1.2,4.7,0.3,98.4,0.0,59690,15064,12656,8099,11101,105,467,760,3005,202,62333,1,47034,12656,105,362,293,2245,74.2,20.0,0.2,0.6,0.5,3.5
371,E07000128,WYRE,57413,57099,95.4,46.4,46.4,46.3,46.3,0.1,0.2,0.6,4.1,0.0,98.4,10.7,54757,26622,26617,26591,26591,46,140,361,2342,23,56517,6146,28140,26617,46,94,221,1981,49.0,46.4,0.1,0.2,0.4,3.5
372,E07000239,WYRE FOREST,48472,48204,96.7,48.5,48.1,2.6,48.1,0.1,0.3,0.7,2.8,0.2,99.1,3.4,46859,23514,23324,1276,23324,45,141,323,1345,94,48058,1662,23535,23324,45,96,182,1022,48.6,48.1,0.1,0.2,0.4,2.1
373,E06000014,YORK,96147,95638,94.2,77.5,74.5,60.4,72.0,0.0,0.2,0.7,5.2,0.0,96.1,7.1,90609,74466,71630,58026,69189,41,182,668,5029,32,92376,6860,18979,71630,41,141,486,4361,19.7,74.5,0.0,0.1,0.5,4.5


Its integrity seems ok, so I will proceed with checking the data frame's shape.

In [26]:
fixed_coverage_2021_df.shape

(374, 40)

The shape of the data frame means there are 374 rows (entries) with 40 columns. There seems to be even less entries in the 2021 dataset. However, the number of columns seems to be consistent.

In [27]:
fixed_coverage_2021_df.describe()

Unnamed: 0,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0
mean,83731.264706,83184.727273,94.344652,59.760963,57.726203,23.960963,38.224599,0.365775,0.913102,1.681551,4.959893,0.537166,97.48369,7.796524,79573.125668,54471.163102,52720.307487,22738.181818,38242.967914,223.21123,581.390374,1105.179144,3611.601604,328.350267,81677.283422,5694.024064,26852.818182,52720.307487,223.21123,358.179144,523.78877,2506.42246,36.614439,57.726203,0.365775,0.545989,0.763904,3.281283
std,55452.666179,55099.509244,5.246693,26.495761,26.509567,18.61083,28.511055,0.764637,1.543772,2.568474,4.877776,1.096611,3.119544,16.33091,53344.355382,48729.738264,47715.361395,28286.134527,47622.801472,440.084224,912.779805,1486.562531,3635.678583,624.196129,53682.885022,13700.191058,22248.06022,47715.361395,440.084224,499.915776,627.13287,2550.335573,23.804715,26.509567,0.764637,0.824213,1.113291,2.975075
min,1683.0,1661.0,56.5,1.1,1.1,0.9,0.9,0.0,0.0,0.0,0.0,0.0,56.5,0.0,1622.0,27.0,27.0,27.0,27.0,0.0,0.0,0.0,39.0,0.0,1661.0,0.0,1032.0,27.0,0.0,0.0,0.0,20.0,0.9,1.1,0.0,0.0,0.0,0.0
25%,49355.5,49097.75,93.225,39.275,36.875,10.225,12.9,0.0,0.1,0.3,2.0,0.0,97.225,0.0,46332.0,23072.25,20897.25,5760.0,7907.75,15.0,92.25,268.0,1488.0,16.0,48570.0,0.0,12402.0,20897.25,15.0,65.75,146.25,1108.0,16.225,36.875,0.0,0.1,0.2,1.5
50%,66536.0,66171.5,95.9,67.75,65.6,19.5,29.9,0.1,0.3,0.8,3.4,0.1,98.25,0.0,62395.0,40711.5,39457.0,13110.5,19977.5,63.5,274.0,617.5,2636.0,95.0,64352.0,1.0,21168.5,39457.0,63.5,185.5,325.0,1832.0,29.95,65.6,0.1,0.2,0.4,2.5
75%,101537.0,100732.5,97.5,82.275,80.5,33.575,66.625,0.3,0.9,1.775,6.075,0.5,98.8,4.925,96827.0,78367.25,75940.25,28009.75,51503.25,202.0,624.0,1197.75,4394.75,363.5,98358.25,3911.5,34513.0,75940.25,202.0,423.0,649.75,3020.25,54.35,80.5,0.3,0.6,0.8,4.575
max,474961.0,471159.0,99.5,97.6,97.6,97.6,97.6,8.2,13.5,20.1,42.0,8.6,99.9,84.8,452173.0,426787.0,420836.0,229819.0,417774.0,4109.0,7186.0,11631.0,36337.0,4727.0,452539.0,118161.0,155211.0,420836.0,4109.0,3984.0,4445.0,29149.0,94.8,97.6,8.2,5.3,11.0,42.0


Again, nothing out of the ordinary. It seems to be less zeros in the 'min' row, compared to 2019 and 2020. The 'min' and 'max' values in the columns containing percentages seems to be within range.

In [28]:
fixed_coverage_2021_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 40 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   laua                                                          374 non-null    object 
 1   laua_name                                                     374 non-null    object 
 2   All Premises                                                  374 non-null    int64  
 3   All Matched Premises                                          374 non-null    int64  
 4   SFBB availability (% premises)                                374 non-null    float64
 5   UFBB (100Mbit/s) availability (% premises)                    374 non-null    float64
 6   UFBB availability (% premises)                                374 non-null    float64
 7   Full Fibre availability (% premises)                          374 non-n

There don't appear to be any anomalies. The data frame has 374 entries with 40 columns. All columns have non-null counts matching the total number of entries, indicating no missing values. The data types of each column seem appropriate for the kind of data they represent.

Double checking for any missing values.

In [29]:
fixed_coverage_2021_df.isnull().sum()

laua                                                            0
laua_name                                                       0
All Premises                                                    0
All Matched Premises                                            0
SFBB availability (% premises)                                  0
UFBB (100Mbit/s) availability (% premises)                      0
UFBB availability (% premises)                                  0
Full Fibre availability (% premises)                            0
Gigabit availability (% premises)                               0
% of premises unable to receive 2Mbit/s                         0
% of premises unable to receive 5Mbit/s                         0
% of premises unable to receive 10Mbit/s                        0
% of premises unable to receive 30Mbit/s                        0
% of premises below the USO                                     0
% of premises with NGA                                          0
% of premi

We can confirm there are no missing values in the data frame. Let's see if there any duplicates.

In [30]:
fixed_coverage_2021_df.duplicated().sum()

np.int64(0)

No duplicates and no missing values found. So, let's move onto the 2022 file.

#### Fixed broadband - 2022 dataset

Let's load the first and last five rows of the top and bottom end of the file.

In [31]:
# Display the first five rows of the fixed broadband dataset for 2022.
! head -n 5 2023J_TMA02_data/Ofcom_fixed/202209_fixed_laua_coverage_r02.csv

laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to recei

In [32]:
# Display the last five rows of the fixed broadband dataset for 2022.
! tail -n 5 2023J_TMA02_data/Ofcom_fixed/202209_fixed_laua_coverage_r02.csv

W06000006,WREXHAM,66672,65735,95.2,48.6,48.1,47.6,47.6,0.2,0.7,1.4,3.4,0.8,97.2,43.3,63473,32422,32097,31766,31767,132,481,928,2262,506,64830,28852,31376,32097,132,349,447,1334,47.1,48.1,0.2,0.5,0.7,2.0
E07000238,WYCHAVON,64057,63530,95.1,35.8,31.0,29.4,30.7,0.1,0.5,0.9,4.0,0.2,98.7,0.1,60942,22960,19882,18815,19685,67,339,559,2588,130,63212,45,41060,19882,67,272,220,2029,64.1,31.0,0.1,0.4,0.3,3.2
E07000128,WYRE,58069,57900,97.0,60.3,60.3,60.2,60.3,0.1,0.2,0.5,2.7,0.1,98.9,10.6,56317,35014,35013,34984,35010,42,141,263,1583,40,57419,6144,21304,35013,42,99,122,1320,36.7,60.3,0.1,0.2,0.2,2.3
E07000239,WYRE FOREST,48894,48679,97.3,55.7,55.7,10.3,55.4,0.1,0.3,0.5,2.3,0.1,99.3,3.4,47557,27255,27255,5020,27064,49,136,250,1122,69,48549,1670,20302,27255,49,87,114,872,41.5,55.7,0.1,0.2,0.2,1.8
E06000014,YORK,96526,96317,94.7,75.8,72.6,52.3,70.0,0.0,0.2,0.7,5.0,0.0,96.6,3.8,91448,73123,70059,50466,67521,42,189,694,4869,23,93230,3707,21389,70059,42,147,505,4175,22.2,72.6,0.0,0.2,0.5,4.3


That is good news, no anomalies or present metadata. I will proceed with loading the file into a data frame and checking the top and bottom of the data frame to ensure there are no problems present when the .CSV file was loaded.

In [33]:
# Import the dataset for fixed coverage for 2022 into a new DataFrame.
fixed_coverage_2022_df=pd.read_csv('2023J_TMA02_data/Ofcom_fixed/202209_fixed_laua_coverage_r02.csv')
fixed_coverage_2022_df.head()

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
0,S12000033,ABERDEEN CITY,128708,128294,95.8,79.1,77.1,74.4,74.4,0.0,0.1,0.4,3.9,0.0,97.1,0.0,123335,101776,99197,95719,95719,59,178,548,4959,55,125036,0,24138,99197,59,119,370,4411,18.8,77.1,0.0,0.1,0.3,3.4
1,S12000034,ABERDEENSHIRE,127941,127265,84.2,20.6,20.5,20.5,20.5,1.8,4.5,8.2,15.3,2.6,94.8,0.0,107742,26341,26251,26171,26171,2330,5780,10440,19523,3335,121312,0,81491,26251,2330,3450,4660,9083,63.7,20.5,1.8,2.7,3.6,7.1
2,E07000223,ADUR,29971,29920,99.1,91.0,91.0,54.5,90.1,0.0,0.0,0.2,0.8,0.0,99.5,0.0,29687,27264,27264,16333,27009,2,8,47,233,7,29826,0,2423,27264,2,6,39,186,8.1,91.0,0.0,0.0,0.1,0.6
3,E07000026,ALLERDALE,52309,52133,92.7,5.4,5.4,5.4,5.4,1.2,2.1,2.9,6.9,0.6,98.6,3.4,48499,2811,2811,2811,2811,630,1086,1541,3634,327,51583,1788,45688,2811,630,456,455,2093,87.3,5.4,1.2,0.9,0.9,4.0
4,E07000032,AMBER VALLEY,62170,61902,96.1,49.4,46.2,43.6,43.7,0.1,0.3,0.5,3.4,0.2,99.2,0.0,59760,30682,28732,27101,27186,58,161,287,2142,107,61669,0,31028,28732,58,103,126,1855,49.9,46.2,0.1,0.2,0.2,3.0


In [34]:
fixed_coverage_2022_df.tail()

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
369,W06000006,WREXHAM,66672,65735,95.2,48.6,48.1,47.6,47.6,0.2,0.7,1.4,3.4,0.8,97.2,43.3,63473,32422,32097,31766,31767,132,481,928,2262,506,64830,28852,31376,32097,132,349,447,1334,47.1,48.1,0.2,0.5,0.7,2.0
370,E07000238,WYCHAVON,64057,63530,95.1,35.8,31.0,29.4,30.7,0.1,0.5,0.9,4.0,0.2,98.7,0.1,60942,22960,19882,18815,19685,67,339,559,2588,130,63212,45,41060,19882,67,272,220,2029,64.1,31.0,0.1,0.4,0.3,3.2
371,E07000128,WYRE,58069,57900,97.0,60.3,60.3,60.2,60.3,0.1,0.2,0.5,2.7,0.1,98.9,10.6,56317,35014,35013,34984,35010,42,141,263,1583,40,57419,6144,21304,35013,42,99,122,1320,36.7,60.3,0.1,0.2,0.2,2.3
372,E07000239,WYRE FOREST,48894,48679,97.3,55.7,55.7,10.3,55.4,0.1,0.3,0.5,2.3,0.1,99.3,3.4,47557,27255,27255,5020,27064,49,136,250,1122,69,48549,1670,20302,27255,49,87,114,872,41.5,55.7,0.1,0.2,0.2,1.8
373,E06000014,YORK,96526,96317,94.7,75.8,72.6,52.3,70.0,0.0,0.2,0.7,5.0,0.0,96.6,3.8,91448,73123,70059,50466,67521,42,189,694,4869,23,93230,3707,21389,70059,42,147,505,4175,22.2,72.6,0.0,0.2,0.5,4.3


Data integrity seems ok. Proceeding to checking the data frame's shape.

In [35]:
fixed_coverage_2022_df.shape

(374, 40)

We have 374 entries with 40 columns. That seems consistent with 2021 dataset.

In [36]:
fixed_coverage_2022_df.describe()

Unnamed: 0,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0
mean,84761.181818,84349.454545,95.325668,67.605615,66.110428,38.207754,64.590642,0.291979,0.736631,1.362299,4.164706,0.340909,97.880214,7.943583,81267.724599,60970.275401,59700.168449,34897.074866,58141.13369,184.927807,478.473262,911.596257,3081.729947,212.582888,83005.681818,6093.965241,21567.55615,59700.168449,184.927807,293.545455,433.122995,2170.13369,29.213904,66.110428,0.291979,0.437166,0.624332,2.804813
std,56035.842818,55767.364674,4.584913,22.29782,22.46434,20.553333,22.19958,0.560008,1.171155,2.090665,4.318733,0.709083,2.932801,16.094505,54151.769111,49758.46883,49116.539501,34109.839245,47537.013687,338.461553,717.449742,1221.06068,3217.899119,415.152083,54463.862699,14593.958269,18190.044767,49116.539501,338.461553,402.030606,548.774815,2335.513567,20.007487,22.46434,0.560008,0.647729,1.005039,2.783777
min,1686.0,1669.0,58.7,1.6,1.6,1.6,1.6,0.0,0.0,0.0,0.0,0.0,58.7,0.0,1631.0,27.0,27.0,27.0,27.0,0.0,0.0,0.0,19.0,0.0,1669.0,0.0,884.0,27.0,0.0,0.0,0.0,10.0,0.8,1.6,0.0,0.0,0.0,0.0
25%,50147.5,49881.25,94.5,52.95,50.775,22.3,49.9,0.0,0.1,0.3,1.6,0.0,97.6,0.0,47645.0,30737.0,29957.75,13105.75,29133.5,14.25,79.25,215.25,1297.25,15.0,49510.25,0.0,9794.25,29957.75,14.25,52.25,115.25,935.5,12.4,50.775,0.0,0.1,0.2,1.3
50%,67870.0,67463.5,96.75,74.4,72.2,37.8,70.35,0.1,0.3,0.65,2.8,0.1,98.6,0.0,63857.0,47277.0,44353.5,24359.0,44475.5,55.0,236.5,531.5,2145.0,61.5,66016.0,5.5,17269.0,44353.5,55.0,153.0,265.0,1522.5,23.8,72.2,0.1,0.2,0.35,2.1
75%,102291.25,102037.0,98.0,85.975,84.8,51.275,83.025,0.3,0.8,1.4,5.1,0.3,99.1,5.8,98492.25,80995.75,79003.0,43995.75,77578.75,186.75,523.0,1054.5,3680.75,228.5,99516.75,4393.75,27491.0,79003.0,186.75,363.0,504.25,2482.0,43.2,84.8,0.3,0.5,0.7,3.6
max,477617.0,475094.0,99.5,97.7,97.7,97.7,97.7,3.9,7.1,17.3,39.3,5.4,99.8,85.0,457684.0,434071.0,431406.0,277769.0,424717.0,2526.0,5780.0,10440.0,32558.0,3335.0,458007.0,124830.0,138951.0,431406.0,2526.0,3450.0,4660.0,26434.0,95.1,97.7,3.9,4.3,10.9,39.3


It looks ok at glance. I will check the name of the columns, missing values and data type for each column. The 'min' and 'max' values in the percentage columns seems within range. Although, the SFBB availability minimum percentage seems quite high, indicating that slightly more than half of the premises have got SFBB broadband availability.

In [37]:
fixed_coverage_2022_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 40 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   laua                                                          374 non-null    object 
 1   laua_name                                                     374 non-null    object 
 2   All Premises                                                  374 non-null    int64  
 3   All Matched Premises                                          374 non-null    int64  
 4   SFBB availability (% premises)                                374 non-null    float64
 5   UFBB (100Mbit/s) availability (% premises)                    374 non-null    float64
 6   UFBB availability (% premises)                                374 non-null    float64
 7   Full Fibre availability (% premises)                          374 non-n

It seems ok. Let's ensure there is no missing values.

In [38]:
fixed_coverage_2022_df.isnull().sum()

laua                                                            0
laua_name                                                       0
All Premises                                                    0
All Matched Premises                                            0
SFBB availability (% premises)                                  0
UFBB (100Mbit/s) availability (% premises)                      0
UFBB availability (% premises)                                  0
Full Fibre availability (% premises)                            0
Gigabit availability (% premises)                               0
% of premises unable to receive 2Mbit/s                         0
% of premises unable to receive 5Mbit/s                         0
% of premises unable to receive 10Mbit/s                        0
% of premises unable to receive 30Mbit/s                        0
% of premises below the USO                                     0
% of premises with NGA                                          0
% of premi

Good news, no missing values. Let's check for any duplicates.

In [39]:
fixed_coverage_2022_df.duplicated().sum()

np.int64(0)

There are no duplicates either. I am proceeding with the last dataset, which is for 2023.

#### Fixed broadband - 2023 dataset

Just like with the previous datasets, I will check the first and last 5 lines of the file.

In [40]:
# Display the first five rows of the fixed broadband dataset for 2023.
! head -n 5 2023J_TMA02_data/Ofcom_fixed/202305_fixed_laua_coverage_r02.csv

laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to recei

In [41]:
# Display the last five rows of the fixed broadband dataset for 2023.
! head -n 5 2023J_TMA02_data/Ofcom_fixed/202305_fixed_laua_coverage_r02.csv

laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to recei

No metadata included, so the files are ok to be loaded in a data frame.

In [42]:
fixed_coverage_2023_df=pd.read_csv('2023J_TMA02_data/Ofcom_fixed/202305_fixed_laua_coverage_r02.csv')
fixed_coverage_2023_df.head()

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
0,S12000033,ABERDEEN CITY,129315,129197,97.2,84.8,83.9,83.0,83.0,0.0,0.2,0.3,2.8,0.0,98.0,0.0,125636,109650,108514,107315,107315,58,209,431,3561,40,126770,0,17122,108514,58,151,222,3130,13.2,83.9,0.0,0.1,0.2,2.4
1,S12000034,ABERDEENSHIRE,128408,128070,85.9,25.5,25.5,25.4,25.4,1.7,4.2,7.6,13.8,2.4,95.4,0.0,110296,32794,32703,32622,32622,2214,5407,9703,17774,3047,122495,0,77593,32703,2214,3193,4296,8071,60.4,25.5,1.7,2.5,3.3,6.3
2,E07000223,ADUR,29985,29953,99.1,92.8,92.8,65.4,92.8,0.0,0.0,0.1,0.8,0.0,99.6,0.0,29727,27826,27826,19606,27826,0,9,40,226,5,29864,0,1901,27826,0,9,31,186,6.3,92.8,0.0,0.0,0.1,0.6
3,E07000026,ALLERDALE,52482,52364,93.1,6.0,6.0,6.0,6.0,1.2,2.0,2.8,6.6,0.6,98.7,3.4,48885,3127,3127,3127,3127,617,1057,1479,3479,294,51819,1785,45758,3127,617,440,422,2000,87.2,6.0,1.2,0.8,0.8,3.8
4,E07000032,AMBER VALLEY,62512,62430,97.2,62.4,60.6,59.0,59.1,0.1,0.3,0.4,2.7,0.1,99.5,0.0,60770,39001,37875,36868,36952,42,157,245,1660,54,62205,1,22895,37875,42,115,88,1415,36.6,60.6,0.1,0.2,0.1,2.3


In [43]:
fixed_coverage_2023_df.tail()

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
369,W06000006,WREXHAM,66829,66176,96.0,56.2,55.7,55.2,55.2,0.2,0.6,1.3,3.0,0.6,97.8,43.1,64171,37570,37197,36894,36895,132,434,843,2005,423,65374,28830,26974,37197,132,302,409,1162,40.4,55.7,0.2,0.5,0.6,1.7
370,E07000238,WYCHAVON,64378,64205,96.2,52.4,49.5,48.4,49.4,0.1,0.4,0.8,3.6,0.1,99.3,0.0,61909,33724,31851,31187,31790,53,263,487,2296,66,63959,1,30058,31851,53,210,224,1809,46.7,49.5,0.1,0.3,0.3,2.8
371,E07000128,WYRE,58322,58243,97.3,60.8,60.8,60.8,60.8,0.1,0.3,0.5,2.6,0.0,99.0,7.8,56743,35480,35477,35450,35480,54,152,280,1500,24,57742,4557,21266,35477,54,98,128,1220,36.5,60.8,0.1,0.2,0.2,2.1
372,E07000239,WYRE FOREST,49121,49027,97.8,63.8,63.8,25.7,63.8,0.1,0.2,0.4,2.0,0.1,99.6,3.4,48048,31340,31340,12612,31340,45,115,202,979,52,48916,1669,16708,31340,45,70,87,777,34.0,63.8,0.1,0.1,0.2,1.6
373,E06000014,YORK,96725,96582,95.4,80.6,78.4,65.9,76.3,0.0,0.1,0.5,4.4,0.0,96.9,21.8,92312,78000,75825,63736,73780,43,142,465,4270,16,93763,21109,16487,75825,43,99,323,3805,17.0,78.4,0.0,0.1,0.3,3.9


All looks good. Let's check its shape.

In [44]:
fixed_coverage_2023_df.shape

(374, 40)

So, we get 374 entries spread amongst 40 columns. It seems consistent with 2021 and 2022 datasets, which is good news.

In [45]:
fixed_coverage_2023_df.describe()

Unnamed: 0,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0
mean,85116.336898,84886.005348,96.072995,71.809091,70.550802,47.627807,69.724866,0.274064,0.679412,1.239305,3.645455,0.268182,98.239037,8.074866,82180.959893,64431.270053,63354.839572,43038.548128,62617.510695,175.125668,447.34492,839.965241,2705.045455,165.42246,83651.620321,6410.419786,18826.120321,63354.839572,175.125668,272.219251,392.620321,1865.080214,25.524064,70.550802,0.274064,0.397594,0.560695,2.404011
std,56252.163193,56070.215034,4.211346,20.414938,20.496678,20.763319,20.390288,0.504081,1.073881,1.937932,4.006559,0.589204,2.811242,16.502087,54598.119327,50747.883493,50144.945451,38006.335278,49592.207036,304.74368,661.491698,1134.726139,2918.498044,346.219534,54873.514474,16354.704195,16354.354108,50144.945451,304.74368,378.581337,508.478333,2071.790679,18.17582,20.496678,0.504081,0.604247,0.939757,2.639301
min,1689.0,1678.0,59.1,1.8,1.8,1.8,1.8,0.0,0.0,0.0,0.0,0.0,59.2,0.0,1650.0,31.0,31.0,31.0,31.0,0.0,0.0,2.0,28.0,0.0,1678.0,0.0,233.0,31.0,0.0,0.0,0.0,26.0,0.2,1.8,0.0,0.0,0.0,0.0
25%,50502.25,50250.0,95.4,59.075,57.55,32.175,57.475,0.0,0.1,0.3,1.5,0.0,98.0,0.0,48514.75,33395.5,32716.25,19534.25,32624.25,19.0,79.5,205.5,1132.0,5.0,49961.0,0.0,8663.25,32716.25,19.0,54.25,100.0,803.0,10.575,57.55,0.0,0.1,0.1,1.1
50%,68136.5,67915.0,97.3,77.05,75.35,48.45,74.4,0.1,0.3,0.6,2.35,0.1,98.9,0.0,64934.0,49165.0,48464.0,32315.0,48262.0,57.5,218.5,504.0,1835.5,41.5,66771.0,1.0,15388.0,48464.0,57.5,137.0,240.0,1267.5,21.25,75.35,0.1,0.2,0.3,1.7
75%,102462.0,102371.25,98.3,88.175,87.2,62.05,86.85,0.3,0.7,1.2,4.3,0.2,99.3,4.475,99292.5,83674.25,82727.25,55378.25,81981.0,185.75,512.0,1020.75,3269.25,168.75,100722.5,2953.75,23219.25,82727.25,185.75,334.0,467.75,2143.75,36.6,87.2,0.3,0.4,0.6,3.075
max,478734.0,476604.0,99.8,98.4,98.4,98.4,98.4,3.3,7.3,16.7,40.0,4.5,99.9,81.9,462456.0,441888.0,438931.0,306857.0,435683.0,2568.0,5407.0,9703.0,30584.0,3047.0,462712.0,154225.0,136775.0,438931.0,2568.0,3193.0,4296.0,24111.0,95.9,98.4,3.3,4.3,10.6,39.8


At glance it looks ok.

In [46]:
fixed_coverage_2023_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 40 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   laua                                                          374 non-null    object 
 1   laua_name                                                     374 non-null    object 
 2   All Premises                                                  374 non-null    int64  
 3   All Matched Premises                                          374 non-null    int64  
 4   SFBB availability (% premises)                                374 non-null    float64
 5   UFBB (100Mbit/s) availability (% premises)                    374 non-null    float64
 6   UFBB availability (% premises)                                374 non-null    float64
 7   Full Fibre availability (% premises)                          374 non-n

No missing values, the column names appear to be consisten too and the data types for each column appears to be correctly assigned. Let's ensure there are no present missing values.

In [47]:
fixed_coverage_2023_df.isnull().sum()

laua                                                            0
laua_name                                                       0
All Premises                                                    0
All Matched Premises                                            0
SFBB availability (% premises)                                  0
UFBB (100Mbit/s) availability (% premises)                      0
UFBB availability (% premises)                                  0
Full Fibre availability (% premises)                            0
Gigabit availability (% premises)                               0
% of premises unable to receive 2Mbit/s                         0
% of premises unable to receive 5Mbit/s                         0
% of premises unable to receive 10Mbit/s                        0
% of premises unable to receive 30Mbit/s                        0
% of premises below the USO                                     0
% of premises with NGA                                          0
% of premi

That is all good. And lastly, let's see if there are any duplicates.

In [48]:
fixed_coverage_2022_df.duplicated().sum()

np.int64(0)

#### Key Summaries and observations

In the individual file examination section, I have confirmed, as presented in the task description, that certain CSV files have different number of columns and rows. Here is the summary of my findings:

- 2019: 38 columns, 382 entries
- 2020: 40 columns, 379 entries
- 2021: 40 columns, 374 entries
- 2022: 40 columns, 374 entries
- 2023: 40 columns, 374 entries

We can see that the number of columns increases from year 2020, including. From 2020 till 2023 the number of columns is consistent. However, we see fluctuation in the number of entries between year 2019 - 2021. In 2019, there were 382, then in 2020 the number dropped to 379 and in 2021 the number went even lower to 374 entries. That is worth investigating further and understand why and how this difference in entry would affect the analysis I will be performing later on.

I need to identify what are the additional 2 columns added in 2020 onward files and also understand why they were added and how does that affect the ongoing files and what changes were made. I need to do the same for the fluctuations in number of entries from 2019 to 2021, where they seem to equal.

My next step would be to check and identify the differences in entries and columns across the datasets.

### Entry differences - investigation

To explore the specific differences between the files, such as which entries are missing or added in each subsequent year, I will look at the key columns that uniquely identify each entry. The dataset contains several columns, but for the purpsoe of identifying unique entries, the columns 'laua' and 'laua_name) appear to be suitable key columns. These should uniquely identify each entry across the files.

I will create a function called 'compare_dataframes', which compares two DataFrames based on the 'laua' column to find added or removed entries. I have chosen to use the 'laua' column because each entry in the DataFrame have a distinct 'laua' value, making it suitable for identifying individual records. Also, the unique identifier like 'laua' allows me to precisely track changes to specific entries across different datasets. It makes it easier to identify which entries have been added or removed over time.

The function will first extract unique values from the key column of both DataFrames. Then it finds the set difference to identify keys that are present in one DataFrame but not the other. Then the function filters the DataFrames to extract the added and removed entries based on the identified keys. Finally, the function returns the added and removed entries.



I will then use the function to compare each dataset with the following year. For example, comapring 2019 to 2020, 2020 to 2021, and so on.

In [49]:
def compare_dataframes(df1, df2, key_column='laua'):
    """
    Compare two dataframes based on a key column to find added or removed entries.
    
    Parameters:
    - df1: DataFrame from year 1.
    - df2: DataFrame from year 2.
    - key_column: The column name used as the key for comparison.
    
    Returns:
    - added_entries: Entries added in df2 compared to df1.
    - removed_entries: Entries removed from df1 compared to df2.
    """
    df1_keys = set(df1[key_column])
    df2_keys = set(df2[key_column])
    
    added_keys = df2_keys - df1_keys
    removed_keys = df1_keys - df2_keys
    
    added_entries = df2[df2[key_column].isin(added_keys)]
    removed_entries = df1[df1[key_column].isin(removed_keys)]
    
    return added_entries, removed_entries

#### 2019 - 2020 entry differences

Let's see what entries have been added and removed in the 2020 dataset.

In [50]:
added_entries, removed_entries = compare_dataframes(fixed_coverage_2019_df, fixed_coverage_2020_df, 'laua')

print("Added Entries:")
print(added_entries[['laua', 'laua_name']])
print("\nRemoved Entries:")
print(removed_entries[['laua', 'laua_name']])

Added Entries:
         laua        laua_name
48  E06000060  BUCKINGHAMSHIRE

Removed Entries:
          laua       laua_name
13   E07000004  AYLESBURY VALE
72   E07000005        CHILTERN
288  E07000006     SOUTH BUCKS
378  E07000007         WYCOMBE


We see that Buckinghamshire with 'laua' E06000060 has been added, while four other areas - Aylesbury Vale, Chiltern, South Bucks and Wycombe have been removed.

#### 2020 - 2021 entry differences

Now let's see what entries have been added and/or removed in the 2021 dataset.

In [51]:
# Example usage to compare 2020 and 2021 dataframes
added_entries, removed_entries = compare_dataframes(fixed_coverage_2020_df, fixed_coverage_2021_df, 'laua')

print("Added Entries:")
print(added_entries[['laua', 'laua_name']])
print("\nRemoved Entries:")
print(removed_entries[['laua', 'laua_name']])


Added Entries:
          laua               laua_name
226  E06000061  NORTH NORTHAMPTONSHIRE
355  E06000062   WEST NORTHAMPTONSHIRE

Removed Entries:
          laua               laua_name
79   E07000150                   CORBY
90   E07000151                DAVENTRY
110  E07000152   EAST NORTHAMPTONSHIRE
174  E07000153               KETTERING
234  E07000154             NORTHAMPTON
296  E07000155  SOUTH NORTHAMPTONSHIRE
353  E07000156          WELLINGBOROUGH


There are a few more entries that have been removed but only few added.

#### 2021 - 2022 entry differences

As we discovered earlier, there is no difference in the number of entries for year 2021, 2022, 2023. However, I would check for differences regardless because just because the number of entry is the same, it does not mean that no changes have been made. There might have been an entry removed and replace with a new one, and the number of entries would have still shown the same numbers. So, it is always good to check.

In [52]:
# Example usage to compare 2021 and 2022 dataframes
added_entries, removed_entries = compare_dataframes(fixed_coverage_2021_df, fixed_coverage_2022_df, 'laua')

print("Added Entries:")
print(added_entries[['laua', 'laua_name']])
print("\nRemoved Entries:")
print(removed_entries[['laua', 'laua_name']])


Added Entries:
Empty DataFrame
Columns: [laua, laua_name]
Index: []

Removed Entries:
Empty DataFrame
Columns: [laua, laua_name]
Index: []


No entries were added or removed in the 2022 dataset. Let's check in the 2023 dataset.

#### 2022 - 2023 entry differences

In [53]:
# Example usage to compare 2021 and 2022 dataframes
added_entries, removed_entries = compare_dataframes(fixed_coverage_2022_df, fixed_coverage_2023_df, 'laua')

print("Added Entries:")
print(added_entries[['laua', 'laua_name']])
print("\nRemoved Entries:")
print(removed_entries[['laua', 'laua_name']])

Added Entries:
Empty DataFrame
Columns: [laua, laua_name]
Index: []

Removed Entries:
Empty DataFrame
Columns: [laua, laua_name]
Index: []


That's good news. The entries in the 2023 dataset have not been amended.

#### Entry differences in the 2019 to 2021 dataset.

To conclude, there are entries added and removed in the 2019 and 2020 dataset. The rest of the datasets seems to be consistent. So, the aim is to make both datasets - 2019 and 2020 consistent to the rest of the datasets in terms of the entries. So, I will now compare 2019 to 2021 to see all changes introduced. I have chosen to make such comparison, because there are no changes introduced in 2021 onwards. 

Then, I will compare 2020 to 2021 and see what changes were made then.

In [54]:
# Example usage to compare 2019 and 2021 dataframes
added_entries, removed_entries = compare_dataframes(fixed_coverage_2019_df, fixed_coverage_2021_df, 'laua')

print("Added Entries:")
print(added_entries[['laua', 'laua_name']])
print("\nRemoved Entries:")
print(removed_entries[['laua', 'laua_name']])

Added Entries:
          laua               laua_name
48   E06000060         BUCKINGHAMSHIRE
226  E06000061  NORTH NORTHAMPTONSHIRE
355  E06000062   WEST NORTHAMPTONSHIRE

Removed Entries:
          laua               laua_name
13   E07000004          AYLESBURY VALE
72   E07000005                CHILTERN
80   E07000150                   CORBY
91   E07000151                DAVENTRY
111  E07000152   EAST NORTHAMPTONSHIRE
175  E07000153               KETTERING
235  E07000154             NORTHAMPTON
288  E07000006             SOUTH BUCKS
298  E07000155  SOUTH NORTHAMPTONSHIRE
355  E07000156          WELLINGBOROUGH
378  E07000007                 WYCOMBE


In the begining of this Jupyter notebook, we have been given a link to '2019–2023 structural changes to local government in England'. I have noted down the added and removed entries. I will consult with the given link to Wikipedia about the structural changes to local government in order to gain better understanding.

Having looked and read the Wikipedia source we have been provided, in the 2019 dataset I need to introduce three new entries - Buckinghamshire, North Northamptonshire and West Northamptonshire. 

The Buckinghamshire needs to incorporate all existing data from Aylesbury Vale, Chiltern, South Bucks, Wycombe. The local authortity area code ('laua') for Buckinghamshire is 'E06000060'. So when I create a new entry, I should use this laua code.

The North Northamptonshire needs to incorporate all existing data from Corby, East Northamptonshire, Kettering, Wellingborough. The local authority area code ('laua') for North Northamptonshire is 'E06000061'.

The West Northamptonshire entry needs to incorporate all existing data from Daventry, Northampton, South Northamptonshire. The local authority area code ('laua') for West Northamptonshire is 'E06000062'.

(Reference: https://en.wikipedia.org/wiki/2019–2023_structural_changes_to_local_government_in_England)

#### Entry differences in the 2020 to 2021 dataset.

I have already checked for removed and added entries in 2020 dataset, but I will recall the code again.

In [55]:
# Example usage to compare 2020 and 2021 dataframes
added_entries, removed_entries = compare_dataframes(fixed_coverage_2020_df, fixed_coverage_2021_df, 'laua')

print("Added Entries:")
print(added_entries[['laua', 'laua_name']])
print("\nRemoved Entries:")
print(removed_entries[['laua', 'laua_name']])


Added Entries:
          laua               laua_name
226  E06000061  NORTH NORTHAMPTONSHIRE
355  E06000062   WEST NORTHAMPTONSHIRE

Removed Entries:
          laua               laua_name
79   E07000150                   CORBY
90   E07000151                DAVENTRY
110  E07000152   EAST NORTHAMPTONSHIRE
174  E07000153               KETTERING
234  E07000154             NORTHAMPTON
296  E07000155  SOUTH NORTHAMPTONSHIRE
353  E07000156          WELLINGBOROUGH


Similarly to 2019 I need to introduce two new entries - North Northamptonshire and West Northamptonshire. 

North Northamptonshire entry should include all existing data for Corby, East Northamptonshire, Kettering, and Wellingborough.

West Northamptonshire entry should include all existing data for Daventry, Northampton, and South Northamptonshire.

### Entry differences - key summaries  

In 2019 dataset, I need to introduce three entries - Buckinghamshire, North Northamptonshire and West Northamptonshire. 

The Buckinghamshire entry needs to incorporate all values contained in the following entries:

* Aylesbury Vale
* Chiltern
* South Bucks
* Wycombe 

The North Northamptonshire entry needs to incorporate all values contained in the following entries:

* Corby
* East Northamptonshire
* Kettering
* Wellingborough

The West Northamptonshire entry needs to incorporate all values contained in the following entries:

* Daventry
* Northampton
* South Northamptonshire

For the 2020 dataset, the changes needed are only for North and West Northamptonshire described above. For both datasets, the rows for the sub-areas listed above need to be dropped.

### Column differences - investigation

After performing the individual file examination, I now know the number of differences in the columns I am looking for. So, my first step is to identify what are the differences in columns between year 2019 and 2020. I will compare the column names across the CSV files and identify the differences between them. 

First, I will read each CSV file and store it in a list named 'dataframes'. Then I will extract the set of column names for each DataFrame and store them in a list called 'column_names'. Finally, I will iterate through each combination of DataFrames and compare their column names. If there are any differences, it will print out the columns that are present in one DataFrame but not in the other, for each pair of DataFrames.

The reason why I am comparing the columns in all five files, and not just the files for 2019 and 2020 where I know there is a difference is the number of columns, is because I want to ensure that there is a consistency througout all files and their respective columns.

In [56]:
# Create sets of column headings for each DataFrame
columns_2019 = set(fixed_coverage_2019_df.columns)
columns_2020 = set(fixed_coverage_2020_df.columns)
columns_2021 = set(fixed_coverage_2021_df.columns)
columns_2022 = set(fixed_coverage_2022_df.columns)
columns_2023 = set(fixed_coverage_2023_df.columns)

In [57]:
# Create a dictionary to store column headings for each year
column_headings = {
    '2019': set(fixed_coverage_2019_df.columns),
    '2020': set(fixed_coverage_2020_df.columns),
    '2021': set(fixed_coverage_2021_df.columns),
    '2022': set(fixed_coverage_2022_df.columns),
    '2023': set(fixed_coverage_2023_df.columns)
}

In [58]:
# Compare column headings across years
for year1, columns1 in column_headings.items():
    for year2, columns2 in column_headings.items():
        if year1 != year2:
            print(f"Differences between {year1} and {year2}:")
            print("Columns in", year1, "but not in", year2, ":\n", columns1 - columns2)
            print("Columns in", year2, "but not in", year1, ":\n", columns2 - columns1)
            print()


Differences between 2019 and 2020:
Columns in 2019 but not in 2020 :
 {'Number of premises able to receive SFBB from FWA', '% of premises able to receive SFBB from FWA'}
Columns in 2020 but not in 2019 :
 {'UFBB (100Mbit/s) availability (% premises)', 'Number of premises with UFBB (100Mbit/s) availability', 'Number of premises with Gigabit availability', 'Gigabit availability (% premises)'}

Differences between 2019 and 2021:
Columns in 2019 but not in 2021 :
 {'Number of premises able to receive SFBB from FWA', '% of premises able to receive SFBB from FWA'}
Columns in 2021 but not in 2019 :
 {'UFBB (100Mbit/s) availability (% premises)', 'Number of premises with UFBB (100Mbit/s) availability', 'Number of premises with Gigabit availability', 'Gigabit availability (% premises)'}

Differences between 2019 and 2022:
Columns in 2019 but not in 2022 :
 {'Number of premises able to receive SFBB from FWA', '% of premises able to receive SFBB from FWA'}
Columns in 2022 but not in 2019 :
 {'UFB

### Column differences - Key summary

To summarise my finding about the change in columns: it seems that two columns - '% of premises able to receive SFBB from FWA' and 'Number of premises able to receive SFBB from FWA' were dropped for all files from year 2020 onwards and four new columns were introduced - 'Number of premises with UFBB (100Mbit/s) availability', 'Gigabit availability (% premises)', 'UFBB (100Mbit/s) availability (% premises)', 'Number of premises with Gigabit availability'. In the previous section where I summarised my findings about the difference in number of columns, I thought there were only two new columns introduced, as the difference in columns from 2019 to 2020 increased from 38 to 40. However, now it becomes apparent that this is not the case.

Changes to be made to 2019 dataset:

* New columns to be introduced:
    * Number of premises with Gigabit availability
    * Gigabit availability (% premises)
    * Number of premises with UFBB (100 Mbit/s) availability
    * UFBB (100 Mbit/s) availability (% premises)
        
        
* Columns to be dropped:
    * Number of premises able to receive SFBB from FWA
    * % of premises able to receive SFBB from FWA

I now feel the need to dig deeper into this dataset. I need to understand exactly what each column means. For this, I will consult with the 'Connected Nations 2019 - About this data: Fixed local and unitary authority area' document. 

Reference: https://www.ofcom.org.uk/__data/assets/pdf_file/0026/186632/connected-nations-2019-about-fixed-local-unitary-authority-area.pdf

### Columns and their relations and derivation (origin)

I feel that before I proceed any further, I need to get a good understanding of how columns relate to each other. I see that there are a lot of columns that represent categorical data. For example, there are many columns that represent a percentage about some categorical data and there are columns that represent numbers of premises able to receive various broadband speeds.

I have identified that broadband speed is classified in three main categories - Superfast Broadband (SFBB), Ultrafast Broadband (UFBB) and Full Fibre Broadband. The speed for each category of broadband is 30Mbit/s and greater for SFBB, 300Mbit/s and greater for UFBB. Regarding the speed of the Full Fibre Broadband, it is not entirely clear what speed it covers. In the 'Connected Nations 2019' metadata document, it is explained that the definition for full fibre coverage has been changed from year 2018. I carried additional checks of documents that are out of the scope of this TMA. I went to Ofcom's website and found the 'Connected Nations 2018' main report in order to find more information what exactly the definition changes for full fibre coverage are. Under 'Fixed broadband and voice service' section in the report, it is said that 'full fibre broadband' 'can offer speeds of 1Gbp/s', while in the 'Overview' section of the document it is said that the speed is 'of up to 1 Gbit/s'. So, from that, I think it is fair that we can conclude that Full Fibre Broabdand speed covers 

Summary: 

* SFBB - 30Mbit/s +
* UFBB - 300Mbit/s +
* Full Fibre - there is ambiguity here. Full Fibre is a description of the technology and infrastructure rather than a speed category. But in my understanding, the speed is typically up to 1Gbp/s or more. 


#### SFBB columns

Let's explore some of the connection between the columns. I will start with columns that are related to SFBB - 'SFBB availability (% premises)' , 'Number of premises with 30<300Mbit/s download speed', 'Number of premises with SFBB availability'. I will also include the 'All matched premises' column, because reading from the report, all columns representing the perecentage of premises receiveing X speed are calculated based on the matched premises.

In [59]:
fixed_coverage_2019_df.loc[:, ['All Matched Premises', 'SFBB availability (% premises)', 'Number of premises with 30<300Mbit/s download speed', 'Number of premises with >=300Mbit/s download speed', 'Number of premises with SFBB availability']]

Unnamed: 0,All Matched Premises,SFBB availability (% premises),Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with SFBB availability
0,125311,73.3,91989,25163,117152
1,124305,78.5,98180,3472,101652
2,29760,16.3,4840,24543,29383
3,51284,89.8,46137,866,47003
4,60596,67.4,40893,15339,56232
...,...,...,...,...,...
377,62114,82.9,51778,5352,57130
378,76345,68.5,52345,20219,72564
379,56280,88.4,49833,3340,53173
380,48061,49.3,23724,22479,46203


My observations here is that the Number of premises with SFBB availability represents the sum of number of properties with broadband speed from 30Mbit/s up, which is correct with what is explained in the 'Connected Nations' report. However, a simple calcuations shows that the value in the SFBB availability (% premises) column does not actually represent the total number of SFBB availability, but only the number of premises with 30<300 Mbit/s download speed. Here are my calculations:

SFBB availability (% premises) = ((Number of premises with 30<300 Mbit/s download speed + Number of premises with >=300Mbit/s download speed) / All Matched Premises)* 100)

When I substitute the numbers for each column for entry with index [0] I get:

SFBB availability (% premises) = ((91989 + 25163) / 125311)* 100 
SFBB availability (% premises) = 93.488999369568514 or 93.5 when rounded. 

When we look at the table above, we see 73.3% in the SFBB availability (% premises). There is a discrepancy of 20%. Let's see why. I will calculate the percentage using only the 30<300Mbit/s speed for entry with index [0]. Using the same method as above, it gives:

(91989 / 125311)* 100 = 73.408559503954162

That value corresponds to what is shown in the SFBB availability (% premises) column.

So we can conclude that the SFBB availability (% premises) column contains wrong values. It needs to be corrected to accomomdate values in both columns with 30<300 and >=300 Mbit/s speeds.

I will go one step forward and check if this discrepancy is present in the rest of the datasets. That will give me an idea whether the mistake is only present in the 2019 dataset or in all datasets.

In [60]:
fixed_coverage_2020_df.loc[:, ['All Matched Premises', 'SFBB availability (% premises)', 'Number of premises with 30<300Mbit/s download speed', 'Number of premises with >=300Mbit/s download speed', 'Number of premises with SFBB availability']]

Unnamed: 0,All Matched Premises,SFBB availability (% premises),Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with SFBB availability
0,125948,94.6,66897,52461,119358
1,125176,82.9,95600,8872,104472
2,29755,98.8,3945,25482,29427
3,51483,92.3,46227,1466,47693
4,60972,94.7,41552,16323,57875
...,...,...,...,...,...
374,65212,94.4,38034,24136,62170
375,62215,93.8,49119,9524,58643
376,56411,95.1,40887,12852,53739
377,48173,96.8,23645,23035,46680


Quick sum: Number of premises with 30<300 Mbits + >=300 Mbits = Number of premises with SFBB availability
    66897 + 52461 = 119358

Indeed that is correct. To find out what the percentage is for SFBB availability (% premises), I need to divide the number of premises with SFBB availability by the All matched premises and multiply by 100.

(119358 / 125948)* 100 = 94.767681900466859. It seems that the rounding in the table is slightly lower. Let's check the next one.

(104472 / 95600)* 100 = 83.460088195820285. Again, it seems slightly off.

(29427 / 29755)* 100 = 98.897664258107881. Again, the rounded values in the table are slightly off. But the main point is that the SFBB availability (% premises) is including any speeds from 30Mbit/s and up. 

In [61]:
fixed_coverage_2021_df.loc[:, ['All Matched Premises', 'SFBB availability (% premises)', 'Number of premises with 30<300Mbit/s download speed', 'Number of premises with >=300Mbit/s download speed', 'Number of premises with SFBB availability']]

Unnamed: 0,All Matched Premises,SFBB availability (% premises),Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with SFBB availability
0,126771,94.7,41566,79372,120938
1,125378,82.8,87395,17366,104761
2,29793,98.6,3895,25581,29476
3,51622,92.3,46172,1750,47922
4,61161,95.1,41339,17177,58516
...,...,...,...,...,...
369,65306,94.6,35154,27435,62589
370,62695,94.2,47034,12656,59690
371,57099,95.4,28140,26617,54757
372,48204,96.7,23535,23324,46859


I will do a quick check here too:

For the entry with index 0 :

41566 + 79372 = 120938 

So, that is correct. 

(120938 / 126771) * 100 = 95.398789944072382. Again, the rounding is off, but nevertheless the SFBB includes any speed from 30 Mbit/s upward.

In [62]:
fixed_coverage_2022_df.loc[:, ['All Matched Premises', 'SFBB availability (% premises)', 'Number of premises with 30<300Mbit/s download speed', 'Number of premises with >=300Mbit/s download speed', 'Number of premises with SFBB availability']]

Unnamed: 0,All Matched Premises,SFBB availability (% premises),Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with SFBB availability
0,128294,95.8,24138,99197,123335
1,127265,84.2,81491,26251,107742
2,29920,99.1,2423,27264,29687
3,52133,92.7,45688,2811,48499
4,61902,96.1,31028,28732,59760
...,...,...,...,...,...
369,65735,95.2,31376,32097,63473
370,63530,95.1,41060,19882,60942
371,57900,97.0,21304,35013,56317
372,48679,97.3,20302,27255,47557


A quick check for entry with 0 index:

24138 + 99197 = 12335

(123335 / 128294) * 100 = 96.134659454066441 

The rounding seems off, but the point is to see if the SFBB availability includes, as it should do, speeds from 30Mbit/s upwards.

And lastly, let's check the 2023 dataset.

In [63]:
fixed_coverage_2023_df.loc[:, ['All Matched Premises', 'SFBB availability (% premises)', 'Number of premises with 30<300Mbit/s download speed', 'Number of premises with >=300Mbit/s download speed', 'Number of premises with SFBB availability']]

Unnamed: 0,All Matched Premises,SFBB availability (% premises),Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with SFBB availability
0,129197,97.2,17122,108514,125636
1,128070,85.9,77593,32703,110296
2,29953,99.1,1901,27826,29727
3,52364,93.1,45758,3127,48885
4,62430,97.2,22895,37875,60770
...,...,...,...,...,...
369,66176,96.0,26974,37197,64171
370,64205,96.2,30058,31851,61909
371,58243,97.3,21266,35477,56743
372,49027,97.8,16708,31340,48048


For record with 0 index:

17122 + 108514 = 125636

(125636 / 129197)* 100 = 97.243744049784438

That seems right. So, the conclusion is that I need to make ammendments in the way SFBB availability (% premises) is calculated in the 2019 dataset.

#### UFBB columns

I will now perform the same checks but with columns that involve UFBB. These are - UFBB availability (% premises), Number ofpremises with UFBB availability, Number of premises with >=300Mbit/s download speed. To reitarate, UFBB covers speeds of >=300Mbit/s.

In [64]:
fixed_coverage_2019_df.loc[:, ['All Matched Premises', 'UFBB availability (% premises)', 'Number of premises with UFBB availability', 'Number of premises with >=300Mbit/s download speed']]

Unnamed: 0,All Matched Premises,UFBB availability (% premises),Number of premises with UFBB availability,Number of premises with >=300Mbit/s download speed
0,125311,20.1,25163,25163
1,124305,2.8,3472,3472
2,29760,82.4,24543,24543
3,51284,1.7,866,866
4,60596,25.3,15339,15339
...,...,...,...,...
377,62114,8.6,5352,5352
378,76345,26.5,20219,20219
379,56280,5.9,3340,3340
380,48061,46.7,22479,22479


Performing quick calculations - The number of premises with >=300Mbit/s divided by the number of all matched premises, then multiplied by 100 to convert it into a percentage gives ((25163 / 125311)*100) = 20.08% (20.1% when rounded). That seems to match what is in the UFBB availability (% premises) column.
So, the values in the UFBB availability (% premises) column are derived from the Number of premises with >=300Mbit/s.

#### Full Fibre columns

As Full Fibre does not have a speed definition but it rather concentrates on the technology of delivery, we only have two columns related to Full Fibre - Number of premises with Full Fibre availability and Full Fibre availability (% premises).

In [65]:
fixed_coverage_2019_df.loc[:, ['All Matched Premises', 'Number of premises with Full Fibre availability', 'Full Fibre availability (% premises)']]

Unnamed: 0,All Matched Premises,Number of premises with Full Fibre availability,Full Fibre availability (% premises)
0,125311,16410,13.1
1,124305,3332,2.7
2,29760,193,0.6
3,51284,866,1.7
4,60596,13412,22.1
...,...,...,...
377,62114,3162,5.1
378,76345,1664,2.2
379,56280,3108,5.5
380,48061,420,0.9


Again, to check if the percentage of premises that are able to receive full fibre, I need to divide the number of premises with full fibre availability by the number of all matched premises and then muliply by 100 to convert into percentage. That gives: ((16410 / 125311) * 100) = 13.09% (13.1% when rounded). That matches the values in the table.

#### All Matched Premises column - check

The number of all matched premises should match the total number of the following columns:
* Number of premises with 0<2Mbit/s download speed
* Number of premises with 2<5Mbit/s download speed
* Number of premises with 5<10Mbit/s download speed
* Number of premises with 10<30Mbit/s download speed
* Number of premises with 30<300Mbit/s download speed
* Number of premises with >=300Mbit/s download speed

Let's verify if this is correct.

In [66]:
fixed_coverage_2019_df.loc[:, ['All Matched Premises', 
                               'Number of premises with 0<2Mbit/s download speed', 
                               'Number of premises with 2<5Mbit/s download speed', 
                               'Number of premises with 5<10Mbit/s download speed',
                               'Number of premises with 10<30Mbit/s download speed',
                               'Number of premises with 30<300Mbit/s download speed',
                               'Number of premises with >=300Mbit/s download speed'
                               ]]

Unnamed: 0,All Matched Premises,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed
0,125311,49,170,665,7275,91989,25163
1,124305,3163,4176,4993,10321,98180,3472
2,29760,0,16,28,333,4840,24543
3,51284,619,704,550,2408,46137,866
4,60596,89,460,705,3110,40893,15339
...,...,...,...,...,...,...,...
377,62114,129,424,670,3761,51778,5352
378,76345,90,399,242,3050,52345,20219
379,56280,69,114,199,2725,49833,3340
380,48061,140,151,371,1196,23724,22479


A quick calculation summing all collumns for the first row - 49 + 170 + 665 + 7275 + 91989 + 25163 = 125311

That seems to match the value in the 'All Matched Premises'. So we can assume that these 7 columns listed in the table above are the core columns. 'Core' because all other values in the 2019 dataset's columns are derived from some form of manipulation of these 7 columns.

I will try to automate the process and see if there any discrepancies by summing the values for each row and column and then comparing it to the value in the 'All Matched Premises'.

In [67]:
columns_to_sum = [
    'Number of premises with 0<2Mbit/s download speed',
    'Number of premises with 2<5Mbit/s download speed',
    'Number of premises with 5<10Mbit/s download speed',
    'Number of premises with 10<30Mbit/s download speed',
    'Number of premises with 30<300Mbit/s download speed',
    'Number of premises with >=300Mbit/s download speed'
]

# Summing the values for each row across specified columns
sums = fixed_coverage_2019_df[columns_to_sum].sum(axis=1)

# Extracting the 'All Matched Premises' column for comparison
total_matched_premises = fixed_coverage_2019_df['All Matched Premises']

# Create a boolean mask to identify mismatched entries
mismatch_mask = sums != total_matched_premises

# Find the rows and corresponding columns where the mismatched entries occur
mismatched_entries = fixed_coverage_2019_df.loc[mismatch_mask, columns_to_sum]
mismatched_entries


Unnamed: 0,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed


It seems that all values match.

### Summary - Column relations and derivation 

To sum up the discoveries so far:
* SFBB availability (% premises) column = ((Number of premises with 30<300 Mbit/s / All matched premises)) * 100
* UFBB availability (% premises) column = ((Number of premises with >=300Mbit/s / All matched premises)) * 100
* Full Fibre availability (% premises) column = ((Number of premises with Full Fibre availability / All matched premises)) * 100

* Number of premises with SFBB availability = Number of premises with 30<300 Mbit/s + Number of premises with >=300 Mbit/s
* Number of premises with UFBB availability = Number of premises with >=300Mbit/s download speed

The main takeaway here is that the number of premises with SFBB availability contains premises able to receive any broadband speed from 30 Mbit/s and higher,clearly including UFBB speeds too. Which in fact, matches what it explained in the 2019 metadata file. That the number and percentage of SFBB availability is indeed 30Mbit/s and higher, without specifiying any higher limit, inheritently including UFBB in these figures. Which presents ambiguity in my opinion and it needs to be addressed.

The rest of the columns seem to be straightforward. 

I think I should update the 'SFBB availability (% premises)' to accomodate any premises able to receive broadband speeds from 30 Mbit/s or greater, as explained in the metadata document. I feel that this would be most accurate and consistent to include UFBB speeds in the SFBB availability (% premises). For example, the number of premises with SFBB availability already includes these two columns (30<300 and >=300Mbit/s). However, I first need to address the SFBB from FWA issue. 

### Addressing differences and inconcistencies - Combining 'Number of premises with SFBB availability' and 'Number of premises able to receive SFBB from FWA' - 2019 dataset

I will address each issue individually. So, The first step is to understand what the removed columns represent.

The columns
* Number of premises able to receive SFBB from FWA
* % of premises able to receive SFBB from FWA

were removed from the datasets from 2020 (including) onwards till 2023, including.

The 'Number of premises with SFBB from FWA' column represents the number of premises with Superfast Broadband (SFBB) from Fixed Wireless Access (FWA). In other words, that means that any premises receiving Superfast Broadband, which is anything from '30Mbit/s or above', but from Fixed Wireless Access. After a brief research, I understand that FWA is a method of delivering broadband internet services to premises using mobile netwrok technology rather than traditional wired connections such as copper or fibre optic cables. FWA can be particularly beneficial in rural or hard-to-reach areas where laying cables is impractical or too costly. 

The '% of premises able to receive SFBB from FWA' represent the percentage of premises with Superfast Broadband (SFBB) from Fixed Wireless Access (FWA). The percentage numbers contained in that column are derived by dividing the number of premises able to receive SFBB via FWA by the total number of premises. The result is multiplied by 100 to convert it into a percentage.

Since year 2020 is the first year where the columns are no longer including, I will look into the metadata file included for that year. 

It says that 'Due to changes in the collection of information from Wireless Internet Service Providers on Fixed Wireless Access (FWA) coverage, it is no longer possible to identify superfast coverage from FWA provision'. This presents an ambiguity in how and what is done with the data. However, analysing the way the sentence is worded - 'it is no longer possible to identify', in my the word 'identify' signifies that the data is collected but combined with the rest of the SFBB data. In essence, that means that there is no way of differentiation between whether premises are receiving broadband with 30Mbit/s or higher from Fixed Wireless Access (FWA) or from the traditional wired connection. 

While in 2019 that differentiation was possible and documented. The broadband speed of 30Mbit/s or higher is classified as Superfast Broadband (SFBB). So, in order to achieve consistency throughout, the 2019 dataset needs to combine the 'Number of premises able to receive SFBB from FWA' and the 'Number of premises with SFBB availability'. Based on the statement given in the 'Connected Nations 2020 - About this data: Fixed coverage, local and unitoary authority area' it likely means that any premises able to receive Superfast Broadband (SFBB) speeds throguh FWA are now included within the broader 'Number of premises with SFBB availability' metric, rather than being reported separately. In the statement it does not say that such number of premises are excluded, hence why I think they have been included in the SFBB metrics. This adjustment would potentially create a more uniform and comparable series of data across years, reflecting total SFBB availability irrespective of the delivery technology.

To address this and achieve consistency in the 2019 dataset, the numbers in the two columns - Number of premises able to receive SFBB from FWA and Number of premises with SFBB availability will be combined.

Additionally, the column representing the percentage of premises with SFBB availability needs to be updated in order to present accurately the percentage. 

In [68]:
sffb_from_fwa_columns_2019_df = pd.DataFrame(columns=['Number of premises with SFBB availability', 'Number of premises able to receive SFBB from FWA', 'Sum of SFBB and SFBB from FWA'])
sffb_from_fwa_columns_2019_df

Unnamed: 0,Number of premises with SFBB availability,Number of premises able to receive SFBB from FWA,Sum of SFBB and SFBB from FWA


In [69]:
selected_columns = ['Number of premises with SFBB availability',
                    'Number of premises able to receive SFBB from FWA']

In [70]:
sffb_from_fwa_columns_2019_df[selected_columns] = fixed_coverage_2019_df[selected_columns]
sffb_from_fwa_columns_2019_df

Unnamed: 0,Number of premises with SFBB availability,Number of premises able to receive SFBB from FWA,Sum of SFBB and SFBB from FWA
0,117152,0,
1,101652,0,
2,29383,0,
3,47003,1114,
4,56232,0,
...,...,...,...
377,57130,848,
378,72564,0,
379,53173,775,
380,46203,172,


The number of rows matches the number of rows in the 2019, so the import has been correctly done.

In [71]:
sffb_from_fwa_columns_2019_df['Sum of SFBB and SFBB from FWA'] = (sffb_from_fwa_columns_2019_df['Number of premises with SFBB availability'] +
                                             sffb_from_fwa_columns_2019_df['Number of premises able to receive SFBB from FWA'])
sffb_from_fwa_columns_2019_df

Unnamed: 0,Number of premises with SFBB availability,Number of premises able to receive SFBB from FWA,Sum of SFBB and SFBB from FWA
0,117152,0,117152
1,101652,0,101652
2,29383,0,29383
3,47003,1114,48117
4,56232,0,56232
...,...,...,...
377,57130,848,57978
378,72564,0,72564
379,53173,775,53948
380,46203,172,46375


We can see that the combining of the values from both columns is successfull. For verification, row 3, where 47003 + 1114 is 48117.

#### Updating the values in 'Number of premises with SFBB availability'.

Now I have combined the values from both columns into one. I need to update the 'Number of premises with SFBB availability' column in the fixed_coverage_2019 data frame. Before I do that, I would like to preview just that column. I am doing this so I can compare the values in the column before and after the update.

In [72]:
fixed_coverage_2019_df['Number of premises with SFBB availability'].to_frame()

Unnamed: 0,Number of premises with SFBB availability
0,117152
1,101652
2,29383
3,47003
4,56232
...,...
377,57130
378,72564
379,53173
380,46203


In [73]:
fixed_coverage_2019_df['Number of premises with SFBB availability'] = sffb_from_fwa_columns_2019_df['Sum of SFBB and SFBB from FWA']
fixed_coverage_2019_df['Number of premises with SFBB availability'].to_frame()

Unnamed: 0,Number of premises with SFBB availability
0,117152
1,101652
2,29383
3,48117
4,56232
...,...
377,57978
378,72564
379,53948
380,46375


That's great! I can see that the values in the column have been updated. Now I can proceed calculating the new values for 'SFBB availability (% premises)'.

#### Calculating 'SFBB availability (% premises)

Now, we confirmed the columns have been dropped. There is a column representing the percentage of premises with SFBB availability (SFF availability (% premises)) that needs to be updated, as its figure is derived from 'Number of premises with SFBB availability' and 'All Matched Premises' column. The figure is calculated by dividing the number of premises with SFBB availability by the number of matched properties.

In [74]:
sffb_availability_percentage_2019_df = pd.DataFrame(columns=['All Matched Premises', 'Number of premises with SFBB availability', 'SFBB availability in percentage'])
sffb_availability_percentage_2019_df

Unnamed: 0,All Matched Premises,Number of premises with SFBB availability,SFBB availability in percentage


In [75]:
selected_columns = ['All Matched Premises',
                    'Number of premises with SFBB availability']

In [76]:
sffb_availability_percentage_2019_df[selected_columns] = fixed_coverage_2019_df[selected_columns]
sffb_availability_percentage_2019_df

Unnamed: 0,All Matched Premises,Number of premises with SFBB availability,SFBB availability in percentage
0,125311,117152,
1,124305,101652,
2,29760,29383,
3,51284,48117,
4,60596,56232,
...,...,...,...
377,62114,57978,
378,76345,72564,
379,56280,53948,
380,48061,46375,


In [77]:
sffb_availability_percentage_2019_df['SFBB availability in percentage'] = round((sffb_availability_percentage_2019_df['Number of premises with SFBB availability'] /
                                             sffb_availability_percentage_2019_df['All Matched Premises']) * 100, 1)
sffb_availability_percentage_2019_df

Unnamed: 0,All Matched Premises,Number of premises with SFBB availability,SFBB availability in percentage
0,125311,117152,93.5
1,124305,101652,81.8
2,29760,29383,98.7
3,51284,48117,93.8
4,60596,56232,92.8
...,...,...,...
377,62114,57978,93.3
378,76345,72564,95.0
379,56280,53948,95.9
380,48061,46375,96.5


#### Updating the values in 'SFBB availability (% premises)

In [78]:
fixed_coverage_2019_df['SFBB availability (% premises)'].to_frame()

Unnamed: 0,SFBB availability (% premises)
0,73.3
1,78.5
2,16.3
3,89.8
4,67.4
...,...
377,82.9
378,68.5
379,88.4
380,49.3


In [79]:
fixed_coverage_2019_df['SFBB availability (% premises)'] = sffb_availability_percentage_2019_df['SFBB availability in percentage']
fixed_coverage_2019_df['SFBB availability (% premises)'].to_frame()

Unnamed: 0,SFBB availability (% premises)
0,93.5
1,81.8
2,98.7
3,93.8
4,92.8
...,...
377,93.3
378,95.0
379,95.9
380,96.5


Now having addressed the SFBB from FWA issue by adding these values to the Number of premises with SFBB availability, since SFBB from FWA is still considered SFBB (30Mbit/s and higher), I can now drop the 'Number of premises able to receive SFBB from FWA' and '% of premises able to receive SFBB from FWA'.

In [80]:
fixed_coverage_2019_df.drop(columns=['Number of premises able to receive SFBB from FWA', '% of premises able to receive SFBB from FWA'], inplace=True)

Let's verify that the columns have been indeed dropped.

In [81]:
if 'Number of premises able to receive SFBB from FWA' in fixed_coverage_2019_df.columns or '% of premises able to receive SFBB from FWA' in fixed_coverage_2019_df.columns:
    print("At least one column still exists")
else:
    print("Columns have been successfully dropped")

Columns have been successfully dropped


Let's check the shape of the dataset to ensure columns have actually been dropped.

In [82]:
fixed_coverage_2019_df.shape

(382, 36)

That seems ok. Originally, there were 38 columns. 2 were dropped, so the result of 36 is correct. Also, we know that 2020 dataset has 4 additional columns, which I will address in a minute. But it makes sense, because after I add the 4 columns existing in the 2020 dataset, the number of columns would be 40, which matches all the other datasets - 2020, 2021, 2022, 2023.

*I will re-call the 2019 data frame for convinience.

### Addressing the entry differences in 2019 dataset

Having looked deeper in the dataset and the relations between the columns and having understood how numbers in most columns are derived from, I feel confident to start making changes to the dataset. 

I will start with the 2019 dataset and addressing the issues around the difference in entries. 

In [83]:
fixed_coverage_2019_df.shape

(382, 36)

In [84]:
# List of tuples for new entries
new_entries = [
    ('E06000060', 'BUCKINGHAMSHIRE'),
    ('E06000061', 'NORTH NORTHAMPTONSHIRE'),
    ('E06000062', 'WEST NORTHAMPTONSHIRE')
]

# DataFrame constructor to create a DataFrame from the list of tuples
new_df = pd.DataFrame(new_entries, columns=['laua', 'laua_name'])

# Concatenate the new DataFrame with the existing DataFrame
fixed_coverage_2019_df = pd.concat([fixed_coverage_2019_df, new_df], ignore_index=True)

Let's confirm if the new entries have been added.

In [85]:
fixed_coverage_2019_df[fixed_coverage_2019_df['laua'].isin(['E06000060', 
                                                            'E06000061', 
                                                            'E06000062'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
382,E06000060,BUCKINGHAMSHIRE,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
383,E06000061,NORTH NORTHAMPTONSHIRE,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
384,E06000062,WEST NORTHAMPTONSHIRE,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


That's seems ok. I will now proceed with filling in the columns with existing information, by summing all relevant rows together. However, I will not fill in the columns containing the percentage values, because they simply need to be recalculated in order to be accurate.

#### Buckinghamshire - 2019

The areas that Buckinghamshire covers are Aylesbury Vale, Chiltern, South Bucks, Wycombe.

In [86]:
# List of region names
regions_buckinghamshire = ['AYLESBURY VALE', 
           'CHILTERN', 
           'SOUTH BUCKS', 
           'WYCOMBE']

# Filter the DataFrame for the specified regions
filtered_regions = fixed_coverage_2019_df[fixed_coverage_2019_df['laua_name'].isin(regions_buckinghamshire)]

# Get the unique 'laua' IDs for the filtered regions
laua_ids = filtered_regions['laua'].unique()

laua_ids

array(['E07000004', 'E07000005', 'E07000006', 'E07000007'], dtype=object)

In [None]:
fixed_coverage_2019_df[fixed_coverage_2019_df['laua'].isin(['E07000004', 
                                                            'E07000005', 
                                                            'E07000006', 
                                                            'E07000007'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
13,E07000004,AYLESBURY VALE,88230.0,88194.0,91.6,43.1,11.4,0.5,1.7,3.2,8.4,0.6,96.1,0.0,80812.0,38056.0,10085.0,449.0,1503.0,2810.0,7382.0,510.0,84800.0,0.0,42756.0,38056.0,449.0,1054.0,1307.0,4572.0,48.5,43.1,0.5,1.2,1.5,5.2
72,E07000005,CHILTERN,42619.0,42614.0,92.3,43.4,0.7,0.1,0.3,0.7,7.7,0.1,99.5,0.0,39325.0,18499.0,301.0,27.0,120.0,318.0,3289.0,56.0,42422.0,0.0,20826.0,18499.0,27.0,93.0,198.0,2971.0,48.9,43.4,0.1,0.2,0.5,7.0
288,E07000006,SOUTH BUCKS,31200.0,31197.0,89.3,36.4,12.9,0.1,0.6,2.6,10.7,0.4,97.1,0.0,27870.0,11360.0,4033.0,24.0,192.0,808.0,3327.0,131.0,30291.0,0.0,16510.0,11360.0,24.0,168.0,616.0,2519.0,52.9,36.4,0.1,0.5,2.0,8.1
378,E07000007,WYCOMBE,76433.0,76345.0,95.0,26.5,2.2,0.1,0.6,1.0,4.9,0.3,99.2,0.0,72564.0,20219.0,1664.0,90.0,489.0,731.0,3781.0,219.0,75802.0,0.0,52345.0,20219.0,90.0,399.0,242.0,3050.0,68.5,26.5,0.1,0.5,0.3,4.0


I will automate the process of summing the relevant values in the each row and column. I need to identify the names of the columns first.

In [88]:
# Extract columns containing '%' in their name
percentage_columns = [col for col in fixed_coverage_2019_df.columns if 'Number' in col]
percentage_columns


['Number of premises with SFBB availability',
 'Number of premises with UFBB availability',
 'Number of premises with Full Fibre availability',
 'Number of premises unable to receive 2Mbit/s',
 'Number of premises unable to receive 5Mbit/s',
 'Number of premises unable to receive 10Mbit/s',
 'Number of premises unable to receive 30Mbit/s',
 'Number of premises below the USO',
 'Number of premises with NGA',
 'Number of premises able to receive decent broadband from FWA',
 'Number of premises with 30<300Mbit/s download speed',
 'Number of premises with >=300Mbit/s download speed',
 'Number of premises with 0<2Mbit/s download speed',
 'Number of premises with 2<5Mbit/s download speed',
 'Number of premises with 5<10Mbit/s download speed',
 'Number of premises with 10<30Mbit/s download speed']

In [89]:
# List of 'laua' IDs to be summed
laua_ids_to_sum_buckinghamshire = ['E07000004', 
                   'E07000005', 
                   'E07000006', 
                   'E07000007']

# Columns to be included in the summation
columns_to_sum = [
    'All Premises',
    'All Matched Premises',
    'Number of premises with SFBB availability',
    'Number of premises with UFBB availability',
    'Number of premises with Full Fibre availability',
    'Number of premises unable to receive 2Mbit/s',
    'Number of premises unable to receive 5Mbit/s',
    'Number of premises unable to receive 10Mbit/s',
    'Number of premises unable to receive 30Mbit/s',
    'Number of premises below the USO',
    'Number of premises with NGA',
    'Number of premises able to receive decent broadband from FWA',
    'Number of premises with 30<300Mbit/s download speed',
    'Number of premises with >=300Mbit/s download speed',
    'Number of premises with 0<2Mbit/s download speed',
    'Number of premises with 2<5Mbit/s download speed',
    'Number of premises with 5<10Mbit/s download speed',
    'Number of premises with 10<30Mbit/s download speed'
]

# Filter the DataFrame for the specified 'laua' IDs and columns
filtered_data = fixed_coverage_2019_df[fixed_coverage_2019_df['laua'].isin(laua_ids_to_sum_buckinghamshire)]

# Calculate the sum of the columns for the filtered rows
total_sum = filtered_data[columns_to_sum].sum()

# Update the corresponding columns for the 'laua' ID 'E06000060'
fixed_coverage_2019_df.loc[fixed_coverage_2019_df['laua'] == 'E06000060', columns_to_sum] = total_sum.values

In [90]:
fixed_coverage_2019_df[fixed_coverage_2019_df['laua'].isin(['E06000060'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,,,,,,,,,,,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,,,,,,


I have manually made some calculations to ensure that the summings of each column for all entries (Aylesbury Vale, Chiltern, South Bucks, Wycombe) matches the result values in the Buckinghamshire entry. For example:

All premises: 88230 + 42619 + 31200 + 76433 = 238482

All matched premises: 88194 + 42614 + 31197 + 76345 = 238350

Number of premises with SFBB: 80812 + 39325 + 27870 + 72564 = 220571

Number of premises with UFBB: 38056 + 18499 + 11360 + 20219= 88134



#### North Northamptonshire - 2019

In [91]:
# List of region names
regions_north_northamptonshire = ['CORBY', 
           'EAST NORTHAMPTONSHIRE', 
           'KETTERING', 
           'WELLINGBOROUGH']

# Filter the DataFrame for the specified regions
filtered_regions = fixed_coverage_2019_df[fixed_coverage_2019_df['laua_name'].isin(regions_north_northamptonshire)]

# Get the unique 'laua' IDs for the filtered regions
laua_ids = filtered_regions['laua'].unique()

laua_ids

array(['E07000150', 'E07000152', 'E07000153', 'E07000156'], dtype=object)

In [92]:
fixed_coverage_2019_df[fixed_coverage_2019_df['laua'].isin(['E07000150', 
                                                            'E07000152', 
                                                            'E07000153', 
                                                            'E07000156'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
80,E07000150,CORBY,32358.0,32284.0,98.2,61.2,14.3,0.2,0.6,0.7,1.8,0.5,98.6,0.0,31709.0,19818.0,4624.0,51.0,194.0,215.0,575.0,163.0,31906.0,0.0,11891.0,19818.0,51.0,143.0,21.0,360.0,36.7,61.2,0.2,0.4,0.1,1.1
111,E07000152,EAST NORTHAMPTONSHIRE,42826.0,42805.0,97.0,54.8,10.3,0.1,0.4,0.7,3.0,0.3,98.8,0.0,41503.0,23474.0,4397.0,40.0,178.0,298.0,1302.0,117.0,42307.0,1.0,18029.0,23474.0,40.0,138.0,120.0,1004.0,42.1,54.8,0.1,0.3,0.3,2.3
175,E07000153,KETTERING,48617.0,48276.0,96.8,52.4,6.8,0.1,0.5,1.0,3.2,0.9,96.9,0.1,46714.0,25457.0,3291.0,41.0,234.0,497.0,1562.0,441.0,47106.0,34.0,21257.0,25457.0,41.0,193.0,263.0,1065.0,43.7,52.4,0.1,0.4,0.5,2.2
355,E07000156,WELLINGBOROUGH,37633.0,37612.0,111.5,36.7,3.5,0.0,0.5,0.9,2.5,0.1,99.0,14.2,41953.0,13826.0,1335.0,17.0,175.0,355.0,923.0,52.0,37262.0,5339.0,22863.0,13826.0,17.0,158.0,180.0,568.0,60.8,36.7,0.0,0.4,0.5,1.5


I have just noted the outlier value in the Wellingborough row under the 'SFBB availability (% premises). The value there is 111.5, which is clearly an outlier, as the column represents percentages, so the values contain should be within 0-100% range. 

We already know that the value in this column is derived from the Number of premises with SFBB availability column divided by the All Matched Premises and multiplied by 100 to convert into percentage. Let's examine the Number of premises with SFBB availability and the columns representing the number of premises with 30<300 and >=300 Mbit/s.


In [93]:
fixed_coverage_2019_df.loc[355, ['Number of premises with SFBB availability', 'Number of premises with 30<300Mbit/s download speed', 'Number of premises with >=300Mbit/s download speed']]

Number of premises with SFBB availability              41953.0
Number of premises with 30<300Mbit/s download speed    22863.0
Number of premises with >=300Mbit/s download speed     13826.0
Name: 355, dtype: object

Let's quickly sum up the values in 30<300 and >=300 Mbit/s:

22863 + 13826 = 36689

Hmm, that's a lower value than what is already entered in the SFBB availability column. It must be an outlier or a typo. When I look at the All Premises in Wellingborough, the value is 37633. So, we can not have a higher number of premises able to receive SFBB broadband than the actual premises in the area. Therefore, I think this is a mistake and I am going to re-calculate the value and substitute it into the Number of premises with SFBB availability. I will need to recalculate the percentage of premises with SFBB availability.

In [94]:
fixed_coverage_2019_df.at[355, 'Number of premises with SFBB availability'] = fixed_coverage_2019_df.at[355, 'Number of premises with 30<300Mbit/s download speed'] + fixed_coverage_2019_df.at[355, 'Number of premises with >=300Mbit/s download speed']

Let's check if the value has been updated and correctly calculated.

In [95]:
sfbb_availability_355 = fixed_coverage_2019_df.at[355, 'Number of premises with SFBB availability']
sfbb_availability_355

np.float64(36689.0)

That's great. Now, let's calculate the percentage of premises with SFBB availability based on the updated value representing the SFBB availability.

In [96]:
fixed_coverage_2019_df.at[355, 'SFBB availability (% premises)'] = round((fixed_coverage_2019_df.at[355, 'Number of premises with SFBB availability'] / fixed_coverage_2019_df.at[355, 'All Matched Premises'])*100,1)

In [97]:
sfbb_availability_355 = fixed_coverage_2019_df.at[355, 'SFBB availability (% premises)']
sfbb_availability_355

np.float64(97.5)

Let's double check:

(36689 / 37612) * 100 = 97.545995958736573 = 97.5 (1 d.p.)	

Yes, that is correct. We can proceed further now.

In [98]:
# List of 'laua' IDs to be summed
laua_ids_to_sum_north_northampshire = ['E07000150',
                                   'E07000152', 
                                   'E07000153',
                                   'E07000156']

# Columns to be included in the summation
columns_to_sum = [
    'All Premises',
    'All Matched Premises',
    'Number of premises with SFBB availability',
    'Number of premises with UFBB availability',
    'Number of premises with Full Fibre availability',
    'Number of premises unable to receive 2Mbit/s',
    'Number of premises unable to receive 5Mbit/s',
    'Number of premises unable to receive 10Mbit/s',
    'Number of premises unable to receive 30Mbit/s',
    'Number of premises below the USO',
    'Number of premises with NGA',
    'Number of premises able to receive decent broadband from FWA',
    'Number of premises with 30<300Mbit/s download speed',
    'Number of premises with >=300Mbit/s download speed',
    'Number of premises with 0<2Mbit/s download speed',
    'Number of premises with 2<5Mbit/s download speed',
    'Number of premises with 5<10Mbit/s download speed',
    'Number of premises with 10<30Mbit/s download speed'
]

# Filter the DataFrame for the specified 'laua' IDs and columns
filtered_data = fixed_coverage_2019_df[fixed_coverage_2019_df['laua'].isin(laua_ids_to_sum_north_northampshire)]

# Calculate the sum of the columns for the filtered rows
total_sum = filtered_data[columns_to_sum].sum()

# Update the corresponding columns for the 'laua' ID 'E06000060'
fixed_coverage_2019_df.loc[fixed_coverage_2019_df['laua'] == 'E06000061', columns_to_sum] = total_sum.values

In [99]:
fixed_coverage_2019_df[fixed_coverage_2019_df['laua'].isin(['E06000061'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,,,,,,,,,,,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,,,,,,


To ensure the summing of each row in the each column is performed correctly, I will perform some manual calculations, just like I did in the Buckinghamshire section.

All Premises: 32358 + 42826 + 48617 + 37633 = 161434

All Matched Premises: 32284 + 42805 + 48276 + 37612 = 160977

Number of premises with SFBB availability: 31709 + 41503 + 46714 + 36689 = 156615

Number of premises with UFBB availability: 19818 + 23474 + 25457 + 13826 = 82575

All matching.

#### West Northamptonshire - 2019

In [100]:
# List of region names
regions_west_northamptonshire = ['DAVENTRY', 
                                 'NORTHAMPTON', 
                                 'SOUTH NORTHAMPTONSHIRE']

# Filter the DataFrame for the specified regions
filtered_regions = fixed_coverage_2019_df[fixed_coverage_2019_df['laua_name'].isin(regions_west_northamptonshire)]

# Get the unique 'laua' IDs for the filtered regions
laua_ids = filtered_regions['laua'].unique()

laua_ids

array(['E07000151', 'E07000154', 'E07000155'], dtype=object)

In [101]:
fixed_coverage_2019_df[fixed_coverage_2019_df['laua'].isin(['E07000151', 
                                                            'E07000154', 
                                                            'E07000155'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
91,E07000151,DAVENTRY,37772.0,37752.0,93.2,25.4,20.5,0.5,1.3,2.4,6.8,0.2,97.5,0.0,35176.0,9595.0,7746.0,174.0,494.0,925.0,2576.0,91.0,36846.0,0.0,25581.0,9595.0,174.0,320.0,431.0,1651.0,67.7,25.4,0.5,0.8,1.1,4.4
235,E07000154,NORTHAMPTON,106786.0,104653.0,98.4,71.5,3.9,0.0,0.3,0.6,1.6,1.7,96.6,1.2,102944.0,76359.0,4128.0,16.0,372.0,605.0,1714.0,1832.0,103156.0,1258.0,26580.0,76359.0,16.0,356.0,233.0,1109.0,24.9,71.5,0.0,0.3,0.2,1.0
298,E07000155,SOUTH NORTHAMPTONSHIRE,42270.0,42248.0,92.0,17.7,13.3,0.2,2.1,3.6,8.4,0.6,95.9,0.7,38880.0,7467.0,5614.0,74.0,907.0,1519.0,3568.0,263.0,40521.0,282.0,31213.0,7467.0,74.0,833.0,612.0,2049.0,73.8,17.7,0.2,2.0,1.4,4.8


In [102]:
# List of 'laua' IDs to be summed
laua_ids_to_sum_west_northamptonshire = ['E07000151', 'E07000154', 'E07000155']

# Columns to be included in the summation
columns_to_sum = [
    'All Premises',
    'All Matched Premises',
    'Number of premises with SFBB availability',
    'Number of premises with UFBB availability',
    'Number of premises with Full Fibre availability',
    'Number of premises unable to receive 2Mbit/s',
    'Number of premises unable to receive 5Mbit/s',
    'Number of premises unable to receive 10Mbit/s',
    'Number of premises unable to receive 30Mbit/s',
    'Number of premises below the USO',
    'Number of premises with NGA',
    'Number of premises able to receive decent broadband from FWA',
    'Number of premises with 30<300Mbit/s download speed',
    'Number of premises with >=300Mbit/s download speed',
    'Number of premises with 0<2Mbit/s download speed',
    'Number of premises with 2<5Mbit/s download speed',
    'Number of premises with 5<10Mbit/s download speed',
    'Number of premises with 10<30Mbit/s download speed'
]

# Filter the DataFrame for the specified 'laua' IDs and columns
filtered_data = fixed_coverage_2019_df[fixed_coverage_2019_df['laua'].isin(laua_ids_to_sum_west_northamptonshire)]

# Calculate the sum of the columns for the filtered rows
total_sum = filtered_data[columns_to_sum].sum()

# Update the corresponding columns for the 'laua' ID 'E06000060'
fixed_coverage_2019_df.loc[fixed_coverage_2019_df['laua'] == 'E06000062', columns_to_sum] = total_sum.values

In [103]:
fixed_coverage_2019_df[fixed_coverage_2019_df['laua'].isin(['E06000062'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
384,E06000062,WEST NORTHAMPTONSHIRE,186828.0,184653.0,,,,,,,,,,,177000.0,93421.0,17488.0,264.0,1773.0,3049.0,7858.0,2186.0,180523.0,1540.0,83374.0,93421.0,264.0,1509.0,1276.0,4809.0,,,,,,


I will perform the basic calculations I have done for the other two areas to ensure that the summing has been performed correctly.

All Premises: 37772 + 106786 + 42270 = 186828

All Matched Premises: 37752 + 104653 + 42248 = 184653

Number of premises with SFBB availability: 35176 + 102944 + 38880 = 177000

Number of premises with UFBB availability: 9595 + 76359 + 7467 = 93421 

All matching, so I can proceed.

Yes, that is correct. Now I have the same number of entries as in 2021, 2022 and 2023 datasets. 

While I am at it, I am going to work on the 2020 dataset to update these rows.

#### Dropping redundant columns in 2019 dataset.

Since the new entries have the summed values updated, as they should, now I can drop the rows as they are no longer needed. The columns containing percentages of something can not be populated by summing each row, as it won't be accurate. The values in the percentage columns need to be recalculated based on the new values in the number columns. For now, I will leave the percentage columns with 'NaN' values and will come back to that in a minute.

I will proceed with dropping the redundant rows.

First, let's see what the shape of 2019 dataframe is after I introduced the new entries. 

In [104]:
fixed_coverage_2019_df.shape

(385, 36)

385. That seems right, as initially, the 2019 dataframe had 382 entries. I have introduced 3 entries - Buckinghamshire, North Northamptonshire, South Northamptonshire. So, 382 + 3 = 385.

I will now proceed dropping the redundant rows.

In [105]:
rows_to_drop = ['AYLESBURY VALE', 
                'CHILTERN', 
                'SOUTH BUCKS', 
                'WYCOMBE', 
                'CORBY', 
                'EAST NORTHAMPTONSHIRE', 
                'KETTERING', 
                'WELLINGBOROUGH', 
                'DAVENTRY', 
                'NORTHAMPTON', 
                'SOUTH NORTHAMPTONSHIRE']

# Drop rows where 'laua_name' is in rows_to_drop
fixed_coverage_2019_df = fixed_coverage_2019_df[~fixed_coverage_2019_df['laua_name'].isin(rows_to_drop)]

The rows are now dropped. Let's check the shape again. There are 11 rows dropped, so we should have 374 entries. (385 - 11 = 374).

In [106]:
fixed_coverage_2019_df.shape

(374, 36)

###### Disucssion

### Addressing the entry differences in 2020 dataset

#### Number of premises with Gigabit availability and Gigabit availability (% premises) - 2019

From 2020, I identified that there are 4 additional columns that are not present in 2019 dataset, due to changes of the way broadband is classified. These columns are:

* Number of premises with UFBB (100Mbit/s) availability
* UFBB (100Mbit/s) availability (% premises)
* Number of premises with Gigabit availability
* Gigabit availability (% premises)

To achieve consistency throughout the five datasets, I will need to introduce these columns in 2019. However, that presents issues especially for the UFBB.

After consulting myself with the Connected Nations Report for 2022 and the Methodology document provided, it is explained in both that Gigabit-capable broadband provides speeds of 1Gbit/s or higher. These kind of speeds are typically only available in locations with fibre-to-the-premises (FTTP), or known as 'full fibre connections'. 

As the Gigabit availability in numbers and % of premises columns are new and I need to populate the data into them, I will apply my understanding of Gigabit-capable broadband. In the 2019 dataset, we have already got a column called 'Number of premises with Full Fibre availability' as well as 'Full Fibre availability (% premises)'. Essentially, that is the same. 

Before I proceed to do anything on the 2019 dataset, I would like to see the Gigabit and Full Fibre availability columns in the 2020 dataset and compare if the values are the same. If my understanding of the Gigabit broadband and its relation to the Full Fibre, these values should be the same in both columns. Let's quickly explore it.


In [107]:
fixed_coverage_2020_df[['Number of premises with Gigabit availability', 'Number of premises with Full Fibre availability']]

Unnamed: 0,Number of premises with Gigabit availability,Number of premises with Full Fibre availability
0,44051,44051
1,8732,8732
2,189,189
3,1466,1466
4,14438,14438
...,...,...
374,23863,23863
375,5066,4988
376,12612,12612
377,23034,961


Interestingly, the values in both Gigabit and Full Fibre availability are the same.

In [108]:
fixed_coverage_2021_df[['Number of premises with Gigabit availability', 'Number of premises with Full Fibre availability']]

Unnamed: 0,Number of premises with Gigabit availability,Number of premises with Full Fibre availability
0,74618,74618
1,17269,17269
2,538,538
3,1750,1750
4,15512,15512
...,...,...
369,27106,27106
370,11101,8099
371,26591,26591
372,23324,1276


In [109]:
fixed_coverage_2022_df[['Number of premises with Gigabit availability', 'Number of premises with Full Fibre availability']]

Unnamed: 0,Number of premises with Gigabit availability,Number of premises with Full Fibre availability
0,95719,95719
1,26171,26171
2,27009,16333
3,2811,2811
4,27186,27101
...,...,...
369,31767,31766
370,19685,18815
371,35010,34984
372,27064,5020


In [110]:
fixed_coverage_2023_df[['Number of premises with Gigabit availability', 'Number of premises with Full Fibre availability']]

Unnamed: 0,Number of premises with Gigabit availability,Number of premises with Full Fibre availability
0,107315,107315
1,32622,32622
2,27826,19606
3,3127,3127
4,36952,36868
...,...,...
369,36895,36894
370,31790,31187
371,35480,35450
372,31340,12612


It seems that the values in both columns - 'Number of premises with Gigabit availability' and 'Number of premises with Full Fibre availability' match. So, I think we can conclude that they are essentially the same thing the way Ofcom reports it. So, I can actually take the values in columns 'Number of premises with Full Fibre availability' and 'Full Fibre availability (% premises)' and substitute these same figures in the newly introduced columns 'Number of premises with Gigabit availability' and 'Gigabit availability (% premises)' in 2019 dataset.

#### Introducing Gigabit availability columns in 2019 dataset.

In [111]:
fixed_coverage_2019_df['Number of premises with Gigabit availability'] = fixed_coverage_2019_df['Number of premises with Full Fibre availability']
fixed_coverage_2019_df['Gigabit availability (% premises)'] = fixed_coverage_2019_df['Full Fibre availability (% premises)']

In [112]:
fixed_coverage_2019_df.loc[:, ['Number of premises with Gigabit availability', 'Number of premises with Full Fibre availability', 'Gigabit availability (% premises)', 'Full Fibre availability (% premises)']]

Unnamed: 0,Number of premises with Gigabit availability,Number of premises with Full Fibre availability,Gigabit availability (% premises),Full Fibre availability (% premises)
0,16410.0,16410.0,13.1,13.1
1,3332.0,3332.0,2.7,2.7
2,193.0,193.0,0.6,0.6
3,866.0,866.0,1.7,1.7
4,13412.0,13412.0,22.1,22.1
...,...,...,...,...
380,420.0,420.0,0.9,0.9
381,43077.0,43077.0,43.6,43.6
382,16083.0,16083.0,,
383,13647.0,13647.0,,


In [113]:
fixed_coverage_2019_df.shape

(374, 38)

In [114]:
fixed_coverage_2020_df

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
0,S12000033,ABERDEEN CITY,126176,125948,94.6,49.0,41.6,34.9,34.9,0.0,0.2,0.7,5.2,0.2,97.0,0.0,119358,61798,52461,44051,44051,55,208,881,6590,300,122434,0,66897,52461,55,153,673,5709,53.0,41.6,0.0,0.1,0.5,4.5
1,S12000034,ABERDEENSHIRE,126065,125176,82.9,7.2,7.0,6.9,6.9,2.6,5.6,9.1,16.4,3.6,94.7,0.0,104472,9118,8872,8732,8732,3234,7099,11516,20704,4538,119331,0,95600,8872,3234,3865,4417,9188,75.8,7.0,2.6,3.1,3.5,7.3
2,E07000223,ADUR,29779,29755,98.8,85.8,85.6,0.6,0.6,0.0,0.0,0.1,1.1,0.1,99.5,0.0,29427,25562,25482,189,189,0,10,34,328,33,29616,0,3945,25482,0,10,24,294,13.2,85.6,0.0,0.0,0.1,1.0
3,E07000026,ALLERDALE,51647,51483,92.3,2.8,2.8,2.8,2.8,1.2,2.3,3.3,7.3,1.2,98.6,2.2,47693,1466,1466,1466,1466,627,1173,1705,3790,634,50931,1160,46227,1466,627,546,532,2085,89.5,2.8,1.2,1.1,1.0,4.0
4,E07000032,AMBER VALLEY,61134,60972,94.7,30.2,26.7,23.6,23.6,0.1,0.5,0.9,5.1,0.4,98.9,0.0,57875,18462,16323,14438,14438,63,280,573,3097,267,60483,0,41552,16323,63,217,293,2524,68.0,26.7,0.1,0.4,0.5,4.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
374,W06000006,WREXHAM,65867,65212,94.4,37.3,36.6,36.2,36.2,0.3,1.0,1.8,4.6,1.4,97.2,43.9,62170,24551,24136,23863,23863,194,639,1217,3042,930,64021,28917,38034,24136,194,445,578,1825,57.7,36.6,0.3,0.7,0.9,2.8
375,E07000238,WYCHAVON,62536,62215,93.8,19.2,15.2,8.0,8.1,0.1,0.7,1.3,5.7,0.8,98.6,0.4,58643,12032,9524,4988,5066,83,423,822,3572,490,61652,225,49119,9524,83,340,399,2750,78.5,15.2,0.1,0.5,0.6,4.4
376,E07000128,WYRE,56527,56411,95.1,22.7,22.7,22.3,22.3,0.1,0.3,0.7,4.7,0.2,98.8,0.3,53739,12856,12852,12612,12612,61,174,368,2672,133,55837,186,40887,12852,61,113,194,2304,72.3,22.7,0.1,0.2,0.3,4.1
377,E07000239,WYRE FOREST,48237,48173,96.8,47.9,47.8,2.0,47.8,0.2,0.5,0.8,3.1,0.4,99.5,3.5,46680,23125,23035,961,23034,84,218,389,1493,183,48001,1675,23645,23035,84,134,171,1104,49.0,47.8,0.2,0.3,0.4,2.3


I am going to re-call what the missing entries in 2020 are.

In [115]:
# Example usage to compare 2020 and 2021 dataframes
added_entries, removed_entries = compare_dataframes(fixed_coverage_2020_df, fixed_coverage_2021_df, 'laua')

print("Added Entries:")
print(added_entries[['laua', 'laua_name']])
print("\nRemoved Entries:")
print(removed_entries[['laua', 'laua_name']])


Added Entries:
          laua               laua_name
226  E06000061  NORTH NORTHAMPTONSHIRE
355  E06000062   WEST NORTHAMPTONSHIRE

Removed Entries:
          laua               laua_name
79   E07000150                   CORBY
90   E07000151                DAVENTRY
110  E07000152   EAST NORTHAMPTONSHIRE
174  E07000153               KETTERING
234  E07000154             NORTHAMPTON
296  E07000155  SOUTH NORTHAMPTONSHIRE
353  E07000156          WELLINGBOROUGH


So, I need to add North Northamptonshire entry with 'laua' E06000061, containing the following areas:

* Corby 
* East Northamptonshire
* Kettering
* Wellingborough

And West Northamptonshire entry with 'laua' E06000062, containing the following areas:

* Daventry
* Northampton
* South Northamptonshire

In [116]:
new_entries = [
    ('E06000061', 'NORTH NORTHAMPTONSHIRE'),
    ('E06000062', 'WEST NORTHAMPTONSHIRE')
]

In [117]:
# DataFrame constructor to create a DataFrame from the list of tuples
new_df = pd.DataFrame(new_entries, columns=['laua', 'laua_name'])

In [118]:
# Concatenate the new DataFrame with the existing DataFrame
fixed_coverage_2020_df = pd.concat([fixed_coverage_2020_df, new_df], ignore_index=True)

Let's confirm if the new entries have been added.

In [119]:
fixed_coverage_2020_df[fixed_coverage_2020_df['laua'].isin(['E06000061',
                                                           'E06000062'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
379,E06000061,NORTH NORTHAMPTONSHIRE,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
380,E06000062,WEST NORTHAMPTONSHIRE,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


That's good news. The two new entries have been added to the dataframe. Now I need to populate them with data by summing the values in the relevant rows.

#### North Northamptonshire - 2020

In [120]:
# Filter the DataFrame for the specified regions
filtered_regions = fixed_coverage_2020_df[fixed_coverage_2020_df['laua_name'].isin(regions_north_northamptonshire)]

# Get the unique 'laua' IDs for the filtered regions
laua_ids = filtered_regions['laua'].unique()

laua_ids

array(['E07000150', 'E07000152', 'E07000153', 'E07000156'], dtype=object)

In [121]:
fixed_coverage_2020_df[fixed_coverage_2020_df['laua'].isin(['E07000150', 
                                                            'E07000152', 
                                                            'E07000153', 
                                                            'E07000156'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
79,E07000150,CORBY,32478.0,32397.0,98.1,78.1,64.9,23.6,23.6,0.1,0.2,0.3,1.6,0.4,98.6,0.0,31864.0,25373.0,21081.0,7681.0,7681.0,20.0,81.0,105.0,533.0,128.0,32027.0,0.0,10783.0,21081.0,20.0,61.0,24.0,428.0,33.2,64.9,0.1,0.2,0.1,1.3
110,E07000152,EAST NORTHAMPTONSHIRE,43066.0,43030.0,97.9,57.9,57.6,13.1,13.1,0.1,0.3,0.5,2.0,0.3,99.0,0.0,42173.0,24915.0,24807.0,5626.0,5626.0,30.0,109.0,210.0,857.0,123.0,42631.0,1.0,17366.0,24807.0,30.0,79.0,101.0,647.0,40.3,57.6,0.1,0.2,0.2,1.5
174,E07000153,KETTERING,48831.0,48416.0,97.4,77.8,72.6,12.0,12.0,0.0,0.4,0.7,1.7,1.1,98.0,0.1,47570.0,38011.0,35437.0,5845.0,5845.0,24.0,192.0,329.0,846.0,518.0,47860.0,34.0,12133.0,35437.0,24.0,168.0,137.0,517.0,24.8,72.6,0.0,0.3,0.3,1.1
353,E07000156,WELLINGBOROUGH,37851.0,37830.0,97.5,62.7,62.6,4.1,4.1,0.1,0.5,0.9,2.5,0.1,99.0,14.1,36891.0,23738.0,23688.0,1569.0,1569.0,19.0,184.0,353.0,939.0,56.0,37474.0,5337.0,13203.0,23688.0,19.0,165.0,169.0,586.0,34.9,62.6,0.1,0.4,0.4,1.5


Earlier we have noted that there are some changes in the number and name of columns for 2020. Let's see the column names.

In [122]:
fixed_coverage_2020_df.columns

Index(['laua', 'laua_name', 'All Premises', 'All Matched Premises',
       'SFBB availability (% premises)',
       'UFBB (100Mbit/s) availability (% premises)',
       'UFBB availability (% premises)',
       'Full Fibre availability (% premises)',
       'Gigabit availability (% premises)',
       '% of premises unable to receive 2Mbit/s',
       '% of premises unable to receive 5Mbit/s',
       '% of premises unable to receive 10Mbit/s',
       '% of premises unable to receive 30Mbit/s',
       '% of premises below the USO', '% of premises with NGA',
       '% of premises able to receive decent broadband from FWA',
       'Number of premises with SFBB availability',
       'Number of premises with UFBB (100Mbit/s) availability',
       'Number of premises with UFBB availability',
       'Number of premises with Full Fibre availability',
       'Number of premises with Gigabit availability',
       'Number of premises unable to receive 2Mbit/s',
       'Number of premises unable to r

I am only going to populate the 'number' columns, and leave the percentage columns for later, as the approach there is slightly diferent as percentages can not just me summed up.

In [123]:
# List of 'laua' IDs to be summed
laua_ids_to_sum_north_northampshire = ['E07000150',
                                   'E07000152', 
                                   'E07000153',
                                   'E07000156']

# Columns to be included in the summation
columns_to_sum = [
    'All Premises',
    'All Matched Premises',
    'Number of premises with SFBB availability',
    'Number of premises with UFBB (100Mbit/s) availability',
    'Number of premises with UFBB availability',
    'Number of premises with Full Fibre availability',
    'Number of premises with Gigabit availability',
    'Number of premises unable to receive 2Mbit/s',
    'Number of premises unable to receive 5Mbit/s',
    'Number of premises unable to receive 10Mbit/s',
    'Number of premises unable to receive 30Mbit/s',
    'Number of premises below the USO', 'Number of premises with NGA',
    'Number of premises able to receive decent broadband from FWA',
    'Number of premises with 30<300Mbit/s download speed',
    'Number of premises with >=300Mbit/s download speed',
    'Number of premises with 0<2Mbit/s download speed',
    'Number of premises with 2<5Mbit/s download speed',
    'Number of premises with 5<10Mbit/s download speed',
    'Number of premises with 10<30Mbit/s download speed',
]

# Filter the DataFrame for the specified 'laua' IDs and columns
filtered_data = fixed_coverage_2020_df[fixed_coverage_2020_df['laua'].isin(laua_ids_to_sum_north_northampshire)]

# Calculate the sum of the columns for the filtered rows
total_sum = filtered_data[columns_to_sum].sum()

# Update the corresponding columns for the 'laua' ID 'E06000060'
fixed_coverage_2020_df.loc[fixed_coverage_2020_df['laua'] == 'E06000061', columns_to_sum] = total_sum.values

In [124]:
fixed_coverage_2020_df[fixed_coverage_2020_df['laua'].isin(['E06000061'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
379,E06000061,NORTH NORTHAMPTONSHIRE,162226.0,161673.0,,,,,,,,,,,,,158498.0,112037.0,105013.0,20721.0,20721.0,93.0,566.0,997.0,3175.0,825.0,159992.0,5372.0,53485.0,105013.0,93.0,473.0,431.0,2178.0,,,,,,


The data seems to be correctly populated. But let's manually perform few calculations to ensure the summing is performed correctly:

All premises: 32478 + 43066 + 48831 + 37851 = 162226

All matched premises: 32397 + 43030 + 48416 + 37830 = 161673

Number of premises with SFBB availability: 31864 + 42173 + 47570 + 36891 = 158498

Number of premises with UFBB (100Mbit/s availability) 25373 + 24915 + 38011 + 23737 = 112037

The values seems to match, so I assume the summing is all performed correctly. Now I will proceed with West Northamptonshire area.

#### West Northamptonshire - 2020

In [125]:
# Filter the DataFrame for the specified regions
filtered_regions = fixed_coverage_2020_df[fixed_coverage_2020_df['laua_name'].isin(regions_west_northamptonshire)]

# Get the unique 'laua' IDs for the filtered regions
laua_ids = filtered_regions['laua'].unique()

laua_ids

array(['E07000151', 'E07000154', 'E07000155'], dtype=object)

In [126]:
fixed_coverage_2020_df[fixed_coverage_2020_df['laua'].isin(['E07000151', 
                                                            'E07000154', 
                                                            'E07000155'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
90,E07000151,DAVENTRY,38116.0,38067.0,94.7,36.9,31.0,25.6,25.6,0.3,0.9,1.9,5.1,0.3,97.6,0.0,36109.0,14065.0,11823.0,9759.0,9759.0,109.0,358.0,743.0,1958.0,124.0,37211.0,0.0,24286.0,11823.0,109.0,249.0,385.0,1215.0,63.7,31.0,0.3,0.7,1.0,3.2
234,E07000154,NORTHAMPTON,106107.0,103913.0,97.2,85.8,83.8,13.9,13.9,0.0,0.0,0.1,0.8,2.1,97.3,1.2,103085.0,91008.0,88967.0,14744.0,14745.0,6.0,45.0,154.0,828.0,2200.0,103294.0,1262.0,14118.0,88967.0,6.0,39.0,109.0,674.0,13.3,83.8,0.0,0.0,0.1,0.6
296,E07000155,SOUTH NORTHAMPTONSHIRE,42882.0,42749.0,92.6,32.9,25.3,20.3,20.3,0.2,2.0,3.1,7.1,0.9,96.3,0.7,39724.0,14128.0,10843.0,8716.0,8716.0,88.0,875.0,1310.0,3025.0,370.0,41311.0,284.0,28881.0,10843.0,88.0,787.0,435.0,1715.0,67.3,25.3,0.2,1.8,1.0,4.0


In [127]:
# List of 'laua' IDs to be summed
laua_ids_to_sum_west_northampshire = ['E07000151', 
                                      'E07000154', 
                                      'E07000155']

# Columns to be included in the summation
columns_to_sum = [
    'All Premises',
    'All Matched Premises',
    'Number of premises with SFBB availability',
    'Number of premises with UFBB (100Mbit/s) availability',
    'Number of premises with UFBB availability',
    'Number of premises with Full Fibre availability',
    'Number of premises with Gigabit availability',
    'Number of premises unable to receive 2Mbit/s',
    'Number of premises unable to receive 5Mbit/s',
    'Number of premises unable to receive 10Mbit/s',
    'Number of premises unable to receive 30Mbit/s',
    'Number of premises below the USO', 'Number of premises with NGA',
    'Number of premises able to receive decent broadband from FWA',
    'Number of premises with 30<300Mbit/s download speed',
    'Number of premises with >=300Mbit/s download speed',
    'Number of premises with 0<2Mbit/s download speed',
    'Number of premises with 2<5Mbit/s download speed',
    'Number of premises with 5<10Mbit/s download speed',
    'Number of premises with 10<30Mbit/s download speed',
]

# Filter the DataFrame for the specified 'laua' IDs and columns
filtered_data = fixed_coverage_2020_df[fixed_coverage_2020_df['laua'].isin(laua_ids_to_sum_west_northampshire)]

# Calculate the sum of the columns for the filtered rows
total_sum = filtered_data[columns_to_sum].sum()

# Update the corresponding columns for the 'laua' ID 'E06000062'
fixed_coverage_2020_df.loc[fixed_coverage_2020_df['laua'] == 'E06000062', columns_to_sum] = total_sum.values

In [128]:
fixed_coverage_2020_df[fixed_coverage_2020_df['laua'].isin(['E06000062'])]

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
380,E06000062,WEST NORTHAMPTONSHIRE,187105.0,184729.0,,,,,,,,,,,,,178918.0,119201.0,111633.0,33219.0,33220.0,203.0,1278.0,2207.0,5811.0,2694.0,181816.0,1546.0,67285.0,111633.0,203.0,1075.0,929.0,3604.0,,,,,,


All Premises: 38116 + 106107 + 42882 = 187105

All Matched Premises: 38067 + 103913 + 42749 = 184729

Number of premises with SFBB availability: 36109 + 103085 + 39724 = 178918

Number of premises with UFBB (100Mbit/s availability) = 14065 + 91008 + 14128 = 119201



The values in the respective columns do match, so I assume the rest of the summing is performed correctly.

#### Dropping redundant columns in 2020 dataset.

Before I drop the redundant columns, I would like to keep a record of the number of entries so far.

In [129]:
fixed_coverage_2020_df.shape

(381, 40)

That seems to be correct, as initially there were 379 entries. After I added two new entries, the number of entries now is 381. I am satisfied with the results, so I can proceed dropping the rows with the redundant rows.

In [130]:
rows_to_drop = ['CORBY',
                'EAST NORTHAMPTONSHIRE',
                'KETTERING',
                'WELLINGBOROUGH',
                'DAVENTRY',
                'NORTHAMPTON',
                'SOUTH NORTHAMPTONSHIRE']
# Drop rows where 'laua_name' is in rows_to_drop
fixed_coverage_2020_df = fixed_coverage_2020_df[~fixed_coverage_2020_df['laua_name'].isin(rows_to_drop)]

Let's check the number of entries after the rows have been dropped.

In [131]:
fixed_coverage_2020_df.shape

(374, 40)

374 entries seems right after removing 7 entries from 381 entries, making 374. 

#### Incorrect SFBB values - 2019 dataset

Before proceeding, because of the various inconsistencies with the SFBB in 2019 dataset, I would like to check if all values in Number of premises with SFBB availability actually match the summing of Number of premises with 30<300 and >=300. 

In [132]:
# Get the values for comparison
sfbb_availability_values = fixed_coverage_2019_df['Number of premises with SFBB availability']
sum_of_speeds = fixed_coverage_2019_df['Number of premises with 30<300Mbit/s download speed'] + fixed_coverage_2019_df['Number of premises with >=300Mbit/s download speed']

# Check if all values match
all_values_match = (sfbb_availability_values == sum_of_speeds).all()
all_values_match

if all_values_match:
    print("All values in 'Number of premises with SFBB availability' match the sum of the other columns.")
else:
    print("There are inconsistencies between 'Number of premises with SFBB availability' and the sum of the other columns.")


There are inconsistencies between 'Number of premises with SFBB availability' and the sum of the other columns.


In [133]:
# Filter the DataFrame to find entries where the values don't match
inconsistent_entries = fixed_coverage_2019_df[fixed_coverage_2019_df['Number of premises with SFBB availability'] != fixed_coverage_2019_df['Number of premises with 30<300Mbit/s download speed'] + fixed_coverage_2019_df['Number of premises with >=300Mbit/s download speed']]

# Print the inconsistent entries
inconsistent_entries


Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises)
3,E07000026,ALLERDALE,51385.0,51284.0,93.8,1.7,1.7,1.2,2.6,3.6,8.3,1.2,98.3,2.3,48117.0,866.0,866.0,619.0,1323.0,1873.0,4281.0,592.0,50507.0,1164.0,46137.0,866.0,619.0,704.0,550.0,2408.0,89.8,1.7,1.2,1.4,1.1,4.7,866.0,1.7
18,E07000027,BARROW-IN-FURNESS,35421.0,35396.0,98.6,0.7,0.7,0.2,0.2,0.4,1.5,0.1,99.9,0.1,34905.0,242.0,242.0,54.0,82.0,150.0,526.0,46.0,35386.0,35.0,34628.0,242.0,54.0,28.0,68.0,376.0,97.8,0.7,0.2,0.1,0.2,1.1,242.0,0.7
21,E07000171,BASSETLAW,55314.0,55240.0,100.2,28.4,4.4,0.4,0.8,1.2,4.8,0.3,99.3,6.7,55329.0,15685.0,2415.0,240.0,433.0,665.0,2629.0,171.0,54929.0,3723.0,36926.0,15685.0,240.0,193.0,232.0,1964.0,66.8,28.4,0.4,0.3,0.4,3.6,2415.0,4.4
28,E06000008,BLACKBURN WITH DARWEN,67155.0,67114.0,120.2,61.5,3.8,0.1,0.3,0.5,2.5,0.1,99.2,46.7,80681.0,41317.0,2553.0,42.0,200.0,361.0,1656.0,62.0,66606.0,31365.0,24141.0,41317.0,42.0,158.0,161.0,1295.0,35.9,61.5,0.1,0.2,0.2,1.9,2553.0,3.8
32,E08000001,BOLTON,133392.0,133277.0,100.9,68.3,0.8,0.0,0.1,0.3,1.6,0.1,99.2,4.4,134516.0,91140.0,1052.0,32.0,148.0,410.0,2153.0,112.0,132304.0,5895.0,39984.0,91140.0,32.0,116.0,262.0,1743.0,30.0,68.3,0.0,0.1,0.2,1.3,1052.0,0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
377,E07000238,WYCHAVON,62475.0,62114.0,93.3,8.6,5.1,0.2,0.9,2.0,8.0,0.9,97.8,9.3,57978.0,5352.0,3162.0,129.0,553.0,1223.0,4984.0,555.0,61070.0,5827.0,51778.0,5352.0,129.0,424.0,670.0,3761.0,82.9,8.6,0.2,0.7,1.1,6.0,3162.0,5.1
379,E07000128,WYRE,56343.0,56280.0,95.9,5.9,5.5,0.1,0.3,0.7,5.5,0.1,98.8,7.3,53948.0,3340.0,3108.0,69.0,183.0,382.0,3107.0,71.0,55673.0,4110.0,49833.0,3340.0,69.0,114.0,199.0,2725.0,88.4,5.9,0.1,0.2,0.4,4.8,3108.0,5.5
380,E07000239,WYRE FOREST,48100.0,48061.0,96.5,46.7,0.9,0.3,0.6,1.4,3.9,0.5,99.4,2.9,46375.0,22479.0,420.0,140.0,291.0,662.0,1858.0,264.0,47829.0,1412.0,23724.0,22479.0,140.0,151.0,371.0,1196.0,49.3,46.7,0.3,0.3,0.8,2.5,420.0,0.9
381,E06000014,YORK,98735.0,98548.0,97.9,70.8,43.6,0.0,0.3,0.6,6.0,0.2,95.6,7.0,96437.0,69871.0,43077.0,36.0,261.0,629.0,5922.0,191.0,94419.0,6956.0,22755.0,69871.0,36.0,225.0,368.0,5293.0,23.0,70.8,0.0,0.2,0.4,5.4,43077.0,43.6


There seems to be 124 entries which values do not match. That needs further attention. What I will do is re-calculate the values for Number of premises SFBB availability. Subsequently, I need to do the same for the column representing the percentage.

In [134]:
sfbb_availability_2019_df = pd.DataFrame(columns=['All Matched Premises', 'Number of premises with 30<300Mbit/s download speed', 'Number of premises with >=300Mbit/s download speed', 'Number of premises with SFBB availability', 'SFBB availability (% premises)'])
sfbb_availability_2019_df

Unnamed: 0,All Matched Premises,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with SFBB availability,SFBB availability (% premises)


In [135]:
# Assign values from fixed_coverage_2019_df to new columns in sfbb_availability_2019_df
sfbb_availability_2019_df['All Matched Premises'] = fixed_coverage_2019_df['All Matched Premises']
sfbb_availability_2019_df['Number of premises with 30<300Mbit/s download speed'] = fixed_coverage_2019_df['Number of premises with 30<300Mbit/s download speed']
sfbb_availability_2019_df['Number of premises with >=300Mbit/s download speed'] = fixed_coverage_2019_df['Number of premises with >=300Mbit/s download speed']
sfbb_availability_2019_df

Unnamed: 0,All Matched Premises,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with SFBB availability,SFBB availability (% premises)
0,125311.0,91989.0,25163.0,,
1,124305.0,98180.0,3472.0,,
2,29760.0,4840.0,24543.0,,
3,51284.0,46137.0,866.0,,
4,60596.0,40893.0,15339.0,,
...,...,...,...,...,...
380,48061.0,23724.0,22479.0,,
381,98548.0,22755.0,69871.0,,
382,238350.0,132437.0,88134.0,,
383,160977.0,74040.0,82575.0,,


In [136]:
sfbb_availability_2019_df['Number of premises with SFBB availability'] = sfbb_availability_2019_df['Number of premises with 30<300Mbit/s download speed'] + sfbb_availability_2019_df['Number of premises with >=300Mbit/s download speed']
sfbb_availability_2019_df['SFBB availability (% premises)'] = round(sfbb_availability_2019_df['Number of premises with SFBB availability'] / sfbb_availability_2019_df['All Matched Premises'] * 100, 1)
sfbb_availability_2019_df


Unnamed: 0,All Matched Premises,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with SFBB availability,SFBB availability (% premises)
0,125311.0,91989.0,25163.0,117152.0,93.5
1,124305.0,98180.0,3472.0,101652.0,81.8
2,29760.0,4840.0,24543.0,29383.0,98.7
3,51284.0,46137.0,866.0,47003.0,91.7
4,60596.0,40893.0,15339.0,56232.0,92.8
...,...,...,...,...,...
380,48061.0,23724.0,22479.0,46203.0,96.1
381,98548.0,22755.0,69871.0,92626.0,94.0
382,238350.0,132437.0,88134.0,220571.0,92.5
383,160977.0,74040.0,82575.0,156615.0,97.3


In [137]:
fixed_coverage_2019_df['Number of premises with SFBB availability'] = sfbb_availability_2019_df['Number of premises with SFBB availability']
fixed_coverage_2019_df['SFBB availability (% premises)'] = sfbb_availability_2019_df['SFBB availability (% premises)']

In [138]:
fixed_coverage_2019_df

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises)
0,S12000033,ABERDEEN CITY,125441.0,125311.0,93.5,20.1,13.1,0.0,0.2,0.7,6.5,0.2,96.8,0.0,117152.0,25163.0,16410.0,49.0,219.0,884.0,8159.0,189.0,121476.0,0.0,91989.0,25163.0,49.0,170.0,665.0,7275.0,73.3,20.1,0.0,0.1,0.5,5.8,16410.0,13.1
1,S12000034,ABERDEENSHIRE,125085.0,124305.0,81.8,2.8,2.7,2.5,5.9,9.9,18.1,3.6,93.6,0.0,101652.0,3472.0,3332.0,3163.0,7339.0,12332.0,22653.0,4519.0,117051.0,0.0,98180.0,3472.0,3163.0,4176.0,4993.0,10321.0,78.5,2.8,2.5,3.3,4.0,8.3,3332.0,2.7
2,E07000223,ADUR,29770.0,29760.0,98.7,82.4,0.6,0.0,0.1,0.1,1.3,0.0,99.1,0.0,29383.0,24543.0,193.0,0.0,16.0,44.0,377.0,12.0,29514.0,0.0,4840.0,24543.0,0.0,16.0,28.0,333.0,16.3,82.4,0.0,0.1,0.1,1.1,193.0,0.6
3,E07000026,ALLERDALE,51385.0,51284.0,91.7,1.7,1.7,1.2,2.6,3.6,8.3,1.2,98.3,2.3,47003.0,866.0,866.0,619.0,1323.0,1873.0,4281.0,592.0,50507.0,1164.0,46137.0,866.0,619.0,704.0,550.0,2408.0,89.8,1.7,1.2,1.4,1.1,4.7,866.0,1.7
4,E07000032,AMBER VALLEY,60674.0,60596.0,92.8,25.3,22.1,0.1,0.9,2.1,7.2,0.7,98.2,0.0,56232.0,15339.0,13412.0,89.0,549.0,1254.0,4364.0,440.0,59578.0,0.0,40893.0,15339.0,89.0,460.0,705.0,3110.0,67.4,25.3,0.1,0.8,1.2,5.1,13412.0,22.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,E07000239,WYRE FOREST,48100.0,48061.0,96.1,46.7,0.9,0.3,0.6,1.4,3.9,0.5,99.4,2.9,46203.0,22479.0,420.0,140.0,291.0,662.0,1858.0,264.0,47829.0,1412.0,23724.0,22479.0,140.0,151.0,371.0,1196.0,49.3,46.7,0.3,0.3,0.8,2.5,420.0,0.9
381,E06000014,YORK,98735.0,98548.0,94.0,70.8,43.6,0.0,0.3,0.6,6.0,0.2,95.6,7.0,92626.0,69871.0,43077.0,36.0,261.0,629.0,5922.0,191.0,94419.0,6956.0,22755.0,69871.0,36.0,225.0,368.0,5293.0,23.0,70.8,0.0,0.2,0.4,5.4,43077.0,43.6
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,92.5,,,,,,,,,,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,,,,,,,16083.0,
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,97.3,,,,,,,,,,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,,,,,,,13647.0,


Now I have populated the values and they should be all correct. Let's check.

In [139]:
# Get the values for comparison
sfbb_availability_values = fixed_coverage_2019_df['Number of premises with SFBB availability']
sum_of_speeds = fixed_coverage_2019_df['Number of premises with 30<300Mbit/s download speed'] + fixed_coverage_2019_df['Number of premises with >=300Mbit/s download speed']

# Check if all values match
all_values_match = (sfbb_availability_values == sum_of_speeds).all()
all_values_match

if all_values_match:
    print("All values in 'Number of premises with SFBB availability' match the sum of the other columns.")
else:
    print("There are inconsistencies between 'Number of premises with SFBB availability' and the sum of the other columns.")


All values in 'Number of premises with SFBB availability' match the sum of the other columns.


That is really good news. I certainly have achieved consistency not only in the number of entries across the 5 dataframes but also in the actual entries.

### Number of premises with UFBB (100Mbit/s) availability and UFBB (100Mbit/s) availability (% premises) columns in 2019 dataset.

It seems that from datasets from 2020 onwards, UFBB speed is broken down into two subcategories - UFBB with speeds of 100Mbit/s or greater and UFBB with speeds of 300Mbit/s or greater. As we discovered earlier, SFBB covers any speed from 30Mbit/s and greater, covering any speeds from 30MBit/s with no upper limit, while UFBB covers any speed from 300Mbit/s and greater, with no upper limit. In other words, we do not have available data in 2019 dataset that differentiates any broadband speed from 100 Mbit/s or greater. In my understanding, this new category falls between SFBB and UFBB categoreis in 2019 dataset. Simply because SFBB covers anything from 30MBit/s, inlcuding UFBB speeds into it.

In this case, one approach I can think of is to approximate the number of premises with 100 Mbit/s or more. Using the available data, to accurately represent premise with speeds of 100Mbit/s or greater, I would need to combine the counts of premises in both the '30<300Mbit/s' and '>=300Mbit/s' columns, as this would cover th entire range of speeds from just above 30Mbit/s through the highest speeds reported. However, it is worth noting that this approach would also include some premises with speeds between 30Mbit/s and 100Mbit/s. Given the absence of a direct 100Mbit/s threshold column in 2019, this is the closest approximation possible, in my opinion.

Adding the number of premises for 30<300 Mbit/s and >=300Mbit/s would give me the total number of premises above 30Mbit/s. To refine this to just those likely to be 100Mbit/s or greater, I would calculate a distribution factor for 30<300Mbit/s and >=300Mbit/s columns for each area, and estimate the number of premises with 100 Mbit/s or greater. To establish the distribution factor for each area, I would calculate the proportion of premises that have speeds of >=300Mbit/s relative to the total number of premises with speeds of >=30Mbit/s. Then I will add the number of premises with 30<300Mbit/s and >= 300Mbit/s and divide the summed number by the factor for that area. I think this approach will give me a more granular data than just summing the 30<300Mbit/s and >=300Mbit/s and finding the average value. 

*I have to mention that this approach is coming purely out of my own logic and understanding of the problem and I do not claim that it is the correct approach for this scenario. Perhaps, there might be much better statistical approaches that can be applied for such problem but I am not familiar with them.

For this purpose, I am going to create a new dataframe containing only the columns I need - 'Number of premises with 30<300Mbit/s download speed', 'Number of premises with >=300Mbit/s download speed'. I will also introduce two additional columns - 'Distribution factor' that will represent the distribtion factor and 'Estimated number of properties with 100Mbit/s or more' which will represent the approximate number of premises with 100Mbit/s or greater. Since I will be calculating the number of premises as well as the percentage for 100Mbit/s or more, I will also introduce a 'Estimated % of properties with 100Mbit/s or more' column that will represent the percentage of approximated number of premises with 100Mbit/s or more. In order to do that, I will need the values of 'All matched premises' column.

I am not going to use the SFBB availability column, because as we previously noted, that column includes both the 30<300 and >=300Mbit/s speeds. For the purpose of this excercise, I need to be able to differentiate between both speeds as I am looking for the distribution of number of premises for each category.

In [140]:
uffb_100_mbit_distribution_df = pd.DataFrame(columns=['All Matched Premises', 'Number of premises with 30<300Mbit/s download speed', 'Number of premises with >=300Mbit/s download speed', 'Distribution factor', 'Estimated number of properties with 100Mbit/s or more', 'Estimated % of properties with 100Mbit/s or more'])
uffb_100_mbit_distribution_df

Unnamed: 0,All Matched Premises,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Distribution factor,Estimated number of properties with 100Mbit/s or more,Estimated % of properties with 100Mbit/s or more


In [141]:
selected_columns = ['All Matched Premises',
                    'Number of premises with 30<300Mbit/s download speed',
                    'Number of premises with >=300Mbit/s download speed']

In [142]:
uffb_100_mbit_distribution_df[selected_columns] = fixed_coverage_2019_df[selected_columns]
uffb_100_mbit_distribution_df

Unnamed: 0,All Matched Premises,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Distribution factor,Estimated number of properties with 100Mbit/s or more,Estimated % of properties with 100Mbit/s or more
0,125311.0,91989.0,25163.0,,,
1,124305.0,98180.0,3472.0,,,
2,29760.0,4840.0,24543.0,,,
3,51284.0,46137.0,866.0,,,
4,60596.0,40893.0,15339.0,,,
...,...,...,...,...,...,...
380,48061.0,23724.0,22479.0,,,
381,98548.0,22755.0,69871.0,,,
382,238350.0,132437.0,88134.0,,,
383,160977.0,74040.0,82575.0,,,


In [143]:
uffb_100_mbit_distribution_df['Distribution factor'] = (uffb_100_mbit_distribution_df['Number of premises with 30<300Mbit/s download speed'] /
                                             uffb_100_mbit_distribution_df['Number of premises with >=300Mbit/s download speed'])
uffb_100_mbit_distribution_df

Unnamed: 0,All Matched Premises,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Distribution factor,Estimated number of properties with 100Mbit/s or more,Estimated % of properties with 100Mbit/s or more
0,125311.0,91989.0,25163.0,3.655725,,
1,124305.0,98180.0,3472.0,28.277650,,
2,29760.0,4840.0,24543.0,0.197205,,
3,51284.0,46137.0,866.0,53.275982,,
4,60596.0,40893.0,15339.0,2.665950,,
...,...,...,...,...,...,...
380,48061.0,23724.0,22479.0,1.055385,,
381,98548.0,22755.0,69871.0,0.325672,,
382,238350.0,132437.0,88134.0,1.502678,,
383,160977.0,74040.0,82575.0,0.896639,,


Now we have the distribution factor. Which could be interpreted as there are nearly three and a half times less the premises able to receive 300Mbit/s and more speed than the number of premises able to receive 30<300Mbit, in the first row. I am not going to round the distribution factor and keep it 'raw' because that way it will give me more accurate estimation when calculating the premises with 100 Mbit/s and more

In [144]:
uffb_100_mbit_distribution_df['Estimated number of properties with 100Mbit/s or more'] = ((uffb_100_mbit_distribution_df['Number of premises with 30<300Mbit/s download speed'] +
                                             uffb_100_mbit_distribution_df['Number of premises with >=300Mbit/s download speed']) /
                                             uffb_100_mbit_distribution_df['Distribution factor'])
uffb_100_mbit_distribution_df

Unnamed: 0,All Matched Premises,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Distribution factor,Estimated number of properties with 100Mbit/s or more,Estimated % of properties with 100Mbit/s or more
0,125311.0,91989.0,25163.0,3.655725,32046.176999,
1,124305.0,98180.0,3472.0,28.277650,3594.782481,
2,29760.0,4840.0,24543.0,0.197205,148997.307645,
3,51284.0,46137.0,866.0,53.275982,882.254980,
4,60596.0,40893.0,15339.0,2.665950,21092.672291,
...,...,...,...,...,...,...
380,48061.0,23724.0,22479.0,1.055385,43778.335736,
381,98548.0,22755.0,69871.0,0.325672,284415.348099,
382,238350.0,132437.0,88134.0,1.502678,146785.298021,
383,160977.0,74040.0,82575.0,0.896639,174668.876621,


Since premises can not be fractions, they need to be rounded to the nearest whole number.

In [145]:
# Round the 'Results' column to the nearest whole number and convert to integer type
uffb_100_mbit_distribution_df['Estimated number of properties with 100Mbit/s or more'] = uffb_100_mbit_distribution_df['Estimated number of properties with 100Mbit/s or more'].round().astype(int)
uffb_100_mbit_distribution_df

Unnamed: 0,All Matched Premises,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Distribution factor,Estimated number of properties with 100Mbit/s or more,Estimated % of properties with 100Mbit/s or more
0,125311.0,91989.0,25163.0,3.655725,32046,
1,124305.0,98180.0,3472.0,28.277650,3595,
2,29760.0,4840.0,24543.0,0.197205,148997,
3,51284.0,46137.0,866.0,53.275982,882,
4,60596.0,40893.0,15339.0,2.665950,21093,
...,...,...,...,...,...,...
380,48061.0,23724.0,22479.0,1.055385,43778,
381,98548.0,22755.0,69871.0,0.325672,284415,
382,238350.0,132437.0,88134.0,1.502678,146785,
383,160977.0,74040.0,82575.0,0.896639,174669,


###### Estimated % of properties with 100Mbit/s or more

In [146]:
uffb_100_mbit_distribution_df['Estimated % of properties with 100Mbit/s or more'] = ((uffb_100_mbit_distribution_df['Estimated number of properties with 100Mbit/s or more'] /
                                             uffb_100_mbit_distribution_df['All Matched Premises']) * 100).round(1)
uffb_100_mbit_distribution_df


Unnamed: 0,All Matched Premises,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Distribution factor,Estimated number of properties with 100Mbit/s or more,Estimated % of properties with 100Mbit/s or more
0,125311.0,91989.0,25163.0,3.655725,32046,25.6
1,124305.0,98180.0,3472.0,28.277650,3595,2.9
2,29760.0,4840.0,24543.0,0.197205,148997,500.7
3,51284.0,46137.0,866.0,53.275982,882,1.7
4,60596.0,40893.0,15339.0,2.665950,21093,34.8
...,...,...,...,...,...,...
380,48061.0,23724.0,22479.0,1.055385,43778,91.1
381,98548.0,22755.0,69871.0,0.325672,284415,288.6
382,238350.0,132437.0,88134.0,1.502678,146785,61.6
383,160977.0,74040.0,82575.0,0.896639,174669,108.5


I need to change my approach as I think it is incorrect and does not work. I am now seeing some odd numbers in the percentage column on the very right and in the estimated number of properties with 100 Mbit/s or more. These large figures as in the entry with index 2, 370, 372 and 373 indicates that perhaps my approach is flawed. So, I will try something else.

*I have done some research of how to address problems like this one, where there is missing data, but we have data available for later years. One approach is statistical modelling. Since we have data from multiple years, I could build a model predicting the number of UFBB premises based on other factors, and then use this model to estimate the 100 Mbit/s availability. Also, regression analysis could be suitable if I have a continuous series of data points, such as in this case. However, I was not sure if we would be allowed to use such methods and techniques, as we have not been introduced to them (yet).

So, instead, I would refine my previous approach. First, I would create 'Propotion of 30<300Mbit/s' which is the proportion of premises with speeds between 30<300Mbit/s out of the total premises that have UFBB (sum of 30<300 Mbit/s and >=300Mbit/s).

Then I will apply this proportion to the 'Number of premises with UFBB availability' to get the 'Estimated number of properties with 100Mbit/s or more'.

Finally, I will calculate the 'Estimated % of properties with 100Mbit/s or more' based on the 'All Matched Premises'. 

I hope that would work. 

In [147]:
uffb_100_mbit_distribution_df = fixed_coverage_2019_df.copy()

# Calculating the proportion of 30<300Mbit/s
uffb_100_mbit_distribution_df['Proportion of 30<300Mbit/s'] = (
    uffb_100_mbit_distribution_df['Number of premises with 30<300Mbit/s download speed'] /
    (uffb_100_mbit_distribution_df['Number of premises with 30<300Mbit/s download speed'] + 
     uffb_100_mbit_distribution_df['Number of premises with >=300Mbit/s download speed'])
)

# Estimating the number of properties with 100Mbit/s or more based on UFBB availability
uffb_100_mbit_distribution_df['Estimated number of properties with 100Mbit/s or more'] = (
    uffb_100_mbit_distribution_df['Number of premises with UFBB availability'] * 
    uffb_100_mbit_distribution_df['Proportion of 30<300Mbit/s']
)

# Calculating the estimated percentage of properties with 100Mbit/s or more
uffb_100_mbit_distribution_df['Estimated % of properties with 100Mbit/s or more'] = (
    uffb_100_mbit_distribution_df['Estimated number of properties with 100Mbit/s or more'] /
    uffb_100_mbit_distribution_df['All Matched Premises'] * 100
)

# Capping the percentage at 100 and rounding off the number of properties to the nearest integer
uffb_100_mbit_distribution_df['Estimated % of properties with 100Mbit/s or more'] = (
    uffb_100_mbit_distribution_df['Estimated % of properties with 100Mbit/s or more'].clip(upper=100).round(1)
)
uffb_100_mbit_distribution_df['Estimated number of properties with 100Mbit/s or more'] = (
    uffb_100_mbit_distribution_df['Estimated number of properties with 100Mbit/s or more'].round().astype(int)
)

uffb_100_mbit_distribution_df[['All Matched Premises', 'Number of premises with 30<300Mbit/s download speed', 
                               'Number of premises with >=300Mbit/s download speed', 'Proportion of 30<300Mbit/s', 
                               'Estimated number of properties with 100Mbit/s or more', 
                               'Estimated % of properties with 100Mbit/s or more']]


Unnamed: 0,All Matched Premises,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Proportion of 30<300Mbit/s,Estimated number of properties with 100Mbit/s or more,Estimated % of properties with 100Mbit/s or more
0,125311.0,91989.0,25163.0,0.785211,19758,15.8
1,124305.0,98180.0,3472.0,0.965844,3353,2.7
2,29760.0,4840.0,24543.0,0.164721,4043,13.6
3,51284.0,46137.0,866.0,0.981576,850,1.7
4,60596.0,40893.0,15339.0,0.727219,11155,18.4
...,...,...,...,...,...,...
380,48061.0,23724.0,22479.0,0.513473,11542,24.0
381,98548.0,22755.0,69871.0,0.245665,17165,17.4
382,238350.0,132437.0,88134.0,0.600428,52918,22.2
383,160977.0,74040.0,82575.0,0.472752,39037,24.3


Perhaps, I feel that this approach is more suitable. But it is worth noting that these figures are just rough estimates based on available data.

Now I have to populate the data from the 'Estimated number of properties with 100Mbit/s or more' and 'Estimated % of properties with 100Mbit/s or more' into the relevant columns in the 2019 dataset.

In [148]:
fixed_coverage_2019_df['Number of premises with UFBB (100Mbit/s) availability'] = uffb_100_mbit_distribution_df['Estimated number of properties with 100Mbit/s or more']
fixed_coverage_2019_df['UFBB (100Mbit/s) availability (% premises)'] = uffb_100_mbit_distribution_df['Estimated % of properties with 100Mbit/s or more']
fixed_coverage_2019_df

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises),Number of premises with UFBB (100Mbit/s) availability,UFBB (100Mbit/s) availability (% premises)
0,S12000033,ABERDEEN CITY,125441.0,125311.0,93.5,20.1,13.1,0.0,0.2,0.7,6.5,0.2,96.8,0.0,117152.0,25163.0,16410.0,49.0,219.0,884.0,8159.0,189.0,121476.0,0.0,91989.0,25163.0,49.0,170.0,665.0,7275.0,73.3,20.1,0.0,0.1,0.5,5.8,16410.0,13.1,19758,15.8
1,S12000034,ABERDEENSHIRE,125085.0,124305.0,81.8,2.8,2.7,2.5,5.9,9.9,18.1,3.6,93.6,0.0,101652.0,3472.0,3332.0,3163.0,7339.0,12332.0,22653.0,4519.0,117051.0,0.0,98180.0,3472.0,3163.0,4176.0,4993.0,10321.0,78.5,2.8,2.5,3.3,4.0,8.3,3332.0,2.7,3353,2.7
2,E07000223,ADUR,29770.0,29760.0,98.7,82.4,0.6,0.0,0.1,0.1,1.3,0.0,99.1,0.0,29383.0,24543.0,193.0,0.0,16.0,44.0,377.0,12.0,29514.0,0.0,4840.0,24543.0,0.0,16.0,28.0,333.0,16.3,82.4,0.0,0.1,0.1,1.1,193.0,0.6,4043,13.6
3,E07000026,ALLERDALE,51385.0,51284.0,91.7,1.7,1.7,1.2,2.6,3.6,8.3,1.2,98.3,2.3,47003.0,866.0,866.0,619.0,1323.0,1873.0,4281.0,592.0,50507.0,1164.0,46137.0,866.0,619.0,704.0,550.0,2408.0,89.8,1.7,1.2,1.4,1.1,4.7,866.0,1.7,850,1.7
4,E07000032,AMBER VALLEY,60674.0,60596.0,92.8,25.3,22.1,0.1,0.9,2.1,7.2,0.7,98.2,0.0,56232.0,15339.0,13412.0,89.0,549.0,1254.0,4364.0,440.0,59578.0,0.0,40893.0,15339.0,89.0,460.0,705.0,3110.0,67.4,25.3,0.1,0.8,1.2,5.1,13412.0,22.1,11155,18.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,E07000239,WYRE FOREST,48100.0,48061.0,96.1,46.7,0.9,0.3,0.6,1.4,3.9,0.5,99.4,2.9,46203.0,22479.0,420.0,140.0,291.0,662.0,1858.0,264.0,47829.0,1412.0,23724.0,22479.0,140.0,151.0,371.0,1196.0,49.3,46.7,0.3,0.3,0.8,2.5,420.0,0.9,11542,24.0
381,E06000014,YORK,98735.0,98548.0,94.0,70.8,43.6,0.0,0.3,0.6,6.0,0.2,95.6,7.0,92626.0,69871.0,43077.0,36.0,261.0,629.0,5922.0,191.0,94419.0,6956.0,22755.0,69871.0,36.0,225.0,368.0,5293.0,23.0,70.8,0.0,0.2,0.4,5.4,43077.0,43.6,17165,17.4
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,92.5,,,,,,,,,,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,,,,,,,16083.0,,52918,22.2
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,97.3,,,,,,,,,,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,,,,,,,13647.0,,39037,24.3


Now I have achieved consistency in terms of the number of columns across all datasets. However, I would like to ensure that, so I will compare all columns in each dataset and make sure there are no diferences.

In [149]:
# List of your dataset variables
datasets = [fixed_coverage_2019_df, fixed_coverage_2020_df, fixed_coverage_2021_df, fixed_coverage_2022_df, fixed_coverage_2023_df]

# Dictionary to store column names for each dataset
columns_dict = {}

# Loop through each dataset and store its column names in the dictionary
for i, dataset in enumerate(datasets, start=2019):
    columns_dict[f'fixed_coverage_{i}'] = set(dataset.columns)

# Find discrepancies in column names
discrepancies = {}
for key, value in columns_dict.items():
    for other_key, other_value in columns_dict.items():
        if key != other_key:
            diff = value.symmetric_difference(other_value)
            if diff:
                discrepancies[key] = diff

discrepancies

{}

It seems all datasets now have the same number and same names for columns. Based on that, I can conclude that the datasets are now consistent.

###### Calculating percentage values for percentage columns in 2019

One last thing I have to do on the 2019 dataset is calculate all missing values in the percentage columns. I did not do that earlier because I was performing various calculations and summing from various other columns, etc. Let's have a look which columns are these.

In [150]:
# Identify missing values (NaNs)
missing_values = fixed_coverage_2019_df[fixed_coverage_2019_df.isnull().any(axis=1)]
missing_values

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises),Number of premises with UFBB (100Mbit/s) availability,UFBB (100Mbit/s) availability (% premises)
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,92.5,,,,,,,,,,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,,,,,,,16083.0,,52918,22.2
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,97.3,,,,,,,,,,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,,,,,,,13647.0,,39037,24.3
384,E06000062,WEST NORTHAMPTONSHIRE,186828.0,184653.0,95.7,,,,,,,,,,176795.0,93421.0,17488.0,264.0,1773.0,3049.0,7858.0,2186.0,180523.0,1540.0,83374.0,93421.0,264.0,1509.0,1276.0,4809.0,,,,,,,17488.0,,44056,23.9


So, let's start with UFBB availability (% premises). To calculate the percentage value, we need to divide the 'Number of premises with UFBB availability' by 'All Matched Premises' and then multiply by 100 to convert the value in percentage.

In [151]:
fixed_coverage_2019_df

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises),Number of premises with UFBB (100Mbit/s) availability,UFBB (100Mbit/s) availability (% premises)
0,S12000033,ABERDEEN CITY,125441.0,125311.0,93.5,20.1,13.1,0.0,0.2,0.7,6.5,0.2,96.8,0.0,117152.0,25163.0,16410.0,49.0,219.0,884.0,8159.0,189.0,121476.0,0.0,91989.0,25163.0,49.0,170.0,665.0,7275.0,73.3,20.1,0.0,0.1,0.5,5.8,16410.0,13.1,19758,15.8
1,S12000034,ABERDEENSHIRE,125085.0,124305.0,81.8,2.8,2.7,2.5,5.9,9.9,18.1,3.6,93.6,0.0,101652.0,3472.0,3332.0,3163.0,7339.0,12332.0,22653.0,4519.0,117051.0,0.0,98180.0,3472.0,3163.0,4176.0,4993.0,10321.0,78.5,2.8,2.5,3.3,4.0,8.3,3332.0,2.7,3353,2.7
2,E07000223,ADUR,29770.0,29760.0,98.7,82.4,0.6,0.0,0.1,0.1,1.3,0.0,99.1,0.0,29383.0,24543.0,193.0,0.0,16.0,44.0,377.0,12.0,29514.0,0.0,4840.0,24543.0,0.0,16.0,28.0,333.0,16.3,82.4,0.0,0.1,0.1,1.1,193.0,0.6,4043,13.6
3,E07000026,ALLERDALE,51385.0,51284.0,91.7,1.7,1.7,1.2,2.6,3.6,8.3,1.2,98.3,2.3,47003.0,866.0,866.0,619.0,1323.0,1873.0,4281.0,592.0,50507.0,1164.0,46137.0,866.0,619.0,704.0,550.0,2408.0,89.8,1.7,1.2,1.4,1.1,4.7,866.0,1.7,850,1.7
4,E07000032,AMBER VALLEY,60674.0,60596.0,92.8,25.3,22.1,0.1,0.9,2.1,7.2,0.7,98.2,0.0,56232.0,15339.0,13412.0,89.0,549.0,1254.0,4364.0,440.0,59578.0,0.0,40893.0,15339.0,89.0,460.0,705.0,3110.0,67.4,25.3,0.1,0.8,1.2,5.1,13412.0,22.1,11155,18.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,E07000239,WYRE FOREST,48100.0,48061.0,96.1,46.7,0.9,0.3,0.6,1.4,3.9,0.5,99.4,2.9,46203.0,22479.0,420.0,140.0,291.0,662.0,1858.0,264.0,47829.0,1412.0,23724.0,22479.0,140.0,151.0,371.0,1196.0,49.3,46.7,0.3,0.3,0.8,2.5,420.0,0.9,11542,24.0
381,E06000014,YORK,98735.0,98548.0,94.0,70.8,43.6,0.0,0.3,0.6,6.0,0.2,95.6,7.0,92626.0,69871.0,43077.0,36.0,261.0,629.0,5922.0,191.0,94419.0,6956.0,22755.0,69871.0,36.0,225.0,368.0,5293.0,23.0,70.8,0.0,0.2,0.4,5.4,43077.0,43.6,17165,17.4
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,92.5,,,,,,,,,,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,,,,,,,16083.0,,52918,22.2
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,97.3,,,,,,,,,,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,,,,,,,13647.0,,39037,24.3


In [152]:
# Calculate UFBB availability (% premises)
fixed_coverage_2019_df['UFBB availability (% premises)'] = round((fixed_coverage_2019_df['Number of premises with UFBB availability'] 
                                                                  / fixed_coverage_2019_df['All Matched Premises']) * 100, 1)
fixed_coverage_2019_df

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises),Number of premises with UFBB (100Mbit/s) availability,UFBB (100Mbit/s) availability (% premises)
0,S12000033,ABERDEEN CITY,125441.0,125311.0,93.5,20.1,13.1,0.0,0.2,0.7,6.5,0.2,96.8,0.0,117152.0,25163.0,16410.0,49.0,219.0,884.0,8159.0,189.0,121476.0,0.0,91989.0,25163.0,49.0,170.0,665.0,7275.0,73.3,20.1,0.0,0.1,0.5,5.8,16410.0,13.1,19758,15.8
1,S12000034,ABERDEENSHIRE,125085.0,124305.0,81.8,2.8,2.7,2.5,5.9,9.9,18.1,3.6,93.6,0.0,101652.0,3472.0,3332.0,3163.0,7339.0,12332.0,22653.0,4519.0,117051.0,0.0,98180.0,3472.0,3163.0,4176.0,4993.0,10321.0,78.5,2.8,2.5,3.3,4.0,8.3,3332.0,2.7,3353,2.7
2,E07000223,ADUR,29770.0,29760.0,98.7,82.5,0.6,0.0,0.1,0.1,1.3,0.0,99.1,0.0,29383.0,24543.0,193.0,0.0,16.0,44.0,377.0,12.0,29514.0,0.0,4840.0,24543.0,0.0,16.0,28.0,333.0,16.3,82.4,0.0,0.1,0.1,1.1,193.0,0.6,4043,13.6
3,E07000026,ALLERDALE,51385.0,51284.0,91.7,1.7,1.7,1.2,2.6,3.6,8.3,1.2,98.3,2.3,47003.0,866.0,866.0,619.0,1323.0,1873.0,4281.0,592.0,50507.0,1164.0,46137.0,866.0,619.0,704.0,550.0,2408.0,89.8,1.7,1.2,1.4,1.1,4.7,866.0,1.7,850,1.7
4,E07000032,AMBER VALLEY,60674.0,60596.0,92.8,25.3,22.1,0.1,0.9,2.1,7.2,0.7,98.2,0.0,56232.0,15339.0,13412.0,89.0,549.0,1254.0,4364.0,440.0,59578.0,0.0,40893.0,15339.0,89.0,460.0,705.0,3110.0,67.4,25.3,0.1,0.8,1.2,5.1,13412.0,22.1,11155,18.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,E07000239,WYRE FOREST,48100.0,48061.0,96.1,46.8,0.9,0.3,0.6,1.4,3.9,0.5,99.4,2.9,46203.0,22479.0,420.0,140.0,291.0,662.0,1858.0,264.0,47829.0,1412.0,23724.0,22479.0,140.0,151.0,371.0,1196.0,49.3,46.7,0.3,0.3,0.8,2.5,420.0,0.9,11542,24.0
381,E06000014,YORK,98735.0,98548.0,94.0,70.9,43.6,0.0,0.3,0.6,6.0,0.2,95.6,7.0,92626.0,69871.0,43077.0,36.0,261.0,629.0,5922.0,191.0,94419.0,6956.0,22755.0,69871.0,36.0,225.0,368.0,5293.0,23.0,70.8,0.0,0.2,0.4,5.4,43077.0,43.6,17165,17.4
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,92.5,37.0,,,,,,,,,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,,,,,,,16083.0,,52918,22.2
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,97.3,51.3,,,,,,,,,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,,,,,,,13647.0,,39037,24.3


In [153]:
# Calculate Full Fibre availability (% premises)
fixed_coverage_2019_df['Full Fibre availability (% premises)'] = round((fixed_coverage_2019_df['Number of premises with Full Fibre availability'] 
                                                                  / fixed_coverage_2019_df['All Matched Premises']) * 100, 1)
fixed_coverage_2019_df

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises),Number of premises with UFBB (100Mbit/s) availability,UFBB (100Mbit/s) availability (% premises)
0,S12000033,ABERDEEN CITY,125441.0,125311.0,93.5,20.1,13.1,0.0,0.2,0.7,6.5,0.2,96.8,0.0,117152.0,25163.0,16410.0,49.0,219.0,884.0,8159.0,189.0,121476.0,0.0,91989.0,25163.0,49.0,170.0,665.0,7275.0,73.3,20.1,0.0,0.1,0.5,5.8,16410.0,13.1,19758,15.8
1,S12000034,ABERDEENSHIRE,125085.0,124305.0,81.8,2.8,2.7,2.5,5.9,9.9,18.1,3.6,93.6,0.0,101652.0,3472.0,3332.0,3163.0,7339.0,12332.0,22653.0,4519.0,117051.0,0.0,98180.0,3472.0,3163.0,4176.0,4993.0,10321.0,78.5,2.8,2.5,3.3,4.0,8.3,3332.0,2.7,3353,2.7
2,E07000223,ADUR,29770.0,29760.0,98.7,82.5,0.6,0.0,0.1,0.1,1.3,0.0,99.1,0.0,29383.0,24543.0,193.0,0.0,16.0,44.0,377.0,12.0,29514.0,0.0,4840.0,24543.0,0.0,16.0,28.0,333.0,16.3,82.4,0.0,0.1,0.1,1.1,193.0,0.6,4043,13.6
3,E07000026,ALLERDALE,51385.0,51284.0,91.7,1.7,1.7,1.2,2.6,3.6,8.3,1.2,98.3,2.3,47003.0,866.0,866.0,619.0,1323.0,1873.0,4281.0,592.0,50507.0,1164.0,46137.0,866.0,619.0,704.0,550.0,2408.0,89.8,1.7,1.2,1.4,1.1,4.7,866.0,1.7,850,1.7
4,E07000032,AMBER VALLEY,60674.0,60596.0,92.8,25.3,22.1,0.1,0.9,2.1,7.2,0.7,98.2,0.0,56232.0,15339.0,13412.0,89.0,549.0,1254.0,4364.0,440.0,59578.0,0.0,40893.0,15339.0,89.0,460.0,705.0,3110.0,67.4,25.3,0.1,0.8,1.2,5.1,13412.0,22.1,11155,18.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,E07000239,WYRE FOREST,48100.0,48061.0,96.1,46.8,0.9,0.3,0.6,1.4,3.9,0.5,99.4,2.9,46203.0,22479.0,420.0,140.0,291.0,662.0,1858.0,264.0,47829.0,1412.0,23724.0,22479.0,140.0,151.0,371.0,1196.0,49.3,46.7,0.3,0.3,0.8,2.5,420.0,0.9,11542,24.0
381,E06000014,YORK,98735.0,98548.0,94.0,70.9,43.7,0.0,0.3,0.6,6.0,0.2,95.6,7.0,92626.0,69871.0,43077.0,36.0,261.0,629.0,5922.0,191.0,94419.0,6956.0,22755.0,69871.0,36.0,225.0,368.0,5293.0,23.0,70.8,0.0,0.2,0.4,5.4,43077.0,43.6,17165,17.4
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,92.5,37.0,6.7,,,,,,,,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,,,,,,,16083.0,,52918,22.2
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,97.3,51.3,8.5,,,,,,,,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,,,,,,,13647.0,,39037,24.3


In [154]:
# Calculate % of premises unable to receive 2Mbit/s
fixed_coverage_2019_df['% of premises unable to receive 2Mbit/s'] = round((fixed_coverage_2019_df['Number of premises unable to receive 2Mbit/s'] 
                                                                  / fixed_coverage_2019_df['All Matched Premises']) * 100, 1)
fixed_coverage_2019_df

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises),Number of premises with UFBB (100Mbit/s) availability,UFBB (100Mbit/s) availability (% premises)
0,S12000033,ABERDEEN CITY,125441.0,125311.0,93.5,20.1,13.1,0.0,0.2,0.7,6.5,0.2,96.8,0.0,117152.0,25163.0,16410.0,49.0,219.0,884.0,8159.0,189.0,121476.0,0.0,91989.0,25163.0,49.0,170.0,665.0,7275.0,73.3,20.1,0.0,0.1,0.5,5.8,16410.0,13.1,19758,15.8
1,S12000034,ABERDEENSHIRE,125085.0,124305.0,81.8,2.8,2.7,2.5,5.9,9.9,18.1,3.6,93.6,0.0,101652.0,3472.0,3332.0,3163.0,7339.0,12332.0,22653.0,4519.0,117051.0,0.0,98180.0,3472.0,3163.0,4176.0,4993.0,10321.0,78.5,2.8,2.5,3.3,4.0,8.3,3332.0,2.7,3353,2.7
2,E07000223,ADUR,29770.0,29760.0,98.7,82.5,0.6,0.0,0.1,0.1,1.3,0.0,99.1,0.0,29383.0,24543.0,193.0,0.0,16.0,44.0,377.0,12.0,29514.0,0.0,4840.0,24543.0,0.0,16.0,28.0,333.0,16.3,82.4,0.0,0.1,0.1,1.1,193.0,0.6,4043,13.6
3,E07000026,ALLERDALE,51385.0,51284.0,91.7,1.7,1.7,1.2,2.6,3.6,8.3,1.2,98.3,2.3,47003.0,866.0,866.0,619.0,1323.0,1873.0,4281.0,592.0,50507.0,1164.0,46137.0,866.0,619.0,704.0,550.0,2408.0,89.8,1.7,1.2,1.4,1.1,4.7,866.0,1.7,850,1.7
4,E07000032,AMBER VALLEY,60674.0,60596.0,92.8,25.3,22.1,0.1,0.9,2.1,7.2,0.7,98.2,0.0,56232.0,15339.0,13412.0,89.0,549.0,1254.0,4364.0,440.0,59578.0,0.0,40893.0,15339.0,89.0,460.0,705.0,3110.0,67.4,25.3,0.1,0.8,1.2,5.1,13412.0,22.1,11155,18.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,E07000239,WYRE FOREST,48100.0,48061.0,96.1,46.8,0.9,0.3,0.6,1.4,3.9,0.5,99.4,2.9,46203.0,22479.0,420.0,140.0,291.0,662.0,1858.0,264.0,47829.0,1412.0,23724.0,22479.0,140.0,151.0,371.0,1196.0,49.3,46.7,0.3,0.3,0.8,2.5,420.0,0.9,11542,24.0
381,E06000014,YORK,98735.0,98548.0,94.0,70.9,43.7,0.0,0.3,0.6,6.0,0.2,95.6,7.0,92626.0,69871.0,43077.0,36.0,261.0,629.0,5922.0,191.0,94419.0,6956.0,22755.0,69871.0,36.0,225.0,368.0,5293.0,23.0,70.8,0.0,0.2,0.4,5.4,43077.0,43.6,17165,17.4
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,92.5,37.0,6.7,0.2,,,,,,,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,,,,,,,16083.0,,52918,22.2
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,97.3,51.3,8.5,0.1,,,,,,,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,,,,,,,13647.0,,39037,24.3


In [155]:
# Calculate % of premises unable to receive 5Mbit/s
fixed_coverage_2019_df['% of premises unable to receive 5Mbit/s'] = round((fixed_coverage_2019_df['Number of premises unable to receive 5Mbit/s'] 
                                                                  / fixed_coverage_2019_df['All Matched Premises']) * 100, 1)
fixed_coverage_2019_df

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises),Number of premises with UFBB (100Mbit/s) availability,UFBB (100Mbit/s) availability (% premises)
0,S12000033,ABERDEEN CITY,125441.0,125311.0,93.5,20.1,13.1,0.0,0.2,0.7,6.5,0.2,96.8,0.0,117152.0,25163.0,16410.0,49.0,219.0,884.0,8159.0,189.0,121476.0,0.0,91989.0,25163.0,49.0,170.0,665.0,7275.0,73.3,20.1,0.0,0.1,0.5,5.8,16410.0,13.1,19758,15.8
1,S12000034,ABERDEENSHIRE,125085.0,124305.0,81.8,2.8,2.7,2.5,5.9,9.9,18.1,3.6,93.6,0.0,101652.0,3472.0,3332.0,3163.0,7339.0,12332.0,22653.0,4519.0,117051.0,0.0,98180.0,3472.0,3163.0,4176.0,4993.0,10321.0,78.5,2.8,2.5,3.3,4.0,8.3,3332.0,2.7,3353,2.7
2,E07000223,ADUR,29770.0,29760.0,98.7,82.5,0.6,0.0,0.1,0.1,1.3,0.0,99.1,0.0,29383.0,24543.0,193.0,0.0,16.0,44.0,377.0,12.0,29514.0,0.0,4840.0,24543.0,0.0,16.0,28.0,333.0,16.3,82.4,0.0,0.1,0.1,1.1,193.0,0.6,4043,13.6
3,E07000026,ALLERDALE,51385.0,51284.0,91.7,1.7,1.7,1.2,2.6,3.6,8.3,1.2,98.3,2.3,47003.0,866.0,866.0,619.0,1323.0,1873.0,4281.0,592.0,50507.0,1164.0,46137.0,866.0,619.0,704.0,550.0,2408.0,89.8,1.7,1.2,1.4,1.1,4.7,866.0,1.7,850,1.7
4,E07000032,AMBER VALLEY,60674.0,60596.0,92.8,25.3,22.1,0.1,0.9,2.1,7.2,0.7,98.2,0.0,56232.0,15339.0,13412.0,89.0,549.0,1254.0,4364.0,440.0,59578.0,0.0,40893.0,15339.0,89.0,460.0,705.0,3110.0,67.4,25.3,0.1,0.8,1.2,5.1,13412.0,22.1,11155,18.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,E07000239,WYRE FOREST,48100.0,48061.0,96.1,46.8,0.9,0.3,0.6,1.4,3.9,0.5,99.4,2.9,46203.0,22479.0,420.0,140.0,291.0,662.0,1858.0,264.0,47829.0,1412.0,23724.0,22479.0,140.0,151.0,371.0,1196.0,49.3,46.7,0.3,0.3,0.8,2.5,420.0,0.9,11542,24.0
381,E06000014,YORK,98735.0,98548.0,94.0,70.9,43.7,0.0,0.3,0.6,6.0,0.2,95.6,7.0,92626.0,69871.0,43077.0,36.0,261.0,629.0,5922.0,191.0,94419.0,6956.0,22755.0,69871.0,36.0,225.0,368.0,5293.0,23.0,70.8,0.0,0.2,0.4,5.4,43077.0,43.6,17165,17.4
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,92.5,37.0,6.7,0.2,1.0,,,,,,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,,,,,,,16083.0,,52918,22.2
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,97.3,51.3,8.5,0.1,0.5,,,,,,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,,,,,,,13647.0,,39037,24.3


In [156]:
# Calculate % of premises unable to receive 5Mbit/s
fixed_coverage_2019_df['% of premises unable to receive 10Mbit/s'] = round((fixed_coverage_2019_df['Number of premises unable to receive 5Mbit/s'] 
                                                                  / fixed_coverage_2019_df['All Matched Premises']) * 100, 1)
fixed_coverage_2019_df

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises),Number of premises with UFBB (100Mbit/s) availability,UFBB (100Mbit/s) availability (% premises)
0,S12000033,ABERDEEN CITY,125441.0,125311.0,93.5,20.1,13.1,0.0,0.2,0.2,6.5,0.2,96.8,0.0,117152.0,25163.0,16410.0,49.0,219.0,884.0,8159.0,189.0,121476.0,0.0,91989.0,25163.0,49.0,170.0,665.0,7275.0,73.3,20.1,0.0,0.1,0.5,5.8,16410.0,13.1,19758,15.8
1,S12000034,ABERDEENSHIRE,125085.0,124305.0,81.8,2.8,2.7,2.5,5.9,5.9,18.1,3.6,93.6,0.0,101652.0,3472.0,3332.0,3163.0,7339.0,12332.0,22653.0,4519.0,117051.0,0.0,98180.0,3472.0,3163.0,4176.0,4993.0,10321.0,78.5,2.8,2.5,3.3,4.0,8.3,3332.0,2.7,3353,2.7
2,E07000223,ADUR,29770.0,29760.0,98.7,82.5,0.6,0.0,0.1,0.1,1.3,0.0,99.1,0.0,29383.0,24543.0,193.0,0.0,16.0,44.0,377.0,12.0,29514.0,0.0,4840.0,24543.0,0.0,16.0,28.0,333.0,16.3,82.4,0.0,0.1,0.1,1.1,193.0,0.6,4043,13.6
3,E07000026,ALLERDALE,51385.0,51284.0,91.7,1.7,1.7,1.2,2.6,2.6,8.3,1.2,98.3,2.3,47003.0,866.0,866.0,619.0,1323.0,1873.0,4281.0,592.0,50507.0,1164.0,46137.0,866.0,619.0,704.0,550.0,2408.0,89.8,1.7,1.2,1.4,1.1,4.7,866.0,1.7,850,1.7
4,E07000032,AMBER VALLEY,60674.0,60596.0,92.8,25.3,22.1,0.1,0.9,0.9,7.2,0.7,98.2,0.0,56232.0,15339.0,13412.0,89.0,549.0,1254.0,4364.0,440.0,59578.0,0.0,40893.0,15339.0,89.0,460.0,705.0,3110.0,67.4,25.3,0.1,0.8,1.2,5.1,13412.0,22.1,11155,18.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,E07000239,WYRE FOREST,48100.0,48061.0,96.1,46.8,0.9,0.3,0.6,0.6,3.9,0.5,99.4,2.9,46203.0,22479.0,420.0,140.0,291.0,662.0,1858.0,264.0,47829.0,1412.0,23724.0,22479.0,140.0,151.0,371.0,1196.0,49.3,46.7,0.3,0.3,0.8,2.5,420.0,0.9,11542,24.0
381,E06000014,YORK,98735.0,98548.0,94.0,70.9,43.7,0.0,0.3,0.3,6.0,0.2,95.6,7.0,92626.0,69871.0,43077.0,36.0,261.0,629.0,5922.0,191.0,94419.0,6956.0,22755.0,69871.0,36.0,225.0,368.0,5293.0,23.0,70.8,0.0,0.2,0.4,5.4,43077.0,43.6,17165,17.4
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,92.5,37.0,6.7,0.2,1.0,1.0,,,,,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,,,,,,,16083.0,,52918,22.2
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,97.3,51.3,8.5,0.1,0.5,0.5,,,,,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,,,,,,,13647.0,,39037,24.3


In [157]:
# List of columns to compute percentages for
percentage_columns = [
    '% of premises unable to receive 2Mbit/s',
    '% of premises unable to receive 5Mbit/s',
    '% of premises unable to receive 10Mbit/s',
    '% of premises unable to receive 30Mbit/s',
    '% of premises below the USO',
    '% of premises with NGA',
    '% of premises able to receive decent broadband from FWA',
    '% of premises with 30<300Mbit/s download speed',
    '% of premises with >=300Mbit/s download speed',
    '% of premises with 0<2Mbit/s download speed',
    '% of premises with 2<5Mbit/s download speed',
    '% of premises with 5<10Mbit/s download speed',
    '% of premises with 10<30Mbit/s download speed',
]

# Compute percentages for each column
for col in percentage_columns:
    num_col = col.replace('% of premises', 'Number of premises').strip()  # Get the corresponding 'Number of premises' column name
    fixed_coverage_2019_df[col] = round((fixed_coverage_2019_df[num_col] / fixed_coverage_2019_df['All Matched Premises']) * 100, 1)

# Handle special cases
special_columns = {
    'Gigabit availability (% premises)': 'Number of premises with Gigabit availability',
    'UFBB (100Mbit/s) availability (% premises)': 'Number of premises with UFBB (100Mbit/s) availability'
}

for col, num_col in special_columns.items():
    fixed_coverage_2019_df[col] = round((fixed_coverage_2019_df[num_col] / fixed_coverage_2019_df['All Matched Premises']) * 100, 1)

fixed_coverage_2019_df


Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises),Number of premises with UFBB (100Mbit/s) availability,UFBB (100Mbit/s) availability (% premises)
0,S12000033,ABERDEEN CITY,125441.0,125311.0,93.5,20.1,13.1,0.0,0.2,0.7,6.5,0.2,96.9,0.0,117152.0,25163.0,16410.0,49.0,219.0,884.0,8159.0,189.0,121476.0,0.0,91989.0,25163.0,49.0,170.0,665.0,7275.0,73.4,20.1,0.0,0.1,0.5,5.8,16410.0,13.1,19758,15.8
1,S12000034,ABERDEENSHIRE,125085.0,124305.0,81.8,2.8,2.7,2.5,5.9,9.9,18.2,3.6,94.2,0.0,101652.0,3472.0,3332.0,3163.0,7339.0,12332.0,22653.0,4519.0,117051.0,0.0,98180.0,3472.0,3163.0,4176.0,4993.0,10321.0,79.0,2.8,2.5,3.4,4.0,8.3,3332.0,2.7,3353,2.7
2,E07000223,ADUR,29770.0,29760.0,98.7,82.5,0.6,0.0,0.1,0.1,1.3,0.0,99.2,0.0,29383.0,24543.0,193.0,0.0,16.0,44.0,377.0,12.0,29514.0,0.0,4840.0,24543.0,0.0,16.0,28.0,333.0,16.3,82.5,0.0,0.1,0.1,1.1,193.0,0.6,4043,13.6
3,E07000026,ALLERDALE,51385.0,51284.0,91.7,1.7,1.7,1.2,2.6,3.7,8.3,1.2,98.5,2.3,47003.0,866.0,866.0,619.0,1323.0,1873.0,4281.0,592.0,50507.0,1164.0,46137.0,866.0,619.0,704.0,550.0,2408.0,90.0,1.7,1.2,1.4,1.1,4.7,866.0,1.7,850,1.7
4,E07000032,AMBER VALLEY,60674.0,60596.0,92.8,25.3,22.1,0.1,0.9,2.1,7.2,0.7,98.3,0.0,56232.0,15339.0,13412.0,89.0,549.0,1254.0,4364.0,440.0,59578.0,0.0,40893.0,15339.0,89.0,460.0,705.0,3110.0,67.5,25.3,0.1,0.8,1.2,5.1,13412.0,22.1,11155,18.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,E07000239,WYRE FOREST,48100.0,48061.0,96.1,46.8,0.9,0.3,0.6,1.4,3.9,0.5,99.5,2.9,46203.0,22479.0,420.0,140.0,291.0,662.0,1858.0,264.0,47829.0,1412.0,23724.0,22479.0,140.0,151.0,371.0,1196.0,49.4,46.8,0.3,0.3,0.8,2.5,420.0,0.9,11542,24.0
381,E06000014,YORK,98735.0,98548.0,94.0,70.9,43.7,0.0,0.3,0.6,6.0,0.2,95.8,7.1,92626.0,69871.0,43077.0,36.0,261.0,629.0,5922.0,191.0,94419.0,6956.0,22755.0,69871.0,36.0,225.0,368.0,5293.0,23.1,70.9,0.0,0.2,0.4,5.4,43077.0,43.7,17165,17.4
382,E06000060,BUCKINGHAMSHIRE,238482.0,238350.0,92.5,37.0,6.7,0.2,1.0,2.0,7.5,0.4,97.9,0.0,220571.0,88134.0,16083.0,590.0,2304.0,4667.0,17779.0,916.0,233315.0,0.0,132437.0,88134.0,590.0,1714.0,2363.0,13112.0,55.6,37.0,0.2,0.7,1.0,5.5,16083.0,6.7,52918,22.2
383,E06000061,NORTH NORTHAMPTONSHIRE,161434.0,160977.0,97.3,51.3,8.5,0.1,0.5,0.8,2.7,0.5,98.5,3.3,156615.0,82575.0,13647.0,149.0,781.0,1365.0,4362.0,773.0,158581.0,5374.0,74040.0,82575.0,149.0,632.0,584.0,2997.0,46.0,51.3,0.1,0.4,0.4,1.9,13647.0,8.5,39037,24.3


The 2019 dataset is now ready populated with data. However, I would like to ensure there are no NaN or null values.

In [158]:
fixed_coverage_2019_df.isna().sum().sum() > 0

np.False_

That's good news. I can now move onto the 2020 dataset to update the perecentage values.

In [159]:
# List of columns to compute percentages for
percentage_columns = [
    '% of premises unable to receive 2Mbit/s',
    '% of premises unable to receive 5Mbit/s',
    '% of premises unable to receive 10Mbit/s',
    '% of premises unable to receive 30Mbit/s',
    '% of premises below the USO',
    '% of premises with NGA',
    '% of premises able to receive decent broadband from FWA',
    '% of premises with 30<300Mbit/s download speed',
    '% of premises with >=300Mbit/s download speed',
    '% of premises with 0<2Mbit/s download speed',
    '% of premises with 2<5Mbit/s download speed',
    '% of premises with 5<10Mbit/s download speed',
    '% of premises with 10<30Mbit/s download speed',
    'SFBB availability (% premises)' 
]

# Compute percentages for each column
for col in percentage_columns:
    num_col = col.replace('% of premises', 'Number of premises').strip()  # Get the corresponding 'Number of premises' column name
    fixed_coverage_2020_df[col] = round((fixed_coverage_2020_df[num_col] / fixed_coverage_2020_df['All Matched Premises']) * 100, 1)

# Handle special cases
special_columns = {
    'UFBB (100Mbit/s) availability (% premises)': 'Number of premises with UFBB (100Mbit/s) availability',
    'UFBB availability (% premises)': 'Number of premises with UFBB availability',
    'Full Fibre availability (% premises)': 'Number of premises with Full Fibre availability',
    'Gigabit availability (% premises)': 'Number of premises with Gigabit availability',
    'SFBB availability (% premises)': 'Number of premises with SFBB availability'  # Add SFBB availability special case
}

for col, num_col in special_columns.items():
    fixed_coverage_2020_df[col] = round((fixed_coverage_2020_df[num_col] / fixed_coverage_2020_df['All Matched Premises']) * 100, 1)

fixed_coverage_2020_df


Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB (100Mbit/s) availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),Gigabit availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB (100Mbit/s) availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises with Gigabit availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed
0,S12000033,ABERDEEN CITY,126176.0,125948.0,94.8,49.1,41.7,35.0,35.0,0.0,0.2,0.7,5.2,0.2,97.2,0.0,119358.0,61798.0,52461.0,44051.0,44051.0,55.0,208.0,881.0,6590.0,300.0,122434.0,0.0,66897.0,52461.0,55.0,153.0,673.0,5709.0,53.1,41.7,0.0,0.1,0.5,4.5
1,S12000034,ABERDEENSHIRE,126065.0,125176.0,83.5,7.3,7.1,7.0,7.0,2.6,5.7,9.2,16.5,3.6,95.3,0.0,104472.0,9118.0,8872.0,8732.0,8732.0,3234.0,7099.0,11516.0,20704.0,4538.0,119331.0,0.0,95600.0,8872.0,3234.0,3865.0,4417.0,9188.0,76.4,7.1,2.6,3.1,3.5,7.3
2,E07000223,ADUR,29779.0,29755.0,98.9,85.9,85.6,0.6,0.6,0.0,0.0,0.1,1.1,0.1,99.5,0.0,29427.0,25562.0,25482.0,189.0,189.0,0.0,10.0,34.0,328.0,33.0,29616.0,0.0,3945.0,25482.0,0.0,10.0,24.0,294.0,13.3,85.6,0.0,0.0,0.1,1.0
3,E07000026,ALLERDALE,51647.0,51483.0,92.6,2.8,2.8,2.8,2.8,1.2,2.3,3.3,7.4,1.2,98.9,2.3,47693.0,1466.0,1466.0,1466.0,1466.0,627.0,1173.0,1705.0,3790.0,634.0,50931.0,1160.0,46227.0,1466.0,627.0,546.0,532.0,2085.0,89.8,2.8,1.2,1.1,1.0,4.0
4,E07000032,AMBER VALLEY,61134.0,60972.0,94.9,30.3,26.8,23.7,23.7,0.1,0.5,0.9,5.1,0.4,99.2,0.0,57875.0,18462.0,16323.0,14438.0,14438.0,63.0,280.0,573.0,3097.0,267.0,60483.0,0.0,41552.0,16323.0,63.0,217.0,293.0,2524.0,68.1,26.8,0.1,0.4,0.5,4.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
376,E07000128,WYRE,56527.0,56411.0,95.3,22.8,22.8,22.4,22.4,0.1,0.3,0.7,4.7,0.2,99.0,0.3,53739.0,12856.0,12852.0,12612.0,12612.0,61.0,174.0,368.0,2672.0,133.0,55837.0,186.0,40887.0,12852.0,61.0,113.0,194.0,2304.0,72.5,22.8,0.1,0.2,0.3,4.1
377,E07000239,WYRE FOREST,48237.0,48173.0,96.9,48.0,47.8,2.0,47.8,0.2,0.5,0.8,3.1,0.4,99.6,3.5,46680.0,23125.0,23035.0,961.0,23034.0,84.0,218.0,389.0,1493.0,183.0,48001.0,1675.0,23645.0,23035.0,84.0,134.0,171.0,1104.0,49.1,47.8,0.2,0.3,0.4,2.3
378,E06000014,YORK,95949.0,95674.0,94.4,75.7,72.1,54.9,54.9,0.0,0.2,0.8,5.6,0.3,96.2,3.7,90313.0,72415.0,68952.0,52549.0,52549.0,39.0,209.0,774.0,5361.0,296.0,92085.0,3522.0,21361.0,68952.0,39.0,170.0,565.0,4587.0,22.3,72.1,0.0,0.2,0.6,4.8
379,E06000061,NORTH NORTHAMPTONSHIRE,162226.0,161673.0,98.0,69.3,65.0,12.8,12.8,0.1,0.4,0.6,2.0,0.5,99.0,3.3,158498.0,112037.0,105013.0,20721.0,20721.0,93.0,566.0,997.0,3175.0,825.0,159992.0,5372.0,53485.0,105013.0,93.0,473.0,431.0,2178.0,33.1,65.0,0.1,0.3,0.3,1.3


### Consistency checks on entries on all datasets - 2019, 2020, 2021, 2022, 2023

I would like to check for each dataframe's shape to ensure the number of entries indeed match and is consistent now.

In [160]:
fixed_coverage_2019_df.shape

(374, 40)

In [161]:
fixed_coverage_2020_df.shape

(374, 40)

In [162]:
fixed_coverage_2021_df.shape

(374, 40)

In [163]:
fixed_coverage_2022_df.shape

(374, 40)

In [164]:
fixed_coverage_2023_df.shape

(374, 40)

That's great news! I will also reset the index for 2019 and 2020 datasets, in case the index was disrupted when I was adding and removing entries.

In [165]:
fixed_coverage_2019_df.reset_index(drop=True, inplace=True)
fixed_coverage_2020_df.reset_index(drop=True, inplace=True)

Now I have achieved consistency across the 5 datasets in terms of number of entries. However, I should not take that for granted, and I think I should go a step forward and compare all datasets for consistency in column 'laua' and 'laua_name'. That way I will ensure that all datasets contain the same local authority code and the same local authority name. 

In [166]:
# Create a list of your data frames
data_frames = [fixed_coverage_2019_df, fixed_coverage_2020_df, fixed_coverage_2021_df, fixed_coverage_2022_df, fixed_coverage_2023_df]

# Create an empty set to store unique entries across all data frames
unique_laua_entries = set()

# Iterate over each data frame to extract unique 'laua' entries
for df in data_frames:
    unique_laua_entries.update(df['laua'].unique())

In [167]:
# Convert the unique 'laua' entries from each data frame to sets
laua_sets = [set(df['laua']) for df in data_frames]

# Compute the intersection of all sets to find common 'laua' entries
common_laua_entries = set.intersection(*laua_sets)

# Check if there are any differences in 'laua' entries across the data frames
differences_exist = any(laua_set != common_laua_entries for laua_set in laua_sets)

if differences_exist:
    print("There are differences in 'laua' entries across the data frames.")
else:
    print("All 'laua' entries are consistent across the data frames.")

All 'laua' entries are consistent across the data frames.


In [168]:
# Create an empty set to store unique 'laua_name' entries across all data frames
unique_laua_name_entries = set()

# Iterate over each data frame to extract unique 'laua_name' entries
for df in data_frames:
    unique_laua_name_entries.update(df['laua_name'].unique())

# Convert the unique 'laua_name' entries from each data frame to sets
laua_name_sets = [set(df['laua_name']) for df in data_frames]

# Compute the intersection of all sets to find common 'laua_name' entries
common_laua_name_entries = set.intersection(*laua_name_sets)

# Check if there are any differences in 'laua_name' entries across the data frames
differences_exist = any(laua_name_set != common_laua_name_entries for laua_name_set in laua_name_sets)

if differences_exist:
    print("There are differences in 'laua_name' entries across the data frames.")
else:
    print("All 'laua_name' entries are consistent across the data frames.")


All 'laua_name' entries are consistent across the data frames.


Okay, now the two datasets are populated with data and are consistent with the rest of the datasets. Let's make sure that there are no missing values.

In [169]:
fixed_coverage_2019_df.isnull().sum()

laua                                                            0
laua_name                                                       0
All Premises                                                    0
All Matched Premises                                            0
SFBB availability (% premises)                                  0
UFBB availability (% premises)                                  0
Full Fibre availability (% premises)                            0
% of premises unable to receive 2Mbit/s                         0
% of premises unable to receive 5Mbit/s                         0
% of premises unable to receive 10Mbit/s                        0
% of premises unable to receive 30Mbit/s                        0
% of premises below the USO                                     0
% of premises with NGA                                          0
% of premises able to receive decent broadband from FWA         0
Number of premises with SFBB availability                       0
Number of 

In [None]:
fixed_coverage_2019_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 40 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   laua                                                          374 non-null    object 
 1   laua_name                                                     374 non-null    object 
 2   All Premises                                                  374 non-null    float64
 3   All Matched Premises                                          374 non-null    float64
 4   SFBB availability (% premises)                                374 non-null    float64
 5   UFBB availability (% premises)                                374 non-null    float64
 6   Full Fibre availability (% premises)                          374 non-null    float64
 7   % of premises unable to receive 2Mbit/s                       374 non-n

It seems lots of columns to have a float data type, which is unecessary. Float data type should be assigned to only the columns containing percentages. The rest of the columns should be int64.

In [171]:
# Columns to convert from float64 to int64
columns_to_convert = [
    'All Premises',
    'All Matched Premises',
    'Number of premises with SFBB availability',
    'Number of premises with UFBB availability',
    'Number of premises with Full Fibre availability',
    'Number of premises unable to receive 2Mbit/s',
    'Number of premises unable to receive 5Mbit/s',
    'Number of premises unable to receive 10Mbit/s',
    'Number of premises unable to receive 30Mbit/s',
    'Number of premises below the USO',
    'Number of premises with NGA',
    'Number of premises able to receive decent broadband from FWA',
    'Number of premises with 30<300Mbit/s download speed',
    'Number of premises with >=300Mbit/s download speed',
    'Number of premises with 0<2Mbit/s download speed',
    'Number of premises with 2<5Mbit/s download speed',
    'Number of premises with 5<10Mbit/s download speed',
    'Number of premises with 10<30Mbit/s download speed',
    'Number of premises with Gigabit availability'
]

# Convert data types of the specified columns
for column in columns_to_convert:
    fixed_coverage_2019_df[column] = fixed_coverage_2019_df[column].astype('int64')


In [172]:
fixed_coverage_2019_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 40 columns):
 #   Column                                                        Non-Null Count  Dtype  
---  ------                                                        --------------  -----  
 0   laua                                                          374 non-null    object 
 1   laua_name                                                     374 non-null    object 
 2   All Premises                                                  374 non-null    int64  
 3   All Matched Premises                                          374 non-null    int64  
 4   SFBB availability (% premises)                                374 non-null    float64
 5   UFBB availability (% premises)                                374 non-null    float64
 6   Full Fibre availability (% premises)                          374 non-null    float64
 7   % of premises unable to receive 2Mbit/s                       374 non-n

In [173]:
fixed_coverage_2019_df

Unnamed: 0,laua,laua_name,All Premises,All Matched Premises,SFBB availability (% premises),UFBB availability (% premises),Full Fibre availability (% premises),% of premises unable to receive 2Mbit/s,% of premises unable to receive 5Mbit/s,% of premises unable to receive 10Mbit/s,% of premises unable to receive 30Mbit/s,% of premises below the USO,% of premises with NGA,% of premises able to receive decent broadband from FWA,Number of premises with SFBB availability,Number of premises with UFBB availability,Number of premises with Full Fibre availability,Number of premises unable to receive 2Mbit/s,Number of premises unable to receive 5Mbit/s,Number of premises unable to receive 10Mbit/s,Number of premises unable to receive 30Mbit/s,Number of premises below the USO,Number of premises with NGA,Number of premises able to receive decent broadband from FWA,Number of premises with 30<300Mbit/s download speed,Number of premises with >=300Mbit/s download speed,Number of premises with 0<2Mbit/s download speed,Number of premises with 2<5Mbit/s download speed,Number of premises with 5<10Mbit/s download speed,Number of premises with 10<30Mbit/s download speed,% of premises with 30<300Mbit/s download speed,% of premises with >=300Mbit/s download speed,% of premises with 0<2Mbit/s download speed,% of premises with 2<5Mbit/s download speed,% of premises with 5<10Mbit/s download speed,% of premises with 10<30Mbit/s download speed,Number of premises with Gigabit availability,Gigabit availability (% premises),Number of premises with UFBB (100Mbit/s) availability,UFBB (100Mbit/s) availability (% premises)
0,S12000033,ABERDEEN CITY,125441,125311,93.5,20.1,13.1,0.0,0.2,0.7,6.5,0.2,96.9,0.0,117152,25163,16410,49,219,884,8159,189,121476,0,91989,25163,49,170,665,7275,73.4,20.1,0.0,0.1,0.5,5.8,16410,13.1,19758,15.8
1,S12000034,ABERDEENSHIRE,125085,124305,81.8,2.8,2.7,2.5,5.9,9.9,18.2,3.6,94.2,0.0,101652,3472,3332,3163,7339,12332,22653,4519,117051,0,98180,3472,3163,4176,4993,10321,79.0,2.8,2.5,3.4,4.0,8.3,3332,2.7,3353,2.7
2,E07000223,ADUR,29770,29760,98.7,82.5,0.6,0.0,0.1,0.1,1.3,0.0,99.2,0.0,29383,24543,193,0,16,44,377,12,29514,0,4840,24543,0,16,28,333,16.3,82.5,0.0,0.1,0.1,1.1,193,0.6,4043,13.6
3,E07000026,ALLERDALE,51385,51284,91.7,1.7,1.7,1.2,2.6,3.7,8.3,1.2,98.5,2.3,47003,866,866,619,1323,1873,4281,592,50507,1164,46137,866,619,704,550,2408,90.0,1.7,1.2,1.4,1.1,4.7,866,1.7,850,1.7
4,E07000032,AMBER VALLEY,60674,60596,92.8,25.3,22.1,0.1,0.9,2.1,7.2,0.7,98.3,0.0,56232,15339,13412,89,549,1254,4364,440,59578,0,40893,15339,89,460,705,3110,67.5,25.3,0.1,0.8,1.2,5.1,13412,22.1,11155,18.4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,E07000239,WYRE FOREST,48100,48061,96.1,46.8,0.9,0.3,0.6,1.4,3.9,0.5,99.5,2.9,46203,22479,420,140,291,662,1858,264,47829,1412,23724,22479,140,151,371,1196,49.4,46.8,0.3,0.3,0.8,2.5,420,0.9,11542,24.0
370,E06000014,YORK,98735,98548,94.0,70.9,43.7,0.0,0.3,0.6,6.0,0.2,95.8,7.1,92626,69871,43077,36,261,629,5922,191,94419,6956,22755,69871,36,225,368,5293,23.1,70.9,0.0,0.2,0.4,5.4,43077,43.7,17165,17.4
371,E06000060,BUCKINGHAMSHIRE,238482,238350,92.5,37.0,6.7,0.2,1.0,2.0,7.5,0.4,97.9,0.0,220571,88134,16083,590,2304,4667,17779,916,233315,0,132437,88134,590,1714,2363,13112,55.6,37.0,0.2,0.7,1.0,5.5,16083,6.7,52918,22.2
372,E06000061,NORTH NORTHAMPTONSHIRE,161434,160977,97.3,51.3,8.5,0.1,0.5,0.8,2.7,0.5,98.5,3.3,156615,82575,13647,149,781,1365,4362,773,158581,5374,74040,82575,149,632,584,2997,46.0,51.3,0.1,0.4,0.4,1.9,13647,8.5,39037,24.3


<a id='saving-the-cleaned-files'></a>
### Saving the cleaned files 

Now the data frames contain cleaned data, I can proceed with loading them into MongoDB. However, I am aware that it is not necessary saving the cleaned data as separate files in order to load the data into MongoDB. I can directly load the cleaned data from the data frames into MongoDB collections. But I have chosen to save the cleaned data into separate files for various reasons. 

Saving cleaned data as CSV files provides a backup for my work. If something goes wrong during the MongoDB loading process or if I need to reproduce my analysis later, I can easily refer back to the CSV files.

Also, in some cases, reading from CSV files might be faster than reading directly from data framews, especially when working with large datasets. 

In [174]:
import os

folder_name = "2023_J_TMA02_data_cleaned"
folder_path = os.path.join(os.getcwd(), folder_name)

# Create the folder if it does not exist
if not os.path.exists(folder_path):
    os.makedirs(folder_path)
    
# Save each data frame to CSV file in the folder
fixed_coverage_2019_df.to_csv(os.path.join(folder_path, "fixed_coverage_2019.csv"), index=False)
fixed_coverage_2020_df.to_csv(os.path.join(folder_path, "fixed_coverage_2020.csv"), index=False)
fixed_coverage_2021_df.to_csv(os.path.join(folder_path, "fixed_coverage_2021.csv"), index=False)
fixed_coverage_2022_df.to_csv(os.path.join(folder_path, "fixed_coverage_2022.csv"), index=False)
fixed_coverage_2023_df.to_csv(os.path.join(folder_path, "fixed_coverage_2023.csv"), index=False)