In this notebook I've been examining the columns present in the XML documents. For this I read the contents of 15 directories worth of XML files, so it is possible that there are other documents with different sets of columns. 

I am only examining the three most common types of documents:
 * Contract notices
 * Contract award notices
 * Additional information
 
The other document types are infrequent enough that they can probably be ignored.
 
Here are some statistics about the columns:
 * There are 82 columns which are present in all three document types, of which 18 always have data.
 * In the three most common document types there are 2 columns which are empty more than 50% of the time.
 * There are 165 columns which are present in both contract notices and contract award notices.
 * In contract notices and contract award notices there are 3 columns which are empty more than 50% of the time.
 * There are 398 columns which only appear in Contract award notices but NOT in contract notices, of these 368 are empty 95% or more of the time.
 * There are 457 columns which only appear in Contract notices but NOT in Contract award notices, of these 420 are empty 95% or more of the time.
 * There are 39 columns which only appear in Additional info documents, of which 26 are empty 95% or more of the time.

Note that these numbers include the three columns I manually add to the dataframe, DATE, FILE and VALUE_EUR.

In [7]:
import pandas as pd
import numpy as np

In [32]:
# load the big dataframe
df = pd.read_pickle("data/df.pkl")

# read the columns for the different file types
addtl_info_cols = pd.read_csv("additional_info_cols.txt", header=None)[0].values
contract_notice_cols = pd.read_csv("contract_notice_cols.txt", header=None)[0].values
contract_award_notice_cols = pd.read_csv("contract_award_notice_cols.txt", header=None)[0].values

# get union of columns
all_columns = df.columns

In [33]:
# see what columns are in all of the document types
counter = 0
common_to_all_cols = []
for col in all_columns:
    if col in addtl_info_cols and col in contract_notice_cols and col in contract_award_notice_cols:
        print(counter, col)
        common_to_all_cols.append(col)
        counter += 1

0 AA_AUTHORITY_TYPE
1 AA_AUTHORITY_TYPE__CODE
2 AC_AWARD_CRIT
3 AC_AWARD_CRIT__CODE
4 CATEGORY
5 COMPLEMENTARY_INFO__DATE_DISPATCH_NOTICE
6 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__ADDRESS
7 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__CONTACT_POINT
8 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__COUNTRY__VALUE
9 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__E_MAIL
10 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__FAX
11 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__NATIONALID
12 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__OFFICIALNAME
13 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__PHONE
14 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__POSTAL_CODE
15 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__TOWN
16 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__URL_BUYER
17 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__URL_GENERAL
18 CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIO

In [34]:
# see what columns are in both the contract notice and contract award notice
counter = 0
common_to_contract_cols = []
for col in all_columns:
    if col in contract_notice_cols and col in contract_award_notice_cols:
        print(counter, col)
        common_to_contract_cols.append(col)
        counter += 1

0 AA_AUTHORITY_TYPE
1 AA_AUTHORITY_TYPE__CODE
2 AC_AWARD_CRIT
3 AC_AWARD_CRIT__CODE
4 CATEGORY
5 COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__ADDRESS
6 COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__COUNTRY__VALUE
7 COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__E_MAIL
8 COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__FAX
9 COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__OFFICIALNAME
10 COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__PHONE
11 COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__POSTAL_CODE
12 COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__TOWN
13 COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__URL
14 COMPLEMENTARY_INFO__ADDRESS_REVIEW_BODY__ADDRESS
15 COMPLEMENTARY_INFO__ADDRESS_REVIEW_BODY__COUNTRY__VALUE
16 COMPLEMENTARY_INFO__ADDRESS_REVIEW_BODY__E_MAIL
17 COMPLEMENTARY_INFO__ADDRESS_REVIEW_BODY__FAX
18 COMPLEMENTARY_INFO__ADDRESS_REVIEW_BODY__OFFICIALNAME
19 COMPLEMENTARY_INFO__ADDRESS_REVIEW_BODY__PHONE
20 COMPLEMENTARY_INFO__ADDRESS_REVIEW_BODY__POSTAL_CODE
21 COMPLEMENTARY_INFO__ADDRESS_REVIEW_BO

## Columns which always have values

In the three most common document types, the following columns seem to ALWAYS have data in them:

In [35]:
for i, col in enumerate(empty_cols_2[empty_cols_2 == 0].index):
    print(i, col)

0 AA_AUTHORITY_TYPE
1 AC_AWARD_CRIT
2 CATEGORY
3 DATE
4 DS_DATE_DISPATCH
5 FILE
6 HEADING
7 LG
8 LG_ORIG
9 NC_CONTRACT_NATURE
10 NO_DOC_OJS
11 ORIGINAL_CPV
12 ORIGINAL_CPV_CODE
13 ORIGINAL_CPV_TEXT
14 PR_PROC
15 REF_NO
16 RP_REGULATION
17 TD_DOCUMENT_TYPE
18 TY_TYPE_BID


## Dataframe with columns common to contract notices and contract award notices

In [36]:
df[common_to_contract_cols].head(10)

Unnamed: 0,AA_AUTHORITY_TYPE,AA_AUTHORITY_TYPE__CODE,AC_AWARD_CRIT,AC_AWARD_CRIT__CODE,CATEGORY,COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__ADDRESS,COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__COUNTRY__VALUE,COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__E_MAIL,COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__FAX,COMPLEMENTARY_INFO__ADDRESS_MEDIATION_BODY__OFFICIALNAME,...,VALUES_LIST__VALUES__SINGLE_VALUE__VALUE__CURRENCY,VALUES_LIST__VALUES__TYPE,VALUES__VALUE,VALUES__VALUE__CURRENCY,VALUES__VALUE__TYPE,VERSION,n2016:CA_CE_NUTS,n2016:CA_CE_NUTS__CODE,n2016:PERFORMANCE_NUTS,n2016:PERFORMANCE_NUTS__CODE
0,European Institution/Agency or International O...,5,The most economic tender,2,TRANSLATION,,,,,,...,,,1.0,EUR,ESTIMATED_TOTAL,,Arr. de Bruxelles-Capitale / Arr. van Brussel-...,BE100,Région de Bruxelles-Capitale / Brussels Hoofds...,BE10
1,European Institution/Agency or International O...,5,The most economic tender,2,TRANSLATION,1 avenue du Président Robert Schuman,FR,euro-ombudsman@europarl.europa.eu,+33 388179062,Médiateur européen,...,,,2400000.0,EUR,ESTIMATED_TOTAL,,Luxembourg,LU000,Luxembourg,LU000
2,European Institution/Agency or International O...,5,The most economic tender,2,TRANSLATION,,,,,,...,,,130000.0,EUR,ESTIMATED_TOTAL,,Área Metropolitana de Lisboa,PT170,PORTUGAL,PT
3,European Institution/Agency or International O...,5,The most economic tender,2,TRANSLATION,,,,,,...,,,,,,,"Frankfurt am Main, Kreisfreie Stadt",DE712,"Frankfurt am Main, Kreisfreie Stadt",DE712
4,European Institution/Agency or International O...,5,The most economic tender,2,TRANSLATION,,,,,,...,,,1500000.0,EUR,PROCUREMENT_TOTAL,,Parma,ITH52,Parma,ITH52
5,European Institution/Agency or International O...,5,Not specified,Z,ORIGINAL,,,,,,...,,,,,,R2.0.8.S03.E01,,,,
6,European Institution/Agency or International O...,5,Not specified,Z,ORIGINAL,,,,,,...,,,,,,R2.0.8.S03.E01,,,,
7,European Institution/Agency or International O...,5,Not specified,Z,TRANSLATION,,,,,,...,,,,,,R2.0.8.S03.E01,,,,
8,European Institution/Agency or International O...,5,Not specified,Z,TRANSLATION,,,,,,...,,,,,,R2.0.8.S03.E01,,,,
9,European Institution/Agency or International O...,5,Not specified,Z,TRANSLATION,,,,,,...,,,,,,R2.0.8.S03.E01,,,,


In [52]:
# find which columns are empty
contract_df = df[common_to_contract_cols]
empty_cols_1 = (contract_df.isnull().sum() / len(contract_df))

# columns which are NEVER empty
print("Columns which are NEVER empty:", len(empty_cols_1[empty_cols_1 == 0]))
empty_cols_1[empty_cols_1 == 0]

Columns which are NEVER empty: 28


AA_AUTHORITY_TYPE           0.0
AA_AUTHORITY_TYPE__CODE     0.0
AC_AWARD_CRIT               0.0
AC_AWARD_CRIT__CODE         0.0
CATEGORY                    0.0
DATE                        0.0
DS_DATE_DISPATCH            0.0
FILE                        0.0
HEADING                     0.0
ISO_COUNTRY__VALUE          0.0
LG                          0.0
LG_ORIG                     0.0
NC_CONTRACT_NATURE          0.0
NC_CONTRACT_NATURE__CODE    0.0
NO_DOC_OJS                  0.0
ORIGINAL_CPV                0.0
ORIGINAL_CPV_CODE           0.0
ORIGINAL_CPV_TEXT           0.0
ORIGINAL_CPV__CODE          0.0
PR_PROC                     0.0
PR_PROC__CODE               0.0
REF_NO                      0.0
RP_REGULATION               0.0
RP_REGULATION__CODE         0.0
TD_DOCUMENT_TYPE            0.0
TD_DOCUMENT_TYPE__CODE      0.0
TY_TYPE_BID                 0.0
TY_TYPE_BID__CODE           0.0
dtype: float64

In [38]:
# find which columns are often empty
print("Columns which are usually empty in contract notices and award notices:")
usually_empty_cols = empty_cols_1[empty_cols_1 > 0.5].sort_values(ascending=False)

usually_empty_cols

Columns which are usually empty in contract notices and award notices:


FD_OTH_NOT__CONTENTS__GR_SEQ__BLK_BTX_SEQ__MARK_LIST__MLI_OCCUR__TXT_MARK__P__ADDRESS_NOT_STRUCT__E_MAIL          0.999551
CONTRACTING_BODY__PROCUREMENT_LAW                                                                                 0.999252
FD_OTH_NOT__CONTENTS__P                                                                                           0.998953
PROCEDURE__MAIN_FEATURES_AWARD                                                                                    0.998505
FD_OTH_NOT__CONTENTS__GR_SEQ__BLK_BTX_SEQ__MARK_LIST__MLI_OCCUR__TXT_MARK__P__ADDRESS_NOT_STRUCT__TOWN            0.998056
FD_OTH_NOT__CONTENTS__GR_SEQ__BLK_BTX_SEQ__MARK_LIST__MLI_OCCUR__TXT_MARK__P__ADDRESS_NOT_STRUCT__ORGANISATION    0.998056
FD_OTH_NOT__CONTENTS__GR_SEQ__BLK_BTX_SEQ__MARK_LIST__MLI_OCCUR__TXT_MARK__P__ADDRESS_NOT_STRUCT__BLK_BTX         0.998056
FD_OTH_NOT__CONTENTS__GR_SEQ__BLK_BTX_SEQ__MARK_LIST__MLI_OCCUR__TXT_MARK                                         0.998056
FD_OTH_NOT__CONT

## Dataframe with only columns common to 3 most common document types

In [39]:
df[common_to_all_cols].head(10)

Unnamed: 0,AA_AUTHORITY_TYPE,AA_AUTHORITY_TYPE__CODE,AC_AWARD_CRIT,AC_AWARD_CRIT__CODE,CATEGORY,COMPLEMENTARY_INFO__DATE_DISPATCH_NOTICE,CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__ADDRESS,CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__CONTACT_POINT,CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__COUNTRY__VALUE,CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__E_MAIL,...,RP_REGULATION__CODE,TD_DOCUMENT_TYPE,TD_DOCUMENT_TYPE__CODE,TY_TYPE_BID,TY_TYPE_BID__CODE,VERSION,n2016:CA_CE_NUTS,n2016:CA_CE_NUTS__CODE,n2016:PERFORMANCE_NUTS,n2016:PERFORMANCE_NUTS__CODE
0,European Institution/Agency or International O...,5,The most economic tender,2,TRANSLATION,2017-12-22,"[IMI2 JU — TO 56 1049 Brussels, Belgium, IMI2 ...","[Procurement Team, Procurement Team]","[BE, BE]","[procurement@imi.europa.eu, procurement@imi.eu...",...,3,Contract notice,3,Submission for all lots,1,,Arr. de Bruxelles-Capitale / Arr. van Brussel-...,BE100,Région de Bruxelles-Capitale / Brussels Hoofds...,BE10
1,European Institution/Agency or International O...,5,The most economic tender,2,TRANSLATION,2017-12-22,,,,,...,3,Contract notice,3,Submission for one or more lots,3,,Luxembourg,LU000,Luxembourg,LU000
2,European Institution/Agency or International O...,5,The most economic tender,2,TRANSLATION,2017-12-22,,,,,...,3,Contract notice,3,Submission for all lots,1,,Área Metropolitana de Lisboa,PT170,PORTUGAL,PT
3,European Institution/Agency or International O...,5,The most economic tender,2,TRANSLATION,2017-12-22,,,,,...,3,Contract award notice,7,Not applicable,9,,"Frankfurt am Main, Kreisfreie Stadt",DE712,"Frankfurt am Main, Kreisfreie Stadt",DE712
4,European Institution/Agency or International O...,5,The most economic tender,2,TRANSLATION,2017-12-22,,,,,...,3,Contract award notice,7,Not applicable,9,,Parma,ITH52,Parma,ITH52
5,European Institution/Agency or International O...,5,Not specified,Z,ORIGINAL,,,,,,...,1,Contract award notice,7,Not applicable,9,R2.0.8.S03.E01,,,,
6,European Institution/Agency or International O...,5,Not specified,Z,ORIGINAL,,,,,,...,1,Additional information,2,Not specified,Z,R2.0.8.S03.E01,,,,
7,European Institution/Agency or International O...,5,Not specified,Z,TRANSLATION,,,,,,...,5,Prior information notice without call for comp...,0,Not applicable,9,R2.0.8.S03.E01,,,,
8,European Institution/Agency or International O...,5,Not specified,Z,TRANSLATION,,,,,,...,3,Contract notice,3,Submission for one or more lots,3,R2.0.8.S03.E01,,,,
9,European Institution/Agency or International O...,5,Not specified,Z,TRANSLATION,,,,,,...,3,Contract award notice,7,Not applicable,9,R2.0.8.S03.E01,,,,


In [40]:
# find which columns are empty
common_df = df[common_to_all_cols]
empty_cols_2 = (common_df.isnull().sum() / len(common_df))

# columns which are NEVER empty
print("Columns which are NEVER empty")
empty_cols_2[empty_cols_2 == 0]

Columns which are NEVER empty


AA_AUTHORITY_TYPE           0.0
AA_AUTHORITY_TYPE__CODE     0.0
AC_AWARD_CRIT               0.0
AC_AWARD_CRIT__CODE         0.0
CATEGORY                    0.0
DATE                        0.0
DS_DATE_DISPATCH            0.0
FILE                        0.0
HEADING                     0.0
ISO_COUNTRY__VALUE          0.0
LG                          0.0
LG_ORIG                     0.0
NC_CONTRACT_NATURE          0.0
NC_CONTRACT_NATURE__CODE    0.0
NO_DOC_OJS                  0.0
ORIGINAL_CPV                0.0
ORIGINAL_CPV_CODE           0.0
ORIGINAL_CPV_TEXT           0.0
ORIGINAL_CPV__CODE          0.0
PR_PROC                     0.0
PR_PROC__CODE               0.0
REF_NO                      0.0
RP_REGULATION               0.0
RP_REGULATION__CODE         0.0
TD_DOCUMENT_TYPE            0.0
TD_DOCUMENT_TYPE__CODE      0.0
TY_TYPE_BID                 0.0
TY_TYPE_BID__CODE           0.0
dtype: float64

In [41]:
print("Columns which are usually empty in 3 most common document types:")
# which columns are almost always empty
empty_cols_2[empty_cols_2 > 0.5].sort_values(ascending=False)

Columns which are usually empty in 3 most common document types:


FD_OTH_NOT__CONTENTS__P                                                    0.998953
CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__NATIONALID          0.997159
CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__FAX                 0.994019
FD_OTH_NOT__CONTENTS__P__ADDRESS_NOT_STRUCT__ORGANISATION                  0.993272
FD_OTH_NOT__CONTENTS__P__ADDRESS_NOT_STRUCT__BLK_BTX                       0.993272
FD_OTH_NOT__CONTENTS__P__ADDRESS_NOT_STRUCT__TOWN                          0.993272
FD_OTH_NOT__CONTENTS__GR_SEQ__TI_GRSEQ__BLK_BTX                            0.992673
CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__CONTACT_POINT       0.992524
OBJECT_CONTRACT__CPV_MAIN__CPV_SUPPLEMENTARY_CODE__CODE                    0.990431
CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__PHONE               0.988487
CONTRACTING_BODY__ADDRESS_CONTRACTING_BODY_ADDITIONAL__URL_BUYER           0.985496
FD_OTH_NOT__STI_DOC                                                        0

### Which columns exist in Contract Award notices but not in contract notices:

In [42]:
only_award_cols = [ col for col in contract_award_notice_cols if col not in contract_notice_cols ]
for i, col in enumerate(only_award_cols):
    print(i, col)

0 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__ADDRESS
1 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__COUNTRY__VALUE
2 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__E_MAIL
3 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__FAX
4 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__NATIONALID
5 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__OFFICIALNAME
6 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__PHONE
7 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__POSTAL_CODE
8 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__TOWN
9 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__URL
10 AWARD_CONTRACT__AWARDED_CONTRACT__CONTRACTORS__CONTRACTOR__ADDRESS_CONTRACTOR__n2016:NUTS__CODE
11 AWA

In [43]:
common_only_award = df[only_award_cols]
empty_cols_4 = (common_only_award.isnull().sum() / len(common_only_award))

pd.DataFrame(empty_cols_4[empty_cols_4 > 0.95].sort_values(ascending=False))

Unnamed: 0,0
FD_CONTRACT_AWARD_UTILITIES__OBJECT_CONTRACT_AWARD_UTILITIES__COSTS_RANGE_AND_CURRENCY_WITH_VAT_RATE__VALUE_COST,0.999850
FD_CONTRACT_AWARD__PROCEDURE_DEFINITION_CONTRACT_AWARD_NOTICE__TYPE_OF_PROCEDURE_DEF__F03_PT_NEGOTIATED_WITHOUT_COMPETITION__ANNEX_D__NO_OPEN_RESTRICTED__VALUE,0.999850
FD_CONTRACT_AWARD__AWARD_OF_CONTRACT__MORE_INFORMATION_TO_SUB_CONTRACTED__CONTRACT_LIKELY_SUB_CONTRACTED__EXCLUDING_VAT_PRCT__FMTVAL,0.999850
AWARD_CONTRACT__AWARDED_CONTRACT__VAL_BARGAIN_PURCHASE,0.999850
AWARD_CONTRACT__AWARDED_CONTRACT__VAL_BARGAIN_PURCHASE__CURRENCY,0.999850
FD_OTH_NOT__CONTENTS__GR_SEQ__BLK_BTX_SEQ__MARK_LIST__MLI_OCCUR__TXT_MARK__P__ADDRESS_NOT_STRUCT__POSTAL_CODE,0.999850
FD_CONTRACT_AWARD_DEFENCE__AWARD_OF_CONTRACT_DEFENCE__CONTRACT_VALUE_INFORMATION__COSTS_RANGE_AND_CURRENCY_WITH_VAT_RATE__INCLUDING_VAT__VAT_PRCT__FMTVAL,0.999850
FD_CONTRACT_AWARD__AWARD_OF_CONTRACT__MORE_INFORMATION_TO_SUB_CONTRACTED__CONTRACT_LIKELY_SUB_CONTRACTED__EXCLUDING_VAT_VALUE,0.999850
FD_CONTRACT_AWARD__AWARD_OF_CONTRACT__MORE_INFORMATION_TO_SUB_CONTRACTED__CONTRACT_LIKELY_SUB_CONTRACTED__EXCLUDING_VAT_VALUE__CURRENCY,0.999850
FD_CONTRACT_AWARD__AWARD_OF_CONTRACT__MORE_INFORMATION_TO_SUB_CONTRACTED__CONTRACT_LIKELY_SUB_CONTRACTED__EXCLUDING_VAT_VALUE__FMTVAL,0.999850


### Which columns only exist in Contract notices but not in Contract award notices

In [44]:
only_notice_cols = [ col for col in contract_notice_cols if col not in contract_award_notice_cols ]
for i, col in enumerate(only_notice_cols):
    print(i, col)

0 COMPLEMENTARY_INFO__ESTIMATED_TIMING
1 COMPLEMENTARY_INFO__ESTIMATED_TIMING__P
2 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__ADDRESS
3 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__CONTACT_POINT
4 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__COUNTRY__VALUE
5 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__E_MAIL
6 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__FAX
7 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__NATIONALID
8 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__OFFICIALNAME
9 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__PHONE
10 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__POSTAL_CODE
11 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__TOWN
12 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__URL_BUYER
13 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__URL_GENERAL
14 CONTRACTING_BODY__ADDRESS_FURTHER_INFO__n2016:NUTS__CODE
15 CONTRACTING_BODY__ADDRESS_PARTICIPATION__ADDRESS
16 CONTRACTING_BODY__ADDRESS_PARTICIPATION__CONTACT_POINT
17 CONTRACTING_BODY__ADDRESS_PARTICIPATION__COUNTRY__VALUE
18 CONTRACTING_BODY__ADDRESS_PARTICIPATION__E_MAIL
19 CONTRACTING

In [45]:
common_only_notice = df[only_notice_cols]
empty_cols_3 = (common_only_notice.isnull().sum() / len(common_only_notice))

pd.DataFrame(empty_cols_3[empty_cols_3 > 0.95].sort_values(ascending=False))

Unnamed: 0,0
FD_CONTRACT_DEFENCE__CONTRACTING_AUTHORITY_INFORMATION_DEFENCE__TYPE_AND_ACTIVITIES_OR_CONTRACTING_ENTITY_AND_PURCHASING_ON_BEHALF__PURCHASING_ON_BEHALF__PURCHASING_ON_BEHALF_YES__CONTACT_DATA_OTHER_BEHALF_CONTRACTING_AUTORITHY__COUNTRY__VALUE,0.999850
FD_CONTRACT__OBJECT_CONTRACT_INFORMATION__DESCRIPTION_CONTRACT_INFORMATION__F02_DIVISION_INTO_LOTS__F02_DIV_INTO_LOT_YES__F02_ANNEX_B__NATURE_QUANTITY_SCOPE__COSTS_RANGE_AND_CURRENCY__CURRENCY,0.999850
FD_CONTRACT__OBJECT_CONTRACT_INFORMATION__DESCRIPTION_CONTRACT_INFORMATION__F02_FRAMEWORK__TOTAL_ESTIMATED__COSTS_RANGE_AND_CURRENCY__RANGE_VALUE_COST__HIGH_VALUE__FMTVAL,0.999850
FD_CONTRACT__OBJECT_CONTRACT_INFORMATION__DESCRIPTION_CONTRACT_INFORMATION__F02_FRAMEWORK__TOTAL_ESTIMATED__COSTS_RANGE_AND_CURRENCY__RANGE_VALUE_COST__HIGH_VALUE,0.999850
FD_CONTRACT__OBJECT_CONTRACT_INFORMATION__DESCRIPTION_CONTRACT_INFORMATION__F02_FRAMEWORK__TOTAL_ESTIMATED__COSTS_RANGE_AND_CURRENCY__CURRENCY,0.999850
FD_CONTRACT__OBJECT_CONTRACT_INFORMATION__DESCRIPTION_CONTRACT_INFORMATION__F02_FRAMEWORK__JUSTIFICATION,0.999850
FD_CONTRACT_DEFENCE__CONTRACTING_AUTHORITY_INFORMATION_DEFENCE__NAME_ADDRESSES_CONTACT_CONTRACT__TENDERS_REQUESTS_APPLICATIONS_MUST_BE_SENT_TO__CONTACT_DATA__ORGANISATION__NATIONALID,0.999850
FD_CONTRACT__OBJECT_CONTRACT_INFORMATION__DESCRIPTION_CONTRACT_INFORMATION__F02_DIVISION_INTO_LOTS__F02_DIV_INTO_LOT_YES__F02_ANNEX_B__NATURE_QUANTITY_SCOPE__COSTS_RANGE_AND_CURRENCY__VALUE_COST__FMTVAL,0.999850
FD_CONTRACT__OBJECT_CONTRACT_INFORMATION__DESCRIPTION_CONTRACT_INFORMATION__F02_DIVISION_INTO_LOTS__F02_DIV_INTO_LOT_YES__F02_ANNEX_B__NATURE_QUANTITY_SCOPE__COSTS_RANGE_AND_CURRENCY__VALUE_COST,0.999850
FD_CONTRACT__OBJECT_CONTRACT_INFORMATION__DESCRIPTION_CONTRACT_INFORMATION__F02_DIVISION_INTO_LOTS__F02_DIV_INTO_LOT_YES__F02_ANNEX_B__ADDITIONAL_INFORMATION_ABOUT_LOTS,0.999850


### Which columns are unique to Additional Info

In [49]:
only_info_cols = [ col for col in addtl_info_cols if col not in contract_award_notice_cols and col not in contract_notice_cols ]
for i, col in enumerate(only_info_cols):
    print(i, col)

0 CHANGES__CHANGE__NEW_VALUE__DATE
1 CHANGES__CHANGE__NEW_VALUE__TEXT
2 CHANGES__CHANGE__NEW_VALUE__TEXT__P
3 CHANGES__CHANGE__NEW_VALUE__TIME
4 CHANGES__CHANGE__OLD_VALUE__DATE
5 CHANGES__CHANGE__OLD_VALUE__TEXT
6 CHANGES__CHANGE__OLD_VALUE__TEXT__P
7 CHANGES__CHANGE__OLD_VALUE__TIME
8 CHANGES__CHANGE__PUBLICATION
9 CHANGES__CHANGE__WHERE__LABEL
10 CHANGES__CHANGE__WHERE__LOT_NO
11 CHANGES__CHANGE__WHERE__SECTION
12 CHANGES__INFO_ADD
13 COMPLEMENTARY_INFO__NOTICE_NUMBER_OJ
14 FD_OTH_NOT__CONTENTS__CORREC__DEL__MIXED
15 FD_OTH_NOT__CONTENTS__CORREC__FOR_READ__FOR__INT_FOR
16 FD_OTH_NOT__CONTENTS__CORREC__FOR_READ__FOR__OLD
17 FD_OTH_NOT__CONTENTS__CORREC__FOR_READ__FOR__OLD__QUOTE
18 FD_OTH_NOT__CONTENTS__CORREC__FOR_READ__READ__INT_READ
19 FD_OTH_NOT__CONTENTS__CORREC__FOR_READ__READ__NEW
20 FD_OTH_NOT__CONTENTS__CORREC__FOR_READ__READ__NEW__P
21 FD_OTH_NOT__CONTENTS__CORREC__FOR_READ__READ__NEW__QUOTE
22 FD_OTH_NOT__CONTENTS__P__ADDRESS_NOT_STRUCT__ADDRESS
23 FD_OTH_NOT__CONTENTS__P_

In [53]:
common_only_info = df[only_info_cols]
empty_cols_4 = (common_only_info.isnull().sum() / len(common_only_info))

print("Number of columns in additional info docs that are empty 95% or more of the time:", len(empty_cols_4[empty_cols_4 > 0.95]))
pd.DataFrame(empty_cols_4[empty_cols_4 > 0.95].sort_values(ascending=False))

Number of columns in additional info docs that are empty 95% or more of the time: 26


Unnamed: 0,0
FD_OTH_NOT__CONTENTS__P__ADDRESS_NOT_STRUCT__COUNTRY__VALUE,0.99985
FD_OTH_NOT__CONTENTS__CORREC__FOR_READ__READ__NEW__P,0.99985
CHANGES__CHANGE__NEW_VALUE__TEXT__P,0.99985
CHANGES__CHANGE__OLD_VALUE__TEXT__P,0.99985
FD_OTH_NOT__OBJ_NOT__CPV__CPV_MAIN__CPV_SUPPLEMENTARY_CODE__CODE,0.99985
FD_OTH_NOT__CONTENTS__P__ADDRESS_NOT_STRUCT__POSTAL_CODE,0.999402
FD_OTH_NOT__CONTENTS__P__ADDRESS_NOT_STRUCT__ADDRESS,0.998953
CHANGES__CHANGE__WHERE__LOT_NO,0.995963
FD_OTH_NOT__CONTENTS__CORREC__DEL__MIXED,0.991926
CHANGES__INFO_ADD,0.983403


26
