# Automated Reconstruction of Colombian Public Procurement Contracting Chain

## About the project

This project proposes a methodology to automatically reconstruct the chain of inter-administrative contracts, using the open contracting database SECOP. The chain consists of two (or more) independent contracts:

1. Contract issued by an public entity to a municipality/department
2. Contract(s) issued by the municipality/department to a third party contractor

We use two main variables to reconstruct the chain: Description of the contract and Municipality/Department in the contract. We construct the chain with a tf-idf approach to numerically represent the strings, and cosine similarity to obtain a similarity measure among contracts. The automated reconstruction of the contracting chain facilitates the implementation of mechanisms that can increase the citizenship involvement in governmental processes.

## Data extraction

Our data source is the open database from the Colombian National Government (https://www.datos.gov.co/). We obtain the datasets with Socrata API. From there we extract two datasets:

1. SECOP: Electronic Public Procurement Database. All public contracts issued since 2012 by public entities, municipalities, departments, among others. (~8 million rows)
2. Official municipalities and departments names in Colombia.

From now on we are gonna test the code with a public entity called "INSTITUTO NACIONAL DE VÍAS (INVIAS)".

In [1]:
import pandas as pd

# Files with functions
import data_extraction_functions as extract
import data_cleaning_functions as clean
import string_similarity_functions as ss

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/melissamontes/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/melissamontes/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Sample of source datasets

Public entity (INVIAS) dataset:

In [2]:
df_entity_raw = extract.extract_entity_contracts('INSTITUTO NACIONAL DE VÍAS (INVIAS)')
df_entity_raw.head(2)

Unnamed: 0,uid,anno_cargue_secop,anno_firma_del_contrato,nivel_entidad,orden_entidad,nombre_de_la_entidad,nit_de_la_entidad,c_digo_de_la_entidad,id_tipo_de_proceso,tipo_de_proceso,...,espostconflicto,marcacion_adiciones,posicion_rubro,nombre_rubro,valor_rubro,sexo_replegal_entidad,pilar_acuerdo_paz,punto_acuerdo_paz,municipio_entidad,departamento_entidad
0,13-12-1708661-1600394,2013,2012.0,NACIONAL,NACIONAL CENTRALIZADO,INSTITUTO NACIONAL DE VÍAS (INVIAS),800215807,124002002,12,Contratación Directa (Ley 1150 de 2007),...,No Definido,No,No Definido,No Definido,0,ND,No Definido,No Definido,Bogotá D.C.,Bogotá D.C.
1,10-12-402771-0,2010,,NACIONAL,NACIONAL CENTRALIZADO,INSTITUTO NACIONAL DE VÍAS (INVIAS),800215807,124002002,12,Contratación Directa (Ley 1150 de 2007),...,No Definido,No,No Definido,No Definido,0,ND,No Definido,No Definido,Bogotá D.C.,Bogotá D.C.


Municipalities and departments names

In [3]:
df_names_raw = extract.extract_mun_names()
df_names_raw.head(2)

Unnamed: 0,region,c_digo_dane_del_departamento,departamento,c_digo_dane_del_municipio,municipio
0,Región Eje Cafetero - Antioquia,5,Antioquia,5001,Medellín
1,Región Eje Cafetero - Antioquia,5,Antioquia,5002,Abejorral


### Feature overview
The main features we use for the chain construction in each dataset are:

**Contracting dataset**

- _anno_firma_del_contrato_: Year when the contract was issued 
- _nombre_de_la_entidad_: Name of the entity that issued the contract (offeror)
- _nom_raz_social_contratista_: Name of the entity that accepted the contract (offerer)
- _detalle_del_objeto_a_contratar_: Description of the contract

**Municipality/department names dataset**
- _departamento_: Department name
- _municipio_: Municipality name

# Data preprocessing

Now we preprocess the raw data to remove unwanted rows, standardize the names of departments/municipalities. We perform the following steps:

Public entity contracting dataframe:

1.  (SECOP dataframe) Remove contracts issued by the _public entity_ that are not issued to a municipality/department, and with incoherent info.
2.  (SECOP dataframe) Obtain list of all municipalities/departments with a contract with the entity.
3.  (SECOP dataframe and Names dataframe) Standardize names of each municipality/department to SECOP standard with the official names dataframe.

We perform this standardization to ensure the joining process between the dataframe of the contracts issued by the public entity and the contracts issued by the municipality/department.

The standardization of the description of the contract is performed in the next step.

In [4]:
# Cleaning unused rows
df_entity_filter = clean.df_cleaning(df_entity_raw)
# Filter entity contracts: only contracts issued to a mun/dept.
# Also gets list of names of mun/dept. with contracts with the entity
df_entity, names_mun_list = clean.df_filter_entity(df_entity_filter)

# Get clean list of departments and municipalities of Colombia
df_names = clean.df_cleaning_names(df_names_raw)

# First standardization: names of mun/dept. with contracts with the entity
names_mun_list = [clean.strip_accents(item) for item in names_mun_list]
names_mun_standard = []
for item in names_mun_list:
    if 'MUNICIPIO' in item:
        names_mun_standard.append(clean.standarize_mun(item))
    else:
        names_mun_standard.append(clean.standarize_depto(item))

# Second standardization: accent standardization without accents with official names
names_mun_standard = clean.standardize_accents_mun(df_names, names_mun_standard)

# Third standardization: format standardization to ensure a right joining
names_mun_standard = clean.standardize_format_mun(df_names, names_mun_standard)  

# Assign new column to entity dataframe
df_entity = df_entity.assign(nom_raz_soc_stand=names_mun_standard)
names_mun_standard_list = list(set(names_mun_standard))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Sample of municipality names before and after standardization: 

In [5]:
print('Before standardization: ')
print(names_mun_list[10:12])
print('After standardization: ')
print(names_mun_standard_list[10:12])

Before standardization: 
['MUNICIPIO DE SANTIAGO DE CALI', 'MUNICIPIO DE SAN RAFAEL']
After standardization: 
['MUNICIPIO DE ZAPAYAN', 'CALDAS - ALCALDÍA MUNICIPIO DE VICTORIA']


# Construction of the contracting chain

We perform an algorithm to construct the contracting chain for each of the contracts that the public entity issued. The algorithm will have to search in two dataframes:

- Dataframe 1: Contracts issued by the public entity to the municipality/department 
- Dataframe 2: Contracts issued by the municipality/department to a private contractor

The algorithm will:

1. Select a contract issued by the __public entity__ to the __municipality/department__ (Dataframe 1)
2. Look all the contract issued by the __municipality/department__ to a private contractor (Dataframe 2), and search for the contracts with the highest similarity with the description of the 1st contract (1).
3. The contracts with a similarity measure higher than 0.8 will be chained.

The resulting dataframe will have:
- A row for each chained contracts, with all the information available in SECOP for both contracts

### Similarity measure
The similarity measure is performed by the following steps:
1. Transforms 'description of the contract' for both databases with __tf-idf__. We tokenize by 3-grams to reduce noise from typing errors.
2. Gets similarity scores with cosine similarity.
3. Converts similarity matrix (sparse) to a readable df

In [12]:
# Parameters
chain_cont = 0
threshold = 0.8
chain_df = pd.DataFrame()

# Small sample of municipalities
list_mun = ['HUILA - ALCALDÍA MUNICIPIO DE NEIVA', 
            'SANTANDER - ALCALDÍA MUNICIPIO DE BUCARAMANGA', 
            'VALLE DEL CAUCA - ALCALDÍA MUNICIPIO DE PALMIRA']
n_contracts = 3
entity_contracts = df_entity

for i, item in enumerate(list_mun):
    #if i % 10 == 0:
    #    print("Iteration # " + str(i) + ";     Mun/Dept Name: " + str(item))

    mun_name = list_mun[i]

    # Subsets public entity df to contracts issued for the mun/dept.
    entity_contracts_mun = entity_contracts.loc[entity_contracts['nom_raz_soc_stand'] == item]

    mun_contracts = extract.extract_mun_contracts(mun_name)
    # If there are no contracts for the mun/dept. in SECOP continue
    if mun_contracts.empty:
        continue

    mun_contracts = clean.df_cleaning(mun_contracts)
    # If there are no contracts for the mun/dept. in the states allowed, continue
    # States = 'Liquidado', 'Terminado Sin Liquidar', 'Celebrado', 'Adjudicado', 'Convocado'
    if mun_contracts.empty:
        continue

    # Approximate string matching
    for index, entity_row in entity_contracts_mun.iterrows():

        # Only evaluate mun/dept contracts issued on or after year of the entity contract
        mun_contracts_filter = mun_contracts[
            (mun_contracts.anno_firma_del_contrato >= entity_row['anno_firma_del_contrato']) |
            (mun_contracts['anno_firma_del_contrato'].isnull())]
        mun_contracts_filter = mun_contracts_filter.reset_index(drop=True)

        # Create list with description of each contract to evaluate
        mun_description = mun_contracts_filter['detalle_del_objeto_a_contratar']
        mun_description_list = list(map(str, list(map(clean.standarize_obj, mun_description))))
        mun_description_list = pd.Series(mun_description_list)
        entity_description_list = pd.Series(clean.standarize_obj(entity_row['detalle_del_objeto_a_contratar']))

        # Paste descriptions mun/dept. and entity. Last row = entity contract
        description_list = list(mun_description_list) + list(entity_description_list)

        # String similarity algorithm -----
        # 1. Transforms strings with tf-idf algorithm to a numeric matrix
        tf_idf_matrix = ss.tf_idf(description_list)
        # 2. Gets similarity scores in a sparse matrix
        matches_sparse = ss.awesome_cossim_top(tf_idf_matrix, tf_idf_matrix.transpose(), n_contracts)
        # 3. Converts similarity matrix to a readable df
        matches_df = ss.get_matches_df(matches_sparse, description_list)

        # Chain construction ----
        # Pastes complete info of the contracts with high similitude
        for index_chain, row_chain in matches_df.loc[matches_df['pos_left'] == (len(description_list)-1)].iterrows():
            if row_chain['pos_left'] == row_chain['pos_right']: continue
            score = row_chain['similarity']
            # Joining info
            chain_entity = entity_row.to_frame().T
            if len(mun_contracts_filter) == row_chain['pos_right']: continue
            chain_mun = mun_contracts_filter.iloc[row_chain['pos_right']].to_frame().T
            chain_entity.index = [chain_cont]
            chain_mun.index = [chain_cont]
            chain_mun.columns = [str(col) + '_mun' for col in chain_mun.columns]
            chain_result = pd.concat([chain_entity, chain_mun], axis=1, join='inner', sort=True)
            chain_df = chain_df.append(chain_result, sort=False)
            chain_df.at[chain_df.index[chain_cont], 'score'] = score
            if score > threshold:
                chain_df.at[chain_df.index[chain_cont], 'valid'] = True
            else:
                chain_df.at[chain_df.index[chain_cont], 'valid'] = False
            chain_cont = chain_cont + 1

Sample of description of the contracts matched by string similitude

In [13]:
chain_df_final = chain_df.loc[chain_df["valid"]][["detalle_del_objeto_a_contratar", "detalle_del_objeto_a_contratar_mun"]]
chain_df_final.head()

Unnamed: 0,detalle_del_objeto_a_contratar,detalle_del_objeto_a_contratar_mun
0,MEJORAMIENTO MANTENIMIENTO Y CONSERVACION DE ...,MEJORAMIENTO MANTENIMIENTO Y CONSERVACIÓN DE ...
2,PAVIMENTACION DE LA RUTA DE LA FRESA SECTOR P...,PAVIMENTACIÓN DE LA RUTA DE LA FRESA SECTOR P...
4,MANTENIMIENTO Y MEJORAMIENTO DE LA VIA LA BUIT...,MANTENIMIENTO Y MEJORAMIENTO VIAS LA BUITRERA ...
6,MEJORAMIENTO MANTENIMIENTO Y CONSERVACION DE ...,MEJORAMIENTO MANTENIMIENTO Y CONSERVACIÓN DE ...
8,MEJORAMIENTO MANTENIMIENTO Y CONCERVACION DE ...,MEJORAMIENTO MANTENIMIENTO Y CONSERVACIÓN DE ...


# Conclusions and outlook
In this project we developed an algorithm that reconstructs automatically the contracting chain between contracts issued by a public entity to a municipality/department; and contracts issued by a muncipality/department to a third party contract. We performed three stages for this:

1. Data extraction
2. Data preprocessing
3. Construction of the contracting chain

With this algorithm we are automating tasks that were performed manually, and we open the doors to a macroscopic evaluation of the Colombian contracting landscape.