Course: COmputación en la nube.

Module: 3 Modelos de servicio.

Student name: Luis Felipe Castañeda Gallego.

## 1. Introduction to the Fiscal Year 2023 Data Loading Notebook

This Jupyter notebook has been created to loading and initial processing of the fiscal year 2023 contract data. Given the large volume and complexity of the dataset, the notebook is structured to ensure efficient handling, loading, and preliminary inspection of the data, which forms the basis for further detailed analysis in the subsequent notebook '3_create_labels_for_fy_2023_companies.ipynb'.

##### Notebook Objectives:
- **Data Loading**: Load the extensive CSV files containing detailed contract information for the fiscal year 2023. This process includes reading the data into a pandas DataFrame, which serves as the primary data structure for further operations.
- **Initial Data Cleaning**: Perform basic cleaning steps that prepare the data for analysis. This includes removing unnecessary columns that do not contribute to the analytical objectives.
- **Data Inspection**: Provide an initial inspection and summary of the data, allowing for an immediate understanding of its structure and key columns.

#### Why Separate Data Loading?
Separating the data loading process into its own notebook allows for more manageable code and focuses on ensuring that the data integrity is maintained from the source to the analytical environment. This separation is particularly important for performance optimization when dealing with large datasets, as it allows for specific tuning and adjustments without affecting other analytical processes.

## 2. Load and read de data 

Import pandas library.

In [1]:
import pandas as pd

Load the first part of the fiscal year 2023 contract data from a csv. This file is part of a series of datasets split across multiple files for manageability. It was loaded in order to know the column structure of the rest of the csv files.

In [2]:
raw_data_1 = pd.read_csv("..\\Data\\FY2023_All_Contracts_Full_20240408\\FY2023_All_Contracts_Full_20240409_1.csv")

  raw_data_1 = pd.read_csv("..\\Data\\FY2023_All_Contracts_Full_20240408\\FY2023_All_Contracts_Full_20240409_1.csv")


Configure pandas display options to increases the maximum number of columns displayed when using the .info().

In [3]:
pd.set_option('display.max_info_columns', 300)

Inspect the DataFrame structure and content.

In [4]:
raw_data_1.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 297 columns):
 #    Column                                                          Non-Null Count    Dtype  
---   ------                                                          --------------    -----  
 0    contract_transaction_unique_key                                 1000000 non-null  object 
 1    contract_award_unique_key                                       1000000 non-null  object 
 2    award_id_piid                                                   1000000 non-null  object 
 3    modification_number                                             1000000 non-null  object 
 4    transaction_number                                              973039 non-null   float64
 5    parent_award_agency_id                                          816678 non-null   object 
 6    parent_award_agency_name                                        816678 non-null   object 
 7    parent_award_id_p

Create a list of paths for all the csv files that contain the data of the fiscal year 2023 contracts.

In [5]:
file_paths = [
    "..\\Data\\FY2023_All_Contracts_Full_20240408\\FY2023_All_Contracts_Full_20240409_1.csv",
    "..\\Data\\FY2023_All_Contracts_Full_20240408\\FY2023_All_Contracts_Full_20240409_2.csv",
    "..\\Data\\FY2023_All_Contracts_Full_20240408\\FY2023_All_Contracts_Full_20240409_3.csv",
    "..\\Data\\FY2023_All_Contracts_Full_20240408\\FY2023_All_Contracts_Full_20240409_4.csv",
    "..\\Data\\FY2023_All_Contracts_Full_20240408\\FY2023_All_Contracts_Full_20240409_5.csv",
    "..\\Data\\FY2023_All_Contracts_Full_20240408\\FY2023_All_Contracts_Full_20240409_6.csv",
    "..\\Data\\FY2023_All_Contracts_Full_20240408\\FY2023_All_Contracts_Full_20240409_7.csv",
    ]

After look at the raw_data_1 structure, create a list of potencial columns of interest. Specify the columns for data loading helps to focus the analysis on relevant information and enhances performance by reducing memory usage.

In [6]:
cols_to_use = [
    "federal_action_obligation",
    "total_dollars_obligated",
    "current_total_value_of_award",
    "potential_total_value_of_award",
    "period_of_performance_start_date",
    "period_of_performance_current_end_date",
    "period_of_performance_potential_end_date",
    "awarding_agency_name",
    "funding_agency_name",
    "recipient_uei",
    "recipient_name",
    "recipient_country_name",
    "recipient_state_code",
    "recipient_state_name",
    "primary_place_of_performance_country_code",
    "primary_place_of_performance_country_name",
    "primary_place_of_performance_state_code",
    "primary_place_of_performance_state_name",
    "award_or_idv_flag",
    "award_type",
    "product_or_service_code",
    "product_or_service_code_description",
    "type_of_contract_pricing",
    "naics_code",
    "naics_description",
    "extent_competed",
    "c8a_program_participant",
]

Load and concatenate the data from the multiple file paths, creating an empty DataFrame fy2023_all_contracts to store the data, then looping over each file path, loading the data with the usecols parameter to limit the process to the columns of interest and optimizing memory usage. And concatenate merging the temporary DataFrame temp_df with the main DataFrame, reindexing the combined DataFrame.

In [7]:
fy2023_all_contracts = pd.DataFrame()

for path in file_paths:
    temp_df = pd.read_csv(path, usecols=cols_to_use)
    fy2023_all_contracts = pd.concat([fy2023_all_contracts, temp_df], ignore_index=True)

Inspect the fy2023_all_contracts DataFrame structure and content.

In [8]:
fy2023_all_contracts.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6669114 entries, 0 to 6669113
Data columns (total 27 columns):
 #   Column                                     Dtype  
---  ------                                     -----  
 0   federal_action_obligation                  float64
 1   total_dollars_obligated                    float64
 2   current_total_value_of_award               float64
 3   potential_total_value_of_award             float64
 4   period_of_performance_start_date           object 
 5   period_of_performance_current_end_date     object 
 6   period_of_performance_potential_end_date   object 
 7   awarding_agency_name                       object 
 8   funding_agency_name                        object 
 9   recipient_uei                              object 
 10  recipient_name                             object 
 11  recipient_country_name                     object 
 12  recipient_state_code                       object 
 13  recipient_state_name                      

## 3. Creation of csv file

Export the aggregated DataFrame fy2023_all_contracts, which contains all relevant data from the fiscal year 2023 contract files, to a csv file for future use.

In [9]:
fy2023_all_contracts.to_csv("..\\Data\\fy2023_all_contracts.csv", index=False)

This csv file will be use in the following notebook.