### The U.S. country level data: Data extraction and data cleaning
- The annual total energy data from the U.S. Energy Information Administration (EIA)
- Data year range: 1949 - 2024
  - Note: Data on Energy Expenditures (#18-20 TETCVUS, TEGDSUS, TETCHUS) are available from 1970 to 2023.
- Purposes: Extracting data, cleaning data, and preparing the dataset for analyses.

#### Step 1: Extracting data
EIA's API: U.S. Total Energy,  https://www.eia.gov/opendata/browser/total-energy

The features extracted: 24 features
- Total 'primary energy' production and consumption (Trillion Btu)
- Total 'end-use energy' consumption by sector: Residential, Commercial, Industrial, Transportation (Trillion Btu)
- Total 'end-use electricity' consumption by sector: Residential, Commercial, Industrial, Transportation (Trillion Btu)
- Total 'primary energy' consumption by  sector (Trillion Btu)
- Energy Expenditures (Million $), GDP, Population, and their related data (i.e, energy consumption/expediture per GDP or per capita)
- Total days: Cooling degree days (CDD) and Heating degree days (HDD) 

In [1]:
import sys
import requests
import pandas as pd
import numpy as np
import time 

In [3]:
# API key and endpoint for EIA total energy data extraction
api_key = "z7ajJBkXeo1MQOGKT3xoQ2yXp2bpwHthFYpiGBU4"
url_1US = "https://api.eia.gov/v2/total-energy/data/"
headers = {"Content-Type": "application/json"}

# Note: Since each extraction retrieves a maximum of 5,000 rows of data, the features (analysis items) 
#   will be extracted through multiple calls.

In [None]:
#================================================================================
# 1. Total 'primary energy' production and consumption (Trillion Btu)
#    Total 'end-use energy' consumption by sector: Residential, Commercial, Industrial, Transportation (Trillion Btu)
#    Total 'end-use electricity' consumption by sector: Residential, Commercial, Industrial, Transportation (Trillion Btu)
#    Total primary energy consumed by the electric power sector (Trillion Btu)

# (1) TEPRBUS: Total Primary Energy Production
# (2) TETCBUS: Primary Energy Consumption Total

# (3) TNTCBUS: End-Use Energy Consumed by the End-Use Sectors (total)
# (4) TNRCBUS: End-Use Energy Consumed by the Residential Sector
# (5) TNCCBUS: End-Use Energy Consumed by the Commercial Sector
# (6) TNICBUS: End-Use Energy Consumed by the Industrial Sector
# (7) TNACBUS: End-Use Energy Consumed by the Transportation Sector

# (8) ESTCBUS: Electricity Sales to Ultimate Customers in the End-Use Sectors (total)
# (9) ESRCBUS: Electricity Sales to Ultimate Customers in the Residential Sector
# (10) ESCCBUS: Electricity Sales to Ultimate Customers in the Commercial Sector
# (11) ESICBUS: Electricity Sales to Ultimate Customers in the Industrial Sector
# (12) ESACBUS: Electricity Sales to Ultimate Customers in the Transportation Sector
# (13) TXEIBUS: Total Primary Energy Consumed by the Electric Power Sector

# List to collect all data
US_data1 = []
# List of MSN codes (features) to fetch
msn_list1 = ["TEPRBUS", "TETCBUS", 
    "TNTCBUS", "TNRCBUS", "TNCCBUS", "TNICBUS", "TNACBUS",
    "ESTCBUS", "ESRCBUS", "ESCCBUS", "ESICBUS", "ESACBUS", "TXEIBUS"
    ]

# Loop through each MSN and request it individually
for msn in msn_list1:
    payload = {
        "frequency": "annual",
        "data": ["value"],
        "facets": {
            "msn": [msn]
        },
        "start": "1949",
        "end": "2024",
        "sort": [{"column": "period", "direction": "asc"}],
        "offset": 0,
        "length": 5000
    }

    try:
        response = requests.post(f"{url_1US}?api_key={api_key}", json=payload, headers=headers)
        if response.status_code == 200:
            data = response.json()['response']['data']
            US_data1.extend(data)
            print(f"✅ Retrieved data for MSN: {msn}")
        else:
            print(f"❌ Error fetching MSN {msn}: {response.status_code}")
            print(response.text)
    except Exception as e:
        print(f"❌ Exception for MSN {msn}: {e}")

    time.sleep(1)  # Be nice to the server

# Convert to DataFrame and save
df1_energy = pd.DataFrame(US_data1)
df1_energy.to_csv("USdata1_TotalEnergy_Electricity_1949-2024.csv", index=False)
print("📁 All data saved to 'USdata1_TotalEnergy_Electricity_1949-2024.csv'")

✅ Retrieved data for MSN: TEPRBUS
✅ Retrieved data for MSN: TETCBUS
✅ Retrieved data for MSN: TNTCBUS
✅ Retrieved data for MSN: TNRCBUS
✅ Retrieved data for MSN: TNCCBUS
✅ Retrieved data for MSN: TNICBUS
✅ Retrieved data for MSN: TNACBUS
✅ Retrieved data for MSN: ESTCBUS
✅ Retrieved data for MSN: ESRCBUS
✅ Retrieved data for MSN: ESCCBUS
✅ Retrieved data for MSN: ESICBUS
✅ Retrieved data for MSN: ESACBUS
✅ Retrieved data for MSN: TXEIBUS
📁 All data saved to 'USdata1_TotalEnergy_Electricity_1949-2030.csv'


In [None]:
#================================================================================
# 2. GDP, Population, Energy Expenditures, and related data; 
#    Cooling and Heating Degree Days
# Note: The data on Energy Expenditures (#18-20 TETCVUS, TEGDSUS, TETCHUS) is available from 1970 to 2023.

# (14) GDPDIUS: U.S. Gross Domestic Product Implicit Price Deflator in 2017 = 1.00000
# (15) GDPRVUS: U.S. GDP Nominal (Billion Dollars)
# (16) GDPRXUS: U.S. GDP Real (Billion Chained 2017 Dollars)
# (17) TPOPPUS: Total Resident Population, United States (Million People)

# (18) TETCVUS: Energy Expenditures (Million Nominal Dollars) (1970-2023)
# (19) TEGDSUS: Energy Expenditures as Share of GDP (Percent) (1970-2023)
# (20) TETCHUS: Energy Expenditures per Capita (Nominal Dollars) (1970-2023)

# (21) TETGRUS: Total Primary Energy Consumption per Real Dollar of GDP (Thousand Btu per Chained (2017) Dollar)
# (22) TETPRUS: Total Primary Energy Consumption per Capita (Million Btu)
# (23) ZWCDPUS: Cooling Degree-Days, United States (Number)
# (24) ZWHDPUS: Heating Degree-Days, United States (Number)

US_data2 = []
msn_list2 = ["GDPDIUS", "GDPRVUS", "GDPRXUS", "TPOPPUS", 
    "TETCVUS", "TEGDSUS", "TETCHUS", "TETGRUS", "TETPRUS", 
    "ZWCDPUS", "ZWHDPUS"]

for msn in msn_list2:
    payload = {
        "frequency": "annual",
        "data": ["value"],
        "facets": {
            "msn": [msn]
        },
        "start": "1949",
        "end": "2024",
        "sort": [{"column": "period", "direction": "asc"}],
        "offset": 0,
        "length": 5000
    }

    try:
        response = requests.post(f"{url_1US}?api_key={api_key}", json=payload, headers=headers)
        if response.status_code == 200:
            data = response.json()['response']['data']
            US_data2.extend(data)
            print(f"✅ Retrieved data for MSN: {msn}")
        else:
            print(f"❌ Error fetching MSN {msn}: {response.status_code}")
            print(response.text)
    except Exception as e:
        print(f"❌ Exception for MSN {msn}: {e}")

    time.sleep(1)  # Be nice to the server

# Convert to DataFrame and save
df2_gdp = pd.DataFrame(US_data2)
df2_gdp.to_csv("USdata2_gdp_pop_expenditures_1949-2024.csv", index=False)
print("📁 All data saved to 'USdata2_gdp_pop_expenditures_1949-2024.csv'")    

✅ Retrieved data for MSN: GDPDIUS
✅ Retrieved data for MSN: GDPRVUS
✅ Retrieved data for MSN: GDPRXUS
✅ Retrieved data for MSN: TPOPPUS
✅ Retrieved data for MSN: TETCVUS
✅ Retrieved data for MSN: TEGDSUS
✅ Retrieved data for MSN: TETCHUS
✅ Retrieved data for MSN: TETGRUS
✅ Retrieved data for MSN: TETPRUS
✅ Retrieved data for MSN: ZWCDPUS
✅ Retrieved data for MSN: ZWHDPUS
📁 All data saved to 'USdata2_gdp_pop_expenditures_1949-2024.csv'


#### Step 2: Checking the datasets extracted 
- Combine the 2 extracted datasets into a single DataFrame.
  - They have 5 common fields: period, msn, seriesDescription, value, and unit
- Check the msn fields: unique values and counts 
- Add a new column/field ('Item'), providing a descriptive item name for each msn

In [12]:
# Read the 2 data files
df1_energy = pd.read_csv("USdata1_TotalEnergy_Electricity_1949-2024.csv")
df2_gdp = pd.read_csv("USdata2_gdp_pop_expenditures_1949-2024.csv")

# Set display width for better readability  
pd.set_option('display.width', 600)  

# Data information
display(df1_energy.info())
print(df1_energy.head(2), "\n")
display(df2_gdp.info())
print(df2_gdp.head(2), "\n")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 988 entries, 0 to 987
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   period             988 non-null    int64  
 1   msn                988 non-null    object 
 2   seriesDescription  988 non-null    object 
 3   value              988 non-null    float64
 4   unit               988 non-null    object 
dtypes: float64(1), int64(1), object(3)
memory usage: 38.7+ KB


None

   period      msn                                seriesDescription      value          unit
0    1949  TEPRBUS  Total Primary Energy Production in Trillion Btu  30613.106  Trillion Btu
1    1950  TEPRBUS  Total Primary Energy Production in Trillion Btu  34459.730  Trillion Btu 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 836 entries, 0 to 835
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   period             836 non-null    int64 
 1   msn                836 non-null    object
 2   seriesDescription  836 non-null    object
 3   value              836 non-null    object
 4   unit               836 non-null    object
dtypes: int64(1), object(4)
memory usage: 32.8+ KB


None

   period      msn                                  seriesDescription   value            unit
0    1949  GDPDIUS  U.S. Gross Domestic Product Implicit Price Def...  .12046  2017 = 1.00000
1    1950  GDPDIUS  U.S. Gross Domestic Product Implicit Price Def...  .12195  2017 = 1.00000 



In [14]:
# Combine the 2 datasets into a single DataFrame
df_all_1extracted = pd.concat([df1_energy, df2_gdp], 
    axis=0,   # stack rows
    ignore_index=True)   # reset index

print("Final dataset shape:", df_all_1extracted.shape)
print(df_all_1extracted.head(3))

# Save to a single CSV
df_all_1extracted.to_csv("USdata3_all_1extracted_1949-2024.csv", index=False)


Final dataset shape: (1824, 5)
   period      msn                                seriesDescription      value          unit
0    1949  TEPRBUS  Total Primary Energy Production in Trillion Btu  30613.106  Trillion Btu
1    1950  TEPRBUS  Total Primary Energy Production in Trillion Btu   34459.73  Trillion Btu
2    1951  TEPRBUS  Total Primary Energy Production in Trillion Btu  37672.919  Trillion Btu


In [15]:
# Check the unique values and their counts: 'period' and 'msn'
print("Unique periods:", df_all_1extracted["period"].nunique())
print(df_all_1extracted["period"].value_counts().sort_index())  # sorted by year

print("\nUnique msn:", df_all_1extracted["msn"].nunique())
print(df_all_1extracted["msn"].value_counts())

Unique periods: 76
period
1949    24
1950    24
1951    24
1952    24
1953    24
        ..
2020    24
2021    24
2022    24
2023    24
2024    24
Name: count, Length: 76, dtype: int64

Unique msn: 24
msn
TEPRBUS    76
TETCBUS    76
ZWCDPUS    76
TETPRUS    76
TETGRUS    76
TETCHUS    76
TEGDSUS    76
TETCVUS    76
TPOPPUS    76
GDPRXUS    76
GDPRVUS    76
GDPDIUS    76
TXEIBUS    76
ESACBUS    76
ESICBUS    76
ESCCBUS    76
ESRCBUS    76
ESTCBUS    76
TNACBUS    76
TNICBUS    76
TNCCBUS    76
TNRCBUS    76
TNTCBUS    76
ZWHDPUS    76
Name: count, dtype: int64


In [None]:
# Add a new column - 'Item', providing a descriptive item name for each msn (msn)

# Define mapping dictionary: 24 features/items
msn_to_item = {
    "TEPRBUS": "TotPrimEnergy_1Production",
    "TETCBUS": "TotPrimEnergy_2Consumed",
    "TNTCBUS": "TotEnergy_2EndUse",
    "TNRCBUS": "EnergyUse1_Residential",
    "TNCCBUS": "EnergyUse2_Commercial",
    "TNICBUS": "EnergyUse3_Industrial",
    "TNACBUS": "EnergyUse4_Transportation",

    "ESTCBUS": "TotElectricity_2Used",
    "ESRCBUS": "ElectricityUse1_Residential",
    "ESCCBUS": "ElectricityUse2_Commercial",
    "ESICBUS": "ElectricityUse3_Industrial",
    "ESACBUS": "ElectricityUse4_Transportation",
    "TXEIBUS": "PrimEnergyUse5_ElectricPower",

    "GDPDIUS": "GDP_Deflator2017",
    "GDPRVUS": "GDP_Nominal",
    "GDPRXUS": "GDP_Real",
    "TPOPPUS": "POP_US",

    "TETCVUS": "Energy_Expenditure",
    "TEGDSUS": "pct_EnergyExpend_w_GDP",
    "TETCHUS": "EnergyExpend_per_capita",
    "TETGRUS": "PrimEnergyUse_per_GDP",
    "TETPRUS": "PrimEnergyUse_per_capita",
    "ZWCDPUS": "days_cooling",
    "ZWHDPUS": "days_heating",
}

# Add new column 'Item' by mapping msn
df_all_1extracted["Item"] = df_all_1extracted["msn"].map(msn_to_item)
print(df_all_1extracted.head(2))

# Check for any unmapped MSNs
missing_items = df_all_1extracted[df_all_1extracted["Item"].isna()]["msn"].unique()
print("\nMSNs with no mapping:", missing_items, "\n")

# Reorder columns to have 'Item' after 'msn'
df_all_1extracted = df_all_1extracted[["period", "msn", "Item", "seriesDescription", "value", "unit"]]

print(df_all_1extracted.head(2))
print("Final dataset shape:", df_all_1extracted.shape)

   period      msn                                seriesDescription      value          unit                       Item
0    1949  TEPRBUS  Total Primary Energy Production in Trillion Btu  30613.106  Trillion Btu  TotPrimEnergy_1Production
1    1950  TEPRBUS  Total Primary Energy Production in Trillion Btu   34459.73  Trillion Btu  TotPrimEnergy_1Production

MSNs with no mapping: [] 

   period      msn                       Item                                seriesDescription      value          unit
0    1949  TEPRBUS  TotPrimEnergy_1Production  Total Primary Energy Production in Trillion Btu  30613.106  Trillion Btu
1    1950  TEPRBUS  TotPrimEnergy_1Production  Total Primary Energy Production in Trillion Btu   34459.73  Trillion Btu
Final dataset shape: (1824, 6)


In [None]:
# Count missing values and data types for each column
# Notes: Although it shows no missing values, the 'value' field includes missing values - "Not Available" (for energy 
#    expenditures). As a result, this field's data type is changed from 'float64' to 'object'.

print(df_all_1extracted.isna().sum(), "\n")
print(df_all_1extracted.dtypes)

# Update the csv data file
df_all_1extracted.to_csv("USdata3_all_1extracted_1949-2024.csv", index=False)

period               0
msn                  0
Item                 0
seriesDescription    0
value                0
unit                 0
dtype: int64 

period                int64
msn                  object
Item                 object
seriesDescription    object
value                object
unit                 object
dtype: object


#### Step 3: Preparing the dataset for data analysis
- Restructure the dataframe, using 3 columns: period, Item, and value
- Use the ‘Item’ colum as the pivot column for reshaping df.
- The restructured df will include 24 feature columns. 

In [38]:
# # Read the extracted CSV file with all US data
df_all_1extracted = pd.read_csv("USdata3_all_1extracted_1949-2024.csv")

# Copy relevant columns to a new DataFrame
df_all_2Restructured = df_all_1extracted[["period", "Item", "value"]]
print("Dataset shape:", df_all_2Restructured.shape)
print(df_all_2Restructured.head(2), "\n")

Dataset shape: (1824, 3)
   period                       Item      value
0    1949  TotPrimEnergy_1Production  30613.106
1    1950  TotPrimEnergy_1Production   34459.73 



In [39]:
# Count how many items exist per 'period'. It should be 24 items/features for all groups. 
item_counts = (df_all_2Restructured
    .groupby(["period"])
    ["Item"]
    .nunique()
    .reset_index(name="Item_count"))

print(item_counts.head(4))
print("\nUnique Item count:", item_counts["Item_count"].unique())
del item_counts

   period  Item_count
0    1949          24
1    1950          24
2    1951          24
3    1952          24

Unique Item count: [24]


In [41]:
# Restructure the DataFrame for analysis

# Pivot so each feature becomes a column
df_all_3ForAnalysis = df_all_2Restructured.pivot_table(
    index=["period"],               # keys that define rows
    columns="Item",                 # pivot column
    values="value",                 # values to spread
    aggfunc="first"                 # only one value per period
).reset_index()

# After pivot, the columns become a MultiIndex (because of pivot_table). Flatten it.
df_all_3ForAnalysis.columns = [col if isinstance(col, str) else col[1] for col in df_all_3ForAnalysis.columns]

print("After pivot, data shape:", df_all_3ForAnalysis.shape)
print(df_all_3ForAnalysis.head(3))

After pivot, data shape: (76, 25)
   period ElectricityUse1_Residential ElectricityUse2_Commercial ElectricityUse3_Industrial ElectricityUse4_Transportation EnergyExpend_per_capita EnergyUse1_Residential EnergyUse2_Commercial EnergyUse3_Industrial EnergyUse4_Transportation  ... PrimEnergyUse5_ElectricPower PrimEnergyUse_per_GDP PrimEnergyUse_per_capita TotElectricity_2Used TotEnergy_2EndUse TotPrimEnergy_1Production TotPrimEnergy_2Consumed days_cooling days_heating pct_EnergyExpend_w_GDP
0    1949                     227.894                    200.104                     418.28                         22.114           Not Available               4688.482              2869.013              12979.35                  7901.625  ...                     3296.506                 13.65                      207              868.393          28438.47                 30613.106               30866.419         1103         4933          Not Available
1    1950                     246.348           

In [42]:
# Change the order of columns 
new_order = ["period", "TotPrimEnergy_1Production", "TotPrimEnergy_2Consumed", "TotEnergy_2EndUse", 
"EnergyUse1_Residential", "EnergyUse2_Commercial", "EnergyUse3_Industrial","EnergyUse4_Transportation", 
"TotElectricity_2Used", "ElectricityUse1_Residential", "ElectricityUse2_Commercial", 
"ElectricityUse3_Industrial", "ElectricityUse4_Transportation", "PrimEnergyUse5_ElectricPower", 
"GDP_Deflator2017", "GDP_Nominal", "GDP_Real", "POP_US", "Energy_Expenditure", 
"pct_EnergyExpend_w_GDP","EnergyExpend_per_capita", "PrimEnergyUse_per_GDP", "PrimEnergyUse_per_capita", 
"days_cooling", "days_heating"]

df_all_3ForAnalysis = df_all_3ForAnalysis[new_order]
print("Dataset shape:", df_all_3ForAnalysis.shape)
print(df_all_3ForAnalysis.head(3))

Dataset shape: (76, 25)
   period TotPrimEnergy_1Production TotPrimEnergy_2Consumed TotEnergy_2EndUse EnergyUse1_Residential EnergyUse2_Commercial EnergyUse3_Industrial EnergyUse4_Transportation TotElectricity_2Used ElectricityUse1_Residential  ... GDP_Nominal GDP_Real POP_US Energy_Expenditure pct_EnergyExpend_w_GDP EnergyExpend_per_capita PrimEnergyUse_per_GDP PrimEnergyUse_per_capita days_cooling days_heating
0    1949                 30613.106               30866.419          28438.47               4688.482              2869.013              12979.35                  7901.625              868.393                     227.894  ...       272.5   2261.9  149.2      Not Available          Not Available           Not Available                 13.65                      207         1103         4933
1    1950                  34459.73               33527.374         30861.148               5075.876              3059.238             14319.447                  8406.588              994.405 

In [43]:
# Count missing values and data types for each column
# Notes: Although it shows no missing values, three fields (regarding energy expenditures) includes missings - "Not Available" .
print(df_all_3ForAnalysis.isna().sum(), "\n")
print(df_all_3ForAnalysis.dtypes)

period                            0
TotPrimEnergy_1Production         0
TotPrimEnergy_2Consumed           0
TotEnergy_2EndUse                 0
EnergyUse1_Residential            0
EnergyUse2_Commercial             0
EnergyUse3_Industrial             0
EnergyUse4_Transportation         0
TotElectricity_2Used              0
ElectricityUse1_Residential       0
ElectricityUse2_Commercial        0
ElectricityUse3_Industrial        0
ElectricityUse4_Transportation    0
PrimEnergyUse5_ElectricPower      0
GDP_Deflator2017                  0
GDP_Nominal                       0
GDP_Real                          0
POP_US                            0
Energy_Expenditure                0
pct_EnergyExpend_w_GDP            0
EnergyExpend_per_capita           0
PrimEnergyUse_per_GDP             0
PrimEnergyUse_per_capita          0
days_cooling                      0
days_heating                      0
dtype: int64 

period                             int64
TotPrimEnergy_1Production         object
Tot

In [44]:
# Note: The data of 'energy expenditures' are available since 1970, that is, 1970-2023.
# Three features related to 'energy expenditures' includes missing values represented as "Not Available".

# Count the times "Not Available" appears in the df
print("Count of 'Not Available':", (df_all_3ForAnalysis == "Not Available").sum().sum())

# Find where "Not Available" appears
df_all_3ForAnalysis.isin(["Not Available"])

Count of 'Not Available': 66


Unnamed: 0,period,TotPrimEnergy_1Production,TotPrimEnergy_2Consumed,TotEnergy_2EndUse,EnergyUse1_Residential,EnergyUse2_Commercial,EnergyUse3_Industrial,EnergyUse4_Transportation,TotElectricity_2Used,ElectricityUse1_Residential,...,GDP_Nominal,GDP_Real,POP_US,Energy_Expenditure,pct_EnergyExpend_w_GDP,EnergyExpend_per_capita,PrimEnergyUse_per_GDP,PrimEnergyUse_per_capita,days_cooling,days_heating
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,True,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,True,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,True,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,True,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
72,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
73,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
74,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [None]:
# Replace "Not Available" with blank (i.e., NaN) across entire DataFrame
df_all_3ForAnalysis.replace("Not Available", np.nan, inplace=True)
# Count missing values
print("Count of missing values:", df_all_3ForAnalysis.isna().sum().sum())

# Convert data type to numeric for all columns except the first ('period')
cols_to_convert = df_all_3ForAnalysis.columns.difference(['period'])
df_all_3ForAnalysis[cols_to_convert] = df_all_3ForAnalysis[cols_to_convert].apply(pd.to_numeric, errors='coerce')

# Check the data types after conversion
print(df_all_3ForAnalysis.dtypes)

Count of missing values: 66
period                              int64
TotPrimEnergy_1Production         float64
TotPrimEnergy_2Consumed           float64
TotEnergy_2EndUse                 float64
EnergyUse1_Residential            float64
EnergyUse2_Commercial             float64
EnergyUse3_Industrial             float64
EnergyUse4_Transportation         float64
TotElectricity_2Used              float64
ElectricityUse1_Residential       float64
ElectricityUse2_Commercial        float64
ElectricityUse3_Industrial        float64
ElectricityUse4_Transportation    float64
PrimEnergyUse5_ElectricPower      float64
GDP_Deflator2017                  float64
GDP_Nominal                       float64
GDP_Real                          float64
POP_US                            float64
Energy_Expenditure                float64
pct_EnergyExpend_w_GDP            float64
EnergyExpend_per_capita           float64
PrimEnergyUse_per_GDP             float64
PrimEnergyUse_per_capita            int64
days_c

In [None]:
# Prepare the final dataset for analyses

# Rename the 'period' field to 'Year'
df_all_3ForAnalysis = df_all_3ForAnalysis.rename(columns={"period": "Year"})
print(df_all_3ForAnalysis.head(5))

# Save the restructured DataFrame to a CSV and an Excel file
df_all_3ForAnalysis.to_csv("USdata4_all_2ForAnalysis_1949-2024.csv", index=False) 
# df_all_3ForAnalysis.to_excel("USdata4_all_2ForAnalysis_1949-2024-2.xlsx", index=False)

   Year  TotPrimEnergy_1Production  TotPrimEnergy_2Consumed  TotEnergy_2EndUse  EnergyUse1_Residential  EnergyUse2_Commercial  EnergyUse3_Industrial  EnergyUse4_Transportation  TotElectricity_2Used  ElectricityUse1_Residential  ...  GDP_Nominal  GDP_Real  POP_US  Energy_Expenditure  pct_EnergyExpend_w_GDP  EnergyExpend_per_capita  PrimEnergyUse_per_GDP  PrimEnergyUse_per_capita  days_cooling  days_heating
0  1949                  30613.106                30866.419          28438.470                4688.482               2869.013              12979.350                   7901.625               868.393                      227.894  ...        272.5    2261.9   149.2                 NaN                     NaN                      NaN                  13.65                       207          1103          4933
1  1950                  34459.730                33527.374          30861.148                5075.876               3059.238              14319.447                   8406.588       