# Week 2: Data Exploration Summary

### Dataset: S&P 500 ESG Data

This notebook explores a dataset of S&P 500 companies, containing financial information and key Environmental, Social, and Governance (ESG) metrics.

The primary goal is to determine how many companies report their CO2 emissions, which is crucial for our Carbon Footprint Calculator project.

### Key Findings:

* **Total Companies in Dataset:** 11000
* **Companies Missing Emissions Data:** 3300
* **Percentage of Companies Missing Data:** 30.00%

In [5]:
import pandas as pd

# --- 1. Load the Dataset (The Smart Way) ---
try:
    file_name = 'dataset_with_missing_emissions.csv'
    
    # 💡 FIX: We add `na_values` to tell Pandas that empty strings should be treated as missing (NaN).
    df = pd.read_csv(file_name, na_values=['', ' ', 'N/A', 'null'])
    
    # 💡 NEW: We also convert the column to numbers, forcing any non-numeric values to become missing (NaN).
    df['CarbonEmissions'] = pd.to_numeric(df['CarbonEmissions'], errors='coerce')

    print(f"✅ Dataset '{file_name}' loaded successfully!")

except FileNotFoundError:
    print(f"❌ Error: Make sure '{file_name}' is in your CarbonProject folder.")
    exit()

# --- 2. Initial Exploration ---
print("\n--- Dataset Info (Columns, Nulls, Dtypes) ---")
# Now, the Dtype for CarbonEmissions should be float64
df.info()

# --- 3. Check for Missing Data (This will work correctly now) ---
emissions_column = 'CarbonEmissions'
missing_emissions_count = df[emissions_column].isnull().sum()
total_rows = len(df)

if total_rows > 0:
    percentage_missing = (missing_emissions_count / total_rows) * 100
else:
    percentage_missing = 0

# --- 4. Print the Final Report ---
print("\n--- Missing Data Analysis ---")
print(f"The emissions column is named: '{emissions_column}'")
print(f"Total number of records in dataset: {total_rows}")
print(f"Number of records MISSING emissions data: {missing_emissions_count}")
print(f"Percentage of records MISSING emissions data: {percentage_missing:.2f}%")

✅ Dataset 'dataset_with_missing_emissions.csv' loaded successfully!

--- Dataset Info (Columns, Nulls, Dtypes) ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11000 entries, 0 to 10999
Data columns (total 16 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CompanyID          11000 non-null  int64  
 1   CompanyName        11000 non-null  object 
 2   Industry           11000 non-null  object 
 3   Region             11000 non-null  object 
 4   Year               11000 non-null  int64  
 5   Revenue            11000 non-null  float64
 6   ProfitMargin       11000 non-null  float64
 7   MarketCap          11000 non-null  float64
 8   GrowthRate         10000 non-null  float64
 9   ESG_Overall        11000 non-null  float64
 10  ESG_Environmental  11000 non-null  float64
 11  ESG_Social         11000 non-null  float64
 12  ESG_Governance     11000 non-null  float64
 13  CarbonEmissions    7700 non-null   float64
 14  Wat