 "# NASA Exoplanet Data Cleaning & Wrangling\n",
    "\n",
    "**Project:** K-Means Clustering Analysis of NASA Exoplanets  \n",
    "\n",
    "---\n",
    "\n",
    "## Objectives\n",
    "\n",
    "This notebook demonstrates professional data wrangling practices:\n",
    "\n",
    "1. **Identify and analyze missing values** in raw data\n",
    "2. **Filter to complete cases** for clustering analysis\n",
    "3. **Create derived variables** (density calculation)\n",
    "4. **Standardize text fields** for consistency\n",
    "5. **Detect and remove outliers** using statistical methods\n",
    "6. **Export clean dataset** for downstream analysis\n",
    "\n",
    "---\n",
    "\n",
    "## Learning Outcomes Demonstrated\n",
    "\n",
    "- ✅ Pandas data manipulation\n",
    "- ✅ Missing value analysis and handling\n",
    "- ✅ Data quality assessment\n",
    "- ✅ Outlier detection methods\n",
    "- ✅ Feature engineering\n",
    "- ✅ Documentation and reproducibility"

 "source": [
    "## 1. Setup and Data Loading"
   ]

In [None]:
"execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": ["# Import required libraries\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from datetime import datetime\n",
    "import warnings\n",
    "\n",
    "# Configure display settings\n",
    "pd.set_option('display.max_columns', None)\n",
    "pd.set_option('display.max_rows', 100)\n",
    "pd.set_option('display.float_format', lambda x: f'{x:.4f}')\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "# Set visualization style\n",
    "sns.set_style('whitegrid')\n",
    "plt.rcParams['figure.figsize'] = (12, 6)\n",
    "plt.rcParams['font.size'] = 10\n",
    "\n",
    "print(\"Libraries imported successfully!\")\n",
    "print(f\"Notebook execution started: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\")"

In [None]:
"execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load raw data\n",
    "INPUT_FILE = 'raw_exoplanets.csv'\n",
    "OUTPUT_FILE = 'cleaned_exoplanets.csv'\n",
    "\n",
    "print(f\"Loading data from: {INPUT_FILE}\")\n",
    "df_raw = pd.read_csv(INPUT_FILE)\n",
    "\n",
    "print(f\"\\n✓ Data loaded successfully!\")\n",
    "print(f\"  Shape: {df_raw.shape[0]:,} rows × {df_raw.shape[1]} columns\")\n",
    "print(f\"\\nColumn names:\\n{df_raw.columns.tolist()}\")"
   ]

In [None]:
"execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display first few rows\n",
    "print(\"First 5 rows of raw data:\")\n",
    "df_raw.head()"