In [None]:
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 📊 Analyzing Data with Pandas and Visualizing Results with Matplotlib\n",
    "\n",
    "**Author:** Philip Karisa  \n",
    "\n",
    "This notebook demonstrates how to:\n",
    "- Load and explore a dataset using Pandas  \n",
    "- Perform basic data analysis  \n",
    "- Visualize results using Matplotlib and Seaborn  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "from sklearn.datasets import load_iris\n",
    "\n",
    "# ---------------- Task 1: Load and Explore Dataset ----------------\n",
    "\n",
    "try:\n",
    "    # Load the Iris dataset\n",
    "    iris_data = load_iris(as_frame=True)\n",
    "    df = iris_data.frame\n",
    "    \n",
    "    # Display first rows\n",
    "    print(\"First 5 rows:\")\n",
    "    display(df.head())\n",
    "    \n",
    "    # Dataset info\n",
    "    print(\"\\nDataset Info:\")\n",
    "    print(df.info())\n",
    "    \n",
    "    # Missing values check\n",
    "    print(\"\\nMissing values:\")\n",
    "    print(df.isnull().sum())\n",
    "    \n",
    "    # Clean dataset (drop missing values if any)\n",
    "    df = df.dropna()\n",
    "    \n",
    "except FileNotFoundError:\n",
    "    print(\"Error: Dataset file not found.\")\n",
    "except Exception as e:\n",
    "    print(f\"An error occurred: {e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ✅ Task 1 Findings:\n",
    "- The Iris dataset contains **150 rows and 5 columns** (4 numeric features + target).  \n",
    "- No missing values were found, so no cleaning was required.  \n",
    "- Features include sepal length, sepal width, petal length, and petal width.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---------------- Task 2: Basic Data Analysis ----------------\n",
    "\n",
    "# Descriptive statistics\n",
    "print(\"Basic Statistics:\")\n",
    "display(df.describe())\n",
    "\n",
    "# Grouping by species (target)\n",
    "grouped = df.groupby(\"target\").mean()\n",
    "print(\"\\nMean values grouped by species:\")\n",
    "display(grouped)\n",
    "\n",
    "# Observations\n",
    "print(\"\\nObservations:\")\n",
    "print(\"- Setosa has the smallest petals.\")\n",
    "print(\"- Virginica has the largest petals.\")\n",
    "print(\"- Sepal dimensions overlap between species.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ✅ Task 2 Findings:\n",
    "- **Setosa** flowers have the smallest petal measurements.  \n",
    "- **Virginica** flowers have the largest petals.  \n",
    "- Sepal dimensions overlap across species, making petals more useful for distinguishing species.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ---------------- Task 3: Data Visualization ----------------\n",
    "\n",
    "# 1. Line chart: Sepal length across samples\n",
    "plt.figure(figsize=(8,5))\n",
    "plt.plot(df.index, df[\"sepal length (cm)\"], label=\"Sepal Length\")\n",
    "plt.title(\"Line Chart - Sepal Length Across Samples\")\n",
    "plt.xlabel(\"Sample Index\")\n",
    "plt.ylabel(\"Sepal Length (cm)\")\n",
    "plt.legend()\n",
    "plt.show()\n",
    "\n",
    "# 2. Bar chart: Average petal length per species\n",
    "plt.figure(figsize=(8,5))\n",
    "sns.barplot(x=\"target\", y=\"petal length (cm)\", data=df, estimator=\"mean\")\n",
    "plt.title(\"Bar Chart - Average Petal Length per Species\")\n",
    "plt.xlabel(\"Species\")\n",
    "plt.ylabel(\"Average Petal Length (cm)\")\n",
    "plt.show()\n",
    "\n",
    "# 3. Histogram: Sepal width distribution\n",
    "plt.figure(figsize=(8,5))\n",
    "plt.hist(df[\"sepal width (cm)\"], bins=15, edgecolor=\"black\")\n",
    "plt.title(\"Histogram - Sepal Width Distribution\")\n",
    "plt.xlabel(\"Sepal Width (cm)\")\n",
    "plt.ylabel(\"Frequency\")\n",
    "plt.show()\n",
    "\n",
    "# 4. Scatter plot: Sepal length vs Petal length\n",
    "plt.figure(figsize=(8,5))\n",
    "sns.scatterplot(x=\"sepal length (cm)\", y=\"petal length (cm)\", hue=\"target\", data=df, palette=\"deep\")\n",
    "plt.title(\"Scatter Plot - Sepal Length vs Petal Length\")\n",
    "plt.xlabel(\"Sepal Length (cm)\")\n",
    "plt.ylabel(\"Petal Length (cm)\")\n",
    "plt.legend(title=\"Species\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ✅ Task 3 Visualizations:\n",
    "1. **Line Chart**: Shows the variation of sepal length across all 150 samples.  \n",
    "2. **Bar Chart**: Highlights the average petal length per species (Setosa < Versicolor < Virginica).  \n",
    "3. **Histogram**: Shows sepal width distribution, roughly normal with slight skew.  \n",
    "4. **Scatter Plot**: Reveals a strong positive correlation between sepal length and petal length.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 📌 Conclusion\n",
    "- Pandas makes it easy to explore and analyze datasets.  \n",
    "- The Iris dataset shows clear differences in petal size among species, making it useful for classification.  \n",
    "- Matplotlib and Seaborn provide clear visualizations to support findings.  \n",
    "\n",
    "**End of Assignment ✅**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
