# Flower Bloom Dataset EDA, Cleaning, and Post-cleaning EDA

**What this notebook does (step-by-step):**

1. Load dataset (tries raw CSV, then cleaned CSV). 
2. Perform *EDA before cleaning* to explore issues. 
3. Apply data cleaning steps (examples shown). 
4. Perform *EDA after cleaning* and extract insights.

# Import Libraries

In [20]:
import os
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

## Load the dataset

In [3]:
file_path = "flower_bloom_dataset.csv"
df = pd.read_csv(file_path)

# EDA (Before cleaning)

In [27]:
df.head()

Unnamed: 0,type,properties.id,properties.Site,properties.Type,properties.Season,properties.Area,geometry.type,geometry.coordinates
0,Feature,,Chino Hills,Wild,Spring,25382.057,MultiPolygon,"[[[[-117.68286416250682, 33.87223450781116], [..."
1,Feature,,Carrizo Plain National Monument,Wild,Spring,354751.514,MultiPolygon,"[[[[-119.50857432459836, 34.87043932367128], [..."
2,Feature,,Antelope Valley California Poppy Reserve,Wild,Spring,25182.333,MultiPolygon,"[[[[-118.50266751919736, 34.69779051559962], [..."
3,Feature,0.0,Red Hills 0,Plantation,Summer,14874.759,MultiPolygon,"[[[[-120.239271, 33.427726]]]]"
4,Feature,1.0,Golden Valley 1,Wild,Winter,364929.893,MultiPolygon,"[[[[-116.766138, 36.991884]]]]"


In [23]:
print("Shape:", df.shape)
print("\nData types:")
display(df.dtypes)

Shape: (150, 8)

Data types:


type                     object
properties.id           float64
properties.Site          object
properties.Type          object
properties.Season        object
properties.Area         float64
geometry.type            object
geometry.coordinates     object
dtype: object

In [24]:
print("\nBasic stats for numeric columns:")
display(df.describe(include=[np.number]))


Basic stats for numeric columns:


Unnamed: 0,properties.id,properties.Area
count,147.0,150.0
mean,73.0,196263.8506
std,42.579338,121273.033454
min,0.0,6036.018
25%,36.5,95507.1545
50%,73.0,185780.3545
75%,109.5,308019.167
max,146.0,397053.019


In [25]:
print("\nMissing values per column:")
display(df.isna().sum())


Missing values per column:


type                    0
properties.id           3
properties.Site         0
properties.Type         0
properties.Season       0
properties.Area         0
geometry.type           0
geometry.coordinates    0
dtype: int64

In [26]:
# Duplicates
print("\nDuplicate rows:", df.duplicated().sum())


Duplicate rows: 0


### 🧩 Findings Before Cleaning

After checking the data, I noticed that some columns are not really useful for my analysis.  
The **`properties.id`** column is just an ID number for each row, so it doesn’t help me find any insights.  
The **`type`** column also doesn’t give any helpful information, so I’ll remove both of them to keep my dataset clean and focused.  

I also saw that some column names are long and include prefixes like `properties.` and `geometry.`  which makes them a bit hard to read.  
So, I’ll rename them to shorter and clearer names. For example:  
- `properties.Site` → `Site`  
- `properties.Type` → `Type`  
- `properties.Season` → `Season`  
- `geometry.type` → `GeometryType`  
- `geometry.coordinates` → `Coordinates`  

# Data Cleaning

In [28]:
# Remove irrelevant columns
df = df.drop(columns=['properties.id', 'type'], errors='ignore')

In [29]:
# Rename columns to simpler names
df = df.rename(columns={
    'properties.Site': 'Site',
    'properties.Type': 'Type',
    'properties.Season': 'Season',
    'properties.Area': 'Area',
    'geometry.type': 'GeometryType',
    'geometry.coordinates': 'Coordinates'
})

In [30]:
# Remove any numbers at the end of the 'Site' names (for example: "Garden 2" → "Garden")
df['Site'] = df['Site'].str.replace(r'\s*\d+$', '', regex=True)

### ✅ Data Cleaning Summary

In this step, I cleaned the dataset by removing unnecessary columns, renaming long column names,  
and fixing small formatting issues in the data.  

- Removed **`properties.id`** and **`type`** because they didn’t add any useful information.  
- Renamed columns like **`properties.Site`**, **`geometry.type`**, and others to shorter, clearer names.  
- Cleaned up the **`Site`** column by removing numbers at the end of site names (for example: “Garden 2” → “Garden”).  