
# CMSC 320 — Checkpoint 2: Data Preprocessing & EDA
**Dataset:** Biodiversity by County — Distribution of Animals, Plants, and Natural Communities (NY)

This notebook fulfills **Checkpoint 2 (25 pts)**:
- **Data preprocessing (5 pts):** import, parse, organize
- **Exploration & statistics (20 pts):** three conclusions using at least three methods (incl. hypothesis testing), with plots


In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import re

# Render plots inline
%matplotlib inline


## 1) Data Import, Parsing, and Organization (Preprocessing)

In [None]:

# Load dataset
csv_path = "Biodiversity_by_County_-_Distribution_of_Animals__Plants_and_Natural_Communities (1).csv"
df_raw = pd.read_csv(csv_path)

# Clean column names
def clean_cols(cols):
    return [re.sub(r'[^0-9a-zA-Z_]+', '', c.strip().lower().replace(' ', '_')) for c in cols]

df = df_raw.copy()
df.columns = clean_cols(df.columns)

# Strip whitespace on key text columns
for col in ['county','category','taxonomic_group','taxonomic_subgroup','scientific_name','common_name',
            'ny_listing_status','federal_listing_status','state_conservation_rank','global_conservation_rank',
            'distribution_status']:
    df[col] = df[col].astype(str).str.strip()

# Convert year to numeric
df['year_last_documented'] = pd.to_numeric(df['year_last_documented'], errors='coerce').astype('Int64')

print(df.shape)
df.head()



### Feature Engineering
- **`state_rarity_score`** from `state_conservation_rank` (S1 rare → 5 … S5 common → 1; SH/SX treated as rare).
- **`federally_listed_flag`** and **`state_listed_flag`** from listing status fields (1 if listed, else 0).
- **Deduplication** at `(county, scientific_name)` to treat each species once per county.


In [None]:

def s_rank_to_score(s):
    if not isinstance(s, str) or s.strip() == "":
        return np.nan
    s = s.upper()
    m = re.search(r'S([1-5])', s)  # pick first digit if S1S2 etc.
    if m:
        digit = int(m.group(1))
        return 6 - digit  # S1 -> 5 (rare) ... S5 -> 1 (common)
    if s.startswith('SH') or s.startswith('SX'):
        return 5
    return np.nan

def listed_flag(x):
    x = str(x).strip().upper()
    if x in ("", "NOT LISTED", "NONE", "N/A", "NA"):
        return 0
    return 1

df['state_rarity_score'] = df['state_conservation_rank'].apply(s_rank_to_score)
df['federally_listed_flag'] = df['federal_listing_status'].apply(listed_flag).astype(int)
df['state_listed_flag'] = df['ny_listing_status'].apply(listed_flag).astype(int)

# Deduplicate species per county
df_unique = df.drop_duplicates(subset=['county','scientific_name']).copy()

df_unique[['county','category','taxonomic_group','scientific_name','state_conservation_rank','state_rarity_score']].head()


## 2) Exploratory Data Analysis & Summary Statistics


### Method 1: Descriptive Statistics (Species Richness by County)
- Compute **unique species per county** (richness).
- Report summary statistics and visualize the **Top 15 counties**.


In [None]:

species_by_county = df_unique.groupby('county')['scientific_name'].nunique().sort_values(ascending=False)
display(species_by_county.describe())

top15 = species_by_county.head(15)
plt.figure()
top15.plot(kind='bar')
plt.title("Top 15 NY Counties by Unique Species (Richness)")
plt.xlabel("County")
plt.ylabel("Unique species count")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()



### Method 2: Correlation — Total Species vs. Rare Species
- Define **rare** as `state_rarity_score >= 4` (S1–S2 or historical/possibly extirpated).
- Compute **per-county** totals and rare counts.
- Compute **Pearson correlation** and show a **scatter with a regression line**.


In [None]:

rare_mask = df_unique['state_rarity_score'] >= 4
rare_counts = df_unique[rare_mask].groupby('county')['scientific_name'].nunique()

county_df = pd.DataFrame({
    'total_species': species_by_county,
    'rare_species': rare_counts
}).fillna(0)

r, p = stats.pearsonr(county_df['total_species'], county_df['rare_species'])
print({'pearson_r': r, 'p_value': p})

x = county_df['total_species'].values
y = county_df['rare_species'].values
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)

plt.figure()
plt.scatter(x, y)
x_line = np.linspace(x.min(), x.max(), 100)
y_line = slope * x_line + intercept
plt.plot(x_line, y_line)
plt.title("Total vs Rare Species by County")
plt.xlabel("Total unique species")
plt.ylabel("Rare species (S1–S2)")
plt.tight_layout()
plt.show()



### Method 3: Hypothesis Test — Are Plants and Animals Equally Rare?
We test whether the **distribution of state rarity scores** differs between **Animals** and **Plants** using a **Mann–Whitney U** test (non-parametric).


In [None]:

a = df_unique[df_unique['category'].str.upper()=='ANIMAL']['state_rarity_score'].dropna()
b = df_unique[df_unique['category'].str.upper()=='PLANT']['state_rarity_score'].dropna()

print({'n_animals': len(a), 'n_plants': len(b)})

U, pval = stats.mannwhitneyu(a, b, alternative='two-sided')
print({'test': 'Mann-Whitney U', 'U': U, 'p_value': pval})

plt.figure()
plt.boxplot([a.values, b.values], labels=['Animal', 'Plant'])
plt.title("State Rarity Score by Category")
plt.ylabel("Rarity score (5 rare … 1 common)")
plt.tight_layout()
plt.show()



## 3) Conclusions (Three Takeaways)

1. **County species richness varies widely** (see Top 15 plot). This informs where conservation inventories are concentrated and suggests heterogeneous habitat diversity across counties.
2. **Total species strongly predicts rare species counts** across counties (high Pearson *r*). Practically, conservation planning in species-rich counties may protect a disproportionate share of rare taxa.
3. **Animals and Plants differ in rarity distributions** (Mann–Whitney U significant). Management prioritization may require category-specific strategies (e.g., targeted actions for plant microhabitats vs. animal movement corridors).
