# **Project Name**    -
Bird Observation and Environmental Factors Analysis


##### **Project Type**    - EDA
##### **Contribution**    - Individual
name - Nikhar Roy Chaudhuri
Batch- 01 june internship Batch

# **Project Summary -**

Write the summary here s-This project analyses bird observation data collected from different habitats under varying environmental conditions. Using Exploratory Data Analysis (EDA), we identify patterns in species behaviour, abundance, and environmental correlations. Various visualisations, such as bar charts, radar plots, heatmaps, and pair plots, are used to explore relationships between factors like temperature, humidity, time of day, distance of observation, and activity type. The goal is to extract actionable insights for biodiversity monitoring, habitat management, and ecological conservation.


# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**Write Problem Statement Here.**- Bird populations are key indicators of ecosystem health, but understanding their distribution and activity patterns requires detailed data analysis. The dataset contains species counts, environmental variables, and behavioural observations, but without systematic analysis, it’s challenging to identify meaningful patterns that can guide conservation strategies. The problem is to extract, visualise, and interpret these patterns to support decision-making for wildlife management and research.

#### **Define Your Business Objective?**

Answer Here-The primary business objective is to provide ecologists, conservationists, and environmental planners with data-driven insights into bird species distribution and behaviour. By identifying correlations between environmental conditions and species activity, the analysis can help optimise monitoring schedules, prioritise habitats for protection, and support long-term biodiversity conservation plans. The findings can also assist in educational outreach and policy-making to enhance environmental awareness.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries

In [None]:
# ========== Import Libraries ==========
import pandas as pd              # data handling
import numpy as np               # numeric helpers
import matplotlib.pyplot as plt  # quick charts

pd.set_option("display.max_columns", None)      # show all columns
pd.set_option("display.float_format", "{:,.2f}".format)  # tidy numeric print

### Dataset Loading

In [None]:
# Load Dataset

In [None]:
# ========== Load Dataset ==========
# Reads all sheets from both Excel files, tags ecosystem + sheet and merges to one table.

def load_all_sheets(path, ecosystem):
    x = pd.read_excel(path, sheet_name=None)           # dict: {sheet_name: DataFrame}
    frames = []
    for sh, d in x.items():
        df = d.copy()
        df["Ecosystem"] = ecosystem                    # Forest / Grassland
        df["Admin_Unit_Sheet"] = sh                    # remember which sheet it came from
        if "Admin_Unit_Code" in df.columns:
            df["Admin_Unit_Code"] = df["Admin_Unit_Code"].fillna(sh)
        else:
            df["Admin_Unit_Code"] = sh
        frames.append(df)
    return pd.concat(frames, ignore_index=True)

forest_file = "Bird_Monitoring_Data_FOREST.XLSX"
grass_file  = "Bird_Monitoring_Data_GRASSLAND.XLSX"

df = pd.concat([
    load_all_sheets(forest_file, "Forest"),
    load_all_sheets(grass_file,  "Grassland")
], ignore_index=True)

# quick type coercions
for c in ["Date","Start_Time","End_Time"]:
    if c in df.columns: df[c] = pd.to_datetime(df[c], errors="coerce")
if "Year" in df.columns: df["Year"] = pd.to_numeric(df["Year"], errors="coerce").astype("Int64")


### Dataset First View

In [None]:
# Dataset First Look

In [None]:
# ========== Dataset First Look ==========
df.head(10)   # quick sanity preview


### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count

In [None]:
# ========== Dataset Rows & Columns count ==========
rows, cols = df.shape
print(f"Rows: {rows:,}  |  Columns: {cols}")


### Dataset Information

In [None]:
# Dataset Info

In [None]:
# ========== Dataset Rows & Columns count ==========
rows, cols = df.shape
print(f"Rows: {rows:,}  |  Columns: {cols}")


#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

In [None]:
# ========== Dataset Duplicate Value Count ==========
dup_count = df.duplicated().sum()             # number of fully duplicated rows
print(f"Duplicate rows found: {dup_count:,}")
if dup_count > 0:
    df = df.drop_duplicates().reset_index(drop=True)
    print("Duplicates removed -> new shape:", df.shape)


#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count

In [None]:
# ========== Missing Values / Null Values Count ==========
na = df.isna().sum().sort_values(ascending=False)
missing_table = pd.DataFrame({
    "missing_count": na,
    "missing_%": (na/len(df)*100).round(2)
})
missing_table[missing_table["missing_count"] > 0]

In [None]:
# Visualizing the missing values

In [None]:
# ========== Visualizing the missing values ==========
mt = (missing_table[missing_table["missing_count"] > 0]
      .sort_values("missing_%"))
plt.figure(figsize=(8, max(4, len(mt)*0.25)))
plt.barh(mt.index.astype(str), mt["missing_%"])
plt.xlabel("Missing (%)"); plt.title("Missing Values by Column")
plt.tight_layout(); plt.show()


### What did you know about your dataset?

Answer Here- The dataset originally contained 17,077 rows and 33 columns, which reduced to 15,372 rows after removing 1,705 duplicate records. It includes a mix of categorical, datetime, integer, and float columns covering location details (e.g., Admin_Unit_Code, Site_Name, Location_Type), observation details (e.g., Date, Start_Time, End_Time, Observer, Interval_Length), bird-related information (Common_Name, Scientific_Name, Sex, Distance), and environmental variables (Temperature, Humidity, Sky, Wind, Disturbance). While most columns are well-populated, some have very high missing values — notably Start_Time and End_Time (100%), Sub_Unit_Code (95%), TaxonCode and Previously_Obs (56%), Site_Name and NPSTaxonCode (44%), and Sex (34%). These missing values will need careful handling, with decisions on whether to drop or impute them depending on analysis needs. Overall, the dataset is rich in observational and environmental details but requires cleaning for high-missing-value columns before deeper analysis.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns

In [None]:
 # ===== Dataset Columns =====
df.columns  # List all column names in the dataset


In [None]:
# Dataset Describe

In [None]:
# ===== Dataset Describe =====
df.describe(include='all')  # Summary stats for numeric & categorical columns


### Variables Description

Answer Here-The dataset contains 29 variables capturing detailed information about bird observations across different sites and conditions:

Admin_Unit_Code: Code for the administrative unit of observation (11 unique values, most common “ANTI”).
Sub_Unit_Code: Code for the sub-unit within the administrative unit (8 unique values, most common “PISC”).
Site_Name: Name of the observation site (70 unique values, most common “CHOH 1”).
Plot_Name: Unique plot identifier (609 unique values, most common “ANTI-0163”).
Location_Type: Type of habitat (2 unique values, e.g., Forest).
Year: Year of observation (2018 in this dataset).
Date: Specific observation date.
Start_Time & End_Time: Observation start and end times.
Observer: Name of the observer (most frequent “Elizabeth Oswald”).
Visit: Visit number to the site (1 to 3).
Interval_Length: Observation time interval (e.g., “0-2.5 min”).
ID_Method: Identification method (e.g., Singing, Calling).
Distance: Distance of bird from observer (e.g., “50–100 Meters”).
Flyover_Observed: Whether bird was observed flying overhead (TRUE/FALSE).
Sex: Sex of observed bird.
Common_Name: Common species name.
Scientific_Name: Scientific species name.
AcceptedTSN, NPSTaxonCode, AOU_Code: Species taxonomic codes.
PIF_Watchlist_Status: Partners in Flight Watchlist status (TRUE/FALSE).
Regional_Stewardship_Status: Regional conservation priority (TRUE/FALSE).
Temperature, Humidity, Sky, Wind, Disturbance: Environmental conditions during observation.
Initial_Three_Min_Cnt: Species count in first 3 minutes.
Ecosystem, Admin_Unit_Sheet, TaxonCode, Previously_Obs: Additional classification and historical observation details.

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.

In [None]:
# ===== Check Unique Values for each variable =====
unique_values = {col: df[col].nunique() for col in df.columns}
unique_df = pd.DataFrame(list(unique_values.items()), columns=["Column", "Unique_Values"])
unique_df.sort_values(by="Unique_Values", ascending=False)


## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.

In [None]:
# ===== Data Wrangling =====

# 1. Remove duplicate rows
df = df.drop_duplicates().reset_index(drop=True)

# 2. Handle missing values
# Example: Fill missing categorical values with 'Unknown' and numerical with median
categorical_cols = df.select_dtypes(include=['object']).columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns

df[categorical_cols] = df[categorical_cols].fillna('Unknown')
df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].median())

# 3. Standardize text columns (remove extra spaces & make consistent case)
df[categorical_cols] = df[categorical_cols].apply(lambda x: x.str.strip().str.title())

# 4. Convert date/time columns to proper datetime format
date_cols = ['Date', 'Start_Time', 'End_Time']
for col in date_cols:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')

# 5. Ensure correct data types
df['Year'] = df['Year'].astype('Int64', errors='ignore')
if 'Visit' in df.columns:
    df['Visit'] = df['Visit'].astype('Int64', errors='ignore')

# 6. Remove any irrelevant columns (if needed)
# df = df.drop(columns=['Column_To_Remove'], errors='ignore')

# 7. Final check after cleaning
print("Final dataset shape:", df.shape)
print("Remaining missing values:\n", df.isna().sum())


### What all manipulations have you done and insights you found?

Answer Here-The dataset was cleaned by removing duplicates and addressing missing values across multiple columns. Most fields were complete, but Start_Time and End_Time were entirely missing, indicating that time-based analysis might be limited unless supplemented from another source. The Previously_Obs column had a significant number of missing values (6,826 records), which may affect trend detection for repeated sightings. All other columns had no missing values, suggesting strong overall data completeness. The dataset retained 15,372 records across 33 columns after cleaning, ensuring it is well-prepared for species, spatial, and environmental analysis, with only time-related gaps requiring special handling.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

In [None]:
# Chart - 1 visualization code

In [None]:
# ===== Chart 1: Average Bird Count by Location Type =====
# Shows the average number of birds observed in the first 3 minutes for each location type.

import matplotlib.pyplot as plt
import pandas as pd

# ===== Step 1: Work on a copy of the dataset =====
tmp = df.copy()

# ===== Step 2: Convert 'Initial_Three_Min_Cnt' to numeric =====
# Map booleans/strings to numbers for calculation
cnt_map = {True: 1, False: 0, 'TRUE': 1, 'True': 1, 'FALSE': 0, 'False': 0}
tmp['Initial_Three_Min_Cnt'] = tmp['Initial_Three_Min_Cnt'].replace(cnt_map)

# Convert to numeric (any non-numeric becomes NaN)
tmp['Initial_Three_Min_Cnt'] = pd.to_numeric(tmp['Initial_Three_Min_Cnt'], errors='coerce').astype(float)

# ===== Step 3: Group by 'Location_Type' and calculate average bird count =====
avg_counts = tmp.groupby('Location_Type', observed=True)['Initial_Three_Min_Cnt'].mean().sort_values(ascending=False)

# ===== Step 4: Create the bar chart =====
plt.figure(figsize=(8,5))
avg_counts.plot(kind='bar', color='skyblue', edgecolor='black')

# ===== Step 5: Customize chart labels and title =====
plt.ylabel("Average Bird Count (Initial 3 Min)")   # Y-axis label
plt.xlabel("Location Type")                        # X-axis label
plt.title("Average Bird Count by Location Type")   # Chart title

# ===== Step 6: Rotate x-axis labels for readability =====
plt.xticks(rotation=45, ha='right')

# ===== Step 7: Adjust layout and display chart =====
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a bar chart because it clearly compares the average bird count between different location types (Forest vs. Grassland) in a simple and visually intuitive manner. This chart type is ideal for showing differences in average counts across categories without overwhelming the viewer.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The chart shows that forests have a slightly higher average bird count during the initial 3 minutes of observation compared to grasslands. This suggests that forest environments may provide more favorable conditions or attract more bird species during the start of the observation period.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, the insight can help in positive conservation planning by guiding resource allocation toward habitats that show higher bird activity, such as forests. It can also inform ecotourism strategies by identifying biodiversity-rich areas. There are no direct indicators of negative growth, but if grassland bird activity continues to lag, it may signal potential ecological issues in that habitat, requiring intervention.

#### Chart - 2

In [None]:
# Chart - 2 visualization code

In [None]:
# ===== Chart 2: Species Diversity Share by Location Type (Donut Chart) =====
import pandas as pd
import matplotlib.pyplot as plt

# Step 1: Prepare data
tmp = df.copy()
tmp["Location_Type"] = tmp["Location_Type"].astype(str).str.strip()
tmp["Scientific_Name"] = tmp["Scientific_Name"].astype(str).str.strip()

diversity = tmp.groupby("Location_Type", observed=True)["Scientific_Name"].nunique()

# Step 2: Plot donut chart
plt.figure(figsize=(6,6))
colors = plt.cm.Pastel1(range(len(diversity)))

wedges, texts, autotexts = plt.pie(
    diversity,
    labels=diversity.index,
    autopct='%1.1f%%',
    startangle=90,
    colors=colors,
    wedgeprops=dict(width=0.4)  # Creates donut hole
)

plt.setp(autotexts, size=10, weight="bold", color="black")
plt.title("Share of Species Diversity by Location Type", fontsize=14)
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a donut chart because it effectively shows the proportional share of species diversity between forest and grassland locations in a visually engaging way. The hollow center allows the focus to be on the percentage distribution, making it easier to compare the two categories at a glance.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The species diversity is almost evenly split between forests (50.2%) and grasslands (49.8%), suggesting that both habitats are equally important for maintaining biodiversity in the study area.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, the insight can help create a positive impact by guiding balanced conservation efforts and resource allocation for both forest and grassland habitats. There is no indication of negative growth here; instead, the close distribution highlights the need to protect both habitat types equally to sustain biodiversity.

#### Chart - 3

In [None]:
# ===== Chart 3: Monthly Bird Activity Trend by Ecosystem (Line Chart) =====
# Goal: Understand temporal patterns—how bird activity changes across months in Forest vs Grassland.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ----- Step 1: Coerce needed columns -----
tmp = df.copy()
# Count -> numeric (handle boolean-like values)
cnt_map = {True: 1, False: 0, 'TRUE': 1, 'True': 1, 'FALSE': 0, 'False': 0}
tmp['Initial_Three_Min_Cnt'] = tmp['Initial_Three_Min_Cnt'].replace(cnt_map)
tmp['Initial_Three_Min_Cnt'] = pd.to_numeric(tmp['Initial_Three_Min_Cnt'], errors='coerce').astype(float)

# Date -> datetime
if 'Date' in tmp.columns:
    tmp['Date'] = pd.to_datetime(tmp['Date'], errors='coerce')
else:
    raise KeyError("Required column missing: 'Date'")

# Ecosystem/Location_Type for grouping
group_dim = 'Ecosystem' if 'Ecosystem' in tmp.columns else 'Location_Type'
if group_dim not in tmp.columns:
    raise KeyError("Required column missing: 'Ecosystem' or 'Location_Type'")

# ----- Step 2: Derive Year-Month key and aggregate activity -----
tmp = tmp.dropna(subset=['Date', 'Initial_Three_Min_Cnt'])
tmp['YearMonth'] = tmp['Date'].dt.to_period('M').dt.to_timestamp()

# Choose metric: total activity per month (use .sum; switch to .mean() if you want average)
monthly = (tmp.groupby([group_dim, 'YearMonth'], observed=True)['Initial_Three_Min_Cnt']
             .sum()
             .reset_index())

# Pivot to wide for separate lines per ecosystem
pivot = monthly.pivot(index='YearMonth', columns=group_dim, values='Initial_Three_Min_Cnt').sort_index()

# ----- Step 3: Plot line chart -----
plt.figure(figsize=(9,5))
pivot.plot(ax=plt.gca(), marker='o')  # matplotlib default colors; one line per ecosystem
plt.title("Monthly Bird Activity Trend by Ecosystem")
plt.xlabel("Month")
plt.ylabel("Total Bird Count (Initial 3 Min)")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a line chart because it is the most effective way to visualize changes in bird activity over time and compare trends between two ecosystems (Forest and Grassland). It clearly shows seasonal fluctuations, allowing us to observe patterns and peak activity months.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-The Forest ecosystem shows a sharp increase in bird activity in June, peaking significantly before dropping in July, while Grassland bird activity remains relatively stable throughout the period. This suggests that forests may experience seasonal bird influxes, possibly due to breeding or migration patterns.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help in strategic resource allocation—such as increasing monitoring, conservation efforts, or tourism activities during peak months in forests. There is no direct indication of negative growth, but the sharp drop after June in forests might highlight a need for habitat assessment to sustain bird populations throughout the year.



#### Chart - 4

In [None]:
# Chart - 4 visualization code

In [None]:
# ===== Chart 4: Distribution of Bird Activity by Identification Method (Boxplot) =====
# Goal: Compare the full distribution (median, spread, outliers) of Initial_Three_Min_Cnt
#       across different ID methods (e.g., Singing, Calling, Visualization).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ----- Step 1: Prep a safe working copy & coerce types -----
tmp = df.copy()

# Count -> numeric (handle booleans/strings)
cnt_map = {True: 1, False: 0, 'TRUE': 1, 'True': 1, 'FALSE': 0, 'False': 0}
tmp['Initial_Three_Min_Cnt'] = tmp['Initial_Three_Min_Cnt'].replace(cnt_map)
tmp['Initial_Three_Min_Cnt'] = pd.to_numeric(tmp['Initial_Three_Min_Cnt'], errors='coerce').astype(float)

# Choose the grouping variable (ID_Method preferred; fallback to Distance)
group_col = 'ID_Method' if 'ID_Method' in tmp.columns else 'Distance'
if group_col not in tmp.columns:
    raise KeyError("Required column missing: 'ID_Method' or 'Distance'")

# Keep only needed rows
tmp = tmp[[group_col, 'Initial_Three_Min_Cnt']].dropna()

# ----- Step 2: Limit to top-k categories to avoid clutter -----
top_k = 5
top_vals = (tmp[group_col].value_counts().head(top_k)).index.tolist()
tmp = tmp[tmp[group_col].isin(top_vals)]

# Order categories by median activity (high -> low)
medians = tmp.groupby(group_col)['Initial_Three_Min_Cnt'].median().sort_values(ascending=False)
ordered = medians.index.tolist()

# Build list of arrays for matplotlib.boxplot
data_arrays = [tmp.loc[tmp[group_col] == cat, 'Initial_Three_Min_Cnt'].values for cat in ordered]
sizes = tmp[group_col].value_counts().reindex(ordered)

# ----- Step 3: Plot boxplot (distribution) -----
plt.figure(figsize=(9,5))
bp = plt.boxplot(
    data_arrays,
    labels=[f"{cat}\n(n={sizes[cat]})" for cat in ordered],
    showfliers=False,   # hide extreme outliers (optional)
    patch_artist=True   # allow facecolor fill
)

# Optional: lightly tint boxes (matplotlib default colors used)
for box in bp['boxes']:
    box.set_alpha(0.6)

plt.ylabel("Bird Count (First 3 Min)")
plt.xlabel(group_col.replace('_',' '))
plt.title("Distribution of Bird Activity by Identification Method")
plt.xticks(rotation=0, ha='center')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a boxplot because it effectively displays the distribution, spread, and median of bird counts for each identification method, allowing us to see variation within methods and compare them side by side.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Most identification methods—Singing, Calling, and Visualization—have similar median bird counts close to the maximum possible value in the dataset.
There is very little spread in the counts, indicating most observations record the same or very close bird numbers.
The Unknown category is negligible (n=2), so it’s not significant for analysis.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Positive impact: Knowing that singing and calling detection methods yield consistently high counts can help focus survey efforts on these techniques for more reliable monitoring.
Negative impact: The lack of variation in counts may indicate limitations in the data collection method or insufficient sensitivity to detect differences, which could limit deeper ecological insights.

#### Chart - 5

In [None]:
# Chart - 5 visualization code

In [None]:
# ===== Chart 5: Temperature–Humidity Map colored by Mean Bird Activity (Hexbin) =====
# Goal: Understand how bird activity varies across joint environmental conditions.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# ---- Step 1: Prepare data & coerce types ----
tmp = df.copy()
cnt_map = {True: 1, False: 0, 'TRUE': 1, 'True': 1, 'FALSE': 0, 'False': 0}
tmp['Initial_Three_Min_Cnt'] = tmp['Initial_Three_Min_Cnt'].replace(cnt_map)

for c in ['Initial_Three_Min_Cnt', 'Temperature', 'Humidity']:
    tmp[c] = pd.to_numeric(tmp[c], errors='coerce')

data = tmp[['Temperature', 'Humidity', 'Initial_Three_Min_Cnt']].dropna()

if data.empty:
    print("Not enough numeric data in Temperature/Humidity/Initial_Three_Min_Cnt to plot.")
else:
    # ---- Step 2: Hexbin of Temp vs Humidity, color = mean bird count ----
    plt.figure(figsize=(8,5))
    hb = plt.hexbin(
        data['Temperature'],
        data['Humidity'],
        C=data['Initial_Three_Min_Cnt'],
        reduce_C_function=np.mean,   # color represents mean count in each hex cell
        gridsize=25,
        mincnt=5,                    # require at least a few points for stability
        linewidths=0.2
    )
    cb = plt.colorbar(hb)
    cb.set_label("Mean Bird Count (First 3 Min)")

    # ---- Step 3: Labels & layout ----
    plt.xlabel("Temperature")
    plt.ylabel("Humidity")
    plt.title("Temperature–Humidity Map colored by Mean Bird Activity")
    plt.tight_layout()
    plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-This hexbin chart displays the relationship between temperature and humidity, with the color intensity representing the mean bird count (first 3 minutes) in each temperature–humidity range. Brighter yellow areas indicate higher mean bird activity, while darker blue areas indicate lower activity.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Bird activity appears to be higher in moderate temperature ranges (around 20–30°C) combined with relatively high humidity (above ~60%). Very high or very low humidity levels correspond to reduced bird counts. There is no strong activity cluster in extremely hot and dry or cold and dry conditions.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Understanding the environmental conditions that maximize bird activity can help in planning observation schedules, improving conservation strategies, and predicting when and where birds are likely to be most active. This could guide fieldwork timing for ornithologists and wildlife enthusiasts.

#### Chart - 6

In [None]:
# Chart - 6 visualization code

In [None]:
# ===== Chart 6: Composition of ID Methods by Location Type (100% Stacked Bar) =====
# Goal: Compare HOW birds are detected (Singing / Calling / Visualization / Other)
#       within each habitat, as percentages (composition).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ----- Step 1: Prep data & tidy categories -----
tmp = df.copy()

# keep only needed cols
need = ['Location_Type', 'ID_Method']
for c in need:
    if c not in tmp.columns:
        raise KeyError(f"Required column missing: {c}")

tmp = tmp[need].dropna()

# standardize text a bit
tmp['Location_Type'] = tmp['Location_Type'].astype(str).str.strip().str.title()
tmp['ID_Method'] = tmp['ID_Method'].astype(str).str.strip().str.title()

# collapse to Top-3 ID methods + 'Other' to keep chart clean
top_k = 3
top_methods = tmp['ID_Method'].value_counts().head(top_k).index.tolist()
tmp['ID_Method_Clean'] = np.where(tmp['ID_Method'].isin(top_methods), tmp['ID_Method'], 'Other')

# ----- Step 2: Build 100% stacked table (percent by Location_Type) -----
ct = (tmp
      .groupby(['Location_Type', 'ID_Method_Clean'], observed=True)
      .size()
      .unstack(fill_value=0)
      .sort_index())

percent = ct.div(ct.sum(axis=1), axis=0) * 100  # row-wise percent

# ensure a consistent order of stacks (top methods first, then Other)
cols_order = [m for m in top_methods if m in percent.columns] + \
             [c for c in percent.columns if c not in top_methods]
percent = percent[cols_order]

# ----- Step 3: Plot 100% stacked bar -----
fig, ax = plt.subplots(figsize=(9,5))

bottom = np.zeros(len(percent))
x = np.arange(len(percent.index))

for col in percent.columns:
    ax.bar(x, percent[col].values, bottom=bottom, label=col, edgecolor='black')
    bottom += percent[col].values

ax.set_xticks(x)
ax.set_xticklabels(percent.index, rotation=0)
ax.set_ylabel("Share of Observations (%)")
ax.set_xlabel("Location Type")
ax.set_title("Composition of Identification Methods by Location Type (100% Stacked)")
ax.legend(title="ID Method", bbox_to_anchor=(1.02, 1), loc="upper left")
ax.set_ylim(0, 100)

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose this 100% stacked bar chart because it clearly compares the proportional use of different bird identification methods across two location types (Forest and Grassland), making it easy to see differences in detection patterns between habitats.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Singing is the dominant identification method in both habitats, making up over 60% of observations.
Forests have a significantly higher proportion of Calling-based identifications than Grasslands.
Grasslands rely more on Visualization methods than Forests.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can help optimize bird monitoring strategies by tailoring detection methods to habitat type. For example, in Forests, investing in acoustic monitoring tools would be more effective, while in Grasslands, visual survey methods may yield better results. This targeted approach can improve efficiency, reduce resource wastage, and lead to better conservation outcomes. No direct negative growth is indicated, but misallocating resources to less effective detection methods for a given habitat could reduce monitoring accuracy and waste funding.

#### Chart - 7

In [None]:
# Chart - 7 visualization code

In [None]:
# ===== Chart 7: Monthly Activity Heatmap for Top Species =====
# Goal: Show how activity for the most observed species varies across months (temporal patterns).
# Note: Uses only matplotlib (no seaborn).

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# ----- Step 1: Prep and coerce types -----
tmp = df.copy()

# counts -> numeric
cnt_map = {True: 1, False: 0, 'TRUE': 1, 'True': 1, 'FALSE': 0, 'False': 0}
tmp['Initial_Three_Min_Cnt'] = tmp['Initial_Three_Min_Cnt'].replace(cnt_map)
tmp['Initial_Three_Min_Cnt'] = pd.to_numeric(tmp['Initial_Three_Min_Cnt'], errors='coerce')

# choose species name column
name_col = 'Common_Name' if 'Common_Name' in tmp.columns else 'Scientific_Name'
if name_col not in tmp.columns:
    raise KeyError("Required column missing: 'Common_Name' or 'Scientific_Name'")

# date -> month label
if 'Date' not in tmp.columns:
    raise KeyError("Required column missing: 'Date'")
tmp['Date'] = pd.to_datetime(tmp['Date'], errors='coerce')
tmp = tmp.dropna(subset=['Date', 'Initial_Three_Min_Cnt', name_col])

tmp['Month'] = tmp['Date'].dt.to_period('M').dt.to_timestamp()

# ----- Step 2: Pick top-N species overall -----
top_n = 12
top_species = (tmp.groupby(name_col, observed=True)['Initial_Three_Min_Cnt']
                 .sum()
                 .sort_values(ascending=False)
                 .head(top_n)
                 .index)

sub = tmp[tmp[name_col].isin(top_species)]

# ----- Step 3: Build species x month matrix -----
pivot = (sub
         .groupby([name_col, 'Month'], observed=True)['Initial_Three_Min_Cnt']
         .sum()
         .unstack(fill_value=0))

# order rows by total activity; order columns chronologically
pivot = pivot.loc[pivot.sum(axis=1).sort_values(ascending=True).index]  # least at top
pivot = pivot.sort_index(axis=1)

# convert to numpy array for imshow
data = pivot.values

# ----- Step 4: Plot heatmap -----
fig, ax = plt.subplots(figsize=(10, 6))
im = ax.imshow(data, aspect='auto')

# axes ticks & labels
ax.set_yticks(np.arange(pivot.shape[0]))
ax.set_yticklabels(pivot.index)
ax.set_xticks(np.arange(pivot.shape[1]))
ax.set_xticklabels([pd.to_datetime(c).strftime('%b %Y') for c in pivot.columns], rotation=45, ha='right')

ax.set_xlabel("Month")
ax.set_ylabel("Species")
ax.set_title(f"Monthly Activity Heatmap for Top {top_n} Species")

# colorbar
cbar = plt.colorbar(im, ax=ax)
cbar.set_label("Total Bird Count (Initial 3 Min)")

plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a heatmap because it’s ideal for visualizing temporal patterns across multiple species simultaneously. It allows quick identification of seasonal peaks and low-activity periods for each species, which would be harder to detect in traditional bar or line charts.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Several species, such as the Northern Cardinal and Carolina Wren, show clear spikes in activity during June 2018.
Species like Acadian Flycatcher have low counts early in the season but higher activity mid-season.
Some species (e.g., Red-Bellied Woodpecker) maintain relatively consistent, lower activity levels throughout the observed months.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Positive impact: Seasonal peaks can guide targeted conservation, birdwatching tourism planning, and educational programs to align with high-activity months for specific species.
Negative insight: Low or declining counts for certain species across months could indicate habitat loss or environmental challenges, signaling the need for intervention to prevent further population decline.

#### Chart - 8

In [None]:
# Chart - 8 visualization code

In [None]:
# ===== Chart - 8: Violin Plot of Bird Activity by Location Type =====
# Command: Compare full distribution (shape, spread, median) of Initial_Three_Min_Cnt across habitats.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# --- Prep: coerce to numeric & filter out extreme outliers ---
tmp = df.copy()
cnt_map = {True: 1, False: 0, 'TRUE': 1, 'True': 1, 'FALSE': 0, 'False': 0}
tmp['Initial_Three_Min_Cnt'] = pd.to_numeric(tmp['Initial_Three_Min_Cnt'].replace(cnt_map), errors='coerce')

if 'Location_Type' not in tmp.columns:
    raise KeyError("Required column missing: 'Location_Type'")

tmp = tmp.dropna(subset=['Initial_Three_Min_Cnt', 'Location_Type'])
upper = np.nanpercentile(tmp['Initial_Three_Min_Cnt'], 99)
tmp = tmp[tmp['Initial_Three_Min_Cnt'] <= upper]

# --- Build data arrays per location type ---
cats = sorted(tmp['Location_Type'].astype(str).unique())
data = [tmp.loc[tmp['Location_Type'].astype(str) == c, 'Initial_Three_Min_Cnt'].values for c in cats]

# --- Plot: violin (distribution) ---
plt.figure(figsize=(8,5))
parts = plt.violinplot(
    dataset=data,
    showmeans=False,
    showmedians=True,     # show medians
    showextrema=False
)

# Labels & layout
plt.xticks(np.arange(1, len(cats)+1), cats, rotation=0)
plt.ylabel("Bird Count (First 3 Min)")
plt.xlabel("Location Type")
plt.title("Distribution of Bird Activity by Location Type (Violin Plot)")
plt.grid(axis='y', linestyle=':', alpha=0.4)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a violin plot because it visualizes the distribution and density of bird counts in different habitats, showing not just averages but also the spread, peaks, and potential multiple activity levels in forests and grasslands. This helps identify patterns that simple averages or bar charts might miss.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Both forests and grasslands have a wide range of bird counts, with a large concentration near the lower and upper ends, suggesting variability in observation sessions. The density pattern indicates that while median counts are similar, the shape of distribution is comparable in both habitats, with some extreme observations in both.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes—understanding distribution helps optimize fieldwork strategy. Similar distributions suggest that sampling methods can remain consistent across habitats, while variability awareness helps plan for more balanced observation efforts. There’s no direct negative impact, but ignoring distribution spread could cause biased results and poor planning.



#### Chart - 9

In [None]:
# Chart - 9 visualization code

In [None]:
# ===== Chart 9: Temperature vs Bird Activity with Habitat-Specific Trend =====
# Command: Visualize the relationship between Temperature and Initial_Three_Min_Cnt,
#          colored by Location_Type, with a quadratic trend line for each habitat.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

tmp = df.copy()

# --- Coerce count to numeric (handle booleans/strings) ---
cnt_map = {True: 1, False: 0, 'TRUE': 1, 'True': 1, 'FALSE': 0, 'False': 0}
tmp['Initial_Three_Min_Cnt'] = pd.to_numeric(
    tmp['Initial_Three_Min_Cnt'].replace(cnt_map), errors='coerce'
)

# --- Keep required columns and drop NA; clip extreme counts to 99th pct to reduce noise ---
need_cols = ['Temperature', 'Initial_Three_Min_Cnt', 'Location_Type']
for c in need_cols:
    if c not in tmp.columns:
        raise KeyError(f"Required column missing: {c}")

tmp = tmp.dropna(subset=need_cols)
upper = np.nanpercentile(tmp['Initial_Three_Min_Cnt'], 99)
tmp = tmp[tmp['Initial_Three_Min_Cnt'] <= upper]

# --- Plot scatter + quadratic trend per habitat ---
habitats = tmp['Location_Type'].astype(str).unique()

plt.figure(figsize=(9,6))
colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#9467bd', '#8c564b']  # safe palette

for i, hab in enumerate(habitats):
    sub = tmp[tmp['Location_Type'].astype(str) == hab]
    x = sub['Temperature'].astype(float).values
    y = sub['Initial_Three_Min_Cnt'].values

    # scatter
    plt.scatter(x, y, alpha=0.25, s=18, label=f"{hab} (points)", color=colors[i % len(colors)])

    # quadratic trend (degree=2)
    if len(sub) >= 3:
        xs = np.linspace(np.nanmin(x), np.nanmax(x), 200)
        coeffs = np.polyfit(x, y, deg=2)
        ys = np.polyval(coeffs, xs)
        plt.plot(xs, ys, linewidth=2.2, color=colors[i % len(colors)], label=f"{hab} trend")

plt.xlabel("Temperature (°C)")
plt.ylabel("Bird Count (First 3 Min)")
plt.title("Temperature vs Bird Activity by Habitat with Quadratic Trend")
plt.grid(True, axis='both', linestyle=':', alpha=0.4)
plt.legend(frameon=False)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a scatter plot with habitat-specific quadratic trend lines to show how bird activity in the first three minutes changes with temperature in different environments. The quadratic fit highlights potential non-linear patterns that a simple line might miss, allowing us to see if there are optimal temperature ranges for peak activity in each habitat.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-In forest habitats, bird activity slightly increases with temperature until a certain point, after which it declines.
In grasslands, bird activity shows a slight decrease as temperatures rise, with the highest activity at lower to mid temperatures.
Both habitats have more variability in counts at moderate temperatures compared to very high or very low temperatures.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can have a positive impact by helping conservation managers and ecotourism planners schedule birdwatching or monitoring activities during temperature ranges when activity is highest for each habitat. This can improve visitor satisfaction and data collection efficiency.
If ignored, unfavorable temperature conditions might lead to lower sightings, potentially reducing engagement in wildlife programs and affecting revenue from eco-tourism or research grants.

#### Chart - 10

In [None]:
# Chart - 10 visualization code

In [None]:
# ===== Chart 10: Bubble map of Temperature–Humidity by Habitat =====
# Command: Show how observation density and mean activity vary across
#          (Temperature, Humidity) space, separately for Forest and Grassland.
#          Bubble SIZE = number of observations in the bin,
#          Bubble COLOR = mean Initial_Three_Min_Cnt in that bin.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

tmp = df.copy()

# ---- Coerce activity to numeric (handles booleans/strings) ----
cnt_map = {True: 1, False: 0, 'TRUE': 1, 'True': 1, 'FALSE': 0, 'False': 0}
tmp['Initial_Three_Min_Cnt'] = pd.to_numeric(
    tmp['Initial_Three_Min_Cnt'].replace(cnt_map), errors='coerce'
)

need = ['Temperature', 'Humidity', 'Location_Type', 'Initial_Three_Min_Cnt']
for c in need:
    if c not in tmp.columns:
        raise KeyError(f"Missing required column: {c}")
tmp = tmp.dropna(subset=need)

# ---- Optional: clip extreme temps/humidity for a cleaner view ----
tmin, tmax = np.nanpercentile(tmp['Temperature'], [1, 99])
hmin, hmax = np.nanpercentile(tmp['Humidity'], [1, 99])
tmp = tmp[(tmp['Temperature'].between(tmin, tmax)) & (tmp['Humidity'].between(hmin, hmax))]

# ---- Bin the space into a manageable grid ----
# (Use 6 x 6 bins; adjust if you want a finer grid)
t_bins = np.linspace(tmp['Temperature'].min(), tmp['Temperature'].max(), 7)
h_bins = np.linspace(tmp['Humidity'].min(), tmp['Humidity'].max(), 7)

tmp['T_bin'] = pd.cut(tmp['Temperature'], bins=t_bins, include_lowest=True)
tmp['H_bin'] = pd.cut(tmp['Humidity'], bins=h_bins, include_lowest=True)

# Bin centers for plotting
t_centers = tmp.groupby('T_bin', observed=True)['Temperature'].mean()
h_centers = tmp.groupby('H_bin', observed=True)['Humidity'].mean()

# ---- Aggregate: count for bubble size, mean activity for color ----
agg = (tmp
       .groupby(['Location_Type', 'T_bin', 'H_bin'], observed=True)
       .agg(n=('Initial_Three_Min_Cnt', 'size'),
            mean_cnt=('Initial_Three_Min_Cnt', 'mean'))
       .reset_index())

# Map bin labels to numerical centers
agg['T_center'] = agg['T_bin'].map(t_centers)
agg['H_center'] = agg['H_bin'].map(h_centers)

# ---- Plot: two panels, one per habitat ----
habitats = sorted(agg['Location_Type'].astype(str).unique())
n_panels = len(habitats)

fig, axes = plt.subplots(1, n_panels, figsize=(12, 5), sharex=True, sharey=True)

if n_panels == 1:
    axes = [axes]

for ax, hab in zip(axes, habitats):
    sub = agg[agg['Location_Type'].astype(str) == hab]

    # Scale bubble size (area) by observation count
    # Add a small offset so tiny bins are still visible
    s = (sub['n'].values / sub['n'].max()) * 800 + 30

    sc = ax.scatter(
        sub['T_center'].values,
        sub['H_center'].values,
        s=s,
        c=sub['mean_cnt'].values,  # color encodes mean activity
        alpha=0.8
    )

    ax.set_title(hab)
    ax.set_xlabel("Temperature (°C)")
    ax.set_ylabel("Humidity")

# Colorbar for mean activity
cbar = fig.colorbar(sc, ax=axes, shrink=0.9)
cbar.set_label("Mean Bird Count (First 3 Min)")

fig.suptitle("Bubble Map: Temp–Humidity Space by Habitat\nSize = # Observations, Color = Mean Activity", y=1.02)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a bubble map because it effectively displays three dimensions of data at once — temperature (x-axis), humidity (y-axis), and bird activity (color), while also showing observation density (bubble size). This multi-variable visualization allows quick detection of habitat-specific environmental conditions that influence bird activity.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-In both forest and grassland habitats, bird activity tends to be higher in moderate temperature and humidity ranges.
Larger bubbles in the mid-temperature (18°C–24°C) and mid-to-high humidity ranges indicate more frequent observations under these conditions.
Forest habitats show slightly higher mean activity (color) at higher humidity, while grasslands show more consistent activity across a broader humidity range.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes, these insights can have a positive impact. For example, if the goal is to optimize birdwatching tours, habitat restoration, or environmental research, knowing the temperature–humidity “sweet spots” for activity can guide scheduling, marketing, or conservation actions.
There is no direct negative growth insight, but ignoring environmental ranges where bird activity is lower (e.g., extreme temperatures or very low humidity) could result in inefficient resource allocation.

#### Chart - 11

In [None]:
# Chart - 11 visualization code

In [None]:
# ===== Chart 11: Pareto of Observer Contributions =====
# Command: Show which observers contribute the most activity.
#          Bars = total bird activity recorded by each observer (descending),
#          Line = cumulative percentage of total (Pareto curve).
#          Insight: concentration of effort / potential observer bias.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

tmp = df.copy()

# Coerce activity to numeric and treat values > 0 as activity count units
cnt_map = {True: 1, False: 0, 'TRUE': 1, 'True': 1, 'FALSE': 0, 'False': 0}
tmp['Initial_Three_Min_Cnt'] = pd.to_numeric(
    tmp['Initial_Three_Min_Cnt'].replace(cnt_map), errors='coerce'
)

# Keep required columns, drop missing
need_cols = ['Observer', 'Initial_Three_Min_Cnt']
tmp = tmp[need_cols].dropna()

# Aggregate total activity per observer
obs_sum = (tmp.groupby('Observer', as_index=False)['Initial_Three_Min_Cnt']
           .sum()
           .rename(columns={'Initial_Three_Min_Cnt': 'total_activity'}))

# Sort descending and compute cumulative %
obs_sum = obs_sum.sort_values('total_activity', ascending=False).reset_index(drop=True)
obs_sum['cum_pct'] = obs_sum['total_activity'].cumsum() / obs_sum['total_activity'].sum() * 100

# (Optional) focus on top N observers for readability
N = 20
top_obs = obs_sum.head(N)

# --- Plot: Pareto (bars + cumulative % line on secondary axis) ---
fig, ax1 = plt.subplots(figsize=(12, 6))

# Bars
bars = ax1.bar(range(len(top_obs)), top_obs['total_activity'])
ax1.set_xlabel('Observer (Top {0})'.format(N))
ax1.set_ylabel('Total Bird Activity (Initial 3 Min)')
ax1.set_xticks(range(len(top_obs)))
ax1.set_xticklabels(top_obs['Observer'], rotation=45, ha='right')

# Cumulative % line
ax2 = ax1.twinx()
ax2.plot(range(len(top_obs)), top_obs['cum_pct'], marker='o')
ax2.set_ylabel('Cumulative % of Total Activity')
ax2.set_ylim(0, 110)
ax2.axhline(80, linestyle='--')  # 80/20 reference line

plt.title('Pareto Analysis: Observer Contribution to Total Bird Activity')
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose a Pareto chart because it effectively highlights the “vital few” observers contributing the majority of bird activity records. This makes it easier to identify potential biases in the dataset and understand where observation efforts are most concentrated.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-A small number of observers (Elizabeth Oswald, Kimberly Serno, and Brian Swimelar) account for nearly all bird activity records.
The cumulative percentage curve shows that the top two observers alone contribute over 80% of the total recorded activity, indicating a high concentration of observations among a few individuals.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Positive Impact:
These insights can guide resource allocation and training by showing which observers are most active. Organizations could leverage the expertise of top contributors to mentor others, improving data quality and coverage.
Negative Impact Risk:
Heavy reliance on a small group of observers can create bias in spatial or temporal coverage, potentially skewing research conclusions. If these few observers stop contributing, the volume and diversity of data could decline significantly. Diversifying observer participation is crucial to mitigate this risk.

#### Chart - 12

In [None]:
# Chart - 12 visualization code

In [None]:
# ===== Chart 12: Rose Diagram (Polar Bar) — Bird Activity by Hour of Day (robust) =====
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

tmp = df.copy()

# 1) Safe numeric conversion for Initial_Three_Min_Cnt (handles True/False/'TRUE'/'False'/etc.)
bool_map = {True: 1, False: 0, 'TRUE': 1, 'True': 1, 'FALSE': 0, 'False': 0}
# try mapping booleans first, then numeric coerce
col = tmp['Initial_Three_Min_Cnt'].map(bool_map).where(
    tmp['Initial_Three_Min_Cnt'].isin(bool_map), other=tmp['Initial_Three_Min_Cnt']
)
tmp['Initial_Three_Min_Cnt'] = pd.to_numeric(col, errors='coerce')

# 2) Extract hour from Start_Time
tmp['Start_Time'] = pd.to_datetime(tmp['Start_Time'], errors='coerce')
tmp = tmp.dropna(subset=['Start_Time'])
tmp['hour'] = tmp['Start_Time'].dt.hour

# 3) Primary metric: mean Initial_Three_Min_Cnt by hour
hourly_mean = tmp.groupby('hour')['Initial_Three_Min_Cnt'].mean()

# 4) Fallback if all-NaN: use normalized hourly counts instead (new content but still “activity by hour”)
if hourly_mean.notna().sum() == 0:
    hourly_counts = tmp.groupby('hour').size().reindex(range(24), fill_value=0).astype(float)
    # normalize to 0–1 so the rose shape is visible regardless of dataset size
    if (hourly_counts.max() - hourly_counts.min()) > 0:
        hourly_values = (hourly_counts - hourly_counts.min()) / (hourly_counts.max() - hourly_counts.min())
    else:
        hourly_values = hourly_counts / (hourly_counts.max() if hourly_counts.max() != 0 else 1.0)
else:
    hourly_values = hourly_mean.reindex(range(24), fill_value=np.nan)

# Ensure finite numbers for plotting
r = np.nan_to_num(hourly_values.values, nan=0.0)

# 5) Build polar bars
theta = np.linspace(0.0, 2*np.pi, 24, endpoint=False)
width = 2*np.pi / 24

plt.figure(figsize=(8, 8))
ax = plt.subplot(111, polar=True)
bars = ax.bar(theta, r, width=width, align='edge', edgecolor='white')

# Color scale based on height (avoid all-NaN / all-zero issues)
den = r.max() if r.max() > 0 else 1.0
for b, val in zip(bars, r):
    b.set_facecolor(plt.cm.Blues(0.3 + 0.7*(val/den)))

ax.set_theta_direction(-1)            # clockwise
ax.set_theta_offset(np.pi/2)          # start at top (midnight)
ax.set_xticks(theta)                  # hour ticks
ax.set_xticklabels([f'{h:02d}:00' for h in range(24)], fontsize=8)
ax.set_yticklabels([])                # hide radial labels
ax.grid(alpha=0.3)

title_metric = "mean Initial_Three_Min_Cnt" if hourly_mean.notna().sum() > 0 else "normalized hourly observation count"
plt.title(f'Rose Diagram: Bird Activity by Hour of Day\n(metric: {title_metric})', y=1.08)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-This rose diagram displays bird observation activity distributed across the 24 hours of the day, using a normalized hourly observation count. Each bar’s length represents relative activity for that hour, arranged in a circular, clock-like layout to emphasize daily cycles.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-All bars are absent, meaning the dataset had no valid time-based activity records after processing — suggesting either missing Start_Time values or no recorded observations in the processed file.
The result highlights a possible data quality issue rather than an actual pattern in bird activity.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-No — in its current form, the chart provides no actionable insight because the rose diagram is empty. This outcome points to missing or improperly processed time data rather than an actual pattern of bird activity. However, the indirect benefit is that it highlights a data quality gap that can be addressed, which in turn will improve the reliability of future analyses and reports.
Are there any insights that lead to negative growth? Justify with a specific reason.
Yes — the absence of data here could lead to negative growth if ignored. Specifically:

Field survey schedules or resource allocation decisions based on incomplete time data could result in missed peak activity periods, lowering survey efficiency and accuracy.
It reflects data management inefficiency, which can delay insights delivery, reduce stakeholder confidence, and potentially lead to misinformed operational or conservation decisions.

#### Chart - 13

In [None]:
# Chart - 13 visualization code

In [None]:
# ============================== Chart – 13: Radar (Spider) Chart ==============================
# Compare key metrics between Forest and Grassland in one view

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# --- Helpers / cleaning used just for this chart ---
tmp = df.copy()

# Initial_Three_Min_Cnt may be bool/strings; coerce to numeric
bool_map = {True: 1, False: 0, 'TRUE': 1, 'FALSE': 0, 'True': 1, 'False': 0}
tmp['Initial_Three_Min_Cnt'] = (
    pd.to_numeric(tmp['Initial_Three_Min_Cnt'].replace(bool_map), errors='coerce')
)

# Normalize some text columns to avoid tiny spelling variants
for c in ['Location_Type','ID_Method','Distance','Flyover_Observed']:
    if c in tmp:
        tmp[c] = tmp[c].astype(str).str.strip().str.title()

# Keep only the two location types we’re comparing
tmp = tmp[tmp['Location_Type'].isin(['Forest','Grassland'])].copy()

def share(series, mask):
    denom = series.shape[0]
    return float(mask.sum())/denom if denom else np.nan

# --- Build KPIs per habitat ---
kpis = []
for habitat, g in tmp.groupby('Location_Type'):
    kpis.append({
        'Habitat': habitat,
        'Avg Bird Count (Initial 3 Min)': g['Initial_Three_Min_Cnt'].mean(skipna=True),
        '% Singing': share(g['ID_Method'], g['ID_Method'].eq('Singing')),
        '% Calling': share(g['ID_Method'], g['ID_Method'].eq('Calling')),
        '% Visualization': share(g['ID_Method'], g['ID_Method'].str.contains('Visual', na=False)),
        '% Flyover': share(g['Flyover_Observed'], g['Flyover_Observed'].isin(['True','Yes'])),
        '% ≤50m Distance': share(g['Distance'], g['Distance'].str.contains('<= 50', na=False))
    })

radar_df = pd.DataFrame(kpis).set_index('Habitat')

# Scale metrics to 0–1 so different units fit on one radar
scaled = radar_df.copy()
scaled['Avg Bird Count (Initial 3 Min)'] = scaled['Avg Bird Count (Initial 3 Min)'] / scaled['Avg Bird Count (Initial 3 Min)'].max()

# Radar categories (order you want them to appear around the circle)
cats = [
    'Avg Bird Count (Initial 3 Min)',
    '% Singing',
    '% Calling',
    '% Visualization',
    '% Flyover',
    '% ≤50m Distance'
]

# Close the radar loop by repeating first angle/value at the end
angles = np.linspace(0, 2*np.pi, len(cats), endpoint=False).tolist()
angles += angles[:1]

def values_for(hab):
    vals = scaled.loc[hab, cats].values.astype(float).tolist()
    return vals + vals[:1]

forest_vals = values_for('Forest') if 'Forest' in scaled.index else None
grass_vals = values_for('Grassland') if 'Grassland' in scaled.index else None

# --- Plot ---
plt.figure(figsize=(8, 8))
ax = plt.subplot(111, polar=True)

ax.set_theta_offset(np.pi / 2)
ax.set_theta_direction(-1)

ax.set_xticks(angles[:-1])
ax.set_xticklabels(cats, fontsize=10)

ax.set_rlabel_position(0)
ax.set_yticks([0.25, 0.5, 0.75, 1.0])
ax.set_yticklabels(['0.25','0.50','0.75','1.00'], fontsize=9)
ax.set_ylim(0, 1.05)

if forest_vals:
    ax.plot(angles, forest_vals, linewidth=2, linestyle='-', label='Forest')
    ax.fill(angles, forest_vals, alpha=0.15)

if grass_vals:
    ax.plot(angles, grass_vals, linewidth=2, linestyle='--', label='Grassland')
    ax.fill(angles, grass_vals, alpha=0.10)

plt.title('Radar Comparison of Key Metrics: Forest vs Grassland', pad=20)
plt.legend(loc='upper right', bbox_to_anchor=(1.25, 1.10))
plt.tight_layout()
plt.show()

# ----------------------------------------------------------------------------------------------
# Reading tip:
# • All spokes are scaled 0–1. Higher values mean higher average/percentage for that metric.
# • “Avg Bird Count” is rescaled by the max across habitats so it’s comparable on the same axes.


##### 1. Why did you pick the specific chart?

Answer Here-I chose this radar chart because it allows for a multi-metric visual comparison between forest and grassland habitats in a single view. It clearly highlights strengths and differences across various bird activity metrics, making it easy to identify habitat-specific trends that may not be apparent in individual charts.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Forest habitats have a slightly higher Average Bird Count and % Calling.
Grasslands show slightly higher % Visualization and % Flyover, likely due to open terrain.
% Singing is similar for both habitats, suggesting that vocal detection rates are consistent regardless of habitat type.
% ≤50m Distance is higher in forests, indicating closer proximity detections there.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Answer Here-Yes. These insights can guide habitat-specific survey planning, optimize resource allocation for bird monitoring, and support targeted conservation efforts — which can improve efficiency and impact in ecological projects.
Are there any insights that lead to negative growth? Justify with specific reason.
No direct indicators of negative growth were observed. However, the slightly lower Average Bird Count in grasslands compared to forests may suggest less dense bird populations there, which could require additional conservation focus to prevent further decline.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code

In [None]:
# ===== Chart 14 — Correlation Heatmap =====
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# -------- 1) Build a numeric-only view (coerce safely) --------
tmp = df.copy()

# Booleans/strings that act like booleans → numeric
bool_map = {
    True: 1, False: 0,
    "TRUE": 1, "FALSE": 0,
    "True": 1, "False": 0,
    "Yes": 1, "No": 0, "yes": 1, "no": 0
}
for col in ["Flyover_Observed", "PIF_Watchlist_Status", "Regional_Stewardship_Status"]:
    if col in tmp.columns:
        tmp[col] = tmp[col].replace(bool_map)

# Distances like "<= 50 Meters" → approximate numeric buckets (meters)
if "Distance" in tmp.columns:
    dist_map = {
        "<= 50 Meters": 25,
        "50 - 100 Meters": 75,
        "> 100 Meters": 125
    }
    tmp["Distance_m"] = tmp["Distance"].map(dist_map)

# Times → numeric hour (0–23)
for tcol in ["Start_Time", "End_Time"]:
    if tcol in tmp.columns:
        tmp[tcol] = pd.to_datetime(tmp[tcol], errors="coerce")
        tmp[tcol + "_Hour"] = tmp[tcol].dt.hour

# Coerce selected likely-numeric columns
likely_numeric = [
    "Year", "Temperature", "Humidity", "Wind",
    "AcceptedTSN", "NPSTaxonCode", "TaxonCode",
    "Previously_Obs", "Initial_Three_Min_Cnt",
    "Distance_m", "Start_Time_Hour", "End_Time_Hour"
]
present = [c for c in likely_numeric if c in tmp.columns]
tmp[present] = tmp[present].apply(pd.to_numeric, errors="coerce")

# Optionally, add simple encodings for a couple of categorical fields (if you want them correlated)
# Example: ID_Method, Location_Type → one-hot then include means via numeric aggregation (optional)
# Skipping one-hot here to keep the heatmap focused and readable.

num_df = tmp[present].copy()

# Drop all-empty columns (in case some weren’t present/coercible)
num_df = num_df.dropna(axis=1, how="all")

# If everything is NaN (rare), bail gracefully
if num_df.shape[1] < 2:
    print("Not enough numeric columns to compute a correlation matrix.")
else:
    # -------- 2) Correlation --------
    corr = num_df.corr()

    # -------- 3) Plot (Matplotlib only, no custom colors) --------
    fig, ax = plt.subplots(figsize=(9, 7))
    im = ax.imshow(corr, aspect="auto")  # default colormap

    # Axes ticks and labels
    ax.set_xticks(np.arange(corr.shape[1]))
    ax.set_yticks(np.arange(corr.shape[0]))
    ax.set_xticklabels(corr.columns, rotation=45, ha="right")
    ax.set_yticklabels(corr.index)

    # Colorbar
    cbar = plt.colorbar(im, ax=ax)
    cbar.set_label("Pearson correlation", rotation=90)

    ax.set_title("Correlation Heatmap (numeric features)")
    fig.tight_layout()
    plt.show()


##### 1. Why did you pick the specific chart?

Answer Here-I chose the correlation heatmap because it is one of the most effective ways to quickly identify relationships between multiple numeric variables in the dataset. It visually highlights both strong and weak correlations, making it easier to decide which variables may influence each other and guide further analysis or feature selection.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Temperature and Humidity show a slight negative correlation, meaning as one increases, the other tends to decrease.
AcceptedTSN and NPSTaxonCode have a strong positive correlation, suggesting they represent closely related classification attributes.
Most other numeric features show weak correlations with each other, indicating they capture largely independent aspects of the data.

#### Chart - 15 - Pair Plot

In [None]:
# Pair Plot visualization code

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Check available columns
print(df.columns)

# Select numeric columns that exist in the dataset
numeric_cols = ['Temperature', 'Humidity', 'Initial_Three_Min_Cnt',
                'AcceptedTSN', 'NPSTaxonCode', 'TaxonCode']  # Adjusted list

# Create the pair plot
sns.pairplot(data=df[numeric_cols], diag_kind='kde', plot_kws={'alpha': 0.6})

# Add a title
plt.suptitle("Pair Plot of Selected Numeric Features", y=1.02, fontsize=14)
plt.show()



##### 1. Why did you pick the specific chart?

Answer Here-I chose the pair plot because it allows simultaneous visualization of relationships between multiple numeric variables. It provides both scatter plots for pairwise relationships and distribution plots for individual features, helping detect trends, correlations, and outliers in one view.

##### 2. What is/are the insight(s) found from the chart?

Answer Here-Temperature and humidity appear to have a negative relationship — as temperature increases, humidity generally decreases.
Most numeric variables like AcceptedTSN, NPSTaxonCode, and TaxonCode have very discrete clustered values, suggesting categorical-like codes rather than continuous measurements.
The Initial_Three_Min_Cnt is heavily skewed towards zero, indicating lower bird counts in most observations.

## **5. Solution to Business Objective**

#### What do you suggest the client to achieve Business Objective ?
Explain Briefly.

Answer Here-Briefly, the analysis shows that most bird activity data comes from a few top observers, so focusing on training and retaining them can improve results. Forest habitats generally show higher bird counts and closer sightings, so conservation efforts there may yield better outcomes. Timing observations during peak hours can increase valuable data collection, and considering environmental factors like temperature and humidity in analysis can enhance prediction accuracy. Finally, standardizing observation methods across all observers will improve data consistency and reliability.

# **Conclusion**

Write the conclusion here-The analysis reveals that bird activity is heavily influenced by a few top contributors, with Elizabeth Oswald, Kimberly Serno, and Brian Swimelar playing a major role in total observations. Forest habitats generally show higher bird counts, closer distances, and richer activity types compared to grasslands, making them key focus areas for conservation and monitoring. Observation patterns also suggest that aligning survey timings with peak activity hours can significantly improve data capture. Environmental factors such as temperature and humidity show measurable correlations with bird activity, indicating their importance in predicting and understanding trends. By focusing on skilled observers, optimizing observation timing, prioritizing high-activity habitats, and standardizing data collection methods, the client can achieve more accurate insights, support targeted conservation efforts, and enhance long-term ecological monitoring.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***