# Day 04 – Weather Data Analyzer (Pandas + CSV)

This notebook loads a weather CSV, handles missing values, finds the hottest day, and computes average temperature per city.

## 1) Imports and settings

We use Pandas for data handling and Matplotlib for plotting.

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt

DATA_FILE = Path('weather.csv')  # preferred filename
SAMPLE_FILE = Path('sample_weather.csv')

pd.options.display.max_rows = 20
pd.options.display.max_columns = 10


## 2) Load data

If `weather.csv` exists it will be used; otherwise the sample CSV will be copied to `weather.csv`. The notebook normalizes column names and coerces numeric columns.

In [None]:
# Load CSV (use user's weather.csv if present; otherwise use sample)
path = DATA_FILE
if not path.exists():
    if SAMPLE_FILE.exists():
        df = pd.read_csv(SAMPLE_FILE)
        df.to_csv(path, index=False)
        print("No weather.csv found — copied sample_weather.csv to weather.csv")
    else:
        raise FileNotFoundError("No weather.csv or sample_weather.csv found in folder.")

df = pd.read_csv(path)
# normalize columns
df.columns = [c.strip().lower() for c in df.columns]
# parse date
if 'date' in df.columns:
    df['date'] = pd.to_datetime(df['date'], errors='coerce')
# coerce numeric
for col in ['temp','humidity']:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

print('Loaded rows:', len(df))
df.head()

## 3) Inspect missing values

Show how many missing values exist per column.

In [None]:
# Missing values summary
missing_summary = df.isna().sum()
print(missing_summary)

# Show rows with any missing values (first 10)
df[df.isna().any(axis=1)].head(10)

## 4) Handle missing values

Strategy used here:
- For `temp`: fill missing values with **mean temperature per city** (fallback to global mean if city mean is NaN).
- For `humidity`: fill missing values with **median humidity per city** (fallback to global median if needed).

You can change strategies (drop, forward-fill, interpolation) depending on your use-case.

In [None]:
# Fill missing temp by city mean, fallback to global mean
if 'temp' in df.columns and 'city' in df.columns:
    city_temp_mean = df.groupby('city')['temp'].transform('mean')
    global_temp_mean = df['temp'].mean()
    df['temp'] = df['temp'].fillna(city_temp_mean)
    df['temp'] = df['temp'].fillna(global_temp_mean)

# Fill missing humidity by city median
if 'humidity' in df.columns and 'city' in df.columns:
    city_hum_median = df.groupby('city')['humidity'].transform('median')
    global_hum_median = df['humidity'].median()
    df['humidity'] = df['humidity'].fillna(city_hum_median)
    df['humidity'] = df['humidity'].fillna(global_hum_median)

print('After imputation, missing values:')
print(df.isna().sum())

df.head()

## 5) Find the hottest day (overall)

We locate the row with the maximum temperature.

In [None]:
# Hottest day overall
if 'temp' in df.columns:
    hottest_idx = df['temp'].idxmax()
    hottest_row = df.loc[hottest_idx]
    display(hottest_row.to_frame().T)
    # Save as CSV
    hottest_row.to_frame().T.to_csv('hottest_day_overall.csv', index=False)
else:
    print('No temp column present.')

## 6) Hottest day per city

Group by city and find the day with the maximum temperature for each city.

In [None]:
# Hottest day per city
hottest_per_city = None
if 'temp' in df.columns and 'city' in df.columns:
    # index of max temp per city
    idx = df.groupby('city')['temp'].idxmax()
    hottest_per_city = df.loc[idx].sort_values('city').reset_index(drop=True)
    display(hottest_per_city)
    hottest_per_city.to_csv('hottest_day_per_city.csv', index=False)
else:
    print('Required columns (city, temp) not found.')

## 7) Average temperature per city

Compute mean temperature for each city and save the result.

In [None]:
# Average temp per city
avg_temp = None
if 'temp' in df.columns and 'city' in df.columns:
    avg_temp = df.groupby('city')['temp'].mean().round(2).sort_values(ascending=False)
    display(avg_temp)
    avg_temp.to_csv('avg_temp_by_city.csv', header=['avg_temp'])
else:
    print('Required columns (city, temp) not found.')

## 8) Plot average temperature by city (Matplotlib)

One plot per cell, using Matplotlib (no explicit color settings).

In [None]:
# Plot average temp by city
if avg_temp is not None:
    ax = avg_temp.plot(kind='bar')
    ax.set_title('Average Temperature by City')
    ax.set_ylabel('Temperature (°C)')
    ax.set_xlabel('City')
    plt.tight_layout()
    plt.show()
else:
    print('No average temperature data available to plot.')

## 9) Save cleaned dataset (optional)

Save the cleaned and imputed dataset for further use.

In [None]:
# Save cleaned dataset
df.to_csv('weather_cleaned.csv', index=False)
print('Saved weather_cleaned.csv (rows:', len(df), ')')