# 📊 Analyzing simple weather data 
**Objective**: 

- Summarize weather patterns by city and season to support high-level operational planning.
- Identify time periods or cities with extreme or volatile weather conditions that may affect public services or infrastructure readiness.
- Detect trends and shifts in climate patterns using rolling averages and seasonal decomposition.
- Explore relationships between temperature, humidity, wind, and precipitation to uncover potential drivers of extreme conditions.
- Present insights visually and narratively in a way that is clear, accessible, and decision-focused.

**Dataset(s)**: 

Found on Kaggle ([source](https://www.kaggle.com/datasets/prasad22/weather-data)), this dataset contains 1 million records in a 99 MB `CSV` file. 

The data recorded is location, date and time, temperature, humidity, precipitation, and wind speed.

**Tools Used**:

Python
- pandas
- matplotlib
- numpy
- seaborn

## 1. 🧭 Introduction

This project explores a fictional but realistic dataset containing weather measurements across multiple cities, including temperature, humidity, precipitation, wind speed, and timestamped records. The goal is to simulate a real-world data analysis workflow that supports actionable insights for stakeholders, such as city planners, business operators, or logistics managers.

Through this project, I demonstrate core competencies in exploratory data analysis (EDA), time-series analysis, data visualization, and stakeholder-style reporting. Emphasis is placed on identifying temporal trends, geographic differences, extreme events, and meaningful correlations between weather variables.

In [1]:
# Add the project root to sys.path
import sys
import os
sys.path.append(os.path.abspath(".."))

# Import packages and variables
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from IPython.display import HTML
from projectconfig.config import raw_data_path, processed_data_path

# Visualization settings
sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (10, 6)

## 2. 🗂️ Data Understanding

In [2]:
# Load data
df = pd.read_csv(raw_data_path, parse_dates=['Date_Time'])

# Preview data
df.head()

Unnamed: 0,Location,Date_Time,Temperature_C,Humidity_pct,Precipitation_mm,Wind_Speed_kmh
0,San Diego,2024-01-14 21:12:46,10.683001,41.195754,4.020119,8.23354
1,San Diego,2024-05-17 15:22:10,8.73414,58.319107,9.111623,27.715161
2,San Diego,2024-05-11 09:30:59,11.632436,38.820175,4.607511,28.732951
3,Philadelphia,2024-02-26 17:32:39,-8.628976,54.074474,3.18372,26.367303
4,San Antonio,2024-04-29 13:23:51,39.808213,72.899908,9.598282,29.898622


In [6]:
# Structure and summary
df.info()
df.describe(include='all')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column            Non-Null Count    Dtype         
---  ------            --------------    -----         
 0   Location          1000000 non-null  object        
 1   Date_Time         1000000 non-null  datetime64[ns]
 2   Temperature_C     1000000 non-null  float64       
 3   Humidity_pct      1000000 non-null  float64       
 4   Precipitation_mm  1000000 non-null  float64       
 5   Wind_Speed_kmh    1000000 non-null  float64       
dtypes: datetime64[ns](1), float64(4), object(1)
memory usage: 45.8+ MB


Unnamed: 0,Location,Date_Time,Temperature_C,Humidity_pct,Precipitation_mm,Wind_Speed_kmh
count,1000000,1000000,1000000.0,1000000.0,1000000.0,1000000.0
unique,10,,,,,
top,Phoenix,,,,,
freq,100209,,,,,
mean,,2024-03-10 10:40:58.896321792,14.779705,60.02183,5.109639,14.997598
min,,2024-01-01 00:00:06,-19.969311,30.000009,9e-06,5.1e-05
25%,,2024-02-04 16:28:23.750000128,2.269631,45.0085,2.580694,7.490101
50%,,2024-03-10 11:43:28,14.778002,60.018708,5.109917,14.993777
75%,,2024-04-14 03:51:32.500000,27.270489,75.043818,7.61375,22.51411
max,,2024-05-18 19:44:10,39.999801,89.999977,14.971583,29.999973


### Summary

There does not seem to be any issues with the data.

The data types make sense.

None of the records are missing any column values and the ranges of values found in each of the numeric columns appear to have no unusual or impossible values.

One thing that is interesting to note is that the mean and median for each numeric column is nearly identical. In a real-world situation, I think that would require closer scrutiny, as perfectly symmetrical weather data seems unlikely.

## 3. 🧹 Data Cleaning & Preparation
- Handle missing values
- Convert data types
- Feature creation if needed
- Explain logic behind each step

In [None]:
# Rename columns to ease use
names = {
    'Location': 'location',
    'Date_Time': 'datetime',
    'Temperature_C': 'temperature',
    'Humidity_pct': 'humidity',
    'Precipitation_mm': 'precipitation',
    'Wind_Speed_kmh': 'windspeed'
}
df.rename(inplace=True, columns=names)
print(df.dtypes)

# Check for duplicate records
duplicates = df[df.duplicated()]
display(duplicates)

location                 object
datetime         datetime64[ns]
temperature             float64
humidity                float64
precipitation           float64
windspeed               float64
dtype: object

Unnamed: 0,location,datetime,temperature,humidity,precipitation,windspeed


### Summary

I renamed the columns to make working with the data (a little) easier.

There are no missing values.

There are no duplicated rows.

The data type for each column is appropriate and optimal for analysis.

## 4. 🔎 Exploratory Data Analysis (EDA)
- Distribution of variables
- Relationships between key features
- Business questions explored

In [None]:
# Example EDA
sns.histplot(df['target_variable'])
plt.title('Distribution of Target Variable')
plt.show()

In [None]:
# Correlation heatmap
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Feature Correlation')
plt.show()

## 5. 📈 Deeper Analysis or Modeling
- Trend analysis, segment analysis, KPIs
- Simple models if needed
- Explain business implications of findings

In [None]:
# Example: Customer churn by segment
df.groupby('segment')['churn'].mean().sort_values(ascending=False)

## 6. 📌 Insights & Recommendations
- 🔹 Insight 1: ...
- 🔹 Insight 2: ...
- 🔹 Insight 3: ...
- ✅ Recommended Actions:
    - ...
    - ...

## 7. 🧾 Conclusion
- Recap of goals and findings
- Limitations
- Suggestions for future work

## 8. 📎 Appendix
- Additional charts
- Definitions
- Notes on calculations