## 📌 Objectives

In this lab, you will:

- Simulate time-series data for environmental sensors in greenhouses
- Apply **rolling averages** and **standard deviation** to smooth sensor data
- Calculate **correlation coefficients** between variables such as temperature, humidity, and CO₂
- Create **scatter plots** and overlay **trend lines** to visualize relationships
- Perform and interpret **linear regression** to quantify the strength and direction of correlations


## 1. Simulate Time-Series Greenhouse Data

## 🧪 Simulating Time-Series Environmental Sensor Data

Before analyzing real-world greenhouse sensor data, we first simulate a dataset representing 30 days of environmental readings. This simulation includes **temperature (°C)**, **humidity (%)**, and **CO₂ concentration (ppm)**.

### 🔍 Explanation of Each Step:

1. **Import necessary Python libraries**:
   - `pandas`: for creating and managing the DataFrame (tabular data)
   - `numpy`: for numerical calculations and random data generation
   - `matplotlib.pyplot`: for plotting graphs
   - `seaborn`: for advanced statistical visualizations
   - `scipy.stats.linregress`: for linear regression analysis

2. **Set a random seed**:
   - `np.random.seed(42)` ensures reproducibility: every time you run this code, it generates the same random data.

3. **Generate a date range**:
   - `pd.date_range(start='2024-01-01', periods=30, freq='D')` creates 30 consecutive daily timestamps starting from January 1, 2024.

4. **Simulate sensor values**:
   - `temperature`: Normally distributed around 25°C with a standard deviation of 2°C
   - `humidity`: Normally distributed around 60% with a standard deviation of 5%
   - `co2`: Simulated based on the temperature (more heat leads to higher CO₂), with additional random noise

5. **Create a DataFrame**:
   - A `pandas.DataFrame` is constructed from the simulated arrays.
   - Columns include `Temperature`, `Humidity`, and `CO2`.
   - The `Date` column is set as the index to enable time-series operations.

6. **Preview the data**:
   - `df.head()` shows the first 5 rows of the dataset so we can verify its structure.


In [None]:
# 📦 Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import linregress

# 🎲 Set random seed for reproducibility
np.random.seed(42)

# 📆 Create a time series index (30 days starting from Jan 1, 2024)
date_rng = pd.date_range(start='2024-01-01', periods=30, freq='D')

# 🌡️ Simulate sensor data
temperature = np.random.normal(loc=25, scale=2, size=30)  # °C
humidity = np.random.normal(loc=60, scale=5, size=30)     # %
co2 = 400 + temperature * 1.5 + np.random.normal(0, 5, size=30)  # ppm, linearly related to temp + noise

# 🧱 Create a DataFrame
df = pd.DataFrame({
    'Date': date_rng,
    'Temperature': temperature,
    'Humidity': humidity,
    'CO2': co2
})

# 🗂️ Set Date as index
df.set_index('Date', inplace=True)

# 👀 Display the first few rows
df.head()


## 📊 Rolling Statistics

In greenhouse environments, sensor data can fluctuate due to natural variation or measurement noise. To observe longer-term trends and reduce short-term fluctuations, we use **rolling statistics** such as the **rolling mean** and **rolling standard deviation**.

### 🔄 What is a Rolling Window?

A *rolling window* computes statistics over a sliding subset of data points. For example, a **3-day rolling mean** at a given date is the average of that date and the previous 2 days. This smooths out sudden spikes or drops.

### 📘 Common Use Cases:
- Detect gradual changes in temperature or humidity
- Remove sensor noise before control decisions
- Spot seasonality or patterns over days or weeks

### 🧮 Operations We'll Perform:
- Compute **3-day** and **7-day** rolling averages for temperature
- Compute the **3-day rolling standard deviation** for temperature
- Compare raw data with smoothed data visually


In [None]:
# 🔁 Calculate rolling statistics
df['Temp_3d'] = df['Temperature'].rolling(window=3).mean()      # 3-day rolling mean
df['Temp_7d'] = df['Temperature'].rolling(window=7).mean()      # 7-day rolling mean
df['Temp_std_3d'] = df['Temperature'].rolling(window=3).std()   # 3-day rolling std deviation

# 📈 Plot the original and smoothed temperature data
df[['Temperature', 'Temp_3d', 'Temp_7d']].plot(figsize=(10, 5), title='Temperature: Raw vs. Rolling Averages')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.grid(True)
plt.legend()
plt.show()

## 📈 Correlation Analysis

Understanding how environmental variables are related helps us identify potential cause-effect relationships and optimize control strategies in greenhouses.

### 🔗 What is Correlation?

**Correlation** measures the strength and direction of a linear relationship between two variables.

- A correlation coefficient (**r**) ranges from **-1** to **1**:
  - `+1` = perfect positive correlation (both increase together)
  - `0` = no correlation
  - `–1` = perfect negative correlation (one increases, the other decreases)

### 🧪 What We'll Do:
We will compute the pairwise correlation between:
- Temperature (°C)
- Humidity (%)
- CO₂ concentration (ppm)

This will help us understand which environmental factors move together.


In [None]:
# 🔗 Calculate correlation coefficients
correlation_matrix = df[['Temperature', 'Humidity', 'CO2']].corr()

# 🖨️ Display the correlation matrix
print("Correlation Matrix:")
print(correlation_matrix)


## 📉 Scatter Plot and Trend Line

After calculating correlation coefficients, it's helpful to **visualize** the relationships between variables. One of the best tools for this is the **scatter plot**.

### 🔍 Why Use a Scatter Plot?

- Each point represents a pair of values (e.g., Temperature and CO₂ on the same day).
- Patterns show whether a relationship exists.
- Adding a **regression line** helps clarify the direction and strength of the trend.

In this lab, we'll plot **Temperature vs. CO₂ concentration** and use a built-in trend line from the `seaborn` library.


In [None]:
# 📉 Create scatter plot with regression line
sns.regplot(data=df, x='Temperature', y='CO2')
plt.title('Scatter Plot with Regression Line: Temperature vs. CO₂')
plt.xlabel('Temperature (°C)')
plt.ylabel('CO₂ (ppm)')
plt.grid(True)
plt.show()


## 📐 Linear Regression with `scipy`

While a scatter plot with a trend line gives us visual insight, we often need **numerical results** to quantify relationships. This is where **linear regression** comes in.

### 🧠 What is Linear Regression?

Linear regression fits a straight line through the data and returns:

- **Slope**: the rate of change (how much CO₂ changes per °C)
- **Intercept**: the expected CO₂ level when temperature is zero
- **R-value (R)**: correlation strength
- **R-squared (R²)**: proportion of variation in CO₂ explained by temperature
- **P-value**: statistical significance
- **Standard Error**: uncertainty in the slope estimate

We'll apply regression between **Temperature** and **CO₂** using `scipy.stats.linregress`.

> 🔍 Tip: A higher R² value (closer to 1) means better predictive power.


In [None]:
# 📐 Prepare variables (drop NA if any)
x = df['Temperature'].dropna()
y = df['CO2'].dropna()
common_idx = x.index.intersection(y.index)
x = x.loc[common_idx]
y = y.loc[common_idx]

# 🔍 Perform linear regression
slope, intercept, r_value, p_value, std_err = linregress(x, y)

# 📊 Print regression results
print(f"Slope: {slope:.2f}")
print(f"Intercept: {intercept:.2f}")
print(f"R-squared: {r_value**2:.2f}")
print(f"P-value: {p_value:.4f}")
print(f"Standard Error: {std_err:.2f}")

# 🖼️ Plot with regression line
plt.scatter(x, y, label='Data')
plt.plot(x, slope * x + intercept, color='red', label='Regression Line')
plt.xlabel('Temperature (°C)')
plt.ylabel('CO₂ (ppm)')
plt.title('Linear Regression: Temperature vs. CO₂')
plt.legend()
plt.grid(True)
plt.show()


## 📝 Exercises

Try the following exercises to practice and reinforce what you've learned. These problems will help you apply rolling statistics, correlation analysis, and regression techniques to greenhouse sensor data.

---

### 🔁 1. Rolling Statistics
- Plot a **7-day rolling average** and **7-day rolling standard deviation** for **humidity**.
- What can you learn from the smoothed trend?

---

### 🔍 2. Scatter Plot and Correlation
- Create a **scatter plot** between **Humidity** and **CO₂**.
- Add a trend line.
- Visually assess the strength and direction of their relationship.

---

### 📐 3. Linear Regression (Humidity vs CO₂)
- Perform a linear regression between **Humidity** and **CO₂** using `linregress()`.
- Print out:
  - Slope
  - Intercept
  - R²
- Compare the R² value to that from **Temperature vs CO₂**. Which is a better predictor?

---

### 📊 4. Outlier Detection (Advanced)
- Find the date where **temperature** has the **maximum deviation** from its **3-day rolling mean**.
- What does this tell you about fluctuations in greenhouse conditions?

---

### 🧠 5. Bonus Challenge
- Try using `.resample('W')` to calculate **weekly averages** for each variable.
- Plot the weekly-averaged values to see smoother trends over time.

---
