# Real-World Problem: **Analyzing and Modeling Sensor Data for a Smart Factory**

### Background

You are a data scientist working for a smart manufacturing factory that collects sensor data from multiple machines on the production line. The goal is to analyze the sensor data to detect anomalies, understand machine behavior, and help the engineering team optimize the system.

---

### Dataset

* You receive raw sensor readings as a **1D list** of 10,000 integer values representing temperature measurements (in °C \* 100, e.g., 2535 means 25.35°C) sampled every second from multiple sensors concatenated.
* These sensors are arranged in 20 machines, each with 500 readings.
* Some readings may be noisy or have outliers.

---

### Tasks

#### 1. **Data Preparation and Cleaning**

* Convert the raw list into a NumPy array.
* Reshape the 1D data into a 20×500 2D matrix, where each row corresponds to one machine’s sensor readings.
* For each machine, use `floor()`, `ceil()`, and `isqrt()` on the first 10 readings and interpret the results.
* Detect and remove outliers from each machine's data using statistical thresholds (e.g., values beyond mean ± 3 std dev).

#### 2. **Statistical Analysis**

* Calculate the mean, median, and variance of temperature readings for each machine.
* Find the machine with the highest average temperature and the one with the lowest.
* Calculate the sine of the normalized readings (normalized to \[0, π]) for any pattern detection.

#### 3. **Linear Algebra and Matrix Operations**

* Compute the covariance matrix (20×20) representing correlation between machine sensors.
* Find eigenvalues and eigenvectors of the covariance matrix using SciPy.
* Identify the principal components (top 2 eigenvectors) that explain most variance in sensor readings.

#### 4. **Matrix Subsetting and Determinants**

* Extract a 5×5 submatrix from the covariance matrix representing a group of 5 related machines.
* Calculate row-wise and column-wise sums of this submatrix.
* Calculate its determinant and eigenvalues to check if these machines are strongly correlated.

#### 5. **Optimization and Insights**

* Use gcd and integer square root methods to analyze patterns in periodic sensor readings (e.g., gcd of time intervals between spikes).
* Summarize all findings and suggest which machines require maintenance or closer monitoring based on statistical and linear algebra insights.

---


In [1]:
import numpy as np
import pandas as pd
from scipy import linalg
from math import floor, ceil, isqrt, gcd, sin, pi

np.random.seed(42)
data = np.random.normal(loc=2500, scale=100, size=10000).astype(int)
df = pd.DataFrame(data)
df.to_csv('sensor_data.csv', index=False, header=False)

df = pd.read_csv('sensor_data.csv', header=None)
sensor_array = df[0].to_numpy()
sensor_matrix = sensor_array.reshape((20, 500))

floor_vals = np.floor(sensor_matrix[:, :10]).astype(int)
ceil_vals = np.ceil(sensor_matrix[:, :10]).astype(int)
isqrt_vals = np.array([[isqrt(abs(x)) for x in row] for row in sensor_matrix[:, :10]])

cleaned_matrix = np.copy(sensor_matrix).astype(float)
for i in range(20):
    m = np.mean(cleaned_matrix[i])
    s = np.std(cleaned_matrix[i])
    mask = (cleaned_matrix[i] < m - 3*s) | (cleaned_matrix[i] > m + 3*s)
    cleaned_matrix[i][mask] = m

means = np.mean(cleaned_matrix, axis=1)
medians = np.median(cleaned_matrix, axis=1)
variances = np.var(cleaned_matrix, axis=1)
highest_avg_machine = np.argmax(means)
lowest_avg_machine = np.argmin(means)

norm_matrix = (cleaned_matrix - cleaned_matrix.min()) / (cleaned_matrix.max() - cleaned_matrix.min()) * pi
sin_matrix = np.sin(norm_matrix)

cov_matrix = np.cov(cleaned_matrix)

eigvals, eigvecs = linalg.eigh(cov_matrix)
sorted_indices = np.argsort(eigvals)[::-1]
eigvals = eigvals[sorted_indices]
eigvecs = eigvecs[:, sorted_indices]

top2_pc = eigvecs[:, :2]

submatrix = cov_matrix[:5, :5]
row_sums = np.sum(submatrix, axis=1)
col_sums = np.sum(submatrix, axis=0)
det_submatrix = linalg.det(submatrix)
eigvals_sub, _ = linalg.eigh(submatrix)

spike_intervals = [20, 40, 60, 80, 100]
g = spike_intervals[0]
for val in spike_intervals[1:]:
    g = gcd(g, val)
int_sqrt_g = isqrt(g)

print("Floor values (first 10 readings of machine 1):", floor_vals[0])
print("Ceil values (first 10 readings of machine 1):", ceil_vals[0])
print("Integer sqrt values (first 10 readings of machine 1):", isqrt_vals[0])
print("Machine with highest avg temp:", highest_avg_machine)
print("Machine with lowest avg temp:", lowest_avg_machine)
print("Mean temperatures:", means)
print("Median temperatures:", medians)
print("Variance temperatures:", variances)
print("Covariance matrix shape:", cov_matrix.shape)
print("Top 2 principal components:\n", top2_pc)
print("Row sums of 5x5 submatrix:", row_sums)
print("Column sums of 5x5 submatrix:", col_sums)
print("Determinant of 5x5 submatrix:", det_submatrix)
print("Eigenvalues of 5x5 submatrix:", eigvals_sub)
print("GCD of spike intervals:", g)
print("Integer sqrt of GCD:", int_sqrt_g)

Floor values (first 10 readings of machine 1): [2549 2486 2564 2652 2476 2476 2657 2576 2453 2554]
Ceil values (first 10 readings of machine 1): [2549 2486 2564 2652 2476 2476 2657 2576 2453 2554]
Integer sqrt values (first 10 readings of machine 1): [50 49 50 51 49 49 51 50 49 50]
Machine with highest avg temp: 2
Machine with lowest avg temp: 8
Mean temperatures: [2499.459152 2502.668    2510.352    2502.74448  2498.339448 2501.057676
 2497.658    2496.330472 2490.096    2499.092512 2495.018808 2492.80888
 2497.108    2495.546516 2500.76624  2504.178832 2499.85872  2502.93712
 2498.832    2498.557552]
Median temperatures: [2500.596 2502.    2511.5   2500.5   2499.    2500.    2497.5   2499.5
 2494.5   2499.064 2495.5   2494.5   2493.    2495.    2500.06  2501.
 2498.    2506.    2500.    2500.   ]
Variance temperatures: [ 8913.99770462  9548.325776   10183.624096    8924.14099833
  8813.15129923  9820.16207797 10760.421036    9902.66125995
  9809.502784    9039.0914743   9030.00342548