# Project assignment
## Part 1

We performed concrete quality tests at multiple sites and collected the data. We told everyone which columns to prepare (they followed this) but unfortunately we did not specify the format and filename, so some people sent xlsx files and others sent CSV files.

Since we expect to receive data like this many times, write a program (or a Colab file) that can perform the following:

- traverse the entire file structure opening all files
- if xlsx then use openpyxl or pandas; if csv then csv or pandas
- read the data (merging them)
- for each data row, also save which file it came from (so we can notify the person if it's faulty)
- validate the data with Pydantic; if the file is invalid, catch the error, log it or ignore it and continue
- store them in a database or a pandas dataframe
- if using a dataframe save it in Parquet format; with sqlite this is already handled.

Extra:
- plot strength as a function of cement and age in a scatter plot with a heatmap (strength)
- handle common exceptions so the user receives meaningful error messages.


A file (if csv) begins like this:
```CSV
date,cement,slag,ash,water,superplastic,coarseagg,fineagg,age,strength
2025-03-22,141.3,212,0,203.5,0,971.8,748.5,28,29.89
2025-07-04,168.9,42.2,124.3,158.3,10.8,1080.8,796.2,14,23.51
2025-10-14,250,0,95.7,187.4,5.5,956.9,861.2,28,29.22
```

If you want to do this in Colab (recommended, because all packages are available here), you need to upload the zip archive here. Use the folder icon on the left (Files), or run the command below which will download and extract it!

If the runtime disconnects the files will be lost, so you will have to upload again.

Feel free to use AI assistance (Colab even has it built in, you can see it at the bottom), but try to understand every line you create! (It's not good if you have no idea what the code you run does).

In [None]:
# run the code block
!wget -q https://github.com/goteguru/kmooc_example_data/archive/refs/heads/main.zip && unzip -o main.zip && mv kmooc_example_data-main adatok && rm -f main.zip
# <--- on the left in the file browser (after refresh) the 'adatok' directory

In [None]:
#
# Your implementation goes here!
#
#

In [None]:
# Use this validation model (min and max values for the inputs):
from datetime import date
from pydantic import BaseModel, Field

class ConcreteExperiment(BaseModel):
    date: date
    cement: float      = Field(ge=0, le=600)
    slag: float        = Field(ge=0, le=400)
    ash: float         = Field(ge=0, le=300)
    water: float       = Field(ge=0, le=400)
    superplastic:float = Field(ge=0, le=50)
    coarseagg:float    = Field(ge=0, le=1500)
    fineagg:float      = Field(ge=0, le=1000)
    age:float          = Field(ge=0, le=365)
    strength:float     = Field(ge=0, le=100)


### Part 2

An allegedly knowledgeable person claimed that you could visualize the concrete curing strength data well using a t-SNE embedding method with a heatmap.

They also provided a sample code that does this (see below). The code below expects the data in a DataFrame named `df`. If you used something else, the simplest approach is to rename it before running this.

- Modify the code so it can save the image with today's date!
- Export the result (3 data points per measurement) to a CSV or XLSX file!

In [None]:
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
import pandas as pd

# ---- Input variables ----

X_cols = [
    "cement",
    "slag",
    "ash",
    "water",
    "superplastic",
    "coarseagg",
    "fineagg",
    "age",
]

X = df[X_cols].copy()
y = df["strength"].values  # for coloring

# ---- Standardize the data ----
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# ---- t-SNE 2D ----
tsne = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate="auto",
    init="pca",
    random_state=42,
)

tsne_components = tsne.fit_transform(X_scaled)

df_tsne = pd.DataFrame(
    tsne_components,
    columns=["TSNE1", "TSNE2"]
)
df_tsne["strength"] = y

# ---- 2D "heatmap" scatter ----
plt.figure(figsize=(10, 7))

scatter = plt.scatter(
    df_tsne["TSNE1"],
    df_tsne["TSNE2"],
    c=df_tsne["strength"],
    cmap="viridis",
    s=60,
    alpha=0.85,
)

cbar = plt.colorbar(scatter)
cbar.set_label("Strength")

plt.xlabel("t-SNE dim 1")
plt.ylabel("t-SNE dim 2")
plt.title("t-SNE 2D mapping â€“ Strength heatmap coloring")
plt.tight_layout()
plt.show()
