#### Label Encoding:
- Needed for math algorithms: linear regression, logistics regression, kNN, NN
- Remove col after encoding to avoid multicolinearity
#### Feature Scaling-Normalization: 
important because
- Make sure features contribute equally, not one dominates the others
- Help for faster convergence
- Avoid exploding/vanishing problems when calculating with gradients
- Need for distance algo: linear regression, kNN



In [None]:
#ENCODER
encoder = OneHotEncoder(drop = 'first', sparse_output = 'False')
X_train_enc = encoder.fit_transform(X_train)
X_test_enc = encoder.transform(X_test)

#create dataframe for encoding
feature_names = encoder.get_feature_names_out()
X_train_df_enc = pd.DataFrame(data= X_train_enc, columns = feature_names, index = X_train.index)
X_test_df_enc = pd.DataFrame(data= X_test_enc, columns = feature_names,index = X_train.index)

#remove old cat columns and add new dataframe into it
X_train = X_train.drop(columns = cat_col,axis =1)
X_train = X_train.concat([X_train,X_train_df_enc], axis =1)
#SCALE
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
#BUILDMODEL
logreg = LogisticRegression(C =0.0001, random_state =42)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_pred_proba = logreg.predict_proba(X_test)[:,1]

#EVALUATION
classification_report(y_test,y_pred)
accuracy_score(y_test,y_pred)
f1_score(y_test,y_pred)
roc_auc_score(y_test,y_pred_proba)

In [None]:
#PCA tSNE
#DATASET: reduce 
df =df.drop(columns = 'cell_type', axis =1)
X = df.loc[:,df.sum(axis=0) > 10]
#Scale
X.to_numpy() #before scale, data have to be in numpy arraz instead of df.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
#PCA
pca_50 = PCA(n_components =50)
X_pca_50 = pca_50.fit_transform(X_scaled)

#plot
sns.scatterplot(x = X_pca_50[:,0], y= X_pca_50[:,1], s =10)

#tSNE
tsne = TSNE(n_components =2, perplexity = 30, random_state = 42)
X_tsne = tsne.fit_transform(X_pca_50)
sns.scatterplot(x = X_tsne[:,0], y = X_tsne[:,1],s =10, hue =dataset['cell_type'] )

#tsne draw with diff seed (random state)
fig, axs = plt.subplots(1, 3, figsize=(18, 5))
for i, seed in enumerate([1, 42, 99]):
    tsne = TSNE(n_components=2, random_state=seed, perplexity=30, max_iter=500, init="random")
    X_tsne = tsne.fit_transform(X_pca_50)
    
    sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], s=10, ax=axs[i], legend=False)
    axs[i].set_title(f"t-SNE with seed={seed}")
    axs[i].set_xlabel("t-SNE 1")
    axs[i].set_ylabel("t-SNE 2")

plt.tight_layout()
plt.show()

### PCA: dimension reduction
- scale/standardize data
- project onto PCA dimension
- calc eingen values and eingen vectors, the first few are the most important.
- top 2 for visualization, top10 for downstream analysis

### Reduce collinearity
- df.corr()
- Feature selection
- Regularization (L1 L2)
- Dimension reduction (PCA)
- Model agnostic: permutation importance, SHAP


In [None]:
#7. CLUSTER ENGINEERING
#Feature Engineering
rfm.histplot(rfm['Recency'])
#remove outliers using IQR
Q1 = rfm.Recency.quantile(0.25)
Q3 = rfm.Recency.quantile(0.75)
IQR = Q3 - Q1
lowerbound = Q1 - k * IQR
upperbound  = Q3 + k * IQR

rfm = rfm[ (rfm['Recency'] > lowerbound) & (rfm['Recency']<upperbound)]
#cal 2 more times for Monetary and Frequency
rfm = rfm[ (rfm['Monetary'] > lowerbound) & (rfm['Monetary']<upperbound)]
rfm = rfm[ (rfm['Frequency'] > lowerbound) & (rfm['Frequency']<upperbound)]
#Scaling
feature_cols = ['Recency', 'Frequency', 'Monetary']
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[feature_cols])
#Create df
rfm_scaled_df = pd.DataFrame(data=rfm_scaled, columns = feature_cols)
#Train
#Find optimal k 
inertia = []
k_range = range(2,11)
for k in k_range:
    kmeans = KMeans(n_clusters = k, random_state=42)
    kmeans.fit(rfm_scaled_df)
    inertia.append(kmeans.inertia_)

optimal_k = 4
kmeans = KMeans(n_clusters=optimal_k, random_state = 42)
#create a new df col ['Cluster']
rfm['Cluster'] = kmeans.fit_predict(rfm_scaled_df)

#EVALUATION
#groupby Cluster column

df_cluster = rfm.groupby('Cluster').agg({
    'Recency': 'mean',
    'Monetary': 'mean',
    'Frequency': 'mean'
}).round(2)

df_cluster /= df_cluster.max(axis =0)

df.heatplot(df_cluster)

In [None]:
#HEATMAP
sns.heatmap(df.corr())
#Train test split
#Scale
#Lienar regression model without params
lr = LinearRegression()
lr.fit(X_train_scaled, y_train)
y_pred = lr.predict(X_test_scaled)
#r2 score, mse
r2_score = r2_score(y_test,y_pred)
mse = mean_square_error(y_test, y_pred)
#Plot scatterplot 
plt.scatter(y_test, y_pred, alpha = 0.6)
plt.plot([y_test.min(), y_test.max()],[y_test.min(), y_test.max()], 'r--', label ='Ideal Fit')
#Find coefficient
feature_names = X_train.columns
coefficients = lr.coef_

#create df

df = pd.DataFrame({
    'Feature':feature_names,
    'Coefficient': coefficients
})
df['Coefficient_Abs'] = np.abs(df.Coefficient)
#plot using barplot
sns.barplot(x= 'Coefficient' , y ='Feature' , data = df)

#LASSO
alphas = np.logspace(-4,0,50)
model = Lasso(max_iter = 10000)
lasso_cv = GridSearchCV(
    model,
    param_grid = {'alpha':alphas},
    cv = 5
)
#fit lasso
lasso_cv.fit(X_train_scaled, y_trian)
lasso_cv.best_params_
lasso_cv.best_estimators_
#lasso df
best_model = lasso_cv.best_estimators
df_best = pd.DataFrame({
    'Feature':feature_names,
    'Coefficient':best_model.coef_,
    'Abs_Coefficient': np.abs(best_model.coef_)

})
##barplot
sns.barplot(x='Coefficient', y='Feature', data=df_best)

In [None]:
#Permutation importances
result = permutation_importances(model, X_test, y_test, scoring ='f1', n_repeats =10, random_state = 4)
#df for permu importance and sorted
df = pd.DataFrame({
    'Feature':feature_names,
    'Importance': permu.importances_mean
}).sort_values(ascending =True)
#visualize barplot
sns.barplot(x = 'Importance', y = 'Feature', data = df)
#shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[:,:,1], X_test, plot_type = 'bar')

In [None]:
#REBUILD WITH OPTIMIZER'S learning_rate = 0.05

#1: Build model Sequential
import keras
nn2 = Sequential([
    keras.Input(shape = (32,32,3)),
    Flatten(),
    Dense(128, activation ='relu'),
    Dense(10, activation = 'softmax')
])
nn2.summary()

#2: Compile model
from keras.optimizers import SGD
nn2.compile(
    optimizer = SGD(learning_rate = 0.05),
    loss = 'categorical_crossentropy',
    metrics = ['accuracy']
)
nn2.summary()

#3: Train model
history2 = nn2.fit(
    X_train, y_train_cat,
    validation_data = (X_test, y_test_cat),
    verbose = 1,
    epochs = 50,
    batch_size = 64
)

#4: Evaluate model
nn2.evaluate(X_test, y_test_cat)

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
clf = DecisionTreeClassifier(criterion = "entropy", min_samples_leaf = 3)
# Lots of parameters: criterion = "gini" / "entropy";
# max_depth;
# min_impurity_split;
clf.fit(X, y) # It can only handle numerical attributes!
# Categorical attributes need to be encoded, see LabelEncoder and OneHotEncoder
clf.predict([x]) # Predict class for x
clf.feature_importances_ # Importance of each feature
clf.tree_ # The underlying tree object
clf = RandomForestClassifier(n_estimators = 20) # Random Forest with 20 trees

# DM
### Domain Knowledge
- Reduce risk of Type 2 erro : increase sample size
- Class imbalance: in small dataset, model tend to bias towards majority class.
- Handle imbalanced: down sample or up sampling.
- Ensure accuracy, generalization and computational feasibility from the start.
- Binning (bucketting): turn numerical to categorical data.
- Feature scaling: required for distance based model (KNN, clustering)---scale so that features contribute equally, not 1 dominates others---help GD faster convergence---avoid exploding/vanishing

### Dimensional Reduction

PCA
- Dimensional reductions by reducing number of features while reserving as much variance as possible.
- Standardization -> Find PC -> place points on the most prominent PCs
- good to preserve global structure
- linear transformation
- Centering ensures true shape and spread of data.
- Measure the variance called 'eigen value'

tSNE
- Convert distance into probability distribution, try to match 2d dstribution probability to the high dimension one with attractive/repulsive force
- preserve local structure
- Compute similarities in high dim (compute distance btw every data points using Euclidean, then convert distance to proba using Gaussian distribution) -> place point randomly -> compute similarities in low dim > initiate > compare similarities btw high low using KL divergence >Force (GD)

Perplexity: 
- control how many points in a cluster
- balance btw local and global structure
- contorl how tSNE define neighborhoods
### Validation method
- Precision/recall/accu/specificity
- F1 score = 2 Precision * Recall / (Precision + Recall)
- ROC AUC
### Training:
- fit model >> compute Loss function >> update weights and bias >> predict >> again
### Hyperparams vs model params
- Tuned using: validation method (learning rate, n_clusters) & optimization algorithms (for weight/bias)
### Entropy: 
- uncertainty of a dataset. low > better
Information Gain: 
- reduction of entropy
Regularization: reduce complexity by adding pernalty.
### Decision Tree:
- Regularization: PRUNING by reducing complexity bottom up manner, if it decrease the validation error.

### Random Forest: 
- ensemble of many Decision Tree
- Key idea: bagging, voting, random subspace method
- Bagging: each tree trained on random subset 
- Voting: each tree vote, majority vote is final prediction
- Model complexity: full trees with depth as hyperparams
- More robust to overfitting thanks to bagging
- Less sensitive by noise or outlier due to averaging process.

### Ensemble Learning:
- methods of combine multiple learning algorithms to improve in overall improvement
- Bagging: train multiple models on random subset of dataset
- Random Subspace Method: multiple models on random subset of features
- Boosting: train iteratively, while making current models learn previous models mistake by increasing weight of missclassified samples.
### Clustering algorithm:
- Quantify similarities btw all data points usig squared Euclidean distance $||x -c||^2$
- Pick cluster k >> Place centroid randomly >> Calculate distance btw data points and cluster centroids >> Place points to the cluster >> recalculate centroid >> repeat until max iteration reached.
#### Limitation of Random Initializtion:
- algorithms get stuck in local optima, good at refining locally but not globally.
- Improve: run with different seeds and choose solution with most compact cluster.
- Measure cluster compactness: use Inertia or WCSS
#### Elbow method
- Calculate inertia for all different clusters k .
- select elbow points since after adding more clusters does not improve modelling 
- Not optimal solution (no ground-truth label -unsupervised learning), no universal objective to define best clusters, and number of clusters k is not uniquely determined.

#### Gap Statistics
- tell us how far awaz our data clustering is from what we expect by chance.
- a higher gap means clusters are well seperated, so by maximize the gap we can find cluster config
- Compute log(W) where W is WCSS >> compute log_uniform(W) and average >> repeat and calculate Gap(K) >> pick k as the smallest value such that the gap is significantly improved to the largest gap
#### Clustering Limitation
- Do poorly with outliers
- Sensitive to feature scaler
- Assume spherical, equal size cluster
#### Clutering purpose
- for explanability, not for predicting outcomes
### DL special
- Recognize patterns by themselves using self-estimated features
### Fit the model
- Is to find best parameters that has acceptable error margin but still reliable.

#### Why we go deep in NN ?
- To learn the hierachical features that helps the prediction
#### Choose right network size
- Network size is the number of layers and neuron in each layer.
- just a hyperparams to tune, can consider other method like Early stopping, regularization.
#### Training difficulty
- Exploding/ vanishing gradients when using activation functions like sigmoid or tanh
#### Optimization tricks
- Weight initialization
- Learning rate decay: during training
- SGD, minibatch GD
- Gradient clipping
#### Build NN from scatch
1. Choose model  NN, CNN, RNN, ..
2. Build model
3. Define cost function
4. Select learning rate and optimization
5. Apply backprop
6. Do hyperparams optimization

#### RAG
- Embedding method: turn text chunks into vectors for semantic search
- Use sentence embedding
- Vector DB: store and retrieve text by vectorrs
- Most similar chunks: distance metric (Cosine, Dot Product, Euclidean, Manhattan)
- Contextual Answer: retrieved chunks + questions for LLM to provide Contextual answer
#### Application of Explanability: Feature Importance Analysis


In [None]:
nn.compile(
    optimizer = SGD(learning_rate = 0.05),
    loss = 'categorical_crossentropy',
    metrics = ['accuracy']
)
nn.summary()
nn.fit(
    X_train, y_train_cat,
    validation_data = (X_test, y_test_cat),
    verbose = 1,
    epochs = 50,
    batch_size = 64
)
nn.evaluate(X_test, y_test_cat)


# Data Mining / Data Science Project

A structured collection of **machine learning, data mining, and deep learning exercises** implemented in Python using Jupyter Notebooks, practised as a 3 months spanned project. This repository demonstrates **end-to-end ML pipelines**, from data preprocessing and classical models to **CNNs and explainable AI**.

# Introduction & Goals

- Throughout the projects, those standard workflow are discussed and practised that including:
- Problem understanding.
- Data exploration & preprocessing.
- Feature engineering. 
- Model sectiontion & training.
- Validation & parameter tuning.
- Testing & communication results.
- Deployment & monitoring. 


# Contents

Each notebook focuses on **both implementation and conceptual understanding**.

## Repository Structure
```text
â”œâ”€â”€ Exercise_2_Pandas_Introduction.ipynb
â”œâ”€â”€ Exercise_3_4_DataMiningPipeline_LR.ipynb
â”œâ”€â”€ Exercise_5_PCA_tSNE.ipynb
â”œâ”€â”€ Exercise_6_ClassificationPipeline_RDF.ipynb
â”œâ”€â”€ Exercise_7_clustering.ipynb
â”œâ”€â”€ Exercise_8_HeatingLoadRegression.ipynb
â”œâ”€â”€ Exercise_9_10_NeuralNetWorks_fromLR_to_NN.ipynb
â”œâ”€â”€ Exercise_10_plus_ImageRecognition_CNN.ipynb
â””â”€â”€ Exercise_11_Explainability_Analysis.ipynb
```

# Exercise Highlights
## ðŸ“— Exercise 2 â€“ Pandas Introduction
**File:** [`Exercise_2_Pandas_Introduction.ipynb`](Exercise_2_Pandas_Introduction.ipynb)


### Overview
Introduction to **Pandas** for tabular data manipulation and exploration.

### What I have learned
- How to create and manipulate `Series` and `DataFrame`
- The difference between `loc` and `iloc` indexing
- How to filter data using logical conditions
- How to handle missing values and basic data cleaning
- Why Pandas is the foundation of all data science and ML workflows

## ðŸ“— Exercise 3 & 4 â€“ Data Mining Pipeline & Logistic Regression

**File:** [`Exercise_3_4_DataMiningPipeline_LR.ipynb`](Exercise_3_4_DataMiningPipeline_LR.ipynb)


### Overview
Implementation of a complete machine learning pipeline using **Logistic Regression**.

### What I have learned
- How to structure an end-to-end machine learning pipeline
- The importance of proper train/test splitting
- How Logistic Regression performs binary classification
- The difference between class prediction and probability prediction
- Why metrics like **ROC-AUC** are preferred for imbalanced datasets
- How data preparation impacts model performance more than the model itself

## ðŸ“— Exercise 5 â€“ PCA & t-SNE

**File:** [`Exercise_5_PCA_tSNE.ipynb`](Exercise_5_PCA_tSNE.ipynb)

### Overview
Dimensionality reduction techniques applied to high-dimensional biological data.

### What I have learned
- Why standardization is mandatory before applying PCA
- How PCA reduces dimensionality while preserving variance
- How to interpret PCA scatter plots
- Why t-SNE is useful for visualization but not for modeling
- The conceptual difference between linear (PCA) and non-linear (t-SNE) methods

## ðŸ“— Exercise 6 â€“ Classification Pipeline with Random Forest

**File:** [`Exercise_6_ClassificationPipeline_RDF.ipynb`](Exercise_6_ClassificationPipeline_RDF.ipynb)

### Overview
Supervised classification using **Random Forest** and ensemble learning.

### What I have learned
- How ensemble methods improve model robustness
- Why Random Forests are less prone to overfitting
- That tree-based models do not require feature scaling
- How feature importance is computed in tree-based models
- The trade-off between model performance and interpretability

## ðŸ“— Exercise 7 â€“ Clustering

**File:** [`Exercise_7_clustering.ipynb`](Exercise_7_clustering.ipynb)

### Overview
Unsupervised learning using **K-Means clustering**.

### What I have learned
- The difference between supervised and unsupervised learning
- Why scaling is critical for distance-based clustering algorithms
- How the Elbow Method helps estimate the optimal number of clusters
- What inertia measures and its limitations
- That clustering results are heuristic and context-dependent

## ðŸ“— Exercise 8 â€“ Heating Load Regression

**File:** [`Exercise_8_HeatingLoadRegression.ipynb`](Exercise_8_HeatingLoadRegression.ipynb)

### Overview
Regression modeling for predicting heating load based on building features.

### What I have learned
- How linear regression models continuous target variables
- How to analyze correlations between features and target
- What RÂ² score represents in regression tasks
- Why multicollinearity affects interpretability
- How MSE penalizes large prediction errors

## ðŸ“— Exercise 9 & 10 â€“ From Logistic Regression to Neural Networks

**File:** [`Exercise_9_10_NeuralNetWorks_fromLR_to_NN.ipynb`](Exercise_9_10_NeuralNetWorks_fromLR_to_NN.ipynb)

### Overview
Transition from traditional machine learning to **neural networks**.

### What I have learned
- Logistic Regression can be seen as a single-layer neural network
- Why one-hot encoding is required for multi-class classification
- How fully connected neural networks process image data
- The role of activation functions and loss functions
- The conceptual difference between forward propagation and backpropagation

## ðŸ“— Exercise 10+ â€“ Image Recognition with CNN

**File:** [`Exercise_10_plus_ImageRecognition_CNN.ipynb`](Exercise_10_plus_ImageRecognition_CNN.ipynb)

### Overview
Image classification using **Convolutional Neural Networks (CNNs)**.

### What I have learned
- How convolutional layers extract spatial features
- Why CNNs outperform fully connected networks on image data
- How pooling layers reduce dimensionality and overfitting
- The concept of parameter sharing in CNNs
- How CNNs learn hierarchical features from edges to objects

## ðŸ“— Exercise 11 â€“ Explainability Analysis

**File:** [`Exercise_11_Explainability_Analysis.ipynb`](Exercise_11_Explainability_Analysis.ipynb)

### Overview
Explainable AI techniques applied to classification models.

### What I have learned
- Why model accuracy alone is not sufficient
- The importance of explainability in real-world ML systems
- The difference between global and local explanations
- How different feature importance methods yield different insights
- Why trust and transparency are essential for deploying ML models
