Classification_Clustering_Freq_Pattern_Mining

2019-2020 Fall CSE4063 - Data Mining

3 projects covering Classification, Clustering Analysis and Frequent Pattern Mining in the scope of Data Mining lectures in Marmara University. Notebooks are written on Kaggle platform so online versions of them are suggested for better visuals.

Online Notebooks:

Repo Content and Implementation Steps:

1.phishing-websites.ipynb

6 classifiers; CART, C4.5, Naive-Bayes, Support Vector Machine, Neural Network with 1 hidden layer and Neural Network with 2 hidden layers are trained on Phishing Websites dataset. Hyperparameter tuning is implemented in 5-fold cross-validation with necesarry preprocessing steps.

2.absenteeism-at-work-clustering.ipynb

Clustering Analysis on Absenteeism at Work Dataset is implemented. First EDA, outlier detection (IQR), normalization (Min-Max Scaler) and feature selection with Random Forest (and Permutation Importance) are completed.
K-means clusters are visaulized by 3D t-SNE plots after searching for possible elbow points (based on inertia attribute). After that PCA+K-means pipeline is tested. Most of valuable information about data is lost with PCA so, resulting graphs seem incomplete. Using k-means on original 7-dimensional data then plotting with t-SNE gives better results.
There's no inertia (Sum of squared distances of samples to their closest cluster center) attribute of AgglomerativeClustering class so we used silhouette coefficient (best:1, worst:-1) to select cluster number of AGNES. Again 3D t-SNE clusters and dendogram is plotted.
DBSCAN model is also implemented and best values for the parameters eps and min_samples are found in gridsearch manner with silhouette coefficient. Again best model is visaulized in 3D t-SNE plot.ly graphs.
And finally in evaluation step, best of 3 models are compared by using 9 metrics:
- Estimated number of clusters
- Estimated number of noise points
- Homogeneity
- Completeness
- V-measure
- Adjusted Rand Index
- Adjusted Mutual Information
- Fowlkes-Mallows score
- Silhouette Coefficient
Explanations and comments on the results can be found in notebooks.

3.frequent-pattern-miningv2.ipynb

Association rules for a given dataset is extracted by using Aprori, FP-Growth and ECLAT algorithms of mlxtend library after preprocessing with TransactionEncoder. Models are compared with memory usages and runtimes.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
1.phishing-websites.ipynb		1.phishing-websites.ipynb
2.absenteeism-at-work-clustering.ipynb		2.absenteeism-at-work-clustering.ipynb
3.frequent-pattern-miningv2.ipynb		3.frequent-pattern-miningv2.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

1.phishing-websites.ipynb

1.phishing-websites.ipynb

2.absenteeism-at-work-clustering.ipynb

2.absenteeism-at-work-clustering.ipynb

3.frequent-pattern-miningv2.ipynb

3.frequent-pattern-miningv2.ipynb

README.md

README.md

Repository files navigation

Classification_Clustering_Freq_Pattern_Mining

Online Notebooks:

Repo Content and Implementation Steps:

About

Releases

Packages

Languages

mustafahakkoz/Classification_Clustering_Freq_Pattern_Mining

Folders and files

Latest commit

History

Repository files navigation

Classification_Clustering_Freq_Pattern_Mining

Online Notebooks:

Repo Content and Implementation Steps:

About

Topics

Resources

Stars

Watchers

Forks

Languages