These are class projects for USC DSCI553 - Foundations and Applications of Data Mining. To process large-scale data efficiently, all projects are completed using python and spark. The dataset for each project can be found here.
| Topic | Code | Keyword | Data Size |
|---|---|---|---|
| Identifying_Frequent_Itemsets | Python | PCY Apriori SON |
9.20M Lines (5.59GB) |
| Recommendation_Systems | Python | Collaborative Filtering MinHash LSH |
0.62M Lines (528MB) |
| Community_Detection_Algorithm | Python | Betweenness Communities Detection Girvan-Newman Algorithm |
38.7k Lines (1.8MB) |
| Clustering_Algorithm | Python | K-Means Bradley-Fayyad-Reina(BFR) Algorithm NMI |
1.46M Lines (666MB) |
| Mining_Streaming_Data | Python | Bloom Filter Flajolet-Martin Algorithm Twitter Streaming Reservoir Sampling |
0.38M Lines (293MB) & steaming data |
| Hybrid_Recommendation_System | Python | Item-Based Collaborative Filtering Switching Cascade |
1.39M (1.31GB) |