In this project the approach that is going to be taken is to use non supervised learning to explore the data space and find patterns within.
In the file there are several algorithms tested on easy datasets to cluster, and a hard movies dataset. The data used for this research is from the combination of two Kaggle data sets. The resulting data set contains the following features.
- Title of the movie.
- Rating, a categorical variable indicating the Motion Picture Association film rating.
- Genres, a categorical variable indicating the main genre of the movie.
- Duration of the movie in minutes.
- Number of years since the film premiere.
- Votes in IMDb.
- Score in IMDb.
- Country, 1 if the country is The United States, 0 if not.
- Budget of the film
All the unsupervised learning algorithms in this project are self created and programmed, in order to deeply understand them.
The algorithms are in the notebook Unsupervised_Learning_Juliana_Henao.ipynb. And there you can find:
- K means clustering
- Fuzzy C-means
- C-means clustering
- Trimmed k-means clustering
Also exploring using different similarity metrics and changing the hyper-parametres. The description of these algorithms are in the paper in this repository Trabajo_No_supervisado_Juliana_Henao.pdf
If you want to replicate the results of this project you must download the data folder and simply run the notebook. The notebook already includes a requirements.txt generator.
Also, the description, analysis and conclusion of the results are in the paper mentioned above.