Malaria is a life-threatening disease caused by parasites transmitted to people through the bites of infected female Anopheles mosquitoes. Malaria is preventable and curable.
This project was created to enable identification of disease through automated reading of images of tissues with either infected or parasite-free cells. The dataset was taken from https://www.kaggle.com/iarunava/cell-images-for-detecting-malaria.
The k-means algorithm divides data into k number of clusters and then, by calculating distances between new data and centroid, finds which centroid each image refers to. https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1.
First of all, each image was converted from RGB into white and black format (8-bit color, from 0 to 256). Then, the histogram was taken and stored in the array. Next, each histogram was equalized to obtain better accuracy. https://docs.opencv.org/2.4/doc/tutorials/imgproc/histograms/histogram_equalization/histogram_equalization.html.
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1. Each picture's histogram has 256 digits that represent the number of pixels of the single code (0-256). For k-means we used 256-dimensional space, where each picture represents a point of 256 number of coordinates. The number of centroids was set to 2 (for parasitized and uninfected). Finally, the whole data was processed by the algorithm, which outputs two centroids.
For prediction, a new picture was taken, converted into black and white, equalized, and used to obtain a point of 256 coordinates. Then, by looking at which centroid was the picture's histogram the closest (by euclidian distance), we could find its reference cluster. This told us whether or not the image contained parasitized cells. Finally, repeating the prediction algorithm, we could find its reference class.
The accuracy of prediction was insufficient for functional use. The k-means algorithm was not good for image classification. For more algorithms, one can look to the kaggle kernels of malaria detection https://www.kaggle.com/iarunava/cell-images-for-detecting-malaria. To increase predicting accuracy, the following adaptations were made:
- Color of # 0 was eliminated from all images, because of the color’s extreme variation.
- Histograms were equalized.
- The number of training data was increased.
In the end, the accuracy did not increase sufficiently.
OpenCV, NumPy, Pandas, Matplotlib
$ pip install -r requirement.txt
Free Software, Hell Yeah!