No numpy, no pandas, no sklearn. Only Hardcore.
Own AI/ML algorithms implementation from scratch and its testing.
The reason I do that: to learn internals of ml algorithms.
There are:
- Regression: linear, non-linear, multi-variable.
- Logistic Regression
- K-Nearest Neighbors
- K-Mean
- Principal Component Analysis
To support data manipulation, I developed data_reader.py package instead of using pandas libs. Data is stored in table-like data-structure in the main memory. It's nice to work with data sets which fit memory.
- Regression implementation.
- About used data.
- Results
Regression was implemented with:
- optimization algorithm: gradient descent
- learning rate
- regularization (penalty) L2 (Lasso)
- for various number of iterations
- logs writing for further debugging / plot
Fuel Consumption vs CO2 EMISSIONS in /data
Trained model shows as little as 3% error of prediction.
Workflow can be found here
https://kotsky.github.io/projects/ai_from_scratch/regression_workflow.html
- Logistic regression implementation.
- About used data.
- Results
With:
- optimization algorithm: gradient descent
- learning rate
- regularization (penalty) L2 (Lasso)
- for various number of iterations
- logs writing for further debugging / plot
- adjustable logistic coefficient for prediction threshold
- evaluation with confusion matrix, precision and recall
Loan data in /data
Trained model has 71% accuracy that given people from a test data set will or not return loan taken before.
f1 score showed the best logistic threshold at 0.27, which we used to find out the best precision 74% and recall 95%.
Workflow can be found here
https://kotsky.github.io/projects/ai_from_scratch/logistic_regression_workflow.html
78% accuracy was achieved by using standard libs (sklearn). Its workflow can be found here
https://github.com/kotsky/ai-studies/blob/main/Projects/Project%20Loan/Loan%20Model.ipynb
- KNN implementation.
- About used data.
- Results
Normal KNN algorithm with optimized the nearest points storing/comparing based on a special k-size max heap data structure.
Loan data in /data. This is done to compare results with LR model.
We can say that k = 4 roughly is the best for our model, which gives 72% accuracy, 74% precision and 94% recall.
Its workflow can be found here:
https://kotsky.github.io/projects/ai_from_scratch/knn_workflow.html
- K-Mean implementation.
- About used data.
- Results
Normal K-Mean algorithm based on distance calculation between centroids and training points.
It's all about customers of some store. So we can try to identify clusters of these customers.
With a proper visualization and cost function analyse, we find our that the best K is 5 for customers' features Income vs Years of Employed.
Moreover, we could predict what cluster is the most suitable for a new customer.
Its workflow can be found here:
https://kotsky.github.io/projects/ai_from_scratch/kmean_workflow.html