Skip to content

kyaiooiayk/Awesome-ML-Lessons-Learnt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 

Repository files navigation

📚Awesome-ML-Lessons-Learnt📚

A list of lesson learnt and things to watch out for in the Data Science and Machine Learning worlds


Table of contents


Data

  • Don't assume the data is available.
  • Don't assume the data is easy to get.
  • Don't assume that the labels are perfect (human bias while labelling).
  • Train-test leakage. Leaking information while splitting the data. Computing the mean and subtracting it from every image across the entire dataset and then splitting the data into train/val/test splits would be a mistake. Instead, the mean must be computed only over the training data and then subtracted equally from all splits (train/val/test). | Ref | Notes
  • Not handling outliers in datasets properly Outliers can either be a noise to ignore or important to take into account. ML algorithms differ in their sensitivity to outliers — AdaBoost is more sensitive to outliers compared to XgBoost which is more sensitive than a decision tree that would simply count an outlier as a false classification. | Blog article
  • Using Normalisation instead of Standardisation This is realted to issue of how to scale feature. To bring features to the same scale, use normalisation (MinMaxScaler) when the data is uniformly distributed and standardisation (StandardScaler) when the feature is approximately Gaussian. | Blog article
  • Not verifying for duplicates in the training dataset Double-checking often reveals that many of the examples in the test set are duplicates of examples in the training set. In such scenarios, the measurements of model generalization are non-deterministic (or meaningless). | Blog article
  • No unit tests for validating input data: In traditional software development projects, it is a best practice to write unit tests to validate code dependencies. In ML projects, a similar best practice needs to be applied to continuously test, verify, and monitor all the input datasets. This includes ensuring test sets yield statistically meaningful results and representative of the data set as a whole. | Blog article

Feature engineering

  • Models like Random Forest, XGBoost, LightGBM, NaiveBayes, and elasticNet all perform better on small data sets if you pull out irrelevant features beforehand. This means that you should be using some sort of stepwise technique when adding or removing features, but here's another thing of which you must be cognizant. The order in which you add or remove features can greatly affect your model when doing stepwise featurization with any of them. Accuracy differences of 10% or more are not uncommon. Thus, when working with small data, in addition to hyperparameter tuning, you should also add in stepwise featurisation (retrain and score the model many times with different features, keeping or removing features only when model performance improves), and randomize the order of the features you add or subtract. Doing so will create the best models. LinkedIn post

Project management

  • Don't fail to start over and reassess when you realise the project is going in weird directions.
  • Don't fail to connect model's objective and the way the model might be used.
  • Don't fail to appreciate the core of the problem could effectively be predicting something way bigger and more complex.
  • Put usability and production needs before model accuracy and not viceversa.
  • Don't start the ML project until you agreed on the scope with the client. Otherwise, the client may move the goalpost later.
  • Don't touch the model until you understand what the real KPI for the business are.
  • Don't start modelling unless you have performed a thoroughly EDA!

Model training

  • Don't train the model on the entire data. Split into train, valid, test. Fit your model on train, optimize on valid, and measure on test.
  • Don't toss too many uncorrelated and unnecessary features. Reduce, reduce, reduce down to the essentials!
  • Don't celebrate too early. Watch out for data leakage and overfitted model.

Classification

  • In fraud modeling, AUC is not the only metric that matters. You have to consider what is the fraud loss in $ and lifetime value $ lost should you ban the wrong user.
  • Inflating recall on imbalance dataset If the data is resampled (oversampling) before splitting it into training and test sets, the recall score will be inflated. The appropriate strategy is to split the data into a training and test set first and then to resample the data.

Recommender system

  • In a recommender system, CTR or NDCG is not the only metric that matters. You have to consider candidate diversity, feedback bias, and $ monetization generated from purchases or Ads

Time series

  • If the time time series is a random walk, using the R2 metric as a proxy to establish how good the predictive power of the model is, may suggest overly optimistic results when in reality the model has not predictive power at all. | Blog article | Notes
  • Target leakage: Think of it in terms of the timing or chronological order that data becomes available, not merely whether a feature helps make good predictions. | Blog article
  • Temporal leak If you’re trying to predict the future given the past (for example, tomorrow’s weather, stock movements, and so on), you should not randomly shuffle your data before splitting it, because doing so will create a temporal leak: your model will effectively be trained on data from the future. In such situations, you should always make sure all data in your test set is posterior to the data in the training set.

Paradoxes


Biases

  • Self-selection bias
  • Omitted variable bias
  • Sponsorship or funding bias
  • Sampling bias (also known as distribution shift)
  • Prejudice or stereotype bias
  • Systematic value distortion
  • Experimenter bias
  • Labeling bias
  • Confirmation bias: people's tendency to process information by looking for, or interpreting, information that is consistent with their existing beliefs.
  • Survivorship bias: the phenomenon where only those that ‘survived’ a long process are included or excluded in an analysis, thus creating a biased sample.

Deep Learning

  • Avoid using fully-connected layers or MLPs in general. For any structured data, like spatial functions, or field data in general, convolutions are preferable, and less likely to overfit. E.g., you’ll notice that CNNs typically don’t need dropout, as they’re nicely regularized by construction. For MLPs, you typically need quite a bit to avoid overfitting.

Releases

No releases published

Packages