# Common Problems in Machine Learning

There are a lot of challenges that machine learning professionals face to inculcate ML skills and create an application from scratch.


some are:

1. Poor Data Quality or Imbalanced Dataset
2. Underfitting of Training Data
3. Overfitting of Training Data
4. Feature Selection and Dimensionality Reduction
5. Data Leakage

## Overfitting

Overfitting occurs when a machine learning model learns the training data too well and fails to generalize to new, unseen data. 

The model becomes overly complex and captures noise or random fluctuations in the training data, resulting in poor performance on the test data.



<img src="https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fdiegolosey.com%2Fwp-content%2Fuploads%2F2020%2F07%2Funderfitting-1536x477.png&f=1&nofb=1&ipt=d2421880db5ffa5dffe78180f78f7c043b560f614fbd375ff4d52e73df02c169&ipo=images">

## Underfitting 

happens when a machine learning model is too simple to capture the underlying patterns and relationships in the training data. 

It fails to learn the essential characteristics of the data, leading to poor performance on both the training and test sets.

In [1]:
# Example of underfitting
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate a linear dataset
X = [[1], [2], [3]]
y = [2, 4, 6]

# Fit a linear regression model
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Calculate mean squared error
mse = mean_squared_error(y, y_pred)
print("Mean Squared Error:", mse)


Mean Squared Error: 3.2869204384208827e-31


# How to Detect Overfitting in Machine Learning

A key challenge with overfitting, and with machine learning in general, is that we can’t know how well our model will perform on new data until we actually test it.

To address this, we can split our initial dataset into separate training and test subsets.

<img src="https://elitedatascience.com/wp-content/uploads/2017/06/Train-Test-Split-Diagram-768x266.jpg">


> If our model does much better on the training set than on the test set, then we’re likely overfitting.



Further reading:
    https://elitedatascience.com/overfitting-in-machine-learning
        

### Overcome Overfitting with Cross-validation
Cross-validation is a powerful preventative measure against overfitting.

The idea is clever: Use your initial training data to generate multiple mini train-test splits. Use these splits to tune your model.

In standard k-fold cross-validation, we partition the data into k subsets, called folds. Then, we iteratively train the algorithm on k-1 folds while using the remaining fold as the test set (called the “holdout fold”).


<img src="https://elitedatascience.com/wp-content/uploads/2017/06/Cross-Validation-Diagram-768x295.jpg">

In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Load the iris dataset
data = load_iris()
X, y = data.data, data.target

# Create a logistic regression model
model = LogisticRegression()

# Perform cross-validation
cv_scores = cross_val_score(model, X, y, cv=5)

# Print the cross-validation scores
print("Cross-Validation Scores:", cv_scores)
print("Mean Score:", cv_scores.mean())


Cross-Validation Scores: [0.96666667 1.         0.93333333 0.96666667 1.        ]
Mean Score: 0.9733333333333334


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In the above code:

* The iris dataset is loaded using load_iris() function, and the input features are stored in X, while the target variable is stored in y.
* We create a LogisticRegression model object.
* cross_val_score function is used to perform cross-validation. We pass the model (model), input features (X), and target variable (y) as arguments. The cv parameter is set to 5, which indicates 5-fold cross-validation.
* The cross-validation scores are stored in cv_scores.
* Finally, we print the cross-validation scores and the mean score across all folds.


The cross-validation scores are displayed as an array of values, with each value representing the model's accuracy on a specific fold of the data. In this case, there are five folds, and the model achieved accuracy values of approximately 0.97, 1.0, 0.93, 0.97, and 1.0 on each respective fold.

The mean score is calculated by taking the average of all the cross-validation scores. In this case, the mean score is approximately 0.973, indicating an overall high accuracy of the logistic regression model across all folds.

> A cross-validation score of 0.973 implies that the model is able to generalize well to unseen data and perform consistently across multiple subsets of the dataset. This suggests that the model is not overfitting the training data and is capturing meaningful patterns that are applicable to new data points.

## Feature Selection and Dimensionality Reducing


Feature selection and dimensionality reduction techniques are used to identify and retain the most informative features in the dataset. 

Having too many irrelevant or redundant features can negatively impact model performance, increase training time, and introduce noise into the model.

## Data Leakage
 
 Data leakage occurs when information from the test set leaks into the training process, leading to overly optimistic performance estimates.
 
 It can happen when features that are not available in a real-world scenario are used during training.
 
 **To overcome you can**

* split your dataset into training and test sets before any preprocessing steps or feature engineering. This means performing the train-test split on the raw, unprocessed data. 
 
* This helps prevent any information or knowledge from the test set influencing the preprocessing steps

In [2]:
from sklearn.model_selection import train_test_split

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply preprocessing steps only on the training data
# ...


Image Credit: elitedatascience