# **Machine Learning Questions and Answers**

### 1. What is a parameter?

- In machine learning, a parameter refers to a configuration variable that is internal to the model and whose value is estimated from the training data.

- Parameters are learned from data during training and are essential for making predictions.

> Key Characteristics of Parameters:

  1. They define the model.

  2. Their values are learned automatically during training.

  3. They directly affect predictions.

- Ex-

  1. In linear regression, the parameters are the weights (coefficients) and bias (intercept) of the line.

  2. In a neural network, the parameters are the weights and biases of the neurons.

  3. In decision trees, the structure of the tree is built using splits, which are not parameters but rather model structure—so decision trees are said to be parameter-free or have very few parameters.

--

### 2. What is correlation? What does negative correlation mean?

- In machine learning, correlation refers to a statistical relationship between two variables—how one variable tends to change when another does.

- Correlation measures the strength and direction of a linear relationship between two variables.

- It's expressed with a value called the correlation coefficient (r): Ranges from -1 to +1, -1 (perfect -ve correlation), +1 (perfect +ve correlation), 0 (no linear correlation)

> Negative Correlation Mean

  - Negative correlation means that as one variable increases, the other tends to decrease i.e. they move in opposite directions.

  - Suppose you're building a model to predict house prices.If the distance from the city center has a negative correlation with price, then:As distance increases, price tends to decrease.

> Why is Correlation Important in Machine Learning?

  1. Helps in feature selection: Highly correlated features can be redundant.

  2. Can reveal multicollinearity in linear models (problematic).

  3. Gives insight into data relationships before modeling.


--

### 3. Define Machine Learning. What are the main components in Machine Learning?

- Machine Learning (ML) is a subfield of artificial intelligence (AI) that focuses on building systems that can learn from data and make decisions or predictions without being explicitly programmed for every task.

> main components in Machine Learning?

  1. Data - foundation of ML, Can be structured (tables), unstructured (text, images), or semi-structured, Includes input features and (sometimes) output labels.

  2. Model - mathematical structure or algorithm that maps input data to output predictions.Ex- linear regression, decision trees, neural networks.

  3. Features - input variables (also called attributes or predictors).

  4. Labels/Targets (for supervised learning) - output the model is supposed to predict, Used during training to learn the correct mapping.

  5. Learning Algorithm - method used to adjust the model parameters based on the data.

  6. Loss Function (Cost Function) - Measures the difference between the predicted output and the actual output.

  7. Training Process - phase where the model learns from the data.

  8. Evaluation - Measures the model's performance using metrics like accuracy, precision, recall, or RMSE, Often involves a separate test or validation dataset.


--

### 4. How does loss value help in determining whether the model is good or not?

- In machine learning, the loss value is a numerical measure of how well (or poorly) a model's predictions match the actual outcomes. It plays a critical role in both training and evaluating a model.

> Role of Loss Value:

  1. Measures Model Error - Loss function computes the difference between predicted output and true output, lower loss indicates better predictions; a higher loss indicates more error.

  2. Guides the Learning Process - During training, the model uses the loss to adjust its parameters (via optimization algorithms like gradient descent). goal is to minimize the loss.

  3. Indicates Model Quality (to some extent) - consistently low loss on training and validation data usually means the model is learning well.very low training loss but high validation loss may signal overfitting.
  

- loss value tells you how far off your model's predictions are from the true values. Lower loss generally means a better model—but it must be evaluated alongside other metrics (like accuracy or precision) to fully judge model performance.

--

### 5. What are continuous and categorical variables?

- In machine learning, features (or variables) are often divided into continuous and categorical types.

- Understanding the difference helps in choosing the right preprocessing methods and algorithms.

> Continuous Variables

  - Variables that can take any numerical value within a range (including decimals).

  - Infinite possible values.

  - Usually measurable quantities.

  - Can be used directly or normalized/scaled.

  - Often used in regression tasks.

  - EX- Height = 170.5 cm, Temperature= 22.3°C, Age = 25.6 years


> Categorical Variables

  - Variables that represent categories or groups and have a finite number of distinct values.

  - 2 types :
    1. Nominal: No natural order ex- color: red, blue, green

    2. Ordinal: Has an order ex- rating: low, medium, high

  - Need to be encoded into numbers ex- one-hot encoding, label encoding.

  - Often used in classification tasks.

  - Ex- Gender (male, female)

--

### 6. How do we handle categorical variables in Machine Learning? What are the common techniques?

- Categorical variables must be converted into numerical form because most machine learning algorithms work with numbers, not text or labels. This process is called encoding.

> Common Techniques for Handling Categorical Variables:

  1. Label Encoding

    - Assigns a unique integer to each category.

    - Pros: Simple, memory-efficient

    - Cons: Implies ordinal relationship (e.g., Blue > Green), which may not be true

    - Best for: Tree-based models like decision trees, random forests

    - Ex- Color: Red, Green, Blue → Red=0, Green=1, Blue=2

  2. One-Hot Encoding

    - Creates binary (0/1) columns for each category.

    - Pros: No assumption of order

    - Cons: Increases dimensionality (especially with many categories)

    - Best for: Linear models, neural networks

    - Ex- Color: Red → [1, 0, 0], Green → [0, 1, 0], Blue → [0, 0, 1]

  3. Ordinal Encoding

    - Like label encoding, but used only when the categories have a natural order.

    - Best for: When the order matters

    - Ex- Size: Small=0, Medium=1, Large=2

  4. Target Encoding (Mean Encoding)

    - Replaces categories with the mean of the target variable for each category.

    - ex- Job: Engineer → 60K, Teacher → 40K, Doctor → 80K

    - Pros: Reduces dimensionality

    - Cons: Can cause data leakage if not used carefully (needs cross-validation)

    - Best for: High-cardinality categorical features

  5. Binary Encoding / Hash Encoding

    - Converts categories to binary or hashed codes.

    - Pros: Handles high cardinality better than one-hot

    - Best for: Large categorical features

--

### 7. What do you mean by training and testing a dataset?

- In machine learning, the data used to build and evaluate a model is typically divided into two main parts: the training dataset and the testing dataset.

  1. Training Dataset

    - Used to train the model—i.e., to learn patterns and adjust parameters.

    - model sees this data during learning.

    - It includes both input features and output labels (in supervised learning).

    - ex- If you're predicting house prices, the training data would include house features (size, location, etc.) and known prices.

  2. Testing Dataset

    - Used to evaluate the model's performance on unseen data.

    - model does not see this data during training.

    - It simulates how the model will perform on real-world, new data.

    - To check for generalization—how well the model performs beyond the training data.

> Why This Split Matters:

  - Prevents overfitting (model memorizing training data).

  - Ensures fair evaluation on new, unseen data.

  - Helps in measuring true model performance.

--

### 8. What is sklearn.preprocessing?

- In machine learning, raw data often needs to be transformed or scaled before it can be used effectively by algorithms.

- The module **sklearn.preprocessing** in Scikit-learn provides tools to preprocess this data.

> Purpose of sklearn.preprocessing:

  1. Prepare data for training and testing.

  2. Improve model performance.

  3. Ensure features are on the same scale.

  4. Convert categorical data to numerical format.

- sklearn.preprocessing is a toolbox for transforming raw data into a format suitable for machine learning algorithms, including scaling, encoding, and feature generation.

--

### 9. What is a Test set?

- In machine learning, a test set is a portion of the dataset that is used to evaluate the final performance of a trained model.

- It helps determine how well the model generalizes to new, unseen data.

- Used only after training is complete.

- The model has never seen this data during training.

- Helps measure true performance on real-world data.

- Contains both input features and true output labels (in supervised learning).

> Why Use a Test Set?

  1. To detect overfitting: a model may perform well on training data but poorly on unseen data.

  2. To validate the model's effectiveness in practice.

  3. To ensure that improvements in training don't just reflect memorization.

- Ex- If you have 1,000 samples of data, you might split them as follows: Training Set: 800 samples → used to train the model, Test Set: 200 samples → used to test the model's accuracy

- test set is a reserved portion of the dataset that is used only at the end to evaluate how well a machine learning model performs on unseen data.

- It ensures the model is generalizing, not just memorizing.


--

### 10. How do we split data for model fitting (training and testing) in Python?How do you approach a Machine Learning problem?

- In Python, especially with Scikit-learn, the most common way to split data is using:
  from sklearn.model_selection import train_test_split

  ex- from sklearn.model_selection import train_test_split

      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # X = features, y = labels

  - test_size=0.2 → 20% of the data goes to the test set, 80% to training.

  - random_state=42 → ensures reproducibility.

> Approaching an ML problem involves structured thinking and step-by-step execution. Here's a typical workflow:

  1. Understand the Problem

  2. Collect and Explore the Data

  3. Preprocess the Data - Handle missing values, Encode categorical variables, Scale or normalize numerical features, Split the data into training and testing sets.

  4. Choose a Model - Select a suitable algorithm ex- Linear Regression, Decision Tree, SVM, etc. based on problem type and data size.

  5. Train the Model - Fit the model using the training data (model.fit(X_train, y_train)).

  6. Evaluate the Model - Use the test set (X_test, y_test) to evaluate performance. Use metrics like accuracy, precision, recall, F1-score, MSE, etc.

  7. Tune the Model (Optional)

  8. Deploy or Interpret Results


--

### 11. Why do we have to perform EDA before fitting a model to the data?

- Exploratory Data Analysis (EDA) is the process of examining and understanding the dataset before building any machine learning model.

> Key Reasons to Perform EDA:

  1. Understand Data Characteristics

  2. Detect Data Quality Issues

  3. Feature Selection and Engineering

  4. Choose Appropriate Models and Techniques

  5. Inform Data Preprocessing Steps

  6. Prevent Model Failures

- EDA is critical because it provides a deep understanding of the data, ensures data quality, guides feature engineering, and helps in selecting and tuning the right model — all of which lead to better, more reliable machine learning outcomes.

--

### 12. What is correlation?

- Correlation is a statistical measure that describes the strength and direction of a linear relationship between two variables.

- It's expressed with a value called the correlation coefficient (r): Ranges from -1 to +1, -1 (perfect -ve correlation), +1 (perfect +ve correlation), 0 (no linear correlation)

> Importance in Machine Learning:

  1. Helps identify how features relate to each other and to the target variable.

  2. Can be used for feature selection ex- removing redundant features.

  3. Detects multicollinearity which can affect model performance.

  4. Guides understanding of data patterns before modeling.

- Correlation quantifies the degree to which two variables move together in a linear fashion. It is a fundamental concept to analyze relationships in data for machine learning.

--

### 13. What does negative correlation mean?

> Negative Correlation Mean

  - Negative correlation means that as one variable increases, the other tends to decrease i.e. they move in opposite directions.

  - Suppose you're building a model to predict house prices.If the distance from the city center has a negative correlation with price, then:As distance increases, price tends to decrease.

> Why is Correlation Important in Machine Learning?

  1. Helps in feature selection: Highly correlated features can be redundant.

  2. Can reveal multicollinearity in linear models (problematic).

  3. Gives insight into data relationships before modeling.

--

### 14. How can you find correlation between variables in Python?

- In machine learning and data analysis, you can find the correlation between variables using libraries like Pandas and NumPy.

  1. Common Method Using Pandas:
  
    - Assuming you have a dataset loaded as a Pandas DataFrame called df:

    - ex- # Calculate correlation matrix for all numerical columns
          correlation_matrix = df.corr()
          print(correlation_matrix)

    - This computes the Pearson correlation coefficient between each pair of numeric variables.

    - The result is a square matrix showing correlations between every pair of variables.

  2. To Find Correlation Between Two Specific Variables:

    - ex- correlation = df['variable1'].corr(df['variable2'])
          print(correlation)

  3. Visualizing Correlation:

    ex- import seaborn as sns
        import matplotlib.pyplot as plt

        sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
        plt.show()

- In Python, the easiest way to find correlations between variables is to use Pandas .corr() function, which computes pairwise correlation coefficients. Visualization tools like Seaborn help interpret these relationships better.


--

### 15. What is causation? Explain difference between correlation and causation with an example.

- Causation (or cause-and-effect) means that one event directly causes another to happen.

- In other words, a change in variable A produces a change in variable B.

> correlation

  - There is a positive correlation between ice cream sales and drowning incidents. Both increase in the summer, but buying ice cream does not cause drowning.

  - Correlation helps in feature selection but does not guarantee that the feature causes the outcome.


> causation

  - Smoking causes an increased risk of lung cancer. Here, smoking is a direct cause affecting health outcomes.

  - Inferring causation requires controlled experiments or causal inference methods, which is beyond simple correlation analysis.

- Correlation means two variables move together, but causation means one actually causes the other.

- Understanding this distinction is crucial to avoid misleading conclusions in machine learning and data analysis.

--

### 16. What is an Optimizer? What are different types of optimizers? Explain each with an example.

- An optimizer is an algorithm or method used to adjust the parameters (like weights and biases) of a machine learning model during training to minimize the loss function.

- Its goal is to find the best set of parameters that make the model predictions as accurate as possible.

> Why Do We Need Optimizers?

  - To minimize the error between predicted and true outputs.

  - To improve model performance.

  - To efficiently navigate the complex space of model parameters.

> Common Types of Optimizers

  1. Gradient Descent (GD) - Calculates the gradient of the loss function with respect to parameters using the entire training dataset and updates parameters in the opposite direction of the gradient.Ex- Basic linear regression or neural networks.

  2. Stochastic Gradient Descent (SGD) - Updates parameters using the gradient from only one training example at a time.Ex- Large datasets, online learning.

  3. Mini-batch Gradient Descent - Uses a small subset (batch) of data points to calculate the gradient and update parameters.Ex- Most deep learning frameworks use this by default.

  4. Momentum - Accelerates SGD by adding a fraction of the previous update to the current one, helping to navigate flat regions or avoid oscillations. Ex- Deep learning tasks requiring faster convergence.

  5. Adagrad - Adapts learning rate for each parameter individually, giving smaller updates to frequent features and larger updates to infrequent ones.Ex- Sparse data, NLP.

  6. RMSprop - Modifies Adagrad by using a moving average of squared gradients to normalize the gradient, preventing the learning rate from shrinking too much. Ex- Recurrent Neural Networks (RNNs).

  7. Adam (Adaptive Moment Estimation) - Combines momentum and RMSprop by keeping an exponentially decaying average of past gradients and squared gradients. Ex- Widely used in deep learning.

--

### 17. What is sklearn.linear_model ?

- sklearn.linear_model is a module in Scikit-learn that provides linear models for both regression and classification tasks.

- These models assume a linear relationship between the input features and the target variable.

- It is used to fit linear models to data.

- Useful for interpretable, efficient, and often surprisingly effective models, especially with high-dimensional data.

- ex- Simple Linear Regression

    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)

- sklearn.linear_model provides a variety of linear algorithms for regression and classification, often used as baseline models or in cases where speed and interpretability are important.

--

### 18. What does model.fit() do? What arguments must be given?

- In machine learning (especially with Scikit-learn), the .fit() method is used to train the model on the given data.

- It learns the patterns in the data by adjusting the model's internal parameters (like weights).

- It fits the model to the training data, meaning the model is now ready to make predictions.

- Syntax = model.fit(X, y)

- X: Input features (also called independent variables or predictors),y: Target variable (also called labels or dependent variable).

- ex-

    from sklearn.linear_model import LinearRegression

    model = LinearRegression()
    model.fit(X_train, y_train)  # X_train: features, y_train: target

    predictions = model.predict(X_test)

- Depending on the model, .fit() might also accept additional arguments, such as: sample_weight: to give different weights to training examples, epochs, batch_size (in deep learning frameworks like TensorFlow or Keras).

- model.fit(X, y) is the step where the model learns from data by adjusting internal parameters based on the input features X and their corresponding target values y.


--

### 19. What does model.predict() do? What arguments must be given?

- In machine learning, the model.predict() method is used to generate predictions from a trained model. After training the model using model.fit(), you use predict() to apply the learned patterns to new, unseen data.

- It is used to make predictions on input features.

- Uses the model's learned parameters to compute output values.

- Syntax = model.predict(X) where X: The input data (features) for which predictions are to be made.

- Ex-

    from sklearn.linear_model import LinearRegression

    model = LinearRegression()
    model.fit(X_train, y_train)

    predictions = model.predict(X_test) # Make predictions on new data

- model.predict(X) takes input features and returns predicted outputs using the patterns the model has learned during training. It's the final step in most machine learning workflows.

--

### 20. What are continuous and categorical variables?

- In machine learning, input features (also called variables) are typically classified into two main types: continuous and categorical.

> Continuous Variables

  - Variables that can take any numerical value within a range (including decimals).

  - Infinite possible values.

  - Usually measurable quantities.

  - Can be used directly or normalized/scaled.

  - Often used in regression tasks.

  - EX- Height = 170.5 cm, Temperature= 22.3°C, Age = 25.6 years


> Categorical Variables

  - Variables that represent categories or groups and have a finite number of distinct values.

  - 2 types :
    1. Nominal: No natural order ex- color: red, blue, green

    2. Ordinal: Has an order ex- rating: low, medium, high

  - Need to be encoded into numbers ex- one-hot encoding, label encoding.

  - Often used in classification tasks.

  - Ex- Gender (male, female)

--

### 21. What is feature scaling? How does it help in Machine Learning?

- Feature scaling is a preprocessing technique used to normalize or standardize the range of independent variables (features) in a dataset. It ensures that all features contribute equally to the learning process.

- Many machine learning algorithms are sensitive to the scale of features. If one feature has a much larger range than others, it can dominate the model's learning process, leading to biased or inaccurate results.

> How Does Feature Scaling Help?

  1. Speeds up gradient descent convergence.

  2. Improves model accuracy by ensuring no feature dominates.

  3. Helps in distance-based models like KNN and SVM where absolute values matter.

  4. Makes coefficients more interpretable in linear models.

- Feature scaling brings all features to the same scale, which helps many ML algorithms learn more effectively and fairly. It's a crucial step in preprocessing, especially when using algorithms sensitive to the magnitude of values.


--

### 22. How do we perform scaling in Python?

- In machine learning, feature scaling is commonly performed using Scikit-learn's sklearn.preprocessing module.

- It provides various scaling methods that transform features to the appropriate scale before training a model.

> Common Scaling Techniques in Python:

  1. Standardization (Z-score Scaling) - Transforms features to have mean = 0 and standard deviation = 1.

  2. Min-Max Scaling (Normalization) - Scales features to a fixed range, usually [0, 1].

  3. Robust Scaling - Uses median and interquartile range, making it less sensitive to outliers.

  4. Normalizer - Scales each individual data point to unit norm (vector of length 1).

> General Process for Scaling:

  1. Import the scaler (e.g., StandardScaler, MinMaxScaler)

  2. Fit the scaler on the training data to learn scaling parameters (mean, std, min, etc.).

  3. Transform the data using the fitted scaler.

  4. Use the same scaler to transform test data (to avoid data leakage).

- Feature scaling in Python is typically done using Scikit-learn scalers like StandardScaler or MinMaxScaler, depending on the type and distribution of the data.

--

### 23. What is sklearn.preprocessing?

- sklearn.preprocessing, is a module in Scikit-learn that provides tools for data preprocessing — a crucial step in any machine learning workflow.

- To transform raw data into a suitable format before feeding it into a machine learning model. It includes techniques for:

  1. Scaling numerical features

  2. Encoding categorical variables

  3. Normalizing feature values

  4. Generating polynomial features

  5. Handling missing or imbalanced data


- sklearn.preprocessing is a preprocessing toolbox in Scikit-learn that helps prepare data for machine learning by scaling, encoding, transforming, or normalizing features.
--

### 24. How do we split data for model fitting (training and testing) in Python?

- In machine learning, we split the dataset into two main parts:

  - Training set - used to train (fit) the model

  - Testing set - used to evaluate the models performance on unseen data

- This helps ensure the model generalizes well and is not overfitting.

- Common Tool Used: train_test_split() from sklearn.model_selection

- Required Arguments - train_test_split(X, y, test_size, random_state)

- In Python, data is split for training and testing using train_test_split() from Scikit-learn, which separates your features and labels into two sets—one for model training and one for performance evaluation.
--

### 25. Explain data encoding?

- Data encoding is the process of converting categorical (non-numeric) data into a numerical format so that it can be used in machine learning models, which typically require numerical input.

- Most ML algorithms (like linear regression, SVM, neural networks) cannot handle non-numeric data directly.

- Encoding allows the model to understand and process categorical features effectively.

> Common Data Encoding Techniques:

  1. Label Encoding
  
    - Assigns a unique integer to each category. Ex- {"red": 0, "green": 1, "blue": 2}

    - Suitable for ordinal data (where categories have order).

    - Not ideal for nominal data (unordered), as models may assume numeric order.

  2. One-Hot Encoding

    - Creates binary columns for each category.

    - Suitable for nominal data.

    - Can lead to high dimensionality if there are many unique categories.

  3. Ordinal Encoding - Similar to label encoding but used for ordered categories.

  4. Binary Encoding, Target Encoding

    - Binary Encoding: Encodes categories as binary numbers.

    - Target Encoding: Replaces each category with the average target value for that category (used in certain supervised learning tasks).

- Data encoding is essential for converting categorical data into numeric form so that machine learning models can process it.

- Choosing the right encoding method depends on whether the categories are ordered and how many there are.

--