1. What is a parameter?

A parameter is a numerical value that describes a characteristic of a population (not just a sample).

In Machine Learning, a parameter usually refers to the internal values of a model that are learned from the training data.

Example: In linear regression
y=mx+c,
m (slope) and c (intercept) are parameters that the model learns.

2. What is correlation?

Correlation is a statistical measure that describes the strength and direction of the relationship between two variables.

It is represented by a value (correlation coefficient, usually r) ranging between -1 and +1

3. What does negative correlation mean?

Negative correlation means that when one variable increases, the other decreases.

The correlation coefficient
r will be less than 0 (between -1 and 0)

4. Define Machine Learning. What are the main components in Machine Learning?

Definition:
Machine Learning (ML) is a branch of Artificial Intelligence where systems learn from data and improve their performance without being explicitly programmed.

Main Components of ML:

Data → The input (training data + test data).

Model → The algorithm or mathematical structure that makes predictions (e.g., Linear Regression, Neural Network).

Parameters → Internal variables learned by the model (like weights in neural networks).

Loss Function → Measures how far predictions are from actual values.

Optimizer/Training Algorithm → Adjusts parameters to minimize the loss (e.g., Gradient Descent).

Evaluation Metrics → Used to measure model performance

5. How does loss value help in determining whether the model is good or not?

Loss value = A number that tells us how well (or badly) the model is performing.

If the loss value is:

High → Model predictions are far from actual values (bad model).

Low → Predictions are close to actual values (good model).

 Example:

In regression: Mean Squared Error (MSE) is used as loss.

A smaller MSE means the model predicts values more accurately.

6. What are continuous and categorical variables?

Continuous Variables

Can take an infinite number of numerical values within a range.

Example: Height (170.5 cm), Temperature (36.8°C), Weight (65.2 kg).

Categorical Variables

Represent categories, labels, or groups.

Values are discrete, not continuous.

Example: Gender (Male/Female), Colors (Red, Blue, Green), City names.

7. How do we handle categorical variables in Machine Learning? What are the common techniques?

Categorical variables are non-numeric labels (e.g., "Male", "Female", "Red", "Blue").
ML models usually work only with numbers, so we must convert categories into numbers

Common Techniques:

1)Label Encoding:-
Converts categories into integers
"Male" → 1, "Female" → 0

2)One-Hot Encoding
Creates new binary columns for each category.

Example:
"Color" column with {Red, Blue, Green} →

Red → [1,0,0]
Blue → [0,1,0]
Green → [0,0,1]

3)Ordinal Encoding (when categories have order)
Example: "Low < Medium < High" → 1, 2, 3.

8. What do you mean by training and testing a dataset?

Training dataset:-
Subset of data used to train the model (fit parameters).

The model learns patterns from this data.

Testing dataset:-Separate unseen data used to evaluate the model.

Helps check if the model can generalize, not just memorize

9. What is sklearn.preprocessing?

sklearn.preprocessing is a module in scikit-learn that provides tools for data preprocessing before feeding into ML models

10. What is a Test set?

A test set is a portion of the dataset that is kept aside to evaluate the final performance of the trained model.

It simulates new/unseen data.

It ensures the model is not just memorizing (overfitting).

👉 Example: If we have 1000 rows of data, we may keep 80% for training and 20% for testing

11. How do we split data for model fitting (training and testing) in Python?

We use train_test_split from sklearn.model_selection


example)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)


12. How do you approach a Machine Learning problem?

Step-by-step approach:

1.Understand the problem (classification, regression, clustering).

2.Collect and prepare data (clean missing values, handle outliers).

3)Perform EDA (Exploratory Data Analysis) to understand trends, distributions, correlations.

4)Preprocess data (scaling, encoding, splitting).

5)Select a suitable model (e.g., Logistic Regression, Decision Tree).

6)Train the model using training data.

7)Evaluate the model using metrics (Accuracy, RMSE, Precision, Recall).

8)Tune hyperparameters (GridSearch, RandomSearch).

9)Deploy the model for real-world use.

👉 Key Idea: ML workflow is like a cycle: Data → Train → Test → Improve → Deploy

13. Why do we have to perform EDA before fitting a model to the data?

EDA (Exploratory Data Analysis) = analyzing the dataset before training.

Importance of EDA:

Detect missing values and decide how to handle them.

Identify outliers that may affect model performance.

Understand data distribution (skewness, imbalance).

Check correlations between variables (important for feature selection).

Decide preprocessing techniques (scaling, encoding).

Without EDA, we might blindly train a model that performs poorly because of unclean or biased data

14. How can you find correlation between variables in Python?

We usually use Pandas or NumPy.

In [1]:
import pandas as pd

# Example DataFrame
data = {'Height': [150, 160, 170, 180, 190],
        'Weight': [50, 60, 70, 80, 90]}
df = pd.DataFrame(data)

# Correlation matrix
print(df.corr())


        Height  Weight
Height     1.0     1.0
Weight     1.0     1.0


15. What is causation? Difference between correlation and causation

Correlation = Two variables are related (move together), but one may not cause the other.

Causation = One variable directly influences the other.


16. What is an Optimizer? Types of Optimizers

Optimizer: Algorithm that updates model parameters (weights) to minimize the loss function.

Common Types of Optimizers:

>Gradient Descent

Adjusts weights step-by-step in the opposite direction of the gradient.

Example: Linear regression training.

>Stochastic Gradient Descent (SGD)

Updates weights after each training example (faster but noisy).

>Mini-Batch Gradient Descent

Updates weights using small batches of data (balance between speed & stability).

>Adam (Adaptive Moment Estimation)

Combines momentum + adaptive learning rate.

Most widely used (default in deep learning)

17. What is sklearn.linear_model?

A module in scikit-learn that provides ML models for linear problems.

Common classes:

LinearRegression() → for regression problems.

LogisticRegression() → for classification problems.

Ridge, Lasso → regularized regressions

18. What does model.fit() do?

Trains the model by finding the best parameters from training data

19. What does model.predict() do?

Makes predictions on new/unseen data using the trained model

20. What are continuous and categorical variables?

Continuous Variables → Can take infinite numerical values.

Example: Height, Weight, Temperature.

Categorical Variables → Represent groups/labels.

Example: Gender (Male/Female), City (Delhi, Mumbai

21. What is feature scaling? How does it help in Machine Learning?

Feature scaling = Transforming data so that all features are on a similar scale/range.

Helps ML algorithms that depend on distance/gradient

22. How do we perform scaling in Python?

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# Standardization (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Normalization (0 to 1)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)


23. What is sklearn.preprocessing?

A module in scikit-learn for preprocessing tasks.

Common functions:

StandardScaler → Standardization.

MinMaxScaler → Normalization.

LabelEncoder, OneHotEncoder → Encoding categories.

PolynomialFeatures → Feature transformation

24. How do we split data for model fitting in Python?

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)


25. Explain data encoding

Data Encoding = Converting categorical (non-numeric) data into numerical form so ML models can understand it.

Types of Encoding:

Label Encoding → Assigns numbers to categories (Male=0, Female=1).

One-Hot Encoding → Creates binary columns for each category.

Ordinal Encoding → Numbers with order (Low=1, Medium=2, High=3)