# **Assignment Questions**


1. What is a parameter?
* A parameter is a value that defines certain characteristics of a machine learning model and is learned from the training data; for example, in linear regression, the slope and intercept are parameters that the algorithm adjusts to minimize prediction error.

2. What is correlation?
* Correlation is a statistical measure that quantifies the strength and direction of a linear relationship between two variables, typically expressed by a coefficient (such as Pearson’s r) ranging from –1 (perfect negative correlation) through 0 (no linear correlation) to +1 (perfect positive correlation).

  What does negative correlation mean?
* Negative correlation means that as one variable increases, the other tends to decrease; for instance, if the correlation between daily exercise hours and resting heart rate is –0.7, more exercise is generally associated with a lower resting heart rate.

3. Define Machine Learning. What are the main components in Machine Learning?
* Machine Learning is a field of artificial intelligence where algorithms learn patterns from data to make predictions or decisions without being explicitly programmed; its main components include the dataset (inputs/features and labels), the model (algorithm structure, such as decision tree, SVM, or neural network), the loss function (which measures how far predictions deviate from true labels), the optimizer (which updates model parameters to minimize loss), and evaluation metrics (which quantify performance on validation or test data).

4. How does loss value help in determining whether the model is good or not?
* The loss value quantifies how well a model’s predictions match the true outputs during training—for example, mean squared error for regression or cross-entropy for classification; a lower training loss indicates that the model has learned to fit the training data better, although generalization also must be checked via validation or test metrics to ensure the model is truly good.

5. What are continuous and categorical variables?
* Continuous variables are numeric features that can take any value within a range (e.g., temperature, height), whereas categorical variables represent discrete groups or classes (e.g., color = {red, green, blue} or diet type = {vegan, vegetarian, omnivore}).

6. How do we handle categorical variables in Machine Learning? What are the common techniques?
* Categorical variables are encoded into numeric form before training most algorithms; common techniques include one-hot encoding (creating a binary column for each category), label encoding (assigning a unique integer to each category), ordinal encoding (mapping categories to integers when order matters), and target/frequency encoding (replacing each category with the mean of the target variable or its frequency), chosen based on algorithm requirements and dataset size.

7. What do you mean by training and testing a dataset?
* Training a dataset means using a subset of data (training set) to teach the model by adjusting its parameters to minimize loss, while testing a dataset means evaluating the trained model’s performance on a separate subset (test set) that the model has never seen, in order to estimate how well it generalizes to new data.

8. What is sklearn.preprocessing?
* `sklearn.preprocessing` is a module within Scikit-learn that provides utilities for transforming and scaling data before feeding it into machine learning models, including classes like `StandardScaler`, `MinMaxScaler`, `OneHotEncoder`, and `LabelEncoder` to prepare features in the appropriate numeric format.

9. What is a Test set?
* The test set is the portion of data withheld from training that is used only to assess the final model’s performance; it provides an unbiased estimate of how the model will perform on unseen real-world data.

10. How do we split data for model fitting (training and testing) in Python?
* In Python’s Scikit-learn, you typically use `train_test_split` from `sklearn.model_selection`, for example:
  ```
  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  ```
  which randomly reserves 20% of the data for testing and 80% for training.

 How do you approach a Machine Learning problem?
* A typical approach is: (1) Define the problem and objectives, (2) Collect and explore data (EDA), (3) Preprocess and clean data (handle missing values, encode categoricals, scale features), (4) Split data into training/validation/test sets, (5) Select appropriate models and hyperparameters, (6) Train and validate models using cross-validation, (7) Evaluate final performance on the test set, and (8) Deploy the model and monitor its real-world performance.

11. Why do we have to perform EDA before fitting a model to the data?
* Exploratory Data Analysis (EDA) helps ensure that you understand data distributions, detect outliers, identify missing values, visualize relationships and correlations, and uncover potential biases or anomalies, enabling you to choose appropriate features, preprocessing steps, and algorithms—ultimately leading to more accurate and robust models.

12. What is correlation?
* Correlation is a statistical measure that quantifies the strength and direction of a linear relationship between two variables, typically expressed by a coefficient (such as Pearson’s r) ranging from –1 (perfect negative correlation) through 0 (no linear correlation) to +1 (perfect positive correlation).

13. What does negative correlation mean?
* Negative correlation means that as one variable increases, the other tends to decrease; for instance, if the correlation between daily exercise hours and resting heart rate is –0.7, more exercise is generally associated with a lower resting heart rate.

14. How can you find correlation between variables in Python?
* In Python, you can compute pairwise correlation coefficients using Pandas with `df.corr()`, which returns a correlation matrix, or use `scipy.stats.pearsonr(x, y)` to get the Pearson correlation coefficient and p-value for any two numeric arrays.

15. What is causation? Explain difference between correlation and causation with an example.
* Causation means that changes in one variable directly cause changes in another, whereas correlation only indicates that two variables move together without implying cause; for example, ice cream sales and drowning incidents are positively correlated (both peak in summer), but buying ice cream does not cause drowning—hot weather (a third variable) drives both correlations.

16. What is an Optimizer? What are different types of optimizers? Explain each with an example.
* An optimizer is an algorithm that updates a model’s parameters to minimize (or maximize) its loss function during training. Common optimizers include:
  - **Gradient Descent**: computes the full gradient over the entire training set at each step.
  - **Stochastic Gradient Descent (SGD)**: updates parameters using the gradient from a single randomly chosen sample, e.g., `SGDClassifier` in Scikit-learn.
  - **Momentum**: accelerates SGD by adding a fraction of the previous update to the current one to smooth updates.
  - **AdaGrad**: adapts the learning rate per parameter based on the sum of past squared gradients.
  - **RMSProp**: similar to AdaGrad but uses a moving average of squared gradients to avoid diminishing learning rates.
  - **Adam**: combines momentum and RMSProp by maintaining running averages of both gradients and squared gradients; often used in deep learning, e.g., `keras.optimizers.Adam()`.

17. What is sklearn.linear_model?
* `sklearn.linear_model` is a Scikit-learn submodule that implements various linear algorithms for regression and classification, such as `LinearRegression`, `LogisticRegression`, `Ridge`, `Lasso`, `SGDRegressor`, and `SGDClassifier`, providing a consistent interface for fitting and predicting with linear models.

18. What does model.fit() do? What arguments must be given?
* The `model.fit()` method trains a Scikit-learn estimator by optimizing its parameters on the provided data; it requires at minimum two arguments: `X_train` (a 2D array or DataFrame of feature values) and `y_train` (a 1D array or Series of target values), and optionally `sample_weight` or other algorithm-specific parameters.

19. What does model.predict() do? What arguments must be given?
* The `model.predict()` method uses a trained estimator to generate predictions for new data; it requires one argument: `X_test` (a 2D array or DataFrame of feature values with the same preprocessing as training), and returns predicted labels (for classifiers) or continuous values (for regressors).

20. What are continuous and categorical variables?
* Continuous variables are numeric features that can take any value within a range (e.g., temperature, height), whereas categorical variables represent discrete groups or classes (e.g., color = {red, green, blue} or diet type = {vegan, vegetarian, omnivore}).

21. What is feature scaling? How does it help in Machine Learning?
* Feature scaling transforms numeric features to a common scale—usually by standardizing (subtracting the mean and dividing by the standard deviation) or normalizing (rescaling to a fixed range, such as [0,1])—which helps algorithms that rely on distance metrics (e.g., k-nearest neighbors, SVM) or gradient-based optimization (e.g., logistic regression, neural networks) converge faster and avoid bias toward features with larger magnitudes.

22. How do we perform scaling in Python?
* In Python’s Scikit-learn, you create a scaler (for example, `StandardScaler()` or `MinMaxScaler()` from `sklearn.preprocessing`), then call `scaler.fit_transform(X_train)` to compute scaling parameters from the training set and apply the transformation, and subsequently call `scaler.transform(X_test)` to apply the same scaling to test or new data.

23. What is sklearn.preprocessing?
* `sklearn.preprocessing` is a module within Scikit-learn that provides utilities for transforming and scaling data before feeding it into machine learning models, including classes like `StandardScaler`, `MinMaxScaler`, `OneHotEncoder`, and `LabelEncoder` to prepare features in the appropriate numeric format.

24. How do we split data for model fitting (training and testing) in Python?
* In Python’s Scikit-learn, you typically use `train_test_split` from `sklearn.model_selection`, for example:
  ```
  from sklearn.model_selection import train_test_split
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
  ```
  which randomly reserves 20% of the data for testing and 80% for training.

25. Explain data encoding.
* Data encoding refers to converting categorical or non-numeric variables into numeric representations so that machine learning algorithms can process them; techniques include one-hot encoding (`OneHotEncoder`), label encoding (`LabelEncoder`), ordinal encoding (mapping ordered categories to integers), and target/frequency encoding (replacing each category with a summary statistic from the target variable or its frequency).
