# ***Assignment Questions***

---



1. **What is a parameter**?

 - In machine learning, a parameter usually refers to a variable that the model learns from the training data.

 - Example: In linear regression, the slope (weights) and intercept (bias) are parameters.

  - In logistic regression, the coefficients for each feature are parameters.

 - In neural networks, the weights and biases across layers are parameters.

 ---


2. **What is correlation? What does negative correlation mean?**

- **Correlation:-**

  - Correlation is a statistical measure that explains the relationship between two variables. It indicates whether and how strongly pairs of variables are related. The value of correlation coefficient (r) always lies between -1 and +1.

  - r = +1 → Perfect positive correlation

  - r = -1 → Perfect negative correlation

  - r = 0 → No correlation

- **Negative Correlation :-**

    - Negative correlation means that when one variable increases, the other decreases, and vice versa. In other words, the two variables move in opposite directions.

    - Example:

    - Price of a product and its demand → as price increases, demand decreases.

    - Number of hours spent exercising and body fat percentage → more exercise, less body fat.

    ---

3. **Define Machine Learning. What are the main components in Machine Learning**

 - Machine Learning (ML) is a branch of Artificial Intelligence (AI) that enables systems to learn automatically from data and improve their performance without being explicitly programmed.
It focuses on building algorithms that can identify patterns, make predictions, or take decisions based on input data.

 - Arthur Samuel (1959) defined ML as:
"Machine Learning is the field of study that gives computers the ability to learn without being explicitly programmed."

**Main Components of Machine Learning**

1. **Dataset (Input Data):**

    - The foundation of ML.

    - Data is collected from various sources (images, text, numbers, sensor data, etc.).

    - It is divided into training data and testing data.

2. **Features (Input Variables):**

    - Features are the measurable properties or characteristics of the data.

    - Example: In predicting house prices, features may include size, location, number of rooms.

3. Model (Algorithm):

    - A mathematical representation or function that maps input features to output.

    - Example: Linear Regression, Decision Trees, Neural Networks.

4. Training Process:

    - The process of feeding data into the model so it can learn patterns.

    - The model adjusts its parameters (like weights) to minimize error.

5. Loss Function (Error Function):

    - Measures how far the predicted output is from the actual result.

    Example: Mean Squared Error (MSE) in regression.

6. Optimizer (Learning Algorithm):

     - Updates model parameters to reduce the error.

     - Example: Gradient Descent, Adam Optimizer.

7. Evaluation (Testing):

   - After training, the model is tested on unseen data to check its accuracy and performance

   ---

4. **How does loss value help in determining whether the model is good or not?**
 - Loss value is a measure of how well or poorly a machine learning model is performing. It is calculated by comparing the model’s predicted outputs with the actual outputs.

A low loss value means the model’s predictions are close to the actual values → the model is good.

A high loss value means the predictions are far from the actual values → the model is not good.

If the loss decreases during training, it shows that the model is learning properly.

If the training loss is low but the testing loss is high, it means the model is overfitting.

If both training and testing losses are high, it means the model is underfitting.

Comparing loss values across different models also helps to identify which model performs better

---

5. **What are continuous and categorical variables?**
- Continuous and Categorical Variables

1. Continuous Variables:

  - Continuous variables are numerical variables that can take any value within a range.

  - They are measurable and often involve fractions or decimals.

  - Examples:

  -  Height of a person (e.g., 170.5 cm)

  - Temperature (e.g., 36.6°C)

  - Salary (e.g., 45000.75 INR)

2. Categorical Variables:

  - Categorical variables represent distinct categories or groups.

  - They are qualitative and usually cannot be measured numerically.

  - Examples:

  - Gender (Male, Female, Other)

  - Blood Group (A, B, AB, O)

  - Product Category (Electronics, Clothing, Furniture)

  ---

6. **How do we handle categorical variables in Machine Learning? What are the common techniques?**

Categorical variables represent discrete groups or labels. Most machine learning algorithms cannot work directly with categorical data, so we need to convert them into numerical form.

Common Techniques to Handle Categorical Variables:

Label Encoding:

Assigns a unique integer to each category.

Example:

Gender: Male → 0, Female → 1

Useful for ordinal data (categories with an order).

One-Hot Encoding:

Converts each category into a binary column (0 or 1).

Example:

Color: Red, Green, Blue →

Red → [1, 0, 0]

Green → [0, 1, 0]

Blue → [0, 0, 1]

Useful for nominal data (no specific order).

Binary Encoding:

Converts categories into binary numbers and represents them in columns.

Useful when there are many categories (high cardinality).

Target Encoding (Mean Encoding):

Replace each category with the mean of the target variable for that category.

Often used in regression problems.

Frequency Encoding:

Replace each category with its frequency/count in the dataset.

Helps algorithms understand importance/popularity of categories.

---

7. **What do you mean by training and testing a dataset?**
 - In machine learning, a dataset is usually divided into two parts: training set and testing set. This helps in building a model that learns patterns and can generalize well to new data.

 1. Training Dataset:

The part of the dataset used to train the machine learning model.

The model learns patterns, relationships, and parameters from this data.

Example: If we have 1000 data points, 70% (700 points) may be used for training.

2. Testing Dataset:

The part of the dataset used to evaluate the performance of the trained model.

The model has never seen this data before, so it helps check if the model can generalize to new, unseen data.

Example: Remaining 30% (300 points) of the data is used for testing.

---

8. **What is sklearn.preprocessing?**
sklearn.preprocessing is a module in Scikit-learn (Python library) that provides tools to prepare and transform data before feeding it into machine learning models. Proper preprocessing helps models learn better and perform efficiently.

---

9. **What is a Test set?**
- A test set is a portion of a dataset that is kept separate from the training data and is used to evaluate the performance of a trained machine learning model.

---

10. **How do we split data for model fitting (training and testing) in Python?
How do you approach a Machine Learning problem?**

- In machine learning, data is usually split into training set and testing set so that the model can learn from one part and be evaluated on the other.

**Common Steps:**

Use the train_test_split function from sklearn.model_selection.

Decide the proportion of data for training and testing (commonly 70%-80% for training and 20%-30% for testing).

Shuffle the data before splitting to ensure randomness



**Approaching a machine learning problem generally involves the following steps**:

***Define the Problem:***

Understand the goal (classification, regression, clustering).

Example: Predict house prices (regression) or detect spam emails (classification).

**Collect Data:**

Gather relevant datasets from sources like CSV files, databases, or APIs.

**Explore & Understand Data (EDA):**

Analyze the data using statistics and visualization.

Check for missing values, outliers, and feature distributions.

**Preprocess Data:**

Handle missing values

Encode categorical variables

Scale or normalize features

**Split Data:**

Divide the data into training set and testing set.

**Choose a Model:**

Select an appropriate algorithm (e.g., Linear Regression, Decision Tree, SVM).

**Train the Model:**

Fit the model on the training data.

**Evaluate the Model:**

Use the test set and metrics like accuracy, MSE, F1-score to check performance.

**Tune Hyperparameters:**

Adjust model parameters to improve performance.

**Deploy & Monitor:**

Use the trained model in real-world applications.

Monitor performance and retrain if necessary.


---

11. **Why do we have to perform EDA before fitting a model to the data?**
Exploratory Data Analysis (EDA) is the process of analyzing and visualizing data to understand its structure, patterns, and quality before applying machine learning models.

**Reasons to perform EDA:**


**Understand the Data:**

Identify the types of variables (continuous, categorical).

Discover relationships and correlations between features.

**Detect Missing Values:**

Find and handle missing or null values that can affect model performance.

**Identify Outliers:**

Detect unusual data points that may distort the model.

**Check Feature Distribution:**

Understand how data is spread (normal, skewed, uniform).

Helps in deciding scaling or transformation techniques.

**Select Important Features:**

Identify which features are most relevant for prediction.

Reduces model complexity and improves accuracy.

**Avoid Garbage In, Garbage Out:**

Ensures the model is trained on clean, meaningful, and relevant data, not raw unprocessed data.

---

12. **What is correlation?**
 - Correlation is a statistical measure that describes the strength and direction of a relationship between two variables.

 ---

13. **What does negative correlation mean?**
 - Negative correlation occurs when two variables move in opposite directions.

 ---

14. **How can you find correlation between variables in Python?**
- In Python, we can find correlation between variables using libraries like Pandas or NumPy. Correlation shows the strength and direction of a relationship between two variables.

**Common Methods:**

**Using Pandas corr() function:**

df.corr() calculates correlation between all numerical columns in a DataFrame.

Returns a correlation matrix showing correlation coefficients.

**Using NumPy corrcoef() function:**

np.corrcoef(x, y) calculates correlation between two arrays/lists.

Returns the correlation coefficient (r value).

---

15. **What is causation? Explain difference between correlation and causation with an example.**
- Causation means that one variable directly affects or causes a change in another variable.

It shows a cause-and-effect relationship.

Example: Smoking causes lung cancer → Smoking (cause) directly increases the risk of lung cancer (effect).

| Feature      | Correlation                                                             | Causation                                                         |
| ------------ | ----------------------------------------------------------------------- | ----------------------------------------------------------------- |
| Meaning      | Measures how two variables **move together**                            | Shows that **one variable causes a change** in another            |
| Relationship | Variables may move together **without causing each other**              | There is a **direct cause-and-effect** relationship               |
| Value        | Correlation coefficient (r) indicates strength and direction            | No coefficient; it’s about **direct effect**                      |
| Example      | Ice cream sales and drowning cases are correlated (both rise in summer) | Smoking causes lung cancer (smoking directly affects cancer risk) |


---

16. **What is an Optimizer? What are different types of optimizers? Explain each with an example**
- An optimizer is an algorithm used to update the parameters (weights and biases) of a machine learning model during training in order to minimize the loss function.

The goal of an optimizer is to help the model learn efficiently and reach the best possible performance.

| Optimizer                             | Description                                                                                                          | Example                                                                                    |
| ------------------------------------- | -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------ |
| **Gradient Descent (GD)**             | Updates all parameters using the gradient of the loss function w\.r.t all training data.                             | Linear regression model: weights are updated after computing gradient over entire dataset. |
| **Stochastic Gradient Descent (SGD)** | Updates parameters using **one training sample at a time**. Faster and can escape local minima.                      | Training a neural network where weights are updated after each data point.                 |
| **Mini-Batch Gradient Descent**       | Combines GD and SGD: updates parameters using a **small batch of data**.                                             | Neural networks with batch size 32 or 64 for training.                                     |
| **Adam (Adaptive Moment Estimation)** | Combines **Momentum and RMSProp**. Maintains adaptive learning rates for each parameter and accelerates convergence. | Deep learning models in TensorFlow/Keras or PyTorch.                                       |
| **RMSProp**                           | Maintains **adaptive learning rate** for each parameter based on recent gradients. Prevents oscillations.            | Training RNNs where gradients can vanish or explode.                                       |
| **Momentum**                          | Accelerates GD by adding a **fraction of the previous update** to the current update. Helps escape local minima.     | Training deep neural networks where convergence is slow.                                   |


---

17. **What is sklearn.linear_model ?**
- sklearn.linear_model is a module in Scikit-learn (Python library) that provides algorithms for linear models.

Linear models are used to predict a target variable based on one or more input features.

The module contains regression and classification algorithms that assume a linear relationship between input features and output.

| Algorithm                        | Type                        | Description                                                                     | Example Use                                           |
| -------------------------------- | --------------------------- | ------------------------------------------------------------------------------- | ----------------------------------------------------- |
| **LinearRegression**             | Regression                  | Predicts a continuous target variable using a linear equation.                  | Predict house prices based on size, location, etc.    |
| **LogisticRegression**           | Classification              | Predicts a binary or multi-class outcome using a logistic function.             | Spam email detection (spam or not spam)               |
| **Ridge Regression**             | Regression                  | Linear regression with L2 regularization to prevent overfitting.                | Predicting stock prices with many correlated features |
| **Lasso Regression**             | Regression                  | Linear regression with L1 regularization, performs feature selection.           | Predicting medical costs with many features           |
| **ElasticNet**                   | Regression                  | Combines L1 and L2 regularization for balanced feature selection and shrinkage. | Predicting insurance claims                           |
| **SGDClassifier / SGDRegressor** | Classification / Regression | Uses stochastic gradient descent for linear models.                             | Large datasets where batch gradient descent is slow   |


---


18. **What does model.fit() do? What arguments must be given?**
- In Scikit-learn (Python), model.fit() is the method used to train a machine learning model on a dataset.

It makes the model learn patterns and relationships between input features and target variables.

During fit(), the model adjusts its parameters (weights, biases, etc.) to minimize the loss function.

**Arguments for model.fit()**

**The typical arguments are:**

**X (Features / Input Data):**

A 2D array or DataFrame containing the input variables.

Example: Columns like height, weight, age.

**y (Target / Output Data):**

A 1D array, Series, or DataFrame containing the target variable.

Example: marks scored, house price, spam/not spam.

**Optional Arguments:**

Some models accept additional arguments like sample weights, number of iterations, regularization parameters, depending on the algorithm.

---

19. **What does model.predict() do? What arguments must be given?**
- In Scikit-learn (Python), model.predict() is the method used to make predictions using a trained machine learning model.

After the model has been trained using model.fit(), it can predict outputs for new, unseen input data.

It uses the patterns learned during training to generate predictions.


**X (Features / Input Data):**

A 2D array or DataFrame containing the new input variables for which predictions are needed.

The number of columns/features must match the features used during training.

**Optional Arguments:**

Most Scikit-learn models only require X.

Some advanced models may have additional optional arguments, but these are rare.


---

20. **What are continuous and categorical variables ?**


**1. Continuous Variables:**

Continuous variables are numerical variables that can take any value within a range.

They are measurable and often include fractions or decimals.

Examples:

Height of a person (e.g., 170.5 cm)

Temperature (e.g., 36.6°C)

Salary (e.g., 45000.75 INR)

**2. Categorical Variables:**

Categorical variables represent distinct categories or groups.

They are qualitative and usually cannot be measured numerically.

Examples:

Gender (Male, Female, Other)

Blood Group (A, B, AB, O)

Product Category (Electronics, Clothing, Furniture)

---

21. **What is feature scaling? How does it help in Machine Learning?**

- Feature scaling is a technique used to normalize or standardize the range of independent variables (features) in a dataset.

Machine learning algorithms often perform better when all features are on a similar scale.

Without scaling, features with larger numerical values can dominate the learning process.

Common Methods of Feature Scaling:

Min-Max Scaling (Normalization):

Scales features to a fixed range, usually 0 to 1.

Formula:

𝑋
𝑠
𝑐
𝑎
𝑙
𝑒
𝑑
=
𝑋
−
𝑋
𝑚
𝑖
𝑛
𝑋
𝑚
𝑎
𝑥
−
𝑋
𝑚
𝑖
𝑛
X
scaled
	​

=
X
max
	​

−X
min
	​

X−X
min
	​

	​


Standardization (Z-score Scaling):

Centers the data around mean = 0 and standard deviation = 1.

Formula:

𝑋
𝑠
𝑐
𝑎
𝑙
𝑒
𝑑
=
𝑋
−
𝜇
𝜎
X
scaled
	​

=
σ
X−μ

---
	​


22. **How do we perform scaling in Python?**
- In Python, we usually perform feature scaling using Scikit-learn’s preprocessing module.

Min-Max Scaling (Normalization):

Use MinMaxScaler from sklearn.preprocessing

Scales features to a fixed range (0 to 1)

Standardization (Z-score Scaling):

Use StandardScaler from sklearn.preprocessing

Centers features around mean = 0 and standard deviation = 1

Max Abs Scaling:

Use MaxAbsScaler for data that is already centered at 0

Scales features by their maximum absolute value

Robust Scaling:

Use RobustScaler to reduce the impact of outliers

Centers using median and scales according to interquartile range (IQR)

---

23. **What is sklearn.preprocessing?**
- sklearn.preprocessing is a module in Scikit-learn (Python library) that provides tools to prepare and transform data before feeding it into machine learning models.

Raw data often contains features with different scales, missing values, or categorical variables.

sklearn.preprocessing helps to normalize, scale, encode, and transform such data.

Proper preprocessing helps models learn better and perform efficiently.



| Function / Class | Purpose                                                  |
| ---------------- | -------------------------------------------------------- |
| `StandardScaler` | Standardizes features to mean = 0, std = 1               |
| `MinMaxScaler`   | Scales features to a fixed range (0 to 1)                |
| `RobustScaler`   | Scales using median and IQR to reduce outlier impact     |
| `Normalizer`     | Scales individual samples to unit norm                   |
| `LabelEncoder`   | Converts categorical labels to integers                  |
| `OneHotEncoder`  | Converts categorical features to binary columns          |
| `Binarizer`      | Converts numerical values into binary based on threshold |


---

24. **How do we split data for model fitting (training and testing) in Python?**

- In machine learning, it is important to split the dataset into training and testing sets so that the model can learn patterns and be evaluated on unseen data.

Use the train_test_split function from sklearn.model_selection.

Specify the proportion of data for training and testing:

Common split: 70%-80% for training, 20%-30% for testing.

Shuffle the data before splitting to ensure random distribution of samples.

Assign the split datasets to variables:

X_train, X_test → Features for training and testing

y_train, y_test → Target labels for training and testing


---

25. **Explain data encoding?**

- Data encoding is the process of converting categorical (non-numerical) data into numerical format so that machine learning algorithms can process it.

Most ML algorithms can only work with numerical data, so encoding is essential.

Encoding preserves the information in categorical variables while making it usable for models.

| Technique                  | Description                                          | Example                                                |
| -------------------------- | ---------------------------------------------------- | ------------------------------------------------------ |
| **Label Encoding**         | Assigns a unique integer to each category            | Gender: Male → 0, Female → 1                           |
| **One-Hot Encoding**       | Creates binary columns for each category             | Color: Red, Green, Blue → \[1,0,0], \[0,1,0], \[0,0,1] |
| **Binary Encoding**        | Converts categories to binary numbers                | Category A → 01, B → 10, C → 11                        |
| **Target / Mean Encoding** | Replaces category with mean of target variable       | Product Type → Average Sales per Product Type          |
| **Frequency Encoding**     | Replaces category with occurrence count or frequency | City → Number of times it appears in dataset           |



---