# Q1. Scaler Trouble

Consider a dataset with two features: 'Weight' and 'Height’.

'Weight' values range from 50 to 90, and
'Height' values range from 160 to 190.
Data:
```python
Weight = [70, 80, 60, 90, 50],  
Height = [170, 180, 160, 190, 165]
```
Which scaler should we choose to preprocess these features for our machine-learning model?

A. Min-Max Scaler for 'Weight' and Standard Scaler for 'Height'
B. Standard Scaler for 'Weight' and Min-Max Scaler for 'Height'
C. Min-Max Scaler for both 'Weight' and 'Height'
D. Standard Scaler for both 'Weight' and 'Height'

### Correct Option: D. Standard Scaler for both ‘Weight’ and ‘Height’

### Explanation:

**Standard Scaler** standardizes data by subtracting the mean and dividing by the standard deviation. This is generally preferred for machine learning models as it:

- Makes all features have zero mean and unit variance, which can improve model performance.
- Ensures all features contribute equally to the model, regardless of their original units or scales.

**Min-Max Scaler** scales the data to a specific range, typically between 0 and 1. This may be useful in certain cases, but it can be problematic for machine learning models because:

- It removes information about the spread of the data (variance), which can be important for certain models.
- It can amplify the effect of outliers.
- In this particular case, both ‘Weight’ and ‘Height’ are continuous features with a well-defined range.

Therefore, using Standard Scaler ensures that both features are standardized and contribute equally to the machine learning model.

**A and B are incorrect because:**

Applying different scaling methods to different features can lead to inconsistencies in the data and potentially harm the performance of the machine learning model.
In this case, there is no specific reason to choose Min-Max Scaler over Standard Scaler for either feature.

**C is incorrect because:**

While Min-Max Scaler can be useful in certain situations, it is not the best approach for this specific case as explained above. Standard Scaler is a more general-purpose scaler and is preferred for machine-learning models.

# Q2. Outlier Trouble

A dataset contains exam scores for students, ranging from 40 to 100. However, there are a few outliers with scores exceeding 120.

Which method is more suitable for identifying outliers in this scenario?

### Correct Option: Interquartile Range (IQR)

### Explanation:

**IQR method:**

- This method identifies outliers based on the quartiles of the data.
- It is more resistant to outliers than the mean-based methods like Z-score, making it better suited for skewed data like exam scores where a few high scores can significantly impact the mean.

**Z-score method:**

- This method identifies outliers based on their standard deviation from the mean.
- However, it can be influenced by outliers itself, leading to inaccurate outlier detection when the data is skewed.
- In this scenario, the presence of a few high scores can inflate the standard deviation, potentially masking other outliers or misidentifying valid data points as outliers.

# Q3. Imputer works 2

What does the following code snippet do?
```python
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(strategy='most_frequent')
imp_mean.fit(data)
imputed_train_df = imp_mean.transform(data)
```

### Correct option: Calculates the most frequent value among the non-missing values in a column and then replacing the missing values within each column separately.


### Explanation:

- **SimpleImputer()** replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column or using a constant value.

- when **strategy= 'most_frequent'** is passed inside, it calculates the mode of the non-missing values in a column and then replaces the missing values within each column separately.

# Q4. Handling Categorical features

Which of the following statements are true about handling categorical values?


a) One-hot encoding is the most effective approach for handling all categorical data.

b) One-hot encoding increases the dimensionality of the data.

c) Target encoding uses the target variable to estimate the value of each category in a categorical variable.

d) Label encoding can be useful for features with a large number of categories.

e) No encoding is also sometimes the best option for handling categorical variables.

f) Label encoding assigns arbitrary numerical values to categories in a categorical variable.

### Correct Option: b, c, f

### Explanation:

The true statements about handling categorical values are:

b) One-hot encoding increases the dimensionality of the data. This is because every category in the original variable is represented by a new binary feature in the one-hot encoded data.

c) Target encoding uses the target variable to estimate the value of each category in a categorical variable. This is done by calculating the average target value for each category and using that value to represent the category.

f) Label encoding assigns arbitrary numerical values to categories in a categorical variable. This is done by assigning a - unique integer to each category, but the order of the integers does not necessarily reflect the order of the categories.
The incorrect statements are:

a) One-hot encoding is the most effective approach for handling all categorical data.
- This is not true, as other encoding techniques, such as label encoding and target encoding, may be more suitable for certain types of categorical data.

d) Label encoding can be useful for features with a large number of categories.
- This is not true because when there are many categories, label encoding can create a large number of features, which can increase the dimensionality of the data and make it difficult for the model to learn.

e) No encoding is also sometimes the best option for handling categorical variables:
- Not true.
- Encoding is generally necessary for machine learning models to interpret categorical data correctly.