In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Q1. Co2 Emissions

The below data represents the carbon dioxide emission from vehicles in grams/per km. There are a few missing values present in the data.

What will be the imputing strategy that we can use?

data=[50,196,221,136,255,NaN,230,252,267,212,NaN,359,328,200,500,624,NaN,236,289,300,366]
```python
import numpy as np

from sklearn.impute import SimpleImputer

data=np.array(data)

imputer = SimpleImputer(strategy = “________”)

data1= imputer.fit_transform(data.reshape(-1,1))
```

### Correct option: median

### Explanation:

![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/003/012/original/missing_values.PNG?1648541322)


Since there are 3 outliers present in the data, we should use the median as an impute strategy.

In the presence of outliers, or extreme values, the median is preferred over the mean.
The reason for this is that the mean can be “dragged” up or down by extreme values, but since the median is just the middle value in a distribution, it is not influenced by the outliers.

# Q2. Missing_values

You are cleaning up a DataFrame that has almost 5000 observations and you notice that one of the categorical columns contains 1512 missing values.

What strategy should you apply to deal with these missing values?

### Correct option: Impute the missing values with the most_frequent value.


### Explanation:

- Since the column is categorical we can replace it with the most frequent value.
- We cannot drop all the rows which will result in loss of information and we might lose some important data.
- We cannot replace all the values with randomly selected values either. There is no sense of doing this.
- We cannot drop the entire column that contains missing values as it may result in a huge loss of important data.

# Q3. Imputer Works

What does the following code snippet do?
```python
from sklearn.impute
import SimpleImputer
imp_mean = SimpleImputer( strategy='mean')
imp_mean.fit(data)
imputed_train_df = imp_mean.transform(data)
```

### Correct option: Calculates the mean of the non-missing values in a column and then replacing the missing values within each column separately

### Explanation:

SimpleImputer() replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column or using a constant value.

when strategy=’mean’ is passed inside, it calculates the mean of the non-missing values in a column and then replaces the missing values within each column separately

# Q4. Car company

The below data represents the Carbon Dioxide emissions from a vehicle.

![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/003/013/original/co2_em.PNG?1648541586)


There are 11 features and 1 target column. In the independent variables “Make” represents 42 unique car companies.

Which feature engineering technique can be applied to this column?

### Correct option: Feature Binning

### Explanation:


If we do one hot encoding it will increase the cardinality issue and most of the data will be sparse data.
We cannot apply feature scaling or feature engineering / transformation as it is not a numeric feature.
We can do feature binning by dividing the 42 unique car companies into 4 categories, for example, Luxury, Sports, Premium, and General cars.

Code:
```python

print('Initial column:\n', data['Make'].unique())

data['Make_Type'] = data['Make'].replace(['BUGATTI', 'PORSCHE', 'MASERATI', 'ASTON MARTIN', 'LAMBORGHINI' 'JAGUAR','SRT'], 'Sports')

data['Make_Type'] = data['Make_Type'].replace(['ALFA ROMEO', 'AUDI', 'BMW', 'BUICK', 'CADILLAC', 'CHRYSLER', 'DODGE', 'GMC','INFINITI', 'JEEP', 'LAND ROVER', 'LEXUS', 'MERCEDES-BENZ','MINI', 'SMART', 'VOLVO'], 'Premium')

data['Make_Type'] = data['Make_Type'].replace(['ACURA', 'BENTLEY', 'LINCOLN', 'ROLLS-ROYCE', 'GENESIS'], 'Luxury')

data['Make_Type'] = data['Make_Type'].replace(['CHEVROLET', 'FIAT', 'FORD', 'KIA', 'HONDA', 'HYUNDAI', 'MAZDA', 'MITSUBISHI','NISSAN', 'RAM', 'SCION', 'SUBARU', 'TOYOTA', 'VOLKSWAGEN'], 'General')

print('Final column:\n', data['Make_Type'].unique())

### Output:

Initial column:
array([‘ACURA’, ‘ALFA ROMEO’, ‘ASTON MARTIN’, ‘AUDI’, ‘BENTLEY’, ‘BMW’, ‘BUICK’, ‘CADILLAC’, ‘CHEVROLET’, ‘CHRYSLER’, ‘DODGE’, ‘FIAT’,’FORD’, ‘GMC’, ‘HONDA’, ‘HYUNDAI’, ‘INFINITI’, ‘JAGUAR’, ‘JEEP’,’KIA’, ‘LAMBORGHINI’, ‘LAND ROVER’, ‘LEXUS’, ‘LINCOLN’, ‘MASERATI’,’MAZDA’, ‘MERCEDES-BENZ’, ‘MINI’, ‘MITSUBISHI’, ‘NISSAN’,’PORSCHE’, ‘RAM’, ‘ROLLS-ROYCE’, ‘SCION’, ‘SMART’, ‘SRT’, ‘SUBARU’,’TOYOTA’, ‘VOLKSWAGEN’, ‘VOLVO’, ‘GENESIS’, ‘BUGATTI’],dtype=object) - These are 42 unique types.

### Final column:
array([‘Sports’, ‘Premium’, ‘Luxury’, ‘General’],dtype=object) - These are 4 unique types.
```

# Q5. Text messages

Data on the number of text messages sent one weekend by girls and boys in school is summarized as follows:

![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/002/865/original/text_msg.PNG?1647936406)

A Statistics student checking the calculations finds that the message counts for all the students were underreported by 5.

If the numbers are corrected, what are the corrected IQR and standard deviation?

### Correct option: IQR = 76 and standard deviation = 42.4

### Explanation:

Adding the same constant to every value will increase the Min, Q1, Median, Q3, Max, and Mean by that constant.

However, measures of variability, including the IQR and the standard deviation, will remain unchanged.

If you add a constant to every value, the distance between values does not change.
As a result, all of the measures of variability (range, interquartile range, standard deviation, and variance) remain the same.

On the other hand, suppose you multiply every value by a constant. This has the effect of multiplying the range, interquartile range (IQR), and standard deviation by that constant.

Thus, IQR = Q3 - Q1 = 90-14 =76
and standard deviation = 42.4

# Q6. Limits for outlier

For a certain array [0, 1, 2, 3, 4, 5, 10], we decided to plot a boxplot as below:

![](https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/026/837/original/yticks.PNG?1677576157)

According to the above plot, calculate the upper limit, median, and lower limit, such that, a data point would be considered as an outlier if it is out of those limits.

### Correct option: 9, 3, -3

### Explanation:

Here, we can observe from the given boxplot that : Q3 = 4.5 and Q1 = 1.5.
Therefore, IQR = Q3 - Q1 = 3

upper limit = Q3+1.5(IQR) = 4.5+1.5(3) = 9

lower limit = Q1- 1.5(IQR) = 1.5 - 1.5(3) = -3

Any observation greater than upper limit and anything lower than lower limit is considered to be an outlier.

Median from the boxplot can easily confirmed as 3.

# Q7. Standardization vs Normalization

Read below statements regarding two data transformation techniques Standardization and Normalization.
```python
A : Normalization forces all features to come down to same range

B : Standardization computes the z-score of all values which makes the feature mean = 0 
```

Mark statements A and B as True or False.

### Correct Option: A : True, B : True

### Explanation:
- Statement A

  True.
  Normalization is a technique that scales the individual features to have the same range.
  It brings the values of different features into a comparable range, often between 0 and 1.
- Statement B:

  True
  Standardization (or z-score normalization) scales the features in such a way that they have a mean of 0 and a standard deviation of 1.

### Correct Option: Both features will have a mean of 0 and a standard deviation of 1.

### Explanation:
When you apply Standard Scaling (Standardization) using the Standard Scaler, it transforms the data in such a way that the resulting features have a mean of 0 and a standard deviation of 1.
This is achieved by subtracting the mean of the feature and dividing by its standard deviation for each data point.

In this scenario:

- ‘Age’ will be transformed to have a mean of 0 and a standard deviation of 1.
- ‘Salary’ will be transformed to have a mean of 0 and a standard deviation of 1.

The process of standardization does not change the unit of measurement; it scales the features to be centered around 0 with a standard deviation of 1, making it easier to compare and analyze them.