# Preprocessing

### Access the dataset from here :

https://drive.google.com/file/d/1XztaPmhMMhBoEp7XuyDGS5Il_kLUlEEl/view?usp=sharing

### Notes:
* This exam consists of a **Regression** problem.  
* The **target** feature is '**cltv**'.
* **Random state** should be taken as **42** wherever applicable.

In [1]:

import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.read_csv("V1.csv")


In [2]:
df.head()  # Display the first few rows of the DataFrame

Unnamed: 0,id,gender,area,qualification,income,marital_status,vintage,claim_amount,num_policies,policy,type_of_policy,cltv
0,27529,Male,Urban,High School,42.99,0,0,3849.0,1,A,Platinum,66816
1,27116,Male,Rural,Bachelor,5.33,1,6,3006.0,More than 1,A,Gold,67164
2,6499,Female,Urban,High School,2.26,1,2,,More than 1,A,Platinum,68076
3,61863,Male,Rural,High School,20.29,1,8,2844.0,More than 1,A,Platinum,63276
4,25045,Female,Urban,High School,5.63,0,6,6370.0,More than 1,A,Platinum,245844


# Metadata

1. **id**-Unique identifier of a customer  
2. **gender**-Gender of the customer   
3. **area**-Area of the customer   
4. **qualification**-Highest Qualification of the customer  
5. **income**-Income earned in a year (in rupees).   
6. **marital_status**- 0:Single, 1: Married
7. **vintage**-No. of years since the first policy date.  
8. **claim_amount**-Total Amount Claimed by the customer (in rupees)
9. **num_policies**-Total no. of policies issued by the customer
10. **policy**-Active policy of the customer
11. **type_of_policy**-Type of active policy
12. **cltv**- Customer life time value. It is the total amount of money a customer is expected to spend with your business, or on your products, during the lifetime of an average business relationship. **[TARGET]**

### Q.2 [Marks: 2] How many total number of features (excluding target variable) are there in the dataset?
Options

A) 1000

B) 11

C) 12

D) 10

`Ans: 11 input features`

In [8]:
# Solution

df.shape[1]-1

11

### Q.3 [Marks: 2] What are the unique values of feature `Types of Policy` in the dataset?

A) ['Bronze', 'Gold']

B) ['Gold', 'Silver']

C) ['Platinum', 'Gold', 'Silver', 'Bronze]

D) ['Platinum', 'Gold', 'Silver']


`Ans: ['Platinum', 'Gold', 'Silver'](D)`

In [5]:
df.columns

Index(['id', 'gender', 'area', 'qualification', 'income', 'marital_status',
       'vintage', 'claim_amount', 'num_policies', 'policy', 'type_of_policy',
       'cltv'],
      dtype='object')

In [4]:
# Solution
df['type_of_policy'].unique()

array(['Platinum', 'Gold', 'Silver'], dtype=object)

### Q.4 [Marks: 3] Which of the following columns have categorical data?[MSQ]

A) income

B) id

C) area

D) claim_amount

E) qualification

`Ans: area (C), qualification (E)`

In [None]:
# Solution

### Q.5 [Marks: 4] Plot the `heatmap` and mark the pair which has the highest positive correlation value. [MCQ]

A) claim_amount & income

B) income & cltv

C) vintage & income.

D) claim_amount & cltv.

`Ans: claim_amount and cltv (D)`

In [None]:
# Solution

### Q.6 [Marks: 2] Which of the following features have `missing` values?[MSQ]

Options:

A) gender

B) area

C) qualification

D) income

E) claim_amount

F) policy

`Ans: area(B), income(D), claim_amount(E), policy(F)`

In [None]:
# Solution

### Q.7 [Marks: 4] Break the dataset into features(`X`) and label (`y`), where the column `cltv` goes to `y` and the rest of the columns go to `X`. Enter the avg value of `cltv` column? [NAT]


`Ans: 97788.08`

In [None]:
# Solution

### Q.8 [Marks : 3] Split the dataset into training and test dataset using `train_test_split` into `70:30` ratio while keeping random_state =42. What is the shape of the training set (X_train) ? [MCQ]


A) (4379, 11)

B) (4392, 13)

C) (4340, 11)

D) (4379, 15)

`Ans: (4379, 11) (A)`

In [None]:
# Solution

### Q.9 [Marks: 2] Drop(remove) `id` column from train and test data because it is not useful in model training. Now how many feature columns are remaining in the training dataset? [NAT]

`Ans: 10 features`

In [None]:
# Solution

### Q.10 [Marks: 2] Compute and write median of the `income` column of X_train while ignoring the missing values. Replace all NaN values in the income column of X_train and X_test by the median  computed from the X_train (upto two decimal). [NAT].

`Ans: 7.04`

In [None]:
# Solution

### Q.11 [Marks: 2] Which is the most frequent value in the `policy` column of X_train? Replace all NaN value in `policy` column of X_train and X_test by most frequent value in X_train [MCQ]

A) 'A'

B) 'B'

C) 'C'

D) None of the above

`Ans: A`

In [None]:
# Solution

### Q.12 [Marks: 2] Which is the most frequent value in the `area` column of X_train? Replace all NaN value in `area` column of X_train and X_test by most frequent value from X_train [MCQ]

A) 'Urban'

B) 'Rural'

C) 'Semi-Urban'

D) None of the above

`Ans: Urban (A)`

In [None]:
# Solution

### Q.13 [Marks: 2] Replace all NaN value in claim_amount column of X_train and X_test by 0 (Zero). After Replacing NAN values from claim_amount column what is the standard deviation of claim_amount column in X_train. (correct upto two decimal places) [NAT]

`Ans: 3358.66`

In [None]:
# Solution

### Q.14 [Marks: 4] Apply `MinMaxScaler` on `income` column of X_train. Compute and write median of `income` column? (correct Upto 2 decimal)[NAT]

`Ans: 0.07`

In [None]:
# Solution

## Apply preprocessing on features of X_train and X_test dataset.

### For Categorical Features

* Apply OneHotEncoding from `sklearn` library on all categorical features(object columns). Do Encoding in the order of following list

  `Categorical Features = ['gender', 'area','qualification','marital_status', 'num_policies', 'policy', 'type_of_policy']`

Lets call the transformed caterical feature matrix $X1$

### For Numerical Features

- apply MinMaxScaler and transform the dataset. Do scaling in the order of following list:

  `Numerical Features =  [ 'income', 'vintage', 'claim_amount' ]`


  - Lets call the transformed numerical feature matrix $X2$

### **Concatenate**(One Hot Encoded Features, Scaled Numerical Features)

After combining transformed categorical feature($X_1$) matrix and transformed numerical feature matrix ($X_2$) (side by side in that order), the output will be $X=[X_1 X_2]$

### Hints
* Apply ColumnTransformer to encode categorical columns and scaling on numerical columns with required preprocessor

* Another way is to separately encode all categorical columns and scale numerical columns and do concatenate (`h-stack`) both. keep categorical columns in front of numerical while concatenating.


* The transformed (as desribed by above steps) X_train and X_test, should be considered as X_train and X_test henceforth.


In [None]:
# Solution

## Q.15 [Marks: 10] How many features you will get after preprocessing? [MCQ]

[Options]

A) 13

B) 20

C) 25

D) 01

`Ans: 20 features (B)`

In [None]:
# Solution

# Model Building

### Q.16 [Marks: 5] Apply `SequentialFeatureSelector` transformer with direction= 'forward' with `LinearRegression()` estimator and select 5 features by fitting to the X_train and y_train.

  `Use cv = KFold(n_splits=5,random_state=42,shuffle=True) in SequentialFeatureSelector.`

### Which of the following options represents the correct integer index of the selected features list?


A) [ 6  9 12 13 19]

B) [ 3  6  9 13 19]

C) [ 8  9 12 14 19]

D) [ 1  2  9 13 19]

E) [ 3  7 10 13 19]


`Ans: [6 9 12 13 19](A)`

In [None]:
# Solution

### Q.17 [Marks: 3] Apply `LinearRegression` on the trainig set(`X_train` and `y_train`). What is the `R2 score` on the test set(`X_test` and `y_test`). ( Upto 4 digits after decimal points) [NAT]


`ANS: 0.0821`

In [None]:
# Solution

### Q.18 [Marks: 6]Using the `LinearRegression` model, compute the `cross-validation scores` for `5 splits` on training data (X_train and y_train) using `cross_val_score`.Enter the maximum value of `𝑅2 score` ( Upto 4 digits after decimal points) obtained.[NAT]

`Use cv = KFold(n_splits=5,random_state=42,shuffle=True) in SequentialFeatureSelector.`

(**Hint**: By default cross_val_score uses LinearRegression's scoring metric, which is  𝑅2 score.)

`Ans: 0.1816 (Range: 0.1790-0.1845)`

In [None]:
# Solution

### Q.19 [Marks: 5]Apply `Ridge` regression with **random_state=42** with default penalty value on training set(`X_train and y_train`) and calculate the 𝑅2 score on test_set (`X_test and y_test`). What is the correct score ( Upto 4 digits after decimal points)? [NAT]

`Ans: 0.0890`

In [None]:
# Solution

### Q.20: [Marks: 6] Apply `Lasso` regression with **random_state=42** and **regularization rate=0.1** on the training data(`X_train & y_train`). Enter the value of the intercept you got correctly upto 2 digits after decimal points . [NAT]

`Ans: 103168.82`

In [None]:
# Solution

### Q.21 [Marks: 5] Fit SGDRegressor(`random_state=42`) estimator on the training data(`X_train & y_train`) and predict labels for test_data(`X_test`), lets call it as y_test_predict. The parameters are initialized with default values. Calculate and mark the correct mean_absolute_error value between y_test and y_test_predict from the given options. (Correct upto two decimals) [NAT]

`Ans: 53085.49`

In [None]:
# Solution

### Q.22: [Marks: 6] Using SGDRegressor(random_state=42) as an estimator for exactly 10 iterations. Write the correct R2 score on test data  [NAT] (correct Upto 4 digits)

`Ans: 0.1359`

In [None]:
# Solution

# (Common Instructions for Question 23 and 24)

### Create a pipeline Using PolynomialFeatures as transformer and Lasso as estimator. Use GridSearchCV with this created pipeline and following hyperparameter values on training data(X_train, y_train) to fit the model .
```
1. Keep polynomial degree as : [1, 2]
2. alpha value to be taken as : np.logspace(-3, 0, num=5)
3. scoring : neg_mean_absolute_error .
```
(**Note**: Kindly ignore the warning.)

In [None]:
# Solution

### Q.23 [Marks: 6] Mark the best `alpha` value you got using above instructions.[MCQ]

A) 0.001

B) 0.00562341

C) 0.03162278

D) 0.17782794

E) 1.00



`Ans: 1.00 (E)`

In [None]:
# Solution

### Q.24 [Marks: 6] Enter the best polynomial degree value you got using above instructions.[NAT]




`Ans: 1`

In [None]:
# Solution

# (Common Instructions for Question 25 and 26)
### To Reduce number of dimensions of training data with PCA. Fit the PCA model using following parameter values on training data.
```
n_components=5
svd_solver='full'
whiten=True
random_state=42
```

In [None]:
# Solution

### Q.25 [Marks: 5] What is the sum of `explained_variance_ratio_` ? [NAT]

`Ans: 0.6591`

In [None]:
# Solution

### Q.26 [Marks: 6] Use PCA transformed training data from earlier question and y_train to fit the `RidgeCV` estimator model having `alpha value as [0.001,0.01,0.1,1]`. Calculate the R2 score you got from the model for transformed test data(PCA transformed X_test). [NAT] (upto 4 decimal)

`Ans: 0.0759`

In [None]:
# Solution