# Practice assignment: Handling categorical data

In this programming assignment, you are going to work with the following dataset from Kaggle:

https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists

This assignment is graded by your `submission.json`.

The cell below creates a valid `submission.json` file, fill your answers in there. 

You can press "Submit Assignment" at any time to submit partial progress.

## Dataset description

Suppose that a company working with Big Data and Data Science wants to hire data scientists among people who have successfully passed some courses conducted by the company. Many people sign up for their training. The company wants to focus of the candidates who really want to work for them after training. Information related to demographics, education, experience is provided by the candidates via a sign-up form.

This dataset is designed to understand the factors that lead a person to leave current job which is useful for HR researches. Based on the provided data, you are going to predict whether a candidate is looking for a job change.

The whole data is divided into train and test parts. Data contains several categorical features – they need to be encoded.

## Feature description

- `enrollee_id`: Unique ID for a candidate
- `city`: City code
- `city_development_index`: Developement index of the city (scaled)
- `gender`: Gender of a candidate
- `relevent_experience`: Relevant experience of a candidate
- `enrolled_university`: Type of University course enrolled in if any
- `education_level`: Education level of a candidate
- `major_discipline`: Education major discipline of a candidate
- `experience`: Candidate total experience in years
- `company_size`: Number of employees in a current employer's company
- `company_type`: Type of a current employer
- `last_new_job`: Difference in years between previous and current jobs
- `training_hours`: training hours completed
- `target`: 0 – Not looking for job change, 1 – Looking for a job change


In [2]:
import numpy as np
import pandas as pd

In [3]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [4]:
train.head(5)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,28120,city_16,0.91,,Has relevent experience,no_enrollment,High School,,19,100-500,Pvt Ltd,2,20,0.0
1,31820,city_21,0.624,,No relevent experience,no_enrollment,Masters,STEM,2,50-99,Early Stage Startup,1,10,1.0
2,4277,city_71,0.884,Male,Has relevent experience,Part time course,Graduate,STEM,>20,,,2,6,1.0
3,3379,city_159,0.843,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,<10,Pvt Ltd,>4,56,0.0
4,10821,city_16,0.91,Male,Has relevent experience,no_enrollment,Masters,STEM,6,,,4,15,0.0


In [5]:
test.head(5)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,32403,city_41,0.827,Male,Has relevent experience,Full time course,Graduate,STEM,9,<10,,1,21,1.0
1,9858,city_103,0.92,Female,Has relevent experience,no_enrollment,Graduate,STEM,5,,Pvt Ltd,1,98,0.0
2,31806,city_21,0.624,Male,No relevent experience,no_enrollment,High School,,<1,,Pvt Ltd,never,15,1.0
3,27385,city_13,0.827,Male,Has relevent experience,no_enrollment,Masters,STEM,11,10/49,Pvt Ltd,1,39,0.0
4,27724,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,10000+,Pvt Ltd,>4,72,1.0


In [6]:
train.columns[train.dtypes == 'object']

Index(['city', 'gender', 'relevent_experience', 'enrolled_university',
       'education_level', 'major_discipline', 'experience', 'company_size',
       'company_type', 'last_new_job'],
      dtype='object')

In [7]:
train.dtypes

enrollee_id                 int64
city                       object
city_development_index    float64
gender                     object
relevent_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
training_hours              int64
target                    float64
dtype: object

In [8]:
len(train.columns[train.dtypes == 'object'])

10

## 1

Before modifying the features, let's study our dataframe a bit. First, call `.dtypes` for `train` dataframe. 

**q1:** How many columns in `train` contain data types different from numbers (ints and floats)?

In [1]:
# # your code here
# len(train.columns[train.dtypes == 'object'])

## 2

**q2:** How many unique caterogies does a feature `'city'` contain in the train data?

In [2]:
# # your code here
# len(train.city.unique())

## 3

**q3:** How many features in train data contain missing values (NaN)?

In [3]:
# # your code here
# len(train.columns[train.isnull().sum() != 0])

## 4

Some features have a relatively small number of missing values. For instance, `'experience'` has only 65 missing values in the train data. Replacing these missing values with a special category might introduce bias to the data. However, since the number of missing values is not so big, we might be OK with it.

Replace missing values in `'experience'` feature with a category -1.

**q4:** How many categories does this feature contain in the train data now?

_(hint: you might want to use `fillna` function from `pandas` library in this task, but remember that it's not an in-place function by default. Or you can use `SimpleImputer` from `sklearn`)_

In [12]:
# your code here
train["experience"] = train["experience"].fillna(-1)

## 5

`'education_level'` is an example of an ordinal feature. Sure, we can define an order on it. For instance, `'High School'` is "bigger" than `'Primary School'`, and `'Phd'` is "bigger" than '`Graduate'`. We can encode this feature with integer numbers in a correct order.

In this task, apply a correct mapping for the values of `'education_level'` feature. The mapping should be the following:

* `'Primary School'` -> 0
* `'High School'` -> 1
* `'Graduate'` -> 2
* `'Masters'` -> 3
* `'Phd'` -> 4

At the same time, impute missing values in `'education_level'` with a new category -1. So another part of the mapping would be:

* `np.nan` -> -1

**q5:** What will be the mean value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest TWO decimal places (e.g. 12.3456789 -> 12.35).

_(hint: you might want to use `map` function from `pandas` in this task)_

In [13]:
# map_education = # your code here
# your code here
train.education_level
size_mapping = {
                'Primary School' : 0,
                'High School' : 1,
                'Graduate' : 2,
                'Masters' : 3,
                'Phd' : 4}

train['education_level'] = train['education_level'].map(size_mapping)
train

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,28120,city_16,0.910,,Has relevent experience,no_enrollment,1.0,,19,100-500,Pvt Ltd,2,20,0.0
1,31820,city_21,0.624,,No relevent experience,no_enrollment,3.0,STEM,2,50-99,Early Stage Startup,1,10,1.0
2,4277,city_71,0.884,Male,Has relevent experience,Part time course,2.0,STEM,>20,,,2,6,1.0
3,3379,city_159,0.843,Male,Has relevent experience,no_enrollment,3.0,STEM,>20,<10,Pvt Ltd,>4,56,0.0
4,10821,city_16,0.910,Male,Has relevent experience,no_enrollment,3.0,STEM,6,,,4,15,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,8241,city_16,0.910,,Has relevent experience,no_enrollment,,,11,,,1,4,0.0
19154,18441,city_103,0.920,,Has relevent experience,no_enrollment,2.0,STEM,>20,,,>4,56,1.0
19155,29117,city_11,0.550,,No relevent experience,Full time course,2.0,STEM,<1,,,never,12,0.0
19156,27929,city_103,0.920,Male,No relevent experience,no_enrollment,2.0,STEM,15,10000+,Pvt Ltd,2,16,1.0


In [14]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='constant',fill_value=-1, missing_values=np.nan)
imputer = imputer.fit(train[['education_level']])
train[['education_level']] = imputer.transform(train[['education_level']])
len(train.experience.unique())

23

In [15]:
# size_mapping = {
#     np.nan : -1
# }
# train.education_level[train.education_level.isnull()] = train.education_level[train.education_level.isnull()].map(size_mapping)
train.education_level

0        1.0
1        3.0
2        2.0
3        3.0
4        3.0
        ... 
19153   -1.0
19154    2.0
19155    2.0
19156    2.0
19157    2.0
Name: education_level, Length: 19158, dtype: float64

In [4]:
# train.education_level.mean()

## 6

Feature `'relevent_experience'` is an example of a binary feature. You can also check that it has no missing values, which makes its encoding pretty easy.

Encode this feature with the following mapping:

*   `"No relevent experience"` -> 0
*   `"Has relevent experience"` -> 1

**q6:** What will be the mean value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest TWO decimal places (e.g. 12.3456789 -> 12.35).

In [17]:
train['relevent_experience']

0        Has relevent experience
1         No relevent experience
2        Has relevent experience
3        Has relevent experience
4        Has relevent experience
                  ...           
19153    Has relevent experience
19154    Has relevent experience
19155     No relevent experience
19156     No relevent experience
19157    Has relevent experience
Name: relevent_experience, Length: 19158, dtype: object

In [18]:
map_relevent_experience = {
    "No relevent experience" : 0,
    "Has relevent experience" : 1
}
# your code here
train['relevent_experience'] = train['relevent_experience'].map(map_relevent_experience)

In [5]:
# train['relevent_experience'].mean()

## 7

In our case, `'gender'` is an example of a nominal feature (notice that it is not a binary feature cause it contains three categories + NaNs). We will use One-Hot encoding to encode it. There are several options how to do it in practice: for example, to use `get_dummies` function from `pandas` or `OneHotEncoder` from `sklearn.preprocessing`. Here, we will go with the first option.

If you want to encode a whole dataset with One-Hot encoding, you can directly pass it into the discussed methods. But here we want to encode only one feature. We suggest to do it in four steps:

1. Obtain a One-Hot encoding dataframe for the feature `'gender'`: apply `pd.get_dummies` to this feature. As a result, you should obtain a dataframe with three features (three different categories for gender). Don't include `'NaN'` feature - it will already be included as the encoding (0, 0, 0).
2. Change the column names of this dataframe, so that a feature `'<category_name>'` will become `'gender-<category_name>'`. Of course, it is not a necessary step in general. However, it probably would be more convenient to work with a full dataframe if we remember that these One-Hot encoded features originally came from the feature `'gender'`. On a technical side, you can perform it by changing `df_gender.columns` list.
3. Concatenate original and new dataframes. You can do it by calling `pd.concat` function. Don't forget to set `'axis'` parameter, so that you concatenate dataframes by columns, not by rows. As a result of this step, you should obtain a new `train` dataframe with three new columns named like `'gender-<category_name>'`.
4. Finally, drop `'gender'` feature because we have already encoded it.

As a result of these four steps, you should obtain a new version of `train` dataframe, with dropped `'gender'` feature and three new features named like `'gender-<category_name>'`. A total number of columns by this time should be equal to 16.

**q7:** What is the total number of zero values which appear in the columns starting with `'gender-'`?

_Example: suppose that you obtain the following dataframe:_

| gender-category1 | gender-category2   | gender-category3|
|------|------|------|
|   0  | 1| 0|
|0|0|0|

_Then the answer for the question above should be 5._



In [20]:
pd.get_dummies(train['gender'])

Unnamed: 0,Female,Male,Other
0,0,0,0
1,0,0,0
2,0,1,0
3,0,1,0
4,0,1,0
...,...,...,...
19153,0,0,0
19154,0,0,0
19155,0,0,0
19156,0,1,0


In [21]:
# your code here
gender1 = pd.get_dummies(train['gender'])
gender1=gender1.rename(columns={"Female": "gender-category1", "Male": "gender-category2", 'Other': 'gender-category3'})

In [22]:
len(gender1[gender1['gender-category1']== 0])+len(gender1[gender1['gender-category2']== 0])+len(gender1[gender1['gender-category3']== 0])

42824

In [23]:
train = pd.concat([gender1,train], axis=1)

In [24]:
train = train.drop(columns='gender')

In [25]:
train[train['gender-category1']== 0]

Unnamed: 0,gender-category1,gender-category2,gender-category3,enrollee_id,city,city_development_index,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,0,0,0,28120,city_16,0.910,1,no_enrollment,1.0,,19,100-500,Pvt Ltd,2,20,0.0
1,0,0,0,31820,city_21,0.624,0,no_enrollment,3.0,STEM,2,50-99,Early Stage Startup,1,10,1.0
2,0,1,0,4277,city_71,0.884,1,Part time course,2.0,STEM,>20,,,2,6,1.0
3,0,1,0,3379,city_159,0.843,1,no_enrollment,3.0,STEM,>20,<10,Pvt Ltd,>4,56,0.0
4,0,1,0,10821,city_16,0.910,1,no_enrollment,3.0,STEM,6,,,4,15,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19152,0,0,0,7670,city_103,0.920,0,Full time course,2.0,STEM,9,,,1,28,1.0
19153,0,0,0,8241,city_16,0.910,1,no_enrollment,-1.0,,11,,,1,4,0.0
19154,0,0,0,18441,city_103,0.920,1,no_enrollment,2.0,STEM,>20,,,>4,56,1.0
19155,0,0,0,29117,city_11,0.550,0,Full time course,2.0,STEM,<1,,,never,12,0.0


In [6]:
# 4508 * 3

## 8

Perform One-Hot encoding for the feature `'enrolled_university'`, using the similar procedure as in the previous task:

1. Obtain a One-Hot encoding dataframe for `'enrolled_university'`.
2. Rename its columns.
3. Concatenate original and One-Hot encoding dataframes.
4. Drop `'enrolled_university'` column.

A total number of columns by this time should be equal to 18.

**q8:** What is the total number of zero values which appear in the columns starting with `'enrolled_university-'`?

In [27]:
# your code here
enrolled_university1 = pd.get_dummies(train['enrolled_university'])
enrolled_university1= enrolled_university1.rename(columns={"Full time course": "enrolled_university-1", "Part time course": "enrolled_university-2", 'no_enrollment': 'enrolled_university-3'})
enrolled_university1

Unnamed: 0,enrolled_university-1,enrolled_university-2,enrolled_university-3
0,0,0,1
1,0,0,1
2,0,1,0
3,0,0,1
4,0,0,1
...,...,...,...
19153,0,0,1
19154,0,0,1
19155,1,0,0
19156,0,0,1


In [28]:
train = pd.concat([enrolled_university1,train], axis=1)

In [29]:
train[train['enrolled_university-3']==0]

Unnamed: 0,enrolled_university-1,enrolled_university-2,enrolled_university-3,gender-category1,gender-category2,gender-category3,enrollee_id,city,city_development_index,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
2,0,1,0,0,1,0,4277,city_71,0.884,1,Part time course,2.0,STEM,>20,,,2,6,1.0
7,1,0,0,0,1,0,26537,city_11,0.550,1,Full time course,2.0,STEM,16,<10,Early Stage Startup,2,30,1.0
21,1,0,0,0,1,0,13109,city_103,0.920,1,Full time course,2.0,STEM,4,10/49,Pvt Ltd,1,40,1.0
22,1,0,0,0,1,0,31939,city_45,0.890,1,Full time course,2.0,STEM,2,50-99,Pvt Ltd,1,86,0.0
24,1,0,0,0,1,0,15008,city_61,0.913,0,Full time course,2.0,Other,1,,,1,112,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19146,0,1,0,0,1,0,20440,city_138,0.836,0,Part time course,1.0,,>20,50-99,Pvt Ltd,2,108,0.0
19150,1,0,0,0,0,0,7125,city_114,0.926,0,Full time course,1.0,,5,<10,Public Sector,never,109,0.0
19151,1,0,0,0,1,0,3242,city_165,0.903,0,Full time course,1.0,,2,<10,Pvt Ltd,1,10,0.0
19152,1,0,0,0,0,0,7670,city_103,0.920,0,Full time course,2.0,STEM,9,,,1,28,1.0


In [7]:
# 15401 + 17960 + 5341

In [31]:
train = train.drop(columns='enrolled_university')

## 9

Encode feature `'city'` using frequency encoding. You should map each category `'city_i'` to its count (a total number of observations in `train` with `city == 'city_i'`). Save this mapping, since later you would apply the same mapping to the test data.

As a result of this task, feature `'city'` should be encoded with category counts in `train`. 

**q9:** What will be the mean value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).


In [32]:
map_city = train.groupby('city').size()
# your code here
train['city'] = train['city'].map(map_city)
train.head(10)

Unnamed: 0,enrolled_university-1,enrolled_university-2,enrolled_university-3,gender-category1,gender-category2,gender-category3,enrollee_id,city,city_development_index,relevent_experience,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,0,0,1,0,0,0,28120,1533,0.91,1,1.0,,19,100-500,Pvt Ltd,2,20,0.0
1,0,0,1,0,0,0,31820,2702,0.624,0,3.0,STEM,2,50-99,Early Stage Startup,1,10,1.0
2,0,1,0,0,1,0,4277,266,0.884,1,2.0,STEM,>20,,,2,6,1.0
3,0,0,1,0,1,0,3379,94,0.843,1,3.0,STEM,>20,<10,Pvt Ltd,>4,56,0.0
4,0,0,1,0,1,0,10821,1533,0.91,1,3.0,STEM,6,,,4,15,0.0
5,0,0,1,0,1,0,27591,1533,0.91,0,2.0,STEM,10,500-999,Pvt Ltd,3,8,0.0
6,0,0,1,0,1,0,1305,4355,0.92,1,2.0,Arts,9,<10,NGO,1,37,0.0
7,1,0,0,0,1,0,26537,247,0.55,1,2.0,STEM,16,<10,Early Stage Startup,2,30,1.0
8,0,0,1,0,1,0,16448,2702,0.624,1,2.0,STEM,4,1000-4999,Pvt Ltd,1,8,1.0
9,0,0,1,0,1,0,23079,1533,0.91,0,3.0,STEM,10,,,>4,67,0.0


In [8]:
# train.city.mean()

## 10

Encode feature `'last_new_job'` with target encoding with no modifications. First, impute missing values in this feature with a new category `'-1'`. Then, map each category of `'last_new_job'` to the mean target value of the observations with the corresponding category. Save this mapping, since later you would apply the same mapping to the test data.

**q10:** What will be the maximum value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.35).

_(hint: you might want to use `groupby` function from `pandas` in this task)_

In [34]:
df = train.copy()

In [35]:
df.last_new_job
size_mapping = {
    np.nan : '-1'
}
df.last_new_job[df.last_new_job.isnull()] = df.last_new_job[df.last_new_job.isnull()].map(size_mapping)
df

Unnamed: 0,enrolled_university-1,enrolled_university-2,enrolled_university-3,gender-category1,gender-category2,gender-category3,enrollee_id,city,city_development_index,relevent_experience,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,0,0,1,0,0,0,28120,1533,0.910,1,1.0,,19,100-500,Pvt Ltd,2,20,0.0
1,0,0,1,0,0,0,31820,2702,0.624,0,3.0,STEM,2,50-99,Early Stage Startup,1,10,1.0
2,0,1,0,0,1,0,4277,266,0.884,1,2.0,STEM,>20,,,2,6,1.0
3,0,0,1,0,1,0,3379,94,0.843,1,3.0,STEM,>20,<10,Pvt Ltd,>4,56,0.0
4,0,0,1,0,1,0,10821,1533,0.910,1,3.0,STEM,6,,,4,15,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,0,0,1,0,0,0,8241,1533,0.910,1,-1.0,,11,,,1,4,0.0
19154,0,0,1,0,0,0,18441,4355,0.920,1,2.0,STEM,>20,,,>4,56,1.0
19155,1,0,0,0,0,0,29117,247,0.550,0,2.0,STEM,<1,,,never,12,0.0
19156,0,0,1,0,1,0,27929,4355,0.920,0,2.0,STEM,15,10000+,Pvt Ltd,2,16,1.0


In [36]:
Mean_encoded_subject = df.groupby(['last_new_job'])['target'].mean().to_dict()
  
df['last_new_job'] =  df['last_new_job'].map(Mean_encoded_subject)
  
print(df['last_new_job'])

0        0.241379
1        0.264303
2        0.241379
3        0.182371
4        0.221574
           ...   
19153    0.264303
19154    0.182371
19155    0.301387
19156    0.241379
19157    0.241379
Name: last_new_job, Length: 19158, dtype: float64


In [9]:
# df.last_new_job.max()

In [38]:
train = df

In [39]:
train

Unnamed: 0,enrolled_university-1,enrolled_university-2,enrolled_university-3,gender-category1,gender-category2,gender-category3,enrollee_id,city,city_development_index,relevent_experience,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,0,0,1,0,0,0,28120,1533,0.910,1,1.0,,19,100-500,Pvt Ltd,0.241379,20,0.0
1,0,0,1,0,0,0,31820,2702,0.624,0,3.0,STEM,2,50-99,Early Stage Startup,0.264303,10,1.0
2,0,1,0,0,1,0,4277,266,0.884,1,2.0,STEM,>20,,,0.241379,6,1.0
3,0,0,1,0,1,0,3379,94,0.843,1,3.0,STEM,>20,<10,Pvt Ltd,0.182371,56,0.0
4,0,0,1,0,1,0,10821,1533,0.910,1,3.0,STEM,6,,,0.221574,15,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,0,0,1,0,0,0,8241,1533,0.910,1,-1.0,,11,,,0.264303,4,0.0
19154,0,0,1,0,0,0,18441,4355,0.920,1,2.0,STEM,>20,,,0.182371,56,1.0
19155,1,0,0,0,0,0,29117,247,0.550,0,2.0,STEM,<1,,,0.301387,12,0.0
19156,0,0,1,0,1,0,27929,4355,0.920,0,2.0,STEM,15,10000+,Pvt Ltd,0.241379,16,1.0


## 11

Encode feature `'experience'` with M-estimate encoding. Map each category of `'experience'` according to the following formula:

$$
\hat{x_{ij}} = \frac{\text{target}\left(j, x_{ij}\right) + m\times y_{\text{mean}}}{\text{count}\left(j, x_{ij}\right) + m}\quad,
$$

where

* $x_{ij}$ is a category being encoded,
* $\hat{x_{ij}}$ is its corresponding M-estimate encoding value,
* $\text{count}\left(j, x_{ij}\right)$ is a total number of times $x_{ij}$ appeared in `train`,
* $\text{target}\left(j, x_{ij}\right)$ is a mean target value of the observations with the corresponding category,
* $m$ is a parameter.

In this task, set $m = 0.5$. 

**q11:** What will be the maximum value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [40]:
test

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,32403,city_41,0.827,Male,Has relevent experience,Full time course,Graduate,STEM,9,<10,,1,21,1.0
1,9858,city_103,0.920,Female,Has relevent experience,no_enrollment,Graduate,STEM,5,,Pvt Ltd,1,98,0.0
2,31806,city_21,0.624,Male,No relevent experience,no_enrollment,High School,,<1,,Pvt Ltd,never,15,1.0
3,27385,city_13,0.827,Male,Has relevent experience,no_enrollment,Masters,STEM,11,10/49,Pvt Ltd,1,39,0.0
4,27724,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,10000+,Pvt Ltd,>4,72,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2124,1289,city_103,0.920,Male,No relevent experience,no_enrollment,Graduate,Humanities,16,,Public Sector,4,15,0.0
2125,195,city_136,0.897,Male,Has relevent experience,no_enrollment,Masters,STEM,18,,,2,30,1.0
2126,31762,city_100,0.887,Male,No relevent experience,no_enrollment,Primary School,,3,,Pvt Ltd,never,18,0.0
2127,7873,city_102,0.804,Male,Has relevent experience,Full time course,High School,,7,100-500,Public Sector,1,84,0.0


In [41]:
train["experience"].value_counts()

>20    3286
5      1430
4      1403
3      1354
6      1216
2      1127
7      1028
10      985
9       980
8       802
15      686
11      664
14      586
1       549
<1      522
16      508
12      494
13      399
17      342
19      304
18      280
20      148
-1       65
Name: experience, dtype: int64

In [48]:
from category_encoders.m_estimate import MEstimateEncoder
MEE_encoder = MEstimateEncoder()
train_mee = MEE_encoder.fit_transform(train, test)
test_mee = MEE_encoder.transform(test["experience"])

ValueError: Unexpected input shape: (2129, 14)

In [None]:
map_experience = # your code here
# your code here

## 12

Encode feature `'major_discipline'` with Leave-One-Out encoding. Remember that this technique is similar to target encoding, but here while computing the encoding for a particular observation, we exclude it from the target encoding formula. Therefore a category $x_{ij}$ for the $i$-th observation will be encoded according to the following formula:

$$
\hat{x_{ij}} = \frac{\text{target}\left(j, x_{ij}\right) - y_i}{\text{count}\left(j, x_{ij}\right) - 1}\quad,
$$

where

* $\hat{x_{ij}}$ is its corresponding M-estimate encoding value,
* $\text{count}\left(j, x_{ij}\right)$ is a total number of times $x_{ij}$ appeared in `train`,
* $\text{target}\left(j, x_{ij}\right)$ is a mean target value of the observations with the corresponding category,
* $y_i$ is a target value of the $i$-th observation

For example, suppose that you have the following train data:

|feature|target|
|-|-|
|A|1|
|A|0|
|B|1|
|B|0|
|A|0|

Then you obtain the following Leave-One-Out encoding:

|feature|feature_loo_encoded|
|-|-|
|A|0.0|
|A|0.5|
|B|0.0|
|B|1.0|
|A|0.5|

It is very important to notice that in this method the same category can be encoded differently for different observations. Thus, after encoding the train part of data, you should create a mapping which will help to encode the test data. In order to do this, simply average train encoding values within each category to obtain the final encoding. For instance, suppose that you obtain the following train dataframe:

|feature|feature_loo_encoded|
|-|-|
|A|0.2|
|A|0.6|
|B|0.3|
|B|0.7|
|A|0.4|

Then, for the test data, you should obtain the following mapping:

* A -> 0.4 (because 0.4 is a mean value of the encoded values for the category A)
* B -> 0.5 (because 0.5 is a mean value of the encoded values for the category B)

Don't impute any missing values in this task. After completing this task, drop the feature `'major_discipline'` and rename `'major_discipline_loo_encoded'` to `'major_discipline'`.

**q12:** What will be the maximum value of the encoding values for `'major_discipline'` for the TEST data in the mapping described above? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [None]:
train['major_discipline_loo_encoded'] = # your code here
map_major_discipline = # your code here
# your code here

## 13

Encode feature `'company_size'` with Catboost encoding. The technique was described in the lecture. Here, for the sake of simplicity, let's use the implementation from `category_encoders` library.

As you may remember, Catboost encoding depends on how the data is ordered. Normally, you should shuffle the data one time or even several times. In this task, we will assume that the data has already been shuffled, so you don't need to shuffle it again.

Take `CatBoostEncoder` and use the default values for all its parameters, except `handle_missing` - set it to `'return_nan'` so that your encoder don't do anything with missing values. Fit it on the `'company_size'` (train data) and `'target'` and transform this feature. Save the encoder - it will be used later to transform this feature in test data.

Don't impute any missing values in this task.

**q13:** What will be the most popular value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [None]:
from category_encoders.cat_boost import CatBoostEncoder
# your code here

## 14

Encode feature `'company_type'` with Weight of Evidence (WoE) encoding. A category $x_{ij}$ will be encoded according to the following formula:

$$
\hat{x_{ij}} = \ln\left(\frac{\mathbb{P}\left(x_{ij}\mid y=1\right)}{\mathbb{P}\left(x_{ij}\mid y=0\right)}\right)\quad.
$$

Here:

$$
\mathbb{P}\left(x_{ij}\mid y=1\right) = \frac{\text{count}\left(y=1\mid x_{ij}\right)}{\text{count}\left(y=1\right)}
$$
$$
\mathbb{P}\left(x_{ij}\mid y=0\right) = \frac{\text{count}\left(y=0\mid x_{ij}\right)}{\text{count}\left(y=0\right)}
$$

The notation means the following:

* $\text{count}\left(y=1\mid x_{ij}\right)$ denotes the number of observations with the category $x_{ij}$ where the target value is equal to $1$;
* $\text{count}\left(y=0\mid x_{ij}\right)$ denotes the same but for the target value $0$;
* $\text{count}\left(y=1\right)$ denotes the number of observations with the target value equal to $1$;
* $\text{count}\left(y=0\right)$ denotes the same but for the target value $0$.


For example, suppose that you have the following train data:

|feature|target|
|-|-|
|A|1|
|A|0|
|B|1|
|B|0|
|A|0|

Then you obtain the following WoE encoding mapping:

* A -> $\ln\left(\frac{\frac{1}{2}}{\frac{2}{3}}\right) \approx -0.288$ 
* B -> $\ln\left(\frac{\frac{1}{2}}{\frac{1}{3}}\right) \approx 0.405$ 

Don't impute any missing values in this task.

**q14:** What will be the most popular value of this feature in the train data after the encoding? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [None]:
map_company_type = # your code here
# your code here

## 15

We have encoded all categorical features. Next, we drop `'enrollee_id'` because it is not a representative feature. After this, we split train part of data into the dataframe which contains only features (without target) and the target array.

Before training the models, we should impute the remaining missing values. You might have noticed that we didn't impute missing values for the features `'major_discipline'`, `'company_size'` and `'company_type'`. This is because the number of missing values in these features is relatively big (you can check it yourself). In practice, you might just create a special category (like `'-1'`) for each of these features before the encoding. However, in this task, let's perform the imputation using KNN approach - where we impute missing values by looking at the similar observations.

Import `KNNImputer` from `sklearn`. It works only with the dataset with numbers - this is why we didn't run it before the encoding. Set `n_neighbors=3`, and let other parameters have the default values. Fit it on the train dataframe with features, and then transform it. Notice that after the transformation we will obtain `numpy.array` - make `pandas.DataFrame` out of it with the same columns as before.

Save the KNN imputer - you will need it to process the test data. Check that there are no missing values in the train data anymore.

**q15:** What is the mean value of the `'company_size'` feature in the train data after the imputation? Provide the answer, rounded to the nearest FIVE decimal places (e.g. 12.3456789 -> 12.34568).

In [None]:
X_train = train.drop(['enrollee_id', 'target'], axis=1)
y_train = train['target']

X_train.shape, y_train.shape

In [None]:
from sklearn.impute import KNNImputer
# your code here

## 16

Finally, train a Random Forest classifier from `sklearn` on the train data. Set `n_estimators=500`, `max_depth=8`, `random_state=13`, and let other parameters have the default values. Check feature importances. 

**q16:** What is the name of the most important feature for this model? Provide the name of the feature.

In [None]:
from sklearn.ensemble import RandomForestClassifier
# your code here

## 17

In this last task, process the test data so that it is possible to make Random Forest predictions on for it. Perform the similar operations as for the train data, but remember that now you work with test observations and therefore all operations are in the inference mode.

1. (task 4) Feature `'experience'`: impute missing values by -1. 
2. (task 5) Feature `'education_level'`: perform ordinal encoding mapping.
3. (task 6) Feature `'relevent_experience'`: perform binary mapping.
4. (task 7) Feature `'gender'`: perform One-Hot encoding and obtain three new features starting with `'gender-'`. Drop `'gender'` feature.
5. (task 8) Feature `'enrolled_university'`: perform One-Hot encoding and obtain three new features starting with `'enrolled_university-'`. Drop `'enrolled_university'` feature.
6. (task 9) Feature `'city'`: perform frequency encoding mapping.
7. (task 10) Feature `'last_new_job'`: impute missing values by -1 and perform target encoding mapping.
8. (task 11) Feature `'experience'`: perform M-estimate encoding mapping.
9. (task 12) Feature `'major_discipline'`: perform Leave-One-Out encoding mapping.
10. (task 13) Feature `'company_size'`: perform Catboost encoding mapping.
11. (task 14) Feature `'company_type'`: perform WoE encoding mapping.
12. (task 15) Split `test` into `X_test` (a dataframe with no `'enrollee_id'` and `'target'`) and `y_test` (an array with target values). Impute missing values by using KNN imputer which you used before (now only in a transform mode).

As a result of the operations described above, you should obtain `X_test` which is a `pandas.DataFrame` with a shape (2129, 16). Calculate the predictions of the trained Random Forest model on it. Check the accuracy of the predictions on test data.

Then calculate the predictions of the same model on `X_train` and check the accuracy there. Compare the accuracies on train and test data. Do you notice something? What, in your opinion, caused such difference in the accuracies?

**q17:** As a result of this task, provide the accuracy for the test data, rounded to the nearest TWO decimal places (e.g. 12.3456789 -> 12.35).

In [None]:
# your code here

In [None]:
X_test.shape

In [None]:
from sklearn.metrics import accuracy_score
# your code here