## Data Cleaning & Pre-processing
![data-cleaning-in-python](https://daxg39y63pxwu.cloudfront.net/images/blog/data-cleaning-in-python/data-cleaning-in-python.png)

First step of an analytics project is to clean the datasets and pre-processed it to make it suitable for use by analytical model and visualization


In [3]:
# import basic libraries
import pandas as pd
import numpy as np
import seaborn as sb

### 1. Import dataset into the notebook

In [4]:
train_df = pd.read_csv('archive/aug_train.csv')
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [5]:
test_df = pd.read_csv('archive/aug_test.csv')
test_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2129 entries, 0 to 2128
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             2129 non-null   int64  
 1   city                    2129 non-null   object 
 2   city_development_index  2129 non-null   float64
 3   gender                  1621 non-null   object 
 4   relevent_experience     2129 non-null   object 
 5   enrolled_university     2098 non-null   object 
 6   education_level         2077 non-null   object 
 7   major_discipline        1817 non-null   object 
 8   experience              2124 non-null   object 
 9   company_size            1507 non-null   object 
 10  company_type            1495 non-null   object 
 11  last_new_job            2089 non-null   object 
 12  training_hours          2129 non-null   int64  
dtypes: float64(1), int64(2), object(10)
memory usage: 216.4+ KB


In [6]:
combined_df = pd.concat([test_df, train_df])
combined_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 21287 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             21287 non-null  int64  
 1   city                    21287 non-null  object 
 2   city_development_index  21287 non-null  float64
 3   gender                  16271 non-null  object 
 4   relevent_experience     21287 non-null  object 
 5   enrolled_university     20870 non-null  object 
 6   education_level         20775 non-null  object 
 7   major_discipline        18162 non-null  object 
 8   experience              21217 non-null  object 
 9   company_size            14727 non-null  object 
 10  company_type            14513 non-null  object 
 11  last_new_job            20824 non-null  object 
 12  training_hours          21287 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [7]:
shuffled = combined_df.sample(frac=1, random_state=1).reset_index()

- combine train and test dataset and reshuffle 

(To write description)

**About this dataset**
- `enrolle_id` : Candidate's unique ID (Identity Document)
- `city` : City code
- `city_development_index`: Development index of the city (scaled) (0.45: less developed to 0.95: highly developed)
- `gender`: Gender of a candidate
- `relevant_experience` : Relevant experience of a candidate
- `enrolled_university` : Type of University course enrolled if any
- `education_level` : Education level of candidate
- `major_discipline` : Education major discipline of candidate
    - STEM: Candidate's degree programme falls under the umbrella of Science, Technology, Engineering or Math
    - Humanities: Candidates who had interdisciplinary study of circumstances, Literature, English, Arts or History
    - Business Degree: Candidates who hold bachelor of business degree
    - Other: Candidates who hold other degrees apart from STEM, Humanities and Business. 
    - No Major: Candidates who do not have a degree
- `experience` : Candidate total experience in years
- `company_size` : Number of employees in current employer's company
- `company_type` : Type of current employer
    - Pvt Ltd: Private Limited company
    - Public Sector: Organisations that are owned and operated by the government
    - Funded Startup: A company that has received funding from investors like venture capitalists/angel investors
    - Early Stage Startup: Newly founded company that is in the early stages of development
    - NGO: Non-Government Organisation 
- `last_new_job` : Difference in years between previous job and current job
- `training_hours`: training hours completed
- `target`: 0 – Not looking for job change, 1 – Looking for a job change


### 2. Data Cleaning & Pre-processing 

#### **2.1 Removing rows with NA values (if they exist)**
 - OR use algo to fill up missing values (knn imputation)

In [6]:
print('Number of NA entries: ', shuffled.isna().sum().sum())

Number of NA entries:  25066


In [7]:
shuffled.isnull().sum()

enrollee_id                  0
city                         0
city_development_index       0
gender                    5016
relevent_experience          0
enrolled_university        417
education_level            512
major_discipline          3125
experience                  70
company_size              6560
company_type              6774
last_new_job               463
training_hours               0
target                    2129
dtype: int64

In [8]:
shuffled.isnull().sum()/len(shuffled) ##finding percentage of missing values

enrollee_id               0.000000
city                      0.000000
city_development_index    0.000000
gender                    0.235637
relevent_experience       0.000000
enrolled_university       0.019589
education_level           0.024052
major_discipline          0.146803
experience                0.003288
company_size              0.308169
company_type              0.318222
last_new_job              0.021750
training_hours            0.000000
target                    0.100014
dtype: float64

##### Rows with N.A values not removed as if then, majority of the rows would need to be removed

#### Identify Continuous Variables

In [9]:
continuous_var = shuffled.select_dtypes(include = ['int64','float64']).columns.values
continuous_var

array(['index', 'enrollee_id', 'city_development_index', 'training_hours',
       'target'], dtype=object)

#### Identify Categorical Variables

In [10]:
categorical_var = shuffled.select_dtypes(include = ['object']).columns.values
categorical_var

array(['city', 'gender', 'relevent_experience', 'enrolled_university',
       'education_level', 'major_discipline', 'experience',
       'company_size', 'company_type', 'last_new_job'], dtype=object)

### Cleaning & Ordering Categorical Variables 

In [19]:
shuffled['last_new_job'] = shuffled['last_new_job'].apply(lambda x: 'Never' if x == 'never' else x) #capiatalising
shuffled.loc[shuffled['enrolled_university'] == 'no_enrollment', 'enrolled_university'] = 'No Enrollment'#capiatalising
shuffled['company_size'] = shuffled['company_size'].apply(lambda x: '10-49' if x == '10/49' else x) #diff replacement method

shuffled['experience'] = shuffled['experience'].apply(lambda x: '0' if x == '<1' else x)
shuffled['experience'] = shuffled['experience'].apply(lambda x: '20' if x == '>20' else x)


shuffled['company_size'].fillna('0',inplace=True)
shuffled['company_type'].fillna('Unknown',inplace=True)
shuffled['major_discipline'].fillna('Unknown',inplace=True)
shuffled['gender'].fillna('Not provided',inplace=True)

In [20]:
ed_order = ['Primary School','High School','Graduate','Masters','Phd']
enroll_order = ['No Enrollment','Part time course','Full time course']
disc_order = ['STEM','Unknown','Humanities','Other','Business Degree','Arts','No Major']
exp_yrs_order = ['<1','1','2','3','4','5','6','7','8','9','10','11','12','13','14','15','16','17','18','19','20','>20']
exp_yrs_order_2 = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]
size_order = ['0','<10', '10-49', '50-99', '100-500', '500-999', '1000-4999', '5000-9999', '10000+']
job_order = ['Never', '1', '2', '3', '4', '>4']
exp_order =['No relevant experience','Has relevant experience']
gender_order = ['Male','Female','Other','Not provided']
company_order = ['Pvt Ltd','Unknown','Funded Startup','Public Sector','Early Stage Startup','NGO','Other']

### Drop missing values of continuous variables

In [21]:
shuffled.dropna(inplace=True)
shuffled.isna().sum()/len(shuffled)

index                     0.0
enrollee_id               0.0
city                      0.0
city_development_index    0.0
gender                    0.0
relevent_experience       0.0
enrolled_university       0.0
education_level           0.0
major_discipline          0.0
experience                0.0
company_size              0.0
company_type              0.0
last_new_job              0.0
training_hours            0.0
target                    0.0
dtype: float64

#### **2.2 Cleaning up continuous variables**

- Filter out outliers using confidential interval (95%?)
- analyse based on column 
- give reasons why drop / not drop 



In [23]:
shuffled['city_development_index'].describe()

count    18014.000000
mean         0.831728
std          0.122115
min          0.448000
25%          0.745000
50%          0.910000
75%          0.920000
max          0.949000
Name: city_development_index, dtype: float64

#### Only values that are valid in the confidential 95% interval were kept. The remaining values, considered as outliers were removed for both city_development_index and training_hours.

In [24]:
# confidence interval: (µ - 3σ, µ + 3σ)
conf_interval = np.mean(shuffled['city_development_index'])-3*np.std(shuffled['city_development_index']),np.mean(shuffled['city_development_index'])+3*np.std(shuffled['city_development_index'])
print('Confidence Interval:', conf_interval)

Confidence Interval: (0.46539202585612793, 1.1980638418023752)


In [26]:
numofRowsBefore = len(shuffled)
print('Number of rows before removing outliers:', numofRowsBefore)

(lower, upper) = conf_interval
shuffled.drop(shuffled[shuffled['city_development_index'] < lower].index, inplace=True)
shuffled.drop(shuffled[shuffled['city_development_index'] > upper].index, inplace=True)

numofRowsAfter = len(shuffled)
print('Number of rows after removing outliers:', numofRowsAfter)
print('Number of rows dropped:', numofRowsBefore-numofRowsAfter)

Number of rows before removing outliers: 18014
Number of rows after removing outliers: 17999
Number of rows dropped: 15


In [28]:
shuffled['training_hours'].describe()

count    17999.000000
mean        65.363465
std         60.073649
min          1.000000
25%         23.000000
50%         47.000000
75%         88.000000
max        336.000000
Name: training_hours, dtype: float64

In [29]:
# confidence interval: (µ - 3σ, µ + 3σ)
conf_interval = np.mean(shuffled['training_hours'])-3*np.std(shuffled['training_hours']),np.mean(shuffled['training_hours'])+3*np.std(shuffled['training_hours'])
print('Confidence Interval:', conf_interval)

Confidence Interval: (-114.85247646869902, 245.57940574254758)


In [31]:
numofRowsBefore = len(shuffled)
print('Number of rows before removing outliers:', numofRowsBefore)

(lower, upper) = conf_interval
shuffled.drop(shuffled[shuffled['training_hours'] < lower].index, inplace=True)
shuffled.drop(shuffled[shuffled['training_hours'] > upper].index, inplace=True)

numofRowsAfter = len(shuffled)
print('Number of rows after removing outliers:', numofRowsAfter)
print('Number of rows dropped:', numofRowsBefore-numofRowsAfter)

Number of rows before removing outliers: 17999
Number of rows after removing outliers: 17575
Number of rows dropped: 424


#### **2.3 Cleaning up categorical variables**

- are the categorical variables  consistent in their values
- are there missing values
- are there unreasonable value (text?)

In [32]:
for feature in shuffled.columns:
  if np.dtype(shuffled[feature]) != 'object':
    continue
  print(shuffled[feature].value_counts(), end='\n\n')

city_103    4109
city_21     2396
city_16     1427
city_114    1230
city_160     795
            ... 
city_111       3
city_8         2
city_129       2
city_171       1
city_140       1
Name: city, Length: 122, dtype: int64

Male            12468
Not provided     3755
Female           1182
Other             170
Name: gender, dtype: int64

Has relevent experience    12869
No relevent experience      4706
Name: relevent_experience, dtype: int64

No Enrollment       13030
Full time course     3433
Part time course     1112
Name: enrolled_university, dtype: int64

Graduate          10911
Masters            4133
High School        1861
Phd                 386
Primary School      284
Name: education_level, dtype: int64

STEM               13645
Unknown             2167
Humanities           640
Other                356
Business Degree      315
Arts                 245
No Major             207
Name: major_discipline, dtype: int64

20    3257
5     1302
4     1266
3     1190
6     1117
2      

- all the categorical variables are consistent in their values
- there are no missing values

Therefore, no data cleaning is needed for categorical variables. However, one hot encoding is needed for categorical variables to encode them into numeric forms to allow analytical models to operate on these categorical variables

#### **2.4 Encoding nominal (unordered) categorical variables using `OneHotEncoding` for predictors & `Integer Encoding` for response**
The `shuffled` dataset contains 10 categorical predictor variables:
- gender
- city
- relevant_experience
- enrolled_university
- education_level
- major_discipline
- experience
- company_size
- company_type
- last_new_job

And 1 categorical response variable:
- target


In [52]:
pip install -U scikit-learn

Collecting scikit-learn
  Downloading scikit_learn-1.2.2-cp39-cp39-macosx_10_9_x86_64.whl (9.1 MB)
[K     |████████████████████████████████| 9.1 MB 6.0 MB/s eta 0:00:01
Collecting joblib>=1.1.1
  Downloading joblib-1.2.0-py3-none-any.whl (297 kB)
[K     |████████████████████████████████| 297 kB 12.2 MB/s eta 0:00:01
Installing collected packages: joblib, scikit-learn
  Attempting uninstall: joblib
    Found existing installation: joblib 1.1.0
    Uninstalling joblib-1.1.0:
      Successfully uninstalled joblib-1.1.0
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.24.2
    Uninstalling scikit-learn-0.24.2:
      Successfully uninstalled scikit-learn-0.24.2
Successfully installed joblib-1.2.0 scikit-learn-1.2.2
Note: you may need to restart the kernel to use updated packages.


**2.4 a) OneHotEncoding**

In [8]:
# Import the OneHotEncoder from sklearn

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

# OneHotEncoding of categorical predictors

cat_variables = [
    'gender', 'city', 'relevent_experience', 'enrolled_university',
    'education_level', 'major_discipline', 'experience', 'company_size', 'company_type',
     'last_new_job'
]

shuffled_cat = shuffled[cat_variables]

ohe.fit(shuffled_cat)
shuffled_cat_ohe = pd.DataFrame(
    ohe.transform(shuffled_cat).toarray(),
    columns=ohe.get_feature_names_out(shuffled_cat.columns))

#Check the encoded variables
shuffled_cat_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21287 entries, 0 to 21286
Columns: 192 entries, gender_Female to last_new_job_nan
dtypes: float64(192)
memory usage: 31.2 MB


In [9]:
shuffled_cat_ohe

Unnamed: 0,gender_Female,gender_Male,gender_Other,gender_nan,city_city_1,city_city_10,city_city_100,city_city_101,city_city_102,city_city_103,...,company_type_Public Sector,company_type_Pvt Ltd,company_type_nan,last_new_job_1,last_new_job_2,last_new_job_3,last_new_job_4,last_new_job_>4,last_new_job_never,last_new_job_nan
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21282,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
21283,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
21284,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
21285,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


**2.4 b) Combine encoded dataframe with continuous variables**

In [10]:
num_variable = []
for i in shuffled:
    if i not in cat_variables:
        num_variable.append(i)
num_variable

['index', 'enrollee_id', 'city_development_index', 'training_hours', 'target']

In [11]:
# Combining Numeric features with the OHE Categorical features
shuffled_num = shuffled[num_variable]
shuffled_cat_ohe
shuffled_cat_ohe_df = pd.concat(
    [shuffled_num.reset_index(drop=True), shuffled_cat_ohe.reset_index(drop=True)],
    sort=False,
    axis=1)

# Check the final dataframe
shuffled_cat_ohe_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21287 entries, 0 to 21286
Columns: 197 entries, index to last_new_job_nan
dtypes: float64(194), int64(3)
memory usage: 32.0 MB


In [13]:
for col in shuffled_cat_ohe_df.columns:
    print(col, end=',  ')

index,  enrollee_id,  city_development_index,  training_hours,  target,  gender_Female,  gender_Male,  gender_Other,  gender_nan,  city_city_1,  city_city_10,  city_city_100,  city_city_101,  city_city_102,  city_city_103,  city_city_104,  city_city_105,  city_city_106,  city_city_107,  city_city_109,  city_city_11,  city_city_111,  city_city_114,  city_city_115,  city_city_116,  city_city_117,  city_city_118,  city_city_12,  city_city_120,  city_city_121,  city_city_123,  city_city_126,  city_city_127,  city_city_128,  city_city_129,  city_city_13,  city_city_131,  city_city_133,  city_city_134,  city_city_136,  city_city_138,  city_city_139,  city_city_14,  city_city_140,  city_city_141,  city_city_142,  city_city_143,  city_city_144,  city_city_145,  city_city_146,  city_city_149,  city_city_150,  city_city_152,  city_city_155,  city_city_157,  city_city_158,  city_city_159,  city_city_16,  city_city_160,  city_city_162,  city_city_165,  city_city_166,  city_city_167,  city_city_171

In [14]:
shuffled_cat_ohe_df

Unnamed: 0,index,enrollee_id,city_development_index,training_hours,target,gender_Female,gender_Male,gender_Other,gender_nan,city_city_1,...,company_type_Public Sector,company_type_Pvt Ltd,company_type_nan,last_new_job_1,last_new_job_2,last_new_job_3,last_new_job_4,last_new_job_>4,last_new_job_never,last_new_job_nan
0,17483,4204,0.924,8,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,15606,30934,0.804,57,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,8848,527,0.920,32,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,832,29537,0.856,118,,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,13397,20188,0.920,310,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21282,8826,20604,0.910,142,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
21283,15160,29294,0.910,146,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
21284,3063,1426,0.910,72,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
21285,10043,17227,0.847,9,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [15]:
## For `target`:
#  - Looking for job (Yes): 1
#  - Not looking for job (No): 0

shuffled_cat_ohe_df['target'].head()

mapping = {
    "Yes": 1,
    "No": 0
}

for val in mapping:
    rows = shuffled_cat_ohe_df['target'] == val
    shuffled_cat_ohe_df.loc[rows, 'target'] = mapping[val]

shuffled_cat_ohe_df['target'].unique()

array([ 0., nan,  1.])

In [16]:
shuffled_cat_ohe_df

Unnamed: 0,index,enrollee_id,city_development_index,training_hours,target,gender_Female,gender_Male,gender_Other,gender_nan,city_city_1,...,company_type_Public Sector,company_type_Pvt Ltd,company_type_nan,last_new_job_1,last_new_job_2,last_new_job_3,last_new_job_4,last_new_job_>4,last_new_job_never,last_new_job_nan
0,17483,4204,0.924,8,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,15606,30934,0.804,57,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,8848,527,0.920,32,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,832,29537,0.856,118,,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,13397,20188,0.920,310,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21282,8826,20604,0.910,142,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
21283,15160,29294,0.910,146,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
21284,3063,1426,0.910,72,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
21285,10043,17227,0.847,9,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


**2.4 c) Export encoded `shuffled` dataframe as csv**

In [17]:
shuffled_cat_ohe_df.to_csv('archive/shuffled_encoded.csv', index=False)

#### **2.5 For correlation, do `Integer Encoding` for categorical variable with strings and categorical response**
The `shuffled` dataset contains 7 categorical predictor variables that is strings:
- 

**2.5 a) IntegerEncoding**

Example (to be edited): 
For `target`, `relevent_experience`, `enrolled_university`, `gender`, `education_level`, `last_new_job`, `company_type`, `major_discipline`, `company_size`:
- Yes: 1
- No: 0

except for `experience` and `city`

In [49]:
shuffled['last_new_job'].value_counts()

1        7592
>4       3146
2        2752
Never    2132
4         986
3         967
Name: last_new_job, dtype: int64

In [18]:
shuffled_cor_df = shuffled.copy()

mapping = {
    'target': {
       "": 1,
        "No": 0
    },
    'relevent_experience': {
        "Has relevent experience": 1,
        "No relevent experience": 0
    },
    'enrolled_university': {
        "Full time course": 2,
        "Part time course": 1,
        "No Enrollment": 0
    },
    'gender': {
        "Male": 3,
        "Female": 2,
        "Other": 1,
        "Not provided": 0
    },
    'education_level': {
        "Phd": 4,
        "Graduate": 3,
        "Masters": 2,
        "High School": 1,
        "Primary School": 0,
    },
    'last_new_job': {
        ">4": 5,
        "4": 4,
        "3": 3,
        "2": 2,
        "1": 1,
        "Never": 0
    },
    'company_type': {
        "Pvt Ltd": 6, 
        "Funded Startup": 5, 
        "Public Sector": 4, 
        "Early Stage Startup": 3, 
        "NGO": 2,
        "Other": 1,
        "Unknown": 0
    },
    'major_discipline': {
        "STEM": 6,
        "Humanities": 5, 
        "Business Degree": 4,
        "Arts": 3,
        "Other": 2,
        "Unknown": 1,
        "No Major": 0
    },
    'company_size': {
                "0": 0,
              "<10": 1,
            "10-49": 2,
            "50-99": 3,
          "100-500": 4,
          "500-999": 5,
        "1000-4999": 6,
        "5000-9999": 7,
           "10000+": 8,
    }
}

for mapping_type in mapping:
    for val in mapping[mapping_type]:
        condition = shuffled_cor_df[mapping_type] == val
        shuffled_cor_df.loc[condition, mapping_type] = mapping[mapping_type][val]

    print(mapping_type, ':', shuffled_cor_df[mapping_type].unique())
        



target : [ 0. nan  1.]
relevent_experience : [1 0]
enrolled_university : [2 'no_enrollment' 1 nan]
gender : [3 2 nan 1]
education_level : [0 3 2 4 1 nan]
last_new_job : [1 2 5 'never' 3 4 nan]
company_type : [nan 2 6 3 4 5 1]
major_discipline : [nan 6 3 0 5 2 4]
company_size : [nan 3 6 8 4 '10/49' 5 1 7]


**2.5 b) Export encoded `shuffled` dataframe as csv**

In [19]:
shuffled_cor_df.to_csv('archive/shuffled_correlation.csv', index=False)

---

#### Dataset created from this notebook:

    .
    ├── heart_pki_2020_original.csv       # original dataset
    |   ├── heart_pki_2020_cleaned.csv        # for EDA and visualization
    |   └── heart_pki_2020_encoded.csv        # for analytical models (OneHotEncoding done)
    |
    └──|

 