## Data Preprocessing

It is the step where we perform the following steps -
    - Getting Dataset  
    - Importing the libraries and the dataset  
    - Finding missing values  
    - Encoding categorial data  
    - Split into training and test dataset  
    - Feature Scaling  

### Importing the libraries of Pandas and Numpy

In [2]:
import pandas as pd
import numpy as np

Importing the dataset

In [3]:
df = pd.read_csv("../datasets/employees.csv")
df.head()

Unnamed: 0,EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,HIRE_DATE,JOB_ID,SALARY,COMMISSION_PCT,MANAGER_ID,DEPARTMENT_ID
0,198,Donald,OConnell,DOCONNEL,650.507.9833,21-JUN-07,SH_CLERK,2600,-,124,50
1,199,Douglas,Grant,DGRANT,650.507.9844,13-JAN-08,SH_CLERK,2600,-,124,50
2,200,Jennifer,Whalen,JWHALEN,515.123.4444,17-SEP-03,AD_ASST,4400,-,101,10
3,201,Michael,Hartstein,MHARTSTE,515.123.5555,17-FEB-04,MK_MAN,13000,-,100,20
4,202,Pat,Fay,PFAY,603.123.6666,17-AUG-05,MK_REP,6000,-,201,20


separate out the independent and dependent variables in the form of arrays

In [13]:
x = df[['EMPLOYEE_ID', 'FIRST_NAME', 'LAST_NAME' , 'EMAIL', 'PHONE_NUMBER', 'JOB_ID' , 'HIRE_DATE','SALARY']]

we have to make the array of independent variables

In [14]:
y = x.values

handling the missing values in the data

In [15]:
from sklearn.impute import SimpleImputer

In [16]:
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')

In [21]:
imput = imputer.fit(y[:,7:8])

In [24]:
y[:,7:8] = imput.transform(y[:,7:8])

In [25]:
y

array([[198, 'Donald', 'OConnell', 'DOCONNEL', '650.507.9833',
        'SH_CLERK', '21-JUN-07', 2600.0],
       [199, 'Douglas', 'Grant', 'DGRANT', '650.507.9844', 'SH_CLERK',
        '13-JAN-08', 2600.0],
       [200, 'Jennifer', 'Whalen', 'JWHALEN', '515.123.4444', 'AD_ASST',
        '17-SEP-03', 4400.0],
       [201, 'Michael', 'Hartstein', 'MHARTSTE', '515.123.5555',
        'MK_MAN', '17-FEB-04', 13000.0],
       [202, 'Pat', 'Fay', 'PFAY', '603.123.6666', 'MK_REP', '17-AUG-05',
        6000.0],
       [203, 'Susan', 'Mavris', 'SMAVRIS', '515.123.7777', 'HR_REP',
        '07-JUN-02', 6500.0],
       [204, 'Hermann', 'Baer', 'HBAER', '515.123.8888', 'PR_REP',
        '07-JUN-02', 10000.0],
       [205, 'Shelley', 'Higgins', 'SHIGGINS', '515.123.8080', 'AC_MGR',
        '07-JUN-02', 12008.0],
       [206, 'William', 'Gietz', 'WGIETZ', '515.123.8181', 'AC_ACCOUNT',
        '07-JUN-02', 8300.0],
       [100, 'Steven', 'King', 'SKING', '515.123.4567', 'AD_PRES',
        '17-JUN-03', 24

### Encoding categorial data 

categorial data means which have labels. Encoding categorial data into numericals.  
This is because most of the algorithm data can't work with categorial data. Some of algorithm like decision
tree can work with these data but many work with only numeric columns

In [26]:
from sklearn.preprocessing import LabelEncoder

this library will convert categorial data into digits and numerical

In [27]:
label_encoder_y = LabelEncoder()

In [35]:
y[:,1] = label_encoder_y.fit_transform(y[:,1])

In [36]:
y

array([[198, 7, 'OConnell', 'DOCONNEL', '650.507.9833', 'SH_CLERK',
        '21-JUN-07', 2600.0],
       [199, 8, 'Grant', 'DGRANT', '650.507.9844', 'SH_CLERK',
        '13-JAN-08', 2600.0],
       [200, 16, 'Whalen', 'JWHALEN', '515.123.4444', 'AD_ASST',
        '17-SEP-03', 4400.0],
       [201, 28, 'Hartstein', 'MHARTSTE', '515.123.5555', 'MK_MAN',
        '17-FEB-04', 13000.0],
       [202, 32, 'Fay', 'PFAY', '603.123.6666', 'MK_REP', '17-AUG-05',
        6000.0],
       [203, 41, 'Mavris', 'SMAVRIS', '515.123.7777', 'HR_REP',
        '07-JUN-02', 6500.0],
       [204, 11, 'Baer', 'HBAER', '515.123.8888', 'PR_REP', '07-JUN-02',
        10000.0],
       [205, 36, 'Higgins', 'SHIGGINS', '515.123.8080', 'AC_MGR',
        '07-JUN-02', 12008.0],
       [206, 44, 'Gietz', 'WGIETZ', '515.123.8181', 'AC_ACCOUNT',
        '07-JUN-02', 8300.0],
       [100, 40, 'King', 'SKING', '515.123.4567', 'AD_PRES', '17-JUN-03',
        24000.0],
       [101, 31, 'Kochhar', 'NKOCHHAR', '515.123.4568', '

we do dummy encoding to prevent any error due to correlation between the data

In [37]:
from sklearn.preprocessing import OneHotEncoder

In [38]:
one_hot_encoder = OneHotEncoder()

In [40]:
z = one_hot_encoder.fit_transform(df.FIRST_NAME.values.reshape(-1,1)).toarray

In [41]:
z

<bound method _cs_matrix.toarray of <50x45 sparse matrix of type '<class 'numpy.float64'>'
	with 50 stored elements in Compressed Sparse Row format>>

## training and testing data

divide your dataset into a training and testing data

In [42]:
from sklearn.model_selection import train_test_split

In [43]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.2, random_state=0)

the test size is 20 percent of the datasize. you can take any other value.  

random state - it means that fix value everytime. same result

In [44]:
x_train

Unnamed: 0,EMPLOYEE_ID,FIRST_NAME,LAST_NAME,EMAIL,PHONE_NUMBER,JOB_ID,HIRE_DATE,SALARY
33,124,Kevin,Mourgos,KMOURGOS,650.123.5234,ST_MAN,16-NOV-07,5800
35,126,Irene,Mikkilineni,IMIKKILI,650.124.1224,ST_CLERK,28-SEP-06,2700
26,117,Sigal,Tobias,STOBIAS,515.127.4564,PU_CLERK,24-JUL-05,2800
34,125,Julia,Nayer,JNAYER,650.124.1214,ST_CLERK,16-JUL-05,3200
18,109,Daniel,Faviet,DFAVIET,515.124.4169,FI_ACCOUNT,16-AUG-02,9000
7,205,Shelley,Higgins,SHIGGINS,515.123.8080,AC_MGR,07-JUN-02,12008
14,105,David,Austin,DAUSTIN,590.423.4569,IT_PROG,25-JUN-05,4800
45,136,Hazel,Philtanker,HPHILTAN,650.127.1634,ST_CLERK,06-FEB-08,2200
48,139,John,Seo,JSEO,650.121.2019,ST_CLERK,12-FEB-06,2700
29,120,Matthew,Weiss,MWEISS,650.123.1234,ST_MAN,18-JUL-04,8000


In [45]:
y_test

array([[119, 21, 'Colmenares', 'KCOLMENA', '515.127.4566', 'PU_CLERK',
        '10-AUG-07', 2500.0],
       [102, 25, 'De Haan', 'LDEHAAN', '515.123.4569', 'AD_VP',
        '13-JAN-01', 17000.0],
       [101, 31, 'Kochhar', 'NKOCHHAR', '515.123.4568', 'AD_VP',
        '21-SEP-05', 17000.0],
       [132, 42, 'Olson', 'TJOLSON', '650.124.8234', 'ST_CLERK',
        '10-APR-07', 2100.0],
       [200, 16, 'Whalen', 'JWHALEN', '515.123.4444', 'AD_ASST',
        '17-SEP-03', 4400.0],
       [118, 9, 'Himuro', 'GHIMURO', '515.127.4565', 'PU_CLERK',
        '15-NOV-06', 2600.0],
       [129, 24, 'Bissot', 'LBISSOT', '650.124.5234', 'ST_CLERK',
        '20-AUG-05', 3300.0],
       [122, 33, 'Kaufling', 'PKAUFLIN', '650.123.3234', 'ST_MAN',
        '01-MAY-03', 7900.0],
       [113, 26, 'Popp', 'LPOPP', '515.124.4567', 'FI_ACCOUNT',
        '07-DEC-07', 6900.0],
       [202, 32, 'Fay', 'PFAY', '603.123.6666', 'MK_REP', '17-AUG-05',
        6000.0]], dtype=object)

## feature scaling

it is the last step of our data preprocessing. If we have any data which is way beyond the other data.
we use feature scaling.

In [46]:
from sklearn.preprocessing import StandardScaler

In [47]:
sc_x = StandardScaler()

In [50]:
# x_train =sc_x.fit_transform(x_train)

# can be used to only numerical values

similarly do form x_test