## Import Library

In [141]:
import pandas as pd

## Memasukan csv

In [142]:
df = pd.read_csv("salary.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


# Melakukan pembagian data

disini kita melakukan pembagian data menjadi data training dengan berisi 70% data dan data testing berisi 30% data. 

In [143]:
from sklearn.model_selection import train_test_split
x = df.drop(columns="salary")
y = df["salary"]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)

print ("Dimensi x_train : ", x_train.shape)
print ("Dimensi x_test : ", x_test.shape)
print ("Dimensi y_train : ", y_train.shape)
print ("Dimensi y_test : ", y_test.shape)

Dimensi x_train :  (22792, 14)
Dimensi x_test :  (9769, 14)
Dimensi y_train :  (22792,)
Dimensi y_test :  (9769,)


# Normalisasi Data

Kita mengcopy data csv kita kemudian kita pilih kolom yang ingin di normalisikan.

In [144]:
from sklearn.preprocessing import MinMaxScaler

df2 = df.copy()
MinMax = MinMaxScaler()
mmsData = MinMax.fit_transform(df2[["age"]])
df2["age"] = mmsData

print ("Kolom age sebelum di scaling : ") 
print (df[["age"]])
print ("Kolom age sesudah di scaling : ") 
print (df2[["age"]])

Kolom age sebelum di scaling : 
       age
0       39
1       50
2       38
3       53
4       28
...    ...
32556   27
32557   40
32558   58
32559   22
32560   52

[32561 rows x 1 columns]
Kolom age sesudah di scaling : 
            age
0      0.301370
1      0.452055
2      0.287671
3      0.493151
4      0.150685
...         ...
32556  0.136986
32557  0.315068
32558  0.561644
32559  0.068493
32560  0.479452

[32561 rows x 1 columns]


# Standarisasi Data

Sama kaya cara diatas, pertama kita bisa mengcopy data csv kita, kemudian bisa kita pilih kolom mana yang ingin di standarisasikan.

In [145]:
from sklearn.preprocessing import StandardScaler
import numpy as np

df3 = df.copy()
ss = StandardScaler()
scaled_data = ss.fit_transform(df3[["age"]])

print ("\n Nilai Data Scaling")
print (scaled_data)
print ("Nilai Standar Deviasi", np.std(scaled_data))


 Nilai Data Scaling
[[ 0.03067056]
 [ 0.83710898]
 [-0.04264203]
 ...
 [ 1.42360965]
 [-1.21564337]
 [ 0.98373415]]
Nilai Standar Deviasi 1.0


## Memasukan nilai null dan mengatasinya

Karena disini data saya tidak memiliki nilai null dan tidak ada data yang bertipe float64, maka saya membuat nilai float dengan cara dibawah ini, dari bertipe int menjadi float.

In [146]:
from sklearn.impute import SimpleImputer
df2["fnlwgt"].astype("float64")


0         77516.0
1         83311.0
2        215646.0
3        234721.0
4        338409.0
           ...   
32556    257302.0
32557    154374.0
32558    151910.0
32559    201490.0
32560    287927.0
Name: fnlwgt, Length: 32561, dtype: float64

setelah membuat nilai float, saya membuat nilai null dengan berbagai jenis tipe diantaranya yaitu int, float, dan object.

In [147]:
df2.loc[100:, ["age"]] = np.nan #int
df2.loc[100:, ["fnlwgt"]] = np.nan #float
df2.loc[100:, ["sex"]] = np.nan #object

df2.isna().sum()

age               32461
workclass             0
fnlwgt            32461
education             0
education-num         0
marital-status        0
occupation            0
relationship          0
race                  0
sex               32461
capital-gain          0
capital-loss          0
hours-per-week        0
native-country        0
salary                0
dtype: int64

Setelah membuat nilai null, disini kita mengatasi nya dengan memanggil kelas SimpleImputer beserta dengan strategy nya yang diantaranya adalah median, mean, dan most_freuent. Lalu kita bisa memanggil df2.isna().sum() untuk melihat apakah si nilai null ini udah hilang atau belum.

In [148]:
stra_med = SimpleImputer(strategy = "median")
stra_mean = SimpleImputer(strategy = "mean")
stra_most = SimpleImputer(strategy = "most_frequent")

df2["age"] = stra_med.fit_transform(df2[["age"]])
df2["fnlwgt"] = stra_mean.fit_transform(df2[["fnlwgt"]])
df2["sex"] = stra_most.fit_transform(df2[["sex"]])
df2.isna().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
salary            0
dtype: int64

## Memasukkan Nilai Duplikat dan Mengatasinya

Pertama bisa kita cek terlebih dahulu apakah si nilai dari data kita yang sama, jika ada yang sama kita bisa menggunakan fungsi dari pandas yaitu drop_duplicates yang berguna untuk menghilangkan data yang duplikat. 

In [149]:
print ("Jumlah Nilai Duplikat Sebelum Diatasi : ", df2.duplicated().sum())
df2.drop_duplicates(inplace=True)
print ("Jumlah Nilai Duplikat Sesudah Diatasi : ", df2.duplicated().sum())


Jumlah Nilai Duplikat Sebelum Diatasi :  13923
Jumlah Nilai Duplikat Sesudah Diatasi :  0


# Mengganti Tipe Data Attribut int

kita bisa cek terlebih dahulu di kolom yang mana, yang bertipe int. setelah kita tau lalu kita bisa ubah tipe data nya menjadi yang kita inginkan, disini saya merubah menjadi float.

In [150]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [151]:
df3["fnlwgt"] = df3["fnlwgt"].astype("float64")
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             32561 non-null  int64  
 1   workclass       32561 non-null  object 
 2   fnlwgt          32561 non-null  float64
 3   education       32561 non-null  object 
 4   education-num   32561 non-null  int64  
 5   marital-status  32561 non-null  object 
 6   occupation      32561 non-null  object 
 7   relationship    32561 non-null  object 
 8   race            32561 non-null  object 
 9   sex             32561 non-null  object 
 10  capital-gain    32561 non-null  int64  
 11  capital-loss    32561 non-null  int64  
 12  hours-per-week  32561 non-null  int64  
 13  native-country  32561 non-null  object 
 14  salary          32561 non-null  object 
dtypes: float64(1), int64(5), object(9)
memory usage: 3.7+ MB


# One Hot Encording

Disini kita mengubah attribut sex menjadi numeric dengan membuat kolom baru.

In [152]:
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder(sparse = False)
sex = onehot.fit_transform(df[["sex"]])
sex = pd.DataFrame(sex)

df["male"] = sex[0]
df["female"] = sex[1]
df.head()



Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary,male,female
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,0.0,1.0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,0.0,1.0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,0.0,1.0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,0.0,1.0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,1.0,0.0
