# 1. Missing Value

Sering kali, data rusak, atau hilang, kita perlu mengurusnya terlebih dahulu karena kedepannya data ini tidak berfungsi saat data hilang atau tidak lengkap.

## Imputing missing values dengan Imputer

In [1]:
import pandas as pd
from sklearn.preprocessing import Imputer

In [2]:
df = pd.read_csv('Data.csv')
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [3]:
df.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [4]:
df.dropna()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [5]:
# drop kolom spesifik yang mengandung NaN 
df.dropna(subset=['Age'])

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [6]:
# replace every occurrence of missing_values to one defined by strategy
# which can be mean, median, mode. Axis = 0 means rows, 1 means column

imputer = Imputer(missing_values='NaN', strategy='mean', axis = 0)
df.iloc[:, 1:3] = imputer.fit_transform(df.iloc[:, 1:3])
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.777778,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


## 2. Encoding Data Kategori

In [7]:
# Label Encoder will replace every categorical variable with number. Useful for replacing yes by 1, no by 0.
# One Hot Encoder will create a separate column for every variable and give a value of 1 where the variable is present
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

In [15]:
lable_encoder = LabelEncoder()
temp = df.copy()
temp.iloc[:, 0] = lable_encoder.fit_transform(df.iloc[:, 0])
temp.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,0,44.0,72000.0,No
1,2,27.0,48000.0,Yes
2,1,30.0,54000.0,No
3,2,38.0,61000.0,No
4,1,40.0,63777.777778,Yes


In [9]:
# you can pass an array of indices of categorical features
# one_hot_encoder = OneHotEncoder(categorical_features=[0])
# temp = df.copy()
# temp.iloc[:, 0] = one_hot_encoder.fit_transform(df.iloc[:, 0])

# you can achieve the same thing using get_dummies
pd.get_dummies(df.iloc[:, :-1])

Unnamed: 0,Age,Salary,Country_France,Country_Germany,Country_Spain
0,44.0,72000.0,1,0,0
1,27.0,48000.0,0,0,1
2,30.0,54000.0,0,1,0
3,38.0,61000.0,0,0,1
4,40.0,63777.777778,0,1,0
5,35.0,58000.0,1,0,0
6,38.777778,52000.0,0,0,1
7,48.0,79000.0,1,0,0
8,50.0,83000.0,0,1,0
9,37.0,67000.0,1,0,0


# 3. Binarizing

Mengubah Data menjadi 0 dan 1.
Kita akan mencoba dataset lain, yaitu dataset iris yang ada pada library scikit-learn. (https://archive.ics.uci.edu/ml/datasets/iris)

In [19]:
from sklearn.datasets import load_iris

iris_dataset = load_iris()
X = iris_dataset.data
y = iris_dataset.target
feature_names = iris_dataset.feature_names
print(feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [21]:
X[:, 1]

array([3.5, 3. , 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.4, 3. ,
       3. , 4. , 4.4, 3.9, 3.5, 3.8, 3.8, 3.4, 3.7, 3.6, 3.3, 3.4, 3. ,
       3.4, 3.5, 3.4, 3.2, 3.1, 3.4, 4.1, 4.2, 3.1, 3.2, 3.5, 3.1, 3. ,
       3.4, 3.5, 2.3, 3.2, 3.5, 3.8, 3. , 3.8, 3.2, 3.7, 3.3, 3.2, 3.2,
       3.1, 2.3, 2.8, 2.8, 3.3, 2.4, 2.9, 2.7, 2. , 3. , 2.2, 2.9, 2.9,
       3.1, 3. , 2.7, 2.2, 2.5, 3.2, 2.8, 2.5, 2.8, 2.9, 3. , 2.8, 3. ,
       2.9, 2.6, 2.4, 2.4, 2.7, 2.7, 3. , 3.4, 3.1, 2.3, 3. , 2.5, 2.6,
       3. , 2.6, 2.3, 2.7, 3. , 2.9, 2.9, 2.5, 2.8, 3.3, 2.7, 3. , 2.9,
       3. , 3. , 2.5, 2.9, 2.5, 3.6, 3.2, 2.7, 3. , 2.5, 2.8, 3.2, 3. ,
       3.8, 2.6, 2.2, 3.2, 2.8, 2.8, 2.7, 3.3, 3.2, 2.8, 3. , 2.8, 3. ,
       2.8, 3.8, 2.8, 2.8, 2.6, 3. , 3.4, 3.1, 3. , 3.1, 3.1, 3.1, 2.7,
       3.2, 3.3, 3. , 2.5, 3. , 3.4, 3. ])

Kita akan mengubah 0 jika dibawah rata-rata, dan 1 jika diatas rata-rata

In [24]:
from sklearn.preprocessing import Binarizer
X[:, 1:2] = Binarizer(threshold=X[:, 1].mean()).fit_transform(X[:, 1].reshape(-1, 1))
X[:, 1]

array([1., 0., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
       1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0.,
       0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 0., 0., 1., 0.])

# 4. Fitur Scaling

In [25]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler

df = pd.read_csv('Data.csv').dropna()
print(df)
X = df[["Age", "Salary"]].values.astype(np.float64)

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
5   France  35.0  58000.0       Yes
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


In [26]:
standard_scaler = StandardScaler()
normalizer = Normalizer()
min_max_scaler = MinMaxScaler()

print("Standardization")
print(standard_scaler.fit_transform(X))

print("Normalizing")
print(normalizer.fit_transform(X))

print("MinMax Scaling")
print(min_max_scaler.fit_transform(X))

Standardization
[[ 0.69985807  0.58989097]
 [-1.51364653 -1.50749915]
 [-1.12302807 -0.98315162]
 [-0.08137885 -0.37141284]
 [-0.47199731 -0.6335866 ]
 [ 1.22068269  1.20162976]
 [ 1.48109499  1.55119478]
 [-0.211585    0.1529347 ]]
Normalizing
[[6.11110997e-04 9.99999813e-01]
 [5.62499911e-04 9.99999842e-01]
 [5.55555470e-04 9.99999846e-01]
 [6.22950699e-04 9.99999806e-01]
 [6.03448166e-04 9.99999818e-01]
 [6.07594825e-04 9.99999815e-01]
 [6.02409529e-04 9.99999819e-01]
 [5.52238722e-04 9.99999848e-01]]
MinMax Scaling
[[0.73913043 0.68571429]
 [0.         0.        ]
 [0.13043478 0.17142857]
 [0.47826087 0.37142857]
 [0.34782609 0.28571429]
 [0.91304348 0.88571429]
 [1.         1.        ]
 [0.43478261 0.54285714]]


# 5. Ekstraksi Fitur
Pada pertemuan sebelumnya kalian telah mencoba membuat program WordCount. WordCount merupakan sebuah teknik dalam melakukan ekstraksi Fitur. Namun, kalian tidak perlu membuat sendiri. Scikit-Learn telah menyediakan librarynya. Ekstraksi Fitur ini nantinya akan berguna dalam pemrosesan klasifikasi, clustering, maupun teknik pembelajaran mesin lainnya.
## 5.1 Count Vectorizer

In [27]:
from sklearn.feature_extraction.text import CountVectorizer

docs = ["Mayur is a nice boy.", "Mayur rock! wohooo!", "My name is Mayur, and I am a Pythonista!"]
cv = CountVectorizer()
X = cv.fit_transform(docs)
print(X)
print(cv.vocabulary_)
print(X.todense())

  (0, 2)	1
  (0, 7)	1
  (0, 3)	1
  (0, 4)	1
  (1, 10)	1
  (1, 9)	1
  (1, 4)	1
  (2, 8)	1
  (2, 0)	1
  (2, 1)	1
  (2, 6)	1
  (2, 5)	1
  (2, 3)	1
  (2, 4)	1
{'mayur': 4, 'is': 3, 'nice': 7, 'boy': 2, 'rock': 9, 'wohooo': 10, 'my': 5, 'name': 6, 'and': 1, 'am': 0, 'pythonista': 8}
[[0 0 1 1 1 0 0 1 0 0 0]
 [0 0 0 0 1 0 0 0 0 1 1]
 [1 1 0 1 1 1 1 0 1 0 0]]


## Dict Vectorizer

DictVectorizer melakukan mapping dari dictionary wordcount ke Vektor

In [28]:
from sklearn.feature_extraction import DictVectorizer

docs = [{"Aku": 1, "suka": 1, "makan": 2}, {"Aku": 1, "tidak": 1, "suka": 2, "makan": 3, "kambing": 1, "bakar": 2, "madu": 3}]
dv = DictVectorizer()
X = dv.fit_transform(docs)
print(X)
print(X.todense())

  (0, 0)	1.0
  (0, 4)	2.0
  (0, 5)	1.0
  (1, 0)	1.0
  (1, 1)	2.0
  (1, 2)	1.0
  (1, 3)	3.0
  (1, 4)	3.0
  (1, 5)	2.0
  (1, 6)	1.0
[[1. 0. 0. 0. 2. 1. 0.]
 [1. 2. 1. 3. 3. 2. 1.]]


## TfIdf Vectorizer:
Word Count (Term Frekuensi dikali dengan Inverse Dokumen Frekuensi),

Tutorial dapat dilihat pada link berikut:
https://datascience.mipa.ugm.ac.id/id/representasi-teks-dalam-vektor-part-1/
https://datascience.mipa.ugm.ac.id/id/representasi-teks-dalam-vektor-part-2/

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vectorizer = TfidfVectorizer()
cv_vectorizer = CountVectorizer()
docs = ["Mayur is a Guitarist", "Mayur is Musician", "Mayur is also a programmer"]
X_idf = tfidf_vectorizer.fit_transform(docs)
X_cv = cv_vectorizer.fit_transform(docs)
print(X_idf.todense())
print(tfidf_vectorizer.vocabulary_)
print(X_cv.todense())

[[0.         0.76749457 0.45329466 0.45329466 0.         0.        ]
 [0.         0.         0.45329466 0.45329466 0.76749457 0.        ]
 [0.6088451  0.         0.35959372 0.35959372 0.         0.6088451 ]]
{'mayur': 3, 'is': 2, 'guitarist': 1, 'musician': 4, 'also': 0, 'programmer': 5}
[[0 1 1 1 0 0]
 [0 0 1 1 1 0]
 [1 0 1 1 0 1]]


# GROUP PROJECT

Tujuan dari Project ini adalah mengaplikasikan hal-hal yang telah dipelajari dari setiap pertemuan pada Digital Talent pada sebuah "big dataset" yang dipilih hingga akhirnya menemukan "insight" dari data tersebut.

Grup Terdiri dari 4-5 orang, dan akan dipilihkan oleh pengajar secara acak.

Sebagai inspirasi untuk project serta datasetnya, silahkan dapat mengunjungi tautan-tautan berikut:

- https://www.kaggle.com/datasets
- https://github.com/awesomedata/awesome-public-datasets
- https://data.go.id/
- https://www.quora.com/What-are-some-good-data-science-projects
- https://www.analyticsvidhya.com/blog/2018/05/24-ultimate-data-science-projects-to-boost-your-knowledge-and-skills/
- https://www.yelp.com/dataset/challenge

Terdapat milestone yang harus dilaporkan setiap minggunya dalam bentuk Pdf:

Milestone 1 : 
- Deskripsi Project & Dataset. 
- Eksplorasi Data

Milestone 2 : 
- Eksplorasi dengan Statistik Deskriptif
- Research Question

Milestone 3 :
- Model Pembelajaran Mesin yang mungkin diterapkan
- Dasar Pemilihan Metode

Milestone 4 :
- Pembahasan Mengenai Hasil dari riset yang telah dilakukan

Milestone 5 :
- Visualisasi Data dengan Tools yang diajarkan
- Menjawab Research Question dari Model Pembelajaran Mesin

Milestone 6 : 
- Laporan Final 
- Pembuatan Presentasi dan Poster
- Publikasi dalam paper (opsional) 
