# MFM: Pengenalan Teks Mining

In [1]:
# for Python 2: use print only as a function
from __future__ import print_function

## Bagian 1: Machine Learning dengan Scikit-Learn (review)

In [2]:
# load dataset iris
from sklearn.datasets import load_iris
iris = load_iris()

In [3]:
# simpan matriks fitur X dan target y
X = iris.data
y = iris.target

**"Fitur"** sering disebut atribut, prediktor atau input.**"target"** sering disebut dengan label

In [4]:
# lihat ukuran X dan y
print(X.shape)
print(y.shape)

(150, 4)
(150,)


**"Observasi"** juga sering disebut jumlah sampel

In [5]:
# lihat 5 fitur pertama
import pandas as pd
pd.DataFrame(X, columns=iris.feature_names).head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [6]:
# lihat vektor label
print(y)
iris.target_names

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


array(['setosa', 'versicolor', 'virginica'], 
      dtype='<U10')

Untuk  **membangun model**, fitur harus berbentuk **numerik**, dan setiap sampel harus memiliki **fitur yang sama dengan urutan yang sama**.

In [7]:
# import pustaka
from sklearn.neighbors import KNeighborsClassifier

# inisiasi model dengan parameter default
knn = KNeighborsClassifier()

# latih model
knn.fit(X, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

Untuk **membuat prediksi**, observasi harus memiliki **fitur yang sama seperti data training**, dari jumlah dan maknanya.

In [8]:
# prediksi hasil
knn.predict([[1, 1, 1, 1]])

array([0])

## Bagian 2: Model Bag of Words

In [32]:
# Contoh teks untuk training model
corpus=[
    'kami sedang belajar data science','kami mempelajari Machine Learning untuk data text',
    'Data science adalah ilmu data',
    'kami sangat antusias'
]

Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> Text Analysis is a major application field for machine learning algorithms. However the raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect **numerical feature vectors with a fixed size** rather than the **raw text documents with variable length**.

Kita akan menggunakan [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) untuk mengubah "teks menjadi matriks":

In [33]:
# inisiasi model bag of words
from sklearn.feature_extraction.text import CountVectorizer

In [46]:
vact =CountVectorizer()

In [47]:
# pelajari vocab pada corpus
vact.fit(corpus)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [48]:
# lihat vocab
vact.get_feature_names()

['adalah',
 'antusias',
 'belajar',
 'data',
 'ilmu',
 'kami',
 'learning',
 'machine',
 'mempelajari',
 'sangat',
 'science',
 'sedang',
 'text',
 'untuk']

In [49]:
# transformasikan list corpus menjadi matriks fitur
X_vect= vact.transform(corpus)
X_vect

<4x14 sparse matrix of type '<class 'numpy.int64'>'
	with 19 stored elements in Compressed Sparse Row format>

In [50]:
# ubah sparse matriks menjadi dense matriks
X_vect.toarray()

array([[0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0],
       [0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1],
       [1, 0, 0, 2, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0]], dtype=int64)

In [51]:
# lihat arti dari fitur menggunakan pandas dataframe
import pandas as pd
pd.DataFrame(X_vect.toarray(),columns=vact.get_feature_names())

Unnamed: 0,adalah,antusias,belajar,data,ilmu,kami,learning,machine,mempelajari,sangat,science,sedang,text,untuk
0,0,0,1,1,0,1,0,0,0,0,1,1,0,0
1,0,0,0,1,0,1,1,1,1,0,0,0,1,1
2,1,0,0,2,1,0,0,0,0,0,1,0,0,0
3,0,1,0,0,0,1,0,0,0,1,0,0,0,0


Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> In this scheme, features and samples are defined as follows:

> - Each individual token occurrence frequency (normalized or not) is treated as a **feature**.
> - The vector of all the token frequencies for a given document is considered a multivariate **sample**.

> A **corpus of documents** can thus be represented by a matrix with **one row per document** and **one column per token** (e.g. word) occurring in the corpus.

> We call **vectorization** the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the **Bag of Words** or "Bag of n-grams" representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

In [52]:
# cek tipe corpus
type(corpus)

list

In [53]:
# lihat korpus
print(corpus)

['kami sedang belajar data science', 'kami mempelajari Machine Learning untuk data text', 'Data science adalah ilmu data', 'kami sangat antusias']


Dari [dokumentasi scikit-learn](http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):

> As most documents will typically use a very small subset of the words used in the corpus, the resulting matrix will have **many feature values that are zeros** (typically more than 99% of them).

> For instance, a collection of 10,000 short text documents (such as emails) will use a vocabulary with a size in the order of 100,000 unique words in total while each document will use 100 to 1000 unique words individually.

> In order to be able to **store such a matrix in memory** but also to **speed up operations**, implementations will typically use a **sparse representation** such as the implementations available in the `scipy.sparse` package.

In [54]:
# Contoh tes model
simple_test=["saya adalah data scientist"]

Untuk **membuat prediksi**, observasi harus memiliki **fitur yang sama seperti data training**, dari jumlah dan maknanya.

In [56]:
# transformasi teks baru kedalam matriks
vect_test=vact.transform(simple_test)
vect_test

<1x14 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

In [57]:
# lihat menggunakan pandas dataframe
pd.DataFrame(vect_test.toarray(),columns=vact.get_feature_names())

Unnamed: 0,adalah,antusias,belajar,data,ilmu,kami,learning,machine,mempelajari,sangat,science,sedang,text,untuk
0,1,0,0,1,0,0,0,0,0,0,0,0,0,0


**Ringkasan:**

- `vect.fit(train)` **memelajari vocabulary** dari data training
- `vect.transform(train)` menggunakan **vocabulary yang sudah dibuat** untuk membangun matriks fitur data training
- `vect.transform(test)` menggunakan **vocabulary yang sudah dibuat** untuk membangun matriks fitur data test

## Bagian 3: Membuka Data

In [60]:
# Baca data
path='sms.tsv'
sms=pd.read_table(path,header=None,names=['label','message'])

In [61]:
# lihat ukuran
sms.shape

(5572, 2)

In [62]:
# lihat data
sms.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [64]:
# lihat distribusi kelas
sms.label.value_counts()

ham     4825
spam     747
Name: label, dtype: int64

In [93]:
#konvensi label
sms['label_num']=sms.label.map({'ham':0,'spam':1})
sms.head()

Unnamed: 0,label,message,label num,label_num
0,ham,"Go until jurong point, crazy.. Available only ...",0,0
1,ham,Ok lar... Joking wif u oni...,0,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,1,1
3,ham,U dun say so early hor... U c already then say...,0,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",0,0


In [108]:
# Split data menjadi data train dan test
from sklearn.cross_validation import train_test_split
X=sms.message
y=sms.label_num
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=1)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(4179,)
(1393,)
(4179,)
(1393,)


In [109]:
X_train.head()

710     4mths half price Orange line rental & latest c...
3740                           Did you stitch his trouser
2711    Hope you enjoyed your new content. text stop t...
3155    Not heard from U4 a while. Call 4 rude chat pr...
3748    Ü neva tell me how i noe... I'm not at home in...
Name: message, dtype: object

## Bagian 4: Vektorisasi

In [110]:
# Inisiasi vectorizer 
vect =CountVectorizer()

In [111]:
# Pelajari vocabulary dan ubah data train menjadi matriks
vect.fit(X_train)
X_train_vect=vect.transform(X_train)

In [112]:
# alternatif satu langkah
X_train_vect=vect.fit_transform(X_train)

In [113]:
# lihat vektor fitur
X_train_vect

<4179x7456 sparse matrix of type '<class 'numpy.int64'>'
	with 55209 stored elements in Compressed Sparse Row format>

In [114]:
# lakukan hal yang sama dengan data testing
X_test_vec = vect.transform(X_test)

## Bagian 5: Klasifikasi

Misal kita gunakan [multinomial Naive Bayes](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html):

> The multinomial Naive Bayes classifier is suitable for classification with **discrete features** (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [115]:
# import
from sklearn.naive_bayes import MultinomialNB
nb= MultinomialNB()

In [116]:
# train dengan melihat waktu eksekusi
%time nb.fit(X_train_vect,y_train)

Wall time: 5.05 ms


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [117]:
# buat prediksi
y_pred_class = nb.predict(X_test_vec)

In [118]:
# hitung akurasi
import numpy as np
np.mean(y_pred_class==y_test)

0.98851399856424982

In [119]:
# gunakan confussion matrix
from sklearn import metrics
metrics.confusion_matrix(y_test,y_pred_class)

array([[1203,    5],
       [  11,  174]], dtype=int64)

In [120]:
# hitung probabilitas
y_pred_prob = nb.predict_proba(X_test_vec)
y_pred_prob[0]

array([ 0.99712255,  0.00287745])

## Bagian 6: Inference

In [121]:
new_sms=['hello, how are you?',
        'get your yoor entry now, click here!']
new_sms_vect=vect.transform(new_sms)
nb.predict(new_sms_vect)

array([0, 1], dtype=int64)

In [123]:
# Inisiasi vectorizer 
vect =CountVectorizer(stop_words='english')