### Zadanie: Random Forest
Podziel zbiór na treningowy i testowy, pamiętaj o stratyfikacji. Następnie naucz model Random Forest, z którego wyciągnij feature importance. Na podstawie tego wykonaj selekcję cech i weź jedynie te, których ważność jest większa niż 0.001. Nauczy nowy model Random Forest z wyborem hiperperparametrów, korzystając z GridSearch. Wykorzystaj poznane techniki do wektoryzacji.

Wyniki prześlij Mentorowi jako Jupyter Notebook umieszczony w Twoim GitHubie.

In [208]:
import numpy as np
import pandas as pd
import string
import nltk
import itertools
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer

In [209]:
spam_dataset = pd.read_csv('spam.csv', encoding = "ISO-8859-1", usecols=[0, 1], 
                names=['Spam', 'Text'],
                           skiprows=1)
spam_dataset['Spam'] = spam_dataset['Spam'].replace(['ham', 'spam'], [0, 1])
spam_dataset

Unnamed: 0,Spam,Text
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [210]:
print(spam_dataset['Spam'].value_counts(normalize=True))

0    0.865937
1    0.134063
Name: Spam, dtype: float64


### Usuwanie znaków interpunkcyjnych

In [211]:
def remove_puncation(text):
    cleaned = ''.join([word for word in text if word not in string.punctuation])
    return cleaned
spam_dataset['Cleaned_Text'] = spam_dataset['Text'].apply(lambda x: remove_puncation(x))
spam_dataset

Unnamed: 0,Spam,Text,Cleaned_Text
0,0,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...
1,0,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say
4,0,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...
...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ì_ b going to esplanade fr home?,Will Ì b going to esplanade fr home
5569,0,"Pity, * was in mood for that. So...any other s...",Pity was in mood for that Soany other suggest...
5570,0,The guy did some bitching but I acted like i'd...,The guy did some bitching but I acted like id ...


### Tokenizacja

In [212]:
def tokenize(text):

    # Usunięcie wielkich liter
    clean_text = text.lower()

    # Tokenizacja
    tokenized_text = nltk.word_tokenize(clean_text)
    return tokenized_text

spam_dataset['Tokenized_Text'] = spam_dataset['Cleaned_Text'].apply(lambda x: tokenize(x))
spam_dataset


Unnamed: 0,Spam,Text,Cleaned_Text,Tokenized_Text
0,0,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o..."
1,0,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]"
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f..."
3,0,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t..."
4,0,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l..."
...,...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,This is the 2nd time we have tried 2 contact u...,"[this, is, the, 2nd, time, we, have, tried, 2,..."
5568,0,Will Ì_ b going to esplanade fr home?,Will Ì b going to esplanade fr home,"[will, ì, b, going, to, esplanade, fr, home]"
5569,0,"Pity, * was in mood for that. So...any other s...",Pity was in mood for that Soany other suggest...,"[pity, was, in, mood, for, that, soany, other,..."
5570,0,The guy did some bitching but I acted like i'd...,The guy did some bitching but I acted like id ...,"[the, guy, did, some, bitching, but, i, acted,..."


In [213]:
stopwords = nltk.corpus.stopwords.words("english")

def remove_stopwords(text):
    without_stopwords = [word for word in text if word not in stopwords]
    return without_stopwords
spam_dataset['WithoutStop_Text'] = spam_dataset['Tokenized_Text'].apply(lambda x: remove_stopwords(x))
spam_dataset

Unnamed: 0,Spam,Text,Cleaned_Text,Tokenized_Text,WithoutStop_Text
0,0,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n..."
1,0,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,0,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]"
4,0,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t..."
...,...,...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,This is the 2nd time we have tried 2 contact u...,"[this, is, the, 2nd, time, we, have, tried, 2,...","[2nd, time, tried, 2, contact, u, u, å£750, po..."
5568,0,Will Ì_ b going to esplanade fr home?,Will Ì b going to esplanade fr home,"[will, ì, b, going, to, esplanade, fr, home]","[ì, b, going, esplanade, fr, home]"
5569,0,"Pity, * was in mood for that. So...any other s...",Pity was in mood for that Soany other suggest...,"[pity, was, in, mood, for, that, soany, other,...","[pity, mood, soany, suggestions]"
5570,0,The guy did some bitching but I acted like i'd...,The guy did some bitching but I acted like id ...,"[the, guy, did, some, bitching, but, i, acted,...","[guy, bitching, acted, like, id, interested, b..."


### Lematyzacja

In [214]:
lemmater = nltk.WordNetLemmatizer()
def lemmatizing(text):
    lemmatized_words = [lemmater.lemmatize(word) for word in text]
    return lemmatized_words
spam_dataset['Lemmatized_Text'] = spam_dataset['WithoutStop_Text'].apply(lambda x: lemmatizing(x))
spam_dataset

Unnamed: 0,Spam,Text,Cleaned_Text,Tokenized_Text,WithoutStop_Text,Lemmatized_Text
0,0,"Go until jurong point, crazy.. Available only ...",Go until jurong point crazy Available only in ...,"[go, until, jurong, point, crazy, available, o...","[go, jurong, point, crazy, available, bugis, n...","[go, jurong, point, crazy, available, bugis, n..."
1,0,Ok lar... Joking wif u oni...,Ok lar Joking wif u oni,"[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]","[ok, lar, joking, wif, u, oni]"
2,1,Free entry in 2 a wkly comp to win FA Cup fina...,Free entry in 2 a wkly comp to win FA Cup fina...,"[free, entry, in, 2, a, wkly, comp, to, win, f...","[free, entry, 2, wkly, comp, win, fa, cup, fin...","[free, entry, 2, wkly, comp, win, fa, cup, fin..."
3,0,U dun say so early hor... U c already then say...,U dun say so early hor U c already then say,"[u, dun, say, so, early, hor, u, c, already, t...","[u, dun, say, early, hor, u, c, already, say]","[u, dun, say, early, hor, u, c, already, say]"
4,0,"Nah I don't think he goes to usf, he lives aro...",Nah I dont think he goes to usf he lives aroun...,"[nah, i, dont, think, he, goes, to, usf, he, l...","[nah, dont, think, goes, usf, lives, around, t...","[nah, dont, think, go, usf, life, around, though]"
...,...,...,...,...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...,This is the 2nd time we have tried 2 contact u...,"[this, is, the, 2nd, time, we, have, tried, 2,...","[2nd, time, tried, 2, contact, u, u, å£750, po...","[2nd, time, tried, 2, contact, u, u, å£750, po..."
5568,0,Will Ì_ b going to esplanade fr home?,Will Ì b going to esplanade fr home,"[will, ì, b, going, to, esplanade, fr, home]","[ì, b, going, esplanade, fr, home]","[ì, b, going, esplanade, fr, home]"
5569,0,"Pity, * was in mood for that. So...any other s...",Pity was in mood for that Soany other suggest...,"[pity, was, in, mood, for, that, soany, other,...","[pity, mood, soany, suggestions]","[pity, mood, soany, suggestion]"
5570,0,The guy did some bitching but I acted like i'd...,The guy did some bitching but I acted like id ...,"[the, guy, did, some, bitching, but, i, acted,...","[guy, bitching, acted, like, id, interested, b...","[guy, bitching, acted, like, id, interested, b..."


In [215]:
# define detaset
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(spam_dataset['Lemmatized_Text'].apply(lambda x: ' '.join(x)))
print(X.shape)
y = spam_dataset['Spam']
print(y.shape)

(5572, 8841)
(5572,)


In [297]:
feature_names = tfidf.get_feature_names_out()
feature_names

array(['008704050406', '0089my', '0121', ..., 'ûïharry', 'ûò', 'ûówell'],
      dtype=object)

In [217]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                    random_state=42, stratify=y)
print ('Treningowe obserwacje: %d\nTestowe obserwacje: %d' % (X_train.shape[0], 
                                                                X_test.shape[0]))

Treningowe obserwacje: 4457
Testowe obserwacje: 1115


In [218]:
from sklearn.ensemble import RandomForestClassifier

# define the model
clf = RandomForestClassifier()
#fit the model
clf.fit(X_train, y_train)


In [219]:
print(f'model score on training data: {clf.score(X_train, y_train)}')
print(f'model score on testing data: {clf.score(X_test, y_test)}')

model score on training data: 1.0
model score on testing data: 0.9766816143497757


#### Contrary to the testing set, the score on the training set is perfect, which means that our model is overfitting here.

In [220]:
# get importance
importances = clf.feature_importances_

In [221]:
clf.feature_importances_

array([9.08151593e-05, 3.68479994e-05, 1.31057009e-04, ...,
       8.80853268e-05, 9.29887075e-06, 0.00000000e+00])

In [222]:
features = pd.DataFrame({'names':feature_names, 'importances':importances})

In [223]:
importances2 = features[features['importances']>0.001]
importances2

Unnamed: 0,names,importances
48,0800,0.001763
51,08000839402,0.001810
52,08000930705,0.003180
291,100,0.005590
292,1000,0.008115
...,...,...
8406,weekly,0.002570
8418,welcome,0.001012
8495,win,0.011967
8504,winner,0.001831


In [289]:
from sklearn.model_selection import GridSearchCV
Tfidf = TfidfVectorizer()
parameters = {
        'min_df': [0.001, 0.002, 0.003],
        'max_df': [0.3, 0.4, 0.5],
        'use_idf':[True, False]
        }
gridsearch = GridSearchCV(Tfidf,
                             parameters,
                             scoring='f1_macro')
gridsearch.fit(spam_dataset)
print('\nBest hyperparameter:', gridsearch.best_params_)
Tfdif_grid = gridsearch.best_estimator_


Best hyperparameter: {'max_df': 0.3, 'min_df': 0.001, 'use_idf': True}


In [290]:
tfidf2 = TfidfVectorizer(max_df = 0.3, min_df = 0.001, use_idf = True)
X = tfidf2.fit_transform(spam_dataset['Lemmatized_Text'].apply(lambda x: ' '.join(x)))
print(X.shape)
y = spam_dataset['Spam']
print(y.shape)

(5572, 1364)
(5572,)


In [291]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                    random_state=42, stratify=y)
print ('Treningowe obserwacje: %d\nTestowe obserwacje: %d' % (X_train.shape[0], 
                                                                X_test.shape[0]))

Treningowe obserwacje: 4457
Testowe obserwacje: 1115


In [282]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

clf_2 = RandomForestClassifier()
parameters = {
        'max_depth': [5, 10, 15],
        'min_samples_leaf': [1, 3, 5],
        'n_estimators':[100, 500, 1000]
        }
gridsearch = GridSearchCV(clf_2,
                        parameters,
                        scoring='f1_macro'
                        )
gridsearch.fit(X_train, y_train)
print('\nBest hyperparameter:', gridsearch.best_params_)
clf_2_grid = gridsearch.best_estimator_


Best hyperparameter: {'max_depth': 15, 'min_samples_leaf': 1, 'n_estimators': 100}


In [284]:
from sklearn.ensemble import RandomForestClassifier
clf_2 = RandomForestClassifier(n_estimators=100, max_depth=15, min_samples_leaf=1, random_state=0)
clf_2.fit(X_train, y_train)
clf_2.score(X_test, y_test)

0.9183856502242153

In [285]:
print(f'model score on training data: {clf_2.score(X_train, y_train)}')
print(f'model score on testing data: {clf_2.score(X_test, y_test)}')

model score on training data: 0.9322414179941665
model score on testing data: 0.9183856502242153


In [287]:
clf_3 = RandomForestClassifier(n_estimators=100, max_depth=15, min_samples_leaf=1, random_state=0)
clf_3.fit(X, y)
clf_2.score(X, y)

0.9294687724335966

In [None]:
tfidf2 = TfidfVectorizer(max_df=0.5, min_df=0.001, use_idf=True)
X = tfidf.fit_transform(spam_dataset['Lemmatized_Text'].apply(lambda x: ' '.join(x)))
print(X.shape)
y = spam_dataset['Spam']
print(y.shape)

(5572, 1364)
(5572,)


In [293]:
from sklearn.feature_extraction.text import CountVectorizer
count = CountVectorizer(max_df = 0.3, min_df = 0.001)
X_count = count.fit_transform(spam_dataset['Lemmatized_Text'].apply(lambda x: ' '.join(x)))
X_count

<5572x1364 sparse matrix of type '<class 'numpy.int64'>'
	with 35142 stored elements in Compressed Sparse Row format>

In [294]:
count.vocabulary_

{'go': 495,
 'point': 901,
 'crazy': 295,
 'available': 136,
 'bugis': 198,
 'great': 507,
 'world': 1325,
 'la': 635,
 'cine': 246,
 'got': 504,
 'wat': 1273,
 'ok': 829,
 'lar': 640,
 'joking': 613,
 'wif': 1299,
 'free': 461,
 'entry': 395,
 'wkly': 1313,
 'comp': 266,
 'win': 1303,
 'cup': 299,
 'final': 438,
 'may': 726,
 'text': 1153,
 'receive': 947,
 'txt': 1218,
 'apply': 115,
 'dun': 369,
 'say': 992,
 'early': 373,
 'already': 97,
 'nah': 790,
 'dont': 356,
 'think': 1164,
 'usf': 1241,
 'life': 665,
 'around': 123,
 'though': 1170,
 'freemsg': 462,
 'hey': 543,
 'darling': 313,
 'week': 1283,
 'word': 1321,
 'back': 148,
 'id': 578,
 'like': 668,
 'fun': 476,
 'still': 1103,
 'xxx': 1339,
 'std': 1102,
 'send': 1012,
 '150': 19,
 'even': 399,
 'brother': 192,
 'speak': 1083,
 'treat': 1203,
 'per': 869,
 'request': 962,
 'set': 1019,
 'caller': 206,
 'press': 912,
 'copy': 282,
 'friend': 467,
 'winner': 1305,
 'valued': 1248,
 'network': 799,
 'customer': 302,
 'selected':

In [295]:
clf_v2 = RandomForestClassifier(max_depth=2, random_state=0)
clf_v2.fit(X_count, y)
clf_v2.score(X_count, y)

0.8659368269921034

In [296]:
print(X_count.toarray()[:5])

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
