### Ensemble Method

Create **Multiple** Models and Combines them to Produce Better Results than any of the **Single Individual** Model.

**Random Forest**

1. Constructs a **Collection** of `Decision Trees` 
 
2. **Aggregate** the Predictions of each tree to determine `Final Prediction`

`Import` Data

In [1]:
import nltk
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
import string

df = pd.read_csv('../Data/SMSSpamCollection.tsv', sep='\t', header=None, names=['Label','SMS'])
df.head()

Unnamed: 0,Label,SMS
0,ham,I've been searching for the right words to tha...
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...
2,ham,"Nah I don't think he goes to usf, he lives aro..."
3,ham,Even my brother is not like to speak with me. ...
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!


In [2]:
def count_punctuation(text):
    count = sum([1 for char in text if char in string.punctuation]) 
    return round(count/(len(text) - text.count(' ')),3)*100 # Excluding Whitespace

df['SMS_Length'] = df['SMS'].apply(lambda x : len(x) - x.count(' ')) # Excluding Whitespace
df['Punctuation%'] = df['SMS'].apply(lambda x : count_punctuation(x))
df.head()

Unnamed: 0,Label,SMS,SMS_Length,Punctuation%
0,ham,I've been searching for the right words to tha...,160,2.5
1,spam,Free entry in 2 a wkly comp to win FA Cup fina...,128,4.7
2,ham,"Nah I don't think he goes to usf, he lives aro...",49,4.1
3,ham,Even my brother is not like to speak with me. ...,62,3.2
4,ham,I HAVE A DATE ON SUNDAY WITH WILL!!,28,7.1


`Clean` Data

In [3]:
stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

In [4]:
def clean_text(text):
    no_punctuation = ''.join([word.lower() for word in text if word not in string.punctuation])
    tokens = re.split('\W+', text)
    stems = [ps.stem(word) for word in tokens if word not in stopwords] # Remove Stopwords
    return stems

`Apply` Vectorizer

In [5]:
tfidf = TfidfVectorizer(analyzer=clean_text)
tfidf_vector = tfidf.fit_transform(df['SMS'])

tfidf_vector_df = pd.DataFrame(tfidf_vector.toarray())

# Create Feature

X = pd.concat([df['SMS_Length'], df['Punctuation%'], tfidf_vector_df], axis=1)

X.head()

Unnamed: 0,SMS_Length,Punctuation%,0,1,2,3,4,5,6,7,...,7521,7522,7523,7524,7525,7526,7527,7528,7529,7530
0,160,2.5,0.053151,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,128,4.7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,49,4.1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,62,3.2,0.074069,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,28,7.1,0.092792,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


`Explore` Random Forest Classifier `Attributes` and `Hyperparameters`

In [6]:
from sklearn.ensemble import RandomForestClassifier
print(dir(RandomForestClassifier))

['__abstractmethods__', '__annotations__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__len__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_n_features', '_estimator_type', '_get_param_names', '_get_tags', '_make_estimator', '_more_tags', '_repr_html_', '_repr_html_inner', '_repr_mimebundle_', '_required_parameters', '_set_oob_score', '_validate_X_predict', '_validate_data', '_validate_estimator', '_validate_y_class_weight', 'apply', 'decision_path', 'feature_importances_', 'fit', 'get_params', 'predict', 'predict_log_proba', 'predict_proba', 'score', 'set_params']


1. `feature_importances` : Value of Each Feature to the Model
    
2. `fit` : Train the Model and can be stored as `Object`
    
3. `predict` : Make Predictions    

`Explore` Random Forest Classifier through `Cross Validation`

In [7]:
from sklearn.model_selection import KFold, cross_val_score

In [8]:
rfc = RandomForestClassifier(n_jobs=-1) # Using All Processor Run Fast
Fold5 = KFold(n_splits=5)
cross_val_score(estimator=rfc, 
                X=X, 
                y=df['Label'], 
                cv=Fold5, 
                scoring='accuracy', 
                n_jobs=-1)

array([0.98025135, 0.98384201, 0.98025135, 0.97574124, 0.98113208])

`Explore` Random Forest on a `Holdout` Test Set

In [9]:
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, df['Label'], test_size=0.2, random_state=42)

In [10]:
rfc = RandomForestClassifier(n_estimators=50, max_depth=20, n_jobs=-1)

model = rfc.fit(X_train, y_train)

In [11]:
model.feature_importances_

array([0.04042974, 0.00751891, 0.01138249, ..., 0.00036286, 0.        ,
       0.        ])

In [12]:
sorted(zip(model.feature_importances_, X_train.columns), reverse=True)[0:10]

[(0.0443379238680936, 1897),
 (0.040429735979576435, 'SMS_Length'),
 (0.03938943085378318, 6905),
 (0.02810931484955495, 2087),
 (0.026927786805321527, 4573),
 (0.026416914937376138, 7413),
 (0.024824519403682434, 3129),
 (0.019652087205164627, 6622),
 (0.01896476084000703, 6933),
 (0.015984736377569756, 5932)]

`Prediction`

In [13]:
y_pred = model.predict(X_test)

`Performance`

In [14]:
precision, recall, fscore, support = score(y_test, y_pred, pos_label='spam', average='binary')

In [15]:
print(f'Precision : {precision*100:.0f}%')
print(f'Recall : {round(recall,2)*100:.2f}%')
print(f'Accuracy : {((y_pred==y_test).sum() / len(y_pred))*100:.2f}%')

Precision : 100%
Recall : 66.00%
Accuracy : 95.42%


`Precision` : All Mail in the `Spam` Folder is Actually `Spam`

`Recall` : All Spam that has come into `Email` was properly placed in the `Spam` Folder.

`Accuracy` : Emails that have Arrived into Inbox were Correctly Identified as `Spam` or `Ham`.