<p style="color:#153462; 
          font-weight: bold; 
          font-size: 30px; 
          font-family: Gill Sans, sans-serif; 
          text-align: center;">
          Building RF model using TF-IDF vectors</p>

### Importing Required Modules

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier

### Reading the Data

In [7]:
X_train = pd.read_csv("data/X_train.csv")
X_test = pd.read_csv("data/X_test.csv")
y_train = pd.read_csv("data/y_train.csv")
y_test = pd.read_csv("data/y_test.csv")

In [10]:
X_train["clean_text"]

0       ['nothing', 'really', 'making', 'sure', 'every...
1       ['urgent', 'urgent', '800', 'free', 'flights',...
2       ['cashbincouk', 'get', 'lots', 'cash', 'weeken...
3       ['thinking', 'going', 'reg', 'pract', 'lessons...
4        ['moji', 'informed', 'saved', 'lives', 'thanks']
                              ...                        
4452                                  ['good', 'evening']
4453                    ['sorry', 'ill', 'call', 'later']
4454    ['mind', 'blastin', 'tsunamis', 'occur', 'rajn...
4455    ['hi', 'darlin', 'im', 'missin', 'u', 'hope', ...
4456    ['must', 'come', 'later', 'normally', 'bathe',...
Name: clean_text, Length: 4457, dtype: object

### Vectorizing the Data

In [12]:
tf_idf_vect = TfidfVectorizer()
tf_idf_vect.fit(X_train["clean_text"])
X_train_vect = tf_idf_vect.transform(X_train["clean_text"])
X_test_vect = tf_idf_vect.transform(X_test["clean_text"])

In [13]:
# To see what words model learned
count = 0
for key, value in tf_idf_vect.vocabulary_.items():
    if count == 10:
        break
    print(key, value)
    count += 1

nothing 5222
really 6023
making 4671
sure 7089
everybodys 2869
speed 6815
urgent 7700
800 710
free 3193
flights 3105


In [15]:
# The data stored in sparse matrics
X_test_vect[0]

<1x8236 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

In [16]:
# To convert sparse matrix in to array
X_test_vect[0].toarray()

array([[0., 0., 0., ..., 0., 0., 0.]])

In [19]:
type(X_test_vect)

scipy.sparse._csr.csr_matrix

### Fitting the Data on Random Forest Model

In [21]:
rf = RandomForestClassifier()
rf_model = rf.fit(X_train_vect, y_train.values.ravel())

In [23]:
pred = rf_model.predict(X_test_vect)

In [25]:
print(classification_report(y_test, pred))

              precision    recall  f1-score   support

           0       0.97      1.00      0.98       937
           1       1.00      0.81      0.90       178

    accuracy                           0.97      1115
   macro avg       0.98      0.91      0.94      1115
weighted avg       0.97      0.97      0.97      1115

