### Random Forest Classifier based on NLP:

#### Aim:---

1.  We have a text as an input and we need to predict whether the text is postitive or negative sentiment using Random Forest Classifier

#### Steps used in this Algorithm:---

1.  Import all the necessary libraries

2.  Download all the necessary NLTK lbraries

3.  Create Sample Dataset

4.  Perform the Text Preprocessing Function

5.  Apply Cleaning

6.  Convert Text to Numerical Features (TF-IDF)

7.  Divide the data into the independent and dependent features

8.  Perform the Train-Test Split

9.  Train the Random Forest Classifier Model

10.  Make Predictions for the model

11. Evaluate the Model

12. Test with Custom Sentence

### Step 1:  Import all the necessary libraries

In [651]:
import  numpy              as   np
import  pandas             as   pd
import  matplotlib.pyplot  as  plt
import  seaborn            as  sns

import  nltk
from    nltk.tokenize                    import word_tokenize, RegexpTokenizer
from    nltk.corpus                      import stopwords


from    sklearn.feature_extraction.text  import CountVectorizer

from    sklearn.model_selection          import train_test_split
from    sklearn.preprocessing            import StandardScaler
from    sklearn.metrics                  import accuracy_score, confusion_matrix, classification_report
from    sklearn.ensemble                 import RandomForestClassifier


### OBSERVATIONS:

1.  numpy  ------------------>  Computation of the numerical array

2.  pandas ------------------>  Data Cleaning and Manipulation

3.  matplotlib -------------->  Data Visualization

4.  seaborn   --------------->  Data Correlation

5.  nltk -------------------->  Contains all the library for text preprocessing

6.  tokenize ---------------->  breaks the text into sub parts

7.  word_tokenize ----------->  breaks the text into words

8.  corpus ------------------>  contains a series of text

9.  stopwords --------------->  words having no meaning

10. feature_extraction ------>  extracting all the essential information from the features

11. TfidfVectorizer  -------->  converts the text into Tfidf score matrix

12. train_test_split -------->  split the data into training and testing data

13. StandardScaler ---------->  sclaes the data in one range between 0 to 1

14. RandomForestClassifier ------>  involves the use of multiple decision trees to improve the performance of the model

15. RegexpTokenizer ----------> Tokenizing the Regular expression statement

### Step 2: Download all the necessary NLTK lbraries

In [652]:
nltk.download('punkt_tab')
nltk.download('average_perceptron_tagger_eng')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Error loading average_perceptron_tagger_eng: Package
[nltk_data]     'average_perceptron_tagger_eng' not found in index
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### OBSERVATIONS:

1.  punkt_tab --------------->   It represents the tokenization model

2.  average_perceptron_tagger_eng ------> It represents the POS Tagging Model

3.  stopwords         -------------------> It represents the model for stopwords

### Step 3: Create Sample Dataset

In [653]:
corpus = [
    "I love data science",
    "Machine learning is amazing",
    "Python is great for AI",
    "Deep learning is powerful",
    "I enjoy coding",
    "This project is fantastic",
    "I hate bugs in code",
    "Debugging is frustrating",
    "Errors make me angry",
    "This is a terrible experience",
    "I dislike bad code",
    "The results are disappointing"
]

labels = [1,1,1,1,1,1,0,0,0,0,0,0]


### Construct the dataframe from u=teh corpus and labels

df = pd.DataFrame(corpus,columns=['text'])

In [654]:
### Add the labels to the dataframe

df['label'] = labels

In [655]:
df

Unnamed: 0,text,label
0,I love data science,1
1,Machine learning is amazing,1
2,Python is great for AI,1
3,Deep learning is powerful,1
4,I enjoy coding,1
5,This project is fantastic,1
6,I hate bugs in code,0
7,Debugging is frustrating,0
8,Errors make me angry,0
9,This is a terrible experience,0


### OBSERVATIONS:

1. The dataset is constructed.

2. It has two columns. One is text and the other is label.

3. The input column represents the text.

4. The output column represents the label.

5. The text is string in nature. So we need to convert it into the numerical foem so that it can be easily be trained by the machine learning model.

### Step 4:  Perform the Text Preprocessing Function

In [656]:
reg = RegexpTokenizer(r'\w+')

In [657]:
### define the function

def clean_text(text):
    ### convert the text into the lower case
    text = text.lower()

    ### remove the punctuations and the special symbols from the text
    ans = reg.tokenize(text)

    ### convert the text in lists to strings
    ans = " ".join(ans)

    ### perform the word tokenization of the regularized texts
    words = word_tokenize(ans)

    ### define all the stop words used in english
    english_stopwords = stopwords.words("english")

    ### check whether any stopword exists in the regularized texts, then remove it
    res = [x for x in words if(x not in english_stopwords)]

    ### convert each sentence in lists to words
    res = " ".join(res)

    return(res)

### OBSERVATIONS:

1. The "clean_text" function is defined.

2. First each input text is accepted as an input.

3. This input text is first converted into lower case.

4. Then regular expression is applied on each and every text to remove all the punctuations and special symbols from the text.

5. The regularized words in list is again converted into words.

6. Then the word tokenization is applied on the regularized words to convert each text into words.

7. All the stopwords of english data is defined.

8. Then all the english stopwords are removed from the tokenized words to get a fileterd text data.

9. The fileterd text data in list is converted back into the words.



### Step 5:  Apply Cleaning

In [658]:
### define the function call

df['clean_text'] = df['text'].apply(clean_text)

In [659]:
df['clean_text']

0            love data science
1     machine learning amazing
2              python great ai
3       deep learning powerful
4                 enjoy coding
5            project fantastic
6               hate bugs code
7        debugging frustrating
8            errors make angry
9          terrible experience
10            dislike bad code
11       results disappointing
Name: clean_text, dtype: object

### OBSERVATIONS:

1. The function 'clean_text' is called where the text preprocessing is applied to each and every row of the dataset.

### Step 6: Convert Text to Numerical Features (TF-IDF)

Q.>  Why Tfidf Vectoization is used in Logistic Regression ?

Ans:>

1.   Logistic Regression can easily work with the high dimensional data and sparse matrix.

     After applying Tfidf vectorizer on the text, the text becomes high dimendional sparse data.

     So here Tfidf vectorizer can easily be used in Logistic Regression.


2.  Logistic Regression is a linear model.

    Tfidf vectorizer produces the output in the form of the numerical data. The most important words have been assigned with the higher weights and the common words have been given lower weights, so as to reduce its imprtance.


3. Tfidf Vectorizer reduces the noise. It is used in Logistic Regression so as to reduce the influence of common words.


4.  Logistic Regression works well with L2-regularized data.

    Tfidf Vectorizer even produces normalized results so as to improve the generalization of the model.

    So Tfidf Vectorizer is used in Logistic Regression.

In [660]:
from sklearn.feature_extraction.text  import  TfidfVectorizer

### Create an object for Tfidf vectorizer

tfidf = TfidfVectorizer()

### using the object for tfidf vectorizer, transform the input text

df.drop(columns='text',axis=1,inplace=True)

In [661]:
dfa = df

In [662]:
df = dfa[['clean_text','label']] 

In [663]:
df

Unnamed: 0,clean_text,label
0,love data science,1
1,machine learning amazing,1
2,python great ai,1
3,deep learning powerful,1
4,enjoy coding,1
5,project fantastic,1
6,hate bugs code,0
7,debugging frustrating,0
8,errors make angry,0
9,terrible experience,0


In [664]:
### Apply Tfidf vectorizer to the clean_text

ans = tfidf.fit_transform(df['clean_text'])

In [665]:
ans

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 31 stored elements and shape (12, 29)>

### OBSERVATIONS:

1.  We have the input in the form of 'clean_text'.

2.  Then tfidf vectorizer is applied on the clean_text to produce the sparse matrix of tfidf scores

3.  Then this sparse matrix is produced.

### Step 7: Divide the data into the independent and dependent features

In [666]:
X = ans

In [667]:
Y = df['label']

In [668]:
X    

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 31 stored elements and shape (12, 29)>

In [669]:
Y

0     1
1     1
2     1
3     1
4     1
5     1
6     0
7     0
8     0
9     0
10    0
11    0
Name: label, dtype: int64

In [670]:
Y.value_counts()

label
1    6
0    6
Name: count, dtype: int64

### OBSERVATIONS:

1. The dataset is converted into independent and dependent features.

2. The input is X which is a sparse matrix

3. The output is Y which is a dependent feature.

### Step 8: Perform the Train-Test Split

In [671]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=42,stratify=Y)

In [672]:
X_train

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 24 stored elements and shape (9, 29)>

In [673]:
X_test

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 7 stored elements and shape (3, 29)>

In [674]:
print("Shape of the input training data is:", X_train.shape)
print("Shape of the input testing  data is:", X_test.shape)

Shape of the input training data is: (9, 29)
Shape of the input testing  data is: (3, 29)


In [675]:
Y_train

6     0
7     0
5     1
0     1
8     0
3     1
2     1
11    0
1     1
Name: label, dtype: int64

In [676]:
Y_test

4     1
9     0
10    0
Name: label, dtype: int64

In [677]:
print("Shape of the output training data is:", Y_train.shape)
print("Shape of the output testing  data is:", Y_test.shape)

Shape of the output training data is: (9,)
Shape of the output testing  data is: (3,)


### OBSERVATIONS:

1. After applying the train_test_split function of the input and the output, the data is divided into the training and the testing data.

    Training data is 80 %

    Testing  data is 20 %.


### Step 9: Train the Random Forest Classifier Model

In [678]:
from sklearn.ensemble import RandomForestClassifier

### create an object for Random Forest Classifier

ran = RandomForestClassifier(
    n_estimators  = 100         ,  ### total number of decision trees
    max_depth     = None        ,  ### No depth restriction
    class_weight  = 'balanced'  ,
    random_state = 42 
)

In [679]:
### using the object  for Random Forest Classifier, train the model
ran.fit(X_train, Y_train)

### OBSERVATIONS:

1. First the object for Random Forest Classifier is created.

2. Using the object for Random Forest Classifier, the model is trained using the training data.

### Step 10:  Make Predictions for the model

In [680]:
Y_pred = ran.predict(X_test)

In [681]:
Y_pred

array([1, 1, 0])

In [682]:
Y_test

4     1
9     0
10    0
Name: label, dtype: int64

### OBSERVATIONS:

1. Based on the test data, the predicted output is obtained.

### Step 11:  Evaluate the Model

In [683]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

ac = accuracy_score(Y_test, Y_pred)*100.0

print("Accuracy of the model is:", ac)

Accuracy of the model is: 66.66666666666666


In [684]:
cm = confusion_matrix(Y_test, Y_pred)

print("Confusion Matrix is:",cm)

Confusion Matrix is: [[1 1]
 [0 1]]


In [685]:
cr = classification_report(Y_test, Y_pred)

print("classification report is:",cr)

classification report is:               precision    recall  f1-score   support

           0       1.00      0.50      0.67         2
           1       0.50      1.00      0.67         1

    accuracy                           0.67         3
   macro avg       0.75      0.75      0.67         3
weighted avg       0.83      0.67      0.67         3



### Step 12: Test with Custom Sentence

In [686]:
new_text = ["I really love this amazing course"]


### clean the text

cleaned_text = [clean_text(x) for x in new_text]


### Transform the text

transformed = tfidf.transform(cleaned_text)

### predict the model

prediction = ran.predict(transformed)

print("Prediction:", "Positive" if prediction[0] == 1 else "Negative")

Prediction: Positive
