### Aim:---

1.  To perform an end to end sentimental analysis implementation on an imdb movie review dataset and predict whether the review is positive or negative.


### Steps used in this Algorithm :----

1.  Import all the necessary Libraries

2.  Download the necessary NLTK libraries

3.  Create the Sample Dataset

4.  Perform the Text Preprocessing Function

5.  Apply Cleaning

6.  Convert Text to Numerical Form (TF-IDF)

7.  Divide the dataset into independent and dependent variables

8.  Train-Test Split

9.  Train Logistic Regression Model

10. Make Predictions

11. Evaluate the Model

12. Predict on New Review

### Step 1: Import all the necessary Libraries

In [485]:
import  numpy               as   np
import  pandas              as   pd
import  matplotlib.pyplot   as   plt
import  seaborn             as   sns

import  nltk

from    nltk.tokenize   import  RegexpTokenizer, word_tokenize
from    nltk.corpus     import  stopwords

from    sklearn.feature_extraction.text import TfidfVectorizer
from    sklearn.model_selection         import train_test_split
from    sklearn.preprocessing           import StandardScaler
from    sklearn.linear_model            import LogisticRegression
from    sklearn.metrics                 import accuracy_score, confusion_matrix, classification_report

### OBSERVATIONS:

1.  numpy  ------------------>  Computation of the numerical array

2.  pandas ------------------>  Data Cleaning and Manipulation

3.  matplotlib -------------->  Data Visualization

4.  seaborn   --------------->  Data Correlation

5.  nltk -------------------->  Contains all the library for text preprocessing

6.  tokenize ---------------->  breaks the text into sub parts

7.  word_tokenize ----------->  breaks the text into words

8.  corpus ------------------>  contains a series of text

9.  stopwords --------------->  words having no meaning

10. feature_extraction ------>  extracting all the essential information from the features

11. TfidfVectorizer  -------->  converts the text into Tfidf score matrix

12. train_test_split -------->  split the data into training and testing data

13. StandardScaler ---------->  sclaes the data in one range between 0 to 1

14. LogisticRegression ------>  provides the result in 1 or 0 in a binary classification problem

15. RegexpTokenizer ----------> Tokenizing the Regular expression statement

16. metrics ------------------> evaluate the performance of the model

### Step 2: Download the necessary NLTK libraries

In [486]:
import nltk
nltk.download('punkt_tab')
nltk.download('average_perceptron_tagger_eng')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Error loading average_perceptron_tagger_eng: Package
[nltk_data]     'average_perceptron_tagger_eng' not found in index
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### OBSERVATIONS:

1.  punkt_tab --------------->   It represents the tokenization model

2.  average_perceptron_tagger_eng ------> It represents the POS Tagging Model

3.  stopwords         -------------------> It represents the model for stopwords

### Step 3:  Create the Sample Dataset

In [487]:
data = {
    "review": [
        "I love this product, it is amazing!",
        "Worst experience ever, very bad service.",
        "Absolutely fantastic quality.",
        "I hate this item, waste of money.",
        "Very happy with the purchase.",
        "Terrible, I will never buy again."
    ],
    "sentiment": [1, 0, 1, 0, 1, 0]  # 1 = Positive, 0 = Negative
}


### Construct the DataFrame from the above data

df = pd.DataFrame(data)


print(df)

                                     review  sentiment
0       I love this product, it is amazing!          1
1  Worst experience ever, very bad service.          0
2             Absolutely fantastic quality.          1
3         I hate this item, waste of money.          0
4             Very happy with the purchase.          1
5         Terrible, I will never buy again.          0


### OBSERVATIONS:

1. The dataset is constructed.

2. It has two columns. One is review and the other is label.

3. The input column represents the text review.

4. The output column represents the label.

5. The review text is string in nature. So we need to convert it into the numerical foem so that it can be easily be trained by the machine learning model.

### Step 4: Perform the Text Preprocessing Function

In [488]:
reg = RegexpTokenizer(r'\w+')

In [489]:
### define the function
def clean_text(text):
    ### convert the text into the lower case
    text = text.lower()
    ### perform the Regular expression tokenization on the text to remove all the punctuationa and special symbols
    text = reg.tokenize(text)
    ### convert the text in lists to words
    text = " ".join(text)
    ### perform the word tokenization on the text
    words = word_tokenize(text)

    ### define the english stopwords
    english_stopwords = stopwords.words("english")

    ### filter out all the stopwords from each text
    res = [x for x in words if(x not in english_stopwords)]

    ### convert all the words from list to words
    res = " ".join(res)


    return(res)

### OBSERVATIONS:

1. The "clean_text" function is defined.

2. First each input text is accepted as an input.

3. This input text is first converted into lower case.

4. Then regular expression is applied on each and every text to remove all the punctuations and special symbols from the text.

5. The regularized words in list is again converted into words.

6. Then the word tokenization is applied on the regularized words to convert each text into words.

7. All the stopwords of english data is defined.

8. Then all the english stopwords are removed from the tokenized words to get a fileterd text data.

9. The fileterd text data in list is converted back into the words.

### Step 5:  Apply Cleaning

In [490]:
### call the function
df['clean_review'] = df['review'].apply(clean_text)

In [491]:
df['clean_review']

0                 love product amazing
1    worst experience ever bad service
2         absolutely fantastic quality
3                hate item waste money
4                       happy purchase
5                   terrible never buy
Name: clean_review, dtype: object

### OBSERVATIONS:

1. The function 'clean_text' is called where the text preprocessing is applied to each and every row of the dataset.

### Step 6: Convert Text to Numerical Form (TF-IDF)

Q.>  Why Tfidf Vectoization is used in Logistic Regression ?

Ans:>

1.   Logistic Regression can easily work with the high dimensional data and sparse matrix.

     After applying Tfidf vectorizer on the text, the text becomes high dimendional sparse data.

     So here Tfidf vectorizer can easily be used in Logistic Regression.


2.  Logistic Regression is a linear model.

    Tfidf vectorizer produces the output in the form of the numerical data. The most important words have been assigned with the higher weights and the common words have been given lower weights, so as to reduce its imprtance.


3. Tfidf Vectorizer reduces the noise. It is used in Logistic Regression so as to reduce the influence of common words.


4.  Logistic Regression works well with L2-regularized data.

    Tfidf Vectorizer even produces normalized results so as to improve the generalization of the model.

    So Tfidf Vectorizer is used in Logistic Regression.

In [492]:
from    sklearn.feature_extraction.text import TfidfVectorizer

### Create an object for Tfidf Vectorizer

tfidf = TfidfVectorizer(
    ngram_range=(1,2),      # Use unigrams + bigrams
    min_df=1,
    max_df=0.9
)

### using the object of tfidf, transform the inputs

X_tfidf = tfidf.fit_transform(df['clean_review'])



In [493]:
X_tfidf

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 34 stored elements and shape (6, 34)>

In [494]:
### convert to numpy array for better view and visibility

X_vectorized = X_tfidf.toarray()

In [495]:
X_vectorized

array([[0.        , 0.        , 0.4472136 , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.4472136 , 0.4472136 ,
        0.        , 0.        , 0.        , 0.4472136 , 0.4472136 ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.33333333, 0.33333333,
        0.        , 0.33333333, 0.33333333, 0.33333333, 0.33333333,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.33333333, 0.        , 0.        ,
        0.        , 0.        , 0.33333333, 0.33333333],
       [0.4472136 , 0.4472136 , 0.        , 0.        , 0.        ,
  

### OBSERVATIONS:

1.  We have the input in the form of 'clean_review'.

2.  Then tfidf vectorizer is applied on the clean_review to produce the sparse matrix of tfidf scores

3.  Then this sparse matrix is converted into numpy array for better view and visibility

4.  All the important words in the numpy array have been assigned with weights greater than 0 and all the common words have been assigned with the weights as 0.

### Step 7:  Divide the dataset into independent and dependent variables

In [496]:
### Independent features 

X = X_vectorized

print(X)

### Dependent features

Y = df['sentiment']

print(Y)

[[0.         0.         0.4472136  0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.4472136  0.4472136  0.         0.         0.         0.4472136
  0.4472136  0.         0.         0.         0.         0.
  0.         0.         0.         0.        ]
 [0.         0.         0.         0.33333333 0.33333333 0.
  0.33333333 0.33333333 0.33333333 0.33333333 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.33333333 0.         0.
  0.         0.         0.33333333 0.33333333]
 [0.4472136  0.4472136  0.         0.         0.         0.
  0.         0.         0.         0.         0.4472136  0.4472136
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.4472136  0.         0.    

In [497]:
Y.value_counts()

sentiment
1    3
0    3
Name: count, dtype: int64

### Step 8: Train-Test Split

In [498]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=42,stratify=Y)

In [499]:
X_train

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.37796447,
        0.37796447, 0.37796447, 0.37796447, 0.        , 0.        ,
        0.37796447, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.37796447, 0.37796447, 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.57735027, 0.57735027, 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.57735027, 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.33333333, 0.33333333,
  

In [500]:
X_test

array([[0.       , 0.       , 0.4472136, 0.       , 0.       , 0.       ,
        0.       , 0.       , 0.       , 0.       , 0.       , 0.       ,
        0.       , 0.       , 0.       , 0.       , 0.       , 0.       ,
        0.4472136, 0.4472136, 0.       , 0.       , 0.       , 0.4472136,
        0.4472136, 0.       , 0.       , 0.       , 0.       , 0.       ,
        0.       , 0.       , 0.       , 0.       ],
       [0.       , 0.       , 0.       , 0.       , 0.       , 0.4472136,
        0.       , 0.       , 0.       , 0.       , 0.       , 0.       ,
        0.       , 0.       , 0.       , 0.       , 0.       , 0.       ,
        0.       , 0.       , 0.       , 0.4472136, 0.4472136, 0.       ,
        0.       , 0.       , 0.       , 0.       , 0.4472136, 0.4472136,
        0.       , 0.       , 0.       , 0.       ]])

In [501]:
print("Shape of the input training data is:", X_train.shape)
print("Shape of the input testing  data is:", X_test.shape)

Shape of the input training data is: (4, 34)
Shape of the input testing  data is: (2, 34)


In [502]:
Y_train

3    0
4    1
1    0
2    1
Name: sentiment, dtype: int64

In [503]:
Y_test

0    1
5    0
Name: sentiment, dtype: int64

In [504]:
print("Shape of the output training data is:", Y_train.shape)
print("Shape of the output testing  data is:", Y_test.shape)

Shape of the output training data is: (4,)
Shape of the output testing  data is: (2,)


### OBSERVATIONS:

1. After applying the train_test_split function of the input and the output, the data is divided into the training and the testing data.

    Training data is 80 %

    Testing  data is 20 %.

### Step 9: Train Logistic Regression Model

In [505]:
from sklearn.linear_model import LogisticRegression

### Create the object for Logistic Regression

log = LogisticRegression()

### using the object for Logistic Regression, train the data

log.fit(X_train, Y_train)

### OBSERVATIONS:

1. First the object for Logistic Regression is created.

2. Using the object for Logistic Regression, the model is trained using the training data.

### Step 10: Make Predictions

In [506]:
Y_pred = log.predict(X_test)

In [507]:
Y_pred

array([1, 1])

### OBSERVATIONS:

1. Based on the test data, the predicted output is obtaiend.

In [508]:
Y_test

0    1
5    0
Name: sentiment, dtype: int64

### Step 11: Evaluate the Model

In [509]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

ac = accuracy_score(Y_test, Y_pred)*100.0

print("Accuracy of the model is:", ac)

Accuracy of the model is: 50.0


In [510]:
cm = confusion_matrix(Y_test, Y_pred)

print("Confusion Matrix is:",cm)

Confusion Matrix is: [[0 1]
 [0 1]]


In [511]:
cr = classification_report(Y_test, Y_pred)

print("classification report is:",cr)

classification report is:               precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Step 12: Predict on New Review

In [512]:
new_review = ["This product is really awesome and worth buying"]

### Clean the text

cleaned = [clean_text(new_review[0])]

### Convert the text into tfidf vector

transformed = tfidf.transform(cleaned)

print(transformed)

### Predict the model

predictions = log.predict(transformed)

print(predictions)

if predictions[0] == 1:
    print("Positive Sentiment ðŸ˜Š")
else:
    print("Negative Sentiment ðŸ˜ž")

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 1 stored elements and shape (1, 34)>
  Coords	Values
  (0, 23)	1.0
[1]
Positive Sentiment ðŸ˜Š
