### Logistic Regression based on NLP:

#### Aim:---

1.  We have a text as an input and we need to predict whether the text is postitive or negative sentiment using SVM

#### Steps used in this Algorithm:---

1.  Import all the necessary libraries

2.  Download all the necessary NLTK lbraries

3.  Create Sample Dataset

4.  Perform the Text Preprocessing Function

5.  Apply Cleaning

6.  Convert Text to Numerical Features (TF-IDF)

7.  Perform the Train-Test Split

8.  Train Support Vector Classifier Model

9.  Make Predictions for the model

10. Evaluate the Model

11. Test with Custom Sentence

### Step 1: Import all the necessary libraries

In [368]:
import  numpy              as   np
import  pandas             as   pd
import  matplotlib.pyplot  as  plt
import  seaborn            as  sns

import  nltk
from    nltk.tokenize      import word_tokenize
from    nltk.tokenize      import RegexpTokenizer
from    nltk.corpus        import stopwords

from sklearn.feature_extraction.text   import  TfidfVectorizer

from  sklearn.model_selection          import  train_test_split
from  sklearn.preprocessing            import  StandardScaler
from  sklearn.svm                      import  SVC

### OBSERVATIONS:

1.  numpy  ------------------>  Computation of the numerical array

2.  pandas ------------------>  Data Cleaning and Manipulation

3.  matplotlib -------------->  Data Visualization

4.  seaborn   --------------->  Data Correlation

5.  nltk -------------------->  Contains all the library for text preprocessing

6.  tokenize ---------------->  breaks the text into sub parts

7.  word_tokenize ----------->  breaks the text into words

8.  corpus ------------------>  contains a series of text

9.  stopwords --------------->  words having no meaning

10. feature_extraction ------>  extracting all the essential information from the features

11. TfidfVectorizer  -------->  converts the text into Tfidf score matrix

12. train_test_split -------->  split the data into training and testing data

13. StandardScaler ---------->  sclaes the data in one range between 0 to 1

14. SVC            -----------> Creates the hyperplane and classifies the data into two categories

15. RegexpTokenizer ----------> Tokenizing the Regular expression statement

### Step 2: Download all the necessary NLTK lbraries

In [369]:
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('average_perceptron_tagger_eng')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Error loading average_perceptron_tagger_eng: Package
[nltk_data]     'average_perceptron_tagger_eng' not found in index


False

### OBSERVATIONS:

1.  punkt_tab --------------->   It represents the tokenization model

2.  average_perceptron_tagger_eng ------> It represents the POS Tagging Model

3.  stopwords         -------------------> It represents the model for stopwords

### Step 3: Create Sample Dataset

In [370]:
data = {
    'text': [
        "I love machine learning",
        "This movie is terrible",
        "Data science is amazing",
        "I hate this product",
        "This is the best course",
        "Worst experience ever"
    ],
    'label': [1, 0, 1, 0, 1, 0]   # 1 = Positive, 0 = Negative
}

print(data)

### Construct the Dataframe from this data

df = pd.DataFrame(data)

print(df)

{'text': ['I love machine learning', 'This movie is terrible', 'Data science is amazing', 'I hate this product', 'This is the best course', 'Worst experience ever'], 'label': [1, 0, 1, 0, 1, 0]}
                      text  label
0  I love machine learning      1
1   This movie is terrible      0
2  Data science is amazing      1
3      I hate this product      0
4  This is the best course      1
5    Worst experience ever      0


### OBSERVATIONS:

1. The dataset is constructed.

2. It has two columns. One is text and the other is label.

3. The input column represents the text.

4. The output column represents the label.

5. The text is string in nature. So we need to convert it into the numerical foem so that it can be easily be trained by the machine learning model.

### Step 4: Perform the Text Preprocessing Function

In [371]:
### Craete an object for Regular Expression

reg = RegexpTokenizer(r'\w+')

In [372]:
### define the function definition
def clean_text(text):
    ### consider every text and convert it into lower case
    text = text.lower()

    ### remove all the punctuations and special symbols from the text using resular expression
    text = reg.tokenize(text)

    ### perform the word tokenization on the above regularized text
    text = " ".join(text)
    words = word_tokenize(text)

    ### define all the english stop wrods
    english_stopwords = stopwords.words("english")
    ### remove all the english stopwords  from the filtered text
    res = [x for x in words if(x not in english_stopwords)]

    ### return the result in string form
    res = " ".join(res)

    return(res)

### OBSERVATIONS:

1. The "clean_text" function is defined.

2. First each input text is accepted as an input.

3. This input text is first converted into lower case.

4. Then regular expression is applied on each and every text to remove all the punctuations and special symbols from the text.

5. The regularized words in list is again converted into words.

6. Then the word tokenization is applied on the regularized words to convert each text into words.

7. All the stopwords of english data is defined.

8. Then all the english stopwords are removed from the tokenized words to get a fileterd text data.

9. The fileterd text data in list is converted back into the words.

### Step 5: Apply Cleaning

In [373]:
### perform the function call
df['clean_text'] = df['text'].apply(clean_text)

In [374]:
df['clean_text']

0    love machine learning
1           movie terrible
2     data science amazing
3             hate product
4              best course
5    worst experience ever
Name: clean_text, dtype: object

### OBSERVATIONS:

1. The function 'clean_text' is called where the text preprocessing is applied to each and every row of the dataset.

### Step 6: Convert Text to Numerical Features (TF-IDF)


Q.>  Why Tfidf Vectoization is used in Logistic Regression ?

Ans:>

1.   Logistic Regression can easily work with the high dimensional data and sparse matrix.

     After applying Tfidf vectorizer on the text, the text becomes high dimendional sparse data.

     So here Tfidf vectorizer can easily be used in Logistic Regression.


2.  Logistic Regression is a linear model.

    Tfidf vectorizer produces the output in the form of the numerical data. The most important words have been assigned with the higher weights and the common words have been given lower weights, so as to reduce its imprtance.


3. Tfidf Vectorizer reduces the noise. It is used in Logistic Regression so as to reduce the influence of common words.


4.  Logistic Regression works well with L2-regularized data.

    Tfidf Vectorizer even produces normalized results so as to improve the generalization of the model.

    So Tfidf Vectorizer is used in Logistic Regression.

In [375]:
from  sklearn.feature_extraction.text  import TfidfVectorizer

### Create an object for Tf-iDF Vectorizer

tfidf = TfidfVectorizer()

### Remove the text column from the dataset

df.drop(columns='text',axis=1,inplace=True)

In [376]:
dfa = df[['clean_text','label']]

In [377]:
df = dfa

In [378]:
df

Unnamed: 0,clean_text,label
0,love machine learning,1
1,movie terrible,0
2,data science amazing,1
3,hate product,0
4,best course,1
5,worst experience ever,0


In [379]:
### Apply Tfidf Vectorizer on the clean text to obtain the matrix of tf-idf scores

X = tfidf.fit_transform(df['clean_text'])

In [380]:
Y = df['label']

In [381]:
X_vectorized = X

In [382]:
X_vectorized 

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 15 stored elements and shape (6, 15)>

In [383]:
Y

0    1
1    0
2    1
3    0
4    1
5    0
Name: label, dtype: int64

In [384]:
Y.value_counts()

label
1    3
0    3
Name: count, dtype: int64

### OBSERVATIONS:

1.  We have the input in the form of 'clean_text'.

2.  Then tfidf vectorizer is applied on the clean_text to produce the sparse matrix of tfidf scores

3.  Then this sparse matrix is converted into numpy array for better view and visibility

4.  All the imporatt words in the numpy array have been assigned with weights greater than 0 and all the common words have been assigned with the weights as 0.

### Step 7: Perform the Train-Test Split

In [385]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X_vectorized, Y, test_size = 0.3, random_state=42,stratify = Y)

In [386]:
X_train

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 9 stored elements and shape (4, 15)>

In [387]:
X_test

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6 stored elements and shape (2, 15)>

In [388]:
print("Shape of the input training data is:", X_train.shape)
print("Shape of the input testing  data is:", X_test.shape)

Shape of the input training data is: (4, 15)
Shape of the input testing  data is: (2, 15)


In [389]:
Y_train

3    0
4    1
1    0
2    1
Name: label, dtype: int64

In [390]:
Y_test

0    1
5    0
Name: label, dtype: int64

In [391]:
print("Shape of the input training data is:", Y_train.shape)
print("Shape of the input testing  data is:", Y_test.shape)

Shape of the input training data is: (4,)
Shape of the input testing  data is: (2,)


### OBSERVATIONS:

1. After applying the train_test_split function of the input and the output, the data is divided into the training and the testing data.

    Training data is 70 %

    Testing  data is 30 %.

### Step 8:  Train Support Vector Classifier Model

In [392]:
from sklearn.svm import SVC

### Creat an object for SVC

svc = SVC(kernel='linear')   ### work only on linear data


### Train the model using the object of SVM

svc.fit(X_train, Y_train)

### OBSERVATIONS:

1. First the object for Support Vector Classifier is created.

2. Using the object for SVC, the model is trained using the training data.

### Step 9: Make Predictions for the model

In [393]:
Y_pred = svc.predict(X_test)

In [394]:
Y_pred

array([1, 1])

### OBSERVATIONS:

1. Based on the test data, the predicted output is obtaiend.

### Step 10: Evaluate the Model

In [395]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

ac = accuracy_score(Y_test, Y_pred)*100.0

print("Accuracy of the model is:", ac)

Accuracy of the model is: 50.0


In [396]:
cm = confusion_matrix(Y_test, Y_pred)

print("Confusion Matrix is:",cm)

Confusion Matrix is: [[0 1]
 [0 1]]


In [397]:
cr = classification_report(Y_test, Y_pred)

print("classification report is:",cr)

classification report is:               precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Step 11: Test with Custom Sentence

In [398]:
new_text = ["I really love this amazing course"]


### clean the text

cleaned_text = [clean_text(x) for x in new_text]


### Transform the text

transformed = tfidf.transform(cleaned_text)

### predict the model

prediction = svc.predict(transformed)

print("Prediction:", "Positive" if prediction[0] == 1 else "Negative")

Prediction: Positive


### OBSERVATIONS:

1.  The SVC model in sklearn requires consistent input formats during training and prediction. If the model is trained on dense NumPy arrays but receives sparse matrices during prediction, it throws a ValueError because the internal optimized computation routines differ for sparse and dense inputs.