# **Text Analytics and Preprocessing**
# This week we will learn how to process text data using different text processing methods including text parsing, tokenization, Bag of Words, TF, IDF, TF-IDF representation, sentiment analysis etc.
# We start with a few exercises on each topic and then the students will complete a set of related exercises.

# **Example: Text Tokenization**
**Description:** Tokenization is the process of breaking down a text into individual words or tokens. In this example, we demonstrate how to tokenize a given text using Python.

In [1]:
# Example text
text = "Natural language processing (NLP) is a subfield of linguistics, " \
       "computer science, and artificial intelligence."

# Tokenize the text
tokens = text.split()

# Print the tokens
print(tokens)


['Natural', 'language', 'processing', '(NLP)', 'is', 'a', 'subfield', 'of', 'linguistics,', 'computer', 'science,', 'and', 'artificial', 'intelligence.']


# **Example: Removing Stopwords**
**Description:** Stopwords are common words like "the", "is", "and", etc., that do not carry significant meaning in text analysis. In this example scenario, students are tasked with removing stopwords from a given text to preprocess it for further analysis.

In [2]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Example text
text = "Text analytics is the process of analyzing unstructured text data to extract meaningful insights."

# Tokenize the text
tokens = text.split()

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Print the filtered tokens
print(filtered_tokens)


['Text', 'analytics', 'process', 'analyzing', 'unstructured', 'text', 'data', 'extract', 'meaningful', 'insights.']


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pandeym\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


# **Text Representation and Feature Extraction**
# **Example: Implementing Bag-of-Words Representation**
**Description:** Bag-of-Words (BoW) representation is a simple technique for converting text data into numerical form. In this example, we show how to implement a BoW representation for a corpus of text documents using Python.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

# Example corpus
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Print the vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())

# Print the Bag-of-Words matrix
print("Bag-of-Words Matrix:")
print(X.toarray())


Vocabulary: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
Bag-of-Words Matrix:
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


# **Example: Implementing TF-IDF Representation**
**Description:** TF-IDF (Term Frequency-Inverse Document Frequency) is a popular technique for text representation that considers both the frequency of a term in a document and its rarity across all documents. In this example scenario, students are introduced to TF-IDF representation and guided to implement it for a given corpus of text documents.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Example corpus
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?'
]

# Create TfidfVectorizer object
vectorizer = TfidfVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(corpus)

# Print the vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())

# Print the TF-IDF matrix
print("TF-IDF Matrix:")
print(X.toarray())


Vocabulary: ['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
TF-IDF Matrix:
[[0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]
 [0.         0.6876236  0.         0.28108867 0.         0.53864762
  0.28108867 0.         0.28108867]
 [0.51184851 0.         0.         0.26710379 0.51184851 0.
  0.26710379 0.51184851 0.26710379]
 [0.         0.46979139 0.58028582 0.38408524 0.         0.
  0.38408524 0.         0.38408524]]


# **Text Classification**
# **Example: Text Classification using Naive Bayes Classifier**
**Description:** Text classification involves categorizing text documents into predefined classes or categories. In this example, we demonstrate how to perform text classification using a Naive Bayes classifier, a popular algorithm for text classification tasks.

In [5]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Load 20newsgroups dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(twenty_train.data)
y_train = twenty_train.target

# Train Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Evaluate the classifier
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
X_test = vectorizer.transform(twenty_test.data)
y_test = twenty_test.target
y_pred = clf.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred, target_names=twenty_test.target_names))


                        precision    recall  f1-score   support

           alt.atheism       0.97      0.60      0.74       319
         comp.graphics       0.96      0.89      0.92       389
               sci.med       0.97      0.81      0.88       396
soc.religion.christian       0.65      0.99      0.78       398

              accuracy                           0.83      1502
             macro avg       0.89      0.82      0.83      1502
          weighted avg       0.88      0.83      0.84      1502



# **Example: Text Classification using Support Vector Machine (SVM)**
**Description:** Support Vector Machine (SVM) is another commonly used algorithm for text classification tasks. In this example scenario, students are introduced to a dataset containing text documents belonging to different categories and tasked with training an SVM classifier to classify the documents into their respective categories.

In [6]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Load 20newsgroups dataset
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)

# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(twenty_train.data)
y_train = twenty_train.target

# Train SVM classifier
clf = SVC(kernel='linear')
clf.fit(X_train, y_train)

# Evaluate the classifier
twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
X_test = vectorizer.transform(twenty_test.data)
y_test = twenty_test.target
y_pred = clf.predict(X_test)

# Print classification report
print(classification_report(y_test, y_pred, target_names=twenty_test.target_names))


                        precision    recall  f1-score   support

           alt.atheism       0.96      0.83      0.89       319
         comp.graphics       0.90      0.96      0.93       389
               sci.med       0.94      0.91      0.93       396
soc.religion.christian       0.89      0.96      0.93       398

              accuracy                           0.92      1502
             macro avg       0.93      0.92      0.92      1502
          weighted avg       0.92      0.92      0.92      1502



# **Example: Sentiment Analysis on product reviews**
This example creates a small review database that contains reviews on a certain product either positive, negative or neutral. We use machine learning to train and test the model using this data (after performing basic preprocessing, and feature extraction).

In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Sample data
data = {
    "Review": [
        "This product is amazing! I love it.",
        "The quality of the product is poor. Disappointed.",
        "It meets my expectations. Good value for money.",
        "I don't like it. Waste of money.",
        "Neutral opinion about the product."
    ],
    "Sentiment": ["Positive", "Negative", "Positive", "Negative", "Neutral"]
}

# Create DataFrame
df = pd.DataFrame(data)

# Split data into train and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(df['Review'], df['Sentiment'], test_size=0.2, random_state=42)

# Text preprocessing and feature extraction (TF-IDF)
vectorizer = TfidfVectorizer(stop_words='english')
train_features = vectorizer.fit_transform(train_texts)
test_features = vectorizer.transform(test_texts)

# Model training (Logistic Regression)
model = LogisticRegression(max_iter=1000)
model.fit(train_features, train_labels)

# Model evaluation
predictions = model.predict(test_features)
accuracy = accuracy_score(test_labels, predictions)
print(f"Accuracy: {accuracy:.2f}")

# Classification report
print(classification_report(test_labels, predictions))


Accuracy: 0.00
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00       1.0
    Positive       0.00      0.00      0.00       0.0

    accuracy                           0.00       1.0
   macro avg       0.00      0.00      0.00       1.0
weighted avg       0.00      0.00      0.00       1.0



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# **---------------------------------------------------------------------------**

# **Exercises**
# Please complete the following exercises

# **Text Tokenization Exercise:**
# Scenario:
You have a dataset containing customer feedback for a restaurant. Each feedback entry is a string representing a customer's comment. Your task is to tokenize each feedback into individual words to analyze the most commonly mentioned aspects of the restaurant.
# Dataset: Customer Feedback for a Restaurant

1. "The food was delicious and the service was excellent."
2. "We had a great time dining at this restaurant."
3. "The ambiance was cozy and welcoming."
4. "The staff was friendly but the food took too long to arrive."
5. "I didn't enjoy the meal as much as I had hoped."
6. "The desserts were divine!"
7. "The portions were generous, and the prices were reasonable."
8. "The restaurant was crowded, and we had to wait for a table."
9. "The presentation of the dishes was beautiful."
10. "Overall, it was a pleasant dining experience."


In [10]:
# Example text
feedbacks = ["The food was delicious and the service was excellent.",
"We had a great time dining at this restaurant.",
"The ambiance was cozy and welcoming.",
"The staff was friendly but the food took too long to arrive.",
"I didn't enjoy the meal as much as I had hoped.",
"The desserts were divine!",
"The portions were generous, and the prices were reasonable.",
"The restaurant was crowded, and we had to wait for a table.",
"The presentation of the dishes was beautiful.",
"Overall, it was a pleasant dining experience."]

for text in feedbacks:
    # Tokenize the text
    tokens = text.split()
    # Print the tokens
    print(tokens)


['The', 'food', 'was', 'delicious', 'and', 'the', 'service', 'was', 'excellent.']
['We', 'had', 'a', 'great', 'time', 'dining', 'at', 'this', 'restaurant.']
['The', 'ambiance', 'was', 'cozy', 'and', 'welcoming.']
['The', 'staff', 'was', 'friendly', 'but', 'the', 'food', 'took', 'too', 'long', 'to', 'arrive.']
['I', "didn't", 'enjoy', 'the', 'meal', 'as', 'much', 'as', 'I', 'had', 'hoped.']
['The', 'desserts', 'were', 'divine!']
['The', 'portions', 'were', 'generous,', 'and', 'the', 'prices', 'were', 'reasonable.']
['The', 'restaurant', 'was', 'crowded,', 'and', 'we', 'had', 'to', 'wait', 'for', 'a', 'table.']
['The', 'presentation', 'of', 'the', 'dishes', 'was', 'beautiful.']
['Overall,', 'it', 'was', 'a', 'pleasant', 'dining', 'experience.']


# **Removing Stopwords Exercise:**
# Scenario:
You are analyzing movie reviews for a film festival. The dataset consists of reviews from various movie critics. Before sentiment analysis, you need to preprocess the text data by removing stopwords to focus on the meaningful content of the reviews.
# Dataset: Movie Reviews for Film Festival

1. "The cinematography in this film is absolutely stunning, and the acting performances are top-notch."
2. "I found the plot to be quite predictable, but the visual effects were impressive."
3. "The dialogue felt natural, and the chemistry between the lead actors was palpable."
4. "This movie kept me on the edge of my seat from start to finish. A thrilling experience!"
5. "I was disappointed by the lack of character development. The story felt shallow and unconvincing."
6. "The soundtrack perfectly complemented the mood of the film. A standout aspect for me."
7. "The pacing of the movie was uneven, making it difficult to fully engage with the story."
8. "Despite its flaws, this film offers a unique perspective on a familiar genre."
9. "The ending was satisfying and left me contemplating its deeper meaning long after the credits rolled."
10. "Overall, a thought-provoking film that stays with you."


In [16]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Example text
reviews = [
    "The cinematography in this film is absolutely stunning, and the acting performances are top-notch.",
    "I found the plot to be quite predictable, but the visual effects were impressive.",
    "The dialogue felt natural, and the chemistry between the lead actors was palpable.",
    "This movie kept me on the edge of my seat from start to finish. A thrilling experience!",
    "I was disappointed by the lack of character development. The story felt shallow and unconvincing.",
    "The soundtrack perfectly complemented the mood of the film. A standout aspect for me.",
    "The pacing of the movie was uneven, making it difficult to fully engage with the story.",
    "Despite its flaws, this film offers a unique perspective on a familiar genre.",
    "The ending was satisfying and left me contemplating its deeper meaning long after the credits rolled.",
    "Overall, a thought-provoking film that stays with you."
]

for text in reviews:
    print("Review: '"+text+"'")
    # Tokenize the text
    tokens = text.split()

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

    # Print the filtered tokens
    print("Tokens without stop words: ",filtered_tokens)

Review: 'The cinematography in this film is absolutely stunning, and the acting performances are top-notch.'
Tokens without stop words:  ['cinematography', 'film', 'absolutely', 'stunning,', 'acting', 'performances', 'top-notch.']
Review: 'I found the plot to be quite predictable, but the visual effects were impressive.'
Tokens without stop words:  ['found', 'plot', 'quite', 'predictable,', 'visual', 'effects', 'impressive.']
Review: 'The dialogue felt natural, and the chemistry between the lead actors was palpable.'
Tokens without stop words:  ['dialogue', 'felt', 'natural,', 'chemistry', 'lead', 'actors', 'palpable.']
Review: 'This movie kept me on the edge of my seat from start to finish. A thrilling experience!'
Tokens without stop words:  ['movie', 'kept', 'edge', 'seat', 'start', 'finish.', 'thrilling', 'experience!']
Review: 'I was disappointed by the lack of character development. The story felt shallow and unconvincing.'
Tokens without stop words:  ['disappointed', 'lack', 'ch

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pandeym\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# **Bag-of-Words Representation Exercise:**
# Scenario:
You are working on a sentiment analysis project for product reviews. You have a dataset containing reviews for electronic gadgets. Convert the dataset into a Bag-of-Words representation, treating each unique word in the reviews as a feature, to train a sentiment classifier.
# Dataset: Product Reviews for Electronic Gadgets

1. "The smartphone has a sleek design and great performance."
2. "The battery life of this tablet is disappointing."
3. "I love the camera quality of this digital camera."
4. "The sound quality of these headphones is amazing."
5. "The laptop froze multiple times within the first week of use."
6. "This smartwatch is easy to use and has useful features."
7. "The touchscreen of this e-reader is unresponsive at times."
8. "The gaming console heats up quickly during extended use."
9. "The voice recognition feature of this smart speaker is impressive."
10. "The software update improved the functionality of this fitness tracker."


In [17]:
from sklearn.feature_extraction.text import CountVectorizer

# Example corpus
reviews = [
    "The smartphone has a sleek design and great performance.",
    "The battery life of this tablet is disappointing.",
    "I love the camera quality of this digital camera.",
    "The sound quality of these headphones is amazing.",
    "The laptop froze multiple times within the first week of use.",
    "This smartwatch is easy to use and has useful features.",
    "The touchscreen of this e-reader is unresponsive at times.",
    "The gaming console heats up quickly during extended use.",
    "The voice recognition feature of this smart speaker is impressive.",
    "The software update improved the functionality of this fitness tracker."
]

# Create CountVectorizer object
vectorizer = CountVectorizer()

# Fit and transform the corpus
X = vectorizer.fit_transform(reviews)

# Print the vocabulary
print("Vocabulary:", vectorizer.get_feature_names_out())

# Print the Bag-of-Words matrix
print("Bag-of-Words Matrix:")
print(X.toarray())

Vocabulary: ['amazing' 'and' 'at' 'battery' 'camera' 'console' 'design' 'digital'
 'disappointing' 'during' 'easy' 'extended' 'feature' 'features' 'first'
 'fitness' 'froze' 'functionality' 'gaming' 'great' 'has' 'headphones'
 'heats' 'impressive' 'improved' 'is' 'laptop' 'life' 'love' 'multiple'
 'of' 'performance' 'quality' 'quickly' 'reader' 'recognition' 'sleek'
 'smart' 'smartphone' 'smartwatch' 'software' 'sound' 'speaker' 'tablet'
 'the' 'these' 'this' 'times' 'to' 'touchscreen' 'tracker' 'unresponsive'
 'up' 'update' 'use' 'useful' 'voice' 'week' 'within']
Bag-of-Words Matrix:
[[0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
  1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0
  0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0
  0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

# **TF-IDF Implementation Exercise:**
# Scenario:
You are working on a project to classify news articles into different categories. Implement TF-IDF representation for a corpus of news articles to prepare them for classification.
# Dataset: News Articles for Classification

Category: Politics
1. "The government announced new policies to address economic challenges."
2. "Opposition leaders criticized the proposed budget for its lack of transparency."
3. "Political tensions rise as the election date approaches."

Category: Technology
4. "Apple unveils its latest iPhone model with advanced features."
5. "Google launches a new artificial intelligence research lab."
6. "Microsoft announces plans to acquire a leading cybersecurity firm."

Category: Sports
7. "The home team secures a decisive victory in the championship game."
8. "A star athlete signs a record-breaking contract with a professional team."
9. "An underdog team surprises everyone by advancing to the finals."

Category: Health
10. "Researchers discover a potential breakthrough in cancer treatment."
11. "Health officials warn of a spike in flu cases during the winter season."
12. "The benefits of regular exercise on mental health are highlighted in a new study."


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Corpus of news articles (as a list of strings)
corpus = [
    "The government announced new policies to address economic challenges.",
    "Opposition leaders criticized the proposed budget for its lack of transparency.",
    "Political tensions rise as the election date approaches.",
    "Apple unveils its latest iPhone model with advanced features.",
    "Google launches a new artificial intelligence research lab.",
    "Microsoft announces plans to acquire a leading cybersecurity firm.",
    "The home team secures a decisive victory in the championship game.",
    "A star athlete signs a record-breaking contract with a professional team.",
    "An underdog team surprises everyone by advancing to the finals.",
    "Researchers discover a potential breakthrough in cancer treatment.",
    "Health officials warn of a spike in flu cases during the winter season.",
    "The benefits of regular exercise on mental health are highlighted in a new study."
]

# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and transform the corpus to get the TF-IDF matrix
X = vectorizer.fit_transform(corpus)

# Convert the TF-IDF matrix to a dense array
dense_matrix = X.todense()

# Create a DataFrame to view the TF-IDF scores for each term in each document
import pandas as pd
df_tfidf = pd.DataFrame(dense_matrix, columns=vectorizer.get_feature_names_out())

# Show the TF-IDF matrix
print(df_tfidf)


     acquire   address  advanced  advancing        an  announced  announces  \
0   0.000000  0.367145    0.0000   0.000000  0.000000   0.367145   0.000000   
1   0.000000  0.000000    0.0000   0.000000  0.000000   0.000000   0.000000   
2   0.000000  0.000000    0.0000   0.000000  0.000000   0.000000   0.000000   
3   0.000000  0.000000    0.3435   0.000000  0.000000   0.000000   0.000000   
4   0.000000  0.000000    0.0000   0.000000  0.000000   0.000000   0.000000   
5   0.363324  0.000000    0.0000   0.000000  0.000000   0.000000   0.363324   
6   0.000000  0.000000    0.0000   0.000000  0.000000   0.000000   0.000000   
7   0.000000  0.000000    0.0000   0.000000  0.000000   0.000000   0.000000   
8   0.000000  0.000000    0.0000   0.344651  0.344651   0.000000   0.000000   
9   0.000000  0.000000    0.0000   0.000000  0.000000   0.000000   0.000000   
10  0.000000  0.000000    0.0000   0.000000  0.000000   0.000000   0.000000   
11  0.000000  0.000000    0.0000   0.000000  0.00000

# **Text Classification Exercise:**
# Scenario:
You are building a spam email classifier. The dataset consists of emails labeled as spam or non-spam. Train a text classification model to classify the emails into spam and non-spam categories based on their content.
# Dataset: Spam Email Classifier
| Email Content                                          | Label   |
|--------------------------------------------------------|---------|
| Congratulations! You've won a free vacation!          | Spam    |
| Important notice: Your account password has expired.  | Non-Spam|
| Click here to claim your prize money now!             | Spam    |
| Reminder: Your subscription renewal is due next week. | Non-Spam|
| Get rich quick with this amazing investment opportunity. | Spam |
| Your monthly newsletter is now available. Click to read. | Non-Spam |
| Discount offer: Save 50% on all purchases today!      | Spam    |
| Urgent: Your package delivery has been delayed.       | Non-Spam|
| Make money from home with our easy work-from-home jobs.| Spam   |
| Thank you for your recent purchase. Here's your receipt.| Non-Spam |


In [26]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Load 20newsgroups dataset
categories = ['Spam', 'Non-Spam']

import pandas as pd

# Create a dictionary with 'Email Content' and 'Label' columns
data = {
    'Email Content': [
        "Congratulations! You've won a free vacation!",
        "Important notice: Your account password has expired.",
        "Click here to claim your prize money now!",
        "Reminder: Your subscription renewal is due next week.",
        "Get rich quick with this amazing investment opportunity.",
        "Your monthly newsletter is now available. Click to read.",
        "Discount offer: Save 50% on all purchases today!",
        "Urgent: Your package delivery has been delayed.",
        "Make money from home with our easy work-from-home jobs.",
        "Thank you for your recent purchase. Here's your receipt."
    ],
    'Label': [
        'Spam',
        'Non-Spam',
        'Spam',
        'Non-Spam',
        'Spam',
        'Non-Spam',
        'Spam',
        'Non-Spam',
        'Spam',
        'Non-Spam'
    ]
}

# Convert the dictionary into a pandas DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df)

                                       Email Content     Label
0       Congratulations! You've won a free vacation!      Spam
1  Important notice: Your account password has ex...  Non-Spam
2          Click here to claim your prize money now!      Spam
3  Reminder: Your subscription renewal is due nex...  Non-Spam
4  Get rich quick with this amazing investment op...      Spam
5  Your monthly newsletter is now available. Clic...  Non-Spam
6   Discount offer: Save 50% on all purchases today!      Spam
7    Urgent: Your package delivery has been delayed.  Non-Spam
8  Make money from home with our easy work-from-h...      Spam
9  Thank you for your recent purchase. Here's you...  Non-Spam


In [27]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import pandas as pd


# Encode the labels: 'Spam' -> 1, 'Non-Spam' -> 0
df['Label'] = df['Label'].map({'Spam': 1, 'Non-Spam': 0})

# Split the data into features and labels
X = df['Email Content']
y = df['Label']

# Create a TF-IDF vectorizer to transform the text data into feature vectors
vectorizer = TfidfVectorizer()

# Transform the email content into TF-IDF features
X_tfidf = vectorizer.fit_transform(X)

# Split the dataset into training and testing sets (80% train, 20% test)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# Train the Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Print classification report for model evaluation
print(classification_report(y_test, y_pred, target_names=['Non-Spam', 'Spam']))


              precision    recall  f1-score   support

    Non-Spam       1.00      1.00      1.00         1
        Spam       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2



# **Sentiment Analysis Exercise:**
# Scenario:
You are analyzing customer feedback for a product. Perform sentiment analysis on the feedback to identify whether each review is positive, negative, or neutral.

# Dataset: Customer feedback for a product

| Review                                                    | Sentiment |
|-----------------------------------------------------------|-----------|
| "The product exceeded my expectations! Highly recommend." | Positive  |
| "I'm satisfied with the quality of the product."          | Positive  |
| "This product is average, nothing special."               | Neutral   |
| "Disappointed with the performance. Would not buy again." | Negative  |
| "Great value for the price. Will purchase again."         | Positive  |
| "The product arrived damaged. Poor quality control."      | Negative  |
| "So-so product. Not worth the money."                     | Neutral   |
| "Absolutely love this product! It's a game-changer."      | Positive  |
| "The product did not meet my expectations."               | Negative  |
| "Excellent customer service. Resolved my issue quickly."  | Positive  |


In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

import pandas as pd

# Data as a dictionary
data = {
    'Review': [
        "The product exceeded my expectations! Highly recommend.",
        "I'm satisfied with the quality of the product.",
        "This product is average, nothing special.",
        "Disappointed with the performance. Would not buy again.",
        "Great value for the price. Will purchase again.",
        "The product arrived damaged. Poor quality control.",
        "So-so product. Not worth the money.",
        "Absolutely love this product! It's a game-changer.",
        "The product did not meet my expectations.",
        "Excellent customer service. Resolved my issue quickly."
    ],
    'Sentiment': [
        'Positive',
        'Positive',
        'Neutral',
        'Negative',
        'Positive',
        'Negative',
        'Neutral',
        'Positive',
        'Negative',
        'Positive'
    ]
}


# Create DataFrame
df = pd.DataFrame(data)

print(df)

# Split data into train and test sets
train_texts, test_texts, train_labels, test_labels = train_test_split(df['Review'], df['Sentiment'], test_size=0.2, random_state=42)

# Text preprocessing and feature extraction (TF-IDF)
vectorizer = TfidfVectorizer(stop_words='english')
train_features = vectorizer.fit_transform(train_texts)
test_features = vectorizer.transform(test_texts)

# Model training (Logistic Regression)
model = LogisticRegression(max_iter=1000)
model.fit(train_features, train_labels)

# Model evaluation
predictions = model.predict(test_features)
accuracy = accuracy_score(test_labels, predictions)
print(f"Accuracy: {accuracy:.2f}")

# Classification report
print(classification_report(test_labels, predictions))

                                              Review Sentiment
0  The product exceeded my expectations! Highly r...  Positive
1     I'm satisfied with the quality of the product.  Positive
2          This product is average, nothing special.   Neutral
3  Disappointed with the performance. Would not b...  Negative
4    Great value for the price. Will purchase again.  Positive
5  The product arrived damaged. Poor quality cont...  Negative
6                So-so product. Not worth the money.   Neutral
7  Absolutely love this product! It's a game-chan...  Positive
8          The product did not meet my expectations.  Negative
9  Excellent customer service. Resolved my issue ...  Positive
Accuracy: 0.50
              precision    recall  f1-score   support

    Negative       0.00      0.00      0.00         1
    Positive       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
