<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>SENTIMENT ANALYSIS IN PYTHON</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Let's predict the sentiment!)</span></div>

## Table of Contents

1. [Classification Problems](#section-1)
2. [Linear and Logistic Regressions](#section-2)
3. [The Logistic Function](#section-3)
4. [Logistic Regression in Python](#section-4)
5. [Measuring Model Performance](#section-5)
6. [Using Accuracy Score](#section-6)
7. [Train/Test Split](#section-7)
8. [Logistic Regression with Train/Test Split](#section-8)
9. [Confusion Matrix](#section-9)
10. [Complex Models and Regularization](#section-10)
11. [Predicting Probability vs. Class](#section-11)
12. [The Sentiment Analysis Problem](#section-12)
13. [Exploration of Reviews](#section-13)
14. [Numeric Transformations (Vectorization)](#section-14)
15. [Arguments of the Vectorizers](#section-15)
16. [The Automated Sentiment Analysis System](#section-16)
17. [Conclusion](#section-17)

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. Classification Problems</span><br>

Sentiment analysis is fundamentally a classification problem where we attempt to categorize text data into specific sentiments.

### Types of Classification
1.  **Binary Classification**:
    *   Used for product and movie reviews.
    *   Outcome: **Positive** or **Negative**.
2.  **Multi-class Classification**:
    *   Used for more nuanced scenarios, such as tweets about airline companies.
    *   Outcome: **Positive**, **Neutral**, or **Negative**.

To demonstrate the concepts in this notebook, we will first generate some synthetic data to simulate a classification problem.



In [1]:
# Synthetic Data Generation for Demonstration
import numpy as np
from sklearn.datasets import make_classification

# Generate synthetic features (X) and labels (y)
# X represents numeric features extracted from text
# y represents sentiment (0 = Negative, 1 = Positive)
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)

print(f"Features shape: {X.shape}")
print(f"Labels shape: {y.shape}")
print(f"Sample labels: {y[:10]}")

Features shape: (1000, 10)
Labels shape: (1000,)
Sample labels: [0 1 1 0 1 0 0 1 1 0]



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Linear and Logistic Regressions</span><br>

When approaching classification, it is important to distinguish between Linear and Logistic regression.

*   **Linear Regression**: Fits a straight line through the data. It is designed to predict a continuous numeric outcome (e.g., price, temperature). It is unbounded, meaning predictions can go from $-\infty$ to $+\infty$.
*   **Logistic Regression**: Fits an S-shaped curve (sigmoid function). It is designed for classification. It outputs values between 0 and 1, which can be interpreted as probabilities.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. The Logistic Function</span><br>

While linear regression predicts a numeric outcome, logistic regression predicts a **probability**.

In the context of sentiment analysis, we are calculating the probability that a given review is positive:

$$ Probability(sentiment = positive \mid review) $$

*   **Y-axis**: Probability (0 to 1).
*   **X-axis**: Input features.
*   **Threshold**: Typically, if the probability is $> 0.5$, we classify it as Positive (1); otherwise, Negative (0).

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Logistic Regression in Python</span><br>

We use the `scikit-learn` library to implement Logistic Regression. The standard workflow involves importing the class, instantiating it, and fitting it to the data.

### Original Code Snippet


In [2]:
from sklearn.linear_model import LogisticRegression

# Assuming X and y are defined
# log_reg = LogisticRegression().fit(X, y)



### Enhanced Executable Code
Below, we train a model using the synthetic data generated in Section 1.



In [3]:
from sklearn.linear_model import LogisticRegression

# Instantiate the model
log_reg = LogisticRegression()

# Fit the model to the data
log_reg.fit(X, y)

print("Model trained successfully.")

Model trained successfully.



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Measuring Model Performance</span><br>

**Accuracy** is the fraction of predictions our model got right.
*   The higher and closer the accuracy is to 1, the better.

We can calculate accuracy directly using the `.score()` method of the trained model.

### Code Implementation



In [4]:
# Accuracy using the .score() method
score = log_reg.score(X, y)

print(f"Accuracy score: {score}")

Accuracy score: 0.859



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> In the original presentation, the example accuracy was 0.9009. Your result above depends on the synthetic data generation. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Using Accuracy Score</span><br>

Alternatively, we can predict the labels first and then use the `accuracy_score` function from `sklearn.metrics`. This is useful when you want to compare predicted vectors against actual vectors explicitly.

### Code Implementation



In [5]:
from sklearn.metrics import accuracy_score

# 1. Predict labels for the data
y_predicted = log_reg.predict(X)

# 2. Calculate accuracy by comparing actual (y) vs predicted (y_predicted)
accuracy = accuracy_score(y, y_predicted)

print(f"Accuracy using accuracy_score: {accuracy}")

Accuracy using accuracy_score: 0.859



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. Train/Test Split</span><br>

To properly evaluate a model, we cannot test it on the same data used for training (this leads to overfitting). We must split the data.

*   **Training set**: Used to train the model (typically 70-80% of data).
*   **Testing set**: Used to evaluate performance on unseen data.

### Key Parameters for `train_test_split`
*   `X`: Features.
*   `y`: Labels.
*   `test_size`: Proportion of data used for testing (e.g., 0.2 for 20%).
*   `random_state`: Seed generator for reproducibility.
*   `stratify`: Ensures the proportion of classes (positive/negative) in the split matches the original data.

### Code Implementation



In [6]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size=0.2, 
    random_state=123, 
    stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

Training set shape: (800, 10)
Testing set shape: (200, 10)



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. Logistic Regression with Train/Test Split</span><br>

Now we retrain the model using **only** the training data and evaluate it on both sets to check for overfitting.

### Code Implementation



In [7]:
# Train on training data
log_reg = LogisticRegression().fit(X_train, y_train)

# Check accuracy on training data
train_acc = log_reg.score(X_train, y_train)
print(f'Accuracy on training data: {train_acc}')

# Check accuracy on testing data
test_acc = log_reg.score(X_test, y_test)
print(f'Accuracy on testing data: {test_acc}')

Accuracy on training data: 0.86625
Accuracy on testing data: 0.85



<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> If Training Accuracy is significantly higher than Testing Accuracy, the model is likely <b>overfitting</b>. </div>

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 9. Confusion Matrix</span><br>

Accuracy can be misleading. A confusion matrix gives a detailed breakdown of correct and incorrect predictions.

### The Matrix Structure

| | **Actual = 1** (Positive) | **Actual = 0** (Negative) |
| :--- | :--- | :--- |
| **Predicted = 1** | **True Positive** (Correct) | **False Positive** (Type I Error) |
| **Predicted = 0** | **False Negative** (Type II Error) | **True Negative** (Correct) |

### Code Implementation

We can normalize the matrix (divide by length of test set) to see proportions.



In [8]:
from sklearn.metrics import confusion_matrix

# Predict on test set
y_predicted = log_reg.predict(X_test)

# Generate confusion matrix
conf_matrix = confusion_matrix(y_test, y_predicted)

# Print raw counts
print("Confusion Matrix (Counts):")
print(conf_matrix)

# Print proportions (as shown in the PDF)
print("\nConfusion Matrix (Proportions):")
print(conf_matrix / len(y_test))

Confusion Matrix (Counts):
[[85 15]
 [15 85]]

Confusion Matrix (Proportions):
[[0.425 0.075]
 [0.075 0.425]]



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 10. Complex Models and Regularization</span><br>

### The Problem: Overfitting
A complex model might capture "noise" in the data rather than the underlying pattern. This often happens when there are a large number of features.

### The Solution: Regularization
Regularization simplifies the model to prevent overfitting.

*   **L2 Regularization**: Shrinks all coefficients towards zero (but not exactly zero).
*   **C Parameter**: Inverse of regularization strength.
    *   **High C**: Low penalization (Model fits training data very closely, risk of overfitting).
    *   **Low C**: High penalization (Model is less flexible, simpler).

### Code Implementation



In [9]:
# Logistic Regression with L2 penalty and specific C value
# Note: 'l2' is often the default, but we specify it explicitly here.
log_reg_regularized = LogisticRegression(penalty='l2', C=1.0)

log_reg_regularized.fit(X_train, y_train)
print("Regularized model trained.")

Regularized model trained.





---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 11. Predicting Probability vs. Class</span><br>

Sometimes we want the raw probability score rather than just the class label (0 or 1).

*   `predict()`: Returns the class (0 or 1).
*   `predict_proba()`: Returns the probability for each class.

### Code Implementation



In [10]:
# Predict labels
y_predicted = log_reg.predict(X_test)

# Predict probabilities
# Returns array of shape (n_samples, 2) -> [prob_class_0, prob_class_1]
y_probab = log_reg.predict_proba(X_test)

print("Probabilities (first 5 samples):\n", y_probab[:5])

# Select probabilities for Class 1 (Positive Sentiment)
y_probab_class1 = y_probab[:, 1]
print("\nProbabilities for Class 1 (first 5 samples):\n", y_probab_class1[:5])

Probabilities (first 5 samples):
 [[0.99733605 0.00266395]
 [0.95713098 0.04286902]
 [0.34869794 0.65130206]
 [0.55542056 0.44457944]
 [0.19333358 0.80666642]]

Probabilities for Class 1 (first 5 samples):
 [0.00266395 0.04286902 0.65130206 0.44457944 0.80666642]



**Note on Metrics**: Standard metrics like `accuracy_score` and `confusion_matrix` require **classes**, not probabilities. If you use probabilities, you will get a `ValueError`.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 12. The Sentiment Analysis Problem</span><br>

Sentiment analysis is the process of understanding the opinion of an author about a subject. Common data sources include:
*   Movie reviews (IMDb).
*   Amazon product reviews.
*   Twitter airline sentiment.
*   Emotionally charged literary examples.

The goal is to map these text inputs to sentiment labels.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 13. Exploration of Reviews</span><br>

Before modeling, we explore the text data. Common techniques include:

1.  **Basic Info**: Size of reviews.
2.  **Word Clouds**: Visualizing frequent words.
3.  **Length Features**:
    *   Number of words.
    *   Number of sentences.
4.  **Language Detection**: Identifying the language of the review.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 14. Numeric Transformations (Vectorization)</span><br>

Machine learning models cannot understand raw text. We must convert text into numbers.

### Methods
1.  **Bag-of-words**: Counts word occurrences.
2.  **TfIdf Vectorization**: Weighs words by importance (Term Frequency - Inverse Document Frequency).

### Code Implementation
We will create a small dummy text dataset to demonstrate vectorization.



In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Dummy text data
data_text = [
    "The movie was great and I loved it",
    "The movie was terrible and I hated it",
    "It was an okay movie, not great not bad"
]

# Initialize Vectorizer
vect = CountVectorizer()

# Fit and Transform
# fit: learns the vocabulary
# transform: converts text to matrix
vect.fit(data_text)
X_text = vect.transform(data_text)

print("Vocabulary:", vect.get_feature_names_out())
print("\nTransformed Shape:", X_text.shape)
print("\nArray Representation:\n", X_text.toarray())

Vocabulary: ['an' 'and' 'bad' 'great' 'hated' 'it' 'loved' 'movie' 'not' 'okay'
 'terrible' 'the' 'was']

Transformed Shape: (3, 13)

Array Representation:
 [[0 1 0 1 0 1 1 1 0 0 0 1 1]
 [0 1 0 0 1 1 0 1 0 0 1 1 1]
 [1 0 1 1 0 1 0 1 2 1 0 0 1]]



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 15. Arguments of the Vectorizers</span><br>

We can tune the vectorizers to improve performance and reduce noise.

### Key Arguments
*   `stop_words`: Removes non-informative, frequent words (e.g., "the", "is", "and").
*   `ngram_range`: Captures phrases instead of just single words (e.g., "not good").
    *   `(1, 1)`: Unigrams only.
    *   `(1, 2)`: Unigrams and Bigrams.
*   `max_features`: Limits the vocabulary to the top N most frequent words.
*   `max_df` / `min_df`: Ignores terms that appear in too many or too few documents.
*   `token_pattern`: Regex to capture specific patterns (e.g., remove digits).

**Note**: Lemmas and stems are important NLP concepts but are **NOT** direct arguments to the standard sklearn vectorizers (they usually require a custom tokenizer).

### Code Implementation



In [12]:
# Vectorizer with arguments
vect_tuned = CountVectorizer(
    stop_words='english',  # Remove English stop words
    ngram_range=(1, 2),    # Use unigrams and bigrams
    max_features=100       # Limit vocabulary size
)

vect_tuned.fit(data_text)
X_tuned = vect_tuned.transform(data_text)

print("Tuned Vocabulary:", vect_tuned.get_feature_names_out())

Tuned Vocabulary: ['bad' 'great' 'great bad' 'great loved' 'hated' 'loved' 'movie'
 'movie great' 'movie terrible' 'okay' 'okay movie' 'terrible'
 'terrible hated']



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 16. The Automated Sentiment Analysis System</span><br>

The complete workflow for an automated sentiment analysis system combines all the steps we have discussed:

1.  **Exploration & Feature Engineering**: Understanding the data and creating new features (like review length).
2.  **Numeric Transformation**: Converting text to numbers using `CountVectorizer` or `TfidfVectorizer`.
3.  **Classification**: Using a supervised learning model (like `LogisticRegression`) to predict the sentiment.

This pipeline allows us to take raw text input and output a sentiment prediction (Positive/Negative).

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 17. Conclusion</span><br>

In this notebook, we have covered the essential building blocks of **Sentiment Analysis in Python**:

1.  **Problem Definition**: We identified sentiment analysis as a classification problem (Binary or Multi-class).
2.  **Modeling**: We explored **Logistic Regression** as a probabilistic classifier suitable for this task.
3.  **Evaluation**: We learned to measure success using **Accuracy** and the **Confusion Matrix**, and emphasized the importance of the **Train/Test Split**.
4.  **Text Processing**: We demonstrated how to convert raw text into numerical features using **Vectorization** (Bag-of-words/TF-IDF) and how to tune these vectorizers.

**Next Steps for the Learner:**
*   Apply these techniques to a real-world dataset (e.g., the IMDb movie review dataset).
*   Experiment with different `C` values in Logistic Regression to observe the effect of regularization.
*   Try `TfidfVectorizer` instead of `CountVectorizer` to see if it improves performance.
*   Explore more complex models like Naive Bayes or Support Vector Machines (SVM).
