<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Classifying Fake News Using Supervised Learning with NLP</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Introduction to Natural Language Processing in Python)</span></div>

## Table of Contents
1. [What is Supervised Learning?](#section-1)
2. [The IMDB Movie Dataset](#section-2)
3. [Supervised Learning Steps](#section-3)
4. [Building Word Count Vectors with Scikit-Learn](#section-4)
5. [Naive Bayes Classifier](#section-5)
6. [Model Evaluation & Confusion Matrix](#section-6)
7. [Simple NLP, Complex Problems](#section-7)
8. [Conclusion](#section-8)

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. What is Supervised Learning?</span><br>

### Understanding the Basics
Supervised learning is a fundamental form of machine learning where the problem involves predefined training data. This data comes with a **label** (or outcome) that you want the model to learn.

In the context of classification problems, the goal is to make good hypotheses about a specific class or species based on input features.

### Example: The Iris Dataset
A classic example of supervised learning is classifying flower species based on geometric features. Below is a representation of such a dataset, where `Sepal length`, `Sepal width`, `Petal length`, and `Petal width` are features, and `Species` is the label.

| Sepal length | Sepal width | Petal length | Petal width | Species |
| :--- | :--- | :--- | :--- | :--- |
| 5.1 | 3.5 | 1.4 | 0.2 | I. setosa |
| 7.0 | 3.2 | 4.77 | 1.4 | I. versicolor |
| 6.3 | 3.3 | 6.0 | 2.5 | I. virginica |

### Supervised Learning with NLP
When applying supervised learning to Natural Language Processing (NLP), we face a unique challenge: we must use **language** instead of geometric features.

*   **Library**: `scikit-learn` is a powerful open-source library used for this purpose.
*   **Data Creation**: To create supervised learning data from text, we typically use:
    *   Bag-of-words models
    *   tf-idf (Term Frequency-Inverse Document Frequency) as features.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. The IMDB Movie Dataset</span><br>

To demonstrate text classification, we will look at the IMDB Movie Dataset. The goal is to predict the movie genre based on the plot summary.

### Dataset Structure
The dataset consists of movie plots and corresponding categorical labels (Sci-Fi or Action).

| Plot | Sci-Fi | Action |
| :--- | :--- | :--- |
| In a post-apocalyptic world in human decay, a ... | 1 | 0 |
| Mohei is a wandering swordsman. He arrives in ... | 0 | 1 |
| #137 is a SCI/FI thriller about a girl, Marla,... | 1 | 0 |

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> In this dataset, categorical features are generated using preprocessing. A '1' indicates the movie belongs to that genre, while a '0' indicates it does not.</div>

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. Supervised Learning Steps</span><br>

To build a successful NLP model, we follow a structured pipeline:

1.  **Collect and preprocess our data**: Gather text and clean it.
2.  **Determine a label**: Identify what we are predicting (Example: Movie genre).
3.  **Split data**: Divide the dataset into **training** and **test** sets to ensure fair evaluation.
4.  **Extract features**: Convert text into numerical vectors to help predict the label.
    *   *Note*: The Bag-of-words vector is built into `scikit-learn`.
5.  **Evaluate**: Use the trained model on the test set to measure performance.

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Building Word Count Vectors with Scikit-Learn</span><br>

The first step in our pipeline (after loading data) is converting text to numbers using the `CountVectorizer`.

### Original Code (From PDF)
The following code demonstrates how to split the data and vectorize it.



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# df = ... # Load data into DataFrame
# y = df['Sci-Fi']

# X_train, X_test, y_train, y_test = train_test_split(
#     df['plot'], y,
#     test_size=0.33,
#     random_state=53)

# count_vectorizer = CountVectorizer(stop_words='english')
# count_train = count_vectorizer.fit_transform(X_train.values)
# count_test = count_vectorizer.transform(X_test.values)



### Enhanced Executable Code
Below is a fully working example where we create a dummy dataset to simulate the IMDB data, allowing you to run the vectorization process.



In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

# 1. Create Dummy Data (Simulating the IMDB Dataset)
data = {
    'plot': [
        'In a post-apocalyptic world in human decay, a robot finds hope.',
        'Mohei is a wandering swordsman. He arrives in a village.',
        '#137 is a SCI/FI thriller about a girl, Marla, who travels space.',
        'The galactic empire strikes back with laser cannons.',
        'A martial artist fights for honor in the ancient arena.',
        'Aliens invade earth and a hero must save the planet.',
        'A detective solves a crime in the city underworld.',
        'Space station alpha is under attack by cyborgs.'
    ],
    'Sci-Fi': [1, 0, 1, 1, 0, 1, 0, 1] # 1 = Sci-Fi, 0 = Not Sci-Fi (Action/Other)
}

df = pd.DataFrame(data)

# 2. Define Label
y = df['Sci-Fi']

# 3. Split Data
X_train, X_test, y_train, y_test = train_test_split(
    df['plot'], y,
    test_size=0.33,
    random_state=53
)

# 4. Initialize CountVectorizer
# stop_words='english' removes common words like 'the', 'is', 'in'
count_vectorizer = CountVectorizer(stop_words='english')

# 5. Fit and Transform Training Data
count_train = count_vectorizer.fit_transform(X_train.values)

# 6. Transform Test Data (Do NOT fit on test data)
count_test = count_vectorizer.transform(X_test.values)

print("Vocabulary size:", len(count_vectorizer.get_feature_names_out()))
print("Training Matrix Shape:", count_train.shape)
print("Test Matrix Shape:", count_test.shape)


Vocabulary size: 26
Training Matrix Shape: (5, 26)
Test Matrix Shape: (3, 26)



***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Naive Bayes Classifier</span><br>

### Theory
The **Naive Bayes Model** is commonly used for testing NLP classification problems. It has its basis in probability: given a particular piece of data, how likely is a particular outcome?

**Examples:**
*   If the plot has a "spaceship", how likely is it to be sci-fi?
*   Given a "spaceship" **and** an "alien", how likely **now** is it sci-fi?

Each word from the `CountVectorizer` acts as a feature. Naive Bayes is favored because it is simple and effective.

### Original Code (From PDF)


In [3]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# nb_classifier = MultinomialNB()
# nb_classifier.fit(count_train, y_train)
# pred = nb_classifier.predict(count_test)
# metrics.accuracy_score(y_test, pred)



### Enhanced Executable Code
We will now train the model using the vectors created in the previous section.



In [4]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

# 1. Initialize the Classifier
nb_classifier = MultinomialNB()

# 2. Fit the classifier to the training data
nb_classifier.fit(count_train, y_train)

# 3. Generate predictions on the test data
pred = nb_classifier.predict(count_test)

# 4. Calculate Accuracy
accuracy = metrics.accuracy_score(y_test, pred)

print(f"Predictions: {pred}")
print(f"Actual Labels: {y_test.values}")
print(f"Accuracy Score: {accuracy}")


Predictions: [1 1 1]
Actual Labels: [0 1 1]
Accuracy Score: 0.6666666666666666



***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 6. Model Evaluation & Confusion Matrix</span><br>

Accuracy is useful, but a **Confusion Matrix** gives deeper insight into where the model is making mistakes (e.g., confusing Action for Sci-Fi).

### Confusion Matrix Structure (From PDF)
The PDF presents an example confusion matrix array and table:

`array([[6410, 563], [ 864, 2242]])`

| | Action | Sci-Fi |
| :--- | :--- | :--- |
| **Action** | 6410 | 563 |
| **Sci-Fi** | 864 | 2242 |

*   **6410**: Correctly predicted Action.
*   **2242**: Correctly predicted Sci-Fi.
*   **563**: Action movies incorrectly predicted as Sci-Fi.
*   **864**: Sci-Fi movies incorrectly predicted as Action.

### Original Code (From PDF)


In [5]:
# metrics.confusion_matrix(y_test, pred, labels=[0,1])



### Enhanced Executable Code
Generating a confusion matrix for our dummy dataset.



In [6]:
from sklearn.metrics import confusion_matrix

# Generate confusion matrix
# labels=[0, 1] ensures the matrix is ordered by [Not Sci-Fi, Sci-Fi]
conf_matrix = confusion_matrix(y_test, pred, labels=[0, 1])

print("Confusion Matrix Array:")
print(conf_matrix)

# Visualizing the matrix simply
print("\n--- Interpretation ---")
print(f"True Negatives (Action predicted as Action): {conf_matrix[0][0]}")
print(f"False Positives (Action predicted as Sci-Fi): {conf_matrix[0][1]}")
print(f"False Negatives (Sci-Fi predicted as Action): {conf_matrix[1][0]}")
print(f"True Positives (Sci-Fi predicted as Sci-Fi): {conf_matrix[1][1]}")


Confusion Matrix Array:
[[0 1]
 [0 2]]

--- Interpretation ---
True Negatives (Action predicted as Action): 0
False Positives (Action predicted as Sci-Fi): 1
False Negatives (Sci-Fi predicted as Action): 0
True Positives (Sci-Fi predicted as Sci-Fi): 2



***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 7. Simple NLP, Complex Problems</span><br>

While the mechanics of NLP (vectorization, classification) can be straightforward, the problems they solve are complex and prone to nuance.

### 1. Translation
Translation is difficult because context matters.
*   **Example**: A tweet "god bless the german language" accompanied by a complex German sentence illustrating how difficult it is to translate specific economic terms.
*   **Source**: [Twitter Link](https://twitter.com/Lupintweets/status/865533182455685121)

### 2. Sentiment Analysis
Words have different sentiments depending on the context (community or topic).
*   **Example**: The word "soft".
    *   In `r/sports`: "big men are very **soft**" (Negative sentiment).
    *   In `r/TwoX`: "some **soft** pajamas" (Positive sentiment).
*   **Source**: [Stanford NLP Project](https://nlp.stanford.edu/projects/socialsent/)

### 3. Language Biases
Models trained on human data inherit human biases.
*   **Example**: Google Translate (Turkish to English).
    *   Turkish: "O bir profesÃ¶r. O bir bebek bakÄ±cÄ±sÄ±." (Gender neutral pronouns).
    *   English Translation: "He's a professor. She's a babysitter."
    *   The model assumed the gender based on the profession, reflecting societal bias in the training data.
*   **Related Talk**: [YouTube Link](https://www.youtube.com/watch?v=j7FwpZB1hWc)

***

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 8. Conclusion</span><br>

In this notebook, we explored the fundamentals of classifying text using supervised learning.

**Key Takeaways:**
1.  **Supervised Learning**: Requires labeled data. In NLP, we convert text labels (like "Sci-Fi") into numerical categories.
2.  **Feature Extraction**: We cannot feed raw text into a model. We used `CountVectorizer` (Bag-of-Words) to turn plot summaries into numerical vectors.
3.  **Modeling**: We used the **Naive Bayes** classifier, a probabilistic model highly effective for text data.
4.  **Evaluation**: We used accuracy scores and the **Confusion Matrix** to understand prediction errors.
5.  **Ethics**: We acknowledged that NLP models can struggle with context and perpetuate real-world biases (e.g., gender bias in translation).

**Next Steps:**
*   Try using `TfidfVectorizer` instead of `CountVectorizer` to see if weighting unique words improves accuracy.
*   Experiment with different classifiers like Logistic Regression or Support Vector Machines (SVM).
*   Audit your datasets for potential biases before training models for production.
