<a href="https://colab.research.google.com/github/jimmy-pink/colab-machinelearning-playground/blob/main/kaggle/Sentiment140.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ⚜️ 《Sentiment140》
[kaggle - sentiment140](https://www.kaggle.com/datasets/kazanova/sentiment140)  
[tensorflow-dataset-sentiment140 (variation)](https://www.tensorflow.org/datasets/catalog/sentiment140)

### 问题分析

- 任务目标  
  - 输入：一条推文文本（如 "I love this movie! #happy"）
  - 输出：二分类情感标签（0=负面, 1=正面）

- 数据特点  
  - 数据量较大（160万条推文），适合练习大规模文本处理。
  - 推文包含噪音（如表情符号、话题标签、@用户名等），需清洗。

In [None]:
!pip install tensorflow

In [None]:
import pickle
import pandas as pd
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

EMBEDDING_DIM = 100
MAX_LENGTH = 32
TRAINING_SPLIT = 0.9
BATCH_SIZE = 128

### 数据准备

- 目标变量0和4 优化为 0和1
- text文本数据清洗

In [2]:
import tensorflow_datasets as tfds

ds = tfds.load('sentiment140', split='train', shuffle_files=True)

# 将 tf.data.Dataset 转换为 Pandas DataFrame
def dataset_to_dataframe(dataset):
    # 将 dataset 转换为 Pandas DataFrame
    df = pd.DataFrame(list(dataset.as_numpy_iterator()))
    return df

# 转换
df = dataset_to_dataframe(ds)

# 查看前几行数据
df.head(2)



Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /root/tensorflow_datasets/sentiment140/1.0.0...


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

Generating splits...:   0%|          | 0/2 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/sentiment140/incomplete.1HVDNC_1.0.0/sentiment140-train.tfrecord*...:   0%…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling /root/tensorflow_datasets/sentiment140/incomplete.1HVDNC_1.0.0/sentiment140-test.tfrecord*...:   0%|…

Dataset sentiment140 downloaded and prepared to /root/tensorflow_datasets/sentiment140/1.0.0. Subsequent calls will reuse this data.


Unnamed: 0,date,polarity,query,text,user
0,b'Mon Jun 01 18:08:26 PDT 2009',4,b'NO_QUERY',"b""i'm 10x cooler than all of you! """,b'katie4593'
1,b'Mon Jun 01 23:55:43 PDT 2009',0,b'NO_QUERY',b'O.kk? Thats weird I cant stop following peop...,b'migaruler'


#### 数据清洗

In [3]:
df.polarity.value_counts()

Unnamed: 0_level_0,count
polarity,Unnamed: 1_level_1
4,800000
0,800000


In [4]:
# df['polarity'] = df.polarity.apply(lambda x: 0 if x == 0 else 1).to_numpy()
df["polarity"] = df["polarity"].replace(4, 1)
df.polarity.value_counts()

Unnamed: 0_level_0,count
polarity,Unnamed: 1_level_1
1,800000
0,800000


In [5]:
import re
texts = df["text"].values

# 文本清洗函数
def clean_text(text):
    # 确保文本是字符串类型
    text = text.decode('utf-8') if isinstance(text, bytes) else text
    # 移除@用户名
    text = re.sub(r"@\w+", "", text)
    # 移除URL
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    # 只保留字母和空格
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    # 转为小写
    text = text.lower().strip()
    return text

# 应用清洗
cleaned_texts = [clean_text(text) for text in texts]

In [6]:
sentences = np.array(cleaned_texts)
labels = df['polarity'].to_numpy()
dataset = tf.data.Dataset.from_tensor_slices((sentences, labels))

for i, (sentence, label) in enumerate(dataset.take(2)):
    print(f"Sample {i+1}:")
    print(f"Sentence: {sentence.numpy().decode('utf-8')}")
    print(f"Label: {label.numpy()}")
    print("-" * 50)

Sample 1:
Sentence: im x cooler than all of you
Label: 1
--------------------------------------------------
Sample 2:
Sentence: okk thats weird i cant stop following people on twitter i have tons of people to unfollow
Label: 0
--------------------------------------------------


In [7]:
# 设置训练集比例
# 计算训练集大小
total_size = len(sentences)
train_size = int(total_size * TRAINING_SPLIT)
val_size = total_size - train_size

# 打乱数据集（使用 buffer_size 作为参数，推荐设置为数据集大小）
dataset = dataset.shuffle(buffer_size=total_size, reshuffle_each_iteration=True)

# 将数据集拆分为训练集和验证集
train_dataset = dataset.take(train_size)
validation_dataset = dataset.skip(train_size)
PREFETCH_BUFFER_SIZE = tf.data.AUTOTUNE
train_dataset = (train_dataset
                   .shuffle(10000)
                   .cache()
                   .prefetch(buffer_size=PREFETCH_BUFFER_SIZE)
                   .batch(BATCH_SIZE)
                   )
validation_dataset = (validation_dataset
                  .cache()
                  .prefetch(buffer_size=PREFETCH_BUFFER_SIZE)
                  .batch(BATCH_SIZE)
                  )

print(f"There are {len(train_dataset)} batches for a total of {BATCH_SIZE*len(train_dataset)} elements for training.\n")
print(f"There are {len(validation_dataset)} batches for a total of {BATCH_SIZE*len(validation_dataset)} elements for validation.\n")

There are 11250 batches for a total of 1440000 elements for training.

There are 1250 batches for a total of 160000 elements for validation.



### NLP


#### 矢量化
- 目标： 将文本数据转化为数值表示

1. fit_vectorizer 函数
- **TextVectorization** 是一个文本预处理层，它负责将文本转换为数字表示。具体来说，TextVectorization 会将每个单词映射为一个数字索引。
    
- **output_sequence_length=MAX_LENGTH**：输出的每个文本序列将被填充或截断到 MAX_LENGTH 长度。如果文本长度超过该值，则会被截断；如果文本长度不足，则会填充零。
    
- **standardize='lower_and_strip_punctuation'**：这个参数指定了对文本进行标准化的方式，包括将文本转换为小写字母并去除标点符号。
    
- **dataset.map(lambda x, y: x)**：这一步从 dataset 中提取文本部分（假设 dataset 中每个元素是 (text, label) 的元组），并创建一个新的数据集 full_tokens，只包含文本。
    
- **vectorizer.adapt(full_tokens)**：这个方法会遍历 full_tokens 数据集中的文本数据，构建一个词汇表，并将文本数据中的每个词映射到一个索引。adapt 的过程是学习数据集中所有单词的索引表示。


In [8]:
def fit_vectorizer(dataset):
    vectorizer = tf.keras.layers.TextVectorization(
        # max_tokens=10000, # 生成的矢量化词库的最大词数
        output_sequence_length=MAX_LENGTH,
        standardize='lower_and_strip_punctuation'
    )
    full_tokens = dataset.map(lambda x, y: x)
    vectorizer.adapt(full_tokens)
    return vectorizer

2. vectorizer 的使用
- 通过调用 fit_vectorizer(train_dataset) 来根据训练数据集 train_dataset 生成一个矢量化器 vectorizer。
    
- vectorizer.vocabulary_size() 返回构建的词汇表的大小，即数据集中有多少个独特的词被映射到了一个索引。


- 对训练集和验证集的数据进行矢量化。map(lambda x, y: (vectorizer(x), y)) 会将每个文本 x 转换为其对应的词向量索引，同时保持标签 y 不变。
    
- train_dataset_vectorized 和 validation_dataset_vectorized 就是矢量化后的数据集，其中 x 已经是整数索引的表示，准备输入到模型中进行训练或验证。

In [9]:
# Adapt the vectorizer to the training sentences
vectorizer = fit_vectorizer(train_dataset)
# Check size of vocabulary
vocab_size = vectorizer.vocabulary_size()
print(f"Vocabulary size: {vocab_size}")
train_dataset_vectorized = train_dataset.map(lambda x,y: (vectorizer(x), y))
validation_dataset_vectorized = validation_dataset.map(lambda x,y: (vectorizer(x), y))

Vocabulary size: 389332


#### 词语嵌入
- 词语嵌入是 神经网络模型 训练后得到的，每个 token（子词）对应一个固定维度的向量（如 100 维或 300 维），这些向量用于表示 token 的语义或上下文信息。

- 数据集词汇表 vectorizer.get_vocabulary()


> 利用预训练的 GloVe（Global Vectors for Word Representation）词向量，将 vectorizer 提供的词汇表（tokens）映射到相应的 GloVe 词向量，并构建一个嵌入矩阵（embedding matrix）。


1. **下载 GloVe 词向量**：
    
    - 代码首先检查 glove.6B.100d.txt 文件是否存在，如果不存在，则使用 wget 命令从 Hugging Face 下载。
        
    
2. **加载 GloVe 词向量**：
    
    - 接下来，它会读取 glove.6B.100d.txt 文件，并将每一行分解成单词和对应的 100 维词向量。
        
    - glove_embeddings 是一个字典，其中存储了每个单词及其对应的 100 维词向量。
        
    
3. **构建词汇表索引**：
    
    - word_index 是 vectorizer.get_vocabulary() 的词汇表的映射，get_vocabulary() 通常会返回按词频排序的所有单词。
        
    - word_index 是一个字典，它将每个单词映射到词汇表中的索引。
        
    
4. **构建嵌入矩阵**：
    
    - embeddings_matrix 是一个大小为 (vocab_size, EMBEDDING_DIM) 的矩阵，用来存储所有词汇表中单词的词向量。vocab_size 是词汇表的大小，而 EMBEDDING_DIM 是 GloVe 词向量的维度（在此为 100）。
        
    - 然后，代码遍历 word_index 中的每个单词，如果该单词在 GloVe 的词向量字典中存在（即 glove_embeddings.get(word) 不为 None），就将该单词的词向量保存到 embeddings_matrix 中。

得到结果：

```python
word_index = {'cat': 0, 'dog': 1, 'fish': 2}
embeddings_matrix = [
    [0.1, 0.2, ..., 0.99],  # cat 的 100维词向量
    [0.3, 0.4, ..., 0.88],  # dog 的 100维词向量
    [0.2, 0.1, ..., 0.75]   # fish 的 100维词向量
]
```

In [10]:

import os

# glove.6B.100d.txt， 包含了 400,000 个单词（或子词）及其对应的 100维 的词向量。
glove_file = './data/glove.6B.100d.txt'
# 确认文件存在
if os.path.exists(glove_file):
    print(f"File found: {glove_file}")
else:
    !wget https://huggingface.co/arjahojnik/LSTM-sentiment-model/resolve/af8ba3a2939dd573a5d9384efecbe7f04137860b/glove.6B.100d.txt -P ./data/


# Initialize an empty embeddings index dictionary
glove_embeddings = {}

# Read file and fill glove_embeddings with its contents
with open(glove_file) as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        glove_embeddings[word] = coefs

--2025-05-05 05:49:41--  https://huggingface.co/arjahojnik/LSTM-sentiment-model/resolve/af8ba3a2939dd573a5d9384efecbe7f04137860b/glove.6B.100d.txt
Resolving huggingface.co (huggingface.co)... 3.165.160.59, 3.165.160.12, 3.165.160.11, ...
Connecting to huggingface.co (huggingface.co)|3.165.160.59|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/e9/50/e95019e32e35836d21121b69d13d64b8572fe25a0b9c077143278d1cd5faa289/95dde4dfd627ab26608d33e76d1195ec059734bd29089ea52cadb08d07c64544?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27glove.6B.100d.txt%3B+filename%3D%22glove.6B.100d.txt%22%3B&response-content-type=text%2Fplain&Expires=1746427782&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc0NjQyNzc4Mn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2U5LzUwL2U5NTAxOWUzMmUzNTgzNmQyMTEyMWI2OWQxM2Q2NGI4NTcyZmUyNWEwYjljMDc3MTQzMjc4ZDFjZDVmYWEyODkvOTVkZGU0ZGZkNjI3YWIyN

In [11]:
word_index = {x:i for i,x in enumerate(vectorizer.get_vocabulary())}

In [12]:
embeddings_matrix = np.zeros((vocab_size, EMBEDDING_DIM))
# Iterate all of the words in the vocabulary and if the vector representation for
# each word exists within GloVe's representations, save it in the embeddings_matrix array
for word, i in word_index.items():
    embedding_vector = glove_embeddings.get(word)
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector

### 建模

In [13]:
from keras import regularizers
def create_model(vocab_size, pretrained_embeddings):
    model = tf.keras.Sequential([
        tf.keras.Input(shape=(None,)),
        tf.keras.layers.Embedding(
            input_dim=vocab_size,
            output_dim=EMBEDDING_DIM,
            weights=[pretrained_embeddings],
            trainable=True  # 允许微调词向量
        ),
        tf.keras.layers.SpatialDropout1D(0.2),
        tf.keras.layers.Conv1D(64, 3, activation='relu', padding='same'),
        tf.keras.layers.MaxPooling1D(2),
        tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(
            32,
            dropout=0.4,
            recurrent_dropout=0.4,
            return_sequences=True
        )),
        tf.keras.layers.GlobalMaxPooling1D(),
        tf.keras.layers.BatchNormalization(),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(32, activation='relu',
                              kernel_regularizer=regularizers.l2(0.02)),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])

    # 使用学习率衰减
    initial_learning_rate = 0.001
    decay_steps = 1000
    decay_rate = 0.9
    learning_rate_schedule = tf.keras.optimizers.schedules.ExponentialDecay(
        initial_learning_rate, decay_steps, decay_rate
    )

    # 编译模型
    model.compile(
        loss='binary_crossentropy',
        optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate_schedule),
        metrics=['accuracy']
    )

    return model

In [None]:
model = create_model(vocab_size, embeddings_matrix)

class myCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs={}):
        if(logs.get('accuracy') >= 0.95 or logs.get('val_loss') < 0.35 ):
            self.model.stop_training = True
callbacks = myCallback()

history = model.fit(
	train_dataset_vectorized,
	epochs=20,
	validation_data=validation_dataset_vectorized,
	callbacks=[callbacks]
)

Epoch 1/20
[1m11250/11250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1656s[0m 146ms/step - accuracy: 0.7536 - loss: 0.5487 - val_accuracy: 0.8271 - val_loss: 0.3870
Epoch 2/20
[1m11250/11250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1654s[0m 147ms/step - accuracy: 0.8188 - loss: 0.4036 - val_accuracy: 0.8370 - val_loss: 0.3685
Epoch 3/20
[1m11250/11250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1679s[0m 149ms/step - accuracy: 0.8281 - loss: 0.3860 - val_accuracy: 0.8396 - val_loss: 0.3634
Epoch 4/20
[1m11250/11250[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1709s[0m 150ms/step - accuracy: 0.8316 - loss: 0.3805 - val_accuracy: 0.8405 - val_loss: 0.3617
Epoch 5/20
[1m 7383/11250[0m [32m━━━━━━━━━━━━━[0m[37m━━━━━━━[0m [1m9:33[0m 148ms/step - accuracy: 0.8325 - loss: 0.3781

### 评估

In [None]:
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

# Get number of epochs
epochs = range(len(acc))

fig, ax = plt.subplots(1, 2, figsize=(10, 5))
fig.suptitle('Training and validation performance')

for i, (data, label) in enumerate(zip([(acc, val_acc), (loss, val_loss)], ["Accuracy", "Loss"])):
    ax[i].plot(epochs, data[0], 'r', label="Training " + label)
    ax[i].plot(epochs, data[1], 'b', label="Validation " + label)
    ax[i].legend()
    ax[i].set_xlabel('epochs')