## 项目目标 ##
构建一个情感分析模型，能够判断一段文本的情感是正面、负面还是中性。

## Dependency ##
1. Kaggle API: pip install kaggle 
2. pip install scikit-learn flask pandas


## 1. Import Data from Kaggle ## 

In [2]:
import kaggle
# Download the dataset
kaggle.api.dataset_download_files("kazanova/sentiment140", path="./data", unzip=True)

Dataset URL: https://www.kaggle.com/datasets/kazanova/sentiment140


In [None]:
import pandas as pd
# Load the dataset
data = pd.read_csv("./data/training.1600000.processed.noemoticon.csv", encoding="latin-1", header=None)

## 2. Data Preparation ##

In [5]:
# Rename columns
data.columns = ["target", "ids", "date", "flag", "user", "text"]
# Map labels: 0 -> 0 (negative), 4 -> 1 (positive)
data["target"] = data["target"].map({0: 0, 2: 1, 4: 2})
print(data["target"].value_counts())

target
0    800000
2    800000
Name: count, dtype: int64


## 3. Train the model ##
Since this after processing the data we find out target only has value 0 and 2 which means either negative or positive, so it's binary. As a result, we utilize logistic regression to train the model 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# 1. Data preprocessing
X = data["text"]
y = data["target"]
vectorizer = TfidfVectorizer(max_features=5000)
X_tfidf = vectorizer.fit_transform(X)

# 2. Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, y, test_size=0.2, random_state=42)

# 3. Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# 4. Evaluate the model
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred, target_names=["negative", "positive"]))

Accuracy: 0.790815625
Classification Report:
               precision    recall  f1-score   support

    negative       0.80      0.78      0.79    159494
    positive       0.78      0.80      0.79    160506

    accuracy                           0.79    320000
   macro avg       0.79      0.79      0.79    320000
weighted avg       0.79      0.79      0.79    320000

