<a href="https://colab.research.google.com/github/psl-skok/PoliticalStanceClassification/blob/main/PoliticalStanceClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Political Stance Classification with TF-IDF and Logistic Regression

This notebook builds an interpretable text classification model to distinguish
between Liberal and Conservative Reddit posts. The emphasis is on understanding
modeling choices, feature engineering, and interpretability rather than treating
machine learning as a black box.


# **Data Loading & Cleaning**

In [14]:
import pandas as pd

df = pd.read_csv("reddit_political_posts.csv")

df = df.dropna(subset=["Text", "Political Lean"])

df = df[df["Text"].str.len() > 20]


We remove missing values and extremely short posts to reduce noise.
Class balance is inspected to ensure stratified splitting is appropriate.

# **Train–Test Split**

In [3]:
from sklearn.model_selection import train_test_split

X = df["Text"]
y = df["Political Lean"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, stratify=y, random_state=42
  )

Stratification preserves class proportions, which is important given mild class imbalance.


# **Feature Engineering: TF-IDF Vectorization**

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features = 20000,
    ngram_range=(1,2),
    min_df=5,
    max_df=0.8)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

These parameters reduce noise, control vocabulary size, and allow the model
to capture short phrases while improving generalization.


# **Model Training: Logistic Regression**

In [5]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(
    max_iter = 1000,
    class_weight="balanced",
    solver="liblinear",
    C=2.0)

clf.fit(X_train_tfidf, y_train)
y_pred = clf.predict(X_test_tfidf)

Logistic Regression is well-suited for high-dimensional, sparse TF-IDF features
and allows for explicit regularization and class imbalance handling.


# **Model Evaluation**

In [8]:
# Get sorted indices
sorted_idx = np.argsort(weights)

top_conservative_idx = sorted_idx[:20]
top_liberal_idx = sorted_idx[-20:][::-1]  # reverse for descending order

print("Top Conservative terms:")
for idx in top_conservative_idx:
    print(f"{feature_names[idx]}: {weights[idx]:.4f}")

print("\nTop Liberal terms:")
for idx in top_liberal_idx:
    print(f"{feature_names[idx]}: {weights[idx]:.4f}")


Top Conservative terms:
libertarian: -3.6826
libertarians: -3.1578
government: -2.9725
capitalism: -2.7640
ancap: -2.4398
property: -1.9309
not: -1.7547
libertarianism: -1.6273
free market: -1.5956
russia: -1.5888
anarcho: -1.5422
capitalist: -1.3476
free: -1.2767
said: -1.2508
ukraine: -1.2381
only: -1.2223
money: -1.1839
x200b: -1.1799
anarchocapitalism: -1.1753
billion: -1.1661

Top Liberal terms:
social: 3.2829
democratic: 2.6921
women: 2.2324
party: 2.2045
and: 1.8750
socialist: 1.8163
workers: 1.7692
in: 1.6531
but: 1.6294
am: 1.5582
social democracy: 1.5075
my: 1.3927
community: 1.3646
union: 1.3415
progressive: 1.3412
working: 1.3119
hi: 1.2728
have been: 1.2701
policy: 1.2608
left: 1.2460


The learned coefficients align with real-world political framing, with
libertarian and free-market language associated with Conservative posts,
and labor, social democracy, and community-oriented language associated
with Liberal posts.


In [7]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

#                    Pred: Conservative   Pred: Liberal
#Actual Conservative        ___                ___
#Actual Liberal             ___                ___

              precision    recall  f1-score   support

Conservative       0.63      0.61      0.62       185
     Liberal       0.76      0.78      0.77       291

    accuracy                           0.71       476
   macro avg       0.69      0.69      0.69       476
weighted avg       0.71      0.71      0.71       476

[[112  73]
 [ 65 226]]


## Appendix: Pipeline-Based Baseline (Earlier Approach)

This section shows an earlier pipeline-based implementation used for comparison.
The final model above was implemented manually to better understand each step.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(
    df["Text"], df["Political Lean"], test_size=0.1, stratify=df["Political Lean"]
)

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=20_000, ngram_range=(1,2))),
    ("clf", LogisticRegression(max_iter=1000))
])

pipeline.fit(X_train, y_train)


In [None]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

Conservative       0.74      0.37      0.49        93
     Liberal       0.69      0.92      0.79       145

    accuracy                           0.70       238
   macro avg       0.72      0.64      0.64       238
weighted avg       0.71      0.70      0.67       238



# Conclusions and Next Steps

This project demonstrates how traditional NLP techniques combined with
careful feature engineering and regularization can produce interpretable
and effective political text classifiers.

Future work includes cross-validation, hyperparameter tuning, and comparison
to additional baselines or transformer-based models.
