<a href="https://colab.research.google.com/github/psl-skok/PoliticalStanceClassification/blob/main/PoliticalStanceClassification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Political Stance Classification with TF-IDF and Logistic Regression

This notebook builds an interpretable text classification model to distinguish
between Liberal and Conservative Reddit posts. The emphasis is on understanding
modeling choices, feature engineering, and interpretability rather than treating
machine learning as a black box. It is a continuation of the NLP - Political Sentiment Analysis project which can also be found on my Github portfolio.


# **Data Loading & Cleaning**

In [10]:
import pandas as pd

df = pd.read_csv("reddit_political_posts.csv")

df = df.dropna(subset=["Text", "Political Lean"])

df = df[df["Text"].str.len() > 20]

df["Political Lean"].value_counts(normalize=True)


Unnamed: 0_level_0,proportion
Political Lean,Unnamed: 1_level_1
Liberal,0.610761
Conservative,0.389239


We remove missing values and extremely short posts to reduce noise. Additionally, we inspect the ratio of Conservative and Liberal posts to ensure stratification works as intended later (see Train-Test Split section for details).


# **Train–Test Split**

In [11]:
from sklearn.model_selection import train_test_split

X = df["Text"]
y = df["Political Lean"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, stratify=y, random_state=10
  )

The dataset is split into training and test sets to evaluate how well the model generalizes to unseen data.

Stratification preserves class proportions, which is important given the class imbalance found above. Setting random state allows for consistent model outputs across runs.


# **Feature Engineering: TF-IDF Vectorization**

In [29]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(
    max_features = 20000,
    ngram_range=(1,2),
    min_df=2,
    max_df=0.8,
    stop_words="english"
    )


X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

These parameters reduce noise, control vocabulary size, and allow the model
to capture short phrases while improving generalization.

**max_features:** limits size to top 20,000 most important terms

**ngram_range:** allows model to capture 2-gram phrases (i.e. gun control, climate change, etc...)

**min & max_df:** removes terms that appear infrequently or too frequently

**stop_words:** removes common and unnecessary filler words (i.e. and, but, in, etc...)

# **Model Training: Logistic Regression**

In [30]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(
    max_iter = 1000,
    class_weight="balanced",
    solver="liblinear",
    C=2.0)

clf.fit(X_train_tfidf, y_train)
y_pred = clf.predict(X_test_tfidf)

Logistic Regression is well-suited for high-dimensional, sparse TF-IDF features
and allows for explicit regularization and class imbalance handling.

**class_weight:** Minority class errors are penalized more and majority class errors are penalized less to handle the imbalanced dataset.

**solver:** Liblinear is effective for smaller datasets and is effective for binary classification.

**C:** Controls the inverse strength of L2 regularization. Slightly increasing C allows the model to fit stronger word signals while still preventing overfitting

# **Model Evaluation**

In [31]:
import numpy as np

feature_names = tfidf.get_feature_names_out()
weights = clf.coef_[0]

# Get sorted indices
sorted_idx = np.argsort(weights)

top_conservative_idx = sorted_idx[:20]
top_liberal_idx = sorted_idx[-20:][::-1]  # reverse for descending order

print("Top Conservative terms:")
for idx in top_conservative_idx:
    print(f"{feature_names[idx]}: {weights[idx]:.4f}")

print("\nTop Liberal terms:")
for idx in top_liberal_idx:
    print(f"{feature_names[idx]}: {weights[idx]:.4f}")


Top Conservative terms:
libertarian: -4.2029
libertarians: -3.5067
government: -2.9100
ancap: -2.7181
capitalism: -2.3628
property: -2.0785
libertarianism: -2.0079
free market: -1.7931
free: -1.5817
freedom: -1.4470
anarcho: -1.3454
russia: -1.3180
capitalist: -1.2872
money: -1.2752
anarchocapitalism: -1.2597
trudeau: -1.2215
commies: -1.1652
people: -1.1627
nap: -1.1111
anarcho capitalism: -1.0737

Top Liberal terms:
social: 3.3842
democratic: 2.8024
socialist: 2.7333
workers: 2.2978
women: 2.2807
hi: 2.0340
social democracy: 1.9270
progressive: 1.9126
know: 1.6915
healthcare: 1.6625
party: 1.6451
new: 1.5824
working: 1.5011
union: 1.4577
feminist: 1.4360
read: 1.4058
community: 1.3899
socialism: 1.3611
recently: 1.3397
issues: 1.3264


The learned coefficients align with real-world political framing, with
libertarian and free-market language associated with Conservative posts,
and labor, social democracy, and community oriented language associated
with Liberal posts.


In [32]:
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

#                    Pred: Conservative   Pred: Liberal
#Actual Conservative        ___                ___
#Actual Liberal             ___                ___

              precision    recall  f1-score   support

Conservative       0.70      0.71      0.70       185
     Liberal       0.81      0.81      0.81       291

    accuracy                           0.77       476
   macro avg       0.76      0.76      0.76       476
weighted avg       0.77      0.77      0.77       476

[[131  54]
 [ 56 235]]


# Conclusions and Next Steps

This project demonstrates how traditional NLP techniques combined with
careful feature engineering and regularization can produce interpretable
and effective political text classifiers.

Future work includes cross-validation, hyperparameter tuning, and comparison
to additional baselines or transformer-based models.


**_________________________________________________________________________________________________________________________**

## Appendix: Pipeline-Based

This section shows an earlier pipeline-based implementation used for comparison.
The model above was implemented manually to better understand each step and to improve accuracy.

In [33]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(
    df["Text"], df["Political Lean"], test_size=0.1, stratify=df["Political Lean"]
)

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(max_features=20_000, ngram_range=(1,2))),
    ("clf", LogisticRegression(max_iter=1000))
])

pipeline.fit(X_train, y_train)


In [34]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = pipeline.predict(X_test)

print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

Conservative       0.83      0.38      0.52        93
     Liberal       0.70      0.95      0.81       145

    accuracy                           0.73       238
   macro avg       0.77      0.66      0.66       238
weighted avg       0.75      0.73      0.70       238

