# News Article Classification by Topic

In [8]:
# Import libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')

# Load the data
train = pd.read_csv('News_Category_Sample_train_topcats.csv')
test = pd.read_csv('News_Category_Sample_test_topcats.csv')

# Initialize CountVectorizer for unigrams 
vectorizer = CountVectorizer()

# Fit and transform the 'snippet' column
train_features = vectorizer.fit_transform(train['snippet']) # Tokenizes text to unigrams
test_features = vectorizer.transform(test['snippet']) 

# View shape of resulting BoW matrix
print("BoW matrix shape:", train_features.shape)

# Create labels
train_labels = train['category'].tolist()
test_labels = test['category'].tolist()

# Ridge Regression
lr_ridge = LogisticRegression(penalty="l2", max_iter=1000).fit(train_features, train_labels)
test_preds_ridge = lr_ridge.predict(test_features)
print("Ridge Logistic Regression test set accuracy: ", accuracy_score(test_labels, test_preds_ridge))

# Hyperparameter tuning 
ridge_params = {'C': [0.01, 0.1, 1, 10]}
ridge_cv = GridSearchCV(LogisticRegression(penalty="l2", max_iter=1000), ridge_params, cv=5, scoring="accuracy")
ridge_cv.fit(train_features, train_labels)
best_ridge = ridge_cv.best_estimator_
test_preds_ridge = best_ridge.predict(test_features)
print("Best Ridge Logistic Regression C:", ridge_cv.best_params_)

BoW matrix shape: (72498, 53695)
Ridge Logistic Regression test set accuracy:  0.9048275862068965
Best Ridge Logistic Regression C: {'C': 1}


The matrix shape shows the number of documents in the data as the number of rows, and the number of unigrams as the number of columns. The matrix shape shows the number of documents in the data as the number of rows, and the number of unigrams as the number of columns. In this case we have 72,498 text samples and there are 53,695 unique unigram tokens. In this case we have 72,498 text samples and there are 53,695 unique unigram tokens.<br>
The accuracy of our model is very high (90.48%). The hyperparameter C, which controls the trade-off between fitting the training data well and keeping model weights small i.e. regularization is shown to be best at its default value of 1, so there is no need to rerun the model after cross validating the hyperparameter.