<a href="https://colab.research.google.com/github/mdkamrulhasan/machine_learning_concepts/blob/master/notebooks/supervised/sample_soln_fake_news_detection_challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The goal of this notebook is to showcase


1.   How to load and preprocess the classification challenge data
2.   How to build a baseline model
3.   How to prepare results to submit to the kaggle challenge


## Loading necessary packages

In [None]:
import os
import pandas as pd

## Mounting Google Drive space for data access:

- You need to download following files from [kaggle challenge site](https://www.kaggle.com/t/f1ab2f7d90714c6b80c03c3ca19043ed) and upload in your Google drive.
  -  fake-news-challenge_train.csv,
  -  fake-news-challenge_test.csv

In [33]:
# Mounting your Google drive space
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [28]:
# 'Your Google Drive Path where you uploaded fake-news-challenge_train.csv and fake-news-challenge_test.csv'
base_path = 'drive/MyDrive/data/fake-news/v1.1-Mon Sep 29 04:31:12 2025'

## Loading Classification Challenge data

In [None]:
df_train = pd.read_csv(os.path.join(base_path, 'fake-news-challenge_train.csv'))
df_test = pd.read_csv(os.path.join(base_path, 'fake-news-challenge_test.csv'))

In [10]:
df_train.head()

Unnamed: 0,news-text,is_news_real
0,BERLIN (Reuters) - Germany s Social Democrats ...,1
1,ROME (Reuters) - Italy’s preparations for host...,1
2,WASHINGTON (Reuters) - Republican U.S. Preside...,1
3,A disgusting black sludge is coming out of res...,0
4,WASHINGTON (Reuters) - U.S. President Donald T...,1


In [None]:
df_train['news-text'][0]

'BERLIN (Reuters) - Germany s Social Democrats (SPD) faced pressure on Wednesday to consider offering coalition talks to Chancellor Angela Merkel s conservatives to settle the worst political crisis in modern German history. A leader of the smaller Free Democrats (FDP) also raised the possibility of reviving coalition talks with the conservatives and Greens that collapsed at the weekend raising fears across Europe of stalemate in the EU s economic and political powerhouse. But the party chief later appeared to ruled it out.  The signs of possible flexibility came after President Frank-Walter Steinmeier, in a move unprecedented for a largely ceremonial position, intervened to promote talks that could avert a disruptive early repeat election. SPD leader Martin Schulz, whose party had governed in coalition under Merkel since 2013, wants to go into opposition after September polls that knocked its support to the lowest levels since formation of the modern German republic in 1949. But the m

## Data Preprocessing and Feature Extraction

In [16]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text

# Stop Words
custom_stop_words = list(text.ENGLISH_STOP_WORDS)

# Apply CountVectorizer with stop word removal and basic token pattern
count_vect = CountVectorizer(
    stop_words=custom_stop_words,
    ngram_range=(1, 1),  # Unigrams
    token_pattern=r'\b[a-zA-Z]{2,}\b'  # Only words with 2+ alphabetic characters
)

X_train_counts = count_vect.fit_transform(df_train['news-text'].values)
print(X_train_counts.shape)


(35082, 96495)


In [30]:
# Extracting labels
y = df_train['is_news_real'].values

In [29]:
set(y)

{np.int64(0), np.int64(1)}

## Feature Extraction for the Test Data

In [18]:
X_test_counts = count_vect.transform(df_test['news-text'].values)

## Training a Baseline kNN (k Neares Neighbor) Classifier

In [21]:
from sklearn.neighbors import KNeighborsClassifier
baseline_model = KNeighborsClassifier(n_neighbors=5)
baseline_model.fit(X_train_counts, y)

## Performance Check (Training Data)

In [20]:
accuracy = baseline_model.score(X_train_counts, y)
print(f'Training Accuracy: {accuracy}')

Training Accuracy: 0.8758052562567699


## Making predictions and preparing submission file for the kaggle challenge

In [31]:
# Making predictions
test_predictions = baseline_model.predict(X_test_counts)

In [24]:
# Pack results in a pandas dataframe
df_predictions = pd.DataFrame({
    'rowId': range(0, len(test_predictions)),
    'label': test_predictions.flatten()})

In [25]:
df_predictions.head()

Unnamed: 0,rowId,label
0,0,0
1,1,0
2,2,1
3,3,0
4,4,0


## Saving resluts to upload to the kaggle challenge site

In [32]:
# Exporting resluts
df_predictions.to_csv(os.path.join(base_path, 'my_baseline_results.csv'), index=False)