# **STILL IN DEVELOPMENT**

# Clickbait Detection using Turkish News Dataset

## 1. Introduction

Clickbait is one of the common practices in news industry. Even though it's beneficial for the websites, since they are able to attract more visitors, it's not beneficial for the readers. Clickbait is a practice of using misleading headlines to attract more visitors to the website. The main goal of this project is to detect clickbait headlines using Turkish news dataset.

## 2. Dataset Characterization

The dataset is collected from several websites and it contains 20,036 headlines and their labels. The labels are binary, 1 for clickbait and 0 for non-clickbait. The dataset is available on [Kaggle](https://www.kaggle.com/datasets/suleymancan/turkishnewstitle20000clickbaitclassified)

# Let's start with the code

## 1. Importing Libraries

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string

import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.stem import PorterStemmer

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from scipy import sparse

import warnings
#warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to /Users/mert/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/mert/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
df = pd.read_csv('https://raw.githubusercontent.com/merttezcan/clickbait-detection-turkish-news/main/code/data/20000_turkish_news_title.csv')
print("Shape of the dataset:", df.shape)

print("Number of the clickbait and non-clickbait titles:")
print(df.clickbait.value_counts())

Shape of the dataset: (20038, 4)
Number of the clickbait and non-clickbait titles:
1.0    10030
0.0    10006
Name: clickbait, dtype: int64


Let's check how is our dataset looks like.

In [26]:
df.head()

Unnamed: 0,id,clickbait,site,title
0,25892,1.0,hurriyet.com.tr,İhracatta Türkiye'nin yarısını geçti
1,25893,1.0,hurriyet.com.tr,Borsa İstanbul günü düşüşle tamamladı
2,25894,1.0,hurriyet.com.tr,Londra ve Manchester uçuşlarında yolcu rekoru ...
3,4,0.0,nayn.co,CHP’li İlgezdi’den partisine ve Kılıçdaroğlu’n...
4,5,0.0,nayn.co,Vatandaşın derdine ortak olmaya soyunan bir cu...


## 2. Data Preprocessing

Our dataset contains 4 columns which are "id", "clickbait", "site", and "title". The "id" column is the unique identifier for each headline. The "clickbait" column is the label of the headline. The "site" column is the website where the headline is published. The "title" column is the headline itself. We only need the "clickbait" and "title" columns for our project, since "clickbait" column is the label and "title" column is the text which we want to analyze. And "id" and "site" information is not meaningful for our project.

So, let's drop the other columns.

In [27]:
df = df.drop(df.columns[[0, 2]],axis = 1)
df

Unnamed: 0,clickbait,title
0,1.0,İhracatta Türkiye'nin yarısını geçti
1,1.0,Borsa İstanbul günü düşüşle tamamladı
2,1.0,Londra ve Manchester uçuşlarında yolcu rekoru ...
3,0.0,CHP’li İlgezdi’den partisine ve Kılıçdaroğlu’n...
4,0.0,Vatandaşın derdine ortak olmaya soyunan bir cu...
...,...,...
20033,1.0,"Kılıçdaroğlu, 8 Mart Dünya Emekçi Kadınlar Gün..."
20034,1.0,Abdest nasıl alınır? Abdest alırken hangi dual...
20035,1.0,"Sıla Hanım, Ahmet Bey’in inançlarına saygısızmış!"
20036,1.0,Binali Yıldırım: Hiçbir şekilde hakkınızın kay...


We can also change the order of the columns, since it's a more common practice to have the label column as the last column. Also we can rename the "clickbait" column to "label".

In [28]:
df = df[['title', 'clickbait']]
df.rename(columns = {'clickbait':'label'}, inplace = True)
df

Unnamed: 0,title,label
0,İhracatta Türkiye'nin yarısını geçti,1.0
1,Borsa İstanbul günü düşüşle tamamladı,1.0
2,Londra ve Manchester uçuşlarında yolcu rekoru ...,1.0
3,CHP’li İlgezdi’den partisine ve Kılıçdaroğlu’n...,0.0
4,Vatandaşın derdine ortak olmaya soyunan bir cu...,0.0
...,...,...
20033,"Kılıçdaroğlu, 8 Mart Dünya Emekçi Kadınlar Gün...",1.0
20034,Abdest nasıl alınır? Abdest alırken hangi dual...,1.0
20035,"Sıla Hanım, Ahmet Bey’in inançlarına saygısızmış!",1.0
20036,Binali Yıldırım: Hiçbir şekilde hakkınızın kay...,1.0


We probably have some missing values in our dataset. Let's check it.

In [29]:
df.isna().sum()

title    0
label    2
dtype: int64

In [30]:
df.isnull().sum()

title    0
label    2
dtype: int64

We only have two missing values in our dataset. We can drop them.

In [31]:
df = df.dropna()

# 3. Feature Engineering

We can try to find some meaningful features from the "title" column which is a text. For example, let's extract the exclamations and question marks from the headlines and create a new feature named contains_exclamation and contains_question_mark.

In [32]:
def contains_exclamation(title):
    if "!" in title: 
        return 1
    else: 
        return 0

def contains_question_mark(title):
    if "?" in title: 
        return 1
    else: 
        return 0    

In [33]:
df['contains_exclamation']=df['title'].apply(contains_exclamation)  

df.contains_exclamation.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['contains_exclamation']=df['title'].apply(contains_exclamation)


0    17291
1     2745
Name: contains_exclamation, dtype: int64

In [34]:
df['contains_question_mark']=df['title'].apply(contains_question_mark) 

df.contains_question_mark.value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['contains_question_mark']=df['title'].apply(contains_question_mark)


0    18537
1     1499
Name: contains_question_mark, dtype: int64

We have 2,745 headlines which contains exclamation mark and 1,499 headlines which contains question mark.

Now we should also remove punctuations and non-alphabetic characters from the headlines, to have a clean text.

In [35]:
def clean_text(text):
    text = text.lower()
    text = re.sub('\n', ' ', text)
    text = re.sub('  ', ' ', text)
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)
    text = re.sub('\[.*?\]', ' ', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('“','',text)
    text = re.sub('”','',text)
    text = re.sub('’','',text)
    text = re.sub('–','',text)
    text = re.sub('‘','',text)
    
    return text

In [36]:
feature_engineering_clean = lambda x: clean_text(x)
df.title = pd.DataFrame(df.title.apply(feature_engineering_clean)) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value


In [37]:
df

Unnamed: 0,title,label,contains_exclamation,contains_question_mark
0,i̇hracatta türkiyenin yarısını geçti,1.0,0,0
1,borsa i̇stanbul günü düşüşle tamamladı,1.0,0,0
2,londra ve manchester uçuşlarında yolcu rekoru ...,1.0,0,0
3,chpli i̇lgezdiden partisine ve kılıçdaroğluna ...,0.0,0,0
4,vatandaşın derdine ortak olmaya soyunan bir cu...,0.0,0,0
...,...,...,...,...
20033,kılıçdaroğlu 8 mart dünya emekçi kadınlar günü...,1.0,0,0
20034,abdest nasıl alınır abdest alırken hangi duala...,1.0,0,1
20035,sıla hanım ahmet beyin inançlarına saygısızmış,1.0,1,0
20036,binali yıldırım hiçbir şekilde hakkınızın kayb...,1.0,0,0


We can also count the number of words in the headlines as a feature. Let's create a new feature named "word_count" and split the headlines into words, then count the number of words.

In [38]:
df['word_count'] = df['title'].apply(lambda x: len(x.split()))

df = df[df['word_count'] != 0]
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['word_count'] = df['title'].apply(lambda x: len(x.split()))


Unnamed: 0,title,label,contains_exclamation,contains_question_mark,word_count
0,i̇hracatta türkiyenin yarısını geçti,1.0,0,0,4
1,borsa i̇stanbul günü düşüşle tamamladı,1.0,0,0,5
2,londra ve manchester uçuşlarında yolcu rekoru ...,1.0,0,0,7
3,chpli i̇lgezdiden partisine ve kılıçdaroğluna ...,0.0,0,0,13
4,vatandaşın derdine ortak olmaya soyunan bir cu...,0.0,0,0,11
...,...,...,...,...,...
20033,kılıçdaroğlu 8 mart dünya emekçi kadınlar günü...,1.0,0,0,9
20034,abdest nasıl alınır abdest alırken hangi duala...,1.0,0,1,8
20035,sıla hanım ahmet beyin inançlarına saygısızmış,1.0,1,0,6
20036,binali yıldırım hiçbir şekilde hakkınızın kayb...,1.0,0,0,8


To analyze the text, we are going to use a library called "nltk". It's a natural language processing library. We can use it to tokenize the text, remove stopwords, and stem the words.

In [39]:
def tokenize(text):
    text = [word_tokenize(x) for x in text]
    return text

#df.title = tokenize(df.title)

We have tokenized our headlines. Now we can remove the stopwords from the headlines.

In [40]:
stopwords_list = stopwords.words('turkish')
#df.title = df['title'].apply(lambda x: [item for item in x if item not in stopwords_list])

# 4. Exploratory Data Analysis (EDA)

coming soon...

# 5. Model Building

We can start building our machine learning models. We are going to use several different machine learning models and compare their performance. We are going to use the following models:

- Logistic Regression
- Naive Bayes
- Support Vector Machine
- Random Forest
- XGBoost

In [41]:
X = df.drop(columns='label')
y = df['label']

y.value_counts()

1.0    10030
0.0    10006
Name: label, dtype: int64

In [42]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


We are going to use sklearn feature_extraction library to convert the text into a meaningful representation for the machine learning models. Let's start with the TfidfVectorizer.

In [43]:
tfidf = TfidfVectorizer(stop_words = stopwords_list,ngram_range = (1,2))
tfidf_text_train = tfidf.fit_transform(X_train['title'])
tfidf_text_test = tfidf.transform(X_test['title'])

X_train_ef = X_train.drop(columns='title')
X_test_ef = X_test.drop(columns='title')

Now we can combine the features we created with the TfidfVectorizer output.

In [44]:
X_train = sparse.hstack([X_train_ef, tfidf_text_train]).tocsr()
X_test = sparse.hstack([X_test_ef, tfidf_text_test]).tocsr()

In [45]:
print(X_train.shape)
print(X_test.shape)

(16028, 132311)
(4008, 132311)


## 5.1. Logistic Regression

Let's start with the Logistic Regression model.

In [46]:
lr_model = LogisticRegression()

lr_model.fit(X_train, y_train)

lr_train_pred = lr_model.predict(X_train)
lr_test_pred = lr_model.predict(X_test)

print("Train Accuracy:", accuracy_score(y_train, lr_train_pred))
print("Test Accuracy:", accuracy_score(y_test, lr_test_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, lr_test_pred))

print("Classification Report:")

print(classification_report(y_test, lr_test_pred))


Train Accuracy: 0.916645869727976
Test Accuracy: 0.8667664670658682
Confusion Matrix:
[[1695  309]
 [ 225 1779]]
Classification Report:
              precision    recall  f1-score   support

         0.0       0.88      0.85      0.86      2004
         1.0       0.85      0.89      0.87      2004

    accuracy                           0.87      4008
   macro avg       0.87      0.87      0.87      4008
weighted avg       0.87      0.87      0.87      4008

