# Data Mining: exam text analysis

- Pablo Vizán Siso
- 7438214

## Question 1

#### Compare the use of clickbait titles between democrats and republicans in framing.p. How many times do democrats refer to an article with a clickbait title and how many times do republicans do? Inspect the titles in the dataset that were classified as clickbait and try to explain the results.

Below, I will import the necessary packages for the analysis

In [None]:
# numpy
import numpy as np

# pandas
import pandas as pd

# scikit-learn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report 

# tensorflow
import tensorflow as tf

# transformers
!pip install transformers
from transformers import DistilBertTokenizerFast, TFDistilBertForSequenceClassification

First step will be opening the file from which we will be extracting the necessary data for the analysis, framing.p, as a pandas DataFrame

In [3]:
df = pd.read_pickle('framing.p')

I will call the head method on the just created DataFrame to take a first look at the data

In [4]:
df.head()

Unnamed: 0,tweet_id,date,user,party,state,chamber,tweet,news_mention,url_reference,netloc,title,description,label
0,1325914751495499776,2020-11-09 21:34:45,SenShelby,R,Alabama,Senator,ICYMI – @BusinessInsider declared #Huntsville ...,businessinsider,https://www.businessinsider.com/personal-finan...,www.businessinsider.com,The 10 best US cities to move to if you want t...,The best US cities to move to if you want to s...,
1,1294021087118987264,2020-08-13 21:20:43,SenShelby,R,Alabama,Senator,Great news! Today @mazda_toyota announced an a...,,https://pressroom.toyota.com/mazda-and-toyota-...,pressroom.toyota.com,Mazda and Toyota Further Commitment to U.S. Ma...,"HUNTSVILLE, Ala., (Aug. 13, 2020) – Today, Maz...",
2,1323340848130609156,2020-11-02 19:06:59,DougJones,D,Alabama,Senator,He’s already quitting on the folks of Alabama ...,,https://apnews.com/article/c73f0dfe8008ebaf85e...,apnews.com,"Tuberville, Jones fight for Senate seat in Ala...","GARDENDALE, Ala. (AP) — U.S. Sen. Doug Jones, ...",
3,1323004075831709698,2020-11-01 20:48:46,DougJones,D,Alabama,Senator,I know you guys are getting bombarded with fun...,,https://secure.actblue.com/donate/djfs-close?r...,secure.actblue.com,I just gave!,Join us! Contribute today.,negiotated
4,1322567531320717314,2020-10-31 15:54:06,DougJones,D,Alabama,Senator,"Well looky here folks, his own players don’t t...",,https://slate.com/culture/2020/10/tommy-tuberv...,slate.com,What Tommy Tuberville’s Former Auburn Players ...,"""All I could think is, why?""",


In order to determine whether or not each tweet is a clickbait, we will make use of a dataset of 16,086 titles labeled as clickbait (1) or not clickbait (0)

In [5]:
DATASET_URL = 'https://gist.githubusercontent.com/amitness/0a2ddbcb61c34eab04bad5a17fd8c86b/raw/66ad13dfac4bd1201e09726677dd8ba8048bb8af/clickbait.csv'
data = pd.read_csv(DATASET_URL)
data.head(5)

Unnamed: 0,title,label
0,"15 Highly Important Questions About Adulthood,...",1
1,250 Nuns Just Cycled All The Way From Kathmand...,1
2,"Australian comedians ""could have been shot"" du...",0
3,Lycos launches screensaver to increase spammer...,0
4,Fußball-Bundesliga 2008–09: Goalkeeper Butt si...,0


While logistic regression is also a valid choice, I will opt to use a BERT model, which already has an initial understanding of the language unlike logistic regression. More particularly, I will us a DistilBERT model, a smaller and faster implementation of the BERT model. The first step of the process will be to split the dataset into training, test and validation sets.

In [6]:
X = list(data.title.values) 
y = list(data.label.values) 
labels = ['not clickbait', 'clickbait']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_test, X_val, y_test, y_val = train_test_split(X_test, y_test, test_size=0.5, random_state=1)

After that, the step of vectorization comes. I will use the tokenizer of the model to extract the tokens from the dataset.

In [7]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

train_encodings = tokenizer(X_train, truncation=True, padding=True, max_length=128) # convert input strings to BERT encodings
test_encodings = tokenizer(X_test, truncation=True, padding=True,  max_length=128)
val_encodings = tokenizer(X_val, truncation=True, padding=True, max_length=128)

train_dataset = tf.data.Dataset.from_tensor_slices((
    dict(train_encodings),
    y_train
)).shuffle(1000).batch(16) # convert the encodings to Tensorflow objects
val_dataset = tf.data.Dataset.from_tensor_slices((
    dict(val_encodings),
    y_val
)).batch(64)
test_dataset = tf.data.Dataset.from_tensor_slices((
    dict(test_encodings),
    y_test
)).batch(64)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




After that, the model will be ready to be loaded and compiled. I will make use of EarlyStopping: the model will be evaluated each epoch (one iteration through the training set), and when its performance on the validation set doesn't improve three times in a row, the training will be stopped early, and the best model will be returned.

In [8]:
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-cased', 
                                                           num_labels=len(labels))
callbacks = [
        tf.keras.callbacks.EarlyStopping(
            monitor='val_loss', min_delta=0, patience=3, verbose=0, mode='min', baseline=None, restore_best_weights=True)
]

optimizer = tf.keras.optimizers.Adam(learning_rate=5e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

model.compile(optimizer=optimizer, loss=loss)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=411.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=354041576.0, style=ProgressStyle(descri…




Some layers from the model checkpoint at distilbert-base-cased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_transform', 'vocab_layer_norm', 'vocab_projector', 'activation_13']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['pre_classifier', 'classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use it fo

Now, everything is set for the model to be finally trained!

In [9]:
model.fit(train_dataset, epochs=10, callbacks=callbacks, validation_data=val_dataset, batch_size=16)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10


<tensorflow.python.keras.callbacks.History at 0x7ff0b22c0ac8>

I will quickly check how good this model is looking at its classification report

In [10]:
logits = model.predict(test_dataset)
y_preds = np.argmax(logits[0], axis=1)
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.98      0.97      0.98      1586
           1       0.97      0.98      0.98      1613

    accuracy                           0.98      3199
   macro avg       0.98      0.98      0.98      3199
weighted avg       0.98      0.98      0.98      3199



The model seems reliable as its accuracy f1-score is very high (0.98!)

Now, it is time to classify the tweets in the data as clickbait or not clickbait.

In [11]:
titles = df.title.to_list()
titles_encodings = tokenizer(titles, truncation=True, padding=True)
titles_encodings = tf.data.Dataset.from_tensor_slices((dict(titles_encodings))).batch(64)
pred_logits = model.predict(titles_encodings)

I just used the model to make the prediction. Now, I am going to append the prediction to the DataFrame as a new column

In [12]:
clickbait = [labels[np.argmax(logits)] for _, logits in enumerate(pred_logits[0])]
df['clickbait'] = clickbait

In [13]:
df.head()

Unnamed: 0,tweet_id,date,user,party,state,chamber,tweet,news_mention,url_reference,netloc,title,description,label,clickbait
0,1325914751495499776,2020-11-09 21:34:45,SenShelby,R,Alabama,Senator,ICYMI – @BusinessInsider declared #Huntsville ...,businessinsider,https://www.businessinsider.com/personal-finan...,www.businessinsider.com,The 10 best US cities to move to if you want t...,The best US cities to move to if you want to s...,,clickbait
1,1294021087118987264,2020-08-13 21:20:43,SenShelby,R,Alabama,Senator,Great news! Today @mazda_toyota announced an a...,,https://pressroom.toyota.com/mazda-and-toyota-...,pressroom.toyota.com,Mazda and Toyota Further Commitment to U.S. Ma...,"HUNTSVILLE, Ala., (Aug. 13, 2020) – Today, Maz...",,not clickbait
2,1323340848130609156,2020-11-02 19:06:59,DougJones,D,Alabama,Senator,He’s already quitting on the folks of Alabama ...,,https://apnews.com/article/c73f0dfe8008ebaf85e...,apnews.com,"Tuberville, Jones fight for Senate seat in Ala...","GARDENDALE, Ala. (AP) — U.S. Sen. Doug Jones, ...",,not clickbait
3,1323004075831709698,2020-11-01 20:48:46,DougJones,D,Alabama,Senator,I know you guys are getting bombarded with fun...,,https://secure.actblue.com/donate/djfs-close?r...,secure.actblue.com,I just gave!,Join us! Contribute today.,negiotated,clickbait
4,1322567531320717314,2020-10-31 15:54:06,DougJones,D,Alabama,Senator,"Well looky here folks, his own players don’t t...",,https://slate.com/culture/2020/10/tommy-tuberv...,slate.com,What Tommy Tuberville’s Former Auburn Players ...,"""All I could think is, why?""",,not clickbait


Great! The prediction for each tweet can be checked looking at its value for the 'clickbait' field. Now, I will divide the dataset in two subsets: tweets made by republicans, and tweets made by democrats

In [14]:
df_dem = df[df.party == 'D']
df_rep = df[df.party == 'R']

Finally, we will check how many times did republican politicians tweet the title of a clickbait article, and how many times did democrat politicians do that as well

In [15]:
print(f"Number of times democrats have shared articles with clickbait titles: {df_dem[df_dem.clickbait == 'clickbait'].shape[0]}")
print(f"Number of times republicans have shared articles with clickbait titles: {df_rep[df_rep.clickbait == 'clickbait'].shape[0]}")

Number of times democrats have shared articles with clickbait titles: 2615
Number of times republicans have shared articles with clickbait titles: 1134


Interesting! Democrats initially seem to have shared significantly more articles with clickbait titles. However, if the question we wanted to answer was if democrats share a higher amount of clickbait articles proportionally than republicans, we would have to calculate the percentage over the total.

In [16]:
print(f"Percentage of tweets made by democrat politicians that shared a clickbait article: {round(100 * (df_dem[df_dem.clickbait == 'clickbait'].shape[0] / df_dem.shape[0]), 2)}")
print(f"Percentage of tweets made by republican politicians that shared a clickbait article:  {round(100 * (df_rep[df_rep.clickbait == 'clickbait'].shape[0] / df_rep.shape[0]), 2)}")

Percentage of tweets made by democrat politicians that shared a clickbait article: 16.8
Percentage of tweets made by republican politicians that shared a clickbait article:  14.59


It seems like proportionally democrats still link to clickbait articles in their tweets more often than republicans do, even if the difference now doesn't seem as drastic as it did when I compared the absolute count of the tweets of both subsets.

I also tried to inspect the dataset of tweets that were labelled as clickbait in case there was any strange pattern that could explain the numbers. I did not find any relevant result, except for the fact that around 30 tweets labelled as clickbait articles are actually links to donation, and except for one they all were shared by democrats. However, this doesn't affect the result very much, as it's a very small number compared with the total number of tweets made by democrats.

In [26]:
import re
df[df.clickbait == "clickbait"][df.title.str.match('I just gave') == True].sample(5)

  


Unnamed: 0,tweet_id,date,user,party,state,chamber,tweet,news_mention,url_reference,netloc,title,description,label,clickbait
4527,1317488586845876224,2020-10-17 15:32:11,MichaelBennet,D,Colorado,U.S. Senator,My friend Doug Jones needs your help. Please ...,,https://secure.actblue.com/donate/jones-homepa...,secure.actblue.com,I just gave to Doug Jones for Senate,Help Doug Jones continue to work for working p...,,clickbait
8,1319698596074262530,2020-10-23 17:53:58,DougJones,D,Alabama,Senator,"We are excited to have @DebraMessing, @SeanHay...",,http://secure.actblue.com/donate/10.23.20wg,secure.actblue.com,I just gave!,Join us! Contribute today.,,clickbait
21474,1325916582086840327,2020-11-09 21:42:02,repblumenauer,D,Oregon 3rd District,U.S. Representative,Senate control hinges on Georgia's special Sen...,,http://GASenate.com,GASenate.com,I just gave to Fair Fight! Lead with Stacey Ab...,Together we are leading the fight for fair ele...,,clickbait
5076,1313538908509143041,2020-10-06 17:57:34,ChrisMurphyCT,D,Connecticut,U.S. Senator,"Listen, Kansas is winnable. They just elected ...",,https://secure.actblue.com/donate/natlla2020,secure.actblue.com,I just gave to Barbara Bollier!,Join us! Contribute today.,,clickbait
15587,1296592851397234688,2020-08-20 23:39:59,Maggie_Hassan,D,New Hampshire,U.S. Senator,On the fourth and final night of the #DemConve...,,https://secure.actblue.com/donate/2020-mwh4bid...,secure.actblue.com,I just gave to Maggie Hassan!,Join us! Contribute today.,,clickbait


In [30]:
print(f"Number of tweets that link to a donation whose title matches the pattern 'I just gave...': " + str(df[df.clickbait == "clickbait"][df.title.str.match('I just gave') == True].shape[0]))

Number of tweets that link to a donation whose title matches the pattern 'I just gave...': 33


  """Entry point for launching an IPython kernel.


The last possibility I considered was that the representation of democrats and republicans in the dataset could have any relevance, since there are as twice democrats as there are republicans. However it is not sufficient to explain why democrats share in proportion more clickbait articles than republicans. The tendency, in general, is not to share clickbaits, rather than the opposite.

In [34]:
df[df.party.isin(["D", "R"])].groupby(by="party").count()['title']

party
D    15564
R     7771
Name: title, dtype: int64

In [35]:
df.groupby(by="clickbait").count()['title']

clickbait
clickbait         3768
not clickbait    19680
Name: title, dtype: int64

## Conclusion

The DistilBert model I have trained with a dataset of already labeled article titles has predicted a 'clickbait' label for more articles shared by democrats than articles shared by republicans: this means, more democrats than republicans have shared clickbait articles, in absolute terms (2615 and 1134, respectively). In relative terms, democrats have also shared more clickbait articles than republicans: 16.8% of them, whereas 14.59% of the republicans did as well. Some particularities of the data are: links to fundraisers that are mislabelled as clickbait articles, and a higher representation of democrat politicians in the data than of republican ones. However, any of these can explain the difference among the numbers of ones and the others.