# Reimplementation of ML Model by Hou et al.

This notebook implements a model presented by Hou et al. in [Towards Automatic Detection of Misinformation in Online Medical Videos](https://arxiv.org/pdf/1909.01543.pdf).

## Model
We train and evaluate the following models:

1. Original version published by Hou et al. using SVM classifier: `LinearSVC` model from `sklearn` with `C=1` and L2 normalizer applied to features. We compare binary, binary with neutral and ternary variants of the model.
2. Modified version of the model using XGBosst classifier (binary and binary with neutral variants): `XGBClassifier` from `xgboost` with the following hyperparameters: `'booster': 'gbtree', 'random_state': 0, 'objective': 'binary:logistic', 'learning_rate': 0.1, 'n_estimators': 500, 'max_depth': 10, 'min_child_weight': 1, 'gamma': 0, 'subsample': 0.8, 'colsample_bytree': 0.8`. L2 normalizer is applied to the features as well.
3. Modified version using XGBosst classifier (ternary variant): `XGBClassifier` from `xgboost` with the following hyperparameters: `'booster': 'gbtree', 'random_state': 0, 'objective': 'multi:softprob', 'learning_rate': 0.1, 'n_estimators': 500, 'max_depth': 10, 'min_child_weight': 1, 'gamma': 0, 'subsample': 0.8, 'colsample_bytree': 0.8, 'eval_metric': 'mlogloss'`. L2 normalizer is applied to the features as well.

## Features

### Stats

* view count per day
* comment count
* like count
* dislike count
* duration in seconds

Missing:
* categories were not used since they were missing in our data
### Linguistic

* ngrams – `TfidfVectorizer` using English stopwords and 1 and 2-grams limited to 1000 features
* readability – all measures from the `readability` library
* liwc – percentage of token counts by categories in the LIWC lexicon

### Acoustic – not implemented

Although the paper applied also acoustic features, we did not evaluate these since we did not collect sound from YouTube videos.

## Import libraries

In [1]:
import pandas as pd
from nltk import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
import isodate
from sklearn.svm import LinearSVC
from sklearn.preprocessing import Normalizer
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter
import liwc
from nltk.tokenize import word_tokenize
import readability
from xgboost import XGBClassifier

## Load dataset

Load a dataset of videos into `videos` pandas DataFrame. The provided training data consist of our seed and encountered videos that we manually annotated and for which we were able to obtain metadata via YouTube API. We publish only `youtube_id` and `annotation` columns. For the rest, please use the official YouTube API (please note that some videos might no longer be available).

In [2]:
videos = pd.read_csv('../Data/normalized_data/train.csv')

In [3]:
videos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2622 entries, 0 to 2621
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   youtube_id       2622 non-null   object 
 1   published_at     0 non-null      float64
 2   updated_at       0 non-null      float64
 3   view_count       0 non-null      float64
 4   like_count       0 non-null      float64
 5   dislike_count    0 non-null      float64
 6   favourite_count  0 non-null      float64
 7   comment_count    0 non-null      float64
 8   duration         0 non-null      float64
 9   transcript       0 non-null      float64
 10  annotation       2622 non-null   object 
dtypes: float64(9), object(2)
memory usage: 225.5+ KB


In [4]:
videos.head()

Unnamed: 0,youtube_id,published_at,updated_at,view_count,like_count,dislike_count,favourite_count,comment_count,duration,transcript,annotation
0,1w0_kazbb_U,,,,,,,,,,promoting
1,R9oqi6HteJg,,,,,,,,,,debunking
2,67ZKmVWB3tY,,,,,,,,,,promoting
3,zw0nYNMUIfA,,,,,,,,,,promoting
4,e20vaAtncsM,,,,,,,,,,debunking


In [5]:
videos['annotation'].value_counts()

neutral      1459
debunking     758
promoting     405
Name: annotation, dtype: int64

## Preprocess dataset

In [None]:
import re

def remove_tags(text):
    """
    Remove vtt markup tags
    """
    tags = [
        r'</c>',
        r'<c(\.color\w+)?>',
        r'<\d{2}:\d{2}:\d{2}\.\d{3}>',

    ]

    for pat in tags:
        text = re.sub(pat, '', text)

    # extract timestamp, only kep HH:MM
    text = re.sub(
        r'(\d{2}:\d{2}):\d{2}\.\d{3} --> .* align:start position:0%',
        r'\g<1>',
        text
    )

    text = re.sub(r'^\s+$', '', text, flags=re.MULTILINE)
    return text

def remove_header(lines):
    """
    Remove vtt file header
    """
    pos = -1
    for mark in ('##', 'Language: en',):
        if mark in lines:
            pos = lines.index(mark)
    lines = lines[pos+1:]
    return lines


def merge_duplicates(lines):
    """
    Remove duplicated subtitles. Duplacates are always adjacent.
    """
    last_timestamp = ''
    last_cap = ''
    for line in lines:
        if line == "":
            continue
        if re.match('^\d{2}:\d{2}$', line):
            if line != last_timestamp:
                last_timestamp = line
        else:
            if line != last_cap:
                yield line
                last_cap = line


def merge_short_lines(lines):
    buffer = ''
    for line in lines:
        if line == "" or re.match('^\d{2}:\d{2}$', line):
            yield '\n' + line
            continue

        if len(line+buffer) < 80:
            buffer += ' ' + line
        else:
            yield buffer.strip()
            buffer = line
    yield buffer


def parse_transcript(text):
    text = remove_tags(text)
    lines = text.splitlines()
    lines = remove_header(lines)
    lines = merge_duplicates(lines)
    lines = list(lines)
    lines = merge_short_lines(lines)
    lines = list(lines)
    result = ' '.join(lines)
    return re.sub('\d{2}:\d{2}:\d{2}\.\d{3} --> \d{2}:\d{2}:\d{2}\.\d{3} ', '', result)

videos['transcript'] = videos['transcript'].fillna('')
videos['clean_transcript'] = videos['transcript'].apply(lambda transcript: parse_transcript(transcript))

In [None]:
count_cols = ['view_count', 'like_count', 'dislike_count', 'favourite_count', 'comment_count']
videos[count_cols] = videos[count_cols].fillna(0)

## Calculate counts of word classes in transcript using the LIWC lexicon

In [None]:
parse, category_names = liwc.load_token_parser('LIWC2007_English100131.dic')
lexicon, _ = liwc.dic.read_dic('LIWC2007_English100131.dic')

liwc_category_counts = Counter(
    value
    for key, values in lexicon.items()
    for value in values
)

def compute_liwc_transcript_counts(videos):
    liwc_transcript_counts = videos['clean_transcript'].apply(
        lambda transcript: pd.DataFrame({
            (category, token)
            for token in word_tokenize(transcript)
            for category in parse(token.lower())
        }, columns=['category', 'token']).groupby('category').size()
    ).fillna(0)

    for column in liwc_transcript_counts.columns:
        liwc_transcript_counts[column] = liwc_transcript_counts[column] / liwc_category_counts[column]
    
    return liwc_transcript_counts

## Calculate readability of transcript using the readability package

In [None]:
def compute_readability(videos):
    readability_scores = videos['clean_transcript'].apply(
        lambda transcript: pd.Series({
            f'{k1}-{k2}': v
            for k1, vs in readability.getmeasures(transcript, lang='en').items()
            for k2, v in vs.items()
        } if len(transcript) > 0 else {}, dtype='float64')
    ).fillna(0)
    readability_scores.index = videos.index

    return readability_scores

## Compute the statistical features

In [None]:
def compute_stats(videos):
    videos['num_tracked_days'] = (
        pd.to_datetime(videos['updated_at'], utc=True) - pd.to_datetime(videos['published_at'], utc=True)
    ).dt.days

    return pd.DataFrame({
        'view_count': videos['view_count'] / videos['num_tracked_days'],
        'comment_count': videos['comment_count'],
        'like_count': videos['like_count'],
        'dislike_count': videos['dislike_count'],
        'duration': videos['duration'].apply(isodate.parse_duration).dt.total_seconds(),
        'clean_transcript': videos['clean_transcript']
    }).fillna(0)

## The machine learning pipeline for different combinations of features

In [None]:
def make_classifier(clf):
    if clf['clf_type'] == 'svm':
        return LinearSVC(**clf['params'])
    elif clf['clf_type'] == 'xgboost':
        return XGBClassifier(**clf['params'])

    # default classifier
    return LinearSVC(random_state=0, C=1)

def clf_pipeline(column_transformer, classifier, sampler=None):
    
    if sampler:
        return make_pipeline(
            sampler,
            column_transformer,
            make_classifier(classifier)
        )
    else:
        return make_pipeline(
            column_transformer,
            make_classifier(classifier)
        )

def make_clf_pipelines(X, stats, readability_scores, liwc_transcript_counts, samplers=['no-sampling'], col_transformers=['full'], classifiers=['svm']):
    all_samplers = {
        'no-sampling':   None,
        'oversampling':  RandomOverSampler(sampling_strategy='not majority'),
        'undersampling': RandomUnderSampler(sampling_strategy='not minority', replacement=False)
    }

    all_column_transformers = {
        'full': make_column_transformer(
                (
                    make_pipeline(
                        TfidfVectorizer(
                            stop_words='english',
                            ngram_range=(1, 2),
                            max_features=1000
                        ),
                        Normalizer(norm='l2')
                    ),
                    'clean_transcript'
                ),
                (
                    Normalizer(norm='l2'),
                    list(set(X.columns).difference(['clean_transcript']))
                )
        ),
        'ngrams': make_column_transformer(
               (
                   make_pipeline(
                       TfidfVectorizer(
                           stop_words='english',
                           ngram_range=(1, 2),
                           max_features=1000
                       ),
                       Normalizer(norm='l2')
                   ),
                   'clean_transcript'
               )
        ),
        'stats': make_column_transformer(
               (
                   Normalizer(norm='l2'),
                   list(set(stats.columns).difference(['clean_transcript']))
               )
        ),
        'readability': make_column_transformer(
               (
                   Normalizer(norm='l2'),
                   list(readability_scores.columns)
               )
        ),
        'liwc': make_column_transformer(
               (
                   Normalizer(norm='l2'),
                   list(liwc_transcript_counts.columns)
               )
        )
    }

    all_classifiers = {
        'svm': {
           'clf_type': 'svm',
           'params': {'random_state': 0, 'C': 1}
        },
        'xgboost_binary': {
           'clf_type': 'xgboost',
           'params': {'booster': 'gbtree', 'random_state': 0, 'objective': 'binary:logistic', 'learning_rate': 0.1, 
           'n_estimators': 500, 'max_depth': 10, 'min_child_weight': 1, 'gamma': 0, 'subsample': 0.8, 
           'colsample_bytree': 0.8}
        },
        'xgboost_ternary': {
            'clf_type': 'xgboost',
            'params': {'booster': 'gbtree', 'random_state': 0, 'objective': 'multi:softprob', 'learning_rate': 0.1, 
            'n_estimators': 500, 'max_depth': 10, 'min_child_weight': 1, 'gamma': 0, 'subsample': 0.8, 
            'colsample_bytree': 0.8, 'eval_metric': 'mlogloss'}
        }
    }

    clfs = {}

    for sampler_key, sampler in all_samplers.items():
        if sampler_key not in samplers:
            continue

        for col_transformer_key, col_transformer in all_column_transformers.items():
            if col_transformer_key not in col_transformers:
                continue
            
            for clf_key, classifier in all_classifiers.items():
                if clf_key not in classifiers:
                    continue
                clfs[f"{sampler_key}_{col_transformer_key}_{clf_key}"] = clf_pipeline(col_transformer, classifier, sampler)

    return clfs

## Cross-validate the pipelines and output the classification report

In [None]:
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import classification_report

def compute_cross_val_predictions(X, y, clfs):
    print('Classification reports')
    print('----------------------------')
    print()

    predicted = {}
    for label, clf in clfs.items():
        print(label)
        predicted[label] = cross_val_predict(clf, X, y, cv=5)
        print(classification_report(y, predicted[label]))
        print()

    return predicted

## Binary (without neutral)

In [None]:
videos_binary = videos.loc[videos['annotation'].isin(['promoting', 'debunking'])].copy()

In [None]:
print(videos_binary.shape[0], 'videos')

In [None]:
videos_binary.head()

In [None]:
readability_scores_binary = compute_readability(videos_binary)

In [None]:
liwc_transcript_counts_binary = compute_liwc_transcript_counts(videos_binary)

In [None]:
stats_binary = compute_stats(videos_binary)

In [None]:
X_binary = pd.concat([stats_binary, readability_scores_binary, liwc_transcript_counts_binary], axis=1)
y_binary = videos_binary['annotation']
X_binary.head()

In [None]:
samplers_binary = ['no-sampling']
# samplers_binary = ['no-sampling', 'oversampling', 'undersampling']

In [None]:
col_transformers_binary = ['full']
# col_transformers_binary = ['full', 'ngrams', 'stats', 'readability', 'liwc']

In [None]:
classifiers_binary = ['svm', 'xgboost_binary']

In [None]:
clfs_binary = make_clf_pipelines(
    X_binary, stats_binary, readability_scores_binary, liwc_transcript_counts_binary, 
    samplers=samplers_binary, col_transformers=col_transformers_binary, classifiers=classifiers_binary)
clfs_binary.keys()

In [None]:
y_binary_transformed = list(map(lambda x: 1 if x == 'promoting' else 0, y_binary))

In [None]:
predicted_binary = compute_cross_val_predictions(X_binary, y_binary_transformed, clfs_binary)

## Binary (with neutral)

In [None]:
videos_binary_neutral = videos.copy()

In [None]:
videos_binary_neutral.loc[videos_binary_neutral['annotation'] == 'neutral', ['annotation']] = 'debunking'

In [None]:
videos_binary_neutral = videos_binary_neutral.loc[videos_binary_neutral['annotation'].isin(['promoting', 'debunking'])]

In [None]:
videos_binary_neutral['annotation'].value_counts()

In [None]:
print(videos_binary_neutral.shape[0], 'videos')

In [None]:
readability_scores_binary_neutral = compute_readability(videos_binary_neutral)

In [None]:
liwc_transcript_counts_binary_neutral = compute_liwc_transcript_counts(videos_binary_neutral)

In [None]:
stats_binary_neutral = compute_stats(videos_binary_neutral)

In [None]:
X_binary_neutral = pd.concat([
    stats_binary_neutral, readability_scores_binary_neutral, liwc_transcript_counts_binary_neutral
], axis=1)
y_binary_neutral = videos_binary_neutral['annotation']
X_binary_neutral.head()

In [None]:
samplers_binary_neutral = ['no-sampling']
# samplers_binary_neutral = ['no-sampling', 'oversampling', 'undersampling']

In [None]:
col_transformers_binary_neutral = ['full']
# col_transformers_binary_neutral = ['full', 'ngrams', 'stats', 'readability', 'liwc']

In [None]:
classifiers_binary_neutral = ['svm', 'xgboost_binary']

In [None]:
clfs_binary_neutral = make_clf_pipelines(
    X_binary_neutral, stats_binary_neutral, readability_scores_binary_neutral, 
    liwc_transcript_counts_binary_neutral,
    samplers=samplers_binary_neutral, col_transformers=col_transformers_binary_neutral, classifiers=classifiers_binary_neutral
)
clfs_binary_neutral.keys()

In [None]:
y_binary_neutral_transformed = list(map(lambda x: 1 if x == 'promoting' else 0, y_binary_neutral))

In [None]:
predicted_binary_neutral = compute_cross_val_predictions(X_binary_neutral, y_binary_neutral_transformed, clfs_binary_neutral)

## Ternary (three classes)

In [None]:
videos_ternary = videos.copy()

In [None]:
videos_ternary = videos_ternary.loc[videos_ternary['annotation'].isin(['promoting', 'debunking', 'neutral'])]

In [None]:
videos_ternary['annotation'].value_counts()

In [None]:
print(videos_ternary.shape[0], 'videos')

In [None]:
readability_scores_ternary = compute_readability(videos_ternary)

In [None]:
liwc_transcript_counts_ternary = compute_liwc_transcript_counts(videos_ternary)

In [None]:
stats_ternary = compute_stats(videos_ternary)

In [None]:
X_ternary = pd.concat([
    stats_ternary, readability_scores_ternary, liwc_transcript_counts_ternary
], axis=1)
y_ternary = videos_ternary['annotation']
X_ternary.head()

In [None]:
samplers_ternary = ['no-sampling']
# samplers_ternary = ['no-sampling', 'oversampling', 'undersampling']

In [None]:
col_transformers_ternary = ['full']
# col_transformers_ternary = ['full', 'ngrams', 'stats', 'readability', 'liwc']

In [None]:
classifiers_ternary = ['svm', 'xgboost_ternary']

In [None]:
clfs_ternary = make_clf_pipelines(
    X_ternary, stats_ternary, readability_scores_ternary,liwc_transcript_counts_ternary,
    samplers=samplers_ternary, col_transformers=col_transformers_ternary, classifiers=classifiers_ternary
)
clfs_ternary.keys()

In [None]:
def map_ternary_labels(label):
    if label == 'neutral':
        return 0
    if label == 'debunking':
        return 1
    if label == 'promoting':
        return 2
y_ternary_transformed = list(map(map_ternary_labels, y_ternary))

In [None]:
predicted_ternary = compute_cross_val_predictions(X_ternary, y_ternary_transformed, clfs_ternary)