## Выявление фейковых новостей

Датасет содержит два типа статей: фейковые и реальные новости, собранные из реальных источников; правдивые статьи были получены путем сканирования статей с Reuters.com (новостной сайт). Фейковые новостные статьи были собраны с ненадежных веб-сайтов, которые были отмечены Politifact и Wikipedia.

Датасет разделен на два файла

1.   Fake.csv
2.   True.csv

Загрузим данные

In [1]:
import pandas as pd
import numpy as np

fake = pd.read_csv('./Dataset/Fake.csv')
true = pd.read_csv('./Dataset/True.csv')

Пример фейковой статьи

In [7]:
fake.loc[0, 'text'][:1000]

'Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he had to give a shout out to his enemies, haters and  the very dishonest fake news media.  The former reality show star had just one job to do and he couldn t do it. As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year,  President Angry Pants tweeted.  2018 will be a great year for America! As our Country rapidly grows stronger and smarter, I want to wish all of my friends, supporters, enemies, haters, and even the very dishonest Fake News Media, a Happy and Healthy New Year. 2018 will be a great year for America!  Donald J. Trump (@realDonaldTrump) December 31, 2017Trump s tweet went down about as welll as you d expect.What kind of president sends a New Year s greeting like this despicable, petty, infantile gibberish? Only Trump! His lack of decency won t ev

Пример реальной статьи

In [8]:
true.loc[0, 'text'][:1000]

'WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who voted this month for a huge expansion of the national debt to pay for tax cuts, called himself a “fiscal conservative” on Sunday and urged budget restraint in 2018. In keeping with a sharp pivot under way among Republicans, U.S. Representative Mark Meadows, speaking on CBS’ “Face the Nation,” drew a hard line on federal spending, which lawmakers are bracing to do battle over in January. When they return from the holidays on Wednesday, lawmakers will begin trying to pass a federal budget in a fight likely to be linked to other issues, such as immigration policy, even as the November congressional election campaigns approach in which Republicans will seek to keep control of Congress. President Donald Trump and his Republicans want a big budget increase in military spending, while Democrats also want proportional increases for non-defense “discretionary” spending on programs that support educat




    
  
Нас интересуют только сам текст статьи и лейбл (1 - фейк, 0 - не фейк).

In [9]:
fake['label'] = 0
true['label'] = 1

df = pd.concat([fake, true], axis=0)
df.reset_index(drop=True, inplace=True)
df = df[['text', 'label']]

In [10]:
df['label'].value_counts()

label
0    23481
1    21417
Name: count, dtype: int64

Имеем сбалансированные классы.

Введем функции для обработки текста (удаляем лишние символы, оставляем лишь цифры и буквы) и токенизации по словам.

In [11]:
import re

def preprocessor(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)  # Удаление всех символов кроме букв, цифр и пробелов
    text = re.sub(r'\s+', ' ', text).strip() # Удаление лишних пробелов

    return text

def tokenizer(text):
    return text.split(' ')

In [12]:
df['text'] = df['text'].apply(preprocessor)

Разделим датасет на тренировочный и тестовый сеты

In [13]:
from sklearn.model_selection import train_test_split

x, y = df.drop(columns=['label']).values, df['label'].values
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,
                                                    random_state=0, stratify=y)
x_train = x_train.flatten()
x_test = x_test.flatten()

Основной моделью будет логистическая регрессия. Признаки из текста извлечем с помощью TF-IDF. Переберем параметры для TF-IDF (размер n-грамов) и лог. регрессии (вид и сила  регуляризации) с помощью поиска по сетке.

In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

tfidf = TfidfVectorizer(strip_accents=None, lowercase=False,
                        preprocessor=None, tokenizer=tokenizer, token_pattern=None)

param_grid = [
    {
        'tfidf__ngram_range': [(1, 1), (2, 2)],
        'log_reg__penalty': ['l2'],
        'log_reg__C':[1.0, 10.0]
    }
]

lr_tfidf = Pipeline([
    ('tfidf', tfidf),
    ('log_reg', LogisticRegression(solver='liblinear'))
])

gs_lr_tfidf = GridSearchCV(
    estimator=lr_tfidf,
    param_grid=param_grid,
    scoring='accuracy',
    cv=5,
    n_jobs=-1,
    verbose=2,
    error_score='raise'
)

In [19]:
%%time
gs_lr_tfidf.fit(x_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[CV] END log_reg__C=1.0, log_reg__penalty=l2, tfidf__ngram_range=(2, 2); total time= 1.8min
[CV] END log_reg__C=10.0, log_reg__penalty=l2, tfidf__ngram_range=(2, 2); total time= 1.1min
CPU times: user 45.1 s, sys: 4.41 s, total: 49.6 s
Wall time: 3min 13s


In [20]:
print(f'Best parameters: {gs_lr_tfidf.best_params_}')
print(f'Best score: {gs_lr_tfidf.best_score_:.3f}')

clf = gs_lr_tfidf.best_estimator_
print(f'Test accuracy: {clf.score(x_test, y_test):.3f}');

Best parameters: {'log_reg__C': 10.0, 'log_reg__penalty': 'l2', 'tfidf__ngram_range': (1, 1)}
Best score: 0.995
Test accuracy: 0.996


Найдем слова, наиболее присущие для фейковых и реальных новостей

In [21]:
feature_names = clf.named_steps['tfidf'].get_feature_names_out() # Извлекаем n-грамы
coefficients = clf.named_steps['log_reg'].coef_.flatten() # Извлекаем коэффициенты лог. регрессии
sorted_indices = np.argsort(coefficients) # Сортируем по возрастанию

top_n = 20

print(f'Top {top_n} important n-grams for TRUE news:\n')

for i in sorted_indices[-top_n:]:
    print(f'{feature_names[i]}: {coefficients[i]:.2f}')

print(f'\n\nTop {top_n} important n-grams for FAKE news:\n')

for i in sorted_indices[:top_n]:
    print(f'{feature_names[i]}: {coefficients[i]:.2f}')

Top 20 important n-grams for TRUE news:

comment: 5.38
nov: 5.46
est: 5.60
dont: 5.60
reporters: 5.67
friday: 5.70
obamas: 5.71
saying: 5.75
its: 6.04
thursday: 6.26
tuesday: 6.60
edt: 6.63
in: 6.93
wednesday: 7.32
washington: 7.35
minister: 7.66
on: 13.26
trumps: 14.11
said: 29.85
reuters: 47.27


Top 20 important n-grams for FAKE news:

s: -21.05
t: -15.78
via: -14.23
this: -9.36
obama: -9.03
re: -7.35
america: -7.15
just: -7.05
is: -6.83
image: -6.80
mr: -6.71
american: -6.63
that: -6.56
gop: -6.45
hillary: -5.91
these: -5.88
daily: -5.86
wire: -5.76
21st: -5.57
ve: -5.34
