# 03) Improve label quality

A key problem with the headlines data is that it was collected by scrapping online news sources, keyword matching and then feeding the keyword matches to Google Gemini for labelling. Everything not keyword matched is assumed to be a non-risk headline. Yet this likely causes us to have many false negatives as a result of the keyword matching process missing risk headlines. To tackle this problem, a regression model for each language (Spanish and Portuguese) is trained on half of the data at a time to generate predictions for the other half's non-risk headlines. A percentage of low probability headlines from each half is kept, eliminating many false negatives and tackling the class imbalance problem (see notebook 1). 

## Read-in data

Seperate dataframes are created for each language (Spanish & Portuguese). 

In [1]:
import pandas as pd
import numpy as np

# read-in data
df = pd.read_csv('../Data/original_headlines.csv', encoding='utf-8')
print(str(round(len(df)/1000, 1)) + 'K Total headlines')

# include only spanish 
spanish_df = df[df.country.isin(['Argentina', 'Colombia', 'Mexico'])].reset_index(drop=True)
print(str(round(len(spanish_df)/1000, 1)) + 'K Spanish headlines')

# include only portuguese 
portuguese_df = df[df.country == 'Brazil'].reset_index(drop=True)
print(str(round(len(portuguese_df)/1000, 1)) + 'K Portuguese headlines')

85.8K Total headlines
69.3K Spanish headlines
13.9K Portuguese headlines


## Remove duplicates & thumbnails

Headlines containing the word thumbnail are normally videos which cannot be scraped. Many of these with a similar format in the non-risk headlines data add little value in terms of variety and therefore are removed.

In [2]:
# remove duplicates
spanish_df.drop_duplicates(subset='headline', inplace=True)
portuguese_df.drop_duplicates(subset='headline', inplace=True)

# remove thumbnails
spanish_df = spanish_df[~spanish_df['headline'].str.lower().str.contains('thumbnail', na=False)]
portuguese_df = portuguese_df[~portuguese_df['headline'].str.lower().str.contains('thumbnail', na=False)]

# removes english headlines from the spanish dataset
spanish_df = spanish_df.loc[~spanish_df.website.isin(['Colombia Reports'])]

# reset index
spanish_df.reset_index(drop=True, inplace=True)
portuguese_df.reset_index(drop=True, inplace=True)

## Clean text

The text is subjected to common cleaning techniques to reduce dimensionality.

In [3]:
import string
from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
spanish_stop_words = set(stopwords.words('Spanish'))
portuguese_stop_words = set(stopwords.words('Portuguese'))

# common text cleaning techniques
def clean_text(text, language):
    text = text.strip()
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation + '¡¿'))

    if language=='Spanish':
        text = ' '.join([word for word in text.split() if word not in spanish_stop_words])
    elif language=='Portuguese':
        text = ' '.join([word for word in text.split() if word not in portuguese_stop_words])
    
    return text

spanish_df['headline'] = [clean_text(x, 'Spanish') for x in spanish_df['headline']]
portuguese_df['headline'] = [clean_text(x, 'Portuguese') for x in portuguese_df['headline']]

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Jack-\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Put aside data

A random percentage of the data from each dataframe is put aside. It is useful to this now because if not we may end up evaluating on an artificially easy dataset in which most of the more difficult edge cases for non-risk headlines have been removed. 

In [4]:
import random 

# returns the main dataframe and a random sample to be put aside for evaluation
def put_aside_random_percent(df, percent):
    indices = list(df.index)
    dividor = int(100 / percent)
    sample_size = int(np.floor(len(indices)/dividor))
    random_sample = random.sample(indices, sample_size)
    put_aside_headlines = df.iloc[random_sample,:].reset_index(drop=True)
    df = df.loc[~df.index.isin(random_sample)].reset_index(drop=True)
    return df, put_aside_headlines

spanish_df_post_sample, spanish_put_aside_df = put_aside_random_percent(spanish_df, 10)
portuguese_df_post_sample, portuguese_put_aside_df = put_aside_random_percent(portuguese_df, 10)

## Split dataframe

Each dataframe is randomly split into two sets so a model can be trained on each set and used to predict headlines for the other.

In [5]:
# randomly split a dataframe into 2 equal size groups 
def split_dataframes(df):
    population = list(range(len(df)))
    half_headlines = int(np.floor(len(population) / 2))
    random_samples = random.sample(population, half_headlines)
    return df.loc[random_samples,:].reset_index(drop=True), df.loc[~df.index.isin(random_samples), :].reset_index(drop=True)

## Fit model

A model is fit using TF-IDF vectors and logistic regression. A regression model is used to obtain probabilities so that the classification threshold can be easily varied. 

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# vectorizes data and fits a model 
def fit_model(df):
    X, y = df.headline, [int(pd.notna(x)) for x in df.risk_type]
    vectorizer = TfidfVectorizer()
    train_tfidf = vectorizer.fit_transform(X)
    model = LogisticRegression()
    model.fit(train_tfidf, y)
    return vectorizer, model

## Generate predictions

Predictions from one half of each dataset are added to the other half.

In [7]:
# returns predictions as binary decisions and probabilities
def predict_headlines(df, vectorizer, model):
    tfidf_vectors = vectorizer.transform(df.headline)
    y_preds = model.predict(tfidf_vectors)
    y_pred_prob = [np.mean(model.predict_proba(x)[:, 1]) for x in tfidf_vectors]
    return y_preds, y_pred_prob

# adds the predictions for each half of the data to their respective dfs
def add_predictions_to_df(primary_df, secondary_df):
    vectorizer, model = fit_model(secondary_df)
    y_preds, y_pred_prob = predict_headlines(primary_df, vectorizer, model)
    primary_df['y_pred'], primary_df['y_pred_prob'] = y_preds, y_pred_prob
    primary_df.sort_values('y_pred_prob', ascending=False, inplace=True)
    return primary_df

# returns a headlines dataframe along with their predictions for veiwing
# this is useful so we can set appropriate upper and lower limits for the 
# slice of low probability non-risk headlines we will select later on
def view_headline_preds(df, language):
    df_1, df_2 = split_dataframes(df) 
    df_1, df_2 = add_predictions_to_df(df_1, df_2), add_predictions_to_df(df_2, df_1)
    #df_1, df_2 =  drop_non_risk_headlines(df_1, language), drop_non_risk_headlines(df_2, language)
    return df_1, df_2

spanish_df_1, spanish_df_2 = view_headline_preds(spanish_df_post_sample, 'Spanish')
portuguese_df_1, portuguese_df_2 = view_headline_preds(portuguese_df_post_sample, 'Portuguese')

## Find false negatives threshold

The threshold at which false negatives are no longer common is located and used for the upper limits in the next step.

In [8]:
# this function shows the non-risk labelled headlines ordered
# from the lowest to highest probability score. An ml practitioner 
# can use this to find the appropriate threshold for which 
# there are no longer many false negatives in the data...
def view_nonrisk_highest_rows(df, start, end):
    temp_df = df.loc[pd.isna(df.risk_type)].sort_values('y_pred_prob')
    print()
    print('Non-risk headlines: ' + str(len(temp_df)))
    print()
    selected_index_df = temp_df.loc[pd.isna(temp_df.risk_type)].iloc[start:end, :]
    for i in range(len(selected_index_df)):
        print(str(selected_index_df.index[i]) + ':   ' + selected_index_df.headline.values[i])
    percent = end / len(temp_df)
    return percent
    
spanish_percent = view_nonrisk_highest_rows(spanish_df_1, 7990, 8000)
portuguese_percent = view_nonrisk_highest_rows(portuguese_df_1, 1990, 2000)


Non-risk headlines: 23816

13088:   molestia taxis “piratas” tizimín
26832:   atención secundarios sólo tres días clases semana
14649:   infraestructura vial portuaria dolor cabeza comercio exterior
25893:   casi duplican casos hepatitis c coahuila 102 positivosunos 80 corresponden hombres 22 mujeres enero 13 julio 2024 casi duplicó cifra personas diagnosticadas hepatitis c coahuila respecto mismo periodo año pasado el…
16889:   lamenta gobernadora muerte joven paola bañuelos estatal jueves 11 julio
2384:   carlos loret mola timoratos
10364:   reconocimiento cantante chalo botero día manzanareño
1694:   regularán uso celulares escuelas buscan promover mejor concentracióna partir próximo ciclo escolar 20242025 iniciará control uso teléfonos celulares salones clase escuelas públicas educación básica durango partir próximo ciclo escolar 20242025 iniciará control uso teléfonos celulares salones clase escuelas públicas educación básica estado…
5459:   joven picado dos veces alacrán dormía 

## Drop headlines

All non-risk headlines above the threshold above are dropped, eliminating many false negatives and tackling the class imbalance problem (see notebook 1).

In [9]:
# drops a number of non-risk headlines based on their prediction scores
def drop_non_risk_headlines(df, percent, language):   
    non_risk_df = df.loc[pd.isna(df.risk_type)]
    risk_df = df.loc[~pd.isna(df.risk_type)]
    lower_limit, upper_limit = 0, int(np.floor(len(non_risk_df) * percent))
    low_score_non_risk = non_risk_df.iloc[(len(non_risk_df)-upper_limit):(len(non_risk_df)-lower_limit),:]
    return pd.concat([risk_df, low_score_non_risk])

# creates a filtered dataframe combining both halfs of the data 
# after dropping headlines with high predictions
def create_filtered_df(df_1, df_2, percent, language):
    df_1, df_2 =  drop_non_risk_headlines(df_1, percent, language), drop_non_risk_headlines(df_2, percent, language)
    return pd.concat([df_1, df_2])

filtered_spanish_df = create_filtered_df(spanish_df_1, spanish_df_2, spanish_percent, language='Spanish')
filtered_portuguese_df = create_filtered_df(portuguese_df_1, portuguese_df_2, portuguese_percent, language='Portuguese')

## Train test split

Creates a train test split for a given dataframe.

In [10]:
from sklearn.model_selection import train_test_split

# returns a train test split
def split_data(df, test_size=0.25):
    X = df.headline
    y = [int(pd.notna(x)) for x in df.risk_type]
    return train_test_split(X, y, test_size=test_size, stratify=y)

## Evaluate model

Prints the accuracy and classification report for a given model on a given set of headlines.

In [11]:
from sklearn.metrics import classification_report, accuracy_score

# evaluates the model's performance and prints the results
def evaluate_model(model, X_test_tfidf, y_test):
    y_pred = model.predict(X_test_tfidf)
    y_pred_prob = model.predict_proba(X_test_tfidf)[:, 1] 
    print("Accuracy:", round(accuracy_score(y_test, y_pred),3))
    print("Classification Report:\n", classification_report(y_test, y_pred))
    print()

## Check results

Trains a model on a refined dataset and evaluates it on set aside data. As we can see, the recall for risk headlines (the main business objective of this project) has improved dramatically for both languages compared to that of notebook 2. 

In [12]:
# evaluates a filtered dataset against new headlines
def check_results(train_df, put_aside_df, language):
    print()
    print('*** ' + language + ' ***')
    print()

    X_train, X_test, y_train, y_test = split_data(train_df, test_size=0.01)
    
    vectorizer = TfidfVectorizer()
    X_train_tfidf = vectorizer.fit_transform(X_train)
    
    model = LogisticRegression()
    model.fit(X_train_tfidf, y_train)

    print(str(round(len(put_aside_df)/1000, 2)) + 'K put aside headlines')
    
    X_test_tfidf = vectorizer.transform(put_aside_df.headline)
    y_test = [int(pd.notna(x)) for x in put_aside_df.risk_type]
    
    evaluate_model(model, X_test_tfidf, y_test)

check_results(filtered_spanish_df, spanish_put_aside_df, language='Spanish')
check_results(filtered_portuguese_df, portuguese_put_aside_df, language='Portuguese')


*** Spanish ***

6.11K put aside headlines
Accuracy: 0.831
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.82      0.89      5291
           1       0.44      0.88      0.58       818

    accuracy                           0.83      6109
   macro avg       0.71      0.85      0.74      6109
weighted avg       0.91      0.83      0.85      6109



*** Portuguese ***

1.32K put aside headlines
Accuracy: 0.786
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.77      0.85      1032
           1       0.51      0.85      0.63       291

    accuracy                           0.79      1323
   macro avg       0.73      0.81      0.74      1323
weighted avg       0.85      0.79      0.80      1323




## Repeat the process...

Given that the process above improved the model, it stands to reason that repeating it with predictions based on the newly refined dataset could improve the model even further. The results below show that although the overall accuracy for both models declined (likely because we are dropping badly needed data), the most important metric from a business perspective (risk headlines recall) went up in both datasets. 

### Generate new predictions

In [13]:
# resets the indices
filtered_spanish_df.reset_index(drop=True, inplace=True)
filtered_portuguese_df.reset_index(drop=True, inplace=True)

# creates additional dataframes with new predictions based on a model trained on the newly filtered dataframe
spanish_df_3, spanish_df_4 = view_headline_preds(filtered_spanish_df, 'Spanish')
portuguese_df_3, portuguese_df_4 = view_headline_preds(filtered_portuguese_df, 'Portuguese')

### Find new false negatives threshold

In [14]:
# this function shows the non-risk labelled headlines ordered
# from the lowest to highest probability score. An ml practitioner 
# can use this to find the appropriate threshold for which 
# there are no longer many false negatives in the data...
new_spanish_percent = view_nonrisk_highest_rows(spanish_df_4, 7001, 7011)
new_portuguese_percent = view_nonrisk_highest_rows(portuguese_df_4, 1681, 1691)


Non-risk headlines: 8034

8963:   rubén solís anuncia obras fpytch arranque semestre 0403
1968:   penaltis premian fe uruguay sentencian brasil
8675:   polémica presunto maltrato ensayos sanjuanero
2447:   donald trump reúne benjamin netanyahu
7941:   acuerdo culpabilidad marca fin saga legal assange pasó cinco años cárcel británica alta seguridad
9559:   mary bianco fundadoras madres cumpliría 100 años
2090:   incendios arrasan 1200 hectáreas bosques argentina
2139:   contenedores 6 características contenedores basura comunitarios
8481:   regidores tizimín irían cárcel desacato
2306:   rodolfo hernández internado uci esposa dio detalles batalla cáncer

Non-risk headlines: 2000

939:   bolsonarista ameaçou colapsar sistema 8 janeiro virá ré processo stf 1h agência estado
781:   alexandre moraes determina facebook cancele conta anderson torres hackeada
762:   professor agredido aluno 17 anos mataleão dentro sala aula paraná
884:   faz home office direito trabalhar onde quiser inclusive

### Drop headlines and evaluate new results

In [15]:
# create new filtered dfs with fewer non-risk headlines
new_filtered_spanish_df = create_filtered_df(spanish_df_3, spanish_df_4, new_spanish_percent, language='Spanish')
new_filtered_portuguese_df = create_filtered_df(portuguese_df_3, portuguese_df_4, new_portuguese_percent, language='Portuguese')

# evaluates the results
check_results(new_filtered_spanish_df, spanish_put_aside_df, language='Spanish')
check_results(new_filtered_portuguese_df, portuguese_put_aside_df, language='Portuguese')


*** Spanish ***

6.11K put aside headlines
Accuracy: 0.772
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.75      0.85      5291
           1       0.36      0.92      0.52       818

    accuracy                           0.77      6109
   macro avg       0.67      0.84      0.69      6109
weighted avg       0.90      0.77      0.81      6109



*** Portuguese ***

1.32K put aside headlines
Accuracy: 0.701
Classification Report:
               precision    recall  f1-score   support

           0       0.95      0.65      0.77      1032
           1       0.42      0.88      0.56       291

    accuracy                           0.70      1323
   macro avg       0.68      0.77      0.67      1323
weighted avg       0.83      0.70      0.73      1323




## Save dataframes

Finally, the refined dataframes are saved as CSV files for further use in additional notebooks.

In [16]:
new_filtered_spanish_df.to_csv('../Data/spanish_df.csv', index=False)
new_filtered_portuguese_df.to_csv('../Data/portuguese_df.csv', index=False)