# Machine Learning Pipeline for Topic Modelling

The dataset that is provided here was scraped from different rss-feeds in between 06-2022 and 09-2023 as base for a Data Science and Machine Learning project. The project focusses on performing exploratory data analysis, gaining insights from the data, performing topic modelling and learning basic techniques.

The dataset is stored in csv-textfiles as well as in a PostgreSQL-database. 
It consists of the following columns:
- id:
- date:
- title:
- description:
- author:
- category:
- copyright:
- url:
- text:
- source:


This pipeline is designed for loading the data from a postgresql database, performing feature engineering and building a ML model for clustering the news into different topics (unsupervised learning) and compare them with the labeled categories.

## Imports

In [1]:
# data manipulation and plotting
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# loading data from postgresql database 
import sqlalchemy as sql

from datetime import datetime

# saving the pipeline
import joblib

# from scikit-learn
from sklearn.model_selection import train_test_split

# from feature-engine
from feature_engine.imputation import CategoricalImputer, AddMissingIndicator
from feature_engine.encoding import RareLabelEncoder
from feature_engine.selection import DropFeatures

# from preprocessors
import preprocessors as pp

## Load the data from database

The entries of the dataset are recorded up from June 2022. 
The modell will be trained and tested with data from 01.06.2022 to 30.09.2023. Data up from 01.10.2023 will be treated as new data and just used for prediction.

In [2]:
# connect to db
engine = sql.create_engine('postgresql+psycopg2://news:news@localhost:5432/news')
con = engine.connect()

start_date = datetime(2022, 6, 1, 0, 0, 0)
end_date = datetime(2023, 9, 30, 23, 59, 59)

with con:
    
    # query data for model training and testing
    query = sql.text("""
        SELECT *
        FROM headlines
        WHERE (date >= :start_date
        AND date <= :end_date)
        ORDER BY date ASC
        """)
    result = con.execute(query, start_date=start_date, end_date=end_date)
    train_test = pd.DataFrame(result.fetchall(), columns=result.keys())

    # query data for prediction
    query = sql.text("""
        SELECT *
        FROM headlines
        WHERE (date > :end_date)
        ORDER BY date ASC
        """)
    result = con.execute(query, end_date=end_date)
    pred = pd.DataFrame(result.fetchall(), columns=result.keys())


In [3]:
train_test.head()

Unnamed: 0,id,date,title,description,author,category,copyright,url,text,source
0,71650,2022-06-01 00:13:42,Preise: Grüne halten Senkung der Spritsteuer f...,Heute tritt die Steuersenkung auf Kraftstoffe ...,,"Steuersenkung, Bundestag, Katharina Dröge, Spr...",,https://www.stern.de/politik/deutschland/preis...,,stern
1,71649,2022-06-01 01:55:03,Biden warnt Putin: USA liefern moderne Raketen...,Die USA rüsten die Ukraine mit fortschrittlich...,,"Ukraine, USA, Joe Biden, Russland, Raketensyst...",,https://www.stern.de/politik/ausland/biden-war...,,stern
2,71648,2022-06-01 02:04:08,Soziale Medien: FDP-Politiker Kuhle: Internet-...,Eine «ZDF Magazin Royale»-Recherche beschäftig...,,"Konstantin Kuhle, FDP, Straftat, Berlin, ZDF, ...",,https://www.stern.de/politik/deutschland/sozia...,,stern
3,71675,2022-06-01 02:26:58,Liveblog: ++ Zwei von drei ukrainischen Kinder...,Rund zwei von drei Mädchen und Jungen in der U...,,,,https://www.tagesschau.de/newsticker/liveblog-...,,Tagesschau
4,71647,2022-06-01 02:31:43,Finanzen: Dänemark stimmt über EU-Verteidigung...,Vorbehalt verteidigen oder Verteidigung ohne V...,,"Dänemark, EU, Volksabstimmung, Finanzen, Ukrai...",,https://www.stern.de/politik/ausland/finanzen-...,,stern


In [4]:
print(train_test.shape)

(75461, 10)


In [5]:
pred.head()

Unnamed: 0,id,date,title,description,author,category,copyright,url,text,source
0,85639,2023-10-01 09:03:00,Frauen für den Frieden,Jolina und Louisa setzen sich in Nordirland fü...,,37 Grad Leben,,https://www.zdf.de/dokumentation/37-grad-leben...,,ZDF heute
1,85434,2023-10-01 09:18:18,PKK hatte sich bekannt - Türkei greift nach An...,Die türkische Hauptstadt Ankara ist am Sonntag...,,Ausland,,https://www.focus.de/politik/ausland/tuerkisch...,,Focus
2,85435,2023-10-01 11:25:35,Gastbeitrag von Gabor Steingart - Unbequeme Pu...,Für 2023 erwartet Russland ein Wirtschaftswach...,,Ausland,,https://www.focus.de/politik/ausland/gastbeitr...,,Focus
3,85601,2023-10-01 12:06:00,Was ist dran an Söders Berlin-Bashing?,"""Wir sind solidarisch, aber nicht naiv"", sagt ...",,Politik,,https://www.zdf.de/nachrichten/politik/laender...,,ZDF heute
4,85651,2023-10-01 14:11:00,Trübe Wirtschaftslage - Lotto-Boom in China,Chinas Wirtschaft kämpft mit einem geringeren ...,,Hohe Jugendarbeitslosigkeit,,https://www.zdf.de/nachrichten/wirtschaft/chin...,,ZDF heute


In [6]:
print(pred.shape)

(262, 10)


## Save raw data for train_test and pred to csv

In [7]:
train_test.to_csv('../data/00_train_test_raw.csv')
pred.to_csv('../data/00_pred_raw.csv')

## Feature Engineering on train_test

### Split the data into train and test set

In [8]:
train, test = train_test_split(train_test, test_size=0.2, random_state=42)

In [9]:
train.head()

Unnamed: 0,id,date,title,description,author,category,copyright,url,text,source
69435,79401,2023-08-21 12:13:29,F-16 für Ukraine - Lawrow warnt Westen vor „in...,"Während der Westen plant, F-16-Kampfflugzeuge ...",,Ukraine-Krise,,https://www.focus.de/politik/ausland/ukraine-k...,,Focus
40748,35571,2023-01-21 10:11:00,"Nur Wellinger überzeugt, Kraft siegt",Nachrichtlich texten,,Sport | Wintersport,,https://www.zdf.de/sport/wintersport/skispring...,,ZDF heute
57437,52069,2023-04-29 20:10:21,Angela Merkel: Angela Merkel verteidigt ihre R...,Trotz des Ukraine-Kriegs hält die Altkanzlerin...,,Deutschland,,https://www.zeit.de/politik/deutschland/2023-0...,,Zeit
7205,2350,2022-06-27 06:51:59,G7-Gipfel in Elmau im Newsticker - G7-Gegner d...,Der 48. G7-Gipfel findet vom 26. bis 28. Juni ...,,Politik,,https://www.focus.de/politik/g7-gipfel-in-elma...,,Focus
4879,76333,2022-06-20 05:25:57,Russische Invasion: Krieg in der Ukraine: So i...,,,News,,https://www.zeit.de/news/2022-06/20/krieg-in-d...,,Zeit


In [10]:
print(train.shape)

(60368, 10)


In [11]:
print(train.info())

<class 'pandas.core.frame.DataFrame'>
Index: 60368 entries, 69435 to 15795
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           60368 non-null  int64         
 1   date         60368 non-null  datetime64[ns]
 2   title        60231 non-null  object        
 3   description  57221 non-null  object        
 4   author       1185 non-null   object        
 5   category     39009 non-null  object        
 6   copyright    0 non-null      object        
 7   url          60235 non-null  object        
 8   text         0 non-null      object        
 9   source       60368 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 5.1+ MB
None


In [12]:
test.head()

Unnamed: 0,id,date,title,description,author,category,copyright,url,text,source
29368,24143,2022-10-09 19:49:00,SPD-Kandidat gewinnt Oberbürgermeisterwahl in ...,In Cottbus hat SPD-Kandidat Tobias Schick die ...,,Deutschland,,https://www.welt.de/politik/deutschland/articl...,,Welt
34216,28909,2022-11-30 03:00:34,"Als es um den Ukraine-Krieg geht, gerät Alice ...",In der Talkshow von Sandra Maischberger bekräf...,,Panorama,,https://www.welt.de/vermischtes/article2424008...,,Welt
75397,85305,2023-09-29 05:43:55,ARD-Deutschlandtrend: Zwei Drittel der Deutsch...,Weniger als jeder Fünfte ist laut einer Umfrag...,,Deutschland,,https://www.zeit.de/politik/deutschland/2023-0...,,Zeit
46409,41081,2023-02-24 18:25:44,Energiemonitor: Die wichtigsten Daten zur Ener...,Ukraine-Krieg und Klimakrise: Deutschland muss...,,Wirtschaft,,https://www.zeit.de/wirtschaft/energiemonitor-...,,Zeit
16931,11693,2022-07-28 10:18:36,Öffentliche Schulden Ende 2021 auf Höchststand,Die öffentliche Verschuldung ist Ende 2021 auf...,,,,https://www.tagesschau.de/wirtschaft/konjunktu...,,Tagesschau


In [13]:
print(test.shape)

(15093, 10)


In [14]:
print(test.info())

<class 'pandas.core.frame.DataFrame'>
Index: 15093 entries, 29368 to 64527
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           15093 non-null  int64         
 1   date         15093 non-null  datetime64[ns]
 2   title        15065 non-null  object        
 3   description  14323 non-null  object        
 4   author       284 non-null    object        
 5   category     9785 non-null   object        
 6   copyright    0 non-null      object        
 7   url          15065 non-null  object        
 8   text         0 non-null      object        
 9   source       15093 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(8)
memory usage: 1.3+ MB
None


### Configuration

In [15]:
# features to drop
DROP_FEATURES = ['id', 'copyright', 'author', 'url']

# variables with duplicates
VARS_WITH_DUPLICATES = ['title', 'description']

# variables with NA and frequent values in train set
VARS_WITH_NA_FREQUENT = ['category']

# variables with NA in train set that will be filled with 'Missing' value
VARS_WITH_NA_MISSING = ['source', 'title', 'description', 'text']

# variables to be combined 
VARS_TO_COMBINE = ('title_description_text', ['title', 'description', 'text'])

# features that are used for topic modelling (for each feature a modell will be trained)
FEATURES = ['title', 'title_description_text']
