# Data Preparation & Exploration

In this notebook, I load, clean and join relevant datafiles, before saving the final product for further analysis

The steps that are performed in this notebook are the following:   
   
1) Data pre-processing   
    - Loading packages   
    - Reading in the data and adding variables where necessary   
    - Selecting only political online articles   
    - Joining the two dataframes   
    - Removing duplicates    
    - Transforming column types    
    - Cleaning up the text data   
    - Upating article length   
    - Removing articles that could not be scraped due to them being behind a paywall   
       
2) NLP to stem text and remove stopwords  

3) Creating dummy variables

4) Saving the clean data

5) Exploring the distribution of the data among different modalities and outlets

### 1) Data pre-processing   
   
#### a) Loading packages

In [100]:
#import pandas
import pandas as pd
from pandas import read_excel
#import numpy
import numpy as np
#load SpaCy
import spacy
#import German language model
import de_core_news_md
#define nlp pipe
nlp = de_core_news_md.load()

#### b) Reading in the data and adding variables where necessary

In [101]:
#read print data
printdata = pd.read_csv("printdata_19Apr_to_08June.csv", encoding = "ISO-8859-1")
#transform data into a dataframe for easier analysis
df_print = pd.DataFrame(printdata)
#add relevant columns for later joining
df_print["Modality"] = "print"
df_print["Teaser"] = ""
df_print["url"] = "none"
#adapt the length column by applying a function
#definingthe function
def removewords(string):
    return string.split(" ")[0].strip()
#applying the function
df_print["Length"] = df_print["Length"].apply(removewords)
#rename columns for later joining
df_print = df_print.rename(columns={"Section":"Category"})
#select relevant columns
df_print = df_print[["ID", "Newspaper", "Date", "Length", "Category", "Author", "Headline", "Teaser", "Article", "Modality", "url"]] 
df_print.head(3)

Unnamed: 0,ID,Newspaper,Date,Length,Category,Author,Headline,Teaser,Article,Modality,url
0,1,Aachener Zeitung,2020-04-29,787,MEINUNG UND HINTERGRUND; S. 4,,Appelle für den Wahlrecht-Kraftakt; Den Partei...,,Von Gregor Mayntz Berlin Wenn man Abgeordnet...,print,none
1,2,Aachener Zeitung,2020-04-22,509,POLITIK; S. 2,,Petition soll Regierung Netanjahu verhindern; ...,,Von Stefanie Järkel Jerusalem Die künftige...,print,none
2,3,Aachener Zeitung,2020-04-29,598,DÜREN; S. 11,,13 Tage für Griechenland statt Inden aktiv; Di...,,"Von Guido Jansen Inden Ist es nötig, dass ...",print,none


In [102]:
df_print.groupby("Newspaper").count()

Unnamed: 0_level_0,ID,Date,Length,Category,Author,Headline,Teaser,Article,Modality,url
Newspaper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Aachener Zeitung,991,984,991,991,0,991,991,991,991,991
Der Tagesspiegel,1298,1293,1298,1298,398,1298,1298,1298,1298,1298
Die Welt,833,833,833,833,657,833,833,833,833,833
Rheinische Post,3285,3269,3285,3285,1875,3285,3285,3285,3285,3285
Stuttgarter Zeitung,1294,1294,1294,1294,1116,1294,1294,1294,1294,1294
Süddeutsche Zeitung (inkl. Regionalausgaben),3875,3862,3875,3875,3723,3875,3875,3875,3875,3875


In [103]:
len(df_print)

11576

In [104]:
#read online data
df_online = read_excel("online_data_complete.xlsx")
#rename columns
df_online = df_online.rename(columns={"_source.title_rss": "Headline", "_source.doctype": "Newspaper", "_source.publication_date": "Date", "_source.teaser_rss": "Teaser", "_source.text": "Article", "_source.category": "Category", "_source.url":"url"})
#add columns for later joining
df_online["Length"] = 0 
df_online["Modality"] = "online"
#reset index and add unique ID column
df_onine = df_online.reset_index(drop=True)
df_online["ID"] = range(100000, 100000+len(df_online))
df_online["Author"] = ""
#select relevant columns
df_online = df_online[["ID", "Newspaper", "Date", "Length", "Category", "Author", "Headline", "Teaser", "Article", "Modality", "url"]]
df_online.head(3)

Unnamed: 0,ID,Newspaper,Date,Length,Category,Author,Headline,Teaser,Article,Modality,url
0,100000,rheinische post (www),2020-05-27T22:15:00,0,Panorama,,Kritik an schleppender Digitalisierung: „Dem D...,"<img src=""https://rp-online.de/imgs/32/8/4/1/6...",erD efCh sde euDhcstne unBdsemtenab ...,online,https://rp-online.de/panorama/coronavirus/coro...
1,100001,aachener zeitung (www),2020-05-28T15:47:26,0,Sport,,DFB-Sportgericht: Kölner Bornauw nach Rot in S...,"<img src=""https://www.aachener-zeitung.de/imgs...",Der 21 Jahre alte Defensivspieler hatte in der...,online,https://www.aachener-zeitung.de/sport/fussball...
2,100002,aachener zeitung (www),2020-05-28T15:27:00,0,Sport,,Niederlage bei der TSG Hoffenheim: Wenn sich d...,"<img src=""https://www.aachener-zeitung.de/imgs...","sE gtib cahu rüf lreluabßF edeis ,agT...",online,https://www.aachener-zeitung.de/sport/fussball...


In [105]:
len(df_online)

7992

In [106]:
df_online.groupby("Newspaper").count()

Unnamed: 0_level_0,ID,Date,Length,Category,Author,Headline,Teaser,Article,Modality,url
Newspaper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
aachener zeitung (www),1463,1463,1463,1439,1463,1463,1463,1446,1463,1463
der tagesspiegel (www),1084,1084,1084,1084,1084,1084,1084,1074,1084,1084
die welt (www),1705,1705,1705,1660,1705,1705,1696,845,1705,1705
rheinische post (www),1528,1528,1528,1526,1528,1528,1528,1503,1528,1528
stuttgarter zeitung (www),1029,1029,1029,992,1029,1029,1029,987,1029,1029
sueddeutsche (www),53,53,53,0,53,53,53,0,53,53
sueddeutschet politik (www),1130,1130,1130,110,1130,1130,1130,1126,1130,1130


#### c) Selecting only political online articles

In [107]:
len(df_online)

7992

In [108]:
#first I remove articles that weren't parsed correctly
df_online = df_online[~df_online["Article"].str.startswith('    ', na=False)]
#next I remove Impressum articles from die Süddeutsche
df_online = df_online[~df_online["Article"].str.contains("HERAUSGEGEBEN VOM SÜDDEUTSCHEN VERLAG", na= False)]
#after that I do the same for Stuttgarter Zeitung
df_online = df_online[~df_online["Article"].str.contains("Impressum Stuttgarter Zeitung Verlagsgesellschaft", na= False)]
#next I include only political articles for the online dataset
relevant_articles = df_online[df_online['url'].astype(str).str.contains("politik")]
relevant_articles2 = df_online[df_online["Category"].astype(str).str.contains("Politik")]
df_online = pd.concat([relevant_articles, relevant_articles2])
len(df_online)

2236

#### d) Joining the two dataframes

In [138]:
#join the two datasets
df = pd.concat([df_online, df_print])
len(df)

13812

In [139]:
#inspect df
df.head()

Unnamed: 0,ID,Newspaper,Date,Length,Category,Author,Headline,Teaser,Article,Modality,url
6,100006,sueddeutschet politik (www),2020-05-28T15:34:08,0,,,SZ Espresso: Nachrichten kompakt - die Übersic...,<p>Was heute wichtig war - und was Sie auf SZ....,Das Wichtigste zum Coronavirus. Berufstätige M...,online,https://www.sueddeutsche.de/politik/nachrichte...
8,100008,sueddeutschet politik (www),2020-05-28T17:01:43,0,,,Kommunalpolitik: Abgeblendet,<p>Bayreuths Stadtrat im Stream</p>,"Livestream aus dem Stadtrat, das klingt transp...",online,https://www.sueddeutsche.de/bayern/kommunalpol...
21,100021,die welt (www),2020-05-28T17:49:45,0,Politik,,"Wie Söder es schaffte, die AfD zu halbieren",Die Corona-Krise lässt die CSU erstarken und b...,,online,https://www.welt.de/politik/deutschland/plus20...
24,100024,aachener zeitung (www),2020-05-28T03:01:52,0,Politik,,Länder planen Öffnung: Streit über Schulen und...,"<img src=""https://www.aachener-zeitung.de/imgs...",Der Streit über die Wiederöffnung von Schulen ...,online,https://www.aachener-zeitung.de/politik/deutsc...
28,100028,sueddeutschet politik (www),2020-05-28T03:12:25,0,,,Mecklenburg-Vorpommern: Grundgesetz als Maßstab,"<img src=""https://media-cdn.sueddeutsche.de/im...",An diesem Donnerstag trifft sich das neue Land...,online,https://www.sueddeutsche.de/politik/barbara-bo...


#### e) Remove duplicates

In [140]:
#check the size of the dataset
len(df)

13812

In [141]:
#create a list of duplicates
duplicates = df[df.duplicated(["Article"])]
len(duplicates)

2198

In [142]:
#remove duplicates
df.drop_duplicates(subset ="Article", keep = "first", inplace = True) 
len(df)

11614

In [143]:
#check for duplicates
duplicates = df[df.duplicated(["Article"])]
len(duplicates)

0

#### f) Transform column types 

In [144]:
df.dtypes

ID            int64
Newspaper    object
Date         object
Length       object
Category     object
Author       object
Headline     object
Teaser       object
Article      object
Modality     object
url          object
dtype: object

In [145]:
df["Article"] = df["Article"].astype(str)
df["Length"] = df["Length"].astype(int)
df["Category"] = df["Category"].astype(str)
df["Author"] = df["Author"].astype(str)
df["Modality"] = df["Modality"].astype(str)
#see if it worked
df.dtypes

ID            int64
Newspaper    object
Date         object
Length        int64
Category     object
Author       object
Headline     object
Teaser       object
Article      object
Modality     object
url          object
dtype: object

#### g) Cleaning up the text data

In [146]:
#show the 1000 most frequent words in the df, in order to spot potential issues that require cleaning
from collections import Counter
Counter(" ".join(df["Article"]).split()).most_common(1000)

[('die', 184266),
 ('der', 179854),
 ('und', 120816),
 ('in', 96811),
 ('den', 65387),
 ('zu', 56257),
 ('das', 50980),
 ('für', 49871),
 ('von', 49312),
 ('mit', 42321),
 ('nicht', 41844),
 ('sich', 41765),
 ('im', 39368),
 ('auf', 37731),
 ('ist', 36925),
 ('es', 36292),
 ('Die', 34960),
 ('eine', 34752),
 ('des', 33530),
 ('auch', 33193),
 ('dass', 32647),
 ('dem', 30987),
 ('ein', 29845),
 ('als', 27989),
 ('an', 22839),
 ('-', 21629),
 ('sie', 20705),
 ('wie', 20130),
 ('am', 19750),
 ('bei', 18261),
 ('er', 17646),
 ('um', 17202),
 ('einer', 16893),
 ('hat', 16837),
 ('noch', 16615),
 ('aus', 16336),
 ('Das', 16122),
 ('aber', 15784),
 ('nach', 15740),
 ('einen', 15709),
 ('Der', 15069),
 ('sind', 14097),
 ('vor', 14094),
 ('einem', 13697),
 ('so', 13417),
 ('nur', 13073),
 ('oder', 12800),
 ('über', 12787),
 ('werden', 12650),
 ('wird', 12195),
 ('mehr', 11740),
 ('zum', 11662),
 ('man', 11294),
 ('wieder', 10825),
 ('wir', 10725),
 ('sei', 10655),
 ('zur', 10312),
 ('In', 10204

Result: There are several formatting errors. As a result, strings like "\xa96" are very common.

In [147]:
#import ftfy to fix formatting errors
import ftfy

#define function to clean the text with 
def fixtext(string):
    return ftfy.fix_text(string)

#apply function
df["Article"] = df["Article"].apply(fixtext)

#check if it worked 
df[df['Article'].str.contains("\xa96")]

Unnamed: 0,ID,Newspaper,Date,Length,Category,Author,Headline,Teaser,Article,Modality,url


Result: It seems like it have worked. I tried this not only with this, but also with some other strings.   
There are however still some other issues that I noticed while manually exploring some of the articles (not included in this Notebook for simplicity). For example, Articles from "Der Tagesspiegel" often include a graphic and its title at the end. Issues like this are addressed in the following.

In [148]:
#remove everything after and including "Graphic" -> removes pictures and their headings from Tagesspiegel articles
def removegraphic(string):
    return string.split("Graphic")[0].strip()
#explore how pervailing this issue is
print(len(df[df['Article'].str.contains("Graphic")]))

df["Article"] = df["Article"].apply(removegraphic)

#see if the code worked
print(len(df[df['Article'].str.contains("Graphic")]))

3284
0


There are also some more formatting errors that ftfy did not capture. They are addressed in the following.

In [149]:
#removes the \xa0 string from the text
def reformat(string):
    return string.replace("\xa0", " ")

#see how prevailing this issue is
print(len(df[df['Article'].str.contains("\xa0")]))

#apply the function
df["Article"] = df["Article"].apply(reformat)

#see if it worked
print(len(df[df['Article'].str.contains("\xa0")]))

56
0


In [150]:
#removes the \xa0 string from the text
def reformat2(string):
    return string.replace("\xad", " ")

#see how prevailing this issue is
print(len(df[df['Article'].str.contains("\xad")]))

#apply the function
df["Article"] = df["Article"].apply(reformat2)

#see if it worked
print(len(df[df['Article'].str.contains("\xad")]))

408
0


In [151]:
#removes all instances of the \' string from the text. This string denotes the German genitive endinng
#This ending is not important for any of the following analysis an might impede readability metrics
def reformat3(string):
    return string.replace("\'", " ")

#see how prevailing this issue is
print(len(df[df['Article'].str.contains("\'")]))

#apply the function
df["Article"] = df["Article"].apply(reformat3)

#see if it worked
print(len(df[df['Article'].str.contains("\'")]))

3960
0


In [152]:
#replaces three spaces with just one
def reformat4(string):
    return string.replace("   ", " ")

#see how prevailing this issue is
print(len(df[df['Article'].str.contains("   ")]))

#apply the function
df["Article"] = df["Article"].apply(reformat4)

#see if it worked
print(len(df[df['Article'].str.contains("   ")]))

3265
11


In [153]:
#replaces dpa which is often named as a source at the end of an article
def reformat5(string):
    return string.replace("dpa", "")

#see how prevailing this issue is
print(len(df[df['Article'].str.contains("dpa")]))

#apply the function
df["Article"] = df["Article"].apply(reformat5)

#see if it worked
print(len(df[df['Article'].str.contains("dpa")]))

969
0


In [154]:
#replaces fullstops with a fullstop followed by a space in order to avoid confounds of the readability meausure
def reformat6(string):
    return string.replace(".", ". ")

#apply the function
df["Article"] = df["Article"].apply(reformat6)

#replace double spaces with single spaces - this way this operation only affected the cases where a fullstop was not followed by a space
def reformat7(string):
    return string.replace("  ", " ")

#apply the function
df["Article"] = df["Article"].apply(reformat7)

In [155]:
df.head(5)

Unnamed: 0,ID,Newspaper,Date,Length,Category,Author,Headline,Teaser,Article,Modality,url
6,100006,sueddeutschet politik (www),2020-05-28T15:34:08,0,,,SZ Espresso: Nachrichten kompakt - die Übersic...,<p>Was heute wichtig war - und was Sie auf SZ....,Das Wichtigste zum Coronavirus. Berufstätige M...,online,https://www.sueddeutsche.de/politik/nachrichte...
8,100008,sueddeutschet politik (www),2020-05-28T17:01:43,0,,,Kommunalpolitik: Abgeblendet,<p>Bayreuths Stadtrat im Stream</p>,"Livestream aus dem Stadtrat, das klingt transp...",online,https://www.sueddeutsche.de/bayern/kommunalpol...
21,100021,die welt (www),2020-05-28T17:49:45,0,Politik,,"Wie Söder es schaffte, die AfD zu halbieren",Die Corona-Krise lässt die CSU erstarken und b...,,online,https://www.welt.de/politik/deutschland/plus20...
24,100024,aachener zeitung (www),2020-05-28T03:01:52,0,Politik,,Länder planen Öffnung: Streit über Schulen und...,"<img src=""https://www.aachener-zeitung.de/imgs...",Der Streit über die Wiederöffnung von Schulen ...,online,https://www.aachener-zeitung.de/politik/deutsc...
28,100028,sueddeutschet politik (www),2020-05-28T03:12:25,0,,,Mecklenburg-Vorpommern: Grundgesetz als Maßstab,"<img src=""https://media-cdn.sueddeutsche.de/im...",An diesem Donnerstag trifft sich das neue Land...,online,https://www.sueddeutsche.de/politik/barbara-bo...


In [156]:
df["Article"][6]

6    Das Wichtigste zum Coronavirus. Berufstätige M...
6    Wenn jeder nur an sich denkt, dann ist ja an a...
Name: Article, dtype: object

#### h) Update article length

In [157]:
#function that counts the number of words in articles
def count_words(text):
    doc = nlp(text)
    tokens = [token.text for token in doc]
    tokens_no_punctuation = []
    punctuation = [",", ".", ":", "!", "?", "'", ",", "  ", ")", "(", "-", '"', "   ", ";", "»", "«", "'", "/"]
    for w in tokens:
        if w not in punctuation:
            tokens_no_punctuation.append(w)
    return len(tokens_no_punctuation)

#apply function to df
df["Length"] = [count_words(text) for text in df["Article"].astype(str)]
                   
#inspect data
df.head(3)

Unnamed: 0,ID,Newspaper,Date,Length,Category,Author,Headline,Teaser,Article,Modality,url
6,100006,sueddeutschet politik (www),2020-05-28T15:34:08,367,,,SZ Espresso: Nachrichten kompakt - die Übersic...,<p>Was heute wichtig war - und was Sie auf SZ....,Das Wichtigste zum Coronavirus. Berufstätige M...,online,https://www.sueddeutsche.de/politik/nachrichte...
8,100008,sueddeutschet politik (www),2020-05-28T17:01:43,200,,,Kommunalpolitik: Abgeblendet,<p>Bayreuths Stadtrat im Stream</p>,"Livestream aus dem Stadtrat, das klingt transp...",online,https://www.sueddeutsche.de/bayern/kommunalpol...
21,100021,die welt (www),2020-05-28T17:49:45,1,Politik,,"Wie Söder es schaffte, die AfD zu halbieren",Die Corona-Krise lässt die CSU erstarken und b...,,online,https://www.welt.de/politik/deutschland/plus20...


#### i) Remove articles that could not be scraped due to them being behind a paywall

In [158]:
#check how many articles have fewer than 40 words.
len(df[df["Length"] <40])

117

In [159]:
#remove the short articles
df = df[~(df["Length"] <=40)]
#check if the code worked
print(len(df), len(df[df["Length"] <40]))

11491 0


### 2) NLP to stem text and remove stopwords

In [160]:
from nltk.corpus import stopwords 
stop_words = set(stopwords.words('german')) 
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("german")
from collections import Counter

#function to remove stopwords
def remove_stopwords_and_stem(text):
    #apply SpaCy to tokenise
    doc = nlp(text)
    #create a list of tokens
    tokens = [token.text for token in doc]
    #create an empty list for all words in a text that are not stopwords
    no_stopwords = []
    #loop over the tokens list and append non-stopwords to the no_stopwords list
    for w in tokens: 
        if w not in stop_words: 
            no_stopwords.append(w)
    #stem all words in the list
    stems=""
    for word in no_stopwords:
        stems=stems + stemmer.stem(word) + " "
    #return the final stems as a single string
    return(stems)

In [161]:
#apply functions in order to...

#... create a column with the cleaned article text (where stopwords are removed and all words are stemmed)
df["clean text"] = [remove_stopwords_and_stem(text) for text in df["Article"]]

#... create a column with the number of words for the cleaned text
df["words in clean text"] = [count_words(text) for text in df["clean text"]]

#inspect data
df.head(3)

Unnamed: 0,ID,Newspaper,Date,Length,Category,Author,Headline,Teaser,Article,Modality,url,clean text,words in clean text
6,100006,sueddeutschet politik (www),2020-05-28T15:34:08,367,,,SZ Espresso: Nachrichten kompakt - die Übersic...,<p>Was heute wichtig war - und was Sie auf SZ....,Das Wichtigste zum Coronavirus. Berufstätige M...,online,https://www.sueddeutsche.de/politik/nachrichte...,"das wichtig coronavirus . berufstat mutt vat ,...",224
8,100008,sueddeutschet politik (www),2020-05-28T17:01:43,200,,,Kommunalpolitik: Abgeblendet,<p>Bayreuths Stadtrat im Stream</p>,"Livestream aus dem Stadtrat, das klingt transp...",online,https://www.sueddeutsche.de/bayern/kommunalpol...,"livestream stadtrat , klingt transparent erstr...",104
24,100024,aachener zeitung (www),2020-05-28T03:01:52,512,Politik,,Länder planen Öffnung: Streit über Schulen und...,"<img src=""https://www.aachener-zeitung.de/imgs...",Der Streit über die Wiederöffnung von Schulen ...,online,https://www.aachener-zeitung.de/politik/deutsc...,der streit wiederoffn schul kindergart kris ve...,318


### 3) Creating dummy variables

#### a) for reach

In [162]:
regional = ["Aachener Zeitung", "Rheinische", "Stuttgarter", "aachener", "rheinische, stuttgarter"]
national = ["Tagesspiegel", "tagesspiegel", "Welt", "welt", "Süddeutsche", "süddeutschet"]

def create_reach_dummy(row):
        if row == "Aachener Zeitung" or row == "aachener zeitung (www)" or row == "Stuttgarter Zeitung" or row == "stuttgarter zeitung (www)" or row == "Rheinische Post" or row == "rheinische post (www)":
            return 0
        else:
            return 1

df["reach_dummy"] = [create_reach_dummy(row) for row in df["Newspaper"]]

#### b) for modality

In [163]:
cond1 = df.Modality.str.contains("print")
cond2 = df.Modality.str.contains("online")
df["modality_dummy"] = np.where(cond1, 1, np.where(cond2, 0, 2) )

### 4) Saving the clean and complete datafile for further analysis

In [164]:
df.to_excel("complete_data_cleaned.xlsx")

### 5) Exploring the data distribution

In [165]:
df.groupby("Newspaper").count()

Unnamed: 0_level_0,ID,Date,Length,Category,Author,Headline,Teaser,Article,Modality,url,clean text,words in clean text,reach_dummy,modality_dummy
Newspaper,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Aachener Zeitung,970,963,970,970,970,970,970,970,970,970,970,970,970,970
Der Tagesspiegel,1286,1281,1286,1286,1286,1286,1286,1286,1286,1286,1286,1286,1286,1286
Die Welt,831,831,831,831,831,831,831,831,831,831,831,831,831,831
Rheinische Post,2375,2365,2375,2375,2375,2375,2375,2375,2375,2375,2375,2375,2375,2375
Stuttgarter Zeitung,1237,1237,1237,1237,1237,1237,1237,1237,1237,1237,1237,1237,1237,1237
Süddeutsche Zeitung (inkl. Regionalausgaben),3720,3708,3720,3720,3720,3720,3720,3720,3720,3720,3720,3720,3720,3720
aachener zeitung (www),168,168,168,168,168,168,168,168,168,168,168,168,168,168
der tagesspiegel (www),264,264,264,264,264,264,264,264,264,264,264,264,264,264
die welt (www),177,177,177,177,177,177,177,177,177,177,177,177,177,177
rheinische post (www),173,173,173,173,173,173,173,173,173,173,173,173,173,173


In [166]:
len(df)

11491