# Extract realiable news in spanish language, preferably from Spain

False news extracted from the FNC corpus has an obvious deficiency of true news, to compensate for this we will extract reliable news from trusted newspapers.

We are going to extract the news from this dataset of 342.000 news in spanish: https://webhose.io/free-datasets/spanish-news-articles/. These news were crawled on 2016 and it is a zip with 342.000 of json files.

1. Unzip and merge all the json files into a pandas dataframe.

2. Drop the unnecessary data

3. Explore the news sources

4. Take only the reliable sources

5. Classify the news according to its topic which is usually specified in the url

6. Drop the news that are too short and export as csv

### 1. Unzip and merge all the json files into a pandas dataframe.

In [2]:
%%time
import pandas as pd
import json  
import zipfile

list_result = []

with zipfile.ZipFile("../data/645_webhose-2016-10_20170904091850.zip", "r") as z:
    for file in z.namelist():  
        with z.open(file) as f:  
            data = f.read()  
            list_result.append(json.loads(data.decode("utf-8")))
            
df = pd.DataFrame(list_result)

CPU times: user 59.4 s, sys: 40.5 s, total: 1min 39s
Wall time: 1min 49s


In [3]:
df.head()

Unnamed: 0,organizations,uuid,thread,author,url,ord_in_thread,title,locations,entities,highlightText,language,persons,text,external_links,published,crawled,highlightTitle
0,[],f9ddd9641788b2f5fcadbae6e2ce7a7f84d555b7,"{'social': {'gplus': {'shares': 0}, 'pinterest...",lanacion.com (noreply@lanacionline.com.ar),http://www.lanacion.com.ar/1943326-detras-de-e...,0,"Detrás de escena, Moyano, Barrionuevo y Caló j...",[],"{'persons': [], 'locations': [], 'organization...",,spanish,[],Recibí por mail las noticias que impactan Detr...,[],2016-10-02T11:00:00.000+03:00,2016-10-02T08:38:50.999+03:00,
1,[],e442424ed01c6c617b46f5b76307c9423dbb0fe5,"{'social': {'gplus': {'shares': 0}, 'pinterest...",,http://www.coches.net/nissan-terrano-30di-spor...,0,nissan terrano 4x4 3.0di sport 5p Diesel de co...,[],"{'persons': [], 'locations': [], 'organization...",,spanish,[],Garantía 12 meses (1 año) Comentarios del anun...,[],2016-10-02T08:00:00.000+03:00,2016-10-02T14:02:34.228+03:00,
2,[],b5eee3b98a2740402ef08a334ca83f796601b55a,"{'social': {'gplus': {'shares': 0}, 'pinterest...",el-nacional.com,http://www.el-nacional.com/GDA/Hijos-Chapo-sos...,0,"Hijos de El Chapo, sospechosos de emboscada qu...",[],"{'persons': [], 'locations': [], 'organization...",,spanish,[],Cinco militares mexicanos murieron y 10 result...,[],2016-10-02T05:34:00.000+03:00,2016-10-02T05:07:30.864+03:00,
3,[],b0e5edd25798fdab23983913a329caf1ef6f851f,"{'social': {'gplus': {'shares': 0}, 'pinterest...",El Nacional Web,http://www.el-nacional.com/escenas/Pelicula-Am...,0,Película El Amparo ganó premio en Francia,[],"{'persons': [], 'locations': [], 'organization...",,spanish,[],EL NACIONAL WEB 1 de octubre 2016 - 03:58 pm L...,[],2016-10-02T04:02:00.000+03:00,2016-10-01T23:02:25.864+03:00,
4,[],03d016c67ff762b0e6ff03b5d9e72aec76e60721,"{'social': {'gplus': {'shares': 0}, 'pinterest...",,http://mx.reuters.com/article/topNews/idMXL2N1...,0,ACTUALIZA 3-Clinton busca mantener a Trump a l...,[],"{'persons': [], 'locations': [], 'organization...",,spanish,[],ACTUALIZA 3-Clinton busca mantener a Trump a l...,[],2016-09-28T10:46:00.000+03:00,2016-10-02T10:48:11.329+03:00,


### 2. Drop the unnecessary data

In [4]:
df.columns

Index(['organizations', 'uuid', 'thread', 'author', 'url', 'ord_in_thread',
       'title', 'locations', 'entities', 'highlightText', 'language',
       'persons', 'text', 'external_links', 'published', 'crawled',
       'highlightTitle'],
      dtype='object')

In [6]:
df = df.drop(['organizations', 'uuid', 'thread', 'ord_in_thread', 'locations', 'entities', 'highlightText', 'language',
                'persons', 'external_links', 'published', 'crawled', 'highlightTitle'] , axis = 1)

KeyError: "['organizations' 'uuid' 'thread' 'ord_in_thread' 'locations' 'entities'\n 'highlightText' 'language' 'persons' 'external_links' 'published'\n 'crawled' 'highlightTitle'] not found in axis"

In [7]:
df.head()

Unnamed: 0,author,url,title,text
0,lanacion.com (noreply@lanacionline.com.ar),http://www.lanacion.com.ar/1943326-detras-de-e...,"Detrás de escena, Moyano, Barrionuevo y Caló j...",Recibí por mail las noticias que impactan Detr...
1,,http://www.coches.net/nissan-terrano-30di-spor...,nissan terrano 4x4 3.0di sport 5p Diesel de co...,Garantía 12 meses (1 año) Comentarios del anun...
2,el-nacional.com,http://www.el-nacional.com/GDA/Hijos-Chapo-sos...,"Hijos de El Chapo, sospechosos de emboscada qu...",Cinco militares mexicanos murieron y 10 result...
3,El Nacional Web,http://www.el-nacional.com/escenas/Pelicula-Am...,Película El Amparo ganó premio en Francia,EL NACIONAL WEB 1 de octubre 2016 - 03:58 pm L...
4,,http://mx.reuters.com/article/topNews/idMXL2N1...,ACTUALIZA 3-Clinton busca mantener a Trump a l...,ACTUALIZA 3-Clinton busca mantener a Trump a l...


### 3. Explore the news sources

In [9]:
df.groupby('author').size().sort_values(ascending = False)

author
                                                       164840
EUROPA PRESS                                            11174
Biobiochile - La Red De Prensa M S Grande De Chile.      5380
Europa Press                                             4493
lanacion.com (noreply@lanacionline.com.ar)               4331
                                                        ...  
Mariao1                                                     1
Mariano Vidal                                               1
mansanrod                                                   1
Mariano S Nchez Soler                                       1
ne0bi0                                                      1
Length: 21381, dtype: int64

In [15]:
# We want reliable news from spain. So we need to drop news that aren't from spain and those newspaper that are not reliable.

rows_to_drop = df[(df['author'] == '') |
                   (df['author'] == 'Biobiochile - La Red De Prensa M S Grande De Chile.') |
                   (df['author'] == 'lanacion.com (noreply@lanacionline.com.ar)') |
                   (df['author'] == 'esglobal') |
                   (df['author'] == 'marca.com') |
                   (df['author'] == 'ap') |
                   (df['author'] == 'prensalibre.com') |
                   (df['author'] == '(lavozdigital)') |
                   (df['author'] == 'excelsior.com.mx') |
                   (df['author'] == 'abc.es') |
                   (df['author'] == 'el-nacional.com') |
                   (df['author'] == 'Clarin.com') |
                   (df['author'] == 'Ol') |
                   (df['author'] == 'emol.com') |
                   (df['author'] == 'lanacion') |
                   (df['author'] == 'lanacion.com') |
                   (df['author'] == 'Casa Editorial El Tiempo') |
                   (df['author'] == 'nacion.com') |
                   (df['author'] == 'Redacción') |
                   (df['author'] == 'El Nacional Web') |
                   (df['author'] == 'Noticia Al Dia') |
                   (df['author'] == 'RT en español') |
                   (df['author'] == 'Milenio Digital') |
                   (df['author'] == 'mundod') |
                   (df['author'] == 'Cadena Ser') |
                   (df['author'] == 'Sevilla') |
                   (df['author'] == '(sevilla)') |
                   (df['author'] == 'Todo Noticias') |
                   (df['author'] == 'AFP') |
                   (df['author'] == 'Associated Press') |
                   (df['author'] == 'sojourngsd.org') |
                   (df['author'] == 'hola.com') |
                   (df['author'] == 'Redacci N') |
                   (df['author'] == 'Agencias') |
                   (df['author'] == 'Noticieros Televisa') |
                   (df['author'] == 'Getty Images') |
                   (df['author'] == '(abc)')].index
            
df2 = df.drop(rows_to_drop)
df2.reset_index(inplace = True, drop = True)

df2.groupby('author').size().sort_values(ascending = False)

author
EUROPA PRESS                       11174
Europa Press                        4493
EFE                                 2132
El País                              830
Redacción , Barcelona                566
                                   ...  
PATRICK WHITTLE                        1
PATRICIA MARTÍN / MADRID               1
PATRICIA MARTÍN                        1
PATRICIA LÓPEZ\n@patricialopezl        1
"BSS"                                  1
Length: 21346, dtype: int64

### 4. Take only the reliable sources

In [16]:
df2 = df[(df['author'] == 'EUROPA PRESS') | (df['author'] == 'Europa Press') | (df['author'] == 'EUROPA PRESS')]
df2.reset_index(inplace = True, drop = True)
df2

Unnamed: 0,author,url,title,text
0,Europa Press,http://www.eldiario.es/norte/euskadi/entidades...,Más de cien entidades participarán en el Foro ...,Euskadi Más de cien entidades participarán en ...
1,Europa Press,http://www.eldiario.es/norte/euskadi/Portugale...,Portugalete acoge este domingo la Gladiator He...,Euskadi Portugalete acoge este domingo la Glad...
2,EUROPA PRESS,http://www.europapress.es/navarra/noticia-lune...,El lunes se iniciará el vaciado del lago de Me...,"PAMPLONA, 2 Oct. (EUROPA PRESS) - El Ayunta..."
3,EUROPA PRESS,http://www.20minutos.es/noticia/2852917/0/fest...,El festival multidisciplinar 'Granada Noir' ar...,"PP pide explicaciones a Junta por ""recorte"" de..."
4,Europa Press,http://www.europapress.es/videos/video-javier-...,"Javier Fernández, presidente de la gestora del...",Hemeroteca \nPortal de actualidad y noticias d...
...,...,...,...,...
15662,EUROPA PRESS,http://www.europapress.es/economia/macroeconom...,Las ventas del comercio minorista moderan su c...,Las ventas del comercio minorista aumentaron u...
15663,Europa Press,http://www.europapress.es/desconecta/curiosity...,5 vídeos de sustos aterradores para celebrar H...,5 vídeos de sustos aterradores para celebrar H...
15664,EUROPA PRESS,http://www.europapress.es/cultura/exposiciones...,Un español en la National Gallery de Escocia e...,El dramaturgo español Pablo Valcarce (Pablo Fe...
15665,EUROPA PRESS,http://www.europapress.es/deportes/baloncesto-...,"Pau Ribas, nueve meses de baja",El jugador del FC Barcelona Lassa Pau Ribas fu...


### 5. Classify the news according to its topic, specified in the url

In [31]:
# Now let's categorize topics and drop urls that not matches the author
list_topic = []
list_source = []
list_headline = []
list_text = []
list_link = []
df3 = pd.DataFrame()

for n, row in df2.iterrows():
    if (df2.loc[n, 'url']).find('politica') >= 1:
        list_topic.append('Politics')
        list_source.append(df2['author'].iloc[n])
        list_headline.append(df2['title'].iloc[n])
        list_text.append(df2['text'].iloc[n])
        list_link.append(df2['url'].iloc[n])
    
    elif (df2.loc[n, 'url']).find('educacion') >= 1:
        list_topic.append('Education')
        list_source.append(df2['author'].iloc[n])
        list_headline.append(df2['title'].iloc[n])
        list_text.append(df2['text'].iloc[n])
        list_link.append(df2['url'].iloc[n])
    
    elif (df2.loc[n, 'url']).find('salud') >= 1:
        list_topic.append('Health')
        list_source.append(df2['author'].iloc[n])
        list_headline.append(df2['title'].iloc[n])
        list_text.append(df2['text'].iloc[n])
        list_link.append(df2['url'].iloc[n])
        
    elif (df2.loc[n, 'url']).find('internacional') >= 1:
        list_topic.append('Politics')
        list_source.append(df2['author'].iloc[n])
        list_headline.append(df2['title'].iloc[n])
        list_text.append(df2['text'].iloc[n])
        list_link.append(df2['url'].iloc[n])
        
    elif (df2.loc[n, 'url']).find('economia') >= 1:
        list_topic.append('Economy')
        list_source.append(df2['author'].iloc[n])
        list_headline.append(df2['title'].iloc[n])
        list_text.append(df2['text'].iloc[n])
        list_link.append(df2['url'].iloc[n])
        
    elif (df2['url'].iloc[n]).find('deportes') >= 1:
        list_topic.append('Sports')
        list_source.append(df2['author'].iloc[n])
        list_headline.append(df2['title'].iloc[n])
        list_text.append(df2['text'].iloc[n])
        list_link.append(df2['url'].iloc[n])
        
    elif (df2.loc[n, 'url']).find('ciencia') >= 1:
        list_topic.append('Science')
        list_source.append(df2['author'].iloc[n])
        list_headline.append(df2['title'].iloc[n])
        list_text.append(df2['text'].iloc[n])
        list_link.append(df2['url'].iloc[n])
        
    elif (df2.loc[n, 'url']).find('cultura') >= 1:
        list_topic.append('Entertainment')
        list_source.append(df2['author'].iloc[n])
        list_headline.append(df2['title'].iloc[n])
        list_text.append(df2['text'].iloc[n])
        list_link.append(df2['url'].iloc[n])
        
    elif (df2.loc[n, 'url']).find('sociedad') >= 1:
        list_topic.append('Society')
        list_source.append(df2['author'].iloc[n])
        list_headline.append(df2['title'].iloc[n])
        list_text.append(df2['text'].iloc[n])
        list_link.append(df2['url'].iloc[n])

df3['Topic'] = list_topic
df3['Source'] = list_source
df3['Headline'] = list_headline
df3['Text'] = list_text
df3['Link'] = list_link
df3['Source'] = df3['Source'].str.replace('EUROPA PRESS', 'Europa Press')

df3

Unnamed: 0,Topic,Source,Headline,Text,Link
0,Entertainment,Europa Press,El festival multidisciplinar 'Granada Noir' ar...,"PP pide explicaciones a Junta por ""recorte"" de...",http://www.20minutos.es/noticia/2852917/0/fest...
1,Economy,Europa Press,"La confianza del consumidor baja 6,3 puntos en...","La confianza del consumidor baja 6,3 puntos en...",http://www.europapress.es/economia/macroeconom...
2,Politics,Europa Press,Ban Ki Moon pide reanudar las conversaciones d...,"04 Octubre 2016 \n""Insto firmemente a reanudar...",http://www.europapress.es/internacional/notici...
3,Society,Europa Press,"Víctimas y automovilistas ven ""positivo"" el de...","Asociaciones de víctimas, automovilistas y mot...",http://www.europapress.es/sociedad/noticia-vic...
4,Politics,Europa Press,La CUP pedirá adelantar el referéndum a julio ...,"El diputado de la CUP, Albert Botrán, ha expli...",http://www.europapress.es/catalunya/noticia-cu...
...,...,...,...,...,...
4601,Economy,Europa Press,Los TCP rechazan la oferta de Eurowings y amen...,El sindicato de tripulantes de cabina alemán U...,http://www.europapress.es/economia/red-empresa...
4602,Sports,Europa Press,Gasquet logra el título en Amberes,El tenista francés Richard Gasquet se ha impue...,http://www.europapress.es/deportes/tenis-00166...
4603,Economy,Europa Press,Las ventas del comercio minorista moderan su c...,Las ventas del comercio minorista aumentaron u...,http://www.europapress.es/economia/macroeconom...
4604,Entertainment,Europa Press,Un español en la National Gallery de Escocia e...,El dramaturgo español Pablo Valcarce (Pablo Fe...,http://www.europapress.es/cultura/exposiciones...


In [32]:
df3['Category'] = 'True'
df3

Unnamed: 0,Topic,Source,Headline,Text,Link,Category
0,Entertainment,Europa Press,El festival multidisciplinar 'Granada Noir' ar...,"PP pide explicaciones a Junta por ""recorte"" de...",http://www.20minutos.es/noticia/2852917/0/fest...,True
1,Economy,Europa Press,"La confianza del consumidor baja 6,3 puntos en...","La confianza del consumidor baja 6,3 puntos en...",http://www.europapress.es/economia/macroeconom...,True
2,Politics,Europa Press,Ban Ki Moon pide reanudar las conversaciones d...,"04 Octubre 2016 \n""Insto firmemente a reanudar...",http://www.europapress.es/internacional/notici...,True
3,Society,Europa Press,"Víctimas y automovilistas ven ""positivo"" el de...","Asociaciones de víctimas, automovilistas y mot...",http://www.europapress.es/sociedad/noticia-vic...,True
4,Politics,Europa Press,La CUP pedirá adelantar el referéndum a julio ...,"El diputado de la CUP, Albert Botrán, ha expli...",http://www.europapress.es/catalunya/noticia-cu...,True
...,...,...,...,...,...,...
4601,Economy,Europa Press,Los TCP rechazan la oferta de Eurowings y amen...,El sindicato de tripulantes de cabina alemán U...,http://www.europapress.es/economia/red-empresa...,True
4602,Sports,Europa Press,Gasquet logra el título en Amberes,El tenista francés Richard Gasquet se ha impue...,http://www.europapress.es/deportes/tenis-00166...,True
4603,Economy,Europa Press,Las ventas del comercio minorista moderan su c...,Las ventas del comercio minorista aumentaron u...,http://www.europapress.es/economia/macroeconom...,True
4604,Entertainment,Europa Press,Un español en la National Gallery de Escocia e...,El dramaturgo español Pablo Valcarce (Pablo Fe...,http://www.europapress.es/cultura/exposiciones...,True


### 6. Drop the news that are too short and export as csv

In [33]:
import spacy
import re
nlp = spacy.load('es_core_news_lg')

for n, row in df3.iterrows(): 
    text = df3.loc[n, 'Text']  
    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"http", "", text)
    doc = nlp(text)
    
    sents = 0
    words = 0

    for sentence in doc.sents:
        sents += 1
        for token in sentence:
            words += 1
    
    if words <= 536 and sents <= 16:
        df3.drop(n, inplace = True)
    
    else:
        pass
    
df3.reset_index(inplace = True, drop = True)
df3

Unnamed: 0,Topic,Source,Headline,Text,Link,Category
0,Entertainment,Europa Press,El festival multidisciplinar 'Granada Noir' ar...,"PP pide explicaciones a Junta por ""recorte"" de...",http://www.20minutos.es/noticia/2852917/0/fest...,True
1,Society,Europa Press,"Víctimas y automovilistas ven ""positivo"" el de...","Asociaciones de víctimas, automovilistas y mot...",http://www.europapress.es/sociedad/noticia-vic...,True
2,Economy,Europa Press,Nueve secretos para invertir con éxito de Garc...,Nueve secretos para invertir con éxito de Garc...,http://www.europapress.es/economia/noticia-nue...,True
3,Education,Europa Press,"El Consell Escolar inicia un debate ""sin restr...",El Consell Escolar de Catalunya (CEC) ha abier...,http://www.europapress.es/catalunya/noticia-co...,True
4,Politics,Europa Press,Los colombianos comienzan a votar este domingo...,"Los colombianos, con su presidente, Juan Manue...",http://www.europapress.es/internacional/notici...,True
...,...,...,...,...,...,...
1470,Sports,Europa Press,Hamilton manda en los primeros libres; Sainz y...,El piloto inglés Lewis Hamilton (Mercedes) fue...,http://www.europapress.es/deportes/formula1-00...,True
1471,Economy,Europa Press,El superávit por cuenta corriente se duplica e...,"La balanza por cuenta corriente, que mide los ...",http://www.europapress.es/economia/macroeconom...,True
1472,Sports,Europa Press,"Clasificación de la carrera, del Mundial de Pi...",Esta es la clasificación del Gran Premio de Ja...,http://www.europapress.es/deportes/formula1-00...,True
1473,Entertainment,Europa Press,El Reina Sofía expone su 'fondo de armario' de...,El Museo Reina Sofía inaugura este martes 25 d...,http://www.europapress.es/cultura/exposiciones...,True


In [34]:
df3.to_csv('../data/reliable_spanish_news.csv', encoding = 'utf-8', index = False)