# Spanish News Article's origin site determination
## Data Preparation
Spanish articles are obtained from (https://webhose.io/free-datasets/spanish-news-articles/) to obtain csv files (https://s3.console.aws.amazon.com/s3/buckets/sbd-projects/RocketHall/customersegmentation/SpanishNews/?region=us-east-2&tab=overview)
The article are in JSON format, this notebook combines them in a csv files for training and evaluation

In [1]:
import glob
import json
import tqdm 
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
#Introduce path to directory with JSON files
list_json=glob.glob("PATH_TO_JSON/*")

In [3]:
#Write JSON info in an array
rows=[]
for j in tqdm.tqdm_notebook(list_json):
    with open(j) as json_file:
        data = json.load(json_file)
    row=[data['thread']['country'],data['thread']['domain_rank'],data['thread']['site'],data['thread']['site_type'],data['author'],data['text']]
    rows.append(row)

In [5]:
#Convert array to dataframe
df = pd.DataFrame(rows, columns=['pais','domain','site','tipo','autor','texto'])

In [4]:
len(rows)

In [7]:
#save dataframe
#Introduce path to save data
df.to_csv("OUTPUT_PATH/articles.csv")

In [9]:
#Use this if previous dataframe has been already created
#df=pd.read_csv("OUTPUT_PATH/articles.csv" , index_col=0)

In [10]:
df.pais.value_counts()

ES    139622
US    110548
AR     21878
MX     15529
IE     14937
PE     14289
EU      5091
CL      3626
RU      2483
NI      2441
GB      1697
AT      1497
FR      1227
CN      1008
SV       685
MN       353
NL       208
IS       177
IT       146
TV       121
DE       109
VE       102
CH        93
ID        73
CO        64
IN        54
PR        51
BR        44
PT        11
LU        11
CR        10
PL         8
JP         6
SK         5
TH         5
TR         4
CA         4
BA         4
NO         3
ME         2
IO         2
KR         2
RS         2
EG         1
AU         1
RO         1
ZA         1
AE         1
BE         1
SG         1
Name: pais, dtype: int64

## Creates version with sites over 5000 articles

In [12]:
sites=df.site.value_counts()

In [13]:
big_sites=list(sites.where(sites>5000).dropna().index)

In [14]:
df_sites=df[df.site.isin(big_sites)]

In [15]:
#Split in train and test sets
df_sites_train, df_sites_test= train_test_split(df_sites, random_state=1977, test_size=0.2, stratify=df_sites.site)

In [20]:
#Save files with articles with more than 5000 articles
#Introduce path to save data
df_sites.to_csv("OUTPUT_PATH/articles_sites.csv")
df_sites_train.to_csv("OUTPUT_PATH/articles_sites_train.csv")
df_sites_test.to_csv("OUTPUT_PATH/articles_sites_test.csv")