# Clean and feature extraction v3

## Clean text, extract stylometric features and create a new dataset

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('../data/corpus_spanish.csv')

In [4]:
df.head()

Unnamed: 0,Id,Category,Topic,Source,Headline,Text,Link
0,641,True,Entertainment,Caras,Sofía Castro y Alejandro Peña Pretelini: una i...,Sofía Castro y Alejandro Peña Pretelini: una i...,https://www.caras.com.mx/sofia-castro-alejandr...
1,6,True,Education,Heraldo,Un paso más cerca de hacer los exámenes 'online',Un paso más cerca de hacer los exámenes 'onlin...,https://www.heraldo.es/noticias/suplementos/he...
2,141,True,Science,HUFFPOST,Esto es lo que los científicos realmente piens...,Esto es lo que los científicos realmente piens...,https://www.huffingtonpost.com/entry/scientist...
3,394,True,Politics,El financiero,Inicia impresión de boletas para elección pres...,Inicia impresión de boletas para elección pres...,http://www.elfinanciero.com.mx/elecciones-2018...
4,139,True,Sport,FIFA,A *NUMBER* día del Mundial,A *NUMBER* día del Mundial\nFIFA.com sigue la ...,https://es.fifa.com/worldcup/news/a-1-dia-del-...


In [5]:
df.shape

(971, 7)

In [6]:
df.dtypes

Id           int64
Category    object
Topic       object
Source      object
Headline    object
Text        object
Link        object
dtype: object

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 971 entries, 0 to 970
Data columns (total 7 columns):
Id          971 non-null int64
Category    971 non-null object
Topic       971 non-null object
Source      971 non-null object
Headline    971 non-null object
Text        971 non-null object
Link        971 non-null object
dtypes: int64(1), object(6)
memory usage: 53.2+ KB


## We are using `spacy`: The NLP *Ruby on Rails* 

[spacy](http://www.spacy.io/) is a library of natural language processing, robust, fast, easy to install and to use. It can be used with other NLP and Deep Learning Libraries.

With its pre-trained models in spanish language, we can operate the typical NLP jobs: Sentences segmentation, tokenization, POS tag, etc...

We are going to use the `es_core_news_lg` pre-trained model to make pos tagging:

In [1]:
import spacy

ModuleNotFoundError: No module named 'spacy'

In [10]:
!sudo apt-get install build-essential python-dev git

[sudo] password for pipe11: 


In [1]:
!python -m spacy download es_core_news_lg

Collecting es_core_news_lg==2.3.1 from https://github.com/explosion/spacy-models/releases/download/es_core_news_lg-2.3.1/es_core_news_lg-2.3.1.tar.gz#egg=es_core_news_lg==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/es_core_news_lg-2.3.1/es_core_news_lg-2.3.1.tar.gz (573.1MB)
[K    100% |████████████████████████████████| 573.1MB 1.6MB/s ta 0:00:0111��█████████                      | 178.2MB 52.8MB/s eta 0:00:08��█████▌                    | 206.2MB 27.4MB/s eta 0:00:140MB 23.3MB/s eta 0:00:13.5MB 54.1MB/s eta 0:00:05 341.3MB 109.5MB/s eta 0:00:03██████▏            | 343.6MB 106.8MB/s eta 0:00:03�█████▍            | 346.5MB 90.2MB/s eta 0:00:03��██████████████████           | 376.0MB 25.2MB/s eta 0:00:08��██████████████████           | 377.9MB 29.5MB/s eta 0:00:07/s eta 0:00:08/s eta 0:00:07/s eta 0:00:05B/s eta 0:00:02█████ | 554.9MB 13.9MB/s eta 0:00:02
[?25hCollecting spacy<2.4.0,>=2.3.0 (from es_core_news_lg==2.3.1)
  Downloading https://files.pyt

  Downloading https://files.pythonhosted.org/packages/e9/45/9c82d3666af4ef9f221cbb954e1d77ddbb513faf552aea6df5f37f1a4859/pathlib2-2.3.5-py2.py3-none-any.whl
Collecting six (from pathlib2; python_version < "3"->importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy<2.4.0,>=2.3.0->es_core_news_lg==2.3.1)
  Downloading https://files.pythonhosted.org/packages/ee/ff/48bde5c0f013094d729fe4b0316ba2a24774b3ff1c52d924a8a4cb04078a/six-1.15.0-py2.py3-none-any.whl
Collecting scandir; python_version < "3.5" (from pathlib2; python_version < "3"->importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy<2.4.0,>=2.3.0->es_core_news_lg==2.3.1)
  Downloading https://files.pythonhosted.org/packages/df/f5/9c052db7bd54d0cbf1bc0bb6554362bba1012d03e5888950a4f5c5dadc4e/scandir-1.10.0.tar.gz
Installing collected packages: murmurhash, cymem, preshed, numpy, blis, wasabi, pathlib, srsly, contextlib2, zipp, configparser, six, scandir, pathlib2, importlib-metadata, 

In [4]:
# cargamos el modelo entrenado en español
nlp_spacy = spacy.load('es_core_news_lg')

NameError: name 'spacy' is not defined

## Clean and complexity features

## Label Encoding

We are using this encoding technique for the target label instead one hot encoding, reasons:

 - The categorical features are binary
 - Not problem with features being ordinal

In [10]:
%%time

import itertools
import pandas as pd
import numpy as np

import nltk
from nltk.corpus import stopwords  
from nltk import word_tokenize, sent_tokenize  
from string import punctuation
from sklearn.preprocessing import LabelEncoder


labelencoder = LabelEncoder()
df['Label'] = labelencoder.fit_transform(df['Category'])

df_features = pd.DataFrame()

list_text = []
list_headline = []
list_sentences_t = []
list_words_t = []
list_words_sent_t = []
list_word_size_t = []
list_ttr_t = []
list_sentences_h = []
list_words_h = []
list_words_sent_h = []
list_word_size_h = []
list_ttr_h = []

for n, row in df.iterrows():
    
    headline = df['Headline'].iloc[n]
    text = df['Text'].iloc[n]
    
    text = text.replace(r"http\S+", "")
    text = text.replace(r"http", "")
    text = text.replace(r"@\S+", "")
    text = text.replace(r"(?<!\n)\n(?!\n)", " ")
    text = text.lower()
        
    headline = headline.replace(r"http\S+", "")
    headline = headline.replace(r"http", "")
    headline = headline.replace(r"@\S+", "")
    headline = headline.replace(r"(?<!\n)\n(?!\n)", " ")
    headline = headline.lower()

    sent_tokens_text = nltk.sent_tokenize(text)
    sent_tokens_headline = nltk.sent_tokenize(headline)

    # number of sentences
    n_sentences_text = len(sent_tokens_text)
    n_sentences_headline = len(sent_tokens_headline)

    word_tokens_text = nltk.word_tokenize(text)
    word_tokens_headline = nltk.word_tokenize(headline)

    stop_words = stopwords.words('spanish')
    stop_words.extend(list(punctuation))
    stop_words.extend(['¿', '¡', '"', '``']) 
    stop_words.extend(map(str,range(10)))

    filtered_tokens_text = [n for n in word_tokens_text if n not in stop_words]
    filtered_tokens_headline = [n for n in word_tokens_headline if n not in stop_words]

    # number of tokens/words
    n_words_text = len(filtered_tokens_text)
    n_words_headline = len(filtered_tokens_headline)

    # average words per sentence
    avg_word_sentences_text = (float(n_words_text) / n_sentences_text)
#     avg_word_sentences_headline = (float(n_words_headline) / n_sentences_headline)

    # average word size
    word_size_text = sum(len(word) for word in filtered_tokens_text) / n_words_text
    word_size_headline = sum(len(word) for word in filtered_tokens_headline) / n_words_headline

    # type token ratio
    types_text = nltk.Counter(filtered_tokens_text)
    ttr_text = (len(types_text) / n_words_text) * 100
    
    types_headline = nltk.Counter(filtered_tokens_headline)
    ttr_headline = (len(types_headline) / n_words_headline) * 100
    
    # text
    list_text.append(text)
    list_sentences_t.append(n_sentences_text)
    list_words_t.append(n_words_text)
    list_words_sent_t.append(avg_word_sentences_text)
    list_word_size_t.append(word_size_text)
    list_ttr_t.append(ttr_text)
    
    # headline
    list_headline.append(headline)    
#     list_sentences_h.append(n_sentences_headline) #irrelevant
    list_words_h.append(n_words_headline)
#     list_words_sent_h.append(avg_word_sentences_headline) # irrelevant
    list_word_size_h.append(word_size_headline)
    list_ttr_h.append(ttr_headline)

df_features['headline'] = list_headline
df_features['text'] = list_text
df_features['n_sentences_text'] = list_sentences_t
df_features['n_words_text'] = list_words_t
df_features['avg_words_sent_text'] = list_words_sent_t
df_features['avg_word_size_text'] = list_word_size_t
df_features['ttr_text'] = list_ttr_t
# df_features['n_sentences_headline'] # list_sentences_h # irrelevant
df_features['n_words_headline'] = list_words_h
# df_features['avg_words_sent_headline'] = list_words_sent_h # irrelevant
df_features['avg_word_size_headline'] = list_word_size_h
df_features['ttr_headline'] = list_ttr_h
df_features['label'] = df['Label']

df_features.to_csv('../data/spanish_corpus_features_v2.csv', encoding = 'utf-8', index = False)

CPU times: user 5.5 s, sys: 281 ms, total: 5.78 s
Wall time: 5.8 s


In [11]:
df_features.head(10)

Unnamed: 0,headline,text,n_sentences_text,n_words_text,avg_words_sent_text,avg_word_size_text,ttr_text,n_words_headline,avg_word_size_headline,ttr_headline,label
0,sofía castro y alejandro peña pretelini: una i...,sofía castro y alejandro peña pretelini: una i...,5,123,24.6,6.398374,69.105691,8,7.5,100.0,1
1,un paso más cerca de hacer los exámenes 'online',un paso más cerca de hacer los exámenes 'onlin...,8,224,28.0,7.205357,77.232143,5,5.8,100.0,1
2,esto es lo que los científicos realmente piens...,esto es lo que los científicos realmente piens...,29,467,16.103448,7.573876,64.668094,4,9.5,100.0,1
3,inicia impresión de boletas para elección pres...,inicia impresión de boletas para elección pres...,10,167,16.7,7.964072,63.473054,5,8.4,100.0,1
4,a *number* día del mundial,a *number* día del mundial\nfifa.com sigue la ...,4,57,14.25,7.368421,84.210526,3,5.333333,100.0,1
5,interpol ordena detención inmediata de osorio ...,interpol ordena detención inmediata de osorio ...,3,56,18.666667,7.732143,89.285714,8,7.75,100.0,0
6,"""los ninis"" más ricos y poderosos del país: hi...","""los ninis"" más ricos y poderosos del país: hi...",5,81,16.2,6.358025,81.481481,7,4.857143,100.0,0
7,gobierno de alfredo del mazo inició con récord...,"para todo sacan lo del populismo, ni siquiera ...",11,183,16.636364,6.677596,79.234973,6,6.5,100.0,1
8,conapred investiga acto de racismo en el pumas...,conapred investiga acto de racismo en el pumas...,6,105,17.5,6.419048,74.285714,7,6.0,100.0,1
9,cristiano ronaldo acepta dos años de prisión,cristiano ronaldo acepta dos años de prisión\n...,16,270,16.875,6.918519,59.259259,6,6.0,100.0,1


It is interesting because the type token ratio seems to be 100 in every headline, except this 31 headlines:
Looking at them we realize there are the double type token ration less than 100 in fake news than the real ones.

In [47]:
df_features[df_features['ttr_headline'] < 100.0].groupby('label').count()

Unnamed: 0_level_0,headline,text,n_sentences_text,n_words_text,avg_words_sent_text,avg_word_size_text,ttr_text,n_words_headline,avg_word_size_headline,ttr_headline
label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,20,20,20,20,20,20,20,20,20,20
1,11,11,11,11,11,11,11,11,11,11
