# Clean and feature extraction v4

## Clean text, extract stylometric, lexical and complexity features and create a new csv

## We are using `spacy`: The NLP *Ruby on Rails* 

[spacy](http://www.spacy.io/) is a library of natural language processing, robust, fast, easy to install and to use. It can be used with other NLP and Deep Learning Libraries.

With its pre-trained models in spanish language, we can operate the typical NLP jobs: Sentences segmentation, tokenization, POS tag, etc...

We are going to use the `es_core_news_lg` pre-trained model to make pos tagging:

## Also extracting headline features

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('../data/corpus_spanish.csv')

In [3]:
df.head()

Unnamed: 0,Id,Category,Topic,Source,Headline,Text,Link
0,641,True,Entertainment,Caras,Sofía Castro y Alejandro Peña Pretelini: una i...,Sofía Castro y Alejandro Peña Pretelini: una i...,https://www.caras.com.mx/sofia-castro-alejandr...
1,6,True,Education,Heraldo,Un paso más cerca de hacer los exámenes 'online',Un paso más cerca de hacer los exámenes 'onlin...,https://www.heraldo.es/noticias/suplementos/he...
2,141,True,Science,HUFFPOST,Esto es lo que los científicos realmente piens...,Esto es lo que los científicos realmente piens...,https://www.huffingtonpost.com/entry/scientist...
3,394,True,Politics,El financiero,Inicia impresión de boletas para elección pres...,Inicia impresión de boletas para elección pres...,http://www.elfinanciero.com.mx/elecciones-2018...
4,139,True,Sport,FIFA,A *NUMBER* día del Mundial,A *NUMBER* día del Mundial\nFIFA.com sigue la ...,https://es.fifa.com/worldcup/news/a-1-dia-del-...


In [4]:
df.shape

(971, 7)

In [5]:
df.dtypes

Id           int64
Category    object
Topic       object
Source      object
Headline    object
Text        object
Link        object
dtype: object

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 971 entries, 0 to 970
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Id        971 non-null    int64 
 1   Category  971 non-null    object
 2   Topic     971 non-null    object
 3   Source    971 non-null    object
 4   Headline  971 non-null    object
 5   Text      971 non-null    object
 6   Link      971 non-null    object
dtypes: int64(1), object(6)
memory usage: 53.2+ KB


In [71]:
df.sample(5)

Unnamed: 0,Id,Category,Topic,Source,Headline,Text,Link
272,151,Fake,Science,Publimetro,Tendremos otra Era de Hielo en *NUMBER* según ...,Tendremos otra Era de Hielo en *NUMBER* según ...,https://www.publimetro.cl/cl/mundo/2016/10/05/...
157,534,True,Politics,El Universal,AMLO dispone de hasta *NUMBER* millones de pes...,AMLO dispone de hasta *NUMBER* millones de pes...,http://www.eluniversal.com.mx/elecciones-2018/...
228,98,Fake,Society,El Dizque,Un grupo de expertos descubre los pasos a segu...,Un grupo de expertos descubre los pasos a segu...,https://www.eldizque.com/un-grupo-de-expertos-...
448,166,True,Politics,El país,Túnez ya cuenta con la primera alcaldesa elect...,Túnez ya cuenta con la primera alcaldesa elect...,https://elpais.com/internacional/2018/07/03/ac...
280,504,True,Politics,Excelsior,La CDMX enfrenta contingencia por falta de agua,La CDMX enfrenta contingencia por falta de agu...,https://www.excelsior.com.mx/comunidad/la-cdmx...


In [76]:
text = df['Text'].iloc[280]

text

'La CDMX enfrenta contingencia por falta de agua\nLa presencia de un canal de alta presión y las altas temperaturas afectaron la distribución a *NUMBER* mil habitantes en *NUMBER* delegaciones\nVecinos de la delegación Tláhuac cerraron un tramo de la avenida del mismo nombre y Calle *NUMBER* para manifestar su inconformidad por la falta de agua potable que padecen en ésa y otras zonas de la Ciudad de México.\nEn las delegaciones Cuauhtémoc, Benito Juárez, Iztapalapa, Iztacalco, Venustiano Carranza, Azcapotzalco y Tlalpan se enfrenta una contingencia por escasez y por baja presión de agua.\nDe acuerdo con las autoridades, aproximadamente, *NUMBER* mil habitantes de las siete demarcaciones padecen la escasez debido a la presencia atípica de un canal de alta presión que originó que se dejaran de recibir *NUMBER* mil litros de agua por segundo del Sistema Cutzamala, además de que, por las altas temperaturas, en algunas zonas, se incrementó hasta *NUMBER* por ciento el consumo del líquido, 

In [77]:
text = re.sub(r"http\S+", "", text)
text = re.sub(r"@\S+", "", text)
text = re.sub("\n", " ", text)
text = re.sub(r"(?<!\n)\n(?!\n)", " ", text)
text = text.replace(r"*NUMBER*", "número")
text = text.replace(r"*PHONE*", "número")
text = text.replace(r"*EMAIL*", "email")
text = text.replace(r"*URL*", "url")
text = text.lower()
text

'la cdmx enfrenta contingencia por falta de agua la presencia de un canal de alta presión y las altas temperaturas afectaron la distribución a número mil habitantes en número delegaciones vecinos de la delegación tláhuac cerraron un tramo de la avenida del mismo nombre y calle número para manifestar su inconformidad por la falta de agua potable que padecen en ésa y otras zonas de la ciudad de méxico. en las delegaciones cuauhtémoc, benito juárez, iztapalapa, iztacalco, venustiano carranza, azcapotzalco y tlalpan se enfrenta una contingencia por escasez y por baja presión de agua. de acuerdo con las autoridades, aproximadamente, número mil habitantes de las siete demarcaciones padecen la escasez debido a la presencia atípica de un canal de alta presión que originó que se dejaran de recibir número mil litros de agua por segundo del sistema cutzamala, además de que, por las altas temperaturas, en algunas zonas, se incrementó hasta número por ciento el consumo del líquido, situación que me

In [44]:
from syltippy import syllabize

In [45]:
syllables= syllabize(text)

In [49]:
print(syllables)

(['la', ' u', 'nam', ' lim', 'pia', ' au', 'las ', 'de', ' a', 'lum', 'nos', ' re', 'za', 'ga', 'dos', ' a ', 'par', 'tir ', 'de ', 'nú', 'me', 'ro,', ' las ', 'nue', 'vas', ' re', 'glas ', 'de', ' in', 'gre', 'so ', 'y ', 'per', 'ma', 'nen', 'cia ', 'co', 'men', 'za', 'ron', ' a', ' a', 'li', 'ge', 'rar', ' la ', 'car', 'ga', ' ad', 'mi', 'nis', 'tra', 'ti', 'va', ' en', ' la', ' u', 'nam;', ' ter', 'mi', 'nan ', 'con ', 'nú', 'me', 'ro% ', 'de ', 'fó', 'si', 'les', ' en ', 'ca', 'si ', 'nú', 'me', 'ro', ' a', 'ños ', 'ciu', 'dad ', 'de ', 'mé', 'xi', 'co, ', 'nú', 'me', 'ro ', 'de ', 'sep', 'tiem', 'bre.-', ' a', 'go', 'bia', 'da ', 'du', 'ran', 'te ', 'dé', 'ca', 'das ', 'por', ' la', ' e', 'xis', 'ten', 'cia ', 'de ', 'cer', 'ca ', 'de ', 'me', 'dio ', 'mi', 'llón ', 'de', ' a', 'lum', 'nos ', 'que ', 'se', ' e', 'ter', 'ni', 'za', 'ban', ' en ', 'sus', ' au', 'las,', ' de ', 'los ', 'cua', 'les ', 'po', 'co ', 'más ', 'de ', 'nú', 'me', 'ro ', 'mil ', 'man', 'te', 'ní', 'an ', 'vi

In [51]:
n_syllables = syllables[-1]

2464

In [None]:
score = ((0.39 * n_words) / n_sents) + ((11.8 * n_syllables) / n_words) - 15.59

In [None]:
score = (0.39 * len(text.split()) / len(text.split('.')) ) + 11.8 * ( sum(list(map(lambda x: 1 if x in ["a","i","e","o","u","y","A","E","I","O","U","y"] else 0,text))) / len(text.split())) - 15.59

In [90]:
import nltk
import spacy
from nltk import FreqDist
from nltk.corpus import stopwords  
from nltk import word_tokenize, sent_tokenize  
from string import punctuation
from lexical_diversity import lex_div as ld

nlp = spacy.load('es_core_news_lg')

text = re.sub(r"http\S+", "", text)
text = re.sub(r"@\S+", "", text)
text = re.sub("\n", " ", text)
text = re.sub(r"(?<!\n)\n(?!\n)", " ", text)
text = text.replace(r"*NUMBER*", "número")
text = text.replace(r"*PHONE*", "número")
text = text.replace(r"*EMAIL*", "email")
text = text.replace(r"*URL*", "url")
text = text.lower()

doc = nlp(text)

syllables= syllabize(text)
list_tokens = []
list_tags = []
list_entities = []
n_sents = 0

for entity in doc.ents:
    list_entities.append(entity.label_)

for sentence in doc.sents:
    n_sents += 1
    for token in sentence:
        list_tokens.append(token.text)
        list_tags.append(token.pos_)
        
n_entities = len(list_entities)
n_tags = nltk.Counter(list_tags)
fdist = FreqDist(list_tokens)
n_syllables = syllables[-1]
n_words = len(list_tokens)
        
# complexity features
avg_word_sentence = round(n_words / n_sents, 2)
avg_word_size = round(sum(len(word) for word in list_tokens) / n_words, 2)
avg_syllables_word = round(n_syllables / n_words, 2)
unique_words = round((len(fdist.hapaxes()) / n_words) * 100, 2)
ttr = round(ld.ttr(list_tokens) * 100, 2)
mltd = round(ld.mtld(list_tokens), 2)

# readability spanish test
i_fernandez_huerta = round(206.84 - (60 * avg_syllables_word) - (1.02 * avg_word_sentence), 2)

# stylometric features
entity_ratio = (n_entities / n_words) * 100
propn_ratio = (n_tags['PROPN'] / n_words) * 100 
noun_ratio = (n_tags['NOUN'] / n_words) * 100 
adp_ratio = (n_tags['ADP'] / n_words) * 100
det_ratio = (n_tags['DET'] / n_words) * 100
punct_ratio = (n_tags['PUNCT'] / n_words) * 100 
pron_ratio = (n_tags['PRON'] / n_words) * 100
verb_ratio = (n_tags['VERB'] / n_words) * 100
adv_ratio = (n_tags['ADV'] / n_words) * 100

print(n_words, n_sents, n_syllables, avg_word_sentences, avg_syllables_word, i_fernandez_huerta)


553 13 1048 42.54 1.9 49.45


## Apply it to the full corpus with iterrows()

In [20]:
%%time

import itertools
import pandas as pd
import numpy as np

import nltk
import spacy
from nltk import FreqDist
from sklearn.preprocessing import LabelEncoder
from lexical_diversity import lex_div as ld
nlp = spacy.load('es_core_news_lg')

df = pd.read_csv('../data/corpus_spanish.csv')

labelencoder = LabelEncoder()
df['Label'] = labelencoder.fit_transform(df['Category'])

# empty lists and df
df_features = pd.DataFrame()
list_text = []
list_nsentences = []
list_nwords = []
list_words_sent = []
list_word_size = []
list_unique_words = []
list_ttr = []
list_mltd = []
list_entity_ratio = []
list_nquotes = []
list_quotes_ratio = []
list_propn_ratio = [] 
list_noun_ratio = []
list_adp_ratio = []
list_det_ratio = []
list_punct_ratio = []
list_pron_ratio = []
list_verb_ratio = []
list_adv_ratio = []
list_sym_ratio = []

list_headline = []
list_words_h = []
list_word_size_h = []
list_ttr_h = []
list_mltd_h = []
list_unique_words_h = []

# df iteration
for n, row in df.iterrows():
    
    ## headline ##
    text_h = df['Headline'].iloc[n]
    text_h = text_h.replace(r"http\S+", "")
    text_h = text_h.replace(r"http", "")
    text_h = text_h.replace(r"@\S+", "")
    text_h = text_h.replace(r"(?<!\n)\n(?!\n)", " ")
    text_h = text_h.lower()
    doc_h = nlp(text_h)

    list_tokens_h = []
    list_tags_h = []
    n_sents_h = 0

    for sentence_h in doc_h.sents:
        n_sents_h += 1
        for token in sentence_h:
            list_tokens_h.append(token.text)

    fdist_h = FreqDist(list_tokens_h)
    
    # headline complexity features
    n_words_h = len(list_tokens_h)
    word_size_h = sum(len(word) for word in list_tokens_h) / n_words_h
    unique_words_h = (len(fdist_h.hapaxes()) / n_words_h) * 100
    ttr_h = ld.ttr(list_tokens_h) * 100
    mltd_h = ld.mtld(list_tokens_h)
    
    ## text content##   
    text = df['Text'].iloc[n]  
    text = text.replace(r"http\S+", "")
    text = text.replace(r"http", "")
    text = text.replace(r"@\S+", "")
    text = text.replace(r"(?<!\n)\n(?!\n)", " ")
    text = text.lower()
    doc = nlp(text)

    list_tokens = []
    list_pos = []
    list_tag = []
    list_entities = []
    n_sents = 0
    
    for entity in doc.ents:
        list_entities.append(entity.label_)

    for sentence in doc.sents:
        n_sents += 1
        for token in sentence:
            list_tokens.append(token.text)
            list_pos.append(token.pos_)
            list_tag.append(token.tag_)
    
    n_entities = len(list_entities)
    n_pos = nltk.Counter(list_pos)
    n_tag = nltk.Counter(list_tag)
    fdist = FreqDist(list_tokens)

    # complexity features
    n_words = len(list_tokens)
    avg_word_sentences = (float(n_words) / n_sents)
    word_size = sum(len(word) for word in list_tokens) / n_words
    unique_words = (len(fdist.hapaxes()) / n_words) * 100
    ttr = ld.ttr(list_tokens) * 100
    mltd = ld.mtld(list_tokens)

    # stylometric features
    entity_ratio = (n_entities / n_words) * 100
    n_quotes = n_tag['PUNCT__PunctType=Quot']
    quotes_ratio = (n_quotes / n_words) * 100
    propn_ratio = (n_pos['PROPN'] / n_words) * 100 
    noun_ratio = (n_pos['NOUN'] / n_words) * 100 
    adp_ratio = (n_pos['ADP'] / n_words) * 100
    det_ratio = (n_pos['DET'] / n_words) * 100
    punct_ratio = (n_pos['PUNCT'] / n_words) * 100 
    pron_ratio = (n_pos['PRON'] / n_words) * 100
    verb_ratio = (n_pos['VERB'] / n_words) * 100
    adv_ratio = (n_pos['ADV'] / n_words) * 100
    sym_ratio = (n_tag['SYM'] / n_words) * 100
    
    # appending on lists
    list_text.append(text)
    list_nsentences.append(n_sents)
    list_nwords.append(n_words)
    list_words_sent.append(avg_word_sentences)
    list_word_size.append(word_size)
    list_unique_words.append(unique_words)
    list_ttr.append(ttr)
    list_mltd.append(mltd)
    list_headline.append(text_h)
    list_words_h.append(n_words_h)
    list_word_size_h.append(word_size_h)
    list_unique_words_h.append(unique_words_h)
    list_ttr_h.append(ttr_h)
    list_mltd_h.append(mltd_h)
    list_entity_ratio.append(entity_ratio)
    list_nquotes.append(n_quotes)
    list_quotes_ratio.append(quotes_ratio)
    list_propn_ratio.append(propn_ratio)
    list_noun_ratio.append(noun_ratio)
    list_adp_ratio.append(adp_ratio)
    list_det_ratio.append(det_ratio)
    list_punct_ratio.append(punct_ratio)
    list_pron_ratio.append(pron_ratio)
    list_verb_ratio.append(verb_ratio)
    list_adv_ratio.append(adv_ratio)
    list_sym_ratio.append(sym_ratio)
    
# dataframe
df_features['text'] = list_text
df_features['headline'] = list_headline
df_features['n_sents'] = list_nsentences
df_features['n_words'] = list_nwords
df_features['avg_words_sents'] = list_words_sent
df_features['word_size'] = list_word_size
df_features['unique_words'] = list_unique_words
df_features['ttr'] = list_ttr
df_features['mltd'] = list_mltd
df_features['n_words_h'] = list_words_h
df_features['word_size_h'] = list_word_size_h
df_features['unique_words_h'] = list_unique_words_h
df_features['ttr_h'] = list_ttr_h
df_features['mltd_h'] = list_mltd_h
df_features['entity_ratio'] = list_entity_ratio
df_features['n_quotes'] = list_nquotes
df_features['quotes_ratio'] = list_quotes_ratio
df_features['propn_ratio'] = list_propn_ratio
df_features['noun_ratio'] = list_noun_ratio
df_features['adp_ratio'] = list_adp_ratio
df_features['det_ratio'] = list_det_ratio
df_features['punct_ratio'] = list_punct_ratio
df_features['pron_ratio'] = list_pron_ratio
df_features['verb_ratio'] = list_verb_ratio
df_features['adv_ratio'] = list_adv_ratio
df_features['sym_ratio'] = list_sym_ratio
df_features['label'] = df['Label']

df_features.to_csv('../data/spanish_corpus_features_v4.csv', encoding = 'utf-8', index = False)

CPU times: user 1min 11s, sys: 1.52 s, total: 1min 12s
Wall time: 1min 12s


In [21]:
df_features.head(10)

Unnamed: 0,text,headline,n_sents,n_words,avg_words_sents,word_size,unique_words,ttr,mltd,n_words_h,...,propn_ratio,noun_ratio,adp_ratio,det_ratio,punct_ratio,pron_ratio,verb_ratio,adv_ratio,sym_ratio,label
0,sofía castro y alejandro peña pretelini: una i...,sofía castro y alejandro peña pretelini: una i...,8,252,31.5,4.190476,34.920635,50.0,58.600559,12,...,15.079365,15.079365,13.888889,11.507937,9.52381,5.555556,6.349206,3.174603,0.0,1
1,un paso más cerca de hacer los exámenes 'onlin...,un paso más cerca de hacer los exámenes 'online',9,486,54.0,4.255144,32.716049,44.238683,41.283136,11,...,16.666667,15.226337,12.345679,11.111111,18.930041,1.234568,3.292181,1.646091,1.851852,1
2,esto es lo que los científicos realmente piens...,esto es lo que los científicos realmente piens...,31,980,31.612903,4.815306,26.020408,38.571429,80.551467,12,...,4.795918,17.857143,13.979592,12.755102,11.836735,2.55102,10.306122,4.693878,0.510204,1
3,inicia impresión de boletas para elección pres...,inicia impresión de boletas para elección pres...,11,369,33.545455,4.728997,22.764228,37.398374,50.995314,7,...,4.065041,22.493225,18.157182,15.176152,11.111111,2.710027,7.317073,0.271003,1.355014,1
4,a *number* día del mundial\nfifa.com sigue la ...,a *number* día del mundial,5,130,26.0,4.461538,48.461538,60.0,47.081602,7,...,14.615385,14.615385,17.692308,13.846154,9.230769,3.076923,4.615385,1.538462,2.307692,1
5,interpol ordena detención inmediata de osorio ...,interpol ordena detención inmediata de osorio ...,4,116,29.0,4.793103,48.275862,63.793103,63.02595,13,...,12.068966,16.37931,18.965517,12.931034,12.068966,2.586207,5.172414,0.0,0.862069,0
6,"""los ninis"" más ricos y poderosos del país: hi...","""los ninis"" más ricos y poderosos del país: hi...",5,211,42.2,3.668246,35.07109,50.236967,50.913142,14,...,6.635071,16.113744,11.848341,11.374408,9.004739,2.369668,6.161137,5.687204,0.473934,0
7,"para todo sacan lo del populismo, ni siquiera ...",gobierno de alfredo del mazo inició con récord...,12,416,34.666667,4.139423,32.932692,46.875,66.856042,11,...,6.009615,17.067308,15.625,9.855769,12.980769,4.807692,10.576923,2.403846,0.480769,1
8,conapred investiga acto de racismo en el pumas...,conapred investiga acto de racismo en el pumas...,6,227,37.833333,4.101322,35.682819,51.54185,66.635851,10,...,10.572687,15.859031,13.656388,11.013216,13.215859,3.0837,9.251101,2.643172,0.0,1
9,cristiano ronaldo acepta dos años de prisión\n...,cristiano ronaldo acepta dos años de prisión,17,590,34.705882,4.276271,24.576271,35.932203,46.584855,7,...,5.084746,17.966102,14.237288,12.711864,10.677966,1.864407,8.474576,3.050847,2.372881,1
