# Clean and feature extraction v3

## Clean text, extract stylometric features and create a new dataset

In [3]:
import pandas as pd

In [4]:
df = pd.read_csv('../data/corpus_spanish.csv')

In [5]:
df.head()

Unnamed: 0,Id,Category,Topic,Source,Headline,Text,Link
0,641,True,Entertainment,Caras,Sofía Castro y Alejandro Peña Pretelini: una i...,Sofía Castro y Alejandro Peña Pretelini: una i...,https://www.caras.com.mx/sofia-castro-alejandr...
1,6,True,Education,Heraldo,Un paso más cerca de hacer los exámenes 'online',Un paso más cerca de hacer los exámenes 'onlin...,https://www.heraldo.es/noticias/suplementos/he...
2,141,True,Science,HUFFPOST,Esto es lo que los científicos realmente piens...,Esto es lo que los científicos realmente piens...,https://www.huffingtonpost.com/entry/scientist...
3,394,True,Politics,El financiero,Inicia impresión de boletas para elección pres...,Inicia impresión de boletas para elección pres...,http://www.elfinanciero.com.mx/elecciones-2018...
4,139,True,Sport,FIFA,A *NUMBER* día del Mundial,A *NUMBER* día del Mundial\nFIFA.com sigue la ...,https://es.fifa.com/worldcup/news/a-1-dia-del-...


In [6]:
df.shape

(971, 7)

In [7]:
df.dtypes

Id           int64
Category    object
Topic       object
Source      object
Headline    object
Text        object
Link        object
dtype: object

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 971 entries, 0 to 970
Data columns (total 7 columns):
Id          971 non-null int64
Category    971 non-null object
Topic       971 non-null object
Source      971 non-null object
Headline    971 non-null object
Text        971 non-null object
Link        971 non-null object
dtypes: int64(1), object(6)
memory usage: 53.2+ KB


## We are using `spacy`: The NLP *Ruby on Rails* 

[spacy](http://www.spacy.io/) is a library of natural language processing, robust, fast, easy to install and to use. It can be used with other NLP and Deep Learning Libraries.

With its pre-trained models in spanish language, we can operate the typical NLP jobs: Sentences segmentation, tokenization, POS tag, etc...

We are going to use the `es_core_news_lg` pre-trained model to make pos tagging:

In [9]:
import spacy

In [15]:
# we load the pre trained model in spanish language

nlp = spacy.load('es_core_news_lg')

In [21]:
text = df['Text'].iloc[0]
print(text)

Sofía Castro y Alejandro Peña Pretelini: una inigualable relación de hermanos
La actriz compartió a través de sus redes sociales una tierna imagen con el hijo de Enrique Peña Nieto
Los hermanos suelen convertirse en grandes amistades para muchos, esto pasa con Sofía Castro y Alejandro Peña Pretellini, quienes no dudan en compartir la buena amistad que hay entre ellos.
A través de sus redes sociales, la actriz de *NUMBER* años compartió una fotografía junto al hijo de Enrique Peña Nieto y la acompañó de un cariñoso mensaje, ?Lo mejor de mi vida?, escribió.
El hijo de Enrique Peña Nieto y la hija de Angélica Rivera, más que hermanos, se han convertido en grandes amigos y las fotos que comparten en sus respectivas cuentas en Instagram muestran lo bien que se llevan.
Si bien la hija mayor de la primera dama se lleva de maravilla con los tres hijos del esposo de su mamá, la poca diferencia de edad con Alejandro Peña, con quien se lleva sólo un año, hace que los jóvenes se entiendan a la per

## Clean text for spanish

In [22]:
def text_clean(text):
    
    text = text.replace(r"http\S+", "")
    text = text.replace(r"http", "")
    text = text.replace(r"@\S+", "")
    text = text.replace(r"(?<!\n)\n(?!\n)", " ")
    text = text.lower()
    
    # text processing
    doc = nlp(text)
    
    return doc

In [23]:
doc = text_clean(text)

### We can easily iterate over the sentences list and scroll through the tokens to access their morpho-syntactic information:

In [26]:
for sentence in doc.sents:
    print("Oración: {}".format(sentence))
    for token in sentence:
        print(token.text, token.pos_)
        n_words_text = len(filtered_tokens_text)

Oración: sofía castro y alejandro peña pretelini: una inigualable relación de hermanos

sofía PROPN
castro PROPN
y CCONJ
alejandro PROPN
peña PROPN
pretelini PROPN
: PUNCT
una DET
inigualable ADJ
relación NOUN
de ADP
hermanos NOUN

 SPACE
Oración: la actriz compartió a través de sus redes sociales una tierna imagen con el hijo de enrique peña nieto

la DET
actriz NOUN
compartió VERB
a ADP
través INTJ
de ADP
sus DET
redes NOUN
sociales ADJ
una DET
tierna ADJ
imagen NOUN
con ADP
el DET
hijo NOUN
de ADP
enrique PROPN
peña PROPN
nieto PROPN

 SPACE
Oración: los hermanos suelen convertirse en grandes amistades para muchos, esto pasa con sofía castro y alejandro peña pretellini, quienes no dudan en compartir la buena amistad que hay entre ellos.

los DET
hermanos NOUN
suelen VERB
convertirse VERB
en ADP
grandes ADJ
amistades NOUN
para ADP
muchos PRON
, PUNCT
esto PRON
pasa VERB
con ADP
sofía PROPN
castro PROPN
y CCONJ
alejandro PROPN
peña PROPN
pretellini PROPN
, PUNCT
quienes PRON
no ADV
du

In [73]:
import nltk
from nltk import FreqDist
from nltk.corpus import stopwords  
from nltk import word_tokenize, sent_tokenize  
from string import punctuation

from lexical_diversity import lex_div as ld

text = text.replace(r"http\S+", "")
text = text.replace(r"http", "")
text = text.replace(r"@\S+", "")
text = text.replace(r"(?<!\n)\n(?!\n)", " ")
text = text.lower()

doc = nlp(text)

list_tokens = []
list_tags = []
n_sents = 0

for sentence in doc.sents:
    n_sents += 1
    for token in sentence:
        list_tokens.append(token.text)
        list_tags.append(token.pos_)

n_tags = nltk.Counter(list_tags)
fdist = FreqDist(list_tokens)
        
# complexity features
n_words = len(list_tokens)
avg_word_sentences = (float(n_words) / n_sents)
word_size = sum(len(word) for word in list_tokens) / n_words
unique_words = (len(fdist.hapaxes()) / n_words) * 100
ttr = ld.ttr(list_tokens) * 100
mltd = ld.mtld(list_tokens)

# lexical features
propn_ratio = (n_tags['PROPN'] / n_words) * 100 
noun_ratio = (n_tags['NOUN'] / n_words) * 100 
adp_ratio = (n_tags['ADP'] / n_words) * 100
det_ratio = (n_tags['DET'] / n_words) * 100
punct_ratio = (n_tags['PUNCT'] / n_words) * 100 
pron_ratio = (n_tags['PRON'] / n_words) * 100
verb_ratio = (n_tags['VERB'] / n_words) * 100
adv_ratio = (n_tags['ADV'] / n_words) * 100

print(n_words, n_sents, avg_word_sentences, word_size, unique_words, ttr, mltd, propn_ratio, noun_ratio, adp_ratio, det_ratio, punct_ratio, 
      pron_ratio, verb_ratio, adv_ratio)

252 8 31.5 4.190476190476191 34.92063492063492 50.0 58.60055865921788 15.079365079365079 15.079365079365079 13.88888888888889 11.507936507936508 9.523809523809524 5.555555555555555 6.349206349206349 3.1746031746031744


## Apply it to the full corpus with iterrows()

In [80]:
%%time

import itertools
import pandas as pd
import numpy as np

import nltk
import spacy
from nltk import FreqDist
from sklearn.preprocessing import LabelEncoder
from lexical_diversity import lex_div as ld
nlp = spacy.load('es_core_news_lg')

labelencoder = LabelEncoder()
df['Label'] = labelencoder.fit_transform(df['Category'])

# empty lists and df
df_features = pd.DataFrame()
list_text = []
list_nsentences = []
list_nwords = []
list_words_sent = []
list_word_size = []
list_unique_words = []
list_ttr = []
list_mltd = []
list_propn_ratio = [] 
list_noun_ratio = []
list_adp_ratio = []
list_det_ratio = []
list_punct_ratio = []
list_pron_ratio = []
list_verb_ratio = []
list_adv_ratio = []

# df iteration
for n, row in df.iterrows():
    text = df['Text'].iloc[n]
    
    text = text.replace(r"http\S+", "")
    text = text.replace(r"http", "")
    text = text.replace(r"@\S+", "")
    text = text.replace(r"(?<!\n)\n(?!\n)", " ")
    text = text.lower()

    doc = nlp(text)

    list_tokens = []
    list_tags = []
    n_sents = 0

    for sentence in doc.sents:
        n_sents += 1
        for token in sentence:
            list_tokens.append(token.text)
            list_tags.append(token.pos_)

    n_tags = nltk.Counter(list_tags)
    fdist = FreqDist(list_tokens)

    # complexity features
    n_words = len(list_tokens)
    avg_word_sentences = (float(n_words) / n_sents)
    word_size = sum(len(word) for word in list_tokens) / n_words
    unique_words = (len(fdist.hapaxes()) / n_words) * 100
    ttr = ld.ttr(list_tokens) * 100
    mltd = ld.mtld(list_tokens)

    # lexical features
    propn_ratio = (n_tags['PROPN'] / n_words) * 100 
    noun_ratio = (n_tags['NOUN'] / n_words) * 100 
    adp_ratio = (n_tags['ADP'] / n_words) * 100
    det_ratio = (n_tags['DET'] / n_words) * 100
    punct_ratio = (n_tags['PUNCT'] / n_words) * 100 
    pron_ratio = (n_tags['PRON'] / n_words) * 100
    verb_ratio = (n_tags['VERB'] / n_words) * 100
    adv_ratio = (n_tags['ADV'] / n_words) * 100
    
    # appending on lists
    list_text.append(text)
    list_nsentences.append(n_sents)
    list_nwords.append(n_words)
    list_words_sent.append(avg_word_sentences)
    list_word_size.append(word_size)
    list_unique_words.append(unique_words)
    list_ttr.append(ttr)
    list_mltd.append(mltd)
    list_propn_ratio.append(propn_ratio)
    list_noun_ratio.append(noun_ratio)
    list_adp_ratio.append(adp_ratio)
    list_det_ratio.append(det_ratio)
    list_punct_ratio.append(punct_ratio)
    list_pron_ratio.append(pron_ratio)
    list_verb_ratio.append(verb_ratio)
    list_adv_ratio.append(adv_ratio)
    
# dataframe
df_features['text'] = list_text
df_features['n_sents'] = list_nsentences
df_features['n_words'] = list_nwords
df_features['avg_words_sents'] = list_words_sent
df_features['word_size'] = list_word_size
df_features['unique_words'] = list_unique_words
df_features['ttr'] = list_ttr
df_features['mltd'] = list_mltd
df_features['propn_ratio'] = list_propn_ratio
df_features['noun_ratio'] = list_noun_ratio
df_features['adp_ratio'] = list_adp_ratio
df_features['det_ratio'] = list_det_ratio
df_features['punct_ratio'] = list_punct_ratio
df_features['pron_ratio'] = list_pron_ratio
df_features['verb_ratio'] = list_verb_ratio
df_features['adv_ratio'] = list_adv_ratio
df_features['label'] = df['Label']

df_features.to_csv('../data/spanish_corpus_features_v3.csv', encoding = 'utf-8', index = False)

CPU times: user 57.5 s, sys: 1.45 s, total: 59 s
Wall time: 59 s


In [81]:
df_features.head(10)

Unnamed: 0,text,n_sents,n_words,avg_words_sents,word_size,unique_words,ttr,mltd,propn_ratio,noun_ratio,adp_ratio,det_ratio,punct_ratio,pron_ratio,verb_ratio,adv_ratio,label
0,sofía castro y alejandro peña pretelini: una i...,8,252,31.5,4.190476,34.920635,50.0,58.600559,15.079365,15.079365,13.888889,11.507937,9.52381,5.555556,6.349206,3.174603,1
1,un paso más cerca de hacer los exámenes 'onlin...,9,486,54.0,4.255144,32.716049,44.238683,41.283136,16.666667,15.226337,12.345679,11.111111,18.930041,1.234568,3.292181,1.646091,1
2,esto es lo que los científicos realmente piens...,31,980,31.612903,4.815306,26.020408,38.571429,80.551467,4.795918,17.857143,13.979592,12.755102,11.836735,2.55102,10.306122,4.693878,1
3,inicia impresión de boletas para elección pres...,11,369,33.545455,4.728997,22.764228,37.398374,50.995314,4.065041,22.493225,18.157182,15.176152,11.111111,2.710027,7.317073,0.271003,1
4,a *number* día del mundial\nfifa.com sigue la ...,5,130,26.0,4.461538,48.461538,60.0,47.081602,14.615385,14.615385,17.692308,13.846154,9.230769,3.076923,4.615385,1.538462,1
5,interpol ordena detención inmediata de osorio ...,4,116,29.0,4.793103,48.275862,63.793103,63.02595,12.068966,16.37931,18.965517,12.931034,12.068966,2.586207,5.172414,0.0,0
6,"""los ninis"" más ricos y poderosos del país: hi...",5,211,42.2,3.668246,35.07109,50.236967,50.913142,6.635071,16.113744,11.848341,11.374408,9.004739,2.369668,6.161137,5.687204,0
7,"para todo sacan lo del populismo, ni siquiera ...",12,416,34.666667,4.139423,32.932692,46.875,66.856042,6.009615,17.067308,15.625,9.855769,12.980769,4.807692,10.576923,2.403846,1
8,conapred investiga acto de racismo en el pumas...,6,227,37.833333,4.101322,35.682819,51.54185,66.635851,10.572687,15.859031,13.656388,11.013216,13.215859,3.0837,9.251101,2.643172,1
9,cristiano ronaldo acepta dos años de prisión\n...,17,590,34.705882,4.276271,24.576271,35.932203,46.584855,5.084746,17.966102,14.237288,12.711864,10.677966,1.864407,8.474576,3.050847,1
