# Exploratory Data Analysis

This notebook contains the code and visualizations for the exploratory data analysis (EDA) phase of the project. It is designed to provide an in-depth understanding of the dataset, including its structure, size, and distribution of values. The notebook covers various EDA techniques such as univariate analysis, missing value treatment, language detection and data cleaning. It also contains the initial insights used to guide the next steps of the project such as topic modeling and model selection. The EDA performed in this notebook is crucial for understanding the underlying patterns in the data and helps to ensure the robustness and accuracy of the final model.

## Libraries

In [None]:
!pip3 install nltk

In [None]:
!pip3 install openpyxl

In [2]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

import re
from unicodedata import normalize
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [76]:
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pautrejo/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
!git clone https://github.com/jhliu17/emoji-to-lang.git
!cd emoji-to-lang
!python setup.py install

In [None]:
cd emoji-to-lang

In [None]:
!python3 setup.py install

In [1]:
import emoji

In [None]:
!pip3 install unidecode

In [73]:
import unidecode

In [None]:
!pip3 install langdetect

In [3]:
from langdetect import detect

In [None]:
!python3 -m spacy download es_core_news_lg

In [None]:
import spacy
import spacy_spanish_lemmatizer

## Dataset

In [4]:
ruta = 'fakenews_socialmedia.xlsx'
ds = pd.read_excel(ruta)
ds.head()

Unnamed: 0,id_chequeo,organizacion,pais,titulo_chequeo,fecha_chequeo,link_chequeo,texto_chequeo,id_desinformacion,titulo_desinformacion,fecha_desinformacion,link_desinformacion,texto_desinformacion,author_desinformacion,topic_desinformacion_raw,topic_desinformacion,label_desinformacion_raw,label_desinformacion,source_desinformacion_raw,source_desinformacion,formato_desinformacion,total_comments,total_reactions,total_shares,augmented,clean_texto,feat_word_count,feat_word_density,feat_characters,feat_upper_case,feat_hashtag,feat_find_url,feat_sentiment,feat_count_sentences,feat_polarity_score,feat_count_punctuation,feat_count_syllables,feat_count_emoticons,feat_ari,feature_ttr,feat_flesch_reading,count_organizations,count_persons,count_locations,count_pronouns,count_verbs,prop_organizations,prop_persons,prop_locations,prop_pronouns,prop_verbs,feat_25common_bigrams,feat_50common_bigrams,feat_100common_bigrams,feat_25common_unigrams,feat_50common_unigrams,feat_100common_unigrams
0,,verificado,mexico,,,,,6b63a3ff-3b8a-32e9-b1fc-feb36b0e6e59,,2017-01-16 00:00:00,https://www.facebook.com/DisculpeLasMolestiasEstoEsUnaRevolucion/photos/basw.AboXK-oDbmrcxLersagjQXeXHA-ZS0zCSzaH0gy5bXMZlI5Wk71Af1feCnoyq6NPT0IboUbmxaQHMvZ5QRPYnHQuIDd3Jjo0fjvja1i6fx5j0_llxDtwEo_ycQJ7qC7ouKtcDQIrqdNMrc1gwSxffIq-tEFRTNDTJbIWmSy3yCvCU1QMynYX-M_B_eX_57evA4mrx8TiSFsN2HXYSc_qhyNt.1368609479857562.1048961835213764.1061098397352852.309848959410452.1893115200922435.351691281967239.1245400692208477.1372432589474736.1200289516693529/1893115200922435/?type=1&opaqueCursor=AbozcLnGbT-ciSJc8dmiONwtJ0r91aQoPfbPAWDUFkHZkYg1YTdnTUrmPGGTNSnoa_L3U8t03r8rYOACb-9ZFhsUzbDnYBkK-hYsM0kKafH7pKt2kaxQuynx6y94tkdPnlgqbUqwi75TR2BU9QCIgNZzIgoUfn81rbX9i2swvp-z2cRiRVePL3a9U_RbnImZnV4roxE993FS9IKA8j9TAUZrcu7BVlt0ajdfg-mhyq5vuzYpooRFsDcASJhNLHr5bc6nPzcZYvd6--bGXaO7o54O8NBIQAxEENLbxkwEaG5GZ0TOwkbIQadypCvcQlrnJWCMR7vX5R3oF64Ibf5tbyJXwsnBbXgx5oAc9iUr0OrGNg-KCgfqubpuGZRal_68Gw9u57yWgkT63pLYecFiy0gIomP4i3NgOA8pvwTt4uVEag1z5ODSgzBIcpiSgKm1Mb1tqKjgRYdBa3omF75RuALCy-fRziTUu0aDnEl4YidG7oPz4j6Us_TUVCKgOr2JORurJPtUHCT6H7e8_1TmhMg1BQOopxWj6pJPhdFn-56QXBS-ADpW2osxvd4dd4JCC-UiB2l_nuv5vuL_DMheichmIzfMj-FWrxQaNL7Dcx9YuFHuCsdDabMRSbkIxfu6ZR3Gq-oi0HZ8PTBRhgqD3bew&theater,"El Senador del ""PAN"" Javier Lozano, lo dejo bien Claro: ""Si me bajan el salario me pongo a robar"" Algo que no es novedad, SIEMPRE HA ROBADO!!\n\n¿Que les parece?",,politica,politica nacional,falso,fake,facebook,redes_sociales,texto,401,1186,6907,0,senador pan javier lozano dejar bajar salariar poner a robar novedad robadoque,12,1.0,78,26,0,0,2.832097e-06,2,-1,11,50,0,14.2,0.933333,74.87,2,2,0,1,0,0.066667,0.066667,0.0,0.033333,0.0,0.0,0.0,0.0,0.0,1.0,1.0
1,,verificado,mexico,,,,,307ec6db-f1b3-3adc-8b8a-3e63a0328a45,,2018-03-18 00:00:00,https://www.facebook.com/novoideass/photos/p.1942992509350897/1942992509350897/?type=1&opaqueCursor=AbpIpJLSjxhIhpJK5i_XKBoi48PCLT8p6dDpUNNcPA6FYx62sGmL3nNoayp0mDFqcRNr14S4zbD97S0bR1xz-gJSMSuO6iEvvNrTSbQvTEOPgEDmxyX2gBjMxhXkaKgY6XPXyjvUZKDlK11wLCSGBSksJOL0kYCUJuXSQ8fDYMGYhBdCSiMKfKeiWV4jU3yXA3-_TqTNmlzdde79PpMshGFgfQ-hfvSPH2-HK1ibC5b0rtXMmdD_tpm4uTxUvWU4k1pgMKefjTIFGPQJNznJjiXbA_J5hBSqxBv7NrccXIR-qu_cu2rgCZltJREHTi9UAmq0vgcWWTeOKQywpV-Q5ZtuzFSse29UalEpofAd1q3l_t7bA_GH4V_zjRWJaBSyR0LTMiMGaqL9JhRy2jmkqKCMrn1t_rLyGUdGsl8efZSv4A&theater,"El actor Gael García Bernal ha mostrado su simpatía por AMLO y hasta se enfrentó en una discusión con Javier Lozano Alarcón en Twitter. Además pidió a la Comisión de Derechos Humanos de la ONU que intervenga en la situación de inseguridad que se vive en México.\n-Imágen: Redes Sociales.\n""Libertad y República""",,politica,politica nacional,falso,fake,facebook,redes_sociales,texto,0,7,29,0,actor gael garcia bernal mostrar simpatia amlo y enfrentar discusion javier lozano alarcon twitter pidio a comision derecho humano onu intervenir situacion inseguridad vivir mexicoimagen redar socialeslibertad y republica,29,1.035714,221,25,0,0,1.011422e-14,4,-1,6,99,0,14.9,0.811321,53.04,1,3,6,1,0,0.018868,0.056604,0.113208,0.018868,0.0,0.0,0.0,0.0,0.0,1.0,1.0
2,,verificado,mexico,,,,,fcf22028-42f0-3e7a-9664-81456787f006,,2018-03-18 00:00:00,https://www.facebook.com/watch/?v=519968068404562,"AMLO desaparecerá al Ejercito Mexicano.\nEl tabasqueño quiere perpetuar al Ejército y a la Marina en las calles, bajo el eufemismo “Guardia Nacional”, que para efectos prácticos, sería una policía militarizada.",,politica,politica nacional,falso,fake,facebook,redes_sociales,"texto, video",1,5,18,0,"amlo desaparecera ejercitar mexicanoel tabasqueno perpetuar ejercitar y a marino calle eufemismo "" guardia nacional "" efecto practicos serio policia militarizar",19,1.055556,160,11,0,0,0.0001419783,2,-1,5,73,0,13.7,0.935484,47.28,2,2,0,1,0,0.064516,0.064516,0.0,0.032258,0.0,0.0,0.0,0.0,1.0,1.0,1.0
3,,verificado,mexico,,,,,3c1ddb38-b4f0-3211-ae46-88b6615d1b04,,2018-03-17 00:00:00,https://www.facebook.com/watch/?v=571767869867769,"AMLO\nEl mandamás de Femsa (Coca Cola, Oxxos) se lanza contra AMLO, se opone a que AMLO toque la Reforma Energética ya que afectaría a los Oxxo Gas.\n\n#Recomendado\nFuente: Regeneración",,politica,politica nacional,falso,fake,facebook,redes_sociales,texto,2467,12658,58997,0,amloel mandamas femsa cocar cola oxxos lanzar amlo oponer a amlo tocar reformar energetica afectaria a oxxo gasrecomendadofuente regeneracion,19,1.117647,141,24,1,0,0.005612344,2,-1,4,62,0,17.6,0.83871,48.47,0,6,3,0,0,0.0,0.193548,0.096774,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,,verificado,mexico,,,,,a661a7e2-d01b-324a-afbe-8b0f69ca2410,,2018-03-23 00:00:00,https://www.facebook.com/vamosporella/photos/rpp.389844434467087/1605812676203584/?type=3&theater,"LUIS DONALDO COLOSIO RIOJAS, UN PREPARADO Y HONORABLE JOVEN QUE JUNTO A TODA SU FAMILIA FUE VICTIMA DEL SISTEMA CORRUPTO AVASALLADOR. \nDONDE EL PEJE ERA GRAN PROTAGONISTA.",,politica,politica nacional,falso,fake,facebook,redes_sociales,texto,4,59,28,0,luis donaldo colosio riojas preparar y honorable joven a familia victimar sistema corrupto avasallador peje protagonista,16,1.0,120,141,0,0,0.03263706,2,-1,3,60,0,17.4,1.0,27.15,4,0,0,2,0,0.148148,0.0,0.0,0.074074,0.0,0.0,0.0,0.0,0.0,0.0,1.0


Dimension of original dataset

In [6]:
ds.shape

(10695, 56)

Number of unique posts from social media

In [7]:
ds.texto_desinformacion.nunique()

8807

## Data Preparation


1.   From emoji to text
2.   Language detection
3.   Remove all kinds of www links
4.   Special characters and numbers
5.   Removes @mentions and #hashtags
6.   Covid-19 synonym
7.   Particular cases 
8.   Lemmatizer

Selects columns of interest

In [5]:
col = ['organizacion','link_desinformacion',
       'texto_desinformacion', 
       'topic_desinformacion_raw', 'topic_desinformacion',
       'label_desinformacion_raw', 'label_desinformacion',
       'source_desinformacion_raw', 'source_desinformacion', 'formato_desinformacion', 'total_comments', 'total_reactions',
       'total_shares', 'augmented', 'clean_texto', 'feat_word_count',
       'feat_word_density', 'feat_characters', 'feat_upper_case',
       'feat_hashtag', 'feat_find_url', 'feat_sentiment',
       'feat_count_sentences', 'feat_polarity_score', 'feat_count_punctuation',
       'feat_count_syllables', 'feat_count_emoticons', 'feat_ari',
       'feature_ttr', 'feat_flesch_reading', 'count_organizations',
       'count_persons', 'count_locations', 'count_pronouns', 'count_verbs']
dsi = ds[col].copy()
dsi.head()

Unnamed: 0,organizacion,link_desinformacion,texto_desinformacion,topic_desinformacion_raw,topic_desinformacion,label_desinformacion_raw,label_desinformacion,source_desinformacion_raw,source_desinformacion,formato_desinformacion,total_comments,total_reactions,total_shares,augmented,clean_texto,feat_word_count,feat_word_density,feat_characters,feat_upper_case,feat_hashtag,feat_find_url,feat_sentiment,feat_count_sentences,feat_polarity_score,feat_count_punctuation,feat_count_syllables,feat_count_emoticons,feat_ari,feature_ttr,feat_flesch_reading,count_organizations,count_persons,count_locations,count_pronouns,count_verbs
0,verificado,https://www.facebook.com/DisculpeLasMolestiasEstoEsUnaRevolucion/photos/basw.AboXK-oDbmrcxLersagjQXeXHA-ZS0zCSzaH0gy5bXMZlI5Wk71Af1feCnoyq6NPT0IboUbmxaQHMvZ5QRPYnHQuIDd3Jjo0fjvja1i6fx5j0_llxDtwEo_ycQJ7qC7ouKtcDQIrqdNMrc1gwSxffIq-tEFRTNDTJbIWmSy3yCvCU1QMynYX-M_B_eX_57evA4mrx8TiSFsN2HXYSc_qhyNt.1368609479857562.1048961835213764.1061098397352852.309848959410452.1893115200922435.351691281967239.1245400692208477.1372432589474736.1200289516693529/1893115200922435/?type=1&opaqueCursor=AbozcLnGbT-ciSJc8dmiONwtJ0r91aQoPfbPAWDUFkHZkYg1YTdnTUrmPGGTNSnoa_L3U8t03r8rYOACb-9ZFhsUzbDnYBkK-hYsM0kKafH7pKt2kaxQuynx6y94tkdPnlgqbUqwi75TR2BU9QCIgNZzIgoUfn81rbX9i2swvp-z2cRiRVePL3a9U_RbnImZnV4roxE993FS9IKA8j9TAUZrcu7BVlt0ajdfg-mhyq5vuzYpooRFsDcASJhNLHr5bc6nPzcZYvd6--bGXaO7o54O8NBIQAxEENLbxkwEaG5GZ0TOwkbIQadypCvcQlrnJWCMR7vX5R3oF64Ibf5tbyJXwsnBbXgx5oAc9iUr0OrGNg-KCgfqubpuGZRal_68Gw9u57yWgkT63pLYecFiy0gIomP4i3NgOA8pvwTt4uVEag1z5ODSgzBIcpiSgKm1Mb1tqKjgRYdBa3omF75RuALCy-fRziTUu0aDnEl4YidG7oPz4j6Us_TUVCKgOr2JORurJPtUHCT6H7e8_1TmhMg1BQOopxWj6pJPhdFn-56QXBS-ADpW2osxvd4dd4JCC-UiB2l_nuv5vuL_DMheichmIzfMj-FWrxQaNL7Dcx9YuFHuCsdDabMRSbkIxfu6ZR3Gq-oi0HZ8PTBRhgqD3bew&theater,"El Senador del ""PAN"" Javier Lozano, lo dejo bien Claro: ""Si me bajan el salario me pongo a robar"" Algo que no es novedad, SIEMPRE HA ROBADO!!\n\n¿Que les parece?",politica,politica nacional,falso,fake,facebook,redes_sociales,texto,401,1186,6907,0,senador pan javier lozano dejar bajar salariar poner a robar novedad robadoque,12,1.0,78,26,0,0,2.832097e-06,2,-1,11,50,0,14.2,0.933333,74.87,2,2,0,1,0
1,verificado,https://www.facebook.com/novoideass/photos/p.1942992509350897/1942992509350897/?type=1&opaqueCursor=AbpIpJLSjxhIhpJK5i_XKBoi48PCLT8p6dDpUNNcPA6FYx62sGmL3nNoayp0mDFqcRNr14S4zbD97S0bR1xz-gJSMSuO6iEvvNrTSbQvTEOPgEDmxyX2gBjMxhXkaKgY6XPXyjvUZKDlK11wLCSGBSksJOL0kYCUJuXSQ8fDYMGYhBdCSiMKfKeiWV4jU3yXA3-_TqTNmlzdde79PpMshGFgfQ-hfvSPH2-HK1ibC5b0rtXMmdD_tpm4uTxUvWU4k1pgMKefjTIFGPQJNznJjiXbA_J5hBSqxBv7NrccXIR-qu_cu2rgCZltJREHTi9UAmq0vgcWWTeOKQywpV-Q5ZtuzFSse29UalEpofAd1q3l_t7bA_GH4V_zjRWJaBSyR0LTMiMGaqL9JhRy2jmkqKCMrn1t_rLyGUdGsl8efZSv4A&theater,"El actor Gael García Bernal ha mostrado su simpatía por AMLO y hasta se enfrentó en una discusión con Javier Lozano Alarcón en Twitter. Además pidió a la Comisión de Derechos Humanos de la ONU que intervenga en la situación de inseguridad que se vive en México.\n-Imágen: Redes Sociales.\n""Libertad y República""",politica,politica nacional,falso,fake,facebook,redes_sociales,texto,0,7,29,0,actor gael garcia bernal mostrar simpatia amlo y enfrentar discusion javier lozano alarcon twitter pidio a comision derecho humano onu intervenir situacion inseguridad vivir mexicoimagen redar socialeslibertad y republica,29,1.035714,221,25,0,0,1.011422e-14,4,-1,6,99,0,14.9,0.811321,53.04,1,3,6,1,0
2,verificado,https://www.facebook.com/watch/?v=519968068404562,"AMLO desaparecerá al Ejercito Mexicano.\nEl tabasqueño quiere perpetuar al Ejército y a la Marina en las calles, bajo el eufemismo “Guardia Nacional”, que para efectos prácticos, sería una policía militarizada.",politica,politica nacional,falso,fake,facebook,redes_sociales,"texto, video",1,5,18,0,"amlo desaparecera ejercitar mexicanoel tabasqueno perpetuar ejercitar y a marino calle eufemismo "" guardia nacional "" efecto practicos serio policia militarizar",19,1.055556,160,11,0,0,0.0001419783,2,-1,5,73,0,13.7,0.935484,47.28,2,2,0,1,0
3,verificado,https://www.facebook.com/watch/?v=571767869867769,"AMLO\nEl mandamás de Femsa (Coca Cola, Oxxos) se lanza contra AMLO, se opone a que AMLO toque la Reforma Energética ya que afectaría a los Oxxo Gas.\n\n#Recomendado\nFuente: Regeneración",politica,politica nacional,falso,fake,facebook,redes_sociales,texto,2467,12658,58997,0,amloel mandamas femsa cocar cola oxxos lanzar amlo oponer a amlo tocar reformar energetica afectaria a oxxo gasrecomendadofuente regeneracion,19,1.117647,141,24,1,0,0.005612344,2,-1,4,62,0,17.6,0.83871,48.47,0,6,3,0,0
4,verificado,https://www.facebook.com/vamosporella/photos/rpp.389844434467087/1605812676203584/?type=3&theater,"LUIS DONALDO COLOSIO RIOJAS, UN PREPARADO Y HONORABLE JOVEN QUE JUNTO A TODA SU FAMILIA FUE VICTIMA DEL SISTEMA CORRUPTO AVASALLADOR. \nDONDE EL PEJE ERA GRAN PROTAGONISTA.",politica,politica nacional,falso,fake,facebook,redes_sociales,texto,4,59,28,0,luis donaldo colosio riojas preparar y honorable joven a familia victimar sistema corrupto avasallador peje protagonista,16,1.0,120,141,0,0,0.03263706,2,-1,3,60,0,17.4,1.0,27.15,4,0,0,2,0


Drops duplicates and replace **misleading** label to **fake** label

In [6]:
# drop duplicates
#label misleading cambia a fake
dsi.label_desinformacion.replace('misleading','fake',True)
dsi.drop_duplicates(subset=['texto_desinformacion','label_desinformacion'],inplace = True)

From emoji to Spanish text

In [7]:
dsi['texto_emoji'] = dsi.texto_desinformacion.map(lambda x: emoji.demojize(x,language='es'))

Language detection

In [8]:
dsi['lang'] = dsi.texto_desinformacion.map(lambda x:detect(x))

In [9]:
dsi.lang.value_counts()

es    8495
pt      86
en      73
ca      51
it      34
de      25
id       7
fr       6
tl       6
ro       5
da       3
nl       3
hu       2
so       2
sv       2
vi       1
lv       1
af       1
lt       1
sw       1
sk       1
et       1
Name: lang, dtype: int64

The results are concentrated in Spanish and Portuguese. The other observations will help us to focus the cleaning.

In [39]:
lang_review = ['it','de','id','fr','tl','ro','da' , 'nl' ,'hu' , 'so' ,'sv' ,'vi' ,'lv' , 'af' , 'lt' , 'sw' , 'sk' , 'et']    

In [53]:
dsi[dsi.lang.isin(lang_review)][['texto_desinformacion','lang']].sample(5)

Unnamed: 0,texto_desinformacion,lang
8748,Secretario general de OEA visita Colombia para evaluar ola migratoria venezolana,it
9915,Focus,ro
1464,"VIDAL.\nPRIMER GOBIERNO EN LA HISTORIA DE LA PROVINCIA, QUE ENTREGARÁ SU MANDATO CON MENOS ESCUELAS DE LAS RECIBIDAS.\nDEPLORABLE.",de
10363,Timeline Photos,et
8813,Maestros robots,sk


In [69]:
dsi[dsi.lang.isin(lang_review)][['texto_desinformacion','lang']].sample(5)

Unnamed: 0,texto_desinformacion,lang
10645,u.afp.com,ro
988,PEPE Y MONI EN LAS ISLAS DE CAPRI...\nELLOS PIDEN EMPATÍA\nAPLAUDEN AHORA LOS PERONCHOS…,de
7708,Maria,sw
8866,PHOTOLARTE,tl
10526,"Spoleto, la ciudad con un ""metro"" para peatones",it


Data cleaning

In [77]:
def clean_text(sentence):

  TAG_RE = re.compile(r'<[^>]+>')
  sentence = TAG_RE.sub('',sentence)  # html tags
  sentence = re.sub(r'https?:\/\/\S*', '', sentence) # http links
  sentence = re.sub(r'www\.\S*', '', sentence) # www links
  sentence = re.sub(r'\S+\.com\S+', '', sentence) # .com links
  sentence = unidecode.unidecode(sentence)   #accents
  sentence = sentence.lower() #minus
  sentence = re.sub(r'\@\S+', ' ', sentence) #remove @mentions
  sentence = re.sub(r'\#\S+', ' ', sentence) #remove #hasthtags
  sentence = re.sub('[^a-zA-Z]', ' ', sentence)  # punctuations and numbers
  new_stopwords = set(stopwords.words('spanish')) - {'no'} #stopwords + no
  sentence = ' '.join([word for word in sentence.split() if word not in new_stopwords]) #remove stopwords
  sentence = re.sub(r'\s+', ' ', sentence) # multiple space
  sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence) # single character ' a '
  sentence = re.sub(r"^[a-zA-Z]\s+", ' ', sentence) #'a '
  sentence = re.sub(r'\b\w{1,1}\b', '', sentence)



  return sentence

In [78]:
dsi['texto_emoji_clean'] = dsi.texto_emoji.map(lambda x: clean_text(x))
dsi['texto_clean'] = dsi.texto_desinformacion.map(lambda x: clean_text(x))
dsi.texto_clean = dsi.texto_clean.str.replace('coronavirus','covid')
dsi.texto_clean.replace('yoestoyconelcambio','',inplace = True)
dsi.texto_clean = dsi.texto_clean.map(lambda x: re.sub(r"\bano\b","", x ))

In [28]:
dsi = dsi[dsi.texto_clean != '']

## Lemmatizer

Spacy lemmatizer

In [None]:
def lemantizar(sentence):

  doc = nlp(sentence)
  sentence = ' '.join([token.lemma_ for token in doc]) # lemantización del texto
  new_stopwords = stopwords.words('spanish')  #no quitamos 'no' de stopwords
  sentence = ' '.join([word for word in sentence.split() if word not in new_stopwords]) #quitamos stopwords  
  sentence = re.sub(r"\s+[a-zA-Z]\s+", ' ', sentence) # single character ' a '
  sentence = re.sub(r"^[a-zA-Z]\s+", ' ', sentence) #'a '
  sentence = re.sub(r'\s+', ' ', sentence) # multiple space

  return sentence  

In [None]:
dsi['texto_lem_spacy'] = dsi.texto_clean.map(lambda x: lemantizar(x))

    Saved file: df_lem_spacy_ready.csv