In [64]:
import os
from itertools import product

**IMPORTANTE:** Es necesario la versión 2.3.0 de spacy.

In [25]:
import spacy
nlp = spacy.load("es_core_news_lg")
spacy.__version__

'2.3.0'

Leemos los tweets que ya procesamos en otro notebook.

In [3]:
import pandas as pd

tweets = pd.read_feather('../data/tweets_limpios')

In [4]:
tweets.shape

(25365, 2)

In [5]:
tweets

Unnamed: 0,text,index
0,promotoresods desear feliz año necesitar esper...,6
1,aplicación regar aguar depurar enzima natural ...,7
2,lunes estrenar esperar 19:30 _ 13c viajar para...,8
3,década cumplir objetivo agenda2030 preparar en...,9
4,nº 329 homenaje municipalismo 40añosdedemocrac...,10
...,...,...
25360,documento cepal examinar tendencia económico s...,43690
25361,ámsterdam demostrar quimera salir entrar crisi...,43693
25362,aprovecha confinamiento blog encontrar 40 artí...,43694
25363,pilar mandato fundación promover debatir ue al...,43695


Algunas pruebas con el método `similarity` de spacy.

In [6]:
doc0 = nlp(tweets['text'][0])
doc1 = nlp(tweets['text'][1])
doc0.similarity(doc1)

0.7585479797048875

In [7]:
doc0

promotoresods desear feliz año necesitar esperar contar apoyar lograr objetivo agenda2030 planeta saludable sustentable esperamos contar ayuda nodejaranadieatras

In [8]:
doc1

aplicación regar aguar depurar enzima natural grava-cemento bayas castrillón resultar pavimentar ecológico proteger firmar prolongar vida útil sostenibilidad innovación pavitek10años pavimentación workinprogress pavitek

### Similaridad de los tweets con las palabras del diccionario

Cargamos el listado de palabras características de cada ods. La idea es comparar cada tweet con las palabras de cada ODS.

In [9]:
palabras = pd.read_feather('../data/diccionario_palabras_ods')

In [10]:
palabras[['ODS.1', 'PALABRAS']].dropna().groupby('ODS.1')['PALABRAS'].apply(list)

ODS.1
1     [ALIMENTOS ASEQUIBLES, CAMPAÑAS SOLIDARIAS, DO...
2     [AGRICULTURA ECOLÓGICA, AGRICULTURA SOSTENIBLE...
3     [ACCIDENTE LABORAL, ACTIVIDAD FÍSICA, ALIMENTO...
4     [ACCESO A INTERNET, APRENDER, APRENDIZAJE, APR...
5     [CONDICIONES LABORABLES JUSTAS, CONTRATACION I...
6     [AGUA, AGUA LIMPIA, CONSUMO AGUA, DISPONIBILID...
7     [AHORRO ENERGETICO, DOMÓTICA, EFICIENCIA ENERG...
8     [BECAS, CONTRATACION IGUALITARIA, CONCILIACION...
9     [CREATIVIDAD, ECONOMIA CIRCULAR, DIGITALIZACIO...
10    [CRISIS HUMANITARIAS, DESIGUALDADES, EMPLEO IN...
11    [CIUDADES , CIUDADES SOSTENIBLES, CONCENTRACIO...
12    [AHORRAR RECURSOS, AHORRO COSTES, CONSUMO, CON...
13    [AHORRO ENERGIA, ANDAR, CAMBIO CLIMATICO, CLIM...
14    [CONSERVACION MARINA, CONTAMINACION MARINA, CO...
15    [BOSQUES, CATÁSTROFE NATURAL, CONSERVACION, DE...
16    [ACCESO A LA JUSTICIA, ANTI-CORRUPCION, ASESIN...
17    [ALIANZA MUNDIAL, ALIANZAS, ASOCIACIONES, CNMC...
Name: PALABRAS, dtype: object

In [11]:
palabras17 = "ALIANZA MUNDIAL, ALIANZAS, ASOCIACIONES, CNMC, FUNDACIONES, ITA".split(',')
palabras17 = " ".join(palabras17).lower().strip()
palabras17

'alianza mundial  alianzas  asociaciones  cnmc  fundaciones  ita'

In [12]:
doc_palabras17 = nlp(palabras17)

In [13]:
doc0.similarity(doc_palabras17)

0.44297837000781914

In [14]:
for text in tweets['text'][:5]:
    doc = nlp(text)
    print(doc.similarity(doc_palabras17))

0.44297837000781914
0.4257914698963234
0.32967012851285526
0.5414404397117567
0.25415340236926104


In [15]:
nlp(tweets['text'][0]).similarity(doc_palabras17)

0.44297837000781914

Esta parte tarda una hora aproximadamente. Por eso la primera vez se guarda el resultado en `../data`. Y en sucesivas ejecuciones se lee dicho fichero.
Para volver a crearlo basta eliminar el fichero `../data/similarity`.

In [None]:
%%time

if os.path.exists('../data/similarity'):
    similarity = pd.read_feather('../data/similarity')
else:
    words_list = palabras[['ODS.1', 'PALABRAS']].dropna().groupby('ODS.1')['PALABRAS'].apply(list)
    similarity = pd.DataFrame([])
    for ods, words in enumerate([' '.join(words).lower() for words in words_list]):
        similarity[str(ods+1)] = pd.Series([nlp(tweet).similarity(nlp(words)) for tweet in tweets['text']])
    similarity.to_feather('../data/similarity')

similarity.head(10)

**Añadido el 30/06/2020.**

El principal problema del código anterior es que estamos procesando el mismo texto con nlp repetidas veces.
Por ello hemos separado el procesado del texto del cálculo de la similitud.

In [50]:
%%time

tweets_nlp = tweets['text'].apply(nlp)

words_list = palabras[['ODS.1', 'PALABRAS']].dropna().groupby('ODS.1')['PALABRAS'].apply(list)
words_list_nlp = [nlp(' '.join(words).lower()) for words in words_list]

CPU times: user 3min 4s, sys: 337 ms, total: 3min 4s
Wall time: 3min 5s


In [68]:
%%time

if os.path.exists('../data/similarity'):
    similarity = pd.read_feather('../data/similarity')
else:
    
    similarity = pd.DataFrame([])
    for ods, words in enumerate(words_list_nlp):
        similarity[str(ods+1)] = [p[0].similarity(p[1]) for p in product(tweets_nlp, [words])]
    similarity.to_feather('../data/similarity')

similarity.head(10)

  import sys


CPU times: user 1.85 s, sys: 20.4 ms, total: 1.87 s
Wall time: 1.92 s


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
0,0.60275,0.668933,0.659126,0.641406,0.565214,0.410802,0.557114,0.645977,0.5828,0.596844,0.503234,0.617844,0.540369,0.432259,0.500043,0.543211,0.442978
1,0.629167,0.777066,0.815426,0.659907,0.619371,0.682512,0.664201,0.703563,0.71881,0.673982,0.671245,0.737869,0.679991,0.587008,0.721592,0.630925,0.425791
2,0.285393,0.28701,0.266649,0.290361,0.298029,0.213132,0.222479,0.335018,0.265226,0.27311,0.44041,0.244629,0.319887,0.343346,0.329822,0.25654,0.32967
3,0.58786,0.53599,0.536301,0.602412,0.609603,0.292556,0.408285,0.627351,0.528717,0.608762,0.461606,0.533556,0.427018,0.353381,0.431755,0.585437,0.54144
4,0.331859,0.313931,0.331852,0.356891,0.369388,0.236997,0.238984,0.423133,0.34004,0.395937,0.338398,0.374853,0.299977,0.264312,0.313376,0.382091,0.254153
5,0.537358,0.50982,0.521213,0.506818,0.549486,0.382609,0.366114,0.607022,0.532024,0.57398,0.584403,0.566641,0.508576,0.424571,0.511043,0.591148,0.434117
6,0.708737,0.667735,0.697512,0.722985,0.68835,0.437536,0.523591,0.765595,0.65323,0.69934,0.590015,0.674642,0.56669,0.478601,0.580832,0.639949,0.575177
7,0.645947,0.754703,0.760511,0.656067,0.638458,0.603507,0.606025,0.737414,0.674608,0.716104,0.601072,0.763501,0.648473,0.502616,0.6089,0.61788,0.430274
8,0.448265,0.499427,0.444971,0.475499,0.412952,0.243447,0.457789,0.492458,0.463821,0.432507,0.386663,0.459945,0.428865,0.330608,0.348773,0.378121,0.416675
9,0.726144,0.731134,0.784925,0.790074,0.784964,0.520035,0.60296,0.836955,0.789187,0.803966,0.673148,0.738163,0.639828,0.566259,0.691333,0.766898,0.588068


In [67]:
similarity.shape

(25365, 17)

Sacamos el contenido de los tweets a ver si cuadran con los valores obtenidos.

In [19]:
for i, tweet in enumerate(tweets['text'][:10]):
    print('{}: {}\n'.format(i, tweet))

0: promotoresods desear feliz año necesitar esperar contar apoyar lograr objetivo agenda2030 planeta saludable sustentable esperamos contar ayuda nodejaranadieatras

1: aplicación regar aguar depurar enzima natural grava-cemento bayas castrillón resultar pavimentar ecológico proteger firmar prolongar vida útil sostenibilidad innovación pavitek10años pavimentación workinprogress pavitek

2: lunes estrenar esperar 19:30 _ 13c viajar paralelo 63 polo norte ciudad universitario trondheim agenda2030

3: década cumplir objetivo agenda2030 preparar enfrentar reto desigualdad cambioclimático conocer desafío enfrentar mundo 2019 lomásleído

4: nº 329 homenaje municipalismo 40añosdedemocracialocal femp cop25 poniendo caro ods caso práctico ods2

5: nº 328 mayores gobiernos locales poniendo caro ods caso práctico ods1

6: feliz2020 mejorar deseo año aacid continuar trabajar junto contribuir cumplimiento agenda2030 ods contamos andalucia comprometida pacode

7: 14% alimento producir perder cosecha