# Keywords
## En este notebook se definen cuatro funciones esenciales:
### (1) graph_weighted(text,K,filter_nouns_adj,digraph) que recibe un texto en forma de string (text), el tamano de la ventana de co-ocurrencia en forma de entero (K), la decision si filtrar NOUNs y ADJs en forma de BOOL (filter_nouns_adj) y la decision de usar grafo dirigido en forma de BOOL (digraph). Esta funcion entrega el grafo de ocurrencia asociado a un texto. 
### (2) keywords_pagerank(text,number_keywords,filter_nouns_adj,K,digraph) que recibe ademas de las mismas variables que (1) el numero de keywords (number_keywords). Esta funcion entrega un conjunto de keywords asociado a un texto. 
### (3) keywords_kcore(text,filter_nouns_adj,K,digraph) que recibe las mismas variables que (1) y entrega los nodos de main core del grafo como una manera de representar los keywords. 
### (4) bigram_keywords(text,filter_nouns_adj,K,digraph) que recibe idem que (1) y entrega un conjunto de 2-gramas de keywords ordenados segun frecuencia de aparicion en el texto (mayor a menor). 

In [2]:
import networkx as nx
import spacy
import operator
nlp = spacy.load('en')
from nltk.corpus import stopwords
from nltk import sent_tokenize
import re
stop_words = set(stopwords.words('english'))

## ejemplo

In [3]:
text = 'Sequencing the Neanderthal genome, the Denisovan genome, and several early modern human genomes from Eurasia has confirmed that archaic hominins left their mark in the genomes of modern humans. Present-day individuals in Eurasia inherited ∼2% of their genome from Neanderthals, and individuals from Oceania inherited ∼5% of their genome from Denisovans. Suggestive evidence indicates that admixture from other unidentified hominin species occurred in Africa. To understand the functional, phenotypic, and evolutionary consequences of archaic admixture, it is necessary to identify the specific haplotypes and alleles that were inherited from archaic hominin ancestors. Approaches to identifying introgressed haplotypes include methods that specifically incorporate reference archaic hominin genome sequences and reference-free methods that do not utilize such information. An example of the former category is the method of Sankararaman et al., which identifies archaic haplotypes by comparing modern human haplotypes to a reference archaic sequence. The latter category of methods include the S∗ statistic, which searches for the mutational signature that ancient admixture leaves in the genomes of present-day humans. The S∗ approach is powerful for finding introgressed haplotypes in the absence of an archaic reference genome because it leverages the unusual mutational characteristics of introgressed haplotypes. Because of the long divergence time between Neanderthals and modern humans, Neanderthals carry many alleles that are specific to their lineage. Such alleles are present on introgressed haplotypes but are absent or rare in African genomes. Further, based on the recent timing of admixture, introgressed haplotypes are expected to be maintained without recombination over distances of approximately 50 kb on average (Sankararaman et al., 2012), resulting in high levels of linkage disequilibrium (LD) between Neanderthal-specific alleles in non-African human genomes. In this study, we develop an S∗-like method that has increased power and is suitable for large-scale genome-wide data. We apply the method to large sets of sequenced data from Eurasia and Oceania and identify putative archaic-specific alleles. We examine the rate at which these alleles match the sequenced archaic genomes and the role of the genes containing these alleles, to obtain insights into the history of the admixture events and their impact on modern human genomes.'

## (0) funcion que limpia los textos

In [4]:
def clean(text,filter_nouns_adj):
    text=re.sub("[\(\[].*?[\)\]]", "", text)
    sentences=sent_tokenize(text)
    sentences=[nlp(sentence) for sentence in sentences]
    print(sentences[0])
    if filter_nouns_adj==True:
        sentences=[[token.lemma_ for token in sentence if token.tag_=='NN' or token.tag_=='NNS' or token.tag_=='JJ'] for sentence in sentences]
    else:
        sentences=[[token.lemma_ for token in sentence] for sentence in sentences]

    text=[item for sublist in sentences for item in sublist]
    text=[word for word in text if not word in stop_words]
    return text


## (1) grafo de palabras

In [5]:
# K es el largo de la ventana
# filter_nouns_adj indica el tipo de filtrado
# digraph indica el tipo de grafo- True = dirigido, False = no dirigido
def graph_weighted(text,K,filter_nouns_adj,digraph):
    text=clean(text,filter_nouns_adj)
    unique_words=list(set(text))
    if digraph==True: ## grafo dirigido o no dirigido
        G=nx.DiGraph()
    else:
        G=nx.Graph()
    for word in unique_words:
        G.add_node(word)
    for word in unique_words: ## recorremos el texto y encontramos los indices de todas las aparicions de word (index_word)
        index_word=[index for index, value in enumerate(text) if value == word]
        ## ahora buscamos las palabras vecinas en una ventana de largo K (hacia adelante)
        for index in index_word:
            for k in range(1,K+1):
                if index+k in range(len(text)):
                    if G.has_edge(text[index],text[index+k])==False:
                        G.add_edge(text[index],text[index+k],weight=1)
                    else:
                        x=G[text[index]][text[index+k]]['weight']
                        G[text[index]][text[index+k]]['weight']=x+1
    
    return G

In [6]:
graph_weighted(text,4,True,False)

Sequencing the Neanderthal genome, the Denisovan genome, and several early modern human genomes from Eurasia has confirmed that archaic hominins left their mark in the genomes of modern humans.


<networkx.classes.graph.Graph at 0x7fb9375088d0>

## (2) keywords segun pagerank

In [7]:
# K es el largo de la ventana
# filter_nouns_adj indica el tipo de filtrado
# number_keywords indica el numero de keywords
def keywords_pagerank(text,number_keywords,filter_nouns_adj,K,digraph):
    G=graph_weighted(text,K,filter_nouns_adj,digraph)
    keywords=nx.pagerank(G, alpha=0.85, weight='weight')
    return list(list(zip(*sorted(keywords.items(), key=operator.itemgetter(1),reverse=True)))[0][:number_keywords])

In [9]:
keywords_pagerank(text,10,True,4,True)

Sequencing the Neanderthal genome, the Denisovan genome, and several early modern human genomes from Eurasia has confirmed that archaic hominins left their mark in the genomes of modern humans.


['genome',
 'archaic',
 'haplotype',
 'allele',
 'human',
 'method',
 'modern',
 'introgressed',
 'specific',
 'admixture']

## (3) keywords segun main core

In [10]:
# K es el largo de la ventana
# filter_nouns_adj indica el tipo de filtrado
# number_keywords indica el numero de keywords
# digraph es el tipo de grafo - True = grafo dirigido, False = grafo no dirigido

def keywords_kcore(text,filter_nouns_adj,K,digraph):
    G=graph_weighted(text,K,filter_nouns_adj,digraph)
    G.remove_edges_from(nx.selfloop_edges(G))
    return list(nx.k_core(G).nodes())

In [11]:
keywords_kcore(text,True,4,True)


Sequencing the Neanderthal genome, the Denisovan genome, and several early modern human genomes from Eurasia has confirmed that archaic hominins left their mark in the genomes of modern humans.


['genome',
 'archaic',
 'human',
 'reference',
 'hominin',
 'haplotype',
 'modern',
 'method']

## (4) 2-gramas de keywords

In [12]:
def bigram_keywords(text,filter_nouns_adj,K,digraph):
    bigram=[]
    G=graph_weighted(text,K,filter_nouns_adj,digraph)
    G.remove_edges_from(nx.selfloop_edges(G))
    main_core=nx.k_core(G) ## subgrafo del grafo de co-ocurrencia
    list_edges=main_core.edges(data=True) ## lista de posibles bigramas, las aristas del main_core
    destruct_edges=list(zip(*list_edges))
    x=destruct_edges[0]
    y=destruct_edges[1]
    w=destruct_edges[2]
    text=clean(text,True) ## lista de palabras (que dependen de True o False para ver que clase incluyen)
    for i in range(len(x)):
        key_1=x[i]
        key_2=y[i]
        exists = (key_1,key_2) in zip(text, text[1:]) ## miramos si la arista aparece en el texto como palabras adyacentes
        if exists==True:
            bigram+=[(key_1+' '+key_2,w[i]['weight'])]
    return list(zip(*sorted(bigram, key=lambda tup: tup[1],reverse=True)))[0] ## ordenamos segun frecuencia del bigrama
    
    

In [13]:
bigram_keywords(text,True,4,True)

Sequencing the Neanderthal genome, the Denisovan genome, and several early modern human genomes from Eurasia has confirmed that archaic hominins left their mark in the genomes of modern humans.
Sequencing the Neanderthal genome, the Denisovan genome, and several early modern human genomes from Eurasia has confirmed that archaic hominins left their mark in the genomes of modern humans.


('modern human',
 'archaic genome',
 'haplotype reference',
 'genome modern',
 'archaic hominin',
 'human genome',
 'archaic reference',
 'human haplotype',
 'reference archaic',
 'reference genome',
 'hominin genome',
 'haplotype modern',
 'method haplotype',
 'genome archaic',
 'haplotype method',
 'method reference')