---

#### Python Notebook to read all the text files from a text dataset into Neo4j database.

--- 


1. List all the text files in the sub-directories your dataset.
2. Read all the files.
3. Create nodes, where n(nodes) = n(files)
4. Dump the text files into individual nodes where every node is a document using Graphaware's NLP pipeline.
5. Sample scripts for entity extraction
---


Imports<br>glob --> for iterating through the folders and sub-folders

In [9]:
import glob
import csv

Specifying the path for the files, the wildcards at the end of the path denote that all the files from all the subdirectories from the bbc folder will be accessed.

In [10]:
folder_path = '/your/path/here/*'

glob will help iterate through the entire folder path, allowing wildcards

In [11]:
from py2neo import *

In [12]:
from neo4j import GraphDatabase
uri = "bolt://localhost:7687"
driver = GraphDatabase.driver(uri, auth=("neo4j", "neo4j"))

In [13]:
authenticate("localhost:7474", "neo4j", "neo4j")
graph = Graph("http://localhost:7474/db/data/")

#graph = Graph(host='localhost', user='neo4j',password='password')
tx = graph.begin()

In [17]:
for filename in glob.glob(folder_path):
    with open(filename, 'r') as f: 
        file_contents = f.read() 
        Nodes = Node("Article",Text=str(file_contents),path=filename)
        print(Nodes)
        graph.create(Nodes)
        tx.merge(Nodes)

(d2caff1:Article {Text:"Data growth and availability as well as data democratization have radically changed data exploration in the last 10 years. Many different data sets, generated by users, systems and sensors, are continuously being collected. These data sets contain information about scientific experiments, health, energy, education etc., and they are highly heterogeneous in nature, ranging from highly structured data in tabular form to unstructured text, images or videos. Furthermore, especially online content, is no longer the purview of large organizations. Open data repositories are made public and can benefit more types of users, from analysts exploring data sets for insight, scientists looking for patterns, to dashboard interactors and consumers looking for information. As a result, the benefit of data exploration becomes increasingly more prominent. However, the volume and complexity of data make it difficult for most users to access data in an easy way.\n\nIn this project 

CALL ga.nlp.processor.addPipeline({
name:"pipeline",
textProcessor: 'com.graphaware.nlp.processor.stanford.ee.processor.EnterpriseStanfordTextProcessor',
processingSteps: {tokenize:true, ner:true, dependencies:true, relations:true, open:true, sentiment:true}
}) 

In [None]:
#NEO4J scripts (non-Python code)

CALL ga.nlp.processor.addPipeline({
    name: 'pipeline2', 
    textProcessor: 'com.graphaware.nlp.processor.stanford.StanfordTextProcessor', 
    processingSteps: {tokenizerAndSentiment:true, ner: true, dependency: true}})


In [None]:
#NEO4J scripts (non-Python code)

CALL apoc.periodic.iterate(
'MATCH (n:Article) RETURN n',
'CALL ga.nlp.annotate({
        	text: n.Text,
        	id: id(n),
        	pipeline: "pipeline2",
        	checkLanguage:false
})
YIELD result MERGE (n)-[:HAS_ANNOTATED_TEXT]->(result)',
{batchSize:1, iterateList:false})


In [None]:
#NEO4J scripts (non-Python code)

MATCH (s:TagOccurrence)<-[]-(a:Sentence)-[]->(v:TagOccurrence),
(a)-[]->(o:TagOccurrence)
WHERE s.pos IN [['NNP']] AND v.pos IN [['VBZ']] AND o.pos IN [['NN']]
RETURN DISTINCT s.value, v.value, o.value

In [None]:
#NEO4J scripts (non-Python code)

MATCH (s:TagOccurrence)<-[]-(a:Sentence)-[]->(v:TagOccurrence),
(a)-[]->(o:TagOccurrence)
WHERE s.pos IN [['NNP']] AND v.pos IN [['VBZ']] AND o.pos IN [['NN']] AND abs(v.startPosition-s.endPosition)<10 AND abs(o.startPosition-v.endPosition)<10 
RETURN DISTINCT s.value, v.value, o.value, v.startPosition-s.endPosition

In [None]:
#NEO4J scripts (non-Python code)

MATCH p= (ar:Article)-[:HAS_ANNOTATED_TEXT]->(an:AnnotatedText)-[:CONTAINS_SENTENCE]->(se:Sentence)-[:SENTENCE_TAG_OCCURRENCE]-(s:TagOccurrence)-[:NSUBJ]-(v:TagOccurrence)-[:DOBJ]-(o:TagOccurrence)
OPTIONAL MATCH (o:TagOccurrence)-[:COMPOUND]-(co:TagOccurrence)
OPTIONAL MATCH (o:TagOccurrence)-[:AMOD]-(am:TagOccurrence)
OPTIONAL MATCH (o:TagOccurrence)-[:NMOD]-(nm:TagOccurrence)
OPTIONAL MATCH (o:TagOccurrence)-[:NMOD]-(nm:TagOccurrence)
OPTIONAL MATCH (nm:TagOccurrence)-[:APPOS]-(apr:TagOccurrence)


RETURN se.text as Text, s.value as Subject, v.value as Predicate, am.value as Desc1, nm.value as Desc2,co.value as Desc3,  apr.value as Prop,  o.value as Object LIMIT 200