<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Documents_with_text_to_word_list.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🍕 Documents with text to word list

**Input:** a list of documents with their text content (CSV).

**Outputs:**
* a list of words with scores (CSV)
* a list of document-word pairs (CSV)

This scripts extracts so-called [named entities](https://en.wikipedia.org/wiki/Named-entity_recognition): words or groups of words that are person names, organizations, locations...

## How to use

1. Put your input file in the same folder as the notebook
1. Edit the settings if needed
1. Run all the cells
1. Take ALL the output files from the notebook folder

# SETTINGS

In [None]:
# Input file
input_file = "documents.csv"

# Which column contains the text?
text_column = "Text"

# Output files
output_file_words = "words.csv"
output_file_pairs = "words-and-documents.csv"


# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [None]:
# Install (if needed)
!pip install pandas
!pip install spacy

# Import
import csv
import pandas as pd
import spacy
from spacy import displacy

print("Done.")

### Read the input file

In [None]:
doc_df = pd.read_csv(input_file, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the document list:")
doc_df

### Extract named entities
We use spacy. More fun stuff to do [there](https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/).

In [None]:
# Named entities recognition engine
NER = spacy.load("en_core_web_sm")

doc_index = {}
count=1
print("Extracting named entities from "+str(len(doc_df.index))+" documents. This might take a while...")
for index, row in doc_df.iterrows():
  text = row[text_column]
  if count % 10 == 0:
    print("Named entities harvested from "+str(count)+" documents out of "+str(len(doc_df.index))+". Continuing...")

  nertxt = NER(text)
  entities = {}
  for ne in nertxt.ents:
    entsign = ne.text + '-' + ne.label_
    if entsign not in entities:
      entities[entsign] = {'NE-text': ne.text, 'NE-type':ne.label_, 'NE-count':0}
    entities[entsign]['NE-count'] += 1
  doc_index[index] = entities
  
print("Done.")

### Aggregate pairs into a dataframe

In [None]:
pair_list = []
doc_notxt_df = doc_df.drop(columns=[text_column])
for index, row in doc_notxt_df.iterrows():
  for entsign in doc_index[index]:
    new_row = {**row, **doc_index[index][entsign]}
    pair_list.append(new_row)

pair_df = pd.DataFrame(pair_list)
print("Done.")
print("Preview of the pairs list:")
pair_df

### Aggregate words into dataframe

In [None]:
word_index = {}
for index, row in doc_notxt_df.iterrows():
  for entsign in doc_index[index]:
    ne = doc_index[index][entsign]
    if entsign not in word_index:
      word_index[entsign] = {'text': ne['NE-text'], 'type': ne['NE-type'], 'count-occurences-total':0, 'count-documents':0}
    word_index[entsign]['count-occurences-total'] += ne['NE-count']
    word_index[entsign]['count-documents'] += 1

word_df = pd.DataFrame(word_index.values())
print("Done.")
print("Preview of the words list:")
word_df

### Save as CSV

In [None]:
try:
  pair_df.to_csv(output_file_pairs, index = False, encoding='utf-8')
except IOError:
  print("/!\ Error while writing the pairs output file")

try:
  word_df.to_csv(output_file_words, index = False, encoding='utf-8')
except IOError:
  print("/!\ Error while writing the words output file")
print("Done.")