<a href="https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Words_and_documents_with_text_to_document_list_with_words.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🍒 Words and documents with text to document list with words

**Inputs:**
* a list of documents with their text content (CSV)
* a small list of words, like a dozen (CSV)

**Outputs:**
* a list of documents with words as columns (CSV)
* a list of document-word pairs (CSV)
* a bipartite network of documents and words (GEXF)

This script tells you which words are in which documents. Each word becomes a column, that is why you want to only have a few of them. You may have many documents, though. Words can be expressions (e.g., named entities).

If you have many words and just want the network, check [this notebook](https://colab.research.google.com/github/jacomyma/mapping-controversies/blob/main/notebooks/Words_and_documents_with_text_to_network.ipynb).

## How to use

1. Put your input files in the same folder as the notebook
1. Edit the settings if needed. CHECK THE COLUMN NAMES!
1. Run all the cells
1. Take ALL the output files from the notebook folder

# SETTINGS

In [None]:
# Input file 1: documents
input_file_documents = "documents.csv"
# Which column contains the text?
documents_text_column = "Text"
# Which column contains the document name or ID?
documents_id_column = "Article"

# Input file 2: small list of words
input_file_words = "words-small-list.csv"
# Which column contains the words?
words_text_column = "text"

# Delete documents that contain none of the words?
discard_unrelated_documents = True

# Output files
output_file_documents = "documents-with-terms.csv"
output_file_pairs = "terms-and-documents.csv"
output_file_network = "terms-document-network.gexf"


# SCRIPT

### Install and import libraries
This notebook draws on existing code.
You can ignore the output.

In [None]:
# Install (if needed)
!pip install pandas
!pip install spacy
!pip install networkx

# Import
import csv
import pandas as pd
import networkx as nx

print("Done.")

Done.


### Read the input file 1 (documents)

In [None]:
doc_df = pd.read_csv(input_file_documents, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the document list:")
doc_df

Preview of the document list:


Unnamed: 0,Article,Text
0,Search engine privacy,search engine privacy is a subset of internet ...
1,Member Berries,"""member berries"" is the first episode in the t..."
2,Real-name system,a real-name system is a system in which users ...
3,CSipSimple,csipsimple is a voice over internet protocol (...
4,Spam blog,"a spam blog, also known as an auto blog or the..."
5,Worldwide Protests for Free Expression in Bang...,the worldwide protests for free expression in ...
6,FTC fair information practice,the united states federal trade commission's ...
7,Spam mass,"spam mass is defined as ""the measure of the im..."


### Read the input file 2 (words)

In [None]:
word_df = pd.read_csv(input_file_words, quotechar='"', encoding='utf8', doublequote=True, quoting=csv.QUOTE_NONNUMERIC, dtype=object)
print("Preview of the word list:")
word_df

Preview of the word list:


Unnamed: 0,text,type,count-occurences-total,count-documents
0,congress,ORG,5,3
1,the united states,GPE,4,3
2,yahoo,ORG,19,2
3,europe,LOC,2,2
4,the european union,ORG,2,2
5,google,ORG,28,1
6,ftc,ORG,10,1
7,aol,ORG,7,1
8,doubleclick,PERSON,5,1
9,oecd,ORG,5,1


### Wrangle the data

In [None]:
# Get a set of the words
words = set()
for index, row in word_df.iterrows():
  words.add(row[words_text_column])

# Init data for output
document_list = []
pair_list = []
network_doc_set = set()
network_word_set = set()
network_edge_list = []

# Search words in documents
for index, row in doc_df.iterrows():
  text = row[documents_text_column].lower()
  count_per_word = {}
  flag = False
  for word in words:
    count = text.count(word.lower())
    count_per_word[word] = count
    if count > 0:
      flag = True

  if flag or not discard_unrelated_documents:
    # output 1
    doc_new_row = {**row, **count_per_word}
    document_list.append(doc_new_row)
    # output 2
    for word in words:
      count = count_per_word[word]
      if count > 0:
        pair_new_row = {**row, 'term':word, 'term-count':count}
        pair_list.append(pair_new_row)
    # output 3
    doc_id = row[documents_id_column]
    network_doc_set.add(doc_id)
    for word in words:
      count = count_per_word[word]
      if count > 0:
        network_word_set.add(word)
        network_edge_list.append((doc_id,word,{"count":count}))



### Make output 1 (documents with words as columns)

In [None]:
output_doc_df = pd.DataFrame(document_list)
output_doc_df = output_doc_df.drop(columns=[documents_text_column])
print("Done.")
print("Preview of the document list:")
output_doc_df

Done.
Preview of the document list:


Unnamed: 0,Article,ftc,the united states,congress,europe,doubleclick,the european union,oecd,yahoo,google,aol
0,Search engine privacy,0,3,3,7,6,1,0,21,50,7
1,Member Berries,0,1,1,0,0,0,0,0,0,0
2,CSipSimple,0,0,0,0,0,0,0,0,1,0
3,Spam blog,0,0,0,0,0,0,0,0,1,0
4,FTC fair information practice,12,2,2,7,0,2,10,0,0,0
5,Spam mass,0,0,0,0,0,0,0,1,0,0


### Make output 2 (document-word pairs)

In [None]:
output_pair_df = pd.DataFrame(pair_list)
output_pair_df = output_pair_df.drop(columns=[documents_text_column])
print("Done.")
print("Preview of the pair list:")
output_pair_df

Done.
Preview of the pair list:


Unnamed: 0,Article,term,term-count
0,Search engine privacy,the united states,3
1,Search engine privacy,congress,3
2,Search engine privacy,europe,7
3,Search engine privacy,doubleclick,6
4,Search engine privacy,the european union,1
5,Search engine privacy,yahoo,21
6,Search engine privacy,google,50
7,Search engine privacy,aol,7
8,Member Berries,the united states,1
9,Member Berries,congress,1


### Save the CSVs

In [None]:
try:
  output_doc_df.to_csv(output_file_documents, index = False, encoding='utf-8')
except IOError:
  print("/!\ Error while writing the documents output file")

try:
  output_pair_df.to_csv(output_file_pairs, index = False, encoding='utf-8')
except IOError:
  print("/!\ Error while writing the pairs output file")
print("Done.")

Done.


### Make and save output 3 (network)

In [None]:
# Build the nodes
nodes = []
doc_df_no_text = doc_df.drop(columns=[documents_text_column]) 
for index, row in doc_df_no_text.iterrows():
  if row[documents_id_column] in network_doc_set:
    nodes.append((row[documents_id_column], {**row, 'label':row[documents_id_column], 'type':'document'}))

for index, row in word_df.iterrows():
  if row[words_text_column] in network_word_set:
    nodes.append((row[words_text_column], {**row, 'label':row[words_text_column], 'type':'term'}))

# Build edges
edges = network_edge_list

G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
nx.write_gexf(G, output_file_network)
print("Done.")

Done.
