In [13]:
import os
import re
import pandas as pd

# Metadata

In research projects based on Text and Data Mining, it is common to examine differences and similarities between separate groups of texts. To establish such groups, it can be useful to create a CSV files with metadata describing these text. Next to the basic information about the titles and the filenames of the texts, such metadata files ought to provide values for categorical variables. These are variables which can take a limited number of values. This notebook offers a number of instructions on how you can create such a CSV file containing metdata. 


## Structure of the CSV file


The CSV file containing the metdata minimally needs to contain the describe the `title`. This field will be used as an identifier for the text.

The CSV file below is an example.


```
path,title,year_of_publication,class
Corpus/Ulysses.txt,Ulysses,1920,A
Corpus/ThroughtheLookingGlass.txt,ThroughtheLookingGlass,1871,A
Corpus/HeartofDarkness.txt,HeartofDarkness,1899,A
Corpus/ARoomWithaView.txt,ARoomWithaView,1908,B
Corpus/ATaleofTwoCities.txt,ATaleofTwoCities,1859,B
Corpus/PrideandPrejudice.txt,PrideandPrejudice,1813,B
```


The CSV file can of course be created manually. The remainder of this notebook also contains some code which can help you to make such a file. 


## Collect all the file names

Firstly, if you have made a directory containing all the files in your corpus, we can collect all path to these files, and save these in a list named `corpus`. 

In [3]:
dir = 'Corpus'
corpus = []

for file in os.listdir(dir):
    if not(re.search(r'^\.' , file)): 
        path = os.path.join(dir,file)
        corpus.append(path)

## Collect all the titles

If the file names reflect the titles of your texts, these titles can be extracted using the finction that is dfefined below. 

In [5]:
def extract_title(path):
    title = os.path.basename(path)
    title = re.sub( r'[.]txt$' , '' , title )
    return title

Using the list named `corpus` that was created earlier, the CSV file can already be generated partly. 

In [7]:
## The header
print('path,title')

for title in corpus:
    print(f'{title},{extract_title(title)}')

path,title
Corpus/Ulysses.txt,Ulysses
Corpus/ThroughtheLookingGlass.txt,ThroughtheLookingGlass
Corpus/HeartofDarkness.txt,HeartofDarkness
Corpus/ARoomWithaView.txt,ARoomWithaView
Corpus/ATaleofTwoCities.txt,ATaleofTwoCities
Corpus/PrideandPrejudice.txt,PrideandPrejudice


## Adding additional fields

The input below can help you to add additional fields.

In [8]:
nr_columns = int(input( "How many columns would you like to add?\n"))

How many columns would you like to add?
1


In [9]:
column_names = []
for column in range(1,nr_columns+1):
    column_name = input( f"Name of column {column}:\n")
    column_names.append(column_name)
    

Name of column 1:
Class


In [10]:
csv = []
for file in corpus:
    print(f'{extract_title(file)}:')
    row = []
    row.extend([file,extract_title(file)])
    for column_name in column_names:
        value = input(f"{column_name}: ")
        row.append(value)
    csv.append(row)

Ulysses:
Class: A
ThroughtheLookingGlass:
Class: A
HeartofDarkness:
Class: A
ARoomWithaView:
Class: B
ATaleofTwoCities:
Class: B
PrideandPrejudice:
Class: B


In [11]:
column_names = ['path','title'] + column_names
print(column_names)

['path', 'title', 'Class']


The values that were collected in this way will finally be saved as a CSV file named `metadata.csv`. 

In [16]:
df = pd.DataFrame(csv, columns = column_names )
df.to_csv('metadata.csv' , index=False)