The code below can be used to create a data set in CSV format for your research project. It applies most of the methods that have been discussed in the course "Digital Text and Data Processing". 

Requirements:

* The texts that you want to mine need to be available in a directory named 'Corpus'. When your files are in a different directory, change the value of the 'dir' variable below.
* The files "Religion.txt" and "Politics.txt" need to be available in the same directory as this notebook. It is possible, of course, to work with other lexicons as well. Lexicon files can be found at https://github.com/peterverhaar/dtdp/tree/master/Texts
* A file named "metadata.csv" needs to be available in the same directory as this notebook. This CSV file needs to list the names of all the files in your corpus, together with the values for the categorical variables that you want to explore in your project.

In [None]:

import re
import os
from os.path import join
import dtdpTdm as dtdp
import pandas as pd

md = pd.read_csv( 'metadata.csv' )

## this dictionary, which links titles to
## indices will be used later
mdIndices = dict()
for index , column in md.iterrows():
    mdIndices[ column['title'] ] = index


out = open( 'data.csv' , 'w' )

dir = 'Corpus'

## make a header
out.write( 'title,tokens,ttr,sentences,syllables,nouns,adjectives,adverbs,fk,religion,politics,century\n' )


for file in os.listdir( dir ):
    if re.search( r'[.]txt' , file ):

        ## extract title from filename by removing extension
        title = re.sub( 'b\.txt' , '' , file )

        # full path to file in directory
        fileName = join( dir , file )
        print("Analysing " + title + " ...")
        tokens = dtdp.numberOfTokens( fileName )

        ## short texts are disregarded
        if tokens > 50:
            out.write( title + ',' )
            print("Number of tokens")
            out.write( str( tokens ) )

            out.write( ',')

            ### type-token ratio
            print("Calculating type-token ratio")

            textFile = open( fileName )
            fullText = textFile.read()

            words = dtdp.tokenise( fullText )
            if len(words) > 1000:
                words = words[0:1000]
                tokenCount = 1000
            else:
                tokenCount = len(words)
            freq = dict()

            for w in words:
                freq[w] = freq.get( w , 0 ) + 1

            typeCount = len( freq )

            ttr = typeCount / tokenCount

            out.write( str( ttr ) )
            out.write( ',')

            print("Number of sentences")
            out.write( str( dtdp.numberOfSentences( fileName ) ) )
            out.write( ',')
            print("Number of syllables")
            out.write( str( dtdp.numberOfSyllables( fileName ) ) )
            out.write( ',')
            print("Number of nouns")
            out.write( str( dtdp.countPosTag( fileName , 'NN')  / tokens ) )
            out.write( ',')
            print("Number of adjectives")
            out.write( str( dtdp.countPosTag( fileName , 'JJ') / tokens  ) )
            out.write( ',')
            print("Number of adverbs")
            out.write( str( dtdp.countPosTag( fileName , 'RB') / tokens  ) )
            out.write( ',')
            print("Flesch-Kincaid")
            out.write( str( dtdp.fleschKincaid( fileName )   ) )
            out.write( ',')
            print("References to religion")
            out.write( str( dtdp.countOccurrencesLexicon( fileName , 'Religion.txt') / tokens  ) )
            out.write( ',')
            print("References to politics")
            out.write( str( dtdp.countOccurrencesLexicon( fileName , 'Politics.txt') / tokens  ) )
            out.write( ',')
            out.write( str( md.iloc[ mdIndices[file] ]['century' ] ) )
            out.write( '\n')

out.close()

print("\nThe data set has been created. The name of the file is 'data.csv' ")


To download the lexicon files that are references in the code above, run the code below. 

In [None]:
import urllib.request
import re
import time

def download( url ):

    request = urllib.request.urlopen(url)
    bytes = request.read()
    fullText = bytes.decode("utf-8")
    request.close()

    parts = re.split( '/' , url )
    id = parts[-1]

    out = open( id , 'w' , encoding = 'utf-8')
    out.write( fullText )
    out.close()



urls = [ 'https://raw.githubusercontent.com/peterverhaar/dtdp/master/Texts/Politics.txt' ,
'https://raw.githubusercontent.com/peterverhaar/dtdp/master/Texts/Religion.txt' 
]

for item in urls:
    print("downloading " + item + " ...")
    download(item)
    
print('All files have been downloaded.')