<a href="https://colab.research.google.com/github/iued-uni-heidelberg/compling2021/blob/main/session00introduction_homework_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Downloading / uploading a text corpus

## Homework task: creating frequency dictionary of multiword expressions

Further sections have been modified to create a dictionary of MWEs (N-word long)

## George Orwell, 1984 novel: 
https://heibox.uni-heidelberg.de/d/d65daff8341e467c82b1/

(texts in en, de, fr, es, it. You can search for a freely-available text in your own language).

## Wikipedia corpus
This site contains plain text versions of the Wikipedia:
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2735#

You can download the version for your favourite language(s).

1. Download the "1984" novel into your local drive
2. Upload it onto the Colab file system:
- *Files* button to the left 
- *Upload to Session storage* button
3. Examine the file on Colab
4. Write a command to download it onto your system automatically: tip -- google "wget" and use it in Colab



In [None]:
# example for downloading files for different languages, e.g., German file in  from the https://heibox.uni-heidelberg.de/f/ea06aa47fe2d49959a62/
# remove / insert comments to choose the language you want to work with

# German
# !wget https://heibox.uni-heidelberg.de/f/ea06aa47fe2d49959a62/?dl=1

# Armenian
# !wget https://heibox.uni-heidelberg.de/f/3255f672ff7b4190828a/?dl=1

# Georgian
# !wget https://heibox.uni-heidelberg.de/f/318b32556cdc44d38238/?dl=1

# French
# !wget https://heibox.uni-heidelberg.de/f/b0cc03fbdb6248cab544/?dl=1

# Spanish
# !wget https://heibox.uni-heidelberg.de/f/585ee5e9eb3548219c34/?dl=1

# Italian
# !wget https://heibox.uni-heidelberg.de/f/fe1ae20b08b240f3a4f0/?dl=1

# English
# !wget https://heibox.uni-heidelberg.de/f/00ee04d9b9544c298be7/?dl=1

# Georgian 'Brown corpus'
# !wget https://heibox.uni-heidelberg.de/f/d5603814da69440aadf4/?dl=1

# English Brown corpus (text)
!wget https://heibox.uni-heidelberg.de/f/d2c3543b757d49839ac8/?dl=1


# renaming file
!mv index.html?dl=1 go1984.txt

Now let's make a frequency dictionary from our file!

In [None]:
# step 1: importing standard Python libraries
import sys, re, os
import scipy.stats as ss
import matplotlib.pyplot as plt
import math
import csv

In [None]:
# step 2: open your file for reading (change the name of the file for a different language / corpus)
# FileInput = open("go1984en.txt",'r')
FileInput = open("go1984.txt",'r')
# open another file for writing (our output file)
FileOutput = open("go1984-MWE-frq.txt", 'w')

In [None]:
# step 3: create an empty frequency dictionary: words will be 'keys', frequencies will be 'values'
DictionaryFrq = {}

In [None]:
# step 4: read each line, clean it up, split it into words and count unique words:
INGram = 2
for Line in FileInput:
    Line = re.sub('<.*?>', '', Line) # remove html/xml tags
    Line = Line.lower() # convert to lower case
    Line = Line.strip() # remove leading and final white spaces
    ListOfWords = re.split('[ ,\.:;\!\(\)\"\[\]«»\-\?]+', Line) # tokenize: split on white spaces and punctuation
    # remove empty words: https://stackoverflow.com/questions/3845423/remove-empty-strings-from-a-list-of-strings
    ListOfWords = list(filter(None, ListOfWords))   
    # sys.stdout.write( str( ListOfWords ) + '\n' )

    # If word exists, add 1 to its existing frequency; if it doesn't, then set frequency to 1
    # { word1 : 4 , word2 : 4, word3 : 2 }
    # -- homework -- how to do MWEs
    # we use list slices: 
    # a[start:stop]
    # 
    # for Word in ListOfWords:
    for i in range(len(ListOfWords)):
        # Word = ListOfWords[i]
        if i+i+INGram > len(ListOfWords): break
        try: LWordsMWE = ListOfWords[i:i+INGram]
        except: LWordsMWE = []
      
        if LWordsMWE:
            SWordsMWE = ' '.join(LWordsMWE)
            try:
                DictionaryFrq[SWordsMWE] += 1
            except:
                DictionaryFrq[SWordsMWE] = 1

In [None]:
# step 5: save the frequency dictionary into file, by decreasing frequencies
# FileOutput.write( str( DictionaryFrq ) + '\n' )
for Word, Frq in sorted( DictionaryFrq.items() , key=lambda x: x[1], reverse=True) :
    FileOutput.write(Word + '\t' + str(Frq) + '\n')
FileOutput.flush()

## Tasks
1. Examine the frequencies file; 
2. download it onto your local machine; 
3. Change the programme to create a frequency dictionary from another file / corpus / language
4. Change the programme to preserve lower/upper-case letters; how would you print out only words with frequency >1 ?
5. Run it again and compare the results (save the results in another file).


In [None]:
LRanks = []
LFrqs = []
LCoef = []
rank = 0
for Word, Frq in sorted(DictionaryFrq.items() , key=lambda x: x[1], reverse=True):
    rank +=1
    coef = rank * Frq
    LRanks.append(rank)
    LFrqs.append(Frq)
    LCoef.append(coef)

print(len(LRanks))
print(len(LFrqs))
print(len(LCoef))

Let's plot rank vs. frequency
(cf. our coefficient), scaled by log

In [None]:
plt.plot([math.log(c) for c in LRanks], [math.log(c) for c in LFrqs], 'ro')
plt.plot([math.log(c) for c in LRanks], [math.log(c) for c in LCoef], 'bs')
# plt.plot([c for c in LRanks], [c for c in LFrqs], 'bs')
