<a href="https://colab.research.google.com/github/iued-uni-heidelberg/DAAD-Training-2021/blob/main/Terminologieextraktion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Navigating Google Colab notebooks: 

1. You can click the folder icon on the left to **see/upload/download files**
2. There are two types of cells: **code and text**. 
3. You can **edit/change/update** any cell by clicking on it (the code runs in your own personal account on Google cloud, other participants work in their own spaces with the same downloaded code)
4. You can **download/save a copy of your Notebook** or the Python file to your local computer or your Google Drive via the *File* menu
5. To **run a code cell**, you can either click on the “run” icon to the left; or press *Ctr+Enter*
6. To add a new code/text cell below the current cell, click on *+Text* or *+Code* buttons;
7. To **convert between code and text**: 
- *“Control_m m”* will convert a code cell to a text cell. 
- *“Control_m y”* will convert a text cell to a code cell.
8. Under *“Runtime > Change runtime type”* you can request a **CPU-only, or GPU or a TPU environment** (depending on the code you run: some Machine Learning packages require GPU for faster processing).
9. The *Python notebook* is opened by default. **To create an *R notebook***, run https://colab.research.google.com/#create=true&language=r or https://colab.to/r ; you can check that you run R in *“Runtime > Change Runtime type”* (R or Python3 will the the choices); GPU and TPU will be available for R notebooks as well.
10. Saving your work: go to *"File > Save a copy in Drive"*
- For the excercises please run each cell in the order how they appear. 
- Some cells (for training neural networks or downloading large models) will run for about 10-15 min -- just grab a cup of coffee!
- Experiment with your own examples
- The environment will clear when you leave the Colab space, please make sure you save your changes on the Google Drive

Handount: https://docs.google.com/document/d/1S2fPnFuv5tbLWsS2Wsr8HeZ9CUQZR4100npmSRXuip0/edit?usp=sharing



## Try it out: convert this section to code and run it. It should print a message

my_name = 'First Last'

print(f'Hello from {my_name}!\n')

# Downloading / uploading a text corpus

## George Orwell, 1984 novel: 
https://heibox.uni-heidelberg.de/d/d65daff8341e467c82b1/

(texts in en, de, fr, es, it. You can search for a freely-available text in your own language).

## Wikipedia corpus
This site contains plain text versions of the Wikipedia:
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2735#

You can download the version for your favourite language(s).

1. Download the "1984" novel into your local drive
2. Upload it onto the Colab file system:
- *Files* button to the left 
- *Upload to Session storage* button
3. Examine the file on Colab
4. Write a command to download it onto your system automatically: tip -- google "wget" and use it in Colab



In [None]:
# ArmenianWP
!wget https://heibox.uni-heidelberg.de/f/206d85a0270943e4b87b/?dl=1
# renaming file
!mv index.html?dl=1 WPhy_vert.txt

# Constitution Republic of Armenia
!wget https://heibox.uni-heidelberg.de/f/bf54977b17604ec592cd/?dl=1
# renaming file
!mv index.html?dl=1 Const_RA_l.txt

# Grundgesetz
!wget https://heibox.uni-heidelberg.de/f/d6c5b31edc84422d9e14/?dl=1
# renaming file
!mv index.html?dl=1 GG_l.txt

# Origin of the species
!wget https://heibox.uni-heidelberg.de/f/befc6dbe718b4a37ba74/?dl=1
# renaming file
!mv index.html?dl=1 OOOS_l.txt

# Europarl DE
!wget https://heibox.uni-heidelberg.de/f/3ba6122e744e4b7f9c14/?dl=1
# renaming file
!mv index.html?dl=1 EP_DE_l.txt


In [None]:
#Terminologieextraktion
import os, re, sys
class clProcCorpus(object):
  ''' we will read a text file and return a dictionary
  this will be done on the line by line basis
  The dictionary can be sorted later...
  '''
  # this is a class for processing a corpus

  def __init__(self, FileIN):
    self.DictFrq = {}
    self.processCorpus(FileIN)

  def processCorpus(self, FileIN):
    LTerm = []
    for Line in FileIN:
      Line = Line.strip()
      LLine = re.split('\t', Line)
      try:
        Word = LLine[0]
        PoS = LLine[1]
        Lemma = LLine[2]
      except:
        Word = ""
        PoS = ""
        Lemma = ""
      
#Select the Tags for your langauge
      #if re.match('N.*', PoS) or re.match('A.*', PoS): #Arm
      if re.match('N.*', PoS) or re.match('ADJ.*', PoS): #DE
      #if re.match('N.*', PoS) or re.match('J.*', PoS): #EN

#Terms as Words or Lemmas
        LTerm.append(Lemma)
      else:
        STerm = ' '.join(LTerm)
        LTerm = []

        try:
            self.DictFrq[STerm] += 1
        except:
            self.DictFrq[STerm] = 1        

    return



In [None]:
#FileIN = open('GG_l.txt', 'r')
#FileOut = open('GG_term.txt', 'w')

#FileIN = open('EP_DE_l.txt', 'r')
#FileOut = open('EP_DE_term.txt', 'w')

#FileIN = open('OOOS_l.txt', 'r')
#FileOut = open('OOOS_term.txt', 'w')

#FileIN = open('Const_RA_l.txt', 'r')
#FileOut = open('Const_RA_term.txt', 'w')

#FileIN = open('Const_RF_l.txt', 'r')
#FileOut = open('Const_RF_term.txt', 'w')

# Georgian Brown corpus (lemmatisiert)
!wget https://heibox.uni-heidelberg.de/f/d306be8b559849e79826/?dl=1
# renaming file
!mv index.html?dl=1 ge_Brown_lem.txt



#FileIN = open('WPhy_vert.txt', 'r')
#FileOut = open('WPhy-terms.txt', 'w')

--2021-09-01 11:08:44--  https://heibox.uni-heidelberg.de/f/d306be8b559849e79826/?dl=1
Resolving heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)... 129.206.7.113
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://heibox.uni-heidelberg.de/seafhttp/files/86b13802-0034-44e1-b27d-4dc23002612f/ge_Brown_lem.txt [following]
--2021-09-01 11:08:44--  https://heibox.uni-heidelberg.de/seafhttp/files/86b13802-0034-44e1-b27d-4dc23002612f/ge_Brown_lem.txt
Reusing existing connection to heibox.uni-heidelberg.de:443.
HTTP request sent, awaiting response... 200 OK
Length: 17565804 (17M) [text/plain]
Saving to: ‘index.html?dl=1’


2021-09-01 11:08:46 (13.6 MB/s) - ‘index.html?dl=1’ saved [17565804/17565804]



In [None]:
# save the frequency dictionary into file, by decreasing frequencies
# FileOutput.write( str( DictionaryFrq ) + '\n' )

OCorpus = clProcCorpus(FileIN)
DictionaryFrq = OCorpus.DictFrq


for Word, Frq in sorted( DictionaryFrq.items() , key=lambda x: x[1], reverse=True):
  if re.search(' ', Word):
    FileOut.write(Word + '\t' + str(Frq) + '\n')

## Tasks
1. Examine the frequencies file; 
2. download it onto your local machine; 
3. Change the programme to create a frequency dictionary from another file / corpus / language
4. Change the programme to preserve lower/upper-case letters; how would you print out only words with frequency >1 ?
5. Run it again and compare the results (save the results in another file).


# Foundations of Python:
1. Variables

2. Data types
- integers, floating point numbers
- strings
- lists, tuples
- dictionaries, multidimensional dictionaries
- files

3. Regular expressions
- Match and search
- Capturing expressions in context

4. Control flow statements
- if, elif, else
- while, for
- break, continue
- range

5. Functions
- standard functions (e.g., sorted)
- writing own functions

6. Classes and objects
- Object packages and libraries 
- Ideas of object-oriented programming

7. Writing reusable code and using others' libraries
- Principles of software engineering
- Using objects for Machine Learning (tensorflow, pytorch), text processing (NLTK), data visualisation