<a href="https://colab.research.google.com/github/iued-uni-heidelberg/DAAD-Training-2021/blob/main/session00introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Navigating Google Colab notebooks: 

1. You can click the folder icon on the left to **see/upload/download files**
2. There are two types of cells: **code and text**. 
3. You can **edit/change/update** any cell by clicking on it (the code runs in your own personal account on Google cloud, other participants work in their own spaces with the same downloaded code)
4. You can **download/save a copy of your Notebook** or the Python file to your local computer or your Google Drive via the *File* menu
5. To **run a code cell**, you can either click on the “run” icon to the left; or press *Ctr+Enter*
6. To add a new code/text cell below the current cell, click on *+Text* or *+Code* buttons;
7. To **convert between code and text**: 
- *“Control_m m”* will convert a code cell to a text cell. 
- *“Control_m y”* will convert a text cell to a code cell.
8. Under *“Runtime > Change runtime type”* you can request a **CPU-only, or GPU or a TPU environment** (depending on the code you run: some Machine Learning packages require GPU for faster processing).
9. The *Python notebook* is opened by default. **To create an *R notebook***, run https://colab.research.google.com/#create=true&language=r or https://colab.to/r ; you can check that you run R in *“Runtime > Change Runtime type”* (R or Python3 will the the choices); GPU and TPU will be available for R notebooks as well.
10. Saving your work: go to *"File > Save a copy in Drive"*
- For the excercises please run each cell in the order how they appear. 
- Some cells (for training neural networks or downloading large models) will run for about 10-15 min -- just grab a cup of coffee!
- Experiment with your own examples
- The environment will clear when you leave the Colab space, please make sure you save your changes on the Google Drive

Handount: https://docs.google.com/document/d/1S2fPnFuv5tbLWsS2Wsr8HeZ9CUQZR4100npmSRXuip0/edit?usp=sharing



## Try it out: convert this section to code and run it. It should print a message

my_name = 'First Last'

print(f'Hello from {my_name}!\n')

# Downloading / uploading a text corpus

## George Orwell, 1984 novel: 
https://heibox.uni-heidelberg.de/d/d65daff8341e467c82b1/

(texts in en, de, fr, es, it. You can search for a freely-available text in your own language).

## Wikipedia corpus
This site contains plain text versions of the Wikipedia:
https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2735#

You can download the version for your favourite language(s).

1. Download the "1984" novel into your local drive
2. Upload it onto the Colab file system:
- *Files* button to the left 
- *Upload to Session storage* button
3. Examine the file on Colab
4. Write a command to download it onto your system automatically: tip -- google "wget" and use it in Colab



In [None]:
# example for downloading files for different languages, e.g., German file in  from the https://heibox.uni-heidelberg.de/f/ea06aa47fe2d49959a62/
# remove / insert comments to choose the language you want to work with

# German
# !wget https://heibox.uni-heidelberg.de/f/ea06aa47fe2d49959a62/
# !wget https://heibox.uni-heidelberg.de/f/ea06aa47fe2d49959a62/?dl=1

# Armenian
# !wget https://heibox.uni-heidelberg.de/f/3255f672ff7b4190828a/?dl=1

# Georgian
# !wget https://heibox.uni-heidelberg.de/f/318b32556cdc44d38238/?dl=1

# French
# !wget https://heibox.uni-heidelberg.de/f/b0cc03fbdb6248cab544/?dl=1

# Spanish
# !wget https://heibox.uni-heidelberg.de/f/585ee5e9eb3548219c34/?dl=1

# Italian
# !wget https://heibox.uni-heidelberg.de/f/fe1ae20b08b240f3a4f0/?dl=1

# English
!wget https://heibox.uni-heidelberg.de/f/00ee04d9b9544c298be7/?dl=1

# renaming file
!mv index.html?dl=1 go1984.txt

Now let's make a frequency dictionary from our file!

In [None]:
# step 1: importing standard Python libraries
import sys, re, os

In [None]:
# step 2: open your file for reading (change the name of the file for a different language / corpus)
# FileInput = open("go1984en.txt",'r')
FileInput = open("go1984.txt",'r')
# open another file for writing (our output file)
FileOutput = open ("go1984en-frq.txt", 'w')

In [None]:
# step 3: create an empty frequency dictionary: words will be 'keys', frequencies will be 'values'
DictionaryFrq = {}

In [None]:
# step 4: read each line, clean it up, split it into words and count unique words:

for Line in FileInput:
    Line = re.sub('<.*?>', '', Line) # remove html/xml tags
    Line = Line.lower() # convert to lower case
    Line = Line.strip() # remove leading and final white spaces
    ListOfWords = re.split('[ ,\.:;\!\(\)\"\[\]]+', Line) # tokenize: split on white spaces and punctuation
    # sys.stdout.write( str( ListOfWords ) + '\n' )

    # If word exists, add 1 to its existing frequency; if it doesn't, then set frequency to 1
    # { word1 : 4 , word2 : 4, word3 : 2 }
    for Word in ListOfWords:
        try:
            DictionaryFrq[Word] += 1
        except:
            DictionaryFrq[Word] = 1

In [None]:
# step 5: save the frequency dictionary into file, by decreasing frequencies
# FileOutput.write( str( DictionaryFrq ) + '\n' )
for Word, Frq in sorted( DictionaryFrq.items() , key=lambda x: x[1], reverse=True) :
    FileOutput.write(Word + '\t' + str(Frq) + '\n')

## Tasks
1. Examine the frequencies file; 
2. download it onto your local machine; 
3. Change the programme to create a frequency dictionary from another file / corpus / language
4. Change the programme to preserve lower/upper-case letters; 
5. Run it again and compare the results (save the results in another file).
