# 1b Files and Strings

In [None]:
import pathlib  # After installing the package, we still need to import it to actually be able to use it in our code.

infile = pathlib.Path('UD_English-GUM', 'en_gum-ud-dev.conllu')
with open(infile) as f:               # Opens the file with the name "infile" and makes it available as the variable "f".
    for line in f:                    # Loops over individual lines in the file.
        line = line.strip()           # Removes whitespaces and newline characters (\n) from a line.
        if line.startswith('#'):      # Checks if a line starts with '#'.
           continue                   # If we do encounter such a line, we skip it. The "continue" statement immediately starts a new loop.
        if line:                      # Otherwise, if we read a non-empty line, ...
            print(line)               # ... print it.
        else:                         # The only other possible case is that we encounter an empty line.
            break                     # Then we break out of the loop.

### Exercise

How can we change this code, so that it doesn't stop after the first word but keeps looping through the whole document?

## The CoNLL File Format

CoNLL stands for the Conference on Computational Natural Language Learning. It regularly hosts competitions called Shared Tasks, which are often about parsing. In the past, the CoNLL file format has been very popular as the input and output format, because it presents a lot of information at the same time and is readable for both humans and computers.
CoNLL is a flavor of TSV (tab-separated values, similar to CSV=comma-separated values), i.e. a tabular format consisting of individual data points on separate rows and a fixed number of data fields for each item in separate columns.

We already know each line contains a single data point, in our case that's a word. So in each iteration of the for-loop, we have access to a word in the "line" variable.

Now we are going to access individual fields by splitting the line at each tab-character:

In [None]:
fields = line.split('\t')
print(fields)

In the context of our loop, it looks like this:

In [None]:
with open(infile) as f:               
    for line in f:                    
        line = line.strip()           
        if line.startswith('#'):      
           continue                   
        if line:
            fields = line.split('\t')
            word = fields[1]
            lemma = fields[2]
            upos = fields[3]
            xpos = fields[4]
            morph = fields[5]
            head = fields[6]
            deprel = fields[7]
            print('word is', word, '; lemma is', lemma, '; part-of-speech is', upos)

As we've seen before, lines starting with # are metadata lines. See if you can make sense of them.

Another important property of CoNLL you might have noticed by now is that there are empty lines between sentences.
Also, sentences are grouped into documents (the actual original documents they were taken from), which is reflected in the metadata (newdoc id and sent_id).

### Exercise

How would you go about parsing / extracting the metadata for each sentence and for each document?
What kinds of research questions could it be useful for?

## Strings

Python is an object-oriented programming language. That means that everything that has a "value" is an object. (The rest are "keywords", but don't worry about that for now.)

Each object has a type. Above, "f" is a variable pointing to a FileObject. And "line" is a variable pointing to various string objects, i.e. objects of the type "string".

Below are a bunch of useful methods for handling and manipulating strings.

In [None]:
string1 = "This is a string.\n"
string2 = 'This is also a string.\n'
string3 = """This
is
a
multi-line
string
."""

# Here we are just assigning some values to some variables.
# Press Ctrl+Enter or Shift+Enter to execute the code block.

In [None]:
# Printing

# We can print strings to the console.
print(string1)
print(string2)
print(string3)

# Press Ctrl+Enter or Shift+Enter to execute the code block.

In [None]:
# Concatenating strings

new_string1 = string1 + string2
new_string2 = 'abc' + 'd 123'
new_string3 = 'Hello' 'World'
new_string4 = ['The', 'brown', 'fox', 'jumped']

print(new_string1)
print(new_string2)
print(new_string3)
print('_'.join(new_string4))

In [None]:
# Format strings

import math

print('combining "%s" and "%s"'.format(new_string1, new_string2))
print(f'combining "{new_string3}" and "{new_string4}"')

some_text = 'f-strings are cool because they are easy to write and can quickly print pretty numbers like ...'

# len(some_text) tells us that the "some_text" string is 95 characters long - try it out with a print() call!

print(f'{some_text} {math.pi:.5f}')  # ".5f" means "a float, rounded to five decimal places"
#                  ^
# If we want all the ones, tens, etc to line up, we need to add 1 for the empty space, 
# and 1 extra so that we align the rightmost digit with Pi's 3 before the decimal.
# 95 + 1 + 1 = 97!

print(f'{42:97d}')
print(f'{4321:97d}')
print(f'{1:97d}')

# If we are printing floats (f) again instead of integers (d), we need to add the number of decimal digits (2) and the decimal point (1).
# 97 + 2 + 1 = 100!

print(f'{42:100.2f}')
print(f'{4321:100.2f}')
print(f'{1.23:100.2f}')
print(f'{0.1:100.2f}')


## Exercise

Now it's your turn!
Write some code that prints "Hello NAME! Today is DATE.", where NAME and DATE should be replaced with an arbitrary name and a date, which are each specified in variables beforehand.

Once you are done with these basic requirements, feel free to experiment and make the code more complex!

You can use the input() function to prompt a used for custom input, which will be stored in a new string.
See [https://www.w3schools.com/python/ref_func_input.asp](https://www.w3schools.com/python/ref_func_input.asp) for help.

You can use the datetime module to get real current time information.
The module needs to be imported first (but does not need to be installed).
See [https://www.w3schools.com/python/python_datetime.asp](https://www.w3schools.com/python/python_datetime.asp) for help.

In [None]:
# Your code here