# IT Skills for linguists 2
## UAM, Faculty of English, 2BA
### Topic: *Input - output*
#### Poznań, 12.12.2022
#### Teacher: mgr inż. Michał Junczyk


In [None]:
!jupyter nbconvert "4-follow-along.ipynb" --to slides --post serve

[NbConvertApp] Converting notebook 4-follow-along.ipynb to slides
[NbConvertApp] Writing 691696 bytes to 4-follow-along.slides.html
[NbConvertApp] Redirecting reveal.js requests to https://cdnjs.cloudflare.com/ajax/libs/reveal.js/3.5.0
Serving your slides at http://127.0.0.1:8000/4-follow-along.slides.html
Use Control-C to stop this server


- Variables (data structures) + control structures - full power of Python available!
- Yet, to do anything useful programs must run on actual data. 
- For us as linguists, actual data means words, sentences, texts, sounds, etc.
- So far, all data was coded as part of the program (e.g. vowel counting)
- Goal for today - open-ended programs responding to files or user-entered data (not precoded)

# Input interfaces
- Ways of **inputting** data:
  - Command-line - Data entered on the command line when the program is invoked.
  - Standard input - programs take input from other programs.
  - Keyboard input - user enter data when prompted by the program.
  - File input–output program reads data from or write data to files.
  - Web - program reads data in from web pages.

## Command-Line Input
- input data can be provided in command line e.g.<br> *python myprogram.py my_input.txt*
- values are stored in predefined **sys.argv** variable 
- to use it, first **sys** module must be imported


#### io1.py content:

import sys <br>
print(sys.argv)

> python io1.py<br>
['io1.py']<br><br>
> python io1.py nouns<br>
['io1.py','nouns']<br><br>
> python io1.py 3<br>
['io1.py','3']<br><br>
> python io1.py this is a cat<br>
['io1.py','this','is','a','cat']<br><br>
> python io1.py '3 > 1'<br>
['io1.py','3 > 1']<br><br>

- input arguments are separated with whitespaces. 
- arguments are **positional**. There's no explicit assignement (input_param1=value1, input_param2 = value2)
- sys.argv[0] stores the name of the script
- sys.argv[1] stores the value of first input argument
- sys.argv[2] stores the value of second input argument
- etc.

In [None]:
!cd 4-standalone-scripts-and-files && python3 io1.py test test2 test3

In [None]:
!cd 4-standalone-scripts-and-files && python3 io1.py 1 2 3 4 5 6

### command line input - single argument

In [None]:
import sys         #make sys.argv available

vowels = 'aeiou'   #define vowels
# notice that previously word value was coded in the program itself
word = sys.argv[1] #get word from command-line

counter = 0        #proceed as before...
vowelcount = 0
while counter < len(word):
	if word[counter] in vowels:
		vowelcount += 1
	counter += 1
else:
	print('There are',vowelcount,
		'vowels in this word')



In [None]:
!cd 4-standalone-scripts-and-files && python3 io2.py abc

In [None]:
!cd 4-standalone-scripts-and-files && python3 io2.py abcde

### command line input - multiple arguments

In [None]:
import sys       #make sys.argv available

vowels = 'aeiou' #define vowels
#iterate over all words in list
for word in sys.argv[1:]:
	counter = 0   #proceed as before
	vowelcount = 0
	while counter < len(word):
		if word[counter] in vowels:
			vowelcount += 1
		counter += 1
	else:
		print('There are',vowelcount,
			'vowels in',word)



In [None]:
!cd 4-standalone-scripts-and-files && python3 io3.py abc abcde abcdefghi

## Standard input (STDIN)

- all programs can receive input data via standard input (stdin)
- any program can produce output as standard output (stdout). 
- Output usually printed to the screen
- Output can be read or given as input to other programs.

- Output of one program is directly fed to a second program.
- Available via a variable from the sys module: **sys.stdin**

In [None]:
#io4.py

import sys

for line in sys.stdin:
	print(line)


In [None]:
!cd 4-standalone-scripts-and-files && echo -e "column1\tcolumn2\nvalue1\tvalue2" | python3 io4.py
!cd 4-standalone-scripts-and-files && echo -e "1 2 3" | python3 io4.py

Let's change our exmaple so it take multiple words from stdin


In [None]:
#io5.py

import sys

vowels = 'aeiou'              #define vowels
#get each line in stdin
for words in sys.stdin:
	for word in words.split(): #break into words
		#do same as before to each
		counter = 0
		vowelcount = 0
		while counter < len(word):
			if word[counter] in vowels:
				vowelcount += 1
			counter += 1
		else:
			print('There are',vowelcount,
				'vowels in',word)



In [None]:
!cd 4-standalone-scripts-and-files && echo -e "example" | python3 io5.py

In [None]:
!echo -e "first\nsecond\n"
!cd 4-standalone-scripts-and-files && echo -e "first\nsecond" | python3 io5.py

Let's add printing line number

In [None]:
#io6.py:import sys

vowels = 'aeiou'    #vowels
line = 1            #line number
#for each line in stdin
for words in sys.stdin:
	#print line number
	print('This is line',line)
	line += 1        #increment line count
	#break line into words
	for word in words.split():
		counter = 0   #continue as before
		vowelcount = 0
		while counter < len(word):
			if word[counter] in vowels:
				vowelcount += 1
			counter += 1
		else:
			print('\tThere are ',vowelcount,
				' vowels in "',word,'"',sep='')



In [None]:
!echo -e "first\nsecond\n"
!cd 4-standalone-scripts-and-files && echo -e "first\nsecond" | python3 io6.py

In [None]:
!echo -e "first line\nsecond line\n"
!cd 4-standalone-scripts-and-files && echo -e "first line\nsecond line" | python3 io6.py

### Reading files via STDIN

In [None]:
#linux
!cat 4-standalone-scripts-and-files/test.txt
#windows
#!copy 
#!type 

In [None]:
!cd 4-standalone-scripts-and-files && cat test.txt | python3 io6.py

### STDIN - take-aways
- stdin can accommodate multiple line
- files can contain any amount of data
- variable assignment and control structures give us the full power of Python;
- unbounded input like STDIN allows applying its power to a problem of any size.

## Keyboard input

- input data can be requested from the user. 
- programs may pause at some point and wait for the user to enter data.
- function input()
  - takes a single string argument.
  - prints argument and returns what the user types in response as a string.

In [None]:
theInput = input('Type something: ')
print('You typed "',theInput,'"',sep='')



In [None]:
# return value is always string
# if numbers are wanted must be converted
#collect two numbers
n1 = input('Enter a number: ')
n2 = input('Enter another number: ')
#convert to integers and add
n3 = int(n1) + int(n2)
print('The sum is:',n3)  #return result

### Guessing game

In [None]:
import random

letters = 'abcdefghijklmnopqrstuvwxyz'

#get random letter
letter = letters[random.randint(0,25)]

while True:              #loop until correct
	#prompt them to type a letter
	guess = input('Type a lower-case letter: ')
	#check that it's actually a letter
	if guess not in letters:
		print("That's not a lower-case letter.")
		continue
	if guess == letter:   #if they're right
		print("That's right!")
		break
	#give them a hint if they're wrong
	if guess > letter:
		print("It's earler in the alphabet.")
	else:
		print("It's later in the alphabet.")



### Eval function for interpreting input

- input() function returns a string, which can be converted to a number 
- Can user enter actual Python variables or functions?


For example, imagine we have three variables x, y, and z. <br>
We want the user to select one to print its content. <br>


In [None]:
# incorrect code
# not what we want!
x = 'Tom'
y = 'Dick'
z = 'Harry'

result = input('Type x, y, or z: ')

print(result)



In [None]:
#correct code
#set up three variables
x = 'Tom'
y = 'Dick'
z = 'Harry'
#collect user input
result = input('Type x, y, or z: ')

#evaluate and print result
print(eval(result))



## File Input–Output

- usual way to input or output large amounts of data is from or to files.
- program should be written to respond to any amount of data.
- program reads and processes (valid) data at once or chunk by chunk.

- Writing to files, in principle, is a **dangerous operation**
- Best practices:
  - Do not experiment with important files. **Create toy files to play with.**
  - When you do want to start working on your own files, use **copies**, not original files.
  - Create a new directory to learn file operations, storing only not important files or copies.

### Text files


- Python can work with many different files formats.
- Yet, simple text files are best to start with.


#### Example - saving text to file

- Create a stream or pathway to a file
- print to that stream
- close the stream

In [None]:
#open file stream
outFile = open('testfile.txt','w')
#write to it
outFile.write('some text!\n')
outFile.write('...and some more text!\n')
outFile.close()   #close file stream

In [None]:
!cat testfile.txt

#### Example - reading from file

In [None]:
#open file stream
inFile = open('testfile.txt','r')
stuff = inFile.read() #read form it
inFile.close()        #close stream
print(stuff)          #print contents



#### Example - reading from file line by line

In [None]:
#open file
inFile = open('testfile.txt','r')
stuff = inFile.read()     #read file contents
inFile.close()            #close file

# first read then split into lines
lines = stuff.split('\n') #split into lines
#print lines and lengths
for line in lines:
	print(len(line),': ',line,sep='')



In [None]:
# read file line by line - more efficient for large file inputs
# open file
inFile = open('testfile.txt','r')
#read from stream line by line
for line in inFile:
	#print length of line and the line
	print(len(line),': ',line,sep='',end='')
inFile.close()   #close file stream



### Different files formats
 - Python can handle specialized or proprietary file formats as well.
 - For example:
   - for audio: wave  (wav)
   - for tabular data - Microsoft Excel files (xls, xlsx)
 - Usually require specialized modules

## Example of lexical statistics - Alice’s Adventures in Wonderland 

### Reading file

In [None]:
count = 0                 #counter for lines
f = open('alice.txt','r') #open the file
for line in f:            #read line by line
	count += 1
f.close()                 #close file
print('lines:',count)     #print line count



### Saving lines in a list

In [None]:
count = 0      #counter for lines
lines = []     #list for line contents
#open file
f = open('alice.txt','r')
for line in f: #read it line by line
	count += 1  #add 1 to counter
	#add current line to list
	lines.append(line)
f.close()      #close the file
#print number of lines read
print('lines:',count)
#print number of lines saved
print('saved lines:',len(lines))



### Printing first few lines

In [None]:
lines = []     #list to save lines
#open file
f = open('alice.txt','r')
for line in f: #read line by line
	#save each line in list
	lines.append(line)
f.close()      #close file
i = 0          #print first 100 lines
while i < 10:
	print(lines[i])
	i += 1



### Printing first few lines - without extra line break

In [None]:
lines = []     #list to save lines
#open file
f = open('alice.txt','r')
for line in f: #read line by line
	#save lines to list
	lines.append(line)
f.close()      #close file
i = 0          #print first 100 lines
while i < 10:
	#don't add a return to the line!
	print(lines[i],end='')
	i += 1



### Skip the Project Gutenberg header (255 lines)

In [None]:
lines = []     #list for lines
#open file
f = open('alice.txt','r')
for line in f: #read lines one by one
	#add lines to list
	lines.append(line)
f.close()      #close file
#strip off first 255 lines
lines = lines[255:]
i = 0          #print first 50 lines
while i < 15:
	#still don't add a return!
	print(lines[i],end='')
	i += 1



### Analysis of the lexical content of the file. 
- is there a correlation between word length and word frequency?
- are more frequent words shorter than less frequent words?

#### Get a list of words

In [None]:
words = []     #list of all words
lines = []     #list of all lines

#open file
f = open('alice.txt','r')
for line in f: #save lines one by one
	lines.append(line)
f.close()      #close file

#remove Gutenberg header
lines = lines[255:]

#go through lines one by one
for line in lines:
	#break each line into words
	wds = line.split()
	#add words to list
	words += wds

i = 0 #print first 100 words
while i < 10:
	print(i,words[i])
	i += 1



###  Count length of each word without punctuation


In [None]:
words = []     #list of all words
lines = []     #list of all lines

f = open('alice.txt','r')
for line in f: #save lines one by one
	lines.append(line)
f.close()      #close file

lines = lines[255:]

for line in lines:
	wds = line.split()      #break into words
	words += wds            #add to list

#print first 100 words and letter counts
i = 0
while i < 10:
	#store the count for the current word
	count = 0
	#convert the current word to lowercase
	word = words[i].lower()
	#go through word letter by letter
	#if lowercase, add 1 to count
	for l in word:
		if l in "abcdefghijklmnopqrstuvwxyz":
			count += 1
	print(i,words[i],count) #print it all
	i += 1



### Count frequency of occurence of word of given length

In [None]:
words = []       #list of all words
lines = []       #list of all lines
wordlengths = {} #dictionary of word lengths

#open file
f = open('alice.txt','r')
for line in f:   #save lines one by one
	lines.append(line)
f.close()        #close file

#remove Gutenberg header
lines = lines[255:]
for line in lines:
	#break each line into words
	wds = line.split()
	#add the words to the list
	words += wds
for wd in words:
	count = 0     #count for current word
	#convert current word to lowercase
	word = wd.lower()
	#go through word letter by letter
	#if lowercase, add 1 to count
	for l in word:
		if l in "abcdefghijklmnopqrstuvwxyz":
			count += 1
	#check if we've seen this length before
	if count in wordlengths:
		#if so add 1
		wordlengths[count] += 1
	else:
		#if not, set to 1
		wordlengths[count] = 1

#print out counts for each word length
for c in wordlengths:
	print(c,wordlengths[c])



### Save results to file


In [None]:
words = []       #list of all words
lines = []       #list of all lines
wordlengths = {} #dictionary of word lengths

#open file
f = open('alice.txt','r')
#save lines one by one
for line in f:
	lines.append(line)
f.close()        #close the file

#remove Gutenberg header
lines = lines[255:]

#go through lines one by one
for line in lines:
	#break each line into words
	wds = line.split()
	#add words to the list
	words += wds

for wd in words:
	count = 0     #count for current word
	#convert current word to lowercase
	word = wd.lower()
	#go through word letter by letter
	#if lowercase, add 1 to count
	for l in word:
		if l in "abcdefghijklmnopqrstuvwxyz":
			count += 1
	#check if we've seen this length already
	if count in wordlengths:
		#if so add 1
		wordlengths[count] += 1
	else:
		#if not, set to 1
		wordlengths[count] = 1

#open output file
g = open('res26.txt','w')
#print out counts for each word length
for c in wordlengths:
	clen = str(wordlengths[c])
	res = str(c) + ': ' + clen + '\n'
	g.write(res)
g.close()        #close output file



In [None]:
!cat res26.txt | head -n 12

### Take-aways from the lexical analysis example

- Each program element was introduced separately <br>(reading file, splitting into lines, counting length etc.)
- This way it's easier to examine and understand what each is doing.
- Stepwise construction is how you should write your own programs. 
- You should build step by step, checking at each point that program performs as you want 
- You check this by:
  - printing the value of variables at each point 
  - checking that they are what you want.
  - If they are, you strip out those print statements and go on to the next step,<br> printing out the new variables of interest.
- This style of building programs is not just for beginners. 
- Get comfortable with it and make it part of your programming habits.