Python is a popular programming language for processing text data. It is also relatively easy to learn to do some basic text analysis in Python (["We chose Python because it has a shallow learning curve, its syntax and semantics are transparent, and it has good string-handling functionality."](https://www.nltk.org/book/ch00.html)). It is free and available on all the major operating systems, with a lot of support ecosystem. So, I will use Python in this course, and this "Notebook" illustrates some basic functionalities of Python. Check out https://www.python.org/about/gettingstarted/ for how to install Python. 

There are many ways of writing python code and running it on your computer. Some people write the code from their terminal, in a plain text format. Some use Integrated Development Environments (e.g., PyCharm, Spyder, Atom). More recently, people are using interactive environments called "Notebook" (e.g., Jupyter, Google Colab) etc. This is a Jupyter Notebook.

The goal of this notebook is to introduce some basic Python concepts. 

Readings: Chapters 1--10 in [Python for Everybody book](https://www.py4e.com/html3/) 
 

# Variables

(Chapter 1-2 in py4e)

In [1]:
# These two lines below tell us what version of Python are we using. Sometimes, this is important to know. 
import sys
print(sys.version)

3.9.6 (default, Aug  5 2022, 15:21:02) 
[Clang 14.0.0 (clang-1400.0.29.102)]


In [2]:
#print() is a useful first function to know. 
print("Hello!")
print("\"Hello!\"")

print(2+3)

Hello!
"Hello!"
5


In [3]:
# Python can also "remember" a bit.
x=2
print(x)
y=x*3
print(y)

"""
In these 4 lines, x and y are "variables" and we are assigning some values to them. print is a "function". 
x=2 is a statement which assigns the value 2 to x. print() is also a statement. 
A statement is a unit of code that is executed. 

* is an operator. 

triple quotes are for multi-line comments. 
# is for single line comment. 
"""


2
6


'\nIn these 4 lines, x and y are "variables" and we are assigning some values to them. print is a "function". \nx=2 is a statement which assigns the value 2 to x. print() is also a statement. \nA statement is a unit of code that is executed. \n\n* is an operator. \n\ntriple quotes are for multi-line comments. \n# is for single line comment. \n'

In [4]:
# Type of a variable can be known using the built in "type" function.

x = "123"
print(type(x))
x = 1
print(type(x))
print(type(3.2))

<class 'str'>
<class 'int'>
<class 'float'>


In [5]:
"""
When you type a large integer, you might be tempted to use commas between groups of three digits, 
as in 1,000,000. This is not an integer in Python, but it won't throw an error either:
"""
print(1,000,000)

1 0 0


I am using x and y, but you can have any names for your variables, as long as you follow Python's naming rules.
- They can contain both letters and numbers, but they cannot start with a number.  
- Don't use one of [python's 35 "keywords"](https://www.w3schools.com/python/python_ref_keywords.asp) as your variable names.
- Variables need to be defined first, before they can be used. 

Try running the lines below, and see what errors you get. 

In [6]:
76trombones = 'big parade'


SyntaxError: invalid syntax (943658999.py, line 1)

In [7]:
more@ = 1000000


FileNotFoundError: [Errno 2] No such file or directory: '@ = 1000000'

In [8]:
class = 'Advanced Theoretical Zymurgy'

SyntaxError: invalid syntax (3803549429.py, line 1)

In [9]:
principal = 327.68
interest = principle * rate

NameError: name 'principle' is not defined

# Operators, Conditions and Loops

(Chapter 3 and 5 in Py4E)

In [10]:
# the Boolean operator == compares two operands and produces True if they are equal and False otherwise
5 == 5
"And" == "and"

x=3
y=2
print(x >= y)

#note: you cannot compare two variables of different types like this. 
z="a"
print(x >=z)

True


TypeError: '>=' not supported between instances of 'int' and 'str'

In [11]:
#Logical operators:

a=3
b=4
c=5
print(a>b and b>c)
print(a<b and b<c)


False
True


In [12]:
# Conditions:
if x > 0 :
    print('x is positive')
    print(x)
else:
    print('x is not positive')

x is positive
3


Where did this x come from? I did not declare it in this block. But it was declared earlier in this notebook. 
You can clear outputs of previous cells. If you clear it all, and run the above cell again, you will get an error, 
because there won't be a variable x.

Note the indentation. That is Python's way of grouping the code (some other languages use parantheses etc.)

In [13]:
x = -12
if x > 0 :
    print('x is positive')
    print(x)
else:
    print('x is not positive')

x is not positive


You can write chained conditionals (if, elif, else) or nested conditionals (a conditional inside another conditional) too.

In [14]:
#Catching errors and acting on them.
while True: #While is a keyword used to implement a loop.
    inp = input('Enter Fahrenheit Temperature:')
    #this takes input from the user. 
    try:
        if inp == "exit":
            print("Exiting this loop")
            break #Breaks out of the loop if we enter "exit"
        fahr = float(inp)
        cel = (fahr - 32.0) * 5.0 / 9.0
        print(cel)
    except:
        print('Not a valid number. Please enter a valid number')

# Code slightly edited from: http://www.py4e.com/code3/fahren2.py

Enter Fahrenheit Temperature:exit
Exiting this loop


Loops are used to iterate through a piece of code (e.g., process a list of numbers or strings, enumerate a large list of items etc).  You can use a "for" loop or a "while" loop in Python. You can "break" out of a loop, or "continue". You can also easily fall into an infinite loop that never stops, if you make an error in programming your loop.

Read [Chapter 5 in Py4e](https://www.py4e.com/html3/05-iterations) to get a good picture of what loops are and how to implement them.

### Programming Exercise

Do the exercises 1--3 in Chapter 3 and 1--2 in Chapter 5 of Py4E for practice.

# Functions

Chapter 4 in Py4E

We repeated the same chunk of code twice in the two blocks that illustrated conditionals. Such repetitive code can be made into a "Function" for reuse. Stuff like "print" and "type" that we saw so far are built-in Python functions (there are several of them, and there are many more pre-implemented ones which we can install as libraries, import into our code and use)

functions are defined using the def keyword.


In [15]:
#Here is a small function which takes a value and prints whether it is a positive or negative integer. 
def isPositive(x):
    if type(x) == int:
        if x > 0 :
            print('x is positive')
        else:
            print('x is not positive')
    else:
        print("x is not an integer. So, there is no positive or negative.")

In [16]:
x = 24
isPositive(x)
x = "a"
isPositive(x)
x = -2
isPositive(x)

x is positive
x is not an integer. So, there is no positive or negative.
x is not positive


Functions can sometimes "return" something instead of just printing outputs, which can be put to use for something else later. For example, consider the built-in function len. It returns an integer - length of the variable it takes as an input "argument". The following snippet illustrates how it works:

In [17]:
x = "trial"
print(len(x))
y = len(x)
print(y > 4)

5
True


In [18]:
# Now, let me change the "isPositive" function to "return" something.
#Here is a small function which takes a value and prints whether it is a positive or negative integer. 
def isPositive(x):
    if type(x) == int:
        if x > 0 :
            return True
        else:
            return False
    else:
        print("x is not an integer. So, there is no positive or negative.")
        return None  #None is used to define a Null value in Python.

In [19]:
x = 2
if isPositive(x):
    print("We will see this message")
else:
    print("We won't see this message as long as we don't change x")

We will see this message


## Exercise:

Read the Chapter 4 in Py4E and do the exercises in the end. 

# Reading and Writing text in python

Chapter 7 in Py4E


When we want to read or write existing files (e.g., on your hard drive), we first must open it. Opening the file communicates with your operating system, which knows where the data for each file is stored. When you open a file, you are asking the operating system to find the file by name and make sure the file exists. In this example, we open the file mbox.txt, which should be stored in the same folder that you are in when you start Python. You can download this file from www.py4e.com/code3/mbox.txt

Once you download the file and store it in the same folder as this Notebook, run the following code.

In [25]:
fhand = open('mbox.txt')
print(fhand)


<_io.TextIOWrapper name='mbox.txt' mode='r' encoding='UTF-8'>


If the open is successful, the operating system returns us a file handle. The file handle is not the actual data contained in the file, but instead it is a “handle” that we can use to read the data. You are given a handle if the requested file exists and you have the proper permissions to read the file.

Now, let us print the first 10 lines of the file. 

In [27]:
for i in range(0,10):
    print(fhand.readline())
fhand.close() #closes the file we opened. 

	by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;

	Sat, 5 Jan 2008 09:14:15 -0500

Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk [194.35.219.184])

	BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ; 

	 5 Jan 2008 09:14:10 -0500

Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])

	by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;

	Sat,  5 Jan 2008 14:10:05 +0000 (GMT)

Message-ID: <200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>

Mime-Version: 1.0



In [32]:
fhand = open('mbox.txt')
entire_contents = fhand.read()
print(len(entire_contents)) #prints length in number of characters.
fhand.close()

fhand = open('mbox.txt')
entire_contents = fhand.readlines()
print(len(entire_contents)) #prints length in number of lines.
fhand.close()

#Why did I open the file twice here?? 



6687002
132045


In [35]:
#Writing into files.

fhand = open("temp.txt", "w")

for i in range(0,10):
    fhand.write("a line")
    fhand.write("\n")

fhand.close()

#Now, you will see a file temp.txt in this folder on your hard disk, with "a line" printed 10 times. 

These are just a few basic file operations. Encoding of the text file can play an important role while handling text with accents, other scripts etc.


# Python's data structures: Strings, Lists, Dictionaries, Tuples

Chapters 6, 8, 9, 10 in Py4E

## Strings

A string is a sequence of characters. You can access the characters one at a time with the bracket operator, by its index (starts at 0, not at 1, as you may have noticed earlier). We can traverse the string with a loop, "slice" a string, use some built-in string functions to do other operations on it, compare two strings etc.

Here are some examples. 



In [38]:
mystr = "banana"

#accesses characters by index:
print(mystr[0])
print(mystr[1])
print(mystr[-1]) #not an error. See what happens
print(mystr[12]) #error, because string isn't that long.


b
a
a


IndexError: string index out of range

In [39]:

#length of the string
print(len(mystr))

#prints all characters in the string one by one
for character in mystr:
    print(character)

6
b
a
n
a
n
a


In [42]:
#slicing a string:
s = 'Monty Python'
print(s[0:5])
print(s[6:12])

#If you omit the first index (before the colon), the slice starts at the beginning of the string. 
#If you omit the second index, the slice goes to the end of the string:
print(s[:4])
print(s[7:])

Monty
Python
Mont
ython


In [43]:
# "in" operator
"on" in s #checks if the substring "on" appears in s ("Monty Python")

True

There are a lot of pre-implemented string methods (i.e., functions that are tied to specific objects like strings).
Some examples are in the [Python documentation](https://docs.python.org/3/library/stdtypes.html#string-methods). Here are a few:

In [44]:
mystr = "This is a string with some spaces, commas, and a full stop."
print(mystr.upper())
print(mystr.find("is"))
print(mystr.startswith("This"))
print(mystr.endswith("This"))

THIS IS A STRING WITH SOME SPACES, COMMAS, AND A FULL STOP.
2
True
False


There are a lot more. string is a useful data structure while working with NLP, for obvious reasons. Check out Chapter 6 in Py4E for a more detailed introduction, and try doing those exercies.

## Lists in Python

Chapter 8 in Py4E

Like a string, a list is a sequence of values. In a string, the values are characters; in a list, they can be any type. The values in list are called elements or sometimes items.

There are several ways to create a new list; the simplest is to enclose the elements in square brackets []:

We can use a loop to traverse a list, we can change elements in a list, and we can use a list within a list. We can concatenate two lists, append an item to a list, extend a list, slice it, and do a lot more! 

Lists are another very useful data structure while dealing with textual data. A string can also be made into a list of characters. 

Here are a few examples of using a list.

In [46]:
mystr = "This is a string with some spaces, commas, and a full stop."
mylist = mystr.split(" ") #splitting the string into a list, by space character

print(len(mylist))
print(mylist[2:6])

for i in range(0, len(mylist)):
    print(mylist[i])
    
for i in mylist:
    print(i)
    
mylist.append("adding this")
print(mylist)

mylist.append(["adding", "this"])
print(mylist)

mylist.extend(["adding", "this"])
print(mylist)

12
['a', 'string', 'with', 'some']
This
is
a
string
with
some
spaces,
commas,
and
a
full
stop.
This
is
a
string
with
some
spaces,
commas,
and
a
full
stop.
['This', 'is', 'a', 'string', 'with', 'some', 'spaces,', 'commas,', 'and', 'a', 'full', 'stop.', 'adding this']
['This', 'is', 'a', 'string', 'with', 'some', 'spaces,', 'commas,', 'and', 'a', 'full', 'stop.', 'adding this', ['adding', 'this']]
['This', 'is', 'a', 'string', 'with', 'some', 'spaces,', 'commas,', 'and', 'a', 'full', 'stop.', 'adding this', ['adding', 'this'], 'adding', 'this']


In [47]:
alist = [1,2,3]
blist = ["a","b","c"]
print(alist+blist)


[1, 2, 3, 'a', 'b', 'c']


In [48]:
print(blist*3)

['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c']


In [51]:
mystr = "This is an example"
mylist = mystr.split(" ")
mystr2 = " ".join(mylist)
print(mystr)
print(mylist)
print(mystr2)

This is an example
['This', 'is', 'an', 'example']
This is an example


There is ofcourse, a lot more than what I quickly introduced. Read Chapter 7 in Py4E and do the exercises. We can have some discussion early next week on these.

## Dictionaries

You can think of a dictionary as a mapping between a set of indices (which are called keys) and a set of values. Each key maps to a value. The association of a key and a value is called a key-value pair or sometimes an item.

As an example, we’ll build a dictionary that maps users with their email addresses.

The function dict creates a new dictionary with no items. Because dict is the name of a built-in function, you should avoid using it as a variable name.



In [54]:
emailmap = dict()
print(emailmap)

emailmap["user1"] = "user1@email.com"
emailmap["user2"] = "user2@email.com"
print(emailmap)

emailmap.update({"user3":"user3@gmail.com", "user4":"user4@gmail.com"})
print(emailmap)

{}
{'user1': 'user1@email.com', 'user2': 'user2@email.com'}
{'user1': 'user1@email.com', 'user2': 'user2@email.com', 'user3': 'user3@gmail.com', 'user4': 'user4@gmail.com'}


In [55]:
print(emailmap["user1"])
print(emailmap["user5"]) #throws an error as there is no user5

user1@email.com


KeyError: 'user5'

In [56]:
print(emailmap.keys())

dict_keys(['user1', 'user2', 'user3', 'user4'])


In [57]:
print(emailmap.values())

dict_values(['user1@email.com', 'user2@email.com', 'user3@gmail.com', 'user4@gmail.com'])


You can use dictionaries, for example, to count frequency of words in a file (or a bunch of files). You can combine them with other python data structures and their methods, other python functions etc, to build pretty sophisticated programs already! I won't discuss further, but take a look at Chapters 9 and 10 in Py4E to get a fuller picture and try the exercises!

# Concluding remarks:

This is a really quick overview of the basics of Python programming. My goal is just to introduce the basic syntax, and the different building blocks of python code. With this, you will already be able to read some of the simpler python programs. In the next part of today's lecture, we will look into installing NLTK, one of the well known text processing libraries, and see how we can use it to do some preliminary corpus analysis. 