![Introduction to Programming with Python](figs/python_workshop.png "Introduction to Programming with Python")

# Overview


* Instructor: **Milena Tsvetkova**
* Teaching Assistant: **Yuanmo He**

**10:00 – 12:00 CET**

* Straight-line programming
  * Data types
  * Operations and methods
* Control flow
  * Conditional statements


**12:00 – 12:15 CET** Break

**12:15 – 14:15 CET**

* Control flow
  * Iteration
  * Functions
* Classes
* Concluding remarks


## Why Do Social Scientists Need Computer Programming?

* Collect data
  * Crawling websites and using APIs
  * Online surveys and experiments
  * Computational models and simulations
* Manage, analyze, and visualize data
  * Large data
  * Non-rectangular data (e.g. networks, text)

* Be autonomous and work independently
* Learn from and collaborate with engineers and scientists

* Generate and share reproducible workflows

## Markup vs. Programming Languages


|              | Markup Languages | Programming Languages   
| :----------- |:---------------- | :----------------------
|  |![Markup languages](figs/markup_lang.png "Markup languages") | ![Programming languages](figs/program_lang.png "Programming languages")
| **Examples** | TeX, HTML, XML, **Markdown**   | C, Java, JavaScript, R, **Python**           
| **Use**      | Structure and present data | Transform and generate data  
| **Execution**| Program (e.g. a browser)   | Computer hardware 
| **Structure**| Inline tags    | Primitive constructs, syntax, static semantics, semantics 

(Image sources: Wikimedia)

A programming language is a formal language used to specify a set of instructions for a computer to execute. It has:

* Primitive constructs – literals (chracters, numbers) and operators
* Syntax – rules for putting primitives together
* Static semantics – rules for forming meaningful commands
* Semantics – the meaning of commands

## Why Python?

![Python](figs/python.png "Python")

* Open-source – free and well-documented
* Simple and concise syntax
* Many useful libraries
* Cross-platform
* [Widely used in industry and science](https://youtu.be/cKzP61Gjf00)

## Programming with Python on Google Colab

Jupyter Notebooks is an open document format based on JSON with live code, equations, visualizations, and explanatory text.  

Google Colab allows you to run a Jupyter notebook in the cloud via your browser, no installation required.

**Go to https://github.com/social-research/python-workshop and open the [Google Colab link](https://drive.google.com/file/d/1ou6DKxFaAVKBrn9AEgWEjdpNe76bltP9/view?usp=sharing) under Software.**

(Alternatively, if you have Jupyter pre-installed, you can clone the repository locally and run the Jupyter server to open the file `python_intro.ipynb`.)

## Using Colab Notebooks

* Text cells
    * Double-click to inspect and edit Markdown
    * See cheatsheet: https://github.com/adam-p/markdown-here/wiki/Markdown-Cheatsheet
* Code cells
    * Press the Play button (or `CTRL/CMD + ENTER`) to run
    * If you run code above, you can use the results below
* Use `+ Code` and `+ Text` buttons to add new cells
* If you get in trouble: `Runtime` &rarr; `Interrupt execution`

# Objects, Data Types, and Expressions

* Computer programs manipulate data in the form of objects
* Objects have types
  * Scalar — indivisible
  * Non-scalar — with internal structure, can be ordered/unordered and mutable/immutable
* We can do things with objects
    * Use variables to associate them with names
    * Combine objects and operators to evaluate expressions
    * Call methods on objects
    * Pass objects to functions

## Data Types in Python


| Type     | Scalar     | Mutability | Order   
| :------: |:----------:|:----------:| :---------:
| `int`    | scalar     | immutable  |             
| `float`  | scalar     | immutable  |  
| `bool`   | scalar     | immutable  | 
| `None`   | scalar     | immutable  | 
| `str`    | non-scalar | immutable  | ordered
| `tuple`  | non-scalar | immutable  | ordered
| `list`   | non-scalar | mutable    | ordered
| `set`    | non-scalar | mutable    | unordered
| `dict`   | non-scalar | mutable    | unordered

## Scalar Data Types

* Integer
* Float
* Boolean
* NoneType

In [9]:
int_var = 2  # int  <-- text after the hashtag # is a comment and will not be executed as code
float_var = 0.125  # float
true_var = True  # bool 
none_var = None  # NoneType

true_var  # Returns the value of the variable

True

In [11]:
print(float_var) # Returns a string representation of the value of the variable

0.125


## Non-Scalar Data Types

* String – sequence of characters (immutable, ordered)
* List – sequence of values (mutable, ordered)
* Tuple – sequence of values (immutable, ordered)
* Set – collection of unique values (mutable, unordered)
* Dictionary – a set of key/value pairs (mutable, unordered)

In [7]:
str_var = 'This is a string.' # str
list_var = [1, 2, 2, 'a', 'a']  # list
tuple_var = (1, 2, 'a', 'b')  # tuple
set_var = {1, 2, 2, 'a', 'b'}  # set 
dict_var = {1: 'a', 2: 'b', 3: ['c', 'd']}  # dict
print(list_var, set_var)

[1, 2, 2, 'a', 'a'] {'b', 1, 2, 'a'}


## Using Operators with Objects

* Arithmetic: `+`, `-`, `*`, `/`, `**` exponent, `%` modulus, `//` floor division
* Boolean: `and`, `or`, `not`
* Comparison: `==`, `!=` does not equal, `>`, `<=`
* Assignment: `=` , `+=`, `-=`
* Membership: `in`

In [7]:
# Note that the arithmetic operators + and * have different meanings 
# depending on the types of objects with which they are used
print(2 + 2)
print('a' + 'bc')
print(3*2)
print(3*'a' + 'h!')

4
abc
6
aaah!


In [8]:
# Boolean operators return bool
print(True and False)
print(not False)

False
True
True


In [12]:
a = 2 # This is assignment
a += 3 # This assignment is equivalent to a = a + 3
print(a)

print(a == 1) # This is test for equality. It returns bool.

5
False


## Unordered Types vs. Sequences

* Unordered types: `set`, `dict`
* Ordered types (sequences): `str`, `list`, `tuple`
  

In [9]:
st = {1, 2, 2, 'a', 'b'} # sets are unordered
print(st)

{'b', 1, 2, 'a'}


## Dictionary Operations: Indexing

* Dictionaries are indexed by keys

In [20]:
mydic = {'Howard': 'aerospace engineer', 'Leonard': 'physicist', 'Sheldon': 'physicist', 
         'Penny': 'waitress', 'Raj': 'astrophysicist'}
print(mydic['Raj'])

astrophysicist


## Sequence Operations: Indexing and Slicing

* Lists, tuples, and strings are indexed by numbers. **Indexing in Python starts from 0!**
* Use `elem[index]` to extract individual sub-elements
* Use `elem[start:end]` to get sub-sequence starting from index `start` and ending at index `end-1`
* Use `elem[start:end:step]` to get sub-sequence starting from index `start`, in steps of `step`, ending at index `end-1`

In [42]:
print( 'abc'[0] ) 
print( ('a', 'b', 'c')[-1]) # use negative numbers to index from the end
print( ['a', 'b', 'c'][3])

a
c


IndexError: list index out of range

In [1]:
ls = [10, 20, 30, 40, 50]
print( ls[1:4] ) 
print( ls[:3] )
print( ls[1:] )

[20, 30, 40]
[10, 20, 30]
[20, 30, 40, 50]


In [2]:
ls = [10, 20, 30, 40, 50]
print( ls[::2] ) # get elements with even indeces
print( ls[::-1] ) # get elements in reverse order
print( ls[:] ) # get a copy of the sequence

[10, 30, 50]
[50, 40, 30, 20, 10]
[10, 20, 30, 40, 50]


>## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches
>
>We will use a mini-projest as an extended practical example to demonstrate the concepts we are learning. The project aims to analyze and compare the inaugural speeches of the current and last US presidents.
>
>The speech transcripts were obtained from https://millercenter.org/the-presidency/presidential-speeches and copied in the text files `biden_inauguration_millercenter.txt` and `trump_inauguration_millercenter.txt` in the `data` folder.

In [5]:
# Open one of the file's and get the text into a string variable called txt
with open('data/trump_inauguration_millercenter.txt') as f:
    txt = f.read()
    
txt[:500] # Show the first 500 characters of the txt variable


'Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you.\n\nWe, the citizens of America, are now joined in a great national effort to rebuild our country and to restore its promise for all of our people.\n\nTogether, we will determine the course of America and the world for years to come.\n\nWe will face challenges. We will confront hardships. But we will get the job done.\n\nEvery four years, we gather on these st'

## Evaluating Functions with Objects 

* Use the name of a type to convert values to that type
* The `len()` function returns the length of the element

In [6]:
a = float(123)
b = int('32')
print(a, b)

123.0 32


In [12]:
c = tuple([1, 2, 3]) 
d = dict( [(1, 'a'), (2, 'b'), (3, 'c')] )
print(c, d)

(1, 2, 3) {1: 'a', 2: 'b', 3: 'c'}


In [13]:
print( len( [0, 1, 2] ) )
print( len('ab') )
print( len( (1, 2, 3, 4, 'a') ) )
print( len( {1:'a', 2:'b'} ) )

3
2
5
2


## Calling Methods on Objects

### `object.method()`

Use the period `.` to link the method to the object.

In [11]:
string1 = 'Hello'

string1 + '!'   # This is an operator. Operators combine objects in expressions.
len(string1)   # This is a function. Functions take objects as arguments.
string1.upper()   # This is a method. Methods are attached to objects.

'HELLO'

## [String Methods](http://docs.python.org/3/library/stdtypes.html#string-methods)

* `S.upper()` – change to upper case
* `S.lower()` – change to lower case
* `S.capitalize()` – capitalize the first word
* `S.find(S1)` – return the index of the first instance of input
* `S.replace(S1, S2)` – find all instances of S1 and change to S2
* `S.strip(S1)` – remove whitespace characters from the beginning and end of a string (useful when reading in from a file)
* `S.split(S1)` – split the string into a list
* `S.join(L)` – combine the input sequence into a single string

In [15]:
print('Make me scream!'.upper())
x = 'make this into a proper sentence'
print(x.capitalize() + '.')

print('Find the first "i" in this sentence.'.find('i'))

MAKE ME SCREAM!
Make this into a proper sentence.
1


In [16]:
x = ' This is a long sentence that we will use as an example.\n'
print(x.replace('s', 'S'))
print(x.strip())
print(x.replace(' ', ''))

 ThiS iS a long Sentence that we will uSe aS an example.

This is a long sentence that we will use as an example.
Thisisalongsentencethatwewilluseasanexample.



In [17]:
x = 'this is a collection of words i would like to break it into tokens'
y = x.split()    # default is to split on ' '
print(y)
print(x.split('o')) 

x_new = '-'.join(y)
print(x_new)

['this', 'is', 'a', 'collection', 'of', 'words', 'i', 'would', 'like', 'to', 'break', 'it', 'into', 'tokens']
['this is a c', 'llecti', 'n ', 'f w', 'rds i w', 'uld like t', ' break it int', ' t', 'kens']
this-is-a-collection-of-words-i-would-like-to-break-it-into-tokens


>## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches
>
>Remember, our goal is to analyze and compare the inaugural speeches of the current and last US presidents. Using string methods, we can:
>
>1. Clean up the text
>2. Extract a list of all the words used in the speech
>3. Estimate the length of the speach
>4. Estimate the number of unique words used in the speech

In [4]:
# Open one of the file's and get the text into a string variable called txt
with open('data/trump_inauguration_millercenter.txt') as f:
    txt = f.read()
    
# Remove paragraphs and format consistently
txt = txt.strip().replace('\n', ' ').replace("’", "'")

# Get rid of possessives and expand contractions
txt = txt.replace("'s", '').replace("'ve", ' have').replace("'re", ' are')
txt = txt.replace("can't", 'can not').replace("n't", ' not')

# Remove punctuation
txt = txt.replace('—', '').replace('–', '')
txt = txt.replace('.', '').replace(',', '').replace(':', '').replace(';', '').replace('…', '')
txt = txt.replace("”", '').replace("“", '')

# Convert to lower-case
txt = txt.lower()

# Break into words
wrds = txt.split()
print(sorted(wrds)[:100])

# Count the number of words in the speech
print(len(wrds))

# Count the number of unique words
print(len(set(wrds)))


['2017', '20th', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'a', 'about', 'about', 'accept', 'across', 'across', 'across', 'across', 'across', 'action', 'action', 'administration', 'affairs', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'again', 'against', 'aid', 'airports', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'all', 'allegiance', 'allegiance', 'alliances', 'allowing', 'almighty', 'along', 'always', 'always', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'america', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'american', 'americans', 'americans', 'americans', 'americans', 'an', 'an', 'an', 'and', 'and']
1436
536


## Set Methods

![Set operations](figs/sets.png "Set operations")

* `S1.union(S2)`, `S1|S2` — elements in S1 or S2, or both
* `S1.intersection(S2)`, `S1&S2` — elements in both S1 and S2
* `S1.difference(S2)`, `S1-S2` — elements in S1 but not in S2
* `S1.symmetric_difference(S2)`, `S1^S2` — elements in S1 or S2 but not both

In [19]:
st1 = set('homophily')
st2 = set('heterophily')
print(st1^st2)

{'m', 'e', 't', 'r'}


## Mutability

* Immutable types: `str`, `tuple`, and all scalars
* Mutable types: `list`, `set`, `dict`

**Objects of mutable types can be modified once they are created.**

In [25]:
dic = {1:'a', 2:'b'}
dic[3] = 'c'
print(dic)

ls = [5, 4, 1, 3, 2]
ls.sort()
print(ls)

{1: 'a', 2: 'b', 3: 'c'}
[1, 2, 3, 4, 5]


## [List Methods](http://docs.python.org/3/library/stdtypes.html#mutable-sequence-types)

* `L.append(e)`
* `L.insert(i, e)`
* `L.remove(e)`
* `L.extend(L1)`
* `L.pop(i)`
* `L.sort()`
* `L.reverse()`

In [13]:
ls1 = [1, 2, 3]
ls1.append(4)
print(ls1)

ls1.extend([5, 6, 7, 8, 9, 10])
print(ls1)

[1, 2, 3, 4]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


In [13]:
mylist = [1, 2, 3, 4]

mylist.remove(1)
print(mylist)

popped = mylist.pop(1)
print(popped, mylist)

[2, 3, 4]
3 [2, 4]


In [31]:
mylist = [4, 5, 2, 1, 3]
mylist.sort()  # Sorts in-place. It is more efficient but overwrites the input.
print(mylist)

mylist = [10, 9, 6, 8, 7]
sorted(mylist) 
print(mylist)

newlist = sorted(mylist)  # Creates a new list that is sorted, not changing the original.
print(mylist, newlist)

[1, 2, 3, 4, 5]
[10, 9, 6, 8, 7]
[10, 9, 6, 8, 7] [6, 7, 8, 9, 10]


## Mutability Can Be Dangerous

In [14]:
ls1 = [1, 2, 3]
ls2 = [4, 5, 6, 7]

ls1.append(ls2)
print(ls1)

ls2.extend([8, 9, 10])
print(ls1)

[1, 2, 3, [4, 5, 6, 7]]
[1, 2, 3, [4, 5, 6, 7, 8, 9, 10]]


## Aliasing vs. Cloning

![Aliasing](figs/aliasing.png "Aliasing")

In [16]:
ls1 = [1, 2, 3]
ls2 = ls1[:]  # Using [:] is one way to clone

ls1.reverse()
print(ls2)

[1, 2, 3]


>## QUIZ QUESTION
>
>What will the following program print?
>
>```
>ls1 = [1, 2, 3, 4, 5]
>ls2 = ls1
>ls2[2] = 0
>print(ls1)
>```
>
>* (A) `[1, 2, 3, 4, 5]`
>* (B) `[1, 0, 3, 4, 5]`
>* (C) `[1, 2, 0, 4, 5]`
>* (D) `0`

## So Far, We Learned How to Write Straight-Line Programs

In [1]:
s = 'All animals are equal, but some animals are more equal than others.'
s = s.rstrip('.').lower()
s_tokens = s.split()
print('There are', len(s_tokens), 'words in the sentence.')


There are 12 words in the sentence.


In straight-line programs, code is executed line by line, from top to bottom and within a line, from left to right (unless overridden with brackets).

Statements can be executed in more complex order, however, and the control flow determines how this is done.

## [Control Flow](https://www.youtube.com/watch?v=k0xgjUhEG3U)

* Control flow is the order in which statements are executed or evaluated
* In Python, there are three main categories of control flow:
  * **Branches** (conditional statements) – execute only if some condition is met
  * **Loops** (iteration) – execute repeatedly 
  * **Function calls** – execute a set of distant statements and return back to the control flow

![Three categories of control flow](figs/control_flow.png "Three categories of control flow")


# Conditional Statements

![Conditional statements](figs/conditional_statements.png "Conditional statements")

## Conditional Statements

```
if *Boolean expression*:
    *block of code*
```

```
if *Boolean expression*:
    *block of code*
else:
    *block of code*
```

```
if *Boolean expression*:
    *block of code*
elif *Boolean expression*:
    *block of code*
else:
    *block of code*
```

In [4]:
x = -2
if x > 0:
    print('Positive')
elif x < 0:
    print('Negative')
else:
    print('Zero')
    

Negative


## Indentation in Python Code

* Indentation is semantically meaningful in Python
* You can use [tabs or spaces](https://www.youtube.com/watch?v=SsoOG6ZeyUI)

* Obviously(!), tabs are preferable
* However, it does not really matter in Jupyter as Jupyter converts tabs to spaces by default

## You Can Nest Conditional Statements


In [6]:
x = -100

if type(x) == int or type(x) == float:
    if x >= 0:
        print('This is a nonnegative number.')
    else:
        print('This is a negative number.')
elif type(x) == str:
    print('This is a string.')
else:
    print("I don't know what this is.")
    

This is a negative number


# Iteration

![Iteration](figs/iteration.png "Iteration")

## Iteration: `while` vs. `for`

```
while *Boolean expression*:
    *block of code*
```

```
for *element* in *sequence*:
    *block of code*
```

## Iteration: `while` with decrementing function

The decrementing function is a function that maps variables to an integer that is initially non-negative but that decreases with every pass through the loop; the loop ends when the integer is 0.

In [7]:
# decrementing function: 5 - x
x = 0
while x < 5: 
    x += 1
    print(x)
    

1
2
3
4
5


## Iteration: `while` with conditional statements


In [8]:
correct = 25
repeat = True

while repeat:
    guess = int(input("Guess which number from 1 to 100 I'm thinking of? "))
    
    if guess > correct + 10 or guess < correct - 10:
        print("You are quite far. Try again.")
    elif guess != correct:
        print("You are very close. Try again.")
    else:
        print("That's right!")
        repeat = False
        

Guess which number from 1 to 100 I'm thinking of? 70
You are quite far. Try again.
Guess which number from 1 to 100 I'm thinking of? 24
You are very close. Try again.
Guess which number from 1 to 100 I'm thinking of? 25
That's right!


## Iteration: `for` with sequences

In [9]:
for i in [1, 2, 3, 4, 5]:
    print(i, end=' ') 
    # Note that the "end" parameter replaces the default new line with a space
    # This allows us to print on the same line
    

1 2 3 4 5 

## Iteration: `for` with `range()`

* In-built function that produces an immutable ordered non-scalar object of type `range`
* Initiate as `range([start], stop, [step])`. If ommitted, `start = 0` and `step = 1`. 
* Function produces progression of integers `[start, start + step, start + 2*step, ..., start + i*step]` 

In [10]:
print(range(6))
print(list(range(6)))


range(0, 6)
[0, 1, 2, 3, 4, 5]


In [11]:
for i in range(6):
    print(i, end=' ')
print() 

for i in range(1, 6):
    print(i, end=' ')
print()
    
for i in range(1, 6, 2):
    print(i, end=' ')
    

0 1 2 3 4 5 
1 2 3 4 5 
1 3 5 

## Indexing Lists with `range(len(L))`

In [12]:
mylist = ['a', 'b', 'c', 'd']
for i in range(len(mylist)):
     print('index', i, '-', mylist[i])
        

index 0 - a
index 1 - b
index 2 - c
index 3 - d


* This is especially useful when you need to go simultaneously over two different lists of the same length

In [13]:
mylist1 = ['a', 'b', 'c', 'd']
mylist2 = [1, 2, 3, 4]
for i in range(len(mylist1)):
     print(mylist1[i] + str(mylist2[i]), end=', ')

a1, b2, c3, d4, 

## Iteration: `break` and `continue`

* Use `break` to exit a loop 
* Use `continue` to go directly to next iteration

In [14]:
for i in range(5):
    if i == 2:
        continue  # Now try with break
    print(i)
    

0
1
3
4


>## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches
>
>Above, we already cleaned up the text in Trump's speech and saved a list of all the words used in the speech in the variable `wrds`. Next, we will:
>
>1. Count the number of times each unique word is mentioned in the speech
>2. Exclude non-meaningful words such as articles and prepositions
>3. Identify the most commonly used meaningful words to reveal the theme and tone of the speech

In [6]:
# Create dictionary with word:count
word_counts = {}

for i in wrds:
    if i not in word_counts:
        word_counts[i] = 1
    else:
        word_counts[i] += 1

# Print the words with counts in decreasing order of popularity
# Note this produces a list of tuples
sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)

sorted_word_counts[:10]

[('and', 74),
 ('the', 70),
 ('we', 49),
 ('of', 48),
 ('our', 48),
 ('will', 40),
 ('to', 37),
 ('is', 21),
 ('america', 18),
 ('a', 15)]

In [7]:
# We will create a dictionary of all words mentioned more than once without stop words
# Stop words are common words that are not meaningful in this context
stop_words = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', 
              'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',
              'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',
              'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',
              'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',
              'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']

common_words = []
for i in sorted_word_counts:
    if i[0] not in stop_words:
        if i[1] > 1:
            common_words.append(i)
        else:
            break
        
common_words[:10]

[('we', 49),
 ('our', 48),
 ('will', 40),
 ('america', 18),
 ('you', 12),
 ('all', 12),
 ('american', 12),
 ('their', 11),
 ('your', 11),
 ('people', 9)]

# List Comprehensions

```
L = [*object, expression, or function* for *element* in *sequence*]
L = [*object, expression, or function* for *element* in *sequence* if *Boolean expression*]
L = [*object, expression, or function* for *element* in *sequence* for *element2* in *sequence2*]
```

* Provide a concise way to create lists
* Faster because implemented in C
* Nested list comprehensions can be somewhat confusing


## List Comprehensions

In [15]:
print([x**2 for x in range(1, 11)])

ans = []
for x in range(1, 11):
    ans.append(x**2)
print(ans)


[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
[1, 4, 9, 16, 25, 36, 49, 64, 81, 100]


In [16]:
print([x**2 for x in range(1, 11) if x%2 == 0])
print([x + y for x in ['a', 'b', 'c'] for y in ['1','2', '3']])


[4, 16, 36, 64, 100]
['a1', 'a2', 'a3', 'b1', 'b2', 'b3', 'c1', 'c2', 'c3']


## Dictionary and Set Comprehensions

In [17]:
print({x: x**2 for x in range(1, 11)})
print({x.lower(): y for x, y in [('A', 1), ('b', 2), ('C', 2)]})

print({x.lower() for x in 'SomeRandomSTRING'})


{1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36, 7: 49, 8: 64, 9: 81, 10: 100}
{'a': 1, 'b': 2, 'c': 2}
{'o', 's', 't', 'm', 'n', 'g', 'e', 'a', 'r', 'i', 'd'}


>## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches
>
>We can use a list comprehension to rewrite some of the code we wrote above:
>
>```
>common_words = []
>for i in sorted_word_counts:
>    if i[0] not in stop_words:
>        if i[1] > 1:
>            common_words.append(i)
>        else:
>            break
>```

In [8]:
common_words = [i for i in sorted_word_counts if i[0] not in stop_words and i[1] > 1]        
common_words[:10]

[('we', 49),
 ('our', 48),
 ('will', 40),
 ('america', 18),
 ('you', 12),
 ('all', 12),
 ('american', 12),
 ('their', 11),
 ('your', 11),
 ('people', 9)]

In a list comprehension, we cannot use `break` to escape iterating until the end of `sorted_word_counts` but the performance of this syntax is still likely to be faster for most data.

# Functions

![Functions](figs/functions.png "Functions") 


* Built-in
  * `len()`, `max()`, `range()`, `open()`, etc.
* User-defined
  * By you, collaborators, or the open-source community

## Defining and Calling Functions

**Defining a function**

```
def *function_name*(*list of parameters*):
    *body of function*
```

**Calling a function**

```
*function_name*(*arguments*)
```


## When the Function is Used, the Parameters are Bound to the Arguments

```
def *function_name*(*list of parameters*):
    *body of function*

*function_name*(*arguments*)
```


In [1]:
def get_larger(x, y):
    """Assumes x and y are of numeric type.
    Returns the larger of x and y.
    """
    if x > y:
        # The execution of a `return` statement terminates the function call
        return x
    else:
        return y
    
m = get_larger(3, 4)
print(m)

4


## A Function Call Always Returns a Value

* The execution of a `return` statement terminates the function call
* The function call also terminates when there are no more statements to execute
* If no expression follows `return` or there is no `return` statement, the function returns `None`       

In [11]:
def get_larger(x, y):
    if x > y:
        return x
    if y > x:
        return y

ex1 = get_larger(3, 5)
ex2 = get_larger(6, 4)
ex3 = get_larger(3, 3)

print(ex1, ex2, ex3)

5 6 None


## Functions Can Return Multiple Values

In [12]:
def double_one(a):
    return 2*a

def double_two(a, b):
    return 2*a, 2*b

x = double_two(5, 3)
print(x)

# You can unpack the tuple in two separate variables
x1, x2 = double_two(5, 3)
print(x1)
print(x2)

(10, 6)
10
6


## Positional vs. Keyword Arguments

In [13]:
def print_reverse(first, second, third):
    print(third, second, first)
    
print_reverse(1, 2, 3)
print_reverse(third=3, second=2, first=1)
print_reverse(1, second=2, third=3)

# Gives a syntax error because keyword arguments cannot come before positional arguments
# print_reverse(first=1, 2, 3)  

3 2 1
3 2 1
3 2 1


## Default Parameter Values

* Default values allow to call a function with fewer arguments than specified
* Default arguments cannot come before non-default arguments

In [2]:
def pretty_print(lst, sep, fullstop=True, capitalize=True):
    toprint = sep.join(lst)
    if fullstop:
        toprint += '.'
    if capitalize:
        toprint = toprint.capitalize()
    print(toprint)

wordlst = ['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']  # an English pangram

pretty_print(wordlst, ' ', True, True)
pretty_print(wordlst, ' ')
pretty_print(wordlst, ' ', False)


The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog.
The quick brown fox jumps over the lazy dog


## A Function Defines a New Scope

* Scope = name space
* This means you can reuse your favorite variable names in different functions

In [16]:
def func(x, y):
    x += 1
    # x is a parameter, z is a local variable
    z = x + y   # z, x, and y exist only in the scope of the definition of func
    return z

x = 1
res = func(x, 5)

print(x)  # x has not changed 
#print(z)  # Returns an error


1


## The Global Scope

In [17]:
GLOBVAR = 3 # It is conventional to use CAPITALS to name global variables

def print_global():
    # Since GLOBVAR is not defined in the function, it is treated as global
    print(GLOBVAR)  

print_global()

3


>## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches
>
>With functions, we can make the code we have so far more modular so that you can easily apply to multiple data files. Below, we will:
1. Create a function to extract words from a text
2. Create another function to count words in a text
2. Apply the functions to each president's speech
3. Compare the length and repetitiveness of the speeches, the most common words, and the unique words

In [9]:
import string  # See https://docs.python.org/3/library/string.html

# This will now be a global variable so we will follow the convention and 
# name it in all caps
STOP_WORDS = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', 
              'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',
              'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',
              'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',
              'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',
              'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']

def get_tokens(fname):
    """Read given text file and return a list with all words in lowercase
    in the order they appear in the text. Common contractions are expanded
    and hyphenated words are combined in one word.
    """
    with open(fname) as f:
        txt = f.read()
        
    # Remove paragraphs and format consistently
    txt = txt.strip().replace('\n', ' ').replace("’", "'")
    
    # Get rid of possessives and expand contractions
    txt = txt.replace("'s", '').replace("'ve", ' have').replace("'re", ' are')
    txt = txt.replace("can't", 'can not').replace("n't", ' not')

    # Remove punctuation and convert to lower-case
    exclude = set(string.punctuation) | {"”", "“", "…", '–'}
    txt = ''.join(ch.lower() for ch in txt if ch not in exclude)

    # Break into words
    wrds = txt.split()
    
    return wrds


def get_word_counts(tokens):
    """Take tokens and return a dictionary where keys are words
    and values are counts of the number of time the word is repeated.
    """
    # Create dictionary with word:count
    word_counts = {}

    for i in tokens:
        if i not in STOP_WORDS:
            if i not in word_counts:
                word_counts[i] = 1
            else:
                word_counts[i] += 1

    # Get the words with counts in decreasing order of popularity
    # Note this produces a list of tuples
    sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)
    
    return sorted_word_counts


trump_tokens = get_tokens('data/trump_inauguration_millercenter.txt')
biden_tokens = get_tokens('data/biden_inauguration_millercenter.txt')

In [10]:
# Biden's speech is longer
print(len(trump_tokens), len(biden_tokens))
print(len(set(trump_tokens)), len(set(biden_tokens)))
# Biden's speech is also more repetitive
print(len(trump_tokens)/len(set(trump_tokens)), len(biden_tokens)/len(set(biden_tokens)))

print() # Add an empty line to separate results

# The ten most common words for Trump and Biden
trump_wcounts = get_word_counts(trump_tokens)
biden_wcounts = get_word_counts(biden_tokens)

# Biden's speech is more self-centered
print(trump_wcounts[:20])

print() # Add an empty line to separate results

print(biden_wcounts[:20])

1436 2382
536 721
2.6791044776119404 3.30374479889043

[('we', 49), ('our', 48), ('will', 40), ('america', 18), ('you', 12), ('all', 12), ('american', 12), ('their', 11), ('your', 11), ('people', 9), ('country', 9), ('nation', 9), ('again', 9), ('one', 8), ('every', 7), ('world', 6), ('now', 6), ('great', 6), ('back', 6), ('never', 6)]

[('we', 91), ('our', 43), ('will', 33), ('i', 33), ('us', 27), ('my', 20), ('america', 20), ('can', 18), ('you', 17), ('all', 17), ('one', 15), ('nation', 14), ('democracy', 11), ('me', 11), ('must', 10), ('americans', 9), ('today', 9), ('people', 9), ('american', 9), ('story', 9)]


## Modules

* For large programs, store different parts in `.py` files
* Get access using `import` statements

In [1]:
import speech_analysis

trump_tokens = speech_analysis.get_tokens('data/trump_inauguration_millercenter.txt')
print(trump_tokens[:20])


['chief', 'justice', 'roberts', 'president', 'carter', 'president', 'clinton', 'president', 'bush', 'president', 'obama', 'fellow', 'americans', 'and', 'people', 'of', 'the', 'world', 'thank', 'you']


In [2]:
import speech_analysis as sa

trump_tokens = sa.get_tokens('data/trump_inauguration_millercenter.txt')


In [3]:
# You should be careful with this one: there will be a conflict if you
# import a different module that also has a function called get_tokens()
from speech_analysis import *

trump_tokens = get_tokens('data/trump_inauguration_millercenter.txt')


## Useful Python Modules

https://docs.python.org/3/library/

* `re` – Regular expression operations
* `datetime` – Basic date and time types
* `math` – Mathematical functions
* `random` – Generate pseudo-random numbers
* `os.path` – Common pathname manipulations
* `pickle` — Python object serialization
* `csv` — CSV file reading and writing
* `json` — JSON encoder and decoder
* ...

## Useful Python Packages

* `numpy` – Scientific computing with multi-dimensional arrays
* `pandas` – Data anlysis with table-like structures (R, pretty much)
* `statsmodels` – Statistical data analysis with linear models
* `scikit-learn` – Data mining and machine learning
* `networkx` – Network analysis
* `matplotlib` – Plotting
* ...

## Decomposition and Abstraction

![Decomposition and abstraction](figs/decomposition_abstraction.png "Decomposition and abstraction")

* **Decomposition creates structure** – it allows to break the program into self-contained parts
* **Abstraction hides detail** – it allows to use code as if it is a black box

We can achieve decomposition and abstraction with:

* Functions
* Classes

# Object-Oriented Programming

A programming paradigm based on the concept of "objects"

An object is a **data abstraction** that captures:

* **Internal representation** (data attributes)
* **Interface** for interacting with object (methods)


## Procedural  vs. Object-Oriented Programming

![Procedural vs. object-oriented programming](figs/procedural_object-oriented.png "Procedural vs. object-oriented programming")

## Everything in Python Is an Object!

* Objects have types (belong to classes)
* Objects also have a set of procedures for interacting with them (methods)

In [8]:
s = 'some string'
print(type(s))
print(s.upper())

<class 'str'>
SOME STRING


## Defining Classes in Python


In [1]:
from datetime import date

class Person(object):
        
    def __init__(self, f_name, l_name):
        """Creates a person using first and last names."""
        self.first_name = f_name
        self.last_name = l_name
        self.birthdate = None
    
    def get_name(self):
        """Gets self's full name."""
        return self.first_name + ' ' + self.last_name
    
    def get_age(self):
        """Gets self's age in years."""
        return date.today().year - self.birthdate.year
    
    def set_birthdate(self, dob):
        """Assumes dob is of type date.
        Sets self's birthdate to dob.
        """
        self.birthdate = dob
    
    def __str__(self):
        """Returns self's full name."""
        return self.first_name + ' ' + self.last_name
    
p1 = Person('Greta', 'Thunberg')
p1.set_birthdate(date(2003, 1, 3))
print(p1, p1.get_age())

Greta Thunberg 16


## Defining Classes in Python

* Data attributes — `first_name`, `last_name`, `birthdate`
* Methods
  * `get_name()`, `get_age()`, `set_birthdate()`
  * `__init__()` — called when a class is instantiated
  * `__str__()` — called by `print()` and `str()`
  
---

* Operations
  * Instantiation: `p1 = Person('Greta', 'Thunberg')` calls method `__init__()`
  * Attribute/method reference: `p1.get_age()`

## Classes vs. Objects

* `Person` is a class
* `p1` is an instance of the class `Person`; it is an object of type `Person`
* Similarly, `str` is a class and `'Greta Thunberg'` is an object of type `str`

![Class vs. object](figs/person_greta.png "Class vs. object")

By Anders Hellberg - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=77270098



>## EXAMPLE PROJECT: Comparing Trump's and Biden's Inaugural Speeches
>
>We can rebundle the code we have written so far in a class, following the object-oriented programming paradigm. In this case, the data and functions are encapsulated together. The functions become methods and they belong only to this particular data type. We cannot call them independently, on other data types, for example.


In [3]:
import string

STOP_WORDS = ['a', 'about', 'across', 'after', 'an', 'and', 'any', 'are', 'as', 'at', 
              'be', 'because', 'but', 'by', 'did', 'do', 'does', 'for', 'from',
              'get', 'has', 'have', 'if', 'in', 'is', 'it', 'its',
              'many', 'more', 'much', 'no', 'not', 'of', 'on', 'or', 'out',
              'so', 'some', 'than', 'the', 'this', 'that', 'those', 'through', 'to',
              'very', 'what', 'where', 'whether', 'which', 'while', 'who', 'with']

class Speech(object):
        
    def __init__(self, fname):
        """Creates a speech using the text in file fname."""
        
        with open(fname) as f:
            self.txt = f.read()
        self.tokens = None
        self.word_counts = None
        
        # Populate the empty attributes above by processing the text
        self.process_tokens()        
        self.process_word_counts()
    
    
    # The following two methods are called when you initialize a new object
        
    def process_tokens(self):
        """Extracts the tokens in the text and assigns them to 
        the attribute 'tokens'. 'tokens' is a list of strings.
        """
        
        # Remove paragraphs and format consistently
        txt = self.txt.strip().replace('\n', ' ').replace("’", "'")
        
        # Get rid of possessives and expand contractions
        txt = txt.replace("'s", '').replace("'ve", ' have').replace("'re", ' are')
        txt = txt.replace("can't", 'can not').replace("n't", ' not')
    
        # Remove punctuation and convert to lower-case
        exclude = set(string.punctuation) | {"”", "“", "…", '–'}
        txt = ''.join(ch.lower() for ch in txt if ch not in exclude)

        # Break into words
        wrds = txt.split()

        self.tokens = wrds
        
        
    def process_word_counts(self):
        """Counts the number of times each word, excluding stop words,
        appears in the speech and assigns the counts to the attribute 'word_counts'.
        'word_counts' is a list of tuples in the form (token, count).
        """
        # Create dictionary with word:count
        word_counts = {}

        for i in self.tokens:
            if i not in STOP_WORDS:
                if i not in word_counts:
                    word_counts[i] = 1
                else:
                    word_counts[i] += 1

        # Get the words with counts in decreasing order of popularity
        # Note this produces a list of tuples
        sorted_word_counts = sorted(word_counts.items(), key=lambda i: i[1], reverse=True)
        self.word_counts = sorted_word_counts
    
    
    # Use get and set methods to provide interface for interacting with the objects
        
    def get_text():
        return self.text
        
    def get_tokens(self):
        """Get the tokens in the speech as a list of strings."""
        # Avoid returning mutable objects as they could be modified in undesirable ways
        return self.tokens[:]
    
    def get_word_counts(self):
        """Get each unique word in the speech and the number of times it appears in the speech.
        Return a list of tuples in the form (token, count).
        """
        # Avoid returning mutable objects as they could be modified in undesirable ways
        return self.word_counts[:]
    
    # You can make your code even more interactive by providing extra methods for
    # common and useful operations
    
    def get_speech_length(self):
        """Get the number of tokens in the speech."""
        return len(self.tokens)
    
    def get_number_unique_tokens(self):
        """Gets the number of unique words used in the speech,
        including stop words.
        """
        return len(set(self.tokens))
    
    def __str__(self):
        """Returns the first 200 characters of the speech."""
        return self.txt[:200] + '...'

    
# Create an object of class Speech for Trump's inaugural speech
trump = Speech('data/trump_inauguration_millercenter.txt')
print(trump)
# Process the speech text and get the length of the speech
print(trump.get_speech_length())

print()

# Create another Speech object for Biden's inaugural speech
biden = Speech('data/biden_inauguration_millercenter.txt')
print(biden)
print(biden.get_speech_length())

Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: thank you.

We, the citizens of America, are now joined in a gre...
1436

Chief Justice Roberts, Vice President Harris, Speaker Pelosi, Leader Schumer, Leader McConnell, Vice President Pence, distinguished guests, and my fellow Americans.

This is America’s day.

This is de...
2382


# Next Steps

* Make use of other resources online
    * [Coursera](https://www.coursera.org/)
    * [MIT OpenCourseWare](https://ocw.mit.edu/index.htm)
    * [Code School](http://tryr.codeschool.com/)
    * ... and [many others](https://github.com/social-research/python-workshop/blob/main/RESOURCES.md)
* Write code at any opportunity
* Practice, practice, practice

## Learn from Other Programmers

![Not sure if I am a good programmer or just good at googling](figs/good_programmer.jpg "Not sure if I am a good programmer or just good at googling") 