<div class="frontmatter text-center">
<h1>Introduction to Data Science and Programming</h1>
<h2>Lecture 6: Python Crash Course - Strings, text, and IO</h2>
<h3>IT University of Copenhagen, Fall 2023</h3>
<h3>Instructor: Anastassia Vybornova</h3>
</div>

# Recap of last time

* modulo `%` operator: `x % y` returns the remainder of the division `x/y` 
* assignment operators `+=, -=, *=, /=`: "add/subtract/multiply/divide" and assign
* `None` object (absence of everything)



# Recap of last time

## Update on functions
* returning multiple values: `return value1, value2`
* **"parameter"** = temporary variable declared in the function **definition** `def function_name(parameter_name): ...`
* **"argument"** = value passed into the function **call** `function_name(argument_value)`
* functions can take several parameters `def function_name(param1, param2, param3, ...)`
* there are required and optional parameters 
    * required: `def function_name(req_param)`
    * optional: `def function_name(opt_param=default_value)` - optional parameters need a default value!
* mixing required and optional parameters in a function call - **watch out for the right order!**

In [None]:
def sum_and_product(a, b, c):
    return a+b+c, a*b*c

sum_and_product(1,3,10) 

<hr>

Today we will learn:

* Built-in modules
* Text processing

**Introducing built-in modules**

# Some easy tasks in Python?
* Find the square root of 8.
* Randomly choose a number from 1 to 6. ("dice throw")
* Print the contents of your working directory (folder).

**"I'm sure there is a function to do this!"**

https://docs.python.org/3/library/functions.html

# Some easy tasks in Python?
* Find the square root of 8.
* Print the contents of your working directory (folder).
* Randomly choose a number from 1 to 6. ("dice throw")

**"I'm sure there is a** ~~function~~ **function in a module to do this!"**


There are many built-in modules in Python for different purposes: https://docs.python.org/3/py-modindex.html

A module is a file containing Python definitions and statements; you can also make your own modules: https://docs.python.org/3/tutorial/modules.html


# Some easy tasks in Python?
* Find the square root of 8. --> `math` module
* Randomly choose a number from 1 to 6. ("dice throw") --> `random` module
* Print the contents of your working directory (folder). --> `os` (operating system) module

**"I'm sure there is a** ~~function~~ **function in a module to do this!"**

A module is a file containing Python definitions and statements. https://docs.python.org/3/tutorial/modules.html

There are many built-in modules in Python for different purposes: https://docs.python.org/3/py-modindex.html


In [None]:
# the built-in modules "come" with Python, but we need to import them first:
import math 
math

In [None]:
# the built-in modules "come" with Python, but we need to import them first:
import math 
type(math)

In [None]:
# now we can use all functions within math, for example math.sqrt:
math.sqrt(9)
# the syntax is: module_name.function_name

# Q: how do I know which functions even exist?

* Scroll through the documentation of the module: https://docs.python.org/3/library/math.html
* Google + [StackOverflow](https://stackoverflow.com) it
* If you are using an IDE (e.g. PyCharm, VS Studio Code): you will get suggestions as you start typing the module name

In [None]:
# if we only need 1 function from the entire module, we can also just import that function:
from math import sqrt
# now we can use the function sqrt "directly" in our code:
sqrt(25)

In [None]:
# if we are lazy typers, we can come up with an "alias" for a module:
import math as mm
# now, to call a function from that module, we just need to type the alias (not the entire module name):
mm.sqrt(16)

# Python syntax to import modules

```python
# to import an ENTIRE module
import math
# to use a function from that module
math.sqrt()

# to import only some FUNCTIONS from a module
from math import sqrt
# now we can use the function "directly"
sqrt()

# to import a module with a "nickname"
import math as mm
# now we use "mm" to refer to that module's functions
mm.sqrt()
```


# Try it out yourself!

(our "easy tasks" from before)
* Find the square root of 8, with the `math` module and its `sqrt` function
* Randomly choose a number from 1 to 6. ("simulate a dice throw"), with the  `random` module and its `choice` function
* Print the contents of your working directory (folder), with the `os` (operating system) module and its `listdir` function

If needed, you can check out the documentation of each module: look it up in the [Python Module Index](https://docs.python.org/3/py-modindex.html).

In [None]:
# find the square root of 8, using math and sqrt

In [None]:
# randomly choose a number from 1 to 6, using random and choice

In [None]:
# print the contents of the folder that your notebook is saved in, with os and listdir

**Introducing Python packages**

# Some slightly more complicated tasks in Python

* Process tabular data (like an excel sheet).
* Approximate a set of data points with a Gaussian curve.
* Speed up the computation time for millions of data points.
* Work with geospatial data (for example a map of land use patterns)
* Make a line plot from a data set.
* Make a very fancy line plot from a data set. 
* Find the shortest path from A to B on a network.

# Some slightly more complicated tasks in Python

**I'm sure there's a** ~~function~~ ~~module~~ **`package` for this!**

* Process tabular data (like an excel sheet). [**`pandas`**](https://pandas.pydata.org)
* Approximate a set of data points with a Gaussian curve. [**`scipy`**](https://scipy.org)
* Speed up the computation time for millions of data points . [**`numpy`**](https://numpy.org)
* Work with geospatial data (for example a map of land use patterns) [**`geopandas`**](https://geopandas.org/en/stable/)
* Make a line plot from a data set. [**`matplotlib`**](https://matplotlib.org)
* Make a very fancy line plot from a data set. [**`seaborn`**](https://seaborn.pydata.org)
* Find the shortest path from A to B on a network. [**`networkx`**](https://networkx.org)

### These are NOT built-in, and need to be installed separately.

# Packages in Python - soft intro

A package is (put very simply) a collection of modules and functions, for a specific & specialized purpose.

It doesn't come with Python - you need to **install** it separately on your computer before using it.

(Incomplete) list of Python packages: https://pypi.org

To install packages, you can either do it completely manually ("build from source") or use a **package manager**, for example **`conda`**. 

We will work with only two Python packages: **`pandas`** and **`matplotlib`**.

## OMG: The Anaconda distribution contains both Python, AND many additional packages.

Thanks to Anaconda, both `pandas` and `matplotlib` should aready be installed  (and you should be able to import them), so you don't need to install anything additional.

# Text processing in Python

1. Formatting strings
2. String methods + sidenote on string comparisons
3. Regular expressions (`re`) module
4. Reading and writing files

# Text processing in Python

## Pt 1: Formatting strings

In [None]:
name = "Anastasia"
greeting = "My name is " + name + ". Nice to meet you!"
greeting
# There MUST be a better way of doing this

In [None]:
name = "Anastassia"
greeting = f"My name is {name}. Nice to meet you!"
greeting

In [None]:
# this comes in handy if i need to repeat the same thing several times.
# Let's say I want to say no to everyone.
for person in ["mum", "my friend", "my teacher", "little cat"]:
    print(f"No, {person}, I don't want to!")

# String formatting - the easiest (?) way

```python
# to insert the value of a variable into a text,
# use f"" instead of "" to create the string,
# and add curly brackets around the variable name:
f"text {variable} text"
```

In [None]:
# Try it out yourself!

# define a string variable 

# insert that variable in a string f"...{variable}..." and print the string

# Text processing in Python

## Pt 2: String methods

* `.count()`
* `.index()`
* `.upper()` `.lower()` `.capitalize()` `.title()` 
* `.replace()`
* `.split()`
* `.format()`

... and many more: https://docs.python.org/3/library/stdtypes.html#string-methods 


In [2]:
# let's define a string that contains a proverb:
proverb = 'actions speak LOUDER than words.'

In [None]:
# .count() counts the number of times a character (or sequence of characters) appears
proverb.count("a")
#proverb.count("actions")

In [None]:
# .index() tells you the index of the FIRST occurrence of a character (or a sequence of characters)
proverb.index("w")
# proverb.index("actions")

In [None]:
# .upper() converts all characters into UPPER case
proverb.upper() # this does NOT change the variable, but instead RETURNS the changed string!

In [None]:
# proverb_upper = proverb.upper()
# print(proverb)
# print(proverb_upper)

In [None]:
# .lower() converts all characters into lower case
proverb.lower() # this does NOT change the object

In [None]:
# .capitalize() capitalizes ONLY the first character & changes all others into lower case
proverb.capitalize() # this does NOT change the object

In [None]:
# .title() capitalizes ALL words (what is separated by white space)
proverb.title() # this does NOT change the object

In [None]:
# .replace() replaces one sequence by another
proverb.replace("words", "music") # this does NOT change the object

In [None]:
# another example for replace: remove smth by replacing it "with nothing":
proverb.replace(" ", "") # all whitespaces (" ") will be replaced by "an empty string"

In [None]:
# another example for "replace": i'm missing a white space:
proverb_without_whitespace = "ActionsSpeak louder than words"
proverb_without_whitespace.replace("S", " s") # replace the capital S by whitespace and small s

In [6]:
# .split(character) will split the string at each appearance of the character
proverb.split(" ") # the character itself (here a white space) is not included anymore!

['actions', 'speak', 'LOUDER', 'than', 'words.']

In [None]:
# .format() is a different way of inserting variable values into literal text
# "... speak louder than ..." 
# let's say we want to insert "apes" and "wolves" in the string

# BY ORDER:
"{} speak louder than {}".format("apes", "wolves") # by order (with empty curly braces)

# by POSITION:
"{1} speak louder than {0}".format("wolves", "apes")

# by NAME:
"{x} speak louder than {y}".format(x = "apes", y = "wolves") # by variable name


In [None]:
# compare all 3 options above (variables defined in .format() ) 
# to the option we learned before (variables need to be predefined):
x = "apes"
y = "wolves"
f"{x} speak louder than {y}"

### Side note on string comparisons

In [None]:
# statement with a comparison operator; evaluates to True OR False
7 > 3

In [None]:
# Can Python tell me whether ABBA is greater than the Beatles?
"ABBA" > "The Beatles"

In [None]:
# comparing two strings == comparing their unicode order (case sensitive!)
"z" > "abc"
#"word" == "word"
# "Виктор Цой" > "Θεσσαλονίκη"
# max(["monday", "marshmallow", "42", "pirate", "Astrid"]) 

If you're curious about string comparisons: See the [Introduction to unicode in Python](https://docs.python.org/3/howto/unicode.html#introduction-to-unicode).

If you're not too curious:

### Main takeaway: BEWARE when interpreting string comparison as alphabetical comparison (it is UNICODE comparison)

# Text processing in Python

## Pt 3: Regular expressions (`re`) module

https://docs.python.org/3/howto/regex.html

Regular expressions are...

* "**tiny, highly specialized programming language** where you can specify the rules for the set of possible **strings that you want to match**"
* a well-known concept in computer science
* used by many programming languages (perl, java, python...)
* all about **finding matching string patterns**
* useless to learn by heart (imho) - just know that they **exist**, in case you might need them

```python
# to use the re module, import it first:
import re
```

```python
# string: the string we're searching within
# pattern: the pattern we're trying to find
# it can be LITERAL (explicitly stating the pattern), or 
# expressed with REGEX - see list below
re.search(pattern, string) # will show you whether pattern is in the string
re.findall(pattern, string) # will return all matching patterns from the string

```

**How to make your own regex patterns (some EXAMPLES)**
* `\d` Matches any number
* `\D` Matches any non-number
* `\w` Matches any whitespace
* `\W` Matches any non-whitespace
* `+` matches the preceding character 1 or more times
* `[]` specifies a class you want to match
* `-` within `[]` indicates a range
* `[a-z]` matches all small letters from a to z

In [None]:
# first, we need to import the re module
import re
# now we can use it in this notebook

In [None]:
# LITERAL
# re.search(pattern, string) tells us whether a pattern is contained in a string 
mypattern = "a" # try out other values
mystring = "Wakanda Forever, part 2"
re.search(mypattern, mystring)
# if yes: we get back a "Match object" (don't worry about it); if no: we get back None (nothing)

In [None]:
# LITERAL
# re.findall(pattern, string) RETURNS all pattern matches in a string 
mypattern = "a" # try out other values
mystring = "Wakanda Forever, part 2"
re.findall(mypattern, mystring)

In [None]:
# BY PATTERN (REGEX!!)
# find only the numbers in the string: "\d" pattern
mypattern = "\d" # this means: a single character which is a number
mystring = "Wakanda Forever, part 2"
re.findall(mypattern, mystring) # we found the number 2 in the string

In [None]:
# BY PATTERN (REGEX)
# find only the numbers in the string: "\d" pattern
mypattern = "\d" # this means: a single character which is a number
mystring = "12 plus 5 is 17"
re.findall(mypattern, mystring) # we found the numbers 1 in the string, but they actually should belong together

In [None]:
# BY PATTERN
# find only the numbers in the string: "\d" pattern
mypattern = "\d+" # this means: ONE OR MORE CONSECUTIVE characters which are number
mystring = "12 plus 5 is 17"
re.findall(mypattern, mystring) # we found the numbers 1 in the string, but they actually should belong together

In [None]:
# find everything which is NOT a number, 
mypattern = "\D" # this means: every single character that is NOT a number
mystring = "Wakanda Forever, part 2"
re.findall(mypattern, mystring)

In [None]:
# find everything which is NOT a number, 
mypattern = "\D+" # this means: ONE OR MORE CONSECUTIVE characters that are NOT numbers
mystring = "Wakanda Forever, part 2"
re.findall(mypattern, mystring)

In [None]:
# specify a class within a certain range
mypattern = "[A-Z]" # all single uppercase letters
mystring = "Wakanda Forever, part 2"
re.findall(mypattern, mystring)

Regex can get quite complicated; here's an example (see this [blogpost](https://www.sitepoint.com/demystifying-regex-with-practical-examples/#matching-password))


Imagine the rules for a password pattern are:
* contain at least 6 and at most 12 characters;
* have at least 1 uppercase and at least 1 lowercase letter;
* contain at least 1 digit;
* AND contain special characters

Then the regex expression that will match if the password matches pattern is: 

```python
mypattern = "^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{6,12}$"
```


you don't need to understand this >> just know it's possible


In [None]:
mypattern = "^(?=.*[a-z])(?=.*[A-Z])(?=.*\d).{6,12}$"
# mypassword = "kittycat" # NO MATCH
# mypassword = "KittyCat15!" # MATCH
# re.search(mypattern, mypassword)

# Regular expressions (patterns) you will need for Exercise 06

* `"\d"` - find a single number character
* `"\d+` - find one or more consecutive number characters
* `"[A-Z]"` - find a single capital letter

# Text processing in Python

## Pt 4: Reading and writing files (Input/Output = IO)

(will also use packages for this later, mostly pandas, but for now - let's do it DIY style)

In [None]:
# open a file
f = open("data/poem.txt", "r") # r is the mode (reading)
f

In [None]:
# open a file and read it
f = open("data/poem.txt", "r")
f.read() # the opened file has the .read() method

In [None]:
# open file, read it, AND SAVE THE TEXT TO A VARIABLE
f = open("data/poem.txt", "r")
my_text = f.read()
print(my_text)

In [None]:
# open file, read it, save text to a variable, and then CLOSE IT!
f = open("data/poem.txt", "r")
my_text = f.read() # the opened file has the .read() method
f.close()

In [None]:
# now we have the variable my_text...
print(my_text[0:21])
# and trying to access f (the opened file we defined before) will not work
f.read()

In [None]:
# more common syntax:
# use the "with" keyword, then file will be automatically closed once you're "done"
# (after the indentation has ended)
with open("data/poem.txt", "r") as opened_file:
    my_text = opened_file.read()
my_text

In [None]:
# my text is a long string; linebreaks in original file are marked with "\n" in the string
# type(my_text)
# len(my_text)
# print(my_text)

In [None]:
# the .readlines() method
with open("data/poem.txt", "r") as opened_file:
    my_text = opened_file.readlines()
my_text
# my_text is a list of strings - each element in the list is one line from the original file
# type(my_text)
# len(my_text)
# print(my_text)

In [None]:
# Write files: same logic. open the file in "w" (Writing) mode and use the .write() method:
file_content = "just sitting here and writing files"

with open('myfile1.txt', 'w') as opened_file:
    opened_file.write(file_content)
# this is used if file_content is one single string

In [None]:
# OR use the .writelines() method
# this is used if file_content is a list of strings
file_content_list = ["just", "sitting", "here" , "and", "writing", "files"]

with open('myfile2.txt', 'w') as opened_file:
    opened_file.writelines(file_content_list)
# be aware that the linebreak "\n" is NOT automatically added!

# Syntax for reading in & writing files

```python
### READING
with open("filepath", "r") as opened_file:
    my_text = opened_file.read() 
# .read() will give you 1 single string

with open("filepath", "r") as opened_file:
    my_text = opened_file.readlines() 
# .readlines() will give you 1 string for each line

### WRITING
with open("filepath", "w") as opened_file:
    opened_file.write(my_text) # if text is 1 single string

with open("filepath", "w") as opened_file:
    opened_file.writelines(my_text) # if text is a list of strings (each string will be on separate line)
```

Note: the default mode is `"r"` (reading), so you don't *need* to specify it when reading in files.

# Try it out yourself!

**Decode the secret message**

* Read in the file `secret_message.txt`
* Replace all numbers 4 with the letter "a"
* Replace all numbers 9 with the letter "n"
* Replace all numbers 1 with the letter "e"
* Replace all hashtags with the letter "m" 
* Replace all @ with the letter "s"
* Replace all !! with the letter "o"
* Replace all "zyz" with whitespaces (" ")
* Write the decoded message to a text file called `decoded_message.txt` 

In [None]:
# Read in the file
with open("data/secret_message.txt") as opened_file:
    mess = opened_file.read()
print("Encoded message:", mess)


In [None]:
# Replace all characters, following the instructions
mess = mess.replace("INSERT YOUR CODE HERE")
# ...


In [None]:
# When you're done with the replacements, print the decoded message:
print("Decoded message:", mess)

In [None]:
# Save it to a file:
with open("data/decoded_message.txt", "w") as opened_file:
    mess = opened_file.write()

In [None]:
# open the file "decoded_message" (not in Python - just by double clicking on it)
# to check if it looks like you expected