# DIS08 - Basic Data Processing with Python

Let's just wrap up the things you should have learned in your previous Python course so far... As DIS06 we are working with the text book [Automatic the Boring Stuff with Python](https://automatetheboringstuff.com).

## Python Basics (chapter 1)

You can compute expressions with a calculator or type string concatenations with a word processor. You can even do string replication easily by copying and pasting text. But expressions, and their component values—operators, variables, and function calls—are the basic building blocks that make programs. Once you know how to handle these elements, you will be able to instruct Python to operate on large amounts of data for you.

It is good to remember the different types of operators (+, -, \*, /, //, %, and ** for math operations, and + and * for string operations) and the three data types (integers, floating-point numbers, and strings) introduced in this chapter.

A few different functions were introduced as well. The print() and input() functions handle simple text output (to the screen) and input (from the keyboard). The len() function takes a string and evaluates to an int of the number of characters in the string. The str(), int(), and float() functions will evaluate to the string, integer, or floating-point number form of the value they are passed.

In a nutshell, you learnt about

 * math operators,
 * data type (integer, float, strings, ...),
 * string concatenation and replication,
 * variables,
 * running code within the Spyder IDE / Jupyter Notebooks,
 * basic functions (print(), input(), len(), str(), int(), and float().

### Small excercise on data types, math, and string concatenations

In [None]:
zwei = 2.04

print(float(zwei)-float(1))

In [None]:
bruch = ''' asdfasdf
asdf
asdfa sdfasdf
asdfaaa

asdf'''
print('Das ist ein Mehrzeilenstring: ' + bruch)

In [None]:
# Where is the bug in the following code snippet? How can you correct it?
currentAge = 21
print('You will be ' + currentAge + 1 + ' in one year.')

### Where to look for help?
* https://www.w3schools.com/python/default.asp
* https://automatetheboringstuff.com

## Flow Control (chapter 2)
By using expressions that evaluate to True or False (also called conditions), you can write programs that make decisions on what code to execute and what code to skip. You can also execute code over and over again in a loop while a certain condition evaluates to True. The break and continue statements are useful if you need to exit a loop or jump back to the start. These flow control statements will let you write much more intelligent programs. 

In a nutshell, you learnt about

 * boolean values, comparison operators, boolean operators,
 * elements of flow control (conditions, blocks of code)
 * flow control statements (if, else, elif, while, for, range()
 * importing modules
 
### Small excercises and examples

In [None]:
# Mixing comparison and boolean operators
if(2 + 2 == 4 and not 2 + 2 == 5 and 2 * 2 == 2 + 2):
    print('True')

In [None]:
# basic flow control
name = 'Mary'
password = 'swordfish'

if name == 'Mary':
    print('Hello Mary')
    if password == 'swordfish':
        print('Access granted.')
    else:
        print('Wrong password.')

In [None]:
# importing modules and range()
import random
for i in range(5):
    print(str(i+1) + " : " + str(random.randint(1, 10)))

In [None]:
def umrechnung(wertFahrenheit):
    celsius = (wertFahrenheit - 32) * 5/9
    return celsius
for i in range(20,70,10):
    print(str(i) + ' Grad Fahrenheit sind in Celsius: ' + str(umrechnung(i)))

## Functions (chapter 3)

Functions are the primary way to compartmentalize your code into logical groups. Since the variables in functions exist in their own local scopes, the code in one function cannot directly affect the values of variables in other functions. This limits what code could be changing the values of your variables, which can be helpful when it comes to debugging your code.

Functions are a great tool to help you organize your code. You can think of them as black boxes: They have inputs in the form of parameters and outputs in the form of return values, and the code in them doesn’t affect variables in other functions.

In a nutshell, you learnt about

 * def statements with parameters
 * return values and return statements
 * keyword arguments and print()
 * local and global scope
 
### Small excercises and examples

In [None]:
# This is a guess the number game.
import random
secretNumber = random.randint(1, 20)
print('I am thinking of a number between 1 and 20.')

# Ask the player to guess 6 times.
for guessesTaken in range(1, 7):
    print('Take a guess.')
    guess = int(input())

    if guess < secretNumber:
        print('Your guess is too low.')
    elif guess > secretNumber:
        print('Your guess is too high.')
    else:
        break    # This condition is the correct guess!

if guess == secretNumber:
    print('Good job! You guessed my number in ' + str(guessesTaken) + ' guesses!')
else:
    print('Nope. The number I was thinking of was ' + str(secretNumber))


# Lists (chapter 4)

Lists are useful data types since they allow you to write code that works on a modifiable number of values in a single variable. Later in this book, you will see programs using lists to do things that would be difficult or impossible to do without them.

__Lists are mutable__, meaning that their contents can change. __Tuples and strings__, although list-like in some respects, __are immutable__ and cannot be changed. A variable that contains a tuple or string value can be overwritten with a new tuple or string value, but this is not the same thing as modifying the existing value in place—like, say, the append() or remove() methods do on lists.

Key topics in this chapter were

* the list data type, len(), slices
* changing values, list concatenation and list replication, removing values
* using for loops with lists
* the in and not in operators
* methods and the index(), append(), insert(), remove(), sort() list methods
* list-like types: strings and tuples

### Small excercises and examples

In [None]:
import random

messages = ['It is certain',
    'It is decidedly so',
    'Yes definitely',
    'Reply hazy try again',
    'Ask again later',
    'Concentrate and ask again',
    'My reply is no',
    'Outlook not so good',
    'Very doubtful']

pos = random.randint(0, len(messages) - 1)
print(pos)
print(messages[pos])

In [None]:
for m in messages:
    print(m)

# Dictionaries and Structuring Data (chapter 5)

You learned all about dictionaries in this chapter. Lists and dictionaries are values that can contain multiple values, including other lists and dictionaries. Dictionaries are useful because you can map one item (the key) to another (the value), as opposed to lists, which simply contain a series of values in order. Values inside a dictionary are accessed using square brackets just as with lists. Instead of an integer index, dictionaries can have keys of a variety of data types: integers, floats, strings, or tuples. By organizing a program’s values into data structures, you can create representations of real-world objects.

In this chapter you learnt about

* the dictionary data type
* the keys(), values(), items(), and get() methods
* using data structures to model real-world things

### Small excercises and examples

In [None]:
picnicItems = {'apples': 5, 'cups': 2}

for key, value in picnicItems.items():
    print('key: ' + str(key) + ' - value: ' + str(value))

# this works...
print('I am bringing ' + str(picnicItems.get('cups', 0)) + ' cups.')
print('I am bringing ' + str(picnicItems['cups']) + ' cups.')
# this doesn't... why?
print('I am bringing ' + str(picnicItems.get('eggs', 0)) + ' eggs.')
print('I am bringing ' + str(picnicItems['eggs']) + ' eggs.')

# Manipulating String (chapter 6)

Text is a common form of data, and Python comes with many helpful string methods to process the text stored in string values. You will make use of indexing, slicing, and string methods in almost every Python program you write.

The programs you are writing now don’t seem too sophisticated—they don’t have graphical user interfaces with images and colorful text. So far, you’re displaying text with print() and letting the user enter text with input(). However, the user can quickly enter large amounts of text through the clipboard. This ability provides a useful avenue for writing programs that manipulate massive amounts of text. These text-based programs might not have flashy windows or graphics, but they can get a lot of useful work done quickly.

Topics covered in this chapter:

* double quotes, escape characters, multiline string
* the upper(), lower(), isupper(), and islower() string methods
* startswith() and endswith(),  join() and split() string methods
* removing whitespace with strip(), rstrip(), and lstrip()

### Small excercises and examples

In [None]:
#! python3
# bulletPointAdder.py - Adds Wikipedia bullet points to the start
# of each line of text.
import pprint

text = 'Lists of animals\nLists of aquarium life\nLists of biologists by author abbreviation\nLists of cultivars'

# Separate lines and add stars.
lines = text.split('\n')

for i in range(len(lines)):    # loop through all indexes for "lines" list
    lines[i] = '* ' + lines[i] # add star to each string in "lines" list

text = '\n'.join(lines)

pprint.pprint(text)

## Regular Expressions in Python (chapter 7)

You already know how to write regex statements and using them with grep. Python also allows to use regex. Just have a look at the next example.

In [None]:
import re

# define a regex pattern for licence plate numbers
licenceRegex = re.compile(r'\D{1,3} \D{1,2} \w{1,4}')

testCases = 'SU BW 1234, BN-XX-123, BER X 1, K FC d1'

# use the search() method which returns the *first* appearance of a match
# mo = matching object 
mo = licenceRegex.search(testCases)
mo.group()

## What did we do here?
While there are several steps to using regular expressions in Python, each step is fairly simple.

1. Import the regex module with import re.
2. Create a Regex object with the re.compile() function. (Remember to use a raw string.)
3. Pass the string you want to search into the Regex object’s search() method. This returns a Match object.
4. Call the Match object’s group() method to return a string of the actual matched text.

But there is more... 

In [None]:
# find all licence plate numbers 
licenceRegex.findall(testCases)

## Substitute Strings with the sub() Methods

Regular expressions can not only find text patterns but can also substitute new text in place of those patterns. The sub() method for Regex objects is passed two arguments. The first argument is a string to replace any matches. The second is the string for the regular expression. The sub() method returns a string with the substitutions applied.

In [None]:
# Substitute a pattern
secretNameRegex = re.compile(r'Agent \w+')
text = 'Agent Alice gave the secret documents to Agent Bob.'
censoredText = secretNameRegex.sub('CENSORED', text)
print(censoredText)

Sometimes you may need to use the matched text itself as part of the substitution. In the first argument to sub(), you can type \1, \2, \3, and so on, to mean “Enter the text of group 1, 2, 3, and so on, in the substitution.”

For example, say you want to censor the names of the secret agents by showing just the first letters of their names. To do this, you could use the regex Agent (\w)\w* and pass r'\1****' as the first argument to sub(). The \1 in that string will be replaced by whatever text was matched by group 1— that is, the (\w) group of the regular expression.


In [None]:
# Reusing parts of the matchted text
agentNamesRegex = re.compile(r'Agent (\w)\w*')
pseudoCensoredText = agentNamesRegex.sub(r'\1****', text)
print(pseudoCensoredText)

## Reading and Writing Files (chapter 8)

Variables are a fine way to store data while your program is running, but if you want your data to persist even after your program has finished, you need to save it to a file. You can think of a file's contents as a single string value, potentially gigabytes in size. In this section, you will learn how to use Python to create, read, and save files on the hard drive.

__Watch out__: Windows and Unix-based systems differ in the way folders are separated. It's a backslash (\\) on Windows and forward slashes (/) on OS X and Linux. There are some tricks to work around these differences. For more details, which are beyond the scope of this tutorial, check [chapter 8](https://automatetheboringstuff.com/chapter8/), subsection "Backslash on Windows and Forward Slash on OS X and Linux".

In [None]:
# where are you right now?
import os
os.getcwd()

In [None]:
# open a csv file and read the content, filter some lines and write the results in a file
import os

# the Lord of the Rings file from the last exercise 
tsvFile = open('lotr_clean.csv')

# read each line and put it in a list
lines = tsvFile.readlines()

# iterate over all lines
resultFile = open('frodo.csv','a') # open the file in append mode - new content does not overwrite old content
for line in lines:
    cols = line.split(';') # split the lines
    if cols[1] == 'FRODO':
        print(cols[1] + ';' + str(cols[2]))
        # write the same line into the resultFile. 
        resultFile.write(cols[1] + ';' + cols[2] + '\n') # remember the newline at the end!
resultFile.close()

        

# Assignment 2, exercise 1

We continue with the Disney Plus data set and try to reproduce some of the exercises we did with shell and grep. Work with [Pandas](https://pandas.pydata.org/) when not stated otherwise!

0. Read chapter 8 to learn more about reading and writing files. I skipped a lot of details.
1. Write a Python program: Read the Disney Plus data as a Pandas Dataframe.
2. Make your program extract a full list of all genre names that are listed in column "listed_in". Make sure to extract only the genres and clean the data from non-letter characters if necessary. Save as a list.
3. Extend your program to count the distinct (unique) genre names. 
4. Next, find how many entries in the disney plus catalog belong to each of the genres. Save the genres and corresponding counts in a dictionary. 
5. Write the results in a new CSV file that contains the genres and the counts. 
6. Count, save in a dictionary and export as a csv (like in 1-4) the occurances of the terms "Disney" and "Marvel" in the column description. Think about different name variations (like uppercase, etc.).

Commit your Python program and the resulting CSV files. 