# DIS08 / OR92 Data Modeling: Python - Introduction (Setup, file management, etc.)

In this lecture, we’ll cover and recap the basics of Python programming, including:

- Python data structures: lists, tuples, dictionaries, and sets.
- Basic file management: reading from and writing to files.
- String handling: manipulation of text data and regular expressions.

In the lab assignments, you will setup your Python development environment, rerun this notebook, and get started with the `pandas` library.

If you would like to study Python more in-depth, we recommend the book [Automate the Boring Stuff With Python](https://automatetheboringstuff.com/). This book also inspired some of the contents in this notebook.

1. Basic Syntax

In [1]:
# This is a comment in Python

In [2]:
# Print "Hello, World!" to the console
print("Hello, World!")  

Hello, World!


In [3]:
# Variable assignment
x = 5  
x

5

2. Data Types

In [4]:
# Integer type
integer_num = 10
type(integer_num)

int

In [5]:
# Float type
float_num = 3.14
type(float_num)

float

In [6]:
# String type
str_val = "Python"
type(str_val)

str

3. Basic Operators

In [7]:
a = 10
b = 3

In [8]:
# Addition
print(a + b)

13


In [9]:
# Subtraction
print(a - b)

7


In [10]:
# Multiplication
print(a * b)

30


In [11]:
# Division
print(a / b)

3.3333333333333335


In [12]:
# Modulus (remainder)
print(a % b)

1


In [13]:
# Exponentiation
print(a ** b)

1000


4. Control Structures

In [14]:
# If-Else statement
x = 5
if x > 0:
    print("Positive")
elif x < 0:
    print("Negative")
else:
    print("Zero")


Positive


In [15]:
# For loop
for i in range(5):
    print(i)

0
1
2
3
4


In [16]:
# While loop
i = 0
while i < 5:
    print(i)
    i += 1

0
1
2
3
4


## Data Structures

In [17]:
# List
my_list = [1, 2, 3, 4, 5]
print(my_list[0])  # Accessing elements

# Tuple (immutable list)
my_tuple = (6, 7, 8)
print(my_tuple[0])

# Dictionary (key-value pairs)
my_dict = {"name": "John", "age": 30, "city": "New York"}
print(my_dict["name"])

# Set
my_set = {"apple", "banana", "cherry"}
print(my_set)



1
6
John
{'apple', 'banana', 'cherry'}


A list is a collection which is ordered and changeable. It allows duplicate members.

In [18]:
# Creating a list
fruits = ["apple", "banana", "cherry"]
print(fruits)

# Adding an item
fruits.append("orange")
print(fruits)

# Accessing items
print(fruits[1])

# Removing an item
fruits.remove("banana")
print(fruits)

['apple', 'banana', 'cherry']
['apple', 'banana', 'cherry', 'orange']
banana
['apple', 'cherry', 'orange']


A tuple is a collection which is ordered but unchangeable (immutable). It allows duplicate members.

In [19]:
# Creating a tuple
my_tuple = ("apple", "banana", "cherry")
print(my_tuple)

# Accessing items
print(my_tuple[1])

# Tuples are immutable, so the following code will cause an error:
# my_tuple[1] = "orange"

('apple', 'banana', 'cherry')
banana


A dictionary is a collection which is unordered, changeable, and indexed. It does not allow duplicates.

In [20]:
# Creating a dictionary
person = {
    "name": "Alice",
    "age": 25,
    "city": "New York"
}
print(person)

# Accessing values
print(person["name"])

# Modifying a value
person["age"] = 26
print(person)

{'name': 'Alice', 'age': 25, 'city': 'New York'}
Alice
{'name': 'Alice', 'age': 26, 'city': 'New York'}


6. Functions

In [21]:
def greet(name):
    """This function greets the person passed in as parameter"""
    print(f"Hello, {name}!")

greet("Alice")  # Output: Hello, Alice!

Hello, Alice!


A set is a collection which is unordered, unindexed, and does not allow duplicate members.

In [22]:
# Creating a set
my_set = {"apple", "banana", "cherry"}
print(my_set)

# Adding an item
my_set.add("orange")
print(my_set)

# Removing an item
my_set.remove("banana")
print(my_set)

{'apple', 'banana', 'cherry'}
{'orange', 'apple', 'banana', 'cherry'}
{'orange', 'apple', 'cherry'}


## Basic File Management in Python

You can also use many command line tools in a Jupyter notebook. We use `echo` to create a first text file.

In [23]:
!echo "This is the beginning of a new file!\n" >> example.txt

To read a file, use the open() function in reading mode ('r').

In [24]:
# Reading from a file
file = open("example.txt", "r")

# Read the entire content of the file
content = file.read()
print(content)

# Always remember to close the file after you're done
file.close()

"This is the beginning of a new file!\n" 



To write to a file, use the open() function in write mode ('w').

In [25]:
# Writing to a file
file = open("example.txt", "w")
file.write("Hello, World!\nThis is a new line.")
file.close()

You can also append data to an existing file using append mode ('a').

In [26]:
# Appending to a file
file = open("example.txt", "a")
file.write("\nThis is an appended line.")
file.close()

It’s a good practice to use the `with` statement when working with files to ensure the file is properly closed after the operations.

In [27]:
# Using 'with' statement to open and read a file
with open("example.txt", "r") as file:
    content = file.read()
    print(content)

Hello, World!
This is a new line.
This is an appended line.


## String Handling in Python

Python provides a variety of methods for manipulating and handling strings.

In [28]:
# Concatenating strings
greeting = "Hello"
name = "Alice"
message = greeting + ", " + name + "!"
print(message)

Hello, Alice!


**String Formatting:** You can format strings using the format() method or f-strings (in Python 3.6+).

In [29]:
# Using format method
message = "Hello, {}!".format(name)
print(message)

# Using f-strings (Python 3.6+)
message = f"Hello, {name}!"
print(message)

Hello, Alice!
Hello, Alice!


**String Methods:** Some common string methods in Python include upper(), lower(), replace(), and split().

In [30]:
# Changing case
text = "Hello, World!"
print(text.upper())
print(text.lower())

# Replacing parts of a string
print(text.replace("World", "Python"))

# Splitting a string
words = text.split(", ")
print(words)

HELLO, WORLD!
hello, world!
Hello, Python!
['Hello', 'World!']


**Regular expressions:** The re.search() function searches for a match in a string.

In [31]:
import re

# Search for the word 'Python' in a string
text = "Welcome to Python programming!"
pattern = "Python"

# Search for the pattern
match = re.search(pattern, text)

# The re.search() function looks for the word “Python” in the text. If found, it returns a match object, otherwise None.
if match:
    print("Match found!")
else:
    print("No match.")

Match found!


## Quiz time!

**Question:** What will be the output of the following code?

In [32]:
fruits = ["apple", "banana", "cherry"]
fruits.append("orange")
print(fruits[1])

banana


A) "apple"  
B) "banana"  
C) "cherry"  
D) ["apple", "banana", "cherry", "orange"]

**Question:** What error will the following code produce?

In [33]:
colors = ("red", "green", "blue")
colors[1] = "yellow"

TypeError: 'tuple' object does not support item assignment

A) TypeError: 'tuple' object does not support item assignment  
B) IndexError: tuple index out of range  
C) No error, it works fine  
D) AttributeError: 'tuple' object has no attribute 'append'

**Question:** What will be the value of car["year"] after running the following code?

In [34]:
car = {
    "brand": "Ford",
    "model": "Mustang",
    "year": 1964
}
car["year"] = 2022
print(car["year"])

2022


A) 1964  
B) 2022  
C) None  
D) KeyError: 'year'

**Question:** What will happen if the following code is executed, and example.txt doesn’t exist in the directory?

In [35]:
with open("example.txt", "r") as file:
    content = file.read()
    print(content)

Hello, World!
This is a new line.
This is an appended line.


A) It will create a new file named example.txt.  
B) It will throw a FileNotFoundError.  
C) It will print an empty string.  
D) It will print None.  

**Question:** What will be the value of message in the following code?

In [36]:
name = "Alice"
greeting = "Hello"
message = greeting + ", " + name + "!"
print(message)

Hello, Alice!


A) "Hello Alice"  
B) "Hello, Alice!"  
C) "Hello Alice!"  
D) ", Alice!"

**Question:** What will the following code output?

In [37]:
text = "Welcome to Python"
print(text.upper())

WELCOME TO PYTHON


A) WELCOME TO PYTHON  
B) welcome to python  
C) Welcome To Python  
D) Welcome to python

**Question:** What happens if you try to add a duplicate item to a set?

In [38]:
my_set = {"apple", "banana", "cherry"}
my_set.add("banana")
print(my_set)

{'apple', 'banana', 'cherry'}


A) The set will now contain two "banana" items.  
B) The set will raise an error.  
C) The set will remain unchanged.   
D) The set will reorder itself.

**Question:** What is the correct way to format a string using f-strings in the following code?

In [39]:
name = "Alice"
age = 25
message = f"My name is ... and I am ... years old."
print(message)

My name is ... and I am ... years old.


A) f"My name is {name} and I am {age} years old."  
B) f"My name is name and I am age years old."  
C) "My name is {name} and I am {age} years old."  
D) "My name is name and I am age years old."

**Question:** What will be the output of the following code?

In [40]:
numbers = [10, 20, 30, 40, 50]
print(numbers[-2])

40


A) 20  
B) 40  
C) 30  
D) 50


**Question:** What will the following code output?

In [41]:
person = {
    "name": "John",
    "age": 30
}
print("height" in person)

False


A) True  
B) False  
C) None  
D) KeyError: 'height'

## Lab Assignments

Download this notebook from Moodle and rerun the examples from today's lecture. If you can execute all of the code cells above you are done! The setup guide below will help you to get started!

Once, you have setup your Python/Jupyter environment, please do the following task:

- Download this [dataset](https://librarycarpentry.org/lc-python-intro/files/data.zip) and extract it (do not commit it to this repository).
- Get familiar with the [pandas](https://pandas.pydata.org/) library and install it in your Python environment.
- Load one or more of the CSV files in the data/ directory as a DataFrame.
- Use the functions `info()`, `head()`, `tail()`, `describe()`, and the variable `columns` to get some basic information about the data. Also document and describe the outputs.
- How do you get the first/last ten rows of the DataFrame?
- How do you get the rows between row 30 and row 40?
- How do you get a specific column, e.g., 'year'? What kind of `type()` has the column?
- What is the `*.pkl` file? How do load it into your program and what are the data contents in it?
- What is the purpose of `.loc()` and `.iloc()`?
- How do you sort values in a column?

When you have completed the tasks, please commit this notebook to your GitHub repository in the directory `assignments/06/`.


In [48]:
import pandas as pd
df22 = pd.read_csv("data/2022_circ.csv")
df11 = pd.read_csv("data/2011_circ.csv")

In [53]:
#Übersicht der Datenstruktur
df22.info()

#Erste 5 Zeilen anzeigen
df22.head()

#Letzte 5 Zeilen anzeigen
df22.tail()

#Statistische Übersicht
df22.describe()

#Spaltennamen auflisten
df22.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 81 entries, 0 to 80
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   branch     81 non-null     object 
 1   address    81 non-null     object 
 2   city       81 non-null     object 
 3   zip code   81 non-null     float64
 4   january    81 non-null     int64  
 5   february   81 non-null     int64  
 6   march      81 non-null     int64  
 7   april      81 non-null     int64  
 8   may        81 non-null     int64  
 9   june       81 non-null     int64  
 10  july       81 non-null     int64  
 11  august     81 non-null     int64  
 12  september  81 non-null     int64  
 13  october    81 non-null     int64  
 14  november   81 non-null     int64  
 15  december   81 non-null     int64  
 16  ytd        81 non-null     int64  
dtypes: float64(1), int64(13), object(3)
memory usage: 10.9+ KB


Index(['branch', 'address', 'city', 'zip code', 'january', 'february', 'march',
       'april', 'may', 'june', 'july', 'august', 'september', 'october',
       'november', 'december', 'ytd'],
      dtype='object')

In [52]:
#Übersicht der Datenstruktur
df11.info()

#Erste 5 Zeilen anzeigen
df11.head()

#Letzte 5 Zeilen anzeigen
df11.tail()

#Statistische Übersicht
df11.describe()

#Spaltennamen auflisten
df11.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   branch     80 non-null     object 
 1   address    80 non-null     object 
 2   city       80 non-null     object 
 3   zip code   80 non-null     float64
 4   january    80 non-null     int64  
 5   february   80 non-null     int64  
 6   march      80 non-null     int64  
 7   april      80 non-null     int64  
 8   may        80 non-null     int64  
 9   june       80 non-null     int64  
 10  july       80 non-null     int64  
 11  august     80 non-null     int64  
 12  september  80 non-null     int64  
 13  october    80 non-null     int64  
 14  november   80 non-null     int64  
 15  december   80 non-null     int64  
 16  ytd        80 non-null     int64  
dtypes: float64(1), int64(13), object(3)
memory usage: 10.8+ KB


Index(['branch', 'address', 'city', 'zip code', 'january', 'february', 'march',
       'april', 'may', 'june', 'july', 'august', 'september', 'october',
       'november', 'december', 'ytd'],
      dtype='object')

* How do you get the first/last ten rows of the DataFrame?

* How do you get the rows between row 30 and row 40?

* How do you get a specific column, e.g., 'year'? What kind of type() has the 
column?

* What is the *.pkl file? How do load it into your program and what are the data contents in it?

* What is the purpose of .loc() and .iloc()?

* How do you sort values in a column?

# How do you get the first/last ten rows of the DataFrame?

In [54]:
#Erste 10 Zeilen bei 2022_circ.csv
df22.head(10)
#Letzte 10 Zeilen bei 2022_circ.csv
df22.tail(10)

#Erste 10 Zeilen bei 2011_circ.csv
df11.head(10)
#Letzte 10 Zeilen bei 2011_circ.csv
df11.tail(10)

Unnamed: 0,branch,address,city,zip code,january,february,march,april,may,june,july,august,september,october,november,december,ytd
70,Water Works,163 E. Pearson St.,Chicago,60611.0,2609,2870,3718,3529,3669,3666,4099,4292,4138,4321,4219,3917,45047
71,West Belmont,3104 N. Narragansett Ave.,Chicago,60634.0,8889,8470,11577,11686,9020,10337,11273,10405,9659,11313,11445,9835,123909
72,West Chicago Avenue,4856 W. Chicago Ave.,Chicago,60651.0,1277,934,1961,2131,2091,2068,1884,2170,2147,2207,1612,1609,22091
73,West Englewood,1745 W. 63rd St.,Chicago,60636.0,1522,1422,1878,1975,1724,1819,1815,2193,2401,2578,2097,1834,23258
74,West Lawn,4020 W. 63rd St.,Chicago,60629.0,7606,6076,8552,8896,7313,8150,5703,7723,8201,10621,8101,7980,94922
75,West Pullman,830 W. 119th St.,Chicago,60643.0,3312,2713,3495,3550,3010,2968,3844,3811,3209,3923,3162,3147,40144
76,West Town,1625 W. Chicago Ave.,Chicago,60622.0,9030,7727,10450,10607,10139,10410,10601,11311,11084,10657,10797,9275,122088
77,"Whitney M. Young, Jr.",7901 S. King Dr.,Chicago,60619.0,2588,2033,3099,3087,3005,2911,3123,3644,3547,3848,3324,3190,37399
78,Woodson Regional,9525 S. Halsted St.,Chicago,60628.0,10564,8874,10948,9299,9025,10020,10366,10892,10901,13272,11421,9474,125056
79,Wrightwood-Ashburn,8530 S. Kedzie Ave.,Chicago,60652.0,3062,2780,3334,3279,3036,3801,4600,3953,3536,4093,3583,3200,42257


How do you get the rows between row 30 and row 40?

In [55]:
#Zeilen von 30 bis 40 bei 2022_circ.csv
df22.iloc[30:41]

#Zeilen von 30 bis 40 bei 2011_circ.csv
df11.iloc[30:41]

Unnamed: 0,branch,address,city,zip code,january,february,march,april,may,june,july,august,september,october,november,december,ytd
30,Hall,4801 S. Michigan Ave.,Chicago,60615.0,2416,1914,2720,2644,2693,3134,3373,2953,2894,2944,2836,2875,33396
31,Harold Washington Library Center,400 S. State St.,Chicago,60605.0,79210,67574,89122,88527,82581,82100,80219,85193,81400,82236,79702,68856,966720
32,Hegewisch,3048 E. 130th St.,Chicago,60633.0,3221,2749,3668,3492,3181,3324,3852,3934,3570,3755,3617,3073,41436
33,Humboldt Park,1605 N. Troy St.,Chicago,60647.0,6251,5736,7798,8184,7509,7502,7019,7120,6609,7424,6943,6730,84825
34,Independence,3548 W. Irving Park Rd.,Chicago,60618.0,8860,7653,9773,10897,9401,11998,12958,11913,10682,11534,10603,9614,125886
35,Jefferson Park,5363 W. Lawrence Ave.,Chicago,60630.0,8819,8308,10663,11099,9422,10735,10591,11447,10447,11033,10757,9755,123076
36,Jeffery Manor,2401 E. 100th St.,Chicago,60617.0,1223,1115,1757,1811,1429,1531,1437,1702,1554,1684,1613,1463,18319
37,Kelly,6151 S. Normal Boulevard,Chicago,60621.0,963,827,1315,1143,1004,1384,1585,1918,1705,2200,1729,1575,17348
38,King,3436 S. King Dr.,Chicago,60616.0,1903,1795,2586,2875,3108,3255,3446,3732,3505,3711,3257,3107,36280
39,Legler Regional,115 S. Pulaski Rd.,Chicago,60624.0,1344,1080,1635,1699,1425,1690,1485,1075,1866,1791,1795,1602,18487


# How do you get a specific column, e.g., 'year'? What kind of type() has the column?

In [62]:
df["address"]

0       4856 W. Chicago Ave.
1            415 E. 79th St.
2        9525 S. Halsted St.
3           1962 W. 95th St.
4        5055 S. Archer Ave.
               ...          
76    2100 S. Wentworth Ave.
77          1350 W. 89th St.
78       4314 S. Archer Ave.
79      9055 S. Houston Ave.
80         3647 S. State St.
Name: address, Length: 81, dtype: object

In [63]:
type(df["address"])

pandas.core.series.Series

# What is the *.pkl file? How do load it into your program and what are the data contents in it?

In [81]:
#Erklärung: .pkl sind serialisierte Pandas DataFrames, die verwendet werden, um Datenstrukturen effizient zu speichern und wiederherzustellen.

# What is the purpose of .loc() and .iloc()?

In [72]:
#Wählt Zeile mit Label 5
df.loc[5]
#Wählt die fünfte Zeile (Index-basiert)
df.iloc[5]

branch               Manning
address      6 S. Hoyne Ave.
city                 Chicago
zip code             60612.0
january                  574
february                 652
march                   1026
april                    924
may                     1185
june                    1063
july                    1401
august                  1218
september               1080
october                  976
november                 972
december                 798
ytd                    11869
Name: 5, dtype: object

# How do you sort values in a column?

In [79]:
df.sort_values(by="branch", ascending=False)

Unnamed: 0,branch,address,city,zip code,january,february,march,april,may,june,july,august,september,october,november,december,ytd
48,Wrightwood-Ashburn,8530 S. Kedzie Ave.,Chicago,60652.0,744,581,816,1114,689,819,482,860,766,733,671,575,8850
2,Woodson Regional,9525 S. Halsted St.,Chicago,60628.0,1891,1810,2255,2429,2264,2207,2150,1960,2021,2257,1970,1788,25002
1,"Whitney M. Young, Jr.",415 E. 79th St.,Chicago,60619.0,674,579,772,651,619,648,835,713,660,703,602,585,8041
74,West Town,1625 W. Chicago Ave.,Chicago,60622.0,4375,3887,4775,4450,4199,4959,4901,4700,4107,4546,4642,3918,53459
18,West Pullman,830 W. 119th St.,Chicago,60643.0,592,700,686,817,641,750,732,720,900,761,577,578,8454
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
46,Austin-Irving,6100 W. Irving Park Rd.,Chicago,60634.0,5054,4857,5787,5604,4626,5883,6068,5834,5419,4917,4976,4457,63482
40,Austin,5615 W. Race Ave.,Chicago,60644.0,521,576,667,644,524,643,592,573,747,683,666,753,7589
4,Archer Heights,5055 S. Archer Ave.,Chicago,60632.0,2341,2131,2628,2690,2049,2432,2505,2377,2259,2054,2099,1778,27343
21,Altgeld,955 E. 131st St.,Chicago,60827.0,150,139,179,289,297,198,257,229,314,223,173,193,2641
