In [None]:
import csv
import json
import pandas as pd

if import pandas as pd doesn't work:

http://jupyter.org/try-jupyter/lab

upload notebook there

# Python Crash Course IDSP 2025 - Lecture 07

Contents:
* list comprehension
* try/except for error message checking
* `pandas`!

# List comprehension

A powerful, "Pythonic" way to create lists.

In [None]:
# how to create a list of all numbers from 1 to 10?
my_list = []
for i in range(1,11):
    my_list.append(i)
my_list
# this is perfectly valid, but we could also just do...

In [None]:
# LIST COMPREHENSION:
[i for i in range(1,11)]

In [None]:
# how to create a list of all SQUARES 
# of the numbers from 1 to 10?
my_list = []
for i in range(1,11):
    my_list.append(i**2)
my_list
# this is perfectly valid, but...

In [None]:
# we could also use list comprehension:
[i**2 for i in range(1,11)]

In [None]:
# [expression for member in iterable], here with:
[i**2 for i in range(1,11)]
# expression ... i**2
# member ... i
# iterable ... range(1,11)

In [None]:
# list of all numbers from 0 to 20
[i for i in range(20)]

In [None]:
# list of all EVEN numbers from 0 to 20
[i for i in range(20) if i % 2 == 0]
# we can also add a condition!

In [None]:
# list of all EVEN numbers from 0 to 20
[i for i in range(20) if i % 2 == 0]
# [expression for member in iterable if condition]
# expression... i
# member ... i
# iterable ... range(20)
# condition .... i%2 == 0 ("i is even")

In [None]:
# ... and as usual, it works for all other iterables, e.g. strings
# create a list of all letters in a word that are NOT vowels
word = "ambivalence"
[letter for letter in word if letter not in ["a", "e", "i", "o", "u"]]

# List comprehension

A powerful, "Pythonic" way to create lists (and to replace for-loops).

```python
[expression for member in iterable]
[expression for member in iterable if condition]
``````

# Try it out yourself!

Use list comprehension to...
* create a list of all numbers from 50 to 65
* create a list of all numbers from 50 to 65 that are divisible by 3
* create a list of **SQUARES** of all numbers from 50 to 65
* create a list of **EVEN squares** of all numbers from 50 to 65

In [None]:
# list of all numbers from 50 to 65

In [None]:
# list of all numbers from 50 to 65 that are divisible by 3

In [None]:
# list of all squares of numbers from 50 to 65 

In [None]:
# list of EVEN squares of numbers from 50 to 65

```python
try:
    # plan A
except:
    # plan B - in case plan A raises an error
```

In [None]:
# try/except statements
a = 10
b = 0
try:
    print(a/b)
except:
    print("division not possible!")

# The pandas package

* motivation for pandas
* lingo: library/package, DataFrame, Series, Axis, record, dtype
* creating a DataFrame
* reading in a file into a DataFrame
* exploring DataFrame contents
* accessing DataFrame contents
* sorting DataFrame records
* adding columns
* combining 2 DataFrames
* subsetting DataFrames (filtering by condition)

# `pandas` for tabular data. Why?

## Common (?) file formats for tabular data
* `.txt` plain text file, not formatted
* `.csv` text in comma-separated values, not formatted
* `.xls`, `.xlsx` Microsoft Excel worksheets
* `.json` "JavaScript Object Specification"

## `csv`: "comma" separated values

`.csv` is often used even when the separator is not a comma, but a tab, a whitespace, a semicolon, ...

<p style="text-align:left;">
    <img src="images/csv.png" alt="csv file" width=1000px>
</p>

## `json` JavaScript Object Specification

* format understood by many programming languages (not only JavaScript!)
* can store different (tree-like) data structures, not only tables
* often used for server-web application data transfer
* data types allowed in json: numbers, strings, booleans, "arrays" (similar to lists in Python), "objects" (name-value pair collections, similar to dictionaries in Python)

<p style="text-align:left;">
    <img src="images/json.png" alt="json file" width=1000px>
</p>


# Jupyter notebook is actually  a json file, too!

<p style="text-align:left;">
    <img src="images/ipynb.png" alt="ipynb file opened with text editor" width=1000px>
</p>


# Our table of the day: Titanic passengers

[(Link to raw data)](https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/titanic.csv)

same data, stored in different file formats, in `data` folder

In [None]:
# reading in tabular data in csv format - the painful way
with open("data/titanic.csv", "r") as opened_file:
    my_reader = csv.reader(opened_file)
    rows = [row for row in my_reader]
rows

In [None]:
# reading in tabular data in json format - the painful way
with open("data/titanic.json", "r") as opened_file:
    my_json = json.load(opened_file)
my_json

# Enter: `pandas`

for tabular data

In [None]:
pd.read_csv("data/titanic.csv")

# `pandas` lingo

* `DataFrame`
* `Series`
* `Axis`
* `record`
* `dtype`

In [None]:
df = pd.read_csv("data/titanic.csv")
df

In [None]:
# DataFrame is a data type defined in pandas:
type(df)

In [None]:
# each column of the DataFrame is a pandas Series 
# (one-dimensional, labeled array)
df.Name
# type(df.Name)

### Axis : row or column
* Axis = 0: row
* Axis = 1: column

### Record : a single row ("one observation")

### dtype: data type defined for entire DataFrame or Series

# Creating a `pandas` DataFrame

* from scratch
* by importing a file


In [None]:
# from scratch
pd.DataFrame(
    {
        "greek_letters": ["alpha", "beta", "gamma"],
        "latin_letters": ["a", "b", "g"],
        "some_random_numbers": [10,8,27]
    }
)

In [None]:
?pd.DataFrame

In [None]:
# by using .read_csv()
df = pd.read_csv("data/titanic.csv")
df

In [None]:
# read_csv(filepath) reads in a csv from a file and returns a pandas DataFrame:
pd.read_csv("data/titanic.csv")

In [None]:
# .read_csv() assumes that sep=","... but i can indicate a different separator:
pd.read_csv("data/titanic_semicolon.csv")

In [None]:
# .read_csv() assumes that sep=","... 
# but i can indicate a different separator:
pd.read_csv("data/titanic_semicolon.csv", sep = ";")

In [None]:
# let's save the pandas DataFrame into a variable, df:
df = pd.read_csv("data/titanic.csv")

# Exploring DataFrame contents

In [None]:
# now we can display the first (by default 5) rows:
df.head(3)
# if you provide an integer argument n, will display the first n rows

In [None]:
# to display the last (by default 5) rows:
df.tail(10) # if you provide an integer argument n, will display the last n rows


In [None]:
# length of the dataframe = number of ROWS
len(df)

In [None]:
# the type of the variable is "pandas dataframe":
type(df)

In [None]:
# this object (the pandas DataFrame) has ATTRIBUTES:
# characteristics accessible by .attributename
df.dtypes # .dtypes contains the data types of all columns

In [None]:
# another attribute is .index (containing the ROW LABELS)
df.index

In [None]:
# another attribute is .columns (containing the COLUMN LABELS)
df.columns

In [None]:
# .shape contains the shape (nrows, ncols) of the dataframe
df.shape

In [None]:
# .describe() provides some summary statistics
df.describe()

# Accessing DataFrame contents

# Accessing specific columns in the data frame

...a little bit like indexing:

## `df[]`

#### `df[columnlabel]` with single column label
#### `df[[col1, col2, col3]]`  with list of column labels


In [None]:
# access the columns separately with square brackets 
# and their column name ("label"):
df["Survived"] # returns ONLY the column "Survived"

In [None]:
# access the columns separately with square brackets 
# and their column name ("label"):
df["Age"] # returns ONLY the column "Age"

In [None]:
# access specific columns by giving a list of column names as index: 
df[["Age", "Name", "Survived"]]

# Accessing specific rows in the data frame

## `df.loc[]` 

#### `df.loc[rowlabel]` with single row label (index)
#### `df.loc[[row1,row2,row3]]` with list of row labels

In [None]:
# remember, our row labels (in this case) are simply integer numbers:
df.head()

In [None]:
# accessing rows by index: df.loc[] with index (row label) as argument
df.loc[0] # returns only first row

In [None]:
# accessing rows by index: df.loc[] with list of indeces (row labels) as argument
df.loc[[0,1,2]] # returns first 3 rows

## Accessing specific rows AND columns in the data frame

#### `df.loc[rowlabels, columnlabels]` 


In [None]:
# row labels: first three rows; column label: "Name"
df.loc[[0,1,2], "Name"]

In [None]:
# row labels: first three rows; column labels: "Name" & "Age"
df.loc[[0,1,2], ["Name", "Age"]]

In [None]:
# row labels: first row; column labels: ["Name", "Sex"]
df.loc[0, ["Name", "Sex"]]

In [None]:
# accessing one single value
df.loc[0, "Name"]

# Try it out yourself

Each of the tasks below is 1 line of code!

* Access only the column "Fare" `[]`
* Access only the columns "Fare" and "Age" `[[]]`
* Access only the rows with row labels 3, 4, and 5 `.loc[]`
* Access only the rows 3, 4, 5, and only the columns "Survived" and "Name"
* Access the name of the last passenger in the dataframe

In [None]:
# YOUR CODE HERE

# Sorting DataFrame records

### `df.sort_values(by=<columnname>)`

In [None]:
df.sort_values(by="Age") # this method returns a sorted view,
# but does NOT change the underlying DataFrame!

In [None]:
df.sort_values(by="Age", ascending=True)
# "ascending" parameter is True by default

In [None]:
# overwriting with sorted version, by ascending "Fare"
df = df.sort_values(by="Fare")
df

In [None]:
# resetting the index
df = df.reset_index(drop=True)
df.head()

# Adding columns to a DataFrame

In [None]:
# adding a new column
df["had_a_bad_trip"] = True # adds a new column, with the same value in ALL rows
df.head(3)

In [None]:
df["linkedin"] = None
df.head(3)

In [None]:
# adding a new column based on another column
df["Fare_DKK"] = df["Fare"] * 900 # multiply the value in each row of "Fare" by 900
df["Fare_EUR"] = df["Fare"] * 120 # multiply the value in each row of "Fare" by 120
df.head()

# Deleting rows and columns with `.drop()`

In [None]:
# remove a row: axis = 0
# df.drop(labels=3, axis=0)
# df.head()
# this removes the row with label 3

In [None]:
# remove a column: axis = 1
df.drop(labels="Age", axis = 1)
# this removes the column "Age"

In [None]:
# how to CHANGE the dataframe instead of changing the VIEW?
df = df.drop(labels="Age", axis = 1)
# either overwrite the variable

In [None]:
# how to CHANGE the dataframe instead of changing the VIEW?
df.drop(labels="Name", axis = 1, inplace=True)
df.head()
# OR set inplace=True

In [None]:
# now both "Name" and "Age" columns are removed:
df.head(3)

In [None]:
# let's read in the data one more time, after we messed with it:
df = pd.read_csv("data/titanic.csv")

# Subsetting DataFrames

* Boolean indexing (filtering by condition)
* bitwise operators!

# Boolean indexing (filtering by condition)

##### `df[columnlabel]` can be combined with comparison `> < == !=` operators
##### `df[condition]` returns only those rows where condition is True
##### `df[(condition1) & (condition2)]` returns only those rows where both conditions are True

In [None]:
# let's see what happens if we use a comparison operator with a single column:
# returns for each row either True or False
df["Age"] > 18 # was this person over 18 on that ship?

In [None]:
# we can use that condition to index only rows where condition is True:
my_condition = df["Age"] > 18 # who was over 18 on that ship?
df[my_condition]

In [None]:
# a shorter (but perhaps more confusing at first) way to write this: 
# df[condition] (where condition contains a df column)
df[ df["Age"]>18 ]

In [None]:
# filtering by several conditions:
# put each condition inside () round brackets
# combine them with & (meaning "and") or | (meaining "or")
# everyone that was over 18 AND survived 
df[ (df["Age"]>18) & (df["Survived"]==1) ]

In [None]:
# everyone that is male and under 25
df[ (df["Sex"]=="male") & (df["Age"]<25) ]

# Combining several DataFrames

### `pd.concat([<list-of-dataframes>])`

In [None]:
# along the row axis (0)
adults = df[df.Age >= 18].copy()
kids = df[df.Age < 18].copy()
# pd.concat([adults,kids], axis=0)

In [None]:
# along the column axis (1)
names = df["Age"].copy()
rest = df.drop(labels="Age", axis = 1).copy()
pd.concat([names,rest], axis=1)

# Saving DataFrames to file

### df.to_csv(filename)

In [None]:
df.to_csv("titanic-pandas.csv")

In [None]:
df.to_csv("titanic-pandas.csv", index=False)

# .apply() method for pandas dataframe columns

...and very brief intro to inline (lambda) functions!

In [None]:
df.Fare

In [None]:
# apply the "int()" function to each row
df.Fare.apply(int)

In [None]:
# apply the "type()" function to each row
df.Fare.apply(type)

In [None]:
# multiply each row's Fare value by 900
def multiply_by_900(myvalue):
    return myvalue * 900
df.Fare.apply(multiply_by_900)

In [None]:
# multiply each row's Fare value by 900 - with
# INLINE FUNCTION (LAMBDA)
df.Fare.apply(lambda x: x*900)

In [None]:
# convert the Survived 0/1 to boolean
df.Survived.apply(lambda x: bool(x))

In [None]:
# split each string by " "
df.Name.apply(lambda x: x.split(" "))

# Try it out yourself!

Write one line of code each `apply.()`ing `lambda` functions...
* to `df.Fare`, so that for each record the text "I want my {insert Fare} dollars back"
* to `df.Age`, so that for each record we get the age in *months*
* to `df.Sex`, so that for each record we show only the first letter for this column ("m" or "f")

In [None]:
# YOUR CODE HERE