<div class="frontmatter text-center">
<h1> Introduction to Data Science and Programming</h1>
<h2>Lecture 18: Induction and command line tools</h2>
<h3>IT University of Copenhagen, Fall 2023</h3>
<h3>Instructor: Michael Szell</h3>
</div>

In [None]:
def csum(n):
    """Calculate cumulative sum of integers 1 to n"""
    if n <= 1:
        return n
    else:
        return n + csum(n-1)

In [None]:
csum(100)

## Imports

In [None]:
import numpy as np

# Loading data

### Constants

Constants are written all caps: https://www.python.org/dev/peps/pep-0008/#constants

In [None]:
FILENAME = "accidents.csv"

### Load raw data

The data were downloaded from here on Jan 4th 2021: https://data.gov.uk/dataset/road-accidents-safety-data  
That page was updated afterwards (Jan 8th 2021), so local and online data may be inconsistent.

In [None]:
# First version, just using the accident table
data = np.genfromtxt("files/"+FILENAME, delimiter=',', dtype=None, names=True, encoding='utf8')

It is always good to start with a "sneak preview":

In [None]:
data[:5]

Reminder and documentation on structured arrays:  
https://numpy.org/devdocs/user/basics.rec.html

#### Insight: Mixed variable types

Accidents have mixed data types, including strings, floats, integers. Categorical variables are encoded as integers. The meaning of these categories can be looked up in `files/variable_lookup.xls`

Number of records

In [None]:
data.shape

Number of fields

In [None]:
len(data.dtype)

In [None]:
data.dtype

**Why is the first field "\ufeffAccident_Index" and not "Accident_Index"?**

Fields

In [None]:
header = data.dtype.names
header

# Command line: Sanity checks and missing data handling

A faster way of getting basic insights into a new data set than by using numpy or pandas is by using command line tools.

Let's get a first overview using `head`:

In [None]:
!head -n 6 "files/accidents.csv"

### General insights

#### Dimensions

Number of records (plus header)

https://en.wikipedia.org/wiki/Wc_(Unix)

In [None]:
!wc -l "files/accidents.csv"

Number of fields (in first line)

https://www.geeksforgeeks.org/awk-command-unixlinux-examples/

In [None]:
!head -1 "files/accidents.csv" | awk -F ',' '{print NF}'

See and count all fields

https://en.wikipedia.org/wiki/Tr_(Unix)
https://en.wikipedia.org/wiki/Nl_(Unix)

In [None]:
!head -1 "files/accidents.csv" | tr ',' '\n' | nl

### Sanity checks

Has each record the same number of fields?

https://shapeshed.com/unix-uniq/  
https://www.putorius.net/uniq-command-linux.html

In [None]:
!awk -F ',' '{print NF}' "files/accidents.csv" | sort | uniq -d

How many duplicate lines are there? (If more than 0, there could be a problem)

In [None]:
!sort "files/accidents.csv" | uniq -d  | wc -l

More advanced stuff with `awk`: https://datafix.com.au/BASHing/2020-05-20.html

## Dealing with missing data

Using a masked array:  
https://numpy.org/devdocs/reference/maskedarray.baseclass.html#numpy.ma.MaskedArray

In [None]:
data_masked = np.genfromtxt("files/"+FILENAME, delimiter=',', dtype=None, names=True, encoding='utf-8-sig', usemask=True)

In [None]:
data_masked[:5]

In [None]:
data_masked.mask[:5]

The first 5 rows seem complete. What about the rest?

In [None]:
np.count_nonzero(data_masked.mask)

Oh oh, values are missing in 5776 rows! In which rows?

In [None]:
rows_incomplete = np.where(data_masked.mask)[0]
print(rows_incomplete)

How many values in total?  
Which fields are missing?

In [None]:
missingpositions = {}
missingvalues = 0
missingconfigurations = set()
for rowpos in rows_incomplete:
    missingpositions_thisrow = list(np.where(list(data_masked.mask[rowpos]))[0])
    missingpositions[rowpos] = missingpositions_thisrow
    missingvalues += len(missingpositions_thisrow)
    missingconfigurations.add((tuple(missingpositions_thisrow)))

In [None]:
print(missingpositions) # Don't do this is you have more than a few 1000 rows or Jupyter might crash.

Summary of missing values:

In [None]:
print("Incomplete rows: " + str(np.count_nonzero(data_masked.mask)))
print("Missing values: " + str(missingvalues))
print("\nMissing field configurations: " + str(missingconfigurations))

In [None]:
print("Missing field configurations (names): ")
missingfieldnames = [np.array(header)[c] for c in [list(b) for b in missingconfigurations]]
for i in missingfieldnames:
    print(i)

*Back to presentation*

<hr>

# Python tips & tricks

## args, kwargs, and unpacking

Source: https://realpython.com/python-kwargs-and-args/

Summing a list of integers:

In [None]:
def my_sum(my_integers):
    result = 0
    for x in my_integers:
        result += x
    return result

list_of_integers = [1, 2, 3]
print(my_sum(list_of_integers))

But what if you just want to sum a number of things?

In [None]:
def my_sum(*args):
    result = 0
    # Iterating over the Python args tuple
    for x in args:
        result += x
    return result

print(my_sum(1, 2, 3))

The unpacking operator `*` in `*args` allows you to use any number of positional arguments (in this case the integers 1, 2, and 3), which are packed into an iterable tuple called `args`. See what the unpacking operator does alone. Here `*` tells `print()` to unpack the list first:

In [None]:
my_list = [1, 2, 3]
print(*my_list)
print(my_list)

When you have keyword arguments (like dictionaries), you use `**kwargs` instead. Example:

In [None]:
def concatenate(**kwargs):
    result = ""
    # Iterating over the Python kwargs dictionary
    for arg in kwargs.values():
        result += arg
    return result

print(concatenate(a="Real", b="Python", c="Is", d="Great", e="!"))

## zip

Source: https://realpython.com/python-zip-function/

If you use `zip()` with n arguments, then the function will return an iterator that generates tuples of length n:

In [None]:
numbers = [1, 2, 3]
letters = ['a', 'b', 'c']
zipped = zip(numbers, letters)

print(zipped)  # Holds an iterator object
print(type(zipped))
print(list(zipped))

This is useful when you want to iterate over multiple lists together, for example:

In [None]:
for n,l in zip(numbers, letters):
    print("The number is:" + str(n))
    print("The letter is:" + l + "\n")

Beware of sets, which are unordered, so Python decides to zip the elements together randomly:

In [None]:
numbers = {2, 3, 1}
letters = {'b', 'a', 'c'}
list(zip(numbers, letters))

If you zip iterables with unequal length, Python will ignore all unmatched elements:

In [None]:
list(zip(range(5), range(100,-1,-1)))

## lambda functions

https://realpython.com/python-lambda/

Lambda functions have a long history in computer science, originating from a model of computation called *lambda calculus*. Lambda functions are basically one-line functions. For example, instead of writing:

In [None]:
def add_one(x):
    return x+1

you can shorten this to:

In [None]:
add_one = lambda x: x + 1

Example application if you don't care about reusing a function again, meaning the function is *anonymous* (without a name):

In [None]:
(lambda x: 2*x)(5)

or if you care about reusing it:

In [None]:
double_value = lambda x: 2*x
double_value(5)

However, the Python style guides discourage using lambdas non-anonymously: https://peps.python.org/pep-0008/#programming-recommendations

A classical application for lambdas in data science is for sorting:

In [None]:
ids = ['id1', 'id2', 'id30', 'id3', 'id22', 'id100']
print(sorted(ids)) # Lexicographic sort

sorted_ids = sorted(ids, key=lambda x: int(x[2:])) # Integer sort
print(sorted_ids)

## map, filter, reduce

Source: https://realpython.com/python-lambda/#map

The built-in function `map()` takes a function as a first argument and applies it to each of the elements of its second argument, an iterable. Example:

In [None]:
list(map(lambda x: x.capitalize(), ['cat', 'dog', 'cow']))

Alternative way with list comprehensions, avoiding `map` and `lambda`:

In [None]:
[x.capitalize() for x in ['cat', 'dog', 'cow']]

The built-in function `filter()` takes a predicate as a first argument and an iterable as a second argument. It builds an iterator containing all the elements of the initial collection that satisfies the predicate function. Example:

In [None]:
even = lambda x: x%2 == 0
list(filter(even, range(11)))

Alternative way with list comprehensions, avoiding `filter` and `lambda`:

In [None]:
[x for x in range(11) if x%2 == 0]

`reduce()` is `functools` module function. As `map()` and `filter()`, its first two arguments are respectively a function and an iterable. It may also take an initializer as a third argument that is used as the initial value of the resulting accumulator. For each element of the iterable, `reduce()` applies the function and accumulates the result that is returned when the iterable is exhausted. Example to a list of pairs and calculate the sum of the first item of each pair:

In [None]:
import functools
pairs = [(1, 'a'), (2, 'b'), (3, 'c')]
functools.reduce(lambda acc, pair: acc + pair[0], pairs, 0)

Advanced: More information on `reduce`: https://realpython.com/python-reduce-function/