# Session 2

By **Paul Rognon & Miquel Torrens i Dinarès & Maxim Fedotov**

*Barcelona School of Economics* –
*Data Science Center*

March 15th, 2023

---
## Modules and imports

*Modules* are Python files, that are files with extension `.py`. Data and methods defined in a module can become part of the namespace by using `import`.

In [1]:
x = sin(5)  # The "sin" function does not exist

NameError: name 'sin' is not defined

The function `sin` is part of the module `math`.

In [2]:
import math  # You need to import the module "math"
x = sin(5)
print(x)

-0.9589242746631385


There are some other useful tricks to import data and methods: you can import a single function from a module, or rename it.

In [3]:
from math import sin  # Imports a single function
from math import sin as sinus  # Nickname, useful when you import something with a long name
print(sinus(3))

0.1411200080598672


Let us discuss here a couple of fundamental modules in Python:

### `numpy` (Numerical Python)

*   User's manual: https://numpy.org/doc/stable/

This is Python's stack for scientific computing. The fundamental new data type is that of a **`numpy` `array`**. It is Python's matrix-type object and is used in the majority of modules for data analysis, statistics and machine learning.

These arrays contain data that must all be of the same type (dtype):

In [3]:
# Import module 
import numpy as np  # usually imported with name "np"

# Creating an array
A = np.array([[1, 2, 3, 4, 5, 6], 
              [42, 53, 43 ,62, 7, 4], 
              [-3, -1, -4 ,-8, -52, -4], 
              [10, 0, 4 , 1, 0, 1]])
print(A)
print(A.dtype)

[[  1   2   3   4   5   6]
 [ 42  53  43  62   7   4]
 [ -3  -1  -4  -8 -52  -4]
 [ 10   0   4   1   0   1]]
int64


We can access the elements in an array using multi-index notation (counting starts from 0, slicing `a:b` is inclusive:exclusive, negative indices, etc.)

In [None]:
print(A)
# Print from row 3 onwards, columns 3 and 5
print(A[2:, [2, 4]])

In [6]:
# Attributes
A.shape  # dimension
A.min()  # minimum (you can similarly use max, sum, etc.)
A.diagonal()  # diagonal
B = A.transpose()  # transposing
C = A.reshape(6, 4)  # rearrange values to change dimension (CAREFUL!)
print(A.shape)
print(B.shape)
print(C.shape)
A.dot(B)  # dot-product (with array B)


(4, 6)
(6, 4)
(6, 4)


array([[   91,   584,  -333,    32],
       [  584, 10331, -1227,   658],
       [ -333, -1227,  2810,   -58],
       [   32,   658,   -58,   118]])

#### `array` operations

Mathematical symbols take on mathematical meanings in `numpy`, and so the `+` operator between two `np.array` just tries to add them together elementwise (it works differently to `list`-type objects). To concatenate you need a specific function.


In [8]:
# Numpy array addition: 
a, b = np.array([1, 2, 3]), np.array([4, 5, 6])
print(a + b)  # Addition
print(np.concatenate([a,b]))  # Concatenation

[5 7 9]
[1 2 3 4 5 6]


### `pandas` (Panel Data Structures)

This is the module in Python for doing rectangular-data management, analysis and plotting. It provides tools to read and write external data.


In [1]:
import pandas as pd  # Usually imported with name "pd"

file_path = "https://raw.githubusercontent.com/barcelonagse-datascience/academic_files/master/data/tips.csv"
tips = pd.read_csv(file_path)  # read_csv allows importing .CSV files
print(tips.head(5))  # head gives first rows (arg = number of rows)

# Data formats in pandas
print(type(tips))
print(type(tips['tip']))

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>


There are the two basic data formats in `pandas`: `Series` and `DataFrame`. A `Series` is the equivalent to a vector in linear algebra. A `DataFrame` is the equivalent to a rectangular data structure. These types are equipped with several attributes useful for data management and analysis. In the tips example, the variable object `tips` is a `DataFrame`, while any individual column would be a `Series`.

---
## 7. Structured data operations (`pandas`)

### `Series`

They are accessed via their index, which is a label (a number or a string). They are typically obtained by reading a dataset from an external file or when operating on a `DataFrame`, but they can be defined manually:

In [14]:
print(type(tips['total_bill']))

# Here no indices are specified (by default they are numerical)
a_series = pd.Series([1, 15, -5, None, 4, 123, 0, 78, 0, 1, -4])
a_series

<class 'pandas.core.series.Series'>


0       1.0
1      15.0
2      -5.0
3       NaN
4       4.0
5     123.0
6       0.0
7      78.0
8       0.0
9       1.0
10     -4.0
dtype: float64

In [14]:
# Accessing a certain value via the index
a_series[0]

1.0

In [4]:
# .values returns a numpy.ndarray of the values
print(a_series.values)
# .index returns the index of the series in a object of index
print(type(a_series.index))
# .index.values returns the index values of the series in a numpy array
print(a_series.index.values) # default index values are 0 to the number of entries-1  

[  1.  15.  -5.  nan   4. 123.   0.  78.   0.   1.  -4.]
[ 0  1  2  3  4  5  6  7  8  9 10]
<class 'pandas.core.indexes.range.RangeIndex'>


In [24]:
# You can overwrite the index directly: 
a_series.index = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k"]
a_series.index.values

array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k'],
      dtype=object)

Accessing values via the index can be very useful, but sometimes you want to access the values by their order in the series as if it was a Python list. In other words "I want the first value!", without having to know the name of the label. You can do this with `.iloc`:


In [25]:
a_series.iloc[0], a_series["a"], a_series.iloc[-1]


(1.0, 1.0, -4.0)

In [28]:
# This just resets the index to default values
a_series = a_series.reset_index(drop = True)

# now if I sort the values from smaller to greater:
x = a_series.sort_values()
print(x)
x[0], x.iloc[0] 
# the indices remain, so x indexed by 0 is 1
# the ordering changes, so the first element of x is -5.0, the minimum

2      -5.0
10     -4.0
6       0.0
8       0.0
0       1.0
9       1.0
4       4.0
1      15.0
7      78.0
5     123.0
3       NaN
dtype: float64


(1.0, -5.0)

`.index` and `.value` are attributes of Pandas series. Some other that are worth highlighting:

*   `.map`
*   `.corr`
*   `.describe`
*   `.hist`
*   `.plot`
*   `.size`
*   `.value_counts`
*   `.sort_values`

For example:

In [33]:
a_series.describe()

count     10.000000
mean      21.300000
std       43.410316
min       -5.000000
25%        0.000000
50%        1.000000
75%       12.250000
max      123.000000
dtype: float64

### Operations with `Series`

`Series` are based on `numpy` arrays. Like arrays, we can operate on series element-wise. The result is another series with data type depending on the type of operations performed.


In [5]:
series1 = pd.Series([1, 3, 5, 7])
series2 = pd.Series([0, 10, -1, 6])

series3 = 2 * series1 + abs(series2)
series4 = series1 > series2 

print(series3)
print(series4)
# Take a look at the different Series objects!

0     2
1    16
2    11
3    20
dtype: int64
0     True
1    False
2     True
3     True
dtype: bool


What goes on in the previous examples is more subtle than it looks. How does Python know which elements from each series to join in the required operation together? What happens is that the indices happened to be the same. So when we ask something like

`series3 = series1 + series2`,

Python looks for entries in each series with the same index and then does an elementwise summation that it stores in a like-wise index in Series 3.

Consider instead the following example:


In [9]:
series1 = pd.Series([1, 10], index=["A", "B"])
series2 = pd.Series([4, -1], index=["C", "D"])
series3 = series1 + series2
print(series3)

A   NaN
B   NaN
C   NaN
D   NaN
dtype: float64


This aspect makes it very easy to work with series that we have sorted or manipulated otherwise; there is always the address to access a value. This helps prevent accidentally combining values we didn't mean to combine!

In [37]:
# accessing by list of index labels
a_series.index = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"]
x = a_series[["A", "K"]]

In [38]:
a_series

A      1.0
B     15.0
C     -5.0
D      NaN
E      4.0
F    123.0
G      0.0
H     78.0
I      0.0
J      1.0
K     -4.0
dtype: float64

In [39]:
x

A    1.0
K   -4.0
dtype: float64

Notice the index of x is a SUBSET of the index of "a_series". 
This can be useful when needing to relate values back to the original "a_series"!

In [6]:
# Getting a boolean-valued series by checking a condition
choose = (a_series == 0.0)
choose

0     False
1     False
2     False
3     False
4     False
5     False
6      True
7     False
8      True
9     False
10    False
dtype: bool

We often use boolean masks to filter data in Pandas. We also get special boolean algebra operators to use in `numpy`/`pandas`, distinct from the and/or/not you will use in regular Python: `&` (AND), `|` (OR), `~` (NOT).


In [7]:
x = a_series[choose]
print(x)
# or the complement
a_series[~choose]

6    0.0
8    0.0
dtype: float64


0       1.0
1      15.0
2      -5.0
3       NaN
4       4.0
5     123.0
7      78.0
9       1.0
10     -4.0
dtype: float64

### Missing values

A series object in `pandas` can help us deal with missing data.

In [10]:
print(series1)
print(series2)
series3 = series1 + series2
print(series3)

A     1
B    10
dtype: int64
C    4
D   -1
dtype: int64
A   NaN
B   NaN
C   NaN
D   NaN
dtype: float64


What happened there is that in the operation labels could not be matched, so `pandas` tried to sum a numeric value with missing value, the result of which is a missing value.
The way to manually specify in `pandas` that a value is missing is to use `None`, as below:

In [11]:
temp = pd.Series([1, None, 2])
print(temp)

0    1.0
1    NaN
2    2.0
dtype: float64


`pandas` coerces `None` values to `NaN` ("Not a Number") values. We can create boolean masks on the basis of such values. The way to identify NaN or None values in a Series is to use either of the equivalent two attributes: `.isna()` and `.isnull()` (existing the opposite `.notna()` and `.notnull()`).

In [10]:
print(temp.isna())
print(temp.isnull())

0    False
1     True
2    False
dtype: bool
0    False
1     True
2    False
dtype: bool


In [11]:
temp.notna()
temp.notnull()

0     True
1    False
2     True
dtype: bool

### `DataFrame`

This is `pandas` model for rectangular data. Operationally it is similar to a dictionary of `Series`; each column of the `DataFrame` is a `Series` object, and comes with all the attributes/methods of a `Series`. An implication is that within each column the data type is common; across columns this can change.

In [12]:
# Recall our "tips" dataset
tips.head(5)  # head returns the first rows od a data frame

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [13]:
# other important attributes: name of rows and columns
print(tips.shape)
print(tips.index)
print(tips.columns)

(244, 7)
RangeIndex(start=0, stop=244, step=1)
Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')


To access the columns of a `DataFrame` there are two standard ways.

In [None]:
tips.tip  # As if it were a method (NOT recommended)
tips["size"]  # As if it were a named index (RECOMMENDED)

The latter is recommended because, for example, the column `size` cannot be accessed via the first option (it would be confused with the method `.size`). The result of this operation is an object of type `Series`.

We can access various columns at a time, supplying a list of columns, obtaining a dataframe with the same index as the original and columns the chosen subset.

In [None]:
tips[["tip", "size", "sex"]].head(5)

Similarly, you can access rows instead of columns using their index labels or `.iloc`.

*   Using a list of index labels: `tips.loc[ [index1, index2, ...] ]`
*   Using a list of integer index location (i-loc): `tips.iloc[ [integer1, integer2, ...] ]`



In [None]:
# Accessing rows AND columns!
# Example of 2-dimension loc
tips.loc[[1, 3], ['sex', 'smoker']]

In [14]:
# Accessing rows AND columns!
# Example of 2-dimensional iloc
tips.iloc[[1, 3], 2:5]

Unnamed: 0,sex,smoker,day
1,Male,No,Sun
3,Male,No,Sun


Note that certain operations are exchangeable: the 3rd element of column "sex" can be obtained with either of the following ways:

In [None]:
tips.sex[2]  # Access col as series, then the 3rd element of that
tips.loc[2, "sex"]  # Access the entry in DF by giving the index labels of row and col
tips.loc[2]["sex"]  # Accessing the whole row as a series, then using the column name as index label

As with `Series`, we can use a boolean-valued series to index a `DataFrame` provided that they share the same index labels. The simplest instance of this is to use series produced as boolean masks of columns of the dataframe. The output of this *filtering* operation is a dataframe with subset of rows corresponding to the `True` values in the boolean mask.

In [17]:
# Creates a boolean series with the same index labels as the data frame tips
tips['sex'] == "Male"

0      False
1       True
2       True
3       True
4      False
       ...  
239     True
240    False
241     True
242     True
243    False
Name: sex, Length: 244, dtype: bool

In [16]:
tips[tips['sex'] == "Male"].head(5)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2


In [18]:
tips[(tips['sex'] == "Male") & (tips['day'] == "Sun")].head(5)  # Multiple booleans ("&", "|", "~")

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2


A `DataFrame` comes with several attributes for computing column-wise statistics and summaries. We highlight some of them:

*   `.boxplot` (check out the `by = ` option)
*   `.corr` and `.corrwith` (within and across `DataFrame`'s)
*   `.dot`
*   `.mean/median/max/quantile/sum`, etc.
*   `.sample`
*   `.sort_values`
*   `.unique`


### `GroupBy`

This `DataFrame` method groups the `DataFrame` according to the values of a column, treating them as categorical values. It returns a groupby object.

In [19]:
# Group tips DataFrame by size of table
by_size = tips.groupby("size")
print(by_size)

# If we coerce it to a list, we see something interesting: 
# it's basically a list of tuples
# The first element is the "category" variable, the second
# is a datafame. 
list(by_size)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7efd26d9a690>


[(1,      total_bill   tip     sex smoker   day    time  size
  67         3.07  1.00  Female    Yes   Sat  Dinner     1
  82        10.07  1.83  Female     No  Thur   Lunch     1
  111        7.25  1.00  Female     No   Sat  Dinner     1
  222        8.58  1.92    Male    Yes   Fri   Lunch     1),
 (2,      total_bill   tip     sex smoker   day    time  size
  0         16.99  1.01  Female     No   Sun  Dinner     2
  3         23.68  3.31    Male     No   Sun  Dinner     2
  6          8.77  2.00    Male     No   Sun  Dinner     2
  8         15.04  1.96    Male     No   Sun  Dinner     2
  9         14.78  3.23    Male     No   Sun  Dinner     2
  ..          ...   ...     ...    ...   ...     ...   ...
  237       32.83  1.17    Male    Yes   Sat  Dinner     2
  240       27.18  2.00  Female    Yes   Sat  Dinner     2
  241       22.67  2.00    Male    Yes   Sat  Dinner     2
  242       17.82  1.75    Male     No   Sat  Dinner     2
  243       18.78  3.00  Female     No  Thur  Di

In [20]:
list(tips.groupby("sex"))

[('Female',      total_bill   tip     sex smoker   day    time  size
  0         16.99  1.01  Female     No   Sun  Dinner     2
  4         24.59  3.61  Female     No   Sun  Dinner     4
  11        35.26  5.00  Female     No   Sun  Dinner     4
  14        14.83  3.02  Female     No   Sun  Dinner     2
  16        10.33  1.67  Female     No   Sun  Dinner     3
  ..          ...   ...     ...    ...   ...     ...   ...
  226       10.09  2.00  Female    Yes   Fri   Lunch     2
  229       22.12  2.88  Female    Yes   Sat  Dinner     2
  238       35.83  4.67  Female     No   Sat  Dinner     3
  240       27.18  2.00  Female    Yes   Sat  Dinner     2
  243       18.78  3.00  Female     No  Thur  Dinner     2
  
  [87 rows x 7 columns]),
 ('Male',      total_bill   tip   sex smoker  day    time  size
  1         10.34  1.66  Male     No  Sun  Dinner     3
  2         21.01  3.50  Male     No  Sun  Dinner     3
  3         23.68  3.31  Male     No  Sun  Dinner     2
  5         25.29  4.

In [21]:
# We can iterate through the groupby just like we would though a list of tuples!
for sex, data in tips.groupby("sex"):
    print(sex)
    print(data.mean())


Female
total_bill    18.056897
tip            2.833448
size           2.459770
dtype: float64
Male
total_bill    20.744076
tip            3.089618
size           2.630573
dtype: float64


  after removing the cwd from sys.path.


We `groupby` to perform *some* operation on each group, to *map* over the groups, applying a function to each element. Very often this function is itself an aggregation (*reduction*). We want to somehow aggregate each group into a value or set of values that *describe* it.

To apply functions to each element of a `groupby`, we use `.apply`:


In [None]:
# Get the maximum bill by gender: 
def max_bill(df):
    return df['total_bill'].max()

tips.groupby("sex").apply(max_bill)

Many aggregation functions that exist on Series and `DataFrames` (`mean`, `max`, `min`, etc.) can be called directly via the groupby object:

In [None]:
print(tips.groupby("sex").max())
print(tips.groupby("sex").mean())

We can actually `groupby` more than one column:

In [None]:
tips.groupby(["sex", "day"])['tip'].mean()

### Combining `DataFrame`'s

There are many ways to combine various `DataFrame`s into a new one, extending in many ways what we already saw for operations on `Series`. The main ways of doing this are:

*   **Concatenate**: paste row-column-wise and taking action on `NaN`s (this works more on the rectangular structure of the data)
*   **Merge**: combine `DataFrame`s using a common piece of information, e.g. an identifier column (this works more as a database operation)



**Concatenate**

In [22]:
# Concatenate
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"A": pd.Series([4]), "C": pd.Series([7])})
pd.concat([df1, df2], axis = 0)
# axis: 0 for pasting below, 1 for pasting on the side

Unnamed: 0,A,B,C
0,1,4.0,
1,2,5.0,
2,3,6.0,
0,4,,7.0


Concatenation is mostly used when the rows index or columns index is shared.  
For example, you might have data with the same columns and want to concatenate them on axis 0:

In [28]:
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"A": pd.Series([4]), "B": pd.Series([7])})
df3 = pd.concat([df1, df2], axis = 0)
df3

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6
0,4,7


Note what happened to the index of our concatenated dataframe above: we might want to reset it.

In [32]:
df3.reset_index()

Unnamed: 0,index,A,B
0,0,1,4
1,1,2,5
2,2,3,6
3,0,4,7


Similarly, you might have data with the same rows and different columns:

In [31]:
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"C": pd.Series([4,5,8]), "D": pd.Series([7,12,1])})
pd.concat([df1, df2], axis = 1)

Unnamed: 0,A,B,C,D
0,1,4,4,7
1,2,5,5,12
2,3,6,8,1


Note what happens if the rows do not align, and you concatenate on axis 1 (by rows):

In [30]:
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"C": pd.Series([7]), "D": pd.Series([10])})
pd.concat([df1, df2], axis = 1)

Unnamed: 0,A,B,C,D
0,1,4,7.0,10.0
1,2,5,,
2,3,6,,


Taking action on `NaN`s that may appear as a results of `concat` is done by specifying the `join` argument to `inner` or `outer` (defaults to `outer`).

In [27]:
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"A": pd.Series([4]), "C": pd.Series([7])})
print(pd.concat([df1, df2], axis = 1, join = "outer"))  # "outer" join
print(pd.concat([df1, df2], axis = 1, join = "inner"))  # "inner" join

   A  B    A    C
0  1  4  4.0  7.0
1  2  5  NaN  NaN
2  3  6  NaN  NaN
   A  B  A  C
0  1  4  4  7


**Merge**

`merge` is commonly used when your two `DataFrame`s are connected but do not share an index or columns. With merge we will connect two `DataFrame`s on some common piece of information, e.g. a common column. To tell `merge` what to with `NaN`s we specify a type of `join`, just as with `concat`. There are four types of `join` operations:

*   `inner`-join: **intersection** of *keys*
*   `outer`-join: **union** of *keys*
*   `left`-join: use *keys* from **left only**
*   `right`-join: use *keys* from **right only**

In [37]:
df1 = pd.DataFrame({"A": pd.Series([1, 2, 3]), "B": pd.Series([4, 5, 6])})
df2 = pd.DataFrame({"A": pd.Series([3, 4]), "C": pd.Series([7, 8])})
print(df1)
print('\n')
print(df2)

   A  B
0  1  4
1  2  5
2  3  6


   A  C
0  3  7
1  4  8


In [38]:
# Merging, "on" defines on what piece of information the DataFrame's will merge
pd.merge(df1, df2, on = 'A', how = 'right')  # if column names differ, use "left_on" and "right_on"

Unnamed: 0,A,B,C
0,3,6.0,7
1,4,,8


In [39]:
pd.merge(df1, df2, on = 'A', how = 'left')

Unnamed: 0,A,B,C
0,1,4,
1,2,5,
2,3,6,7.0


In [40]:
pd.merge(df1, df2, on = 'A', how = 'outer')

Unnamed: 0,A,B,C
0,1,4.0,
1,2,5.0,
2,3,6.0,7.0
3,4,,8.0


In [41]:
pd.merge(df1, df2, on='A', how = 'inner')

Unnamed: 0,A,B,C
0,3,6,7


Note that when merging on columns, the DataFrame indexes will be ignored and new index is created.

---
## String manipulation

### Simple operations

You may find yourself working with text data, a.k.a. strings. 

Strings are actually iterables, just like lists. They can be subset analogously:

In [None]:
x = "one python string"
x[4:10]

In [43]:
# You can also turn a string into a list of strings via the "split" method:
x = "one python string"
y = x.split(" ")
y == ["one", "python", "string"]

True

In [44]:
# The reverse is also possible via the "join" method:
space = " "
z = space.join(y)
z == x

True

You can also make everything lower (or upper) case, replace certain substrings with other substrings, and check for the existence of a substring with `in`:

In [None]:
z = "My Python String"
z.lower()

In [None]:
z.upper()

In [45]:
w = z.replace("Python", "R")
print(w)
"Python" in w

one python string


False

There are many more easy-to-use, built-in tools for working with text data in Python. You can read more here:

*   https://docs.python.org/3/library/stdtypes.html#string-method

### Regular expressions

A *regular expression* is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern. Python uses the module `re` to deal with them.

Let's pretend we have an `html` text. Find **all** words tagged as bold (`<b>word</b>`) in `html`, turn them to italics (`<i>word</i>`) and add the word "freaking" before them.

In [52]:
import re

a_text = 'Why am I taking this <b>course</b>? I want to go back to the <b>holidays</b>.'

# Preffix 'r' is an indicator of regular expression
the_regex = r'<b>([a-z\s]+)</b>'
replacement = r'<i>freaking \1</i>' #\1 is for group 1, we are asking to replace by the matched string

# Function to replace a regular expression with another
re.sub(the_regex, replacement, a_text)

'Why am I taking this <i>freaking course</i>? I want to go back to the <i>freaking holidays</i>.'

Some helpful functions to manouvering with regular expressions with `re`:

*  `re.search(pattern, string)`: scan through `string` looking for locations where the regular expression `pattern` produces a match, and return a corresponding match object.
*  `re.match(pattern, string)`: if zero or more characters at the beginning of `string` match the regular expression `pattern`, return a corresponding match object.
*  `re.split(pattern, string)`: split `string` by the occurrences of `pattern`
*  `re.sub(pattern, repl, string)`: return the `string `obtained by replacing the leftmost non-overlapping occurrences of `pattern` in string by the replacement `repl`
*  `re.findall()`: search for *all* occurrences that match a given pattern

Regular expressions are relatively complex and there is a long list of combinations we can make, which are well outside the scope of this introduction. We limit to mentioning some special characters to form regular expressions:

*  `^`: Start of string
*  `$`: End of string
*  `.`: One character, no matter which (except line break)
*  `*`: match 0 or more repetitions of the preceding RE
*  `+`: match 1 or more repetitions of the preceding RE
*  `?`: match 0 or 1 repetition of the preceding RE
*  `{m,n}`: Causes the resulting RE to match from m to n repetitions of the preceding RE. The comma and the n are optional depending on the case
*  `|`: OR. E.g. A|B means match RE A or B
*  `(...)`: Matches whatever regular expression is inside the parentheses
*  `[]`: Defines a subset of characters to match
*  `\`: Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence (e.g. \s means a white space).
*  `\w`: Matches a word (letters only)
*  `\W`: Matches a word (letters and characters, equivalent to `[^a-zA-Z0-9_]`)

---
## Some advanced concepts


### Copy vs. assignment

Look at the following example:

In [59]:
a = [1, 2, 5]
b = a
b[2] = 10

In [60]:
print(a)
print(b)

[1, 2, 10]
[1, 2, 10]


What happens is that really `a` and `b` point to the same place in the memory and share the same data.

In [61]:
print(id(a))
print(id(b))

139625708448256
139625708448256


The way to create an object that will *copy* the data in a but not *share* the data with a is to use the method `copy()`.

In [62]:
a = [1, 2, 5]
b = a.copy()
print(id(a))
print(id(b))

139625708445936
139625708446096


In [63]:
b[2] = 10
print(a)
print(b)

[1, 2, 5]
[1, 2, 10]


This goes a bit deeper into the concept how `python` uses memory in your computer, but it is important to have a grasp on it in order to avoid potential problems when memory becomes an issue.

Note that the behaviour we observed above for lists would not be the same for integers. For example:

In [86]:
a = 2
b = a
a += 1
print(a)
print(b)
print(id(a))
print(id(b))

3
2
11126752
11126720


Things also become a little trickier when you deal with lists of lists. 

In [79]:
a = [1, [2, 3], 5, 'abc'] 
b = a.copy()
b[1].append(100)

In [77]:
print(a)
print(b)

[1, [2, 3, 100], 5, 'abc']
[1, [2, 3, 100], 5, 'abc']


In [78]:
print(id(a))
print(id(b))

139625708513232
139625708515072


What happened here is that both `a` and `b` still point to some deeper list `[2, 3]`. For this reason there is also the `deepcopy`.

In [74]:
# Try now 
from copy import deepcopy
a = [1, [2, 3], 5, 'abc'] 
b = deepcopy(a)
b[1].append(100)
print(a)
print(b)

[1, [2, 3], 5, 'abc']
[1, [2, 3, 100], 5, 'abc']


In [75]:
print(id(a))
print(id(b))

139625708411552
139625708427376


To conclude, be careful when you write statements such as `name1 = name2`.

### 9.2 Error handling

Sometimes there problems come up in the code that make it not function properlu. When that happens, if there is no error raised, the problem could be anywhere and that makes it very hard to fix.

Setting up informative *errors* if something is not working well, on the other hand, tells us where things go wrong and point us to what we need to fix.

In [None]:
# Errors in Python are called Exceptions. 
e = Exception()
type(e)

In [None]:
# Exceptions are raised with the "raise" keyword: 
raise Exception('Oops!')

Every *part* of your program, for example each function, must be in charge of things going as expected inside its body. If something goes wrong, it should tell us what happened.

One way to do this is to check for possible problems before they occur:

In [None]:
def age_a_person(person):
    if not hasattr(person, 'age'):
       raise Exception(f'The person must have an age attribute! Given: {person}')
    return person.age + 1

age_a_person('notaperson')

Python encourages that one should first try, and *catch* any expected errors that occur, handling them then. By *catching an error* we mean the following:


In [None]:
def age_a_person(person):
    try:
        return person.age + 1
    except AttributeError as e:
        raise Exception(f'The person must have an age attribute! Given: {person}') from e

age_a_person('notaperson')

### Default values in functions

In `python` we get to assign default values to inputs of functions. For example:

In [88]:
# New function "f"
def f(a = 1, b = 2):
    return a + b


In [89]:
# Function "f" can be validly be called in the following ways
print(f())
print(f(10))
print(f(b = 4))
print(f(10, 4))
print(f(a = 10, b = 4))
print(f(b = 4, a = 10))

3
12
5
14
14
14


In [93]:
# But NOT like this
f(a = 10, 4)

SyntaxError: ignored

In [91]:
# however this works
f(10, b = 4)

14

You cannot omit an argument name, if you stated previous arguments names.

---
## Particular topics

### Date and time objects

We introduce how to deal with time and date type of objects in Python. To do this, we need a module denominated `datetime`, which contains the class `datetime`.

In [94]:
from datetime import datetime  # This is a module and a class

# Current date
current_time = datetime.now()
print(current_time)
print(type(current_time))

2022-10-25 16:04:27.514660
<class 'datetime.datetime'>


`current_time` is of class `datetime`, but this is just one of five distinct time-object classes:

*   `datetime`: allows to work with times and dates together (month, day, year, hour, second, microsecond).
*   `date`: works with dates only (month, day, year), independent of time. 
*   `time`: works with time only (hour, minute, second, microsecond), independent of date. 
*   `timedelta`: a duration of time used for measuring distance between to time points.

The most common scenario with time data is translating from and to regular character objects, which is the most frequent format that we will encounter when importing times and dates. We use the following two functions:

In [96]:
today_date = '2022-01-04'

# Create date object in given time format yyyy-mm-dd
today_date = datetime.strptime(today_date, '%Y-%m-%d')

print(today_date)
print(type(today_date))

2022-01-04 00:00:00
<class 'datetime.datetime'>


Note we used the *pattern* `%Y-%m-%d` to indicate the year-month-day format we want to give the date object. A full list of imaginable patterns can be found in the library's [documentation](https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior).

In [None]:
print('* Month:', today_date.month)  # Get month from date
print('* Year:', today_date.year)  # Get month from year
print('* Day of month:', today_date.day)  # ...
print('* Day of Week (number):', today_date.weekday())  # Recall indexing

In [None]:
print('* Hour: ', current_time.hour)
print('* Minute: ', current_time.minute)
print(current_time.isocalendar())  # Returns (year, # week, # day)

As mentioned, you can convert from time to string with the function `strftime()`. With the argument `format`, you can change the date format in the resulting string.

In [99]:
today_str = datetime.strftime(today_date, format = '%d-%m-%Y')
type(today_str)
print(today_str)

04-01-2022


Recall we mentioned that to measure time spans, or to operate dates or times (add/subtract) we could use the `timedelta` type of object. Mind you, these objects need not be anchored on a specific date and they can be a generic time frame.

In [102]:
from datetime import timedelta

# timedelta objects
three_weeks = timedelta(weeks = 3)
one_year = timedelta(days = 365)

print(three_weeks)
print(type(three_weeks))
print(three_weeks.days)
print(one_year.days)

21 days, 0:00:00
<class 'datetime.timedelta'>
21
365


Let us now operate on these objects.

In [103]:
from datetime import datetime, timedelta

# Current time
now = datetime.now()
print("Today's date: ", str(now))

# Add three weeks to current date
now_in_3weeks = now + three_weeks
print('Date after three weeks: ', now_in_3weeks)

# Subtract one year from current date
one_year_ago = now - one_year
print('Date one year ago: ', one_year_ago)
print(type(one_year_ago))

Today's date:  2022-10-25 16:09:12.213578
Date after three weeks:  2022-11-15 16:09:12.213578
Date one year ago:  2021-10-25 16:09:12.213578
<class 'datetime.datetime'>


In [105]:
from datetime import date

# Create two dates
date1 = date(2011, 5, 28)
date2 = date(2015, 6, 6)
# create two dates with year, month, day, hour, minute, and second
date1b = datetime(2011, 5, 28, 23, 1, 0)
date2b = datetime(2015, 6, 6, 22, 52, 10)

# Difference between two dates
date_diff = date2 - date1
date_diffb = date2b - date1b
print("Time difference (days): ", date_diff.days)
print("Time difference: ", date_diffb)
print(type(date_diff))

Time difference (days):  1470
Time difference:  1469 days, 23:51:10
<class 'datetime.timedelta'>


In [106]:
# To work with time zones:
from pytz import timezone

# Create timezone US/Eastern
est = timezone('US/Eastern')

# Re-set date to local time
loc_time = est.localize(datetime(2015, 6, 6, 22, 52, 10))
print(loc_time)

2015-06-06 22:52:10-04:00


You can also work with time objects using `pandas`. You can convert text strings into `pandas` `Datetime` objects using:

*  `to_datetime()`: to convert string dates/times to `datetime` objects.
*  `to_timedelta()`: find differences in times.


In [107]:
import pandas as pd
import numpy as np

# String to datetime
good_date = pd.to_datetime("6th of June, 2015")
print(good_date)

# Create date series to_timedelta() (add numpy)
date_series = good_date + pd.to_timedelta(np.arange(12), 'D')
print(date_series)

# Create date series using date_range() function
date_series = pd.date_range('06/06/2015', periods = 12, freq = 'D')
print(date_series)

2015-06-06 00:00:00
DatetimeIndex(['2015-06-06', '2015-06-07', '2015-06-08', '2015-06-09',
               '2015-06-10', '2015-06-11', '2015-06-12', '2015-06-13',
               '2015-06-14', '2015-06-15', '2015-06-16', '2015-06-17'],
              dtype='datetime64[ns]', freq=None)
DatetimeIndex(['2015-06-06', '2015-06-07', '2015-06-08', '2015-06-09',
               '2015-06-10', '2015-06-11', '2015-06-12', '2015-06-13',
               '2015-06-14', '2015-06-15', '2015-06-16', '2015-06-17'],
              dtype='datetime64[ns]', freq='D')


In [108]:
# Create a DataFrame with date as a column
data = pd.DataFrame()
data['date'] = date_series
data.head()

Unnamed: 0,date
0,2015-06-06
1,2015-06-07
2,2015-06-08
3,2015-06-09
4,2015-06-10


In [None]:
# .dt gives access to the series datetime properties if any
# Extract year, month, day, hour, and minute; and assign to new columns
data['year'] = data['date'].dt.year
data['month'] = data['date'].dt.month
data['day'] = data['date'].dt.day
data['hour'] = data['date'].dt.hour
data['minute'] = data['date'].dt.minute
data.head()

### Web scraping and `html` parsing

Web *scraping* deals with the retrieval of information featured in some web page. It basically consists of reading the content in a URL, and posteriorly *parsing* it to extract the relevant information that you are looking for. This is an extensive topic in itself so we will just introduce the plain basics to get you started.

Before anything, note that most information in a HTML source is of little use to us and is used to render and format the webpage itself. Therefore it would be wise to familiarise yourself with HTML tags as and when needed. We must consider some design aspects of creating a *spider* that will *crawl* the target webpage and get us the desired content.

1.   Identify tags that contain useful information
2.   Add randomised waiting periods between every access to website
3.   Make use of logs to monitor progress
4.   Regularly write collected data to an external file
5.   **Always** respect `robots.txt` file defined by websites

We use as an example Wikipedia in english. The target will be to  retrieve the summary of the featured article in Wikipedia along with all the URLs in it.

In [1]:
from bs4 import BeautifulSoup
#import urllib  # If you're using Python 2.x
import urllib.request  # If you're using Python 3

url = 'https://en.wikipedia.org/wiki/Main_Page'
#html = urllib.urlopen(url).read()  # Python 2.x
with urllib.request.urlopen(url) as url_content:
    html = url_content.read()

# BeautifulSoup() is a formatting module to interpret html/xml
soup = BeautifulSoup(html, 'html.parser')
print(soup)

<!DOCTYPE html>

<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>Wikipedia, the free encyclopedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"0866973d-23a0-4c51-a8d4-66c7ca9021e1","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1114291180,"wgRevisionId":1114291180,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantArticleId":15580374,"wgIsProbablyEditable":fal

The `prettify` method turns your soup into a more nicely formatted string with a separate line for each tag and each string, and indented.

In [3]:
print(type(soup))
print(soup.prettify()[0:5000])

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Wikipedia, the free encyclopedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":false,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"0866973d-23a0-4c51-a8d4-66c7ca9021e1","wgCSPNonce":false,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":1114291180,"wgRevisionId":1114291180,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Main_Page","wgRelevantA

The text is structured with HTML tags organised in a systematic way. This structure is referred to as a *HTML DOM* (Document Object Model).

We are interested in grabbing the summary of the article of the day, from the Wikipedia main page along with all the links in it. So we'll use `BeautifulSoup`'s functions to parse HTML tags that contain this information or lead us to it in the form of embedded URLs.

On inspection of the HTML source we identify that the article summary is enclosed in the first `p` tag. `p` stands for paragraph. The idea is as follows,

1.   Using the function in `.find_all()` in BeautifulSoup we search for all tags of the type `p`
2.  Identify the correct paragraph and extract URLs.
3.  Extract text.



In [4]:
# 1. search for all tags of the type p
paragraphs_list = soup.find_all(name = "p")
paragraphs_list

[<p><i><b><a href="/wiki/Mischief_Makers" title="Mischief Makers">Mischief Makers</a></b></i> is a 1997 <a href="/wiki/Side-scrolling_video_game" title="Side-scrolling video game">side-scrolling</a> platform video game, the first for the <a href="/wiki/Nintendo_64" title="Nintendo 64">Nintendo 64</a> <i>(pictured)</i>, developed by <a href="/wiki/Treasure_(company)" title="Treasure (company)">Treasure</a> and published by <a href="/wiki/Enix" title="Enix">Enix</a> and <a href="/wiki/Nintendo" title="Nintendo">Nintendo</a>. The player assumes the role of Marina, a robot who grabs, shakes, and throws objects in her journey to rescue her creator from the planet's emperor. The game is presented in <a href="/wiki/2.5D" title="2.5D">2.5D</a>, with pre-rendered 3D backgrounds behind 2D gameplay. A 12-person team developed the game over two years as Treasure's first title for a Nintendo console. It was shown at the 1997 <a href="/wiki/E3" title="E3">Electronic Entertainment Expo</a> before its

In [6]:
# 2. Identify the correct paragraph
paragraph = soup.findAll(name = 'p')[0]
paragraph

<p><i><b><a href="/wiki/Mischief_Makers" title="Mischief Makers">Mischief Makers</a></b></i> is a 1997 <a href="/wiki/Side-scrolling_video_game" title="Side-scrolling video game">side-scrolling</a> platform video game, the first for the <a href="/wiki/Nintendo_64" title="Nintendo 64">Nintendo 64</a> <i>(pictured)</i>, developed by <a href="/wiki/Treasure_(company)" title="Treasure (company)">Treasure</a> and published by <a href="/wiki/Enix" title="Enix">Enix</a> and <a href="/wiki/Nintendo" title="Nintendo">Nintendo</a>. The player assumes the role of Marina, a robot who grabs, shakes, and throws objects in her journey to rescue her creator from the planet's emperor. The game is presented in <a href="/wiki/2.5D" title="2.5D">2.5D</a>, with pre-rendered 3D backgrounds behind 2D gameplay. A 12-person team developed the game over two years as Treasure's first title for a Nintendo console. It was shown at the 1997 <a href="/wiki/E3" title="E3">Electronic Entertainment Expo</a> before its 

In [7]:
# 2. Extract URLs
urls = [tag['href'] for tag in paragraph.findAll('a', href = True)]
for url in urls:
    print(url)

print('\n')

/wiki/Mischief_Makers
/wiki/Side-scrolling_video_game
/wiki/Nintendo_64
/wiki/Treasure_(company)
/wiki/Enix
/wiki/Nintendo
/wiki/2.5D
/wiki/E3
/wiki/Boss_(video_games)
/wiki/Replay_value
/wiki/Sound_bite
/wiki/Mischief_Makers




An HTML document is a collection of nodes with (or without) child nodes. To build the text of the article summary, we loop through all the children and concatenate the text.

In [8]:
# 3. Build the text of the article summary by looping through all the children and concatenating the text 
text = ''
for ch in paragraph.children:
    text = text + ch.string

print(text + '\n')

Mischief Makers is a 1997 side-scrolling platform video game, the first for the Nintendo 64 (pictured), developed by Treasure and published by Enix and Nintendo. The player assumes the role of Marina, a robot who grabs, shakes, and throws objects in her journey to rescue her creator from the planet's emperor. The game is presented in 2.5D, with pre-rendered 3D backgrounds behind 2D gameplay. A 12-person team developed the game over two years as Treasure's first title for a Nintendo console. It was shown at the 1997 Electronic Entertainment Expo before its release. Reviews were mixed, with praise for its inventiveness, personality, and boss fights, but criticism for its brevity, low difficulty, low replay value, sound, and harsh introductory learning curve. Retrospective reviewers disagreed with the originally poor reception, and several highlighted Marina's signature "Shake, shake!" sound bite. In 2009, GamesRadar called it "possibly the most underrated and widely ignored game on the N