# Working with data in Python

Python is an interpreted, interactive, object-oriented programming language. 

It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes. 

It supports multiple programming paradigms beyond object-oriented programming, such as procedural and functional programming.

Although this is not a Python lecture, we will use Python intensively. 

So let's first get familiar with some often used data types in this lecture. Meanwhile get familiar with Jupyter Notebook.

First of all, we need to import some libraries (By now, these libraries should be included in the working environment):

In [1]:
import numpy as np
import pandas as pd

## Lists, Arrays, Series

Assume that you have the following observations of a feature: 2, 4, 6, Null, 8, and you want to assign them into one variable. There are at least three alternatives:
- Python list `ls1 = [2, 4, 6, None, 8]`
- Numpy array `arr1 = numpy.array([2, 4, 6, np.nan, 8])`
- Pandas series `s1 = pd.Series([2, 4, 6, np.nan, 8])`

*Insert a new code cell below, copy paste the above statements, and check the value of ls1, arr1 and s1.*

*To check the values, you could use* print(f"Some text here {variable here}"), or simply type the variable name in the last line of the cell)

### Python lists

Python lists might contain items of different types (also could be list of lists), but usually the items all have the same type.

It is similar to the list in JavaScript -- it has append, insert, index, ... functions. 

The next cell shows some examples.
 
More information can be found here: https://docs.python.org/3/tutorial/datastructures.html

Run the next cell, check the outputs

In [2]:
ls2 = [2, 'Sunny', 6, None, [1,2,3]]
print(f"Python list ls2= {ls2}")

ls2.append(199)
print(f"Updated ls2= {ls2}")

id199 =ls2.index(199)
print(f"Index of 199= {id199}")

ls234 = ls2[1:4]
print(f"The 2nd to the 4th items are {ls234}")

item_5_3=ls2[4][2]
print(f"The 3rd item in the 5th item of ls2 is {item_5_3}")

ls10 = [1,2,3]
ls11 = [4,5,6]
ls12 = ls10+ls11
print(f"+ works as append ls10+ls11={ls12}")

Python list ls2= [2, 'Sunny', 6, None, [1, 2, 3]]
Updated ls2= [2, 'Sunny', 6, None, [1, 2, 3], 199]
Index of 199= 5
The 2nd to the 4th items are ['Sunny', 6, None]
The 3rd item in the 5th item of ls2 is 3
+ works as append ls10+ls11=[1, 2, 3, 4, 5, 6]


In [23]:
ls69 = [69, "Möp", "penis", None, [6,9,6,9], [69], [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[69]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
print(f"Python list ls69= {ls69}")

ls69.append(69)
ls69.index(69)
ls69.index("penis")

ls69cutlow = ls69[1:3]
print(f"ls69cutlow= {ls69cutlow}, [1:3] die 3 ist nicht inkludiert")

ls69cuthigh = ls69[3:]
print(f"ls69cuthigh= {ls69cuthigh}, [3:] die 3 ist inkludiert")

ls69allshit = ls2 + ls2 +ls10 + ls69 + ls12 + ls69*10 + ls10*10 + ls10

print(f"ls69allshit= {ls69allshit}")

Python list ls69= [69, 'Möp', 'penis', None, [6, 9, 6, 9], [69], [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[69]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
ls69cutlow= ['Möp', 'penis'], [1:3] die 3 ist nicht inkludiert
ls69cuthigh= [None, [6, 9, 6, 9], [69], [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[69]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]], 69], [3:] die 3 ist inkludiert
ls69allshit= [2, 'Sunny', 6, None, [1, 2, 3], 199, 2, 'Sunny', 6, None, [1, 2, 3], 199, 1, 2, 3, 69, 'Möp', 'penis', None, [6, 9, 6, 9], [69], [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[69]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]], 69, 1, 2, 3, 4, 5, 6, 69, 'Möp', 'penis', None, [6, 9, 6, 9], [69], [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[69]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]], 69, 69, 'Möp', 'penis', None, [6, 9, 6, 9], [69], [[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[69]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]], 69, 69, 'Möp', 'penis', None, [6, 9, 6, 9], [69], 

The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together,  and then the second item in each passed iterator are paired together etc.

In [24]:
ls13 = zip(ls10, ls11)
print(f"zip(ls10, ls11)={tuple(ls13)}")

zip(ls10, ls11)=((1, 4), (2, 5), (3, 6))


Add a new cell below, and check the value of tuple(ls13), what will you get? Why? 

### Numpy arrays 

NumPy supports for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. --> Offer specialized functions for numerical operations with high performance.

Different from Python list, all the items in one Numpy array must have the same type/length --> consume less memory compared with lists 

https://numpy.org/doc/stable/reference/generated/numpy.array.html

#### First, let's create some arrays and retrieve some items from them.

In [29]:
d1 = np.array([1, 2, 3, 4, 5])
print(f"1D array: \n", d1, "\n")

d2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"2D array: \n", d2)
print('2nd item in the 1st item: ', d2[0][1])
print('Row 1, Column 2: ', d2[0, 1])
print(f"Row 3, Column 2: ", d2[2,1], "\n")

d3 = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(f"3D array: \n", d3, "\n")

# Error:
#d3 = np.array([[[1, 2, 3, 100], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

# Check what is the result: 
x = np.arange(12).reshape((3,4))
display(x)
x.sum(axis=1)

1D array: 
 [1 2 3 4 5] 

2D array: 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
2nd item in the 1st item:  2
Row 1, Column 2:  2
Row 3, Column 2:  8 

3D array: 
 [[[1 2 3]
  [4 5 6]]

 [[1 2 3]
  [4 5 6]]] 


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

array([ 6, 22, 38])

#### You can operate arrays as vectors/matrices.

https://numpy.org/doc/stable/reference/routines.linalg.html

*Run the following cells, what are those calculation results?*

In [31]:
d4 = np.array([10, 10, 10, 10, 10])
d5 = d1+d4

In [32]:
d6 = 2*d1

In [33]:
d7 = d1*d4

In [34]:
d8 = np.dot(d1, d4)

In [38]:
from numpy import linalg as LA
n4 = LA.norm(d4) 
n4

22.360679774997898

### Pandas series

Pandas series is a ***one-dimensional labeled*** array holding data of any type such as integers, strings, Python objects etc.
- Pandas Series is one dimensional.
- Pandas Series has an explicitly defined index associated with values.
- Pandas Series allows different data types, including objects.

https://pandas.pydata.org/docs/reference/series.html

*Run the followin cell. check the difference between s1 and any np array. Compare d1 and s2.*


In [43]:
# One dimension
s1 = pd.Series([[1, 'a', 3], [4, 5, 6], [7, 8, 9]])
display(s1)

# You may convert an array into a Series
print(f"Original d1={d1}")
s2 = pd.Series(d1)  
display(s2)

0    [1, a, 3]
1    [4, 5, 6]
2    [7, 8, 9]
dtype: object

Original d1=[1 2 3 4 5]


0    1
1    2
2    3
3    4
4    5
dtype: int64

You may define the index when creating a Series `s2 = pd.Series([2, 4, 6, np.nan, 8], index=['a', 'b', 'c', 'd', 'e'])`. 

*Check the value of s2:*

In [40]:
s2 = pd.Series([2, 4, 'Hallo', np.nan, 8], index=['a', 'b', 'c', 'd', 'e'])
s2

a        2
b        4
c    Hallo
d      NaN
e        8
dtype: object

You may also create Series by passing a dictionary `s3 = pd.Series(data={'a': 1, 'b': 2, 'c': 3})`

In [47]:
s3 = pd.Series(data={'a': 1, 'b': 2, 'c': 3})
s3

a    1
b    2
c    3
dtype: int64

*How to retrieve value 2 of s3? How to retrieve values indexed by b, c, d of s2?*

In [49]:
display(s3['b'])
s2['b':'d']

2

TypeError: cannot do slice indexing on RangeIndex with these indexers [b] of type str

In [52]:
s69 = pd.Series([2, 4, 6, np.nan, 8], index=['a', 'b', 'c', 'd', 'cock'])
s69 

a       2.0
b       4.0
c       6.0
d       NaN
cock    8.0
dtype: float64

## DataFrame

DataFrame is a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

https://pandas.pydata.org/docs/reference/frame.html

There are various ways to initialize a dataframe. 

*Run the following cells and check the outcome.*


In [55]:
data1 = {'Name':["Tom","Jerry","Emily","Amy"], 'Age':[18,24,18,22]}
df1 = pd.DataFrame(data1)
display(df1)

Unnamed: 0,Name,Age
0,Tom,18
1,Jerry,24
2,Emily,18
3,Amy,22


In [56]:
data2 = {'col1': [10, 11, 12, 13], 'col2': pd.Series(['A', 'B'], index=[2, 3])}

#Error:
#pd.DataFrame(data=data2)

dfWithNaA = pd.DataFrame(data=data2, index=[0, 1, 2, 3])

In [59]:
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])

#### To use Numpy functions, you can convert dataframe into array

In [60]:
df2.to_numpy()

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

If there are different data types in the dataframe, they will be converted to "objects"

In [61]:
df3 = pd.DataFrame(
    {
        "A": 1,
        "B": pd.Timestamp("20240223"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
display(df3)
df3.to_numpy()

Unnamed: 0,A,B,C,D,E,F
0,1,2024-02-23,1.0,3,test,foo
1,1,2024-02-23,1.0,3,train,foo
2,1,2024-02-23,1.0,3,test,foo
3,1,2024-02-23,1.0,3,train,foo


array([[1, Timestamp('2024-02-23 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1, Timestamp('2024-02-23 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1, Timestamp('2024-02-23 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1, Timestamp('2024-02-23 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

### Overview of the datafile

With DataFrame, you can conviniently load data and have overview.

In [63]:
df = pd.read_csv("../data/advertising.csv")
display(df)

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5


*Try the following commands. What are their functions?*

In [64]:
display(df.head(5))

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9


In [65]:
display(df.tail(10))

Unnamed: 0,TV,Radio,Newspaper,Sales
190,39.5,41.1,5.8,10.8
191,75.5,10.8,6.0,11.9
192,17.2,4.1,31.6,5.9
193,166.8,42.0,3.6,19.6
194,149.7,35.6,6.0,17.3
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5
199,232.1,8.6,8.7,18.4


In [66]:
display(df.index)


RangeIndex(start=0, stop=200, step=1)

In [67]:

display(df.dtypes)


TV           float64
Radio        float64
Newspaper    float64
Sales        float64
dtype: object

In [68]:
df.describe()

Unnamed: 0,TV,Radio,Newspaper,Sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,15.1305
std,85.854236,14.846809,21.778621,5.283892
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,11.0
50%,149.75,22.9,25.75,16.0
75%,218.825,36.525,45.1,19.05
max,296.4,49.6,114.0,27.0


### Select data from dataframe

It is very convenient to select data from a dataframe. In addition to select by index, slicing, you can easily add conditional filters.

*How to get the data in Column TV?*

In [69]:
display(df)

print(f"Column TV: \n{df['TV']}", "\n")

print(f"Column TV: \n{df.TV}", "\n")

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5


Column TV: 
0      230.1
1       44.5
2       17.2
3      151.5
4      180.8
       ...  
195     38.2
196     94.2
197    177.0
198    283.6
199    232.1
Name: TV, Length: 200, dtype: float64 

Column TV: 
0      230.1
1       44.5
2       17.2
3      151.5
4      180.8
       ...  
195     38.2
196     94.2
197    177.0
198    283.6
199    232.1
Name: TV, Length: 200, dtype: float64 


*How to get the data of Column TV and Column Newspaper?*

In [70]:
df[["TV", "Newspaper"]]

Unnamed: 0,TV,Newspaper
0,230.1,69.2
1,44.5,45.1
2,17.2,69.3
3,151.5,58.5
4,180.8,58.4
...,...,...
195,38.2,13.8
196,94.2,8.1
197,177.0,6.4
198,283.6,66.2


#### loc and iloc

- loc gets rows and/or columns with particular labels.

- iloc gets rows and/or columns at integer locations.

To test them, let's create a new dataframe with random data:

In [72]:
dates = pd.date_range('2/1/2024', periods=8)

#8-by-4 array of samples from the normal distribution
df = pd.DataFrame(np.random.randn(8, 4),  
                  index=dates, 
                  columns=['A', 'B', 'C', 'D'])

display(df)

Unnamed: 0,A,B,C,D
2024-02-01,-1.489353,0.74022,0.718577,-0.13023
2024-02-02,0.321152,-1.206719,-1.518101,-1.310413
2024-02-03,0.623487,0.262852,0.062015,0.662257
2024-02-04,0.447203,-1.11538,-0.137813,-0.181104
2024-02-05,1.989134,0.915686,-0.603898,-0.800401
2024-02-06,0.165332,1.972802,-1.283837,-1.657675
2024-02-07,-0.81329,0.134247,-2.065386,-0.149419
2024-02-08,-0.502359,-0.21941,-2.059755,1.002937


*Show the value of B on 4 Feb 2024*
*Hint: df.loc[row, column]*

In [80]:
df.loc["2024-02-04", "B"]

-1.1153802895261948

*Show all values on 4 Feb 2024*

In [81]:
df.loc["2024-02-04"]

A    0.447203
B   -1.115380
C   -0.137813
D   -0.181104
Name: 2024-02-04 00:00:00, dtype: float64

*Show all values of B by using df.loc[]*

In [88]:
df.loc[:, "B"]

2024-02-01    0.740220
2024-02-02   -1.206719
2024-02-03    0.262852
2024-02-04   -1.115380
2024-02-05    0.915686
2024-02-06    1.972802
2024-02-07    0.134247
2024-02-08   -0.219410
Freq: D, Name: B, dtype: float64

*Show the values of B and C, before (including) 4 Feb 2024*

In [91]:
df.loc[:"2024-02-04", "B":"C"]

Unnamed: 0,B,C
2024-02-01,0.74022,0.718577
2024-02-02,-1.206719,-1.518101
2024-02-03,0.262852,0.062015
2024-02-04,-1.11538,-0.137813


*Show the second row by using df.iloc[row, column]*

In [95]:
df.iloc[1, :]

A    0.321152
B   -1.206719
C   -1.518101
D   -1.310413
Name: 2024-02-02 00:00:00, dtype: float64

*Show the value in the second row and the second column*

In [96]:
df.iloc[1,1]

-1.2067191579958343

*Show the first two columns of the 4th and the 5th row*

In [101]:
df.iloc[3:5, :2]

Unnamed: 0,A,B
2024-02-04,0.447203,-1.11538
2024-02-05,1.989134,0.915686


*Show all the rows which has positive B values*

In [102]:
df[df["B"] > 0]

Unnamed: 0,A,B,C,D
2024-02-01,-1.489353,0.74022,0.718577,-0.13023
2024-02-03,0.623487,0.262852,0.062015,0.662257
2024-02-05,1.989134,0.915686,-0.603898,-0.800401
2024-02-06,0.165332,1.972802,-1.283837,-1.657675
2024-02-07,-0.81329,0.134247,-2.065386,-0.149419


*Show all the rows which has positive B values and positive A values*

In [115]:
df[(df["A"]> 0) & (df["B"]> 0)]

Unnamed: 0,A,B,C,D
2024-02-03,0.623487,0.262852,0.062015,0.662257
2024-02-05,1.989134,0.915686,-0.603898,-0.800401
2024-02-06,0.165332,1.972802,-1.283837,-1.657675


*Show all the data later than 2024-02-02*

*Reminder: dates is index*

In [118]:
df["2024-02-02" < dates]

Unnamed: 0,A,B,C,D
2024-02-03,0.623487,0.262852,0.062015,0.662257
2024-02-04,0.447203,-1.11538,-0.137813,-0.181104
2024-02-05,1.989134,0.915686,-0.603898,-0.800401
2024-02-06,0.165332,1.972802,-1.283837,-1.657675
2024-02-07,-0.81329,0.134247,-2.065386,-0.149419
2024-02-08,-0.502359,-0.21941,-2.059755,1.002937


Now let's manipulte a dataframe.

*After the following operations, what is the shape of d4? (How many rows? How many collumns?)*

In [121]:
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])

df4 = df2.copy()

df4.loc[len(df4.index)] = ['A', 89, 93] 

df4['type'] = ["cat", "dog", "cat", "bird"]

df4


Unnamed: 0,a,b,c,type
0,1,2,3,cat
1,4,5,6,dog
2,7,8,9,cat
3,A,89,93,bird


*Return all the data points of cats and dogs*

In [124]:
df4[(df4["type"] == "cat") | (df4["type"] == "dog")]

Unnamed: 0,a,b,c,type
0,1,2,3,cat
1,4,5,6,dog
2,7,8,9,cat


*Sort all the rows based on the values in Column type*

In [125]:
df4.sort_values(by="type")

Unnamed: 0,a,b,c,type
3,A,89,93,bird
0,1,2,3,cat
2,7,8,9,cat
1,4,5,6,dog
