# Working with data in Python

Python is an interpreted, interactive, object-oriented programming language. 

It incorporates modules, exceptions, dynamic typing, very high level dynamic data types, and classes. 

It supports multiple programming paradigms beyond object-oriented programming, such as procedural and functional programming.

Although this is not a Python lecture, we will use Python intensively. 

So let's first get familiar with some often used data types in this lecture. Meanwhile get familiar with Jupyter Notebook.

First of all, we need to import some libraries (By now, these libraries should be included in the working environment):

In [1]:
import numpy as np
import pandas as pd

## Lists, Arrays, Series

Assume that you have the following observations of a feature: 2, 4, 6, Null, 8, and you want to assign them into one variable. There are at least three alternatives:
- Python list `ls1 = [2, 4, 6, None, 8]`
- Numpy array `arr1 = numpy.array([2, 4, 6, np.nan, 8])`
- Pandas series `s1 = pd.Series([2, 4, 6, np.nan, 8])`

*Insert a new cell below, copy paste the above statements, and check the value of ls1, arr1 and s1.*

In [2]:
ls1 = [2, 4, 6, None, 8]
print(f"Python list ls1= {ls1}")

arr1 = np.array([2, 4, 6, None, 8])
print(f"Numpy array arr1= {arr1}")

s1 = pd.Series([2, 4, 6, np.nan, 8])
print(f"Pandas series s1: \n{s1}")


Python list ls1= [2, 4, 6, None, 8]
Numpy array arr1= [2 4 6 None 8]
Pandas series s1: 
0    2.0
1    4.0
2    6.0
3    NaN
4    8.0
dtype: float64


### Python lists

Python lists might contain items of different types (also could be list of lists), but usually the items all have the same type.

It is similar to the list in JavaScript -- it has append, insert, index, ... functions. 

The next cell shows some examples.
 
More information can be found here: https://docs.python.org/3/tutorial/datastructures.html


In [3]:
ls2 = [2, 'Sunny', 6, None, [1,2,3]]
print(f"Python list ls2= {ls2}")

ls2.append(199)
print(f"Updated ls2= {ls2}")

id199 =ls2.index(199)
print(f"Index of 199= {id199}")

ls234 = ls2[1:4]
print(f"The 2nd to the 4th items are {ls234}")

item_5_3=ls2[4][2]
print(f"The 3rd item in the 5th item of ls2 is {item_5_3}")

ls10 = [1,2,3]
ls11 = [4,5,6]
ls12 = ls10+ls11
print(f"+ works as append ls10+ls11={ls12}")

Python list ls2= [2, 'Sunny', 6, None, [1, 2, 3]]
Updated ls2= [2, 'Sunny', 6, None, [1, 2, 3], 199]
Index of 199= 5
The 2nd to the 4th items are ['Sunny', 6, None]
The 3rd item in the 5th item of ls2 is 3
+ works as append ls10+ls11=[1, 2, 3, 4, 5, 6]


The zip() function returns a zip object, which is an iterator of tuples where the first item in each passed iterator is paired together,  and then the second item in each passed iterator are paired together etc.

In [4]:
ls13 = zip(ls10, ls11)
print(f"zip(ls10, ls11)={tuple(ls13)}")

zip(ls10, ls11)=((1, 4), (2, 5), (3, 6))


Add a new cell below, and check the value of tuple(ls13), what will you get? Why? 

In [5]:
tuple(ls13)

()

print(tuple(ls13)) pops items from the iterator to create a list, then prints the list.
Afterwards, the iterator doesn't contain any item.

### Numpy arrays 

NumPy supports for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. --> Offer specialized functions for numerical operations with high performance.

Different from Python list, all the items in one Numpy array must have the same type/length --> consume less memory compared with lists 

https://numpy.org/doc/stable/reference/generated/numpy.array.html

#### First, let's create some arrays and retrieve some items from them.

In [6]:
d1 = np.array([1, 2, 3, 4, 5])
print(f"1D array: \n", d1, "\n")

d2 = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(f"2D array: \n", d2)
print('2nd item in the 1st item: ', d2[0][1])
print('Row 1, Column 2: ', d2[0, 1])
print(f"Row 3, Column 2: ", d2[2,1], "\n")

d3 = np.array([[[1, 2, 3], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])
print(f"3D array: \n", d3, "\n")

# Error:
#d3 = np.array([[[1, 2, 3, 100], [4, 5, 6]], [[1, 2, 3], [4, 5, 6]]])

# Check what is the result: 
x = np.arange(12).reshape((3,4))
display(x)
x.sum(axis=1)

1D array: 
 [1 2 3 4 5] 

2D array: 
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
2nd item in the 1st item:  2
Row 1, Column 2:  2
Row 3, Column 2:  8 

3D array: 
 [[[1 2 3]
  [4 5 6]]

 [[1 2 3]
  [4 5 6]]] 


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

array([ 6, 22, 38])

In [7]:
x.sum(axis=0)

array([12, 15, 18, 21])

#### You can operate arrays as vectors/matrices.

https://numpy.org/doc/stable/reference/routines.linalg.html

In [8]:
d4 = np.array([10, 10, 10, 10, 10])
print (f"d1={d1}")
print (f"d4={d4}")
d5 = d1+d4
print(f"Plus two vectors d1+d4={d5}")

d1=[1 2 3 4 5]
d4=[10 10 10 10 10]
Plus two vectors d1+d4=[11 12 13 14 15]


In [9]:
d6 = 2*d1
print(f"Scale a vector by scalar 2: 2*d1={d6}")

Scale a vector by scalar 2: 2*d1=[ 2  4  6  8 10]


In [10]:
d7 = d1*d4
print(f"Hadamard product of two vectors: d1*d4={d7}")

Hadamard product of two vectors: d1*d4=[10 20 30 40 50]


In [11]:
d8 = np.dot(d1, d4)
print(f"Dot product of two vectors: np.dot(d1, d4)={d8}")

Dot product of two vectors: np.dot(d1, d4)=150


In [12]:
from numpy import linalg as LA
n4 = LA.norm(d4) 
# By default, LA.norm(x) returns Frobenius norm, sometimes also called the Euclidean norm
print(f"Norm of d4: {n4}")


Norm of d4: 22.360679774997898


### Pandas series

Pandas series is a ***one-dimensional labeled*** array holding data of any type such as integers, strings, Python objects etc.
- Pandas Series is one dimensional.
- Pandas Series has an explicitly defined index associated with values.
- Pandas Series allows different data types, including objects.

https://pandas.pydata.org/docs/reference/series.html


In [13]:
# One dimension
s1 = pd.Series([[1, 'a', 3], [4, 5, 6], [7, 8, 9]])
display(s1)

# You may convert an array into a Series
print(f"Original d1={d1}")
s2 = pd.Series(d1)  
display(s2)

0    [1, a, 3]
1    [4, 5, 6]
2    [7, 8, 9]
dtype: object

Original d1=[1 2 3 4 5]


0    1
1    2
2    3
3    4
4    5
dtype: int32

You may define the index when creating a Series `s2 = pd.Series([2, 4, 6, np.nan, 8], index=['a', 'b', 'c', 'd', 'e'])`. 

*Check the value of s2:*

In [14]:
s2 = pd.Series([2, 4, 'Hallo', np.nan, 8], index=['a', 'b', 'c', 'd', 'e'])
s2

a        2
b        4
c    Hallo
d      NaN
e        8
dtype: object

You may also create Series by passing a dictionary `s3 = pd.Series(data={'a': 1, 'b': 2, 'c': 3})`

In [15]:
s3 = pd.Series(data={'a': 1, 'b': 2, 'c': 3})
s3

a    1
b    2
c    3
dtype: int64

*How to retrieve value 2 of s3? How to retrieve values indexed by b, c, d of s2?*

In [16]:
display(s3['b'])
s2['b':'d']

2

b        4
c    Hallo
d      NaN
dtype: object

## DataFrame

DataFrame is a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

https://pandas.pydata.org/docs/reference/frame.html


In [17]:
data1 = {'Name':["Tom","Jerry","Emily","Amy"], 'Age':[18,24,18,22]}
df1 = pd.DataFrame(data1)
display(df1)

Unnamed: 0,Name,Age
0,Tom,18
1,Jerry,24
2,Emily,18
3,Amy,22


In [18]:
data2 = {'col1': [10, 11, 12, 13], 'col2': pd.Series(['A', 'B'], index=[2, 3])}

#Error:
#pd.DataFrame(data=data2)

dfWithNaA = pd.DataFrame(data=data2, index=[0, 1, 2, 3])
dfWithNaA

Unnamed: 0,col1,col2
0,10,
1,11,
2,12,A
3,13,B


In [19]:
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])
df2

Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


#### To use Numpy functions, you can convert dataframe into array

In [20]:
df2.to_numpy()

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

If there are different data types in the dataframe, they will be converted to "objects"

In [21]:
df3 = pd.DataFrame(
    {
        "A": 1,
        "B": pd.Timestamp("20240223"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
display(df3)
df3.to_numpy()

Unnamed: 0,A,B,C,D,E,F
0,1,2024-02-23,1.0,3,test,foo
1,1,2024-02-23,1.0,3,train,foo
2,1,2024-02-23,1.0,3,test,foo
3,1,2024-02-23,1.0,3,train,foo


array([[1, Timestamp('2024-02-23 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1, Timestamp('2024-02-23 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1, Timestamp('2024-02-23 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1, Timestamp('2024-02-23 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

### Overview of the datafile

With DataFrame, you can conviniently load data and have overview.

In [22]:
df = pd.read_csv("Data/advertising.csv")
display(df)

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5


*Try the following commands. What are their functions?*

In [23]:
display(df.head(5))

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9


In [24]:
display(df.tail(10))

Unnamed: 0,TV,Radio,Newspaper,Sales
190,39.5,41.1,5.8,10.8
191,75.5,10.8,6.0,11.9
192,17.2,4.1,31.6,5.9
193,166.8,42.0,3.6,19.6
194,149.7,35.6,6.0,17.3
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5
199,232.1,8.6,8.7,18.4


In [25]:
display(df.index)


RangeIndex(start=0, stop=200, step=1)

In [26]:

display(df.dtypes)


TV           float64
Radio        float64
Newspaper    float64
Sales        float64
dtype: object

In [27]:

df.describe()

Unnamed: 0,TV,Radio,Newspaper,Sales
count,200.0,200.0,200.0,200.0
mean,147.0425,23.264,30.554,15.1305
std,85.854236,14.846809,21.778621,5.283892
min,0.7,0.0,0.3,1.6
25%,74.375,9.975,12.75,11.0
50%,149.75,22.9,25.75,16.0
75%,218.825,36.525,45.1,19.05
max,296.4,49.6,114.0,27.0


*The above commands display the following information about df*
- *The first 5 rows (including header)*
- *The last 10 rows (including header)*
- *The information about index*
- *The data type of each column*
- *Statistical information about the table*

### Select data from dataframe

It is very convenient to select data from a dataframe. In addition to select by index, slicing, you can easily add conditional filters.

*How to get the data in Column TV?*

In [28]:
display(df)

print(f"Column TV: \n{df['TV']}", "\n")
print(f"Column TV: \n{df.TV}", "\n")

Unnamed: 0,TV,Radio,Newspaper,Sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,12.0
3,151.5,41.3,58.5,16.5
4,180.8,10.8,58.4,17.9
...,...,...,...,...
195,38.2,3.7,13.8,7.6
196,94.2,4.9,8.1,14.0
197,177.0,9.3,6.4,14.8
198,283.6,42.0,66.2,25.5


Column TV: 
0      230.1
1       44.5
2       17.2
3      151.5
4      180.8
       ...  
195     38.2
196     94.2
197    177.0
198    283.6
199    232.1
Name: TV, Length: 200, dtype: float64 

Column TV: 
0      230.1
1       44.5
2       17.2
3      151.5
4      180.8
       ...  
195     38.2
196     94.2
197    177.0
198    283.6
199    232.1
Name: TV, Length: 200, dtype: float64 


*How to get the data of Column TV and Column Newspaper?*

In [29]:
df[['TV', 'Newspaper']]

Unnamed: 0,TV,Newspaper
0,230.1,69.2
1,44.5,45.1
2,17.2,69.3
3,151.5,58.5
4,180.8,58.4
...,...,...
195,38.2,13.8
196,94.2,8.1
197,177.0,6.4
198,283.6,66.2


#### loc and iloc

- loc gets rows and/or columns with particular labels.

- iloc gets rows and/or columns at integer locations.

To test them, let's create a new dataframe with random data:

In [30]:
dates = pd.date_range('2/1/2024', periods=8)

#8-by-4 array of samples from the normal distribution
df = pd.DataFrame(np.random.randn(8, 4),  
                  index=dates, 
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
2024-02-01,1.388559,-1.08109,1.059009,-1.844721
2024-02-02,0.094614,-0.107355,-0.481956,-1.415501
2024-02-03,-0.768443,0.651123,-0.67683,-1.31375
2024-02-04,0.731718,1.839382,0.338888,-0.195298
2024-02-05,0.088568,0.493426,2.487158,-0.257452
2024-02-06,-0.338782,0.712679,-1.26985,-1.748766
2024-02-07,0.284759,0.895761,0.185038,1.601746
2024-02-08,0.293341,0.055059,1.149262,-0.368709


*Show the value of B on 4 Feb 2024*
*Hint: df.loc[row, column]*

In [31]:
df.loc['2024-02-04', 'B']

1.8393823689340623

*Show all values on 4 Feb 2024*

In [32]:
df.loc['2024-02-04']

A    0.731718
B    1.839382
C    0.338888
D   -0.195298
Name: 2024-02-04 00:00:00, dtype: float64

*Show all values of B by using df.loc[]*

In [33]:
df.loc[:, 'B']

2024-02-01   -1.081090
2024-02-02   -0.107355
2024-02-03    0.651123
2024-02-04    1.839382
2024-02-05    0.493426
2024-02-06    0.712679
2024-02-07    0.895761
2024-02-08    0.055059
Freq: D, Name: B, dtype: float64

*Show the values of B and C, before (including) 4 Feb 2024*

In [34]:
df.loc[:'2024-02-04', 'B':'C']

Unnamed: 0,B,C
2024-02-01,-1.08109,1.059009
2024-02-02,-0.107355,-0.481956
2024-02-03,0.651123,-0.67683
2024-02-04,1.839382,0.338888


*Show the second row by using df.iloc[row, column]*

In [35]:
print(f"Second record/Row 2: \n{df.iloc[1]}", "\n")

Second record/Row 2: 
A    0.094614
B   -0.107355
C   -0.481956
D   -1.415501
Name: 2024-02-02 00:00:00, dtype: float64 


*Show the value in the second row and the second column*

In [36]:
df.iloc[1, 1]

-0.10735514136439861

*Show the first two columns of the 4th and the 5th row*

In [37]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2024-02-04,0.731718,1.839382
2024-02-05,0.088568,0.493426


*Show all the rows which has positive B values*

In [38]:
df[df["B"] > 0]

Unnamed: 0,A,B,C,D
2024-02-03,-0.768443,0.651123,-0.67683,-1.31375
2024-02-04,0.731718,1.839382,0.338888,-0.195298
2024-02-05,0.088568,0.493426,2.487158,-0.257452
2024-02-06,-0.338782,0.712679,-1.26985,-1.748766
2024-02-07,0.284759,0.895761,0.185038,1.601746
2024-02-08,0.293341,0.055059,1.149262,-0.368709


*Show all the rows which has positive B values and positive A values*

In [39]:
df[(df["B"] > 0) & (df["A"]>0)]

Unnamed: 0,A,B,C,D
2024-02-04,0.731718,1.839382,0.338888,-0.195298
2024-02-05,0.088568,0.493426,2.487158,-0.257452
2024-02-07,0.284759,0.895761,0.185038,1.601746
2024-02-08,0.293341,0.055059,1.149262,-0.368709


*Show all the data later than 2024-02-02*

*Reminder: dates is index*

In [40]:
df[dates > '2024-02-02']

Unnamed: 0,A,B,C,D
2024-02-03,-0.768443,0.651123,-0.67683,-1.31375
2024-02-04,0.731718,1.839382,0.338888,-0.195298
2024-02-05,0.088568,0.493426,2.487158,-0.257452
2024-02-06,-0.338782,0.712679,-1.26985,-1.748766
2024-02-07,0.284759,0.895761,0.185038,1.601746
2024-02-08,0.293341,0.055059,1.149262,-0.368709


Now let's manipulte a dataframe.

*After the following operations, what is the shape of d4? (How many rows? How many collumns?)*

In [41]:
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'])

df4 = df2.copy()

df4.loc[len(df4.index)] = ['A', 89, 93] 

df4['type'] = ["cat", "dog", "cat", "bird"]


In [42]:
display(df2)

display(df4)


Unnamed: 0,a,b,c
0,1,2,3
1,4,5,6
2,7,8,9


Unnamed: 0,a,b,c,type
0,1,2,3,cat
1,4,5,6,dog
2,7,8,9,cat
3,A,89,93,bird


*Return all the data points of cats and dogs*

In [43]:
df4[df4["type"].isin(["cat", "dog"])]

Unnamed: 0,a,b,c,type
0,1,2,3,cat
1,4,5,6,dog
2,7,8,9,cat


*Sort all the rows based on the values in Column type*

In [44]:
df4.sort_values(by=['type'])

Unnamed: 0,a,b,c,type
3,A,89,93,bird
0,1,2,3,cat
2,7,8,9,cat
1,4,5,6,dog
