# Pandas

Pandas is a powerful library for data manipulation and analysis in Python. It is widely used in a range of fields, including data science, finance, and statistics.

## 003. Series Basics

Pandas Series can be used for a variety of data manipulation and
analysis tasks. For example, you can use them to perform calculations on the
data, such as sum, mean, and standard deviation, or to plot the data using the
built-in plotting functions in pandas.

## 003.000 Assets

Some assets to avoid too much typing

| Name        | Age|
|-------------|----|
| Mbappé      | 23 |
| De Bruyne   | 31 |
| Lewandowski | 33 |
| Benzema     | 34 |
| Messi       | 35 |

In [1]:
import sys
from pathlib import Path

current_dir = Path().resolve()
while current_dir != current_dir.parent and current_dir.name != "katas":
    current_dir = current_dir.parent
if current_dir != current_dir.parent:
    sys.path.append(current_dir.as_posix())


In [2]:
import pandas as pd
from lib.utils import fresh_df
from IPython.core.interactiveshell import InteractiveShell

pd.set_option('display.max_rows', None)
InteractiveShell.ast_node_interactivity = "all"

names_as_list = ["Mbappé", "De Bruyne", "Lewandowski", "Benzema", "Messi"]
ages_as_list = [ 23,       31,          33,             34,        35]


### 003.001 Extract series from DataFrame

1. Load the TSV file into a DF. Extract the "DOB" column as Series without using loc. Print it, notice it has the row indices and values
1. Extract the dobs as a list, and as a NumPy array
1. Extract the row labels as a list, and as array
1. Extract the "DOB" column as Series, this time using loc


In [3]:
datafile = "002.tsv"
df = fresh_df(src=datafile, id="Name")
# solution

1
s = df["DOB"]
s

2
s.tolist()
s.values

3
s.index.tolist()
s.index.values

4
df.loc[:, "DOB"]

1

Name
Mbappé         1998-12-20
De Bruyne      1991-06-28
Lewandowski    1988-08-21
Benzema        1987-12-19
Messi          1987-06-24
Name: DOB, dtype: object

2

['1998-12-20', '1991-06-28', '1988-08-21', '1987-12-19', '1987-06-24']

array(['1998-12-20', '1991-06-28', '1988-08-21', '1987-12-19',
       '1987-06-24'], dtype=object)

3

['Mbappé', 'De Bruyne', 'Lewandowski', 'Benzema', 'Messi']

array(['Mbappé', 'De Bruyne', 'Lewandowski', 'Benzema', 'Messi'],
      dtype=object)

4

Name
Mbappé         1998-12-20
De Bruyne      1991-06-28
Lewandowski    1988-08-21
Benzema        1987-12-19
Messi          1987-06-24
Name: DOB, dtype: object

### 003.002 Sort Series

1. Load the TSV file into a DF. Extract the "DOB" column as Series by making a copy, otherwise it wont' work
1. Sort by row labels
1. Sort by values with default ordering. Then sort it again, reversing it
1. Get another df, this time without id. Print it, and then print it sorted by name


In [4]:
datafile = "002.tsv"
df = fresh_df(src=datafile, id="Name")
# solution

1
s = df["DOB"].copy()
s

2
s.sort_index(inplace=True)
s

3
s.sort_values(inplace=True)
s
s.sort_values(inplace=True, ascending=False)
s

4
df = fresh_df(src=datafile)
df
df.sort_values(by="Name", inplace=True)
df

1

Name
Mbappé         1998-12-20
De Bruyne      1991-06-28
Lewandowski    1988-08-21
Benzema        1987-12-19
Messi          1987-06-24
Name: DOB, dtype: object

2

Name
Benzema        1987-12-19
De Bruyne      1991-06-28
Lewandowski    1988-08-21
Mbappé         1998-12-20
Messi          1987-06-24
Name: DOB, dtype: object

3

Name
Messi          1987-06-24
Benzema        1987-12-19
Lewandowski    1988-08-21
De Bruyne      1991-06-28
Mbappé         1998-12-20
Name: DOB, dtype: object

Name
Mbappé         1998-12-20
De Bruyne      1991-06-28
Lewandowski    1988-08-21
Benzema        1987-12-19
Messi          1987-06-24
Name: DOB, dtype: object

4

Unnamed: 0,Name,DOB
0,Mbappé,1998-12-20
1,De Bruyne,1991-06-28
2,Lewandowski,1988-08-21
3,Benzema,1987-12-19
4,Messi,1987-06-24


Unnamed: 0,Name,DOB
3,Benzema,1987-12-19
1,De Bruyne,1991-06-28
2,Lewandowski,1988-08-21
0,Mbappé,1998-12-20
4,Messi,1987-06-24


### 003.003 File for seaborn

1. Read the data_file
   1. It doesn't have a header 
   1. The index is the first column. Rename it to 'Name'
   1. The rows need to be sorted by index
1. Extract the 1st row as a series to a variable row
   1. Ignore the NaN values
   1. ...and split each cell into a list, splitting by '|'
   1. ...making sure even empty values are replaced by a suitable list
   1. ...and turn each item in the list into a column (Series)
   1. ...and save into a new df called new_df, and rename the columns "Time" and name, then make "Time" the index
   1. Append ':00' to each 'name' column, and turn into a timedelta
1. Create an empty dataframe called katas_df. Then loop through each row of df, and apply the transformations above to it. At the end of the loop, append the newly created DF to katas_df
1. Make "Time" the index column, sort ascending by it, and save to the `save_to` csv


In [5]:
import os
import re
data_file = "katas.tsv"
name = "PANDAS XXX"
save_to = re.sub(r'/katas/.+$', '/katas/seaborn_katas/solutions/katas.tsv', os.path.abspath(''))
# solution

1
df = fresh_df(data_file, sep="\t", index_col=0, header=None)
df.index = df.index.rename("name")
df.sort_index(inplace=True)
df

2
row = df.iloc[0]
row.dropna()
row.dropna().apply(lambda x: x.split("|"))
row.dropna().apply(lambda x: x.split("|")).apply(pd.Series)
new_df = row.dropna().apply(lambda x: x.split("|")).apply(pd.Series)
new_df.columns = ["Time", name]
new_df = new_df.set_index("Time")
new_df[name] = pd.to_timedelta(new_df[name].map(lambda cell: cell + ":00"))
new_df

3
katas_df = pd.DataFrame()
for name, row in df.iterrows():
    new_df = row.dropna().apply(lambda x: x.split("|") if x else [None, None]).apply(pd.Series)
    new_df.columns = ["Time", name]
    new_df.set_index("Time")
    new_df[name] = pd.to_timedelta(new_df[name].map(lambda cell: cell + ":00"))
    katas_df = pd.concat([katas_df, new_df], axis=0)
    
katas_df = katas_df.set_index("Time")
katas_df.sort_index(inplace=True, ascending=True)
katas_df
katas_df.to_csv(save_to, index=True, index_label="Time", sep="\t")

1

Unnamed: 0_level_0,1,2,3,4,5,6,7,8
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
PANDAS 001,2023-01-06 08:00|00:22,2023-01-08 13:11|00:22,2023-01-12 00:52|00:31,2023-01-17 23:55 |00:23,2023-01-22 11:10 |00:31,2023-01-26 21:02 |00:25,2023-02-10 23:30 |00:36,
PANDAS 002,2023-01-06 13:08|00:12,2023-01-07 23:05|00:19,2023-01-10 21:45|00:10,2023-01-12 20:16|00:22,2023-01-16 23:29 |00:10,2023-01-21 14:10 |00:26,2023-01-26 00:29 |00:15,2023-02-01 23:00 |00:10
PANDAS 003,2023-01-13 23:41|00:06,2023-01-18 22:02 |00:05,2023-01-23 23:18 |00:12,2023-01-28 11:43 |01:19,,,,
PANDAS 004,2023-01-20 22:41|00:23,2023-01-24 22:58 |00:25,2023-01-27 21:56 |00:36,2023-01-31 22:18 |00:20,,,,
PYTHON 002,2023-01-29 23:08 |01:00,2023-02-09 23:54 |00:40,,,,,,


2

1     2023-01-06 08:00|00:22
2     2023-01-08 13:11|00:22
3     2023-01-12 00:52|00:31
4    2023-01-17 23:55 |00:23
5    2023-01-22 11:10 |00:31
6    2023-01-26 21:02 |00:25
7    2023-02-10 23:30 |00:36
Name: PANDAS 001, dtype: object

1     [2023-01-06 08:00, 00:22]
2     [2023-01-08 13:11, 00:22]
3     [2023-01-12 00:52, 00:31]
4    [2023-01-17 23:55 , 00:23]
5    [2023-01-22 11:10 , 00:31]
6    [2023-01-26 21:02 , 00:25]
7    [2023-02-10 23:30 , 00:36]
Name: PANDAS 001, dtype: object

Unnamed: 0,0,1
1,2023-01-06 08:00,00:22
2,2023-01-08 13:11,00:22
3,2023-01-12 00:52,00:31
4,2023-01-17 23:55,00:23
5,2023-01-22 11:10,00:31
6,2023-01-26 21:02,00:25
7,2023-02-10 23:30,00:36


Unnamed: 0_level_0,PANDAS XXX
Time,Unnamed: 1_level_1
2023-01-06 08:00,0 days 00:22:00
2023-01-08 13:11,0 days 00:22:00
2023-01-12 00:52,0 days 00:31:00
2023-01-17 23:55,0 days 00:23:00
2023-01-22 11:10,0 days 00:31:00
2023-01-26 21:02,0 days 00:25:00
2023-02-10 23:30,0 days 00:36:00


3

Unnamed: 0_level_0,PANDAS 001
Time,Unnamed: 1_level_1
2023-01-06 08:00,00:22
2023-01-08 13:11,00:22
2023-01-12 00:52,00:31
2023-01-17 23:55,00:23
2023-01-22 11:10,00:31
2023-01-26 21:02,00:25
2023-02-10 23:30,00:36


Unnamed: 0_level_0,PANDAS 002
Time,Unnamed: 1_level_1
2023-01-06 13:08,00:12
2023-01-07 23:05,00:19
2023-01-10 21:45,00:10
2023-01-12 20:16,00:22
2023-01-16 23:29,00:10
2023-01-21 14:10,00:26
2023-01-26 00:29,00:15
2023-02-01 23:00,00:10


Unnamed: 0_level_0,PANDAS 003
Time,Unnamed: 1_level_1
2023-01-13 23:41,00:06
2023-01-18 22:02,00:05
2023-01-23 23:18,00:12
2023-01-28 11:43,01:19


Unnamed: 0_level_0,PANDAS 004
Time,Unnamed: 1_level_1
2023-01-20 22:41,00:23
2023-01-24 22:58,00:25
2023-01-27 21:56,00:36
2023-01-31 22:18,00:20


Unnamed: 0_level_0,PYTHON 002
Time,Unnamed: 1_level_1
2023-01-29 23:08,01:00
2023-02-09 23:54,00:40


Unnamed: 0_level_0,PANDAS 001,PANDAS 002,PANDAS 003,PANDAS 004,PYTHON 002
Time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2023-01-06 08:00,0 days 00:22:00,NaT,NaT,NaT,NaT
2023-01-06 13:08,NaT,0 days 00:12:00,NaT,NaT,NaT
2023-01-07 23:05,NaT,0 days 00:19:00,NaT,NaT,NaT
2023-01-08 13:11,0 days 00:22:00,NaT,NaT,NaT,NaT
2023-01-10 21:45,NaT,0 days 00:10:00,NaT,NaT,NaT
2023-01-12 00:52,0 days 00:31:00,NaT,NaT,NaT,NaT
2023-01-12 20:16,NaT,0 days 00:22:00,NaT,NaT,NaT
2023-01-13 23:41,NaT,NaT,0 days 00:06:00,NaT,NaT
2023-01-16 23:29,NaT,0 days 00:10:00,NaT,NaT,NaT
2023-01-17 23:55,0 days 00:23:00,NaT,NaT,NaT,NaT
