# Pandas

Pandas is a powerful library for data manipulation and analysis in Python. It is widely used in a range of fields, including data science, finance, and statistics.

## 003. Series Basics

Pandas Series can be used for a variety of data manipulation and
analysis tasks. For example, you can use them to perform calculations on the
data, such as sum, mean, and standard deviation, or to plot the data using the
built-in plotting functions in pandas.

## 003.000 Assets

Some assets to avoid too much typing

| Name        | Age|
|-------------|----|
| Mbappé      | 23 |
| De Bruyne   | 31 |
| Lewandowski | 33 |
| Benzema     | 34 |
| Messi       | 35 |

In [1]:
import sys
from pathlib import Path

current_dir = Path().resolve()
while current_dir != current_dir.parent and current_dir.name != "katas":
    current_dir = current_dir.parent
if current_dir != current_dir.parent:
    sys.path.append(current_dir.as_posix())


In [2]:
import pandas as pd
from lib.utils import fresh_df
from IPython.core.interactiveshell import InteractiveShell

pd.set_option('display.max_rows', None)
InteractiveShell.ast_node_interactivity = "all"

names_as_list = ["Mbappé", "De Bruyne", "Lewandowski", "Benzema", "Messi"]
ages_as_list = [ 23,       31,          33,             34,        35]


### 003.001 Extract series from DataFrame

1. Load the TSV file into a DF. Extract the "DOB" column as Series without using loc. Print it, notice it has the row indices and values
1. Extract the dobs as a list, and as a NumPy array
1. Extract the row labels as a list, and as array
1. Extract the "DOB" column as Series, this time using loc


In [11]:
datafile = "002.tsv"
df = fresh_df(src=datafile, id="Name")
# solution

df["DOB"]

df["DOB"].values
df["DOB"].to_list()

df["DOB"].index.values
df["DOB"].index.to_list()

df.loc[:, "DOB"]


Name
Mbappé         1998-12-20
De Bruyne      1991-06-28
Lewandowski    1988-08-21
Benzema        1987-12-19
Messi          1987-06-24
Name: DOB, dtype: object

array(['1998-12-20', '1991-06-28', '1988-08-21', '1987-12-19',
       '1987-06-24'], dtype=object)

['1998-12-20', '1991-06-28', '1988-08-21', '1987-12-19', '1987-06-24']

array(['Mbappé', 'De Bruyne', 'Lewandowski', 'Benzema', 'Messi'],
      dtype=object)

['Mbappé', 'De Bruyne', 'Lewandowski', 'Benzema', 'Messi']

Name
Mbappé         1998-12-20
De Bruyne      1991-06-28
Lewandowski    1988-08-21
Benzema        1987-12-19
Messi          1987-06-24
Name: DOB, dtype: object

### 003.002 Sort Series

1. Load the TSV file into a DF. Extract the "DOB" column as Series by making a copy, otherwise it wont' work
1. Sort by row labels
1. Sort by values with default ordering. Then sort it again, reversing it
1. Get another df, this time without id. Print it, and then print it sorted by name


In [23]:
datafile = "002.tsv"
df = fresh_df(src=datafile, id="Name")
# solution

dob = df["DOB"].copy()
dob.sort_index(ascending=True)
dob.sort_values()
dob.sort_values(ascending=False)

df = fresh_df(src=datafile)
df
df.sort_values(by="Name")

Name
Benzema        1987-12-19
De Bruyne      1991-06-28
Lewandowski    1988-08-21
Mbappé         1998-12-20
Messi          1987-06-24
Name: DOB, dtype: object

Name
Messi          1987-06-24
Benzema        1987-12-19
Lewandowski    1988-08-21
De Bruyne      1991-06-28
Mbappé         1998-12-20
Name: DOB, dtype: object

Name
Mbappé         1998-12-20
De Bruyne      1991-06-28
Lewandowski    1988-08-21
Benzema        1987-12-19
Messi          1987-06-24
Name: DOB, dtype: object

Unnamed: 0,Name,DOB
0,Mbappé,1998-12-20
1,De Bruyne,1991-06-28
2,Lewandowski,1988-08-21
3,Benzema,1987-12-19
4,Messi,1987-06-24


Unnamed: 0,Name,DOB
3,Benzema,1987-12-19
1,De Bruyne,1991-06-28
2,Lewandowski,1988-08-21
0,Mbappé,1998-12-20
4,Messi,1987-06-24


### 003.003 File for seaborn

1. Read the data_file
   1. It doesn't have a header 
   1. The index is the first column. Rename it to 'Name'
   1. The rows need to be sorted by index
1. Extract the 1st row as a series to a variable row
   1. Ignore the NaN values
   1. ...and split each cell into a list, splitting by '|'
   1. ...and turn each item in the list into a column (Series)
   1. ...and save into a new df called new_df, and rename the columns "Time" and name, then make "Time" the index
   1. Append ':00' to each 'name' column, and turn into a timedelta
1. Create an empty dataframe called katas_df. Then loop through each row of df, and apply the transformations above to it. At the end of the loop, append the newly created DF to katas_df. Note that you don't need to set index in the row
1. Make "Time" the index column, sort ascending by it, and save to the `save_to` csv


In [52]:
import os
import re
data_file = "katas.tsv"
name = "PANDAS XXX"
save_to = re.sub(r'/katas/.+$', '/katas/seaborn_katas/solutions/katas.tsv', os.path.abspath(''))
# solution

df = pd.read_csv(data_file, sep="\t", index_col=0, header=None)
df.index.name = "Name"
df.sort_index(inplace=True)
# df

row = df.iloc[0]
new_df = row.dropna().apply(lambda x: x.split("|")).apply(pd.Series)
new_df.columns = ["Time", name]
new_df.set_index("Time")
new_df = new_df[name].apply(lambda x: pd.Timedelta(x + ":00"))

katas_df = pd.DataFrame()
for name, row in df.iterrows():
    row = df.iloc[0]
    new_df = row.dropna().apply(lambda x: x.split("|")).apply(pd.Series)
    new_df.columns = ["Time", name]
    new_df = new_df[name].apply(lambda x: pd.Timedelta(x + ":00"))
    katas_df = pd.concat([katas_df, new_df], axis="columns")
    

katas_df = katas_df.set_index("Time")
katas_df


Unnamed: 0_level_0,PANDAS XXX
Time,Unnamed: 1_level_1
2023-04-23 23:21,01:52
2023-04-24 23:46,01:07
2023-04-25 22:20,00:21
2023-05-04 00:05,00:22
2023-05-26 00:08,00:26
2023-08-03 23:11,00:44
2023-08-28 22:18,00:27


Unnamed: 0,GOOGLEAPPS 001,PANDAS 001,PANDAS 002,PANDAS 003,PANDAS 004,PANDAS 006 anki,PANDAS 007 Car,PANDAS 008 anki nlp,PYTHON_REAL_PYTHON 001,PYTHON_REAL_PYTHON 002
1,0 days 01:52:00,0 days 01:52:00,0 days 01:52:00,0 days 01:52:00,0 days 01:52:00,0 days 01:52:00,0 days 01:52:00,0 days 01:52:00,0 days 01:52:00,0 days 01:52:00
2,0 days 01:07:00,0 days 01:07:00,0 days 01:07:00,0 days 01:07:00,0 days 01:07:00,0 days 01:07:00,0 days 01:07:00,0 days 01:07:00,0 days 01:07:00,0 days 01:07:00
3,0 days 00:21:00,0 days 00:21:00,0 days 00:21:00,0 days 00:21:00,0 days 00:21:00,0 days 00:21:00,0 days 00:21:00,0 days 00:21:00,0 days 00:21:00,0 days 00:21:00
4,0 days 00:22:00,0 days 00:22:00,0 days 00:22:00,0 days 00:22:00,0 days 00:22:00,0 days 00:22:00,0 days 00:22:00,0 days 00:22:00,0 days 00:22:00,0 days 00:22:00
5,0 days 00:26:00,0 days 00:26:00,0 days 00:26:00,0 days 00:26:00,0 days 00:26:00,0 days 00:26:00,0 days 00:26:00,0 days 00:26:00,0 days 00:26:00,0 days 00:26:00
6,0 days 00:44:00,0 days 00:44:00,0 days 00:44:00,0 days 00:44:00,0 days 00:44:00,0 days 00:44:00,0 days 00:44:00,0 days 00:44:00,0 days 00:44:00,0 days 00:44:00
7,0 days 00:27:00,0 days 00:27:00,0 days 00:27:00,0 days 00:27:00,0 days 00:27:00,0 days 00:27:00,0 days 00:27:00,0 days 00:27:00,0 days 00:27:00,0 days 00:27:00
