# Pandas

Pandas is a powerful library for data manipulation and analysis in Python. It is widely used in a range of fields, including data science, finance, and statistics.

## 001. DataFrame Importing / Exporting

## 001.000 Assets

Some assets to avoid too much typing

| Name        | Age|
|-------------|----|
| Mbappé      | 23 |
| De Bruyne   | 31 |
| Lewandowski | 33 |
| Benzema     | 34 |
| Messi       | 35 |

In [1]:
import sys
from pathlib import Path

current_dir = Path().resolve()
while current_dir != current_dir.parent and current_dir.name != "katas":
    current_dir = current_dir.parent
if current_dir != current_dir.parent:
    sys.path.append(current_dir.as_posix())

In [15]:
import pandas as pd
import numpy as np
from lib.utils import fresh_df
from IPython.core.interactiveshell import InteractiveShell

pd.set_option('display.max_rows', None)
InteractiveShell.ast_node_interactivity = "all"
names_as_list = ["Mbappé", "De Bruyne", "Lewandowski", "Benzema", "Messi"]
ages_as_list = [ 23,       31,          33,             34,        35]


### 001.001 Dictionary of arrays

1. Replicate this data as a dataframe with a dictionary of arrays, using a constructor, and print it
1. Do the same, but using from_dict
1. Extract the dict out of the df again with to_dict


In [12]:
# solution
1
data = {
    "Name": names_as_list,
    "Age": ages_as_list
}
pd.DataFrame(data)

2
df = pd.DataFrame.from_dict(data)
df

3
df.to_dict(orient="list")

1

Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


2

Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


3

{'Name': ['Mbappé', 'De Bruyne', 'Lewandowski', 'Benzema', 'Messi'],
 'Age': [23, 31, 33, 34, 35]}

### 001.002 List of dictionaries

1. Create a list where every entry is a dict with keys "Name" and "Age", and print it
1. Do the same, but using from_records
1. Extract the list out of the df again with to_dict (not to_records). Note that you can actually get a list and not a dict, with the right parameters...


In [14]:
# solution
1
data = [{ "Name": name, "Age": age} for name, age in zip(names_as_list, ages_as_list)]
data
pd.DataFrame(data)

2
df = pd.DataFrame.from_records(data)
df

3
df.to_dict(orient="records")


1

[{'Name': 'Mbappé', 'Age': 23},
 {'Name': 'De Bruyne', 'Age': 31},
 {'Name': 'Lewandowski', 'Age': 33},
 {'Name': 'Benzema', 'Age': 34},
 {'Name': 'Messi', 'Age': 35}]

Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


2

Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


3

[{'Name': 'Mbappé', 'Age': 23},
 {'Name': 'De Bruyne', 'Age': 31},
 {'Name': 'Lewandowski', 'Age': 33},
 {'Name': 'Benzema', 'Age': 34},
 {'Name': 'Messi', 'Age': 35}]

### 001.003 2D NumPy array

1. Print a NumPy 2D array with the same data as before, but with an extra column for "Last Updated" with a pd.Timestamp, and create a df with it. Make sure the column names are still "Name", "Age", "Last Updated"
1. Update the "Last Updated" column with the current timestamp
1. Convert to a NumPy and print it, proving the timestamp is different


In [16]:
# solution

1
data = np.array([ [name, age, pd.Timestamp.now()] for name, age in zip(names_as_list, ages_as_list)])
data
df = pd.DataFrame(data, columns=["Name", "Age", "Last Updated"])
df

2
df["Last Updated"] = pd.Timestamp.now()

3
data = df.to_numpy()
data


1

array([['Mbappé', 23, Timestamp('2023-01-22 10:50:54.764612')],
       ['De Bruyne', 31, Timestamp('2023-01-22 10:50:54.764627')],
       ['Lewandowski', 33, Timestamp('2023-01-22 10:50:54.764630')],
       ['Benzema', 34, Timestamp('2023-01-22 10:50:54.764632')],
       ['Messi', 35, Timestamp('2023-01-22 10:50:54.764634')]],
      dtype=object)

Unnamed: 0,Name,Age,Last Updated
0,Mbappé,23,2023-01-22 10:50:54.764612
1,De Bruyne,31,2023-01-22 10:50:54.764627
2,Lewandowski,33,2023-01-22 10:50:54.764630
3,Benzema,34,2023-01-22 10:50:54.764632
4,Messi,35,2023-01-22 10:50:54.764634


2

3

array([['Mbappé', 23, Timestamp('2023-01-22 10:50:54.769727')],
       ['De Bruyne', 31, Timestamp('2023-01-22 10:50:54.769727')],
       ['Lewandowski', 33, Timestamp('2023-01-22 10:50:54.769727')],
       ['Benzema', 34, Timestamp('2023-01-22 10:50:54.769727')],
       ['Messi', 35, Timestamp('2023-01-22 10:50:54.769727')]],
      dtype=object)

### 001.004 Pandas Series

1. Create two pandas series from the two lists, and then concatenate them into a df. Make sure the colum names are still "Name" and "Age"

In [17]:
# solution
1
s1 = pd.Series(names_as_list)
s2 = pd.Series(ages_as_list)
pd.concat([s1, s2], keys=["Name", "Age"], axis=1)


1

Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


### 001.005 SQL (sqlite)

1. Create a DataFrame by reading from table 'players' in SQLite DB "001.005.sqlite" with the built-in SQLite module. Note that the index column is called 'id'
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the database. HINT: Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp



In [18]:
# solution
import sqlite3 as sqlite

1
con = sqlite.connect("001.005.sqlite")
df = pd.read_sql("SELECT * FROM players", con=con, index_col="id")
df

2
df["Last Updated"] = pd.Timestamp.now()
df.to_sql("players", con=con, if_exists="replace", index=True, index_label="id")

3
pd.read_sql("SELECT * FROM players", con=con, index_col="id")





1

Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-01-21 22:22:15.222182
1,De Bruyne,31,2023-01-21 22:22:15.222182
2,Lewandowski,33,2023-01-21 22:22:15.222182
3,Benzema,34,2023-01-21 22:22:15.222182
4,Messi,35,2023-01-21 22:22:15.222182


2

5

3

Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-01-22 10:57:51.580700
1,De Bruyne,31,2023-01-22 10:57:51.580700
2,Lewandowski,33,2023-01-22 10:57:51.580700
3,Benzema,34,2023-01-22 10:57:51.580700
4,Messi,35,2023-01-22 10:57:51.580700


### 001.006 SQL (sqlalchemy)

1. Replicate 001.005 but use sqlalchemy to connect to the DB

(Note: you will probably get warnings, but you can ignore them)

In [8]:
# solution
from sqlalchemy import create_engine

engine = create_engine(f"sqlite:///001.005.sqlite")
conn = engine.raw_connection()
df = pd.read_sql_query('SELECT * FROM players', conn, index_col='id')
df.head()

df["Last Updated"] = pd.Timestamp.now()
df.to_sql('players', con=conn, if_exists='replace', index=True, index_label='id')

df = pd.read_sql_query('SELECT * FROM players', conn, index_col='id')
df.head()

  df = pd.read_sql_query('SELECT * FROM players', conn, index_col='id')


Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-01-21 22:22:15.119076
1,De Bruyne,31,2023-01-21 22:22:15.119076
2,Lewandowski,33,2023-01-21 22:22:15.119076
3,Benzema,34,2023-01-21 22:22:15.119076
4,Messi,35,2023-01-21 22:22:15.119076


  df.to_sql('players', con=conn, if_exists='replace', index=True, index_label='id')


5

  df = pd.read_sql_query('SELECT * FROM players', conn, index_col='id')


Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-01-21 22:22:15.222182
1,De Bruyne,31,2023-01-21 22:22:15.222182
2,Lewandowski,33,2023-01-21 22:22:15.222182
3,Benzema,34,2023-01-21 22:22:15.222182
4,Messi,35,2023-01-21 22:22:15.222182


### 001.007 TSV file

1. Create a DataFrame by reading from TSV file in `data_file`. Note that the index column is called 'id'
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the same file. HINT: Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp

In [19]:
data_file = "001.007.tsv"
# solution

1
df = pd.read_csv(data_file, sep="\t", index_col="id")
df

2
df["Last Updated"] = pd.Timestamp.now()
df.to_csv(data_file, index=True, index_label="id", sep="\t")

3
pd.read_csv(data_file, sep="\t", index_col="id")

1

Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-01-21 22:22:15.249774
1,De Bruyne,31,2023-01-21 22:22:15.249774
2,Lewandowski,33,2023-01-21 22:22:15.249774
3,Benzema,34,2023-01-21 22:22:15.249774
4,Messi,35,2023-01-21 22:22:15.249774


2

3

Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-01-22 11:05:16.009416
1,De Bruyne,31,2023-01-22 11:05:16.009416
2,Lewandowski,33,2023-01-22 11:05:16.009416
3,Benzema,34,2023-01-22 11:05:16.009416
4,Messi,35,2023-01-22 11:05:16.009416


### 001.008 JSON file

1. Create a DataFrame by reading from JSON file in `data_file`
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the same file. HINT: Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp. Make sure non ASCII characters are readable

In [20]:
data_file = "001.008.json"
# solution

1
df = pd.read_json(data_file)
df

2
df["Last Updated"] = pd.Timestamp.now()
df.to_json(data_file, force_ascii=False)

3
pd.read_json(data_file)

1

Unnamed: 0,Name,Age,Last Updated
0,Mbappé,23,1674339735275
1,De Bruyne,31,1674339735275
2,Lewandowski,33,1674339735275
3,Benzema,34,1674339735275
4,Messi,35,1674339735275


2

3

Unnamed: 0,Name,Age,Last Updated
0,Mbappé,23,1674385606998
1,De Bruyne,31,1674385606998
2,Lewandowski,33,1674385606998
3,Benzema,34,1674385606998
4,Messi,35,1674385606998


### 001.009 From a table in an HTML page

1. Create a DataFrame by reading from HTML file in `data_file`. Use beautifulsoup flavor. Note that the column ID is called "ID"
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the same file. HINT: Make sure you "flatten" the indices, because `to_html` writes two rows for the headers exactly as you see in the output
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp

In [21]:
data_file = "001.009.html"
# solution

1
df = pd.read_html(data_file, flavor="bs4", index_col="ID")[0]
df

2
df["Last Updated"] = pd.Timestamp.now()
df.columns.name = "ID"
df.index.name = None
df.to_html(data_file, index=True, header=["ID", "Name", "Age", "Last Updated"])

3
pd.read_html(data_file, flavor="bs4", index_col="ID")[0]

1

Unnamed: 0_level_0,Name,Age,Last Updated
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-01-21 22:22:15.383428
1,De Bruyne,31,2023-01-21 22:22:15.383428
2,Lewandowski,33,2023-01-21 22:22:15.383428
3,Benzema,34,2023-01-21 22:22:15.383428
4,Messi,35,2023-01-21 22:22:15.383428


2

3

Unnamed: 0_level_0,Name,Age,Last Updated
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-01-22 11:10:38.173231
1,De Bruyne,31,2023-01-22 11:10:38.173231
2,Lewandowski,33,2023-01-22 11:10:38.173231
3,Benzema,34,2023-01-22 11:10:38.173231
4,Messi,35,2023-01-22 11:10:38.173231
