# Pandas

Pandas is a powerful library for data manipulation and analysis in Python. It is widely used in a range of fields, including data science, finance, and statistics.

## 001. DataFrame Importing / Exporting

## 001.000 Assets

Some assets to avoid too much typing

| Name        | Age|
|-------------|----|
| Mbappé      | 23 |
| De Bruyne   | 31 |
| Lewandowski | 33 |
| Benzema     | 34 |
| Messi       | 35 |

In [1]:
import sys
from pathlib import Path

current_dir = Path().resolve()
while current_dir != current_dir.parent and current_dir.name != "katas":
    current_dir = current_dir.parent
if current_dir != current_dir.parent:
    sys.path.append(current_dir.as_posix())

In [2]:
import pandas as pd
import numpy as np
from lib.utils import fresh_df
from IPython.core.interactiveshell import InteractiveShell

pd.set_option('display.max_rows', None)
InteractiveShell.ast_node_interactivity = "all"
names_as_list = ["Mbappé", "De Bruyne", "Lewandowski", "Benzema", "Messi"]
ages_as_list = [ 23,       31,          33,             34,        35]


### 001.001 Dictionary of arrays

1. Replicate this data as a dataframe with a dictionary of arrays, using a constructor, and print it
1. Do the same, but using from_dict
1. Extract the dict out of the df again with to_dict


In [16]:
# solution
data = {
    "Name": names_as_list,
    "Age": ages_as_list
}
df = pd.DataFrame(data)
df 


df = pd.DataFrame.from_dict(data=data)
df 

df.to_dict(orient="list")




Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


{'Name': ['Mbappé', 'De Bruyne', 'Lewandowski', 'Benzema', 'Messi'],
 'Age': [23, 31, 33, 34, 35]}

### 001.002 List of dictionaries

1. Create a list where every entry is a dict with keys "Name" and "Age", and print
1. Create a df from it
1. Do the same, but using from_records
1. Extract the list out of the df again with to_dict (not to_records). Note that you can actually get a list and not a dict, with the right parameters...


In [18]:
# solution

data = [{"Name": name, "Age": age} for name, age in zip(names_as_list, ages_as_list)]
data

df = pd.DataFrame(data)
df 


df = pd.DataFrame.from_records(data=data)
df 

df.to_dict(orient="records")

[{'Name': 'Mbappé', 'Age': 23},
 {'Name': 'De Bruyne', 'Age': 31},
 {'Name': 'Lewandowski', 'Age': 33},
 {'Name': 'Benzema', 'Age': 34},
 {'Name': 'Messi', 'Age': 35}]

Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


[{'Name': 'Mbappé', 'Age': 23},
 {'Name': 'De Bruyne', 'Age': 31},
 {'Name': 'Lewandowski', 'Age': 33},
 {'Name': 'Benzema', 'Age': 34},
 {'Name': 'Messi', 'Age': 35}]

### 001.003 2D NumPy array

1. Print a NumPy 2D array with the same data as before, but with an extra column for "Last Updated" with a pd.Timestamp, and create a df with it. Make sure the column names are still "Name", "Age", "Last Updated"
1. Update the "Last Updated" column with the current timestamp
1. Convert to a NumPy and print it, proving the timestamp is different


In [22]:
# solution
data = np.array([[name, age, pd.Timestamp.now()] for name, age in zip(names_as_list, ages_as_list)])
data

df = pd.DataFrame(data, columns=["Name", "Age", "Last Updated"])
df 

df["Last Updated"] = pd.Timestamp.now()

df.to_numpy()

array([['Mbappé', 23, Timestamp('2023-03-29 00:25:41.002063')],
       ['De Bruyne', 31, Timestamp('2023-03-29 00:25:41.002079')],
       ['Lewandowski', 33, Timestamp('2023-03-29 00:25:41.002081')],
       ['Benzema', 34, Timestamp('2023-03-29 00:25:41.002083')],
       ['Messi', 35, Timestamp('2023-03-29 00:25:41.002084')]],
      dtype=object)

Unnamed: 0,Name,Age,Last Updated
0,Mbappé,23,2023-03-29 00:25:41.002063
1,De Bruyne,31,2023-03-29 00:25:41.002079
2,Lewandowski,33,2023-03-29 00:25:41.002081
3,Benzema,34,2023-03-29 00:25:41.002083
4,Messi,35,2023-03-29 00:25:41.002084


array([['Mbappé', 23, Timestamp('2023-03-29 00:25:41.006848')],
       ['De Bruyne', 31, Timestamp('2023-03-29 00:25:41.006848')],
       ['Lewandowski', 33, Timestamp('2023-03-29 00:25:41.006848')],
       ['Benzema', 34, Timestamp('2023-03-29 00:25:41.006848')],
       ['Messi', 35, Timestamp('2023-03-29 00:25:41.006848')]],
      dtype=object)

### 001.004 Pandas Series

1. Create two pandas series from the two lists, and then concatenate them into a df. Make sure the column names are still "Name" and "Age"
1. Do the same thing, but differently - this time mae sure the names are already in the Series

In [25]:
# solution
s1 = pd.Series(names_as_list)
s2 = pd.Series(ages_as_list)

df = pd.concat([s1, s2], keys=["Name", "Age"], axis=1)
df

df = pd.concat([s1, s2], axis=1)
df.columns = ["Name", "Age"]
df

Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


### 001.005 SQL (sqlite)

1. Create a DataFrame by reading from table 'players' in SQLite DB "001.005.sqlite" with the built-in SQLite module. Note that the index column is called 'id'
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the database. HINT: Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp



In [32]:
# solution
import sqlite3 as sqlite

con = sqlite.connect("001.005.sqlite")
df = pd.read_sql("SELECT * FROM players", con=con, index_col="id")
df


df["Last Updated"] = pd.Timestamp.now()
df.to_sql("players", con=con, index=True, index_label="id", if_exists="replace")

pd.read_sql("SELECT * FROM players", con=con, index_col="id")

Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-03-21 23:06:10.611184
1,De Bruyne,31,2023-03-21 23:06:10.611184
2,Lewandowski,33,2023-03-21 23:06:10.611184
3,Benzema,34,2023-03-21 23:06:10.611184
4,Messi,35,2023-03-21 23:06:10.611184


5

Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-03-29 00:30:50.013150
1,De Bruyne,31,2023-03-29 00:30:50.013150
2,Lewandowski,33,2023-03-29 00:30:50.013150
3,Benzema,34,2023-03-29 00:30:50.013150
4,Messi,35,2023-03-29 00:30:50.013150


### 001.006 SQL (sqlalchemy)

1. Replicate 001.005 but use sqlalchemy to connect to the DB

(Note: you will probably get warnings, but you can ignore them)

In [33]:
# solution
from sqlalchemy import create_engine

engine = create_engine("sqlite:///001.005.sqlite")
con = engine.raw_connection()
df = pd.read_sql("SELECT * FROM players", con=con, index_col="id")
df


df["Last Updated"] = pd.Timestamp.now()
df.to_sql("players", con=con, index=True, index_label="id", if_exists="replace")

pd.read_sql("SELECT * FROM players", con=con, index_col="id")

  df = pd.read_sql("SELECT * FROM players", con=con, index_col="id")


Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-03-29 00:30:50.013150
1,De Bruyne,31,2023-03-29 00:30:50.013150
2,Lewandowski,33,2023-03-29 00:30:50.013150
3,Benzema,34,2023-03-29 00:30:50.013150
4,Messi,35,2023-03-29 00:30:50.013150


  df.to_sql("players", con=con, index=True, index_label="id", if_exists="replace")


5

  pd.read_sql("SELECT * FROM players", con=con, index_col="id")


Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-03-29 00:31:56.821521
1,De Bruyne,31,2023-03-29 00:31:56.821521
2,Lewandowski,33,2023-03-29 00:31:56.821521
3,Benzema,34,2023-03-29 00:31:56.821521
4,Messi,35,2023-03-29 00:31:56.821521


### 001.007 TSV file

1. Create a DataFrame by reading from TSV file in `data_file`. Note that the index column is called 'id'
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the same file. HINT: Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp

In [35]:
data_file = "001.007.tsv"
# solution

df = pd.read_csv(data_file, sep="\t", index_col="id")
df

df["Last Updated"] = pd.Timestamp.now()
df.to_csv(data_file, sep="\t", index=True, index_label="id")

pd.read_csv(data_file, sep="\t", index_col="id")

Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-03-29 00:34:15.229704
1,De Bruyne,31,2023-03-29 00:34:15.229704
2,Lewandowski,33,2023-03-29 00:34:15.229704
3,Benzema,34,2023-03-29 00:34:15.229704
4,Messi,35,2023-03-29 00:34:15.229704


Unnamed: 0_level_0,Name,Age,Last Updated
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Mbappé,23,2023-03-29 00:34:23.276768
1,De Bruyne,31,2023-03-29 00:34:23.276768
2,Lewandowski,33,2023-03-29 00:34:23.276768
3,Benzema,34,2023-03-29 00:34:23.276768
4,Messi,35,2023-03-29 00:34:23.276768


### 001.008 JSON file

1. Create a DataFrame by reading from JSON file in `data_file`
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the same file. HINT: Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp. Make sure non ASCII characters are readable

In [10]:
data_file = "001.008.json"
# solution

df = pd.read_json(data_file)
df

df["Last Updated"] = pd.Timestamp.now()
df.to_json(data_file, force_ascii=False)

pd.read_json(data_file)


### 001.009 From a table in an HTML page

1. Create a DataFrame by reading from HTML file in `data_file`. Use beautifulsoup flavor. Note that the column ID is called "ID"
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the same file. HINT: Make sure you "flatten" the indices, because `to_html` writes two rows for the headers exactly as you see in the output
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp

In [11]:
data_file = "001.009.html"
# solution


### 001.010 Markdown

(requires the tabulate package)

1. There is no read_markdown method. Use the generic method to load tabular data, making sure that
   1. It is aware of | as delimiter
   1. It handles the index as the 1st (zero-based) column
   1. It treates the 1st and last delimiters as _separators_, meaning it puts an empty column either side. Remove them
   1. It treats the `------` row as a normal data row. Remove it
1. There is a to_markdown, method, but it doesn't write to a file like the similarly named ones. You'll have to do it old-school
   1. Overwrite `Last Updated` with the current timestamp
   1. Write a file using the output of to_markdown, using the 'github' table format
1. Reload the table again and prove it was overwritten correctly



In [12]:
data_file = "001.010.md"
# solution
