# Pandas

Pandas is a powerful library for data manipulation and analysis in Python. It is widely used in a range of fields, including data science, finance, and statistics.

## 001. DataFrame Importing / Exporting

## 001.000 Assets

Some assets to avoid too much typing

| Name        | Age|
|-------------|----|
| Mbappé      | 23 |
| De Bruyne   | 31 |
| Lewandowski | 33 |
| Benzema     | 34 |
| Messi       | 35 |

In [1]:
import sys
from pathlib import Path

current_dir = Path().resolve()
while current_dir != current_dir.parent and current_dir.name != "katas":
    current_dir = current_dir.parent
if current_dir != current_dir.parent:
    sys.path.append(current_dir.as_posix())

In [2]:
import pandas as pd
import numpy as np
from lib.utils import fresh_df
from IPython.core.interactiveshell import InteractiveShell

pd.set_option('display.max_rows', None)
InteractiveShell.ast_node_interactivity = "all"
names_as_list = ["Mbappé", "De Bruyne", "Lewandowski", "Benzema", "Messi"]
ages_as_list = [ 23,       31,          33,             34,        35]


### 001.001 Dictionary of arrays

1. Replicate this data as a dataframe with a dictionary of arrays, using a constructor, and print it
1. Do the same, but using from_dict
1. Extract the dict out of the df again with to_dict


In [3]:
# solution


### 001.002 List of dictionaries

1. Create a list where every entry is a dict with keys "Name" and "Age", and print
1. Create a df from it
1. Do the same, but using from_records
1. Extract the list out of the df again with to_dict (not to_records). Note that you can actually get a list and not a dict, with the right parameters...


In [4]:
# solution


### 001.003 2D NumPy array

1. Print a NumPy 2D array with the same data as before, but with an extra column for "Last Updated" with a pd.Timestamp, and create a df with it. Make sure the column names are still "Name", "Age", "Last Updated"
1. Update the "Last Updated" column with the current timestamp
1. Convert to a NumPy and print it, proving the timestamp is different


In [5]:
# solution


### 001.004 Pandas Series

1. Create two pandas series from the two lists, and then concatenate them into a df. Make sure the column names are still "Name" and "Age"
1. Do the same thing, but differently - this time mae sure the names are already in the Series

In [6]:
# solution


### 001.005 SQL (sqlite)

1. Create a DataFrame by reading from table 'players' in SQLite DB "001.005.sqlite" with the built-in SQLite module. Note that the index column is called 'id'
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the database. HINT: Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp



In [7]:
# solution


### 001.006 SQL (sqlalchemy)

1. Replicate 001.005 but use sqlalchemy to connect to the DB

(Note: you will probably get warnings, but you can ignore them)

In [8]:
# solution


### 001.007 TSV file

1. Create a DataFrame by reading from TSV file in `data_file`. Note that the index column is called 'id'
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the same file. HINT: Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp

In [9]:
data_file = "001.007.tsv"
# solution


### 001.008 JSON file

1. Create a DataFrame by reading from JSON file in `data_file`
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the same file. HINT: Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp. Make sure non ASCII characters are readable

In [10]:
data_file = "001.008.json"
# solution


### 001.009 From a table in an HTML page

1. Create a DataFrame by reading from HTML file in `data_file`. Use beautifulsoup flavor. Note that the column ID is called "ID"
1. Change the Last Updated column to the current timestamp in the DataFrame. Write the updated DataFrame data to the same file. HINT: Make sure you "flatten" the indices, because `to_html` writes two rows for the headers exactly as you see in the output
1. Load the data again in a new DataFrame, and confirm it's same as for step (1) except for the timestamp

In [11]:
data_file = "001.009.html"
# solution


### 001.010 Markdown

(requires the tabulate package)

1. There is no read_markdown method. Use the generic method to load tabular data, making sure that
   1. It is aware of | as delimiter
   1. It handles the index as the 1st (zero-based) column
   1. It treates the 1st and last delimiters as _separators_, meaning it puts an empty column either side. Remove them
   1. It treats the `------` row as a normal data row. Remove it
1. There is a to_markdown, method, but it doesn't write to a file like the similarly named ones. You'll have to do it old-school
   1. Overwrite `Last Updated` with the current timestamp
   1. Write a file using the output of to_markdown, using the 'github' table format
1. Reload the table again and prove it was overwritten correctly



In [12]:
data_file = "001.010.md"
# solution
