# Pandas

Pandas is a powerful library for data manipulation and analysis in Python. It is widely used in a range of fields, including data science, finance, and statistics.

## 001. DataFrame Importing / Exporting

## 001.000 Assets

Some assets to avoid too much typing

| Name        | Age|
|-------------|----|
| Mbappé      | 23 |
| De Bruyne   | 31 |
| Lewandowski | 33 |
| Benzema     | 34 |
| Messi       | 35 |

In [1]:
import sys
from pathlib import Path

current_dir = Path().resolve()
while current_dir != current_dir.parent and current_dir.name != "katas":
    current_dir = current_dir.parent
if current_dir != current_dir.parent:
    sys.path.append(current_dir.as_posix())

In [2]:
import pandas as pd
from lib.utils import fresh_df
from IPython.core.interactiveshell import InteractiveShell

pd.set_option('display.max_rows', None)
InteractiveShell.ast_node_interactivity = "all"
names_as_list = ["Mbappé", "De Bruyne", "Lewandowski", "Benzema", "Messi"]
ages_as_list = [ 23,       31,          33,             34,        35]


### 001.001 Dictionary of arrays

1. Replicate this data as a dataframe with a dictionary of arrays, using a constructor, and print it
1. Do the same, but using from_dict
1. Extract the dict out of the df again with to_dict


In [3]:
# solution


### 001.002 List of dictionaries

1. Create a list where every entry is a dict with keys "Name" and "Age", and print it
1. Do the same, but using from_records
1. Extract the list out of the df again with to_dict (not to_records). Note that you can actually get a list and not a dict, with the right parameters...


In [4]:
# solution


### 001.003 2D Numpy array

1. Import numpy
1. Create a Nump 2D array with the same data as before, but with an extra column for "Last Updated" with a pd.Timestamp
1. Print the numpy array
2. Import as a pandas DataFrame
1. Update the Last Updated column to the current timestamp
3. Make sure the colum names are still "Name", "Age", "Last Updated"
1. Export to a Nump 2D array
1. convert to a NumPy and print it, proving the timestamp is different


In [5]:
# solution


### 001.004 Pandas Series

1. Replicate 001.003 but create two pandas series from the two lists, and then concatenate them
1. Make sure the colum names are still "Name" and "Age"

In [6]:
# solution


### 001.005 SQL (sqlite)

1. Create a DataFrame by reading from table 'players' in SQLite DB "001.005.sqlite" with the built-in SQLite module. Note that the index column is called 'id'
1. Show it with df.head()
1. Change the Last Updated column to the current timestamp in the DataFrame
1. Write the updated DataFrame data to the database. Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame
1. Show it with df.head(), and confirm it's different from the previous one.
1. Also confirm there are no extra index columns



In [7]:
# solution


### 001.006 SQL (sqlalchemy)

1. Replicate 001.005 but use sqlalchemy to connect to the DB

(Note: you will probably get warnings, but you can ignore them)

In [8]:
# solution


### 001.007 TSV file

1. Replicate 001.005 but with the TSV file "[001.007.tsv](001.007.tsv)". Save that file location to a variable
1. TSV is a CSV but with "\t" as seprator
1. Bear in mind the same issues as 001.004 for index columns

In [9]:
data_file = "001.007.tsv"
# solution


### 001.008 JSON file

1. Replicate 001.005 but with the JSON file "[001.008.json](001.008.json)". Save that file location to a variable
1. Make sure non ASCII characters are readable

In [10]:
data_file = "001.008.json"
# solution


### 001.009 From a table in an HTML page

1. Replicate 001.005 but with the HTML file "[001.009.html](001.009.html)". Use beautifulsoup4 flavor
1. Make sure there is only a single header row, ids are shown, and the ID column is called 'ID'

In [11]:
data_file = "001.009.html"
# solution
