# Pandas

Pandas is a powerful library for data manipulation and analysis in Python. It is widely used in a range of fields, including data science, finance, and statistics.

1. Import pandas
1. Enable matplotlib to print inline

In [1]:
import pandas as pd
import numpy as np
import sqlalchemy
%matplotlib inline

## 000. Assets

Some assets to avoid too much typing

In [2]:
names_as_list = ["Mbappé", "De Bruyne", "Lewandowski", "Benzema", "Messi"]
ages_as_list = [ 23,       31,          33,             34,        35]


## 001. Importing / Exporting data

### 001.001 Dictionary of arrays

1. Replicate this data as a dataframe with a dictionary of arrays
1. Show it with head()

| Name        | Age|
|-------------|----|
| Mbappé      | 23 |
| De Bruyne   | 31 |
| Lewandowski | 33 |
| Benzema     | 34 |
| Messi       | 35 |


In [3]:
data = {
    'Name': names_as_list, 
    'Age':  ages_as_list
}
df = pd.DataFrame(data)
df.head()

Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


### 001.002 List of dictionaries

1. Create a list where every entry is a dict with keys "Name" and "Age"
1. Continue as per 001.002


In [4]:
data_as_list_of_dicts = [{ "Name": name, "Age": age } for name, age in zip(names_as_list, ages_as_list) ]
df = pd.DataFrame(data_as_list_of_dicts)
df.head()

Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


### 001.003 2D Numpy array

1. Create a Nump 2D array with the same data
2. Continue as per 001.002
3. Make sure the colum names are still "Name" and "Age"


In [5]:
data_as_list_of_lists = [[ name, age ] for name, age in zip(names_as_list, ages_as_list) ]

data = np.array(data_as_list_of_lists)
df = pd.DataFrame(data, columns=["Name", "Age"])
df.head()

Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


### 001.004 Pandas Series

1. Replicate 001.003 but create two pandas series from the two lists, and then concatenate them
1. Make sure the colum names are still "Name" and "Age"

In [6]:
s1 = pd.Series(names_as_list)
s2 = pd.Series(ages_as_list)
df = pd.concat([s1, s2], keys=["Name", "Age"], axis=1)
df.head()


Unnamed: 0,Name,Age
0,Mbappé,23
1,De Bruyne,31
2,Lewandowski,33
3,Benzema,34
4,Messi,35


### 001.005 SQL (sqlite)

1. Create a DataFrame by reading from table 'players' in SQLite DB "001.005.sqlite" with the built-in SQLite module. Note that the index column is called 'id'
1. Show it with df.head()
1. Change the Last Update column to the current timestamp in the DataFrame
1. Write the updated DataFrame data to the database. Make sure you tell the function to use the 'id' column for indices
1. Load the data again in a new DataFrame
1. Show it with df.head(), and confirm it's different from the previous one.
1. Also confirm there are no extra index columns



In [8]:
import sqlite3

conn = sqlite3.connect("001.005.sqlite")
df = pd.read_sql_query('SELECT * FROM players', conn, index_col='id')
print(df.head())

df["Last Updated"] = pd.Timestamp.now()
df.to_sql('players', con=conn, if_exists='replace', index=True, index_label='id')

df = pd.read_sql_query('SELECT * FROM players', conn, index_col='id')
print(df.head())


           Name  Age                Last Updated
id                                              
0        Mbappé   23  2022-12-28 19:46:34.172276
1     De Bruyne   31  2022-12-28 19:46:34.172276
2   Lewandowski   33  2022-12-28 19:46:34.172276
3       Benzema   34  2022-12-28 19:46:34.172276
4         Messi   35  2022-12-28 19:46:34.172276
           Name  Age                Last Updated
id                                              
0        Mbappé   23  2022-12-28 22:30:28.068699
1     De Bruyne   31  2022-12-28 22:30:28.068699
2   Lewandowski   33  2022-12-28 22:30:28.068699
3       Benzema   34  2022-12-28 22:30:28.068699
4         Messi   35  2022-12-28 22:30:28.068699


### 001.006 SQL (sqlalchemy)

1. Replicate 001.005 but use sqlalchemy to connect to the DB

(Note: you will probably get warnings, but you can ignore them)

In [9]:
from sqlalchemy import create_engine

engine = create_engine(f"sqlite:///001.005.sqlite")
conn = engine.raw_connection()
df = pd.read_sql_query('SELECT * FROM players', conn, index_col='id')
print(df.head())

df["Last Updated"] = pd.Timestamp.now()
df.to_sql('players', con=conn, if_exists='replace', index=True, index_label='id')

df = pd.read_sql_query('SELECT * FROM players', conn, index_col='id')
print(df.head())

           Name  Age                Last Updated
id                                              
0        Mbappé   23  2022-12-28 22:30:28.068699
1     De Bruyne   31  2022-12-28 22:30:28.068699
2   Lewandowski   33  2022-12-28 22:30:28.068699
3       Benzema   34  2022-12-28 22:30:28.068699
4         Messi   35  2022-12-28 22:30:28.068699
           Name  Age                Last Updated
id                                              
0        Mbappé   23  2022-12-28 22:30:28.085769
1     De Bruyne   31  2022-12-28 22:30:28.085769
2   Lewandowski   33  2022-12-28 22:30:28.085769
3       Benzema   34  2022-12-28 22:30:28.085769
4         Messi   35  2022-12-28 22:30:28.085769


  df = pd.read_sql_query('SELECT * FROM players', conn, index_col='id')
  df.to_sql('players', con=conn, if_exists='replace', index=True, index_label='id')
  df = pd.read_sql_query('SELECT * FROM players', conn, index_col='id')


### 001.007 TSV file

1. Replicate 001.005 but with the TSV file "[001.007.tsv](001.007.tsv)". Save that file location to a variable
1. TSV is a CSV but with "\t" as seprator
1. Bear in mind the same issues as 001.004 for index columns

In [10]:
data_file = "001.007.tsv"
df = pd.read_csv(data_file, sep="\t", index_col='id')
print(df.head())

df["Last Updated"] = pd.Timestamp.now()
df.to_csv(data_file, sep="\t", index=True, index_label='id')

df = pd.read_csv(data_file, sep="\t", index_col='id')
print(df.head())

           Name  Age                Last Updated
id                                              
0        Mbappé   23  2022-12-28 19:43:52.811925
1     De Bruyne   31  2022-12-28 19:43:52.811925
2   Lewandowski   33  2022-12-28 19:43:52.811925
3       Benzema   34  2022-12-28 19:43:52.811925
4         Messi   35  2022-12-28 19:43:52.811925
           Name  Age                Last Updated
id                                              
0        Mbappé   23  2022-12-28 22:30:28.097741
1     De Bruyne   31  2022-12-28 22:30:28.097741
2   Lewandowski   33  2022-12-28 22:30:28.097741
3       Benzema   34  2022-12-28 22:30:28.097741
4         Messi   35  2022-12-28 22:30:28.097741


### 001.008 JSON file

1. Replicate 001.005 but with the JSON file "[001.008.json](001.008.json)". Save that file location to a variable
1. Make sure non ASCII characters are readable

In [11]:
data_file = "001.008.json"
df = pd.read_json(data_file)
print(df.head())

df["Last Updated"] = pd.Timestamp.now()
df.to_json(data_file, force_ascii=False)

df = pd.read_json(data_file)
print(df.head())

          Name  Age   Last Updated
0       Mbappé   23  1672257117428
1    De Bruyne   31  1672257117428
2  Lewandowski   33  1672257117428
3      Benzema   34  1672257117428
4        Messi   35  1672257117428
          Name  Age   Last Updated
0       Mbappé   23  1672266628108
1    De Bruyne   31  1672266628108
2  Lewandowski   33  1672266628108
3      Benzema   34  1672266628108
4        Messi   35  1672266628108


### 001.009 From a table in an HTML page

1. Replicate 001.005 but with the HTML file "[001.009.html](001.009.html)". Save that file location to a variable
1. Make sure there is only a single header row, ids are shown, and the ID column is called 'ID'

In [12]:
data_file = "001.009.html"
df = pd.read_html(data_file, flavor="bs4", index_col="ID")[0]
print(df.head())

df["Last Updated"] = pd.Timestamp.now()
df.columns.name = 'ID'
df.index.name = None
df.to_html(data_file, index=True, index_names=['ID'], header=['ID', 'Name', 'Age', 'Last Updated'])

df = pd.read_html(data_file, flavor="bs4", index_col="ID")[0]
print(df.head())

           Name  Age                Last Updated
ID                                              
0        Mbappé   23  2022-12-28 21:52:19.292355
1     De Bruyne   31  2022-12-28 21:52:19.292355
2   Lewandowski   33  2022-12-28 21:52:19.292355
3       Benzema   34  2022-12-28 21:52:19.292355
4         Messi   35  2022-12-28 21:52:19.292355
           Name  Age                Last Updated
ID                                              
0        Mbappé   23  2022-12-28 22:30:28.166824
1     De Bruyne   31  2022-12-28 22:30:28.166824
2   Lewandowski   33  2022-12-28 22:30:28.166824
3       Benzema   34  2022-12-28 22:30:28.166824
4         Messi   35  2022-12-28 22:30:28.166824
