<img src="images/bwHPC_Logo_cmyk.svg" width="200" /> <img src="images/HochschuleEsslingen_Logo_RGB_DE.png" width="200" /> <img src="images/Konstanz_Logo.svg" width="200" /> <img src="images/KIT_Logo.png" width="200" />

## Pandas

* Library for Data analysis
* Creating and working on series and DataFrames
* Reading and writing of large data sets in different formats

### Pandas-Series
* One dimensional
* Just like NumPy Arrays uses a numerical index
* Additionally to the index, provides a label (any hashable datatype, e.z. datetime, string, ...)

(regarding hashable, please see https://docs.python.org/3/glossary.html#term-hashable)

#### Creating a Pandas-Series from a list

In [None]:
import pandas as pd

index = ['Germany', 'France', 'Netherlands']
data = [83000000, 67000000, 17000000]

series = pd.Series(data=data) # Creating Pandas-Series without label
print("series without label:")
display(series)

series = pd.Series(data=data, index=index) # Creating Pandas-Series with Label
print("\nseries with label:")
display(series)

print("\nindex 0:")
display(series[0]) # Access the first element using index

print("\nlabel 'Germany':")
display(series['Germany']) # Access the first element using a label

#### Creating a Pandas-Series from a Python Dictionary

In [None]:
import pandas as pd

age = {'Pia':20, 'Felix':26} # Python Dictionary
pd.Series(age) # Converting dict into Pandas-Series

#### Operations on Pandas Series

In [None]:
import pandas as pd

year_1990 = {'Germany':70000000, 'France': 50000000, 'Netherlands': 12000000}
year_2021 = {'Germany':83000000, 'France': 67000000, 'Netherlands': 17000000, 'Greece': 13000000}

inhabitants_1990 = pd.Series(year1990) ##Konvertierung in Pandas-Series
inhabitants_2021 = pd.Series(year_2021)

print("Inhabitants in 1990:")
display(inhabitants_1990)

print("\nAccess via Label 'France':")
display(inhabitants_1990['France'])

print("\nDivision:")
display(inhabitants_1990 / inhabitants_2021)

print("\nDivision with fill_value:")
display(inhabitants_1990.div(inhabitants_2021, fill_value=13000000)) # Missing values are filled with the supplied numbers (to omit division-by-zero)

### Pandas DataFrames
* Two dimensional
* Group of Pandas-Series with the same Index/Label in every row
* Every Series has an index (or label) to select the series (column)

<p style="text-align: center;"> Pandas-Series 1: </p>

| Index       | Inhabitants 2021 |
|-------------|------------------|
| Germany     | 83000000         |
| France      | 67000000         |
| Netherlands | 17000000         |

<p style="text-align: center;"> Pandas-Series 2: </p>

| Index       | Inhabitants 1990 |
|-------------|------------------|
| Germany     | 70000000         |
| France      | 50000000         |
| Netherlands | 12000000         |

<p style="text-align: center;"> Pandas-Dataframe assembled from the two Pandas-Series:</p>

| Index       | Inhabitants 1990 | Inhabitants 2021 |
|-------------|------------------|------------------|
| Germany     | 70000000         | 83000000         |
| France      | 50000000         | 67000000         |
| Netherlands | 12000000         | 17000000         |

#### Creating Pandas-DataFrames from Python

In [None]:
import numpy as np

np.random.seed(42)
data = np.random.randint(0, 101, (4, 3)) # Numpy 4x3 Matrix with random integer values from 0 to 101 (exclusive)
display(data)

In [None]:
import pandas as pd
import numpy as np

np.random.seed(42)
data = np.random.randint(0, 101, (4, 3))

index = ['Berlin', 'BW', 'Bayern', 'Hessen']
columns = ['Jan', 'Feb', 'Mar']

df = pd.DataFrame(data=data, index=index, columns=columns) # Create Pandas DataFrame
display(df)

display(df.info()) # Generate more info on the Pandas DataFrame

#### Read a Pandas DataFrame from File

In [None]:
import pandas as pd

#df = pd.read_csv('s3://nyc-tlc/trip data/green_tripdata_2019-02.csv', nrows=1000) # CSV from S3 einlesen; limit to 1000 rows, only
## S3 requires an account for AWS; therefore the data used is stored locally
## please use the provided parquet-file (see below)

# https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
df = pd.read_parquet('./files/green_tripdata_2023-01.parquet', engine='pyarrow')
display(df)

print("\nSpalten:")
display(df.columns) # show colums

print("\nIndex:")
display(df.index) # show index

print("\nthe first six rows")
display(df.head(6)) # show the first six rows

print("\nStatistical overview:")
display(df.describe().transpose()) # show statistical overview including (Min, Max, Quantile, ....)

#### Working with columns

In [None]:
import pandas as pd
df = pd.read_parquet('./files/green_tripdata_2023-01.parquet', engine='pyarrow')

display(df['tip_amount']) # Output a specific column
display(type(df['tip_amount'])) # Every single column of a Pandas DataFrame is a Series

In [None]:
import pandas as pd
df = pd.read_parquet('./files/green_tripdata_2023-01.parquet', engine='pyarrow')

cols = ['tip_amount','total_amount']
display(df[cols]) # Show multiple columns
display(type(df[cols])) # Two columns of a DataFrame is again a DataFrame

In [None]:
import pandas as pd
df = pd.read_parquet('./files/green_tripdata_2023-01.parquet', engine='pyarrow')

display(100 * df['tip_amount'] / df['total_amount']) # Percentage of tip of the total amount earned.

In [None]:
import pandas as pd
df = pd.read_parquet('./files/green_tripdata_2023-01.parquet', engine='pyarrow')

df['tip_percentage'] = 100 * df['tip_amount'] / df['total_amount'] # Insert a new column into the DataFrame
display(df.head())

display(df.drop('tip_percentage', axis=1)) # Delete column, but only in the DataFrame returned by drop()
display(df) # 'tip_percentage' is still available in the original DataFrame

df = df.drop('tip_percentage', axis=1) # Only here 'tip_percentage' is finally deleted
display(df)

display(df.shape) # Dimension of DataFrames

display(df.shape[0]) # Index=0 --> Rows

display(df.shape[1]) # Index=1 --> Columns

#### Working with rows

In [None]:
import pandas as pd
df = pd.read_parquet('./files/green_tripdata_2023-01.parquet', engine='pyarrow')

display(df.index)

In [None]:
df = df.set_index("lpep_pickup_datetime") # change the index
# Attention: "lpep_pickup_datetime" is not a column anymore

In [None]:
df.head()

In [None]:
df.reset_index() # reset the index

In [None]:
df.iloc[0:6] # return the first six rows

In [None]:
display(df.index)
df.loc[[pd.to_datetime('2023-01-01 00:26:10'),pd.to_datetime('2023-01-01 00:35:12')]] # return only rows between certain dates

In [None]:
df.drop(pd.to_datetime('2023-01-01 00:26:10'), axis=0) # Drop certain rows

In [None]:
row = df.iloc[0] # Single out specific rows

In [None]:
row # show the row

In [None]:
df = df.append(row) # Append row at the end

In [None]:
df

#### Filtering with conditions

Every column is called a feature of the data. Every row is called an instance of the data.

In [None]:
import pandas as pd
df = pd.read_parquet('./files/green_tripdata_2023-01.parquet', engine='pyarrow') # read anew

In [None]:
df["tip_amount"] > 3 # Where is 'tip_amount' larger than 3?

In [None]:
var1 = df["tip_amount"] > 3 # create a new cariable based on rows, where tip_amount is larger than 3

In [None]:
df[var1] # and filter DataFrame to only these values

In [None]:
df[(df["tip_amount"] > 3) & (df["total_amount"] > 40)] # Doing the filtering in just one row

In [None]:
filter = [2,5] # Filter, which only selects two input values 2 and 5
df['passenger_count'].isin(filter).iloc[500:510] # Filter out the rides with 2 or 5 passengers

In [None]:
df.iloc[501] # 501 ist false since 'passenger_count'=1 (aka not 2 or 5)

In [None]:
df.head()

In [None]:
df.info() # Information about the type

In [None]:
def last_two(number): # Last two numbers of every row in the column
    return int(str(number)[-2:])

In [None]:
last_two(1234567)

In [None]:
df['PULocationID'].apply(last_two) # Apple function to columns in the DataFrame

In [None]:
df['PULocationID_last_two'] = df['PULocationID'].apply(last_two)

In [None]:
df.head()

In [None]:
df['total_amount'].mean()