# Introduction to Pandas
<hr style="border:2px solid black">

## 1. Python Built-in Data Structures

### 1.1 `list`

**multiple values in ordered sequence**

In [None]:
my_list = [1,2,3,4,5]
my_list

**indexed**

In [None]:
my_list[0]

**mutable**

In [None]:
my_list[0] = 0
my_list

### 1.2 `tuple`

**multiple values in ordered sequence**

In [None]:
my_tuple = ('a','b','c','d','e')
my_tuple

**indexed**

In [None]:
my_tuple[-1]

**immutable**

In [None]:
my_tuple[-1] = 'f'
my_tuple

### 1.3 `dictionary`

**collection of key-value pairs**

In [None]:
my_dictionary = {'a': 1,'b': 2,'c': 3, 'd': 4}
my_dictionary

**key serves as index**

In [None]:
my_dictionary['c']

**mutable**

In [None]:
my_dictionary['e'] = 5
my_dictionary

### 1.4 `set`

**unordered collection of unique values**

In [None]:
my_set = set([1,1,2,3,4,5,5,0,0,0])
my_set

**no index**

In [None]:
my_set[1]

**mutable**

In [None]:
my_set.add(6)
my_set

<hr style="border:2px solid black">

## 2. Pandas

>- open-source library for processing and analyzing tabular data
>- built on top of the Python programming language
>- fast, powerful, flexible and user-friendly

### 2.1 [Pandas Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)

- one dimensional array with axis labels
- can contain data of only one type

In [None]:
import pandas as pd

In [None]:
my_series = pd.Series(my_list)
my_series

In [None]:
my_series.mean()

### 2.2 [Pandas DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html)

- two-dimensional, size-mutable, potentially heterogeneous tabular data
- has `pd.Series` as columns
- rows contain observations; columns contain variables/features
- can load data from various data sources (e.g. csv, excel)

### Caution

**Do not ever loop through the rows of a DataFrame!!**

<hr style="border:2px solid black">

## 3. Working with Pandas 

**import pandas**

In [None]:
import pandas as pd

**reading a file**

In [None]:
penguins = pd.read_csv('../data/penguins_simple.csv',sep=';')

**view dataframe content**

In [None]:
penguins

**data type**

In [None]:
type(penguins)

### 3.1 Examining DataFrames

**show the first 3 lines**

In [None]:
penguins.head(3)

**show the last 3 lines**

In [None]:
penguins.tail(3)

**number of rows and columns**

In [None]:
penguins.shape

In [None]:
penguins.shape[0]

In [None]:
penguins.shape[1]

**column information**

In [None]:
penguins.info()

**missing values**

In [None]:
penguins.isna().sum()

**description of numerical columns**

In [None]:
penguins.describe()

**column mean**

In [None]:
penguins['Culmen Length (mm)'].mean()

**distinct values in a column**

In [None]:
penguins['Species'].unique()

**number of distinct values in a column**

In [None]:
penguins['Species'].nunique()

**occurance of each unique value in a column**

In [None]:
penguins['Species'].value_counts()

**apply a calculation to a column**

In [None]:
penguins['Body Mass (kg)'] = penguins['Body Mass (g)']/1000

**check head**

In [None]:
penguins.head()

**drop a column**

In [None]:
penguins.drop('Body Mass (kg)', axis=1, inplace=True)

In [None]:
penguins.head()

### 3.2 Selecting Rows and Columns

**index range**

In [None]:
penguins.index

**column names**

In [None]:
penguins.columns

**choose a column**

In [None]:
type(penguins['Flipper Length (mm)'])

In [None]:
penguins['Flipper Length (mm)']

In [None]:
type(penguins[['Flipper Length (mm)']])

In [None]:
penguins[['Flipper Length (mm)']]

**choose multiple columns**

In [None]:
penguins[['Flipper Length (mm)', 'Body Mass (g)']]

**filter rows by a condition**

In [None]:
penguins[penguins['Body Mass (g)'] > 5000]

In [None]:
penguins[(penguins['Body Mass (g)'] > 5000) & (penguins['Culmen Depth (mm)'] < 15.0)] 

In [None]:
penguins[(penguins['Body Mass (g)'] > 5000) | (penguins['Culmen Depth (mm)'] < 15.0)] 

In [None]:
penguins[~(penguins['Body Mass (g)'] > 5000)]

In [None]:
penguins.head()

**location based slicing**

In [None]:
penguins.loc[1]

In [None]:
penguins.iloc[1]

In [None]:
penguins.loc[3:7]

In [None]:
penguins.iloc[3:7]

### 3.3 Summarizing Data

**converting to numpy array**

In [None]:
penguins.values

In [None]:
type(penguins.values)

**cumulative sum**

In [None]:
penguins['Body Mass (g)'].cumsum()

**groupby**

In [None]:
penguins.groupby('Sex')[['Body Mass (g)']].mean()

**sort values**

In [None]:
penguins.sort_values(by=['Species', 'Body Mass (g)'])

**creating new column out of an existing one**

In [None]:
def get_initial(s):
    return s[0]

penguins['initial'] = penguins['Sex'].apply(get_initial)
penguins

In [None]:
penguins['initial2'] = penguins['Species'].apply(lambda s: s[0])
penguins

**stacking**

In [None]:
penguins.stack()

**transposition**

In [None]:
penguins.transpose()

**pandas built-in plotting option**

In [None]:
penguins['Body Mass (g)'].hist();

In [None]:
penguins.plot('Culmen Depth (mm)', 'Culmen Length (mm)' , style='o');

### Writing to Disk

In [None]:
penguins.to_csv('../data/new_penguin_data.csv')

**read the saved file**

In [None]:
df = pd.read_csv('../data/new_penguin_data.csv',index_col=0)

In [None]:
df.head()

### BONUS: Selecting rows and columns

In [None]:
gentoo = penguins[penguins['Species']=='Gentoo']

In [None]:
gentoo

In [None]:
gentoo['Species'].unique()

**Selecting by "name"**

In [None]:
gentoo.loc[216]

**Selecting by index**

In [None]:
gentoo.iloc[118]

In [None]:
gentoo.loc[216]==gentoo.iloc[2]

**reset index**

In [None]:
gentoo.reset_index(drop=True,inplace=True)

In [None]:
gentoo.head()

**save to disk**

In [None]:
gentoo.to_csv('../data/gentoo_data.csv',index=False)

**read saved file**

In [None]:
pd.read_csv('../data/gentoo_data.csv')

**set index**

In [None]:
penguins.set_index('Species')