# Table of Contents

1. [Import Pandas Library](#import)
2. [Create Pandas from dictionary, column-oriented](#from_dict_col)
3. [Create Pandas from dictionary, row-oriented](#from_dict_row)
4. [Create Pandas from list, column-oriented](#from_list_col)
5. [Create Pandas from list, row-oriented](#from_list_row)
6. [Pandas Save](#save)
7. [Pandas Load](#load)
8. [Understanding table, head, tail, describe, shape](#understanding)
9. [Rename columns](#rename_col)
10. [Drop columns](#drop_col)
11. [Handling Missing Data, Impute mean using df.loc](#missing_loc)
12. [Handling Missing Data, Impute mean using df.fillna](#missing_fillna)
13. [Groupby Aggregation](#groupby)
14. [Sort DataFrame by columns](#sort)
15. [Row/Col extraction using df.loc](#loc)
16. [Row/Col extraction using df.iloc](#iloc)
17. [Calculations or Transforming data in DataFrame using apply](#apply)

<a id='import'></a>
# Import Pandas Library

In [1]:
import pandas as pd
import numpy as np

<a id='from_dict_col'></a>
# Create Pandas from dictionary, key-value pair column-oriented

By using columns=, you are able to reorder the columns of the DataFrame

In [2]:
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy','Thomas', 'Bryan'], 
        'last_name': ['Miller', 'Jacobson', "Johnson", 'Milner', 'Cooze', 'Edison', 'Adams'], 
        'age': [42, 52, np.nan, 24, 73, 15, 65], 
        'gender': ['M','F', 'F', 'M', 'F','M', 'M'],
        'score': [90, 85, 57, 62, np.nan , 60, 55]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'gender', 'score'])
df

Unnamed: 0,first_name,last_name,age,gender,score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,
5,Thomas,Edison,15.0,M,60.0
6,Bryan,Adams,65.0,M,55.0


<a id='from_dict_row'></a>
# Create Pandas from dictionary, key-value pair row-oriented

In [3]:
raw_data = [{'first_name':'Jason', 'last_name':'Miller','age':42,'gender':'M','score':90},
            {'first_name':'Molly', 'last_name':'Jacobson','age':52,'gender':'F','score':85},
            {'first_name':'Tina', 'last_name':'Johnson','age':np.nan,'gender':'F','score':57},
            {'first_name':'Jake', 'last_name':'Milner','age':24,'gender':'M','score':62},
            {'first_name':'Amy', 'last_name':'Cooze','age':73,'gender':'F','score':np.nan},
            {'first_name':'Thomas', 'last_name':'Edison','age':15,'gender':'M','score':60},
            {'first_name':'Bryan', 'last_name':'Adams','age':65,'gender':'M','score':55}]

df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'gender', 'score'])
df

Unnamed: 0,first_name,last_name,age,gender,score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,
5,Thomas,Edison,15.0,M,60.0
6,Bryan,Adams,65.0,M,55.0


<a id='from_list_col'></a>
# Create Pandas from List Vertical, i.e. column-oriented

Create a dictionary from list (same as above) and store into df

In [4]:
fname_arr = ['Jason', 'Molly', 'Tina', 'Jake', 'Amy','Thomas', 'Bryan']
lname_arr = ['Miller', 'Jacobson', "Johnson", 'Milner', 'Cooze','Edison', 'Adams']
age_arr = [42, 52, np.nan, 24, 73, 15, 65]
gender_arr = ['M','F', 'F', 'M', 'F','M', 'M']
score_arr = [90, 85, 57, 62, np.nan, 60, 55]

d = {'first_name' : fname_arr, 'last_name': lname_arr, 'age': age_arr, 'gender': gender_arr, 'score': score_arr}

df = pd.DataFrame(d)
df

Unnamed: 0,first_name,last_name,age,gender,score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,
5,Thomas,Edison,15.0,M,60.0
6,Bryan,Adams,65.0,M,55.0


<a id='from_list_row'></a>
# Create Pandas from List Horizontal, i.e. row-oriented

In [5]:
person_list = [['Jason','Miller',42,'M',90],
               ['Molly', 'Jacobson',52,'F',85],
               ['Tina','Johnson',np.nan,'F',57],
               ['Jake','Milner',24,'M',62],
               ['Amy','Cooze',73,'F',np.nan],
               ['Thomas','Edison',15,'M',60],
               ['Bryan','Adams',65,'M',55]]

columns_name = ['first_name','last_name','age','gender','score']

df = pd.DataFrame(person_list, columns=columns_name)
df

Unnamed: 0,first_name,last_name,age,gender,score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,
5,Thomas,Edison,15.0,M,60.0
6,Bryan,Adams,65.0,M,55.0


<a id=#save></a>
# Pandas Save

In [6]:
import pickle
#save csv
df.to_csv('csv_pandas_notes.csv')
#save hdf5
df.to_hdf('hdf_pandas_notes.h5',key='df',mode='w')
#save pickle
df.to_pickle('pickle_pandas_notes.pkl')

<a id=#load></a>
# Pandas Load

In [7]:
#csv read
df_csv = pd.read_csv('csv_pandas_notes.csv', delimiter=',')
df_hdf = pd.read_hdf('hdf_pandas_notes.h5',key='df')
df_pickle = pd.read_pickle('pickle_pandas_notes.pkl')

In [8]:
df_csv

Unnamed: 0.1,Unnamed: 0,first_name,last_name,age,gender,score
0,0,Jason,Miller,42.0,M,90.0
1,1,Molly,Jacobson,52.0,F,85.0
2,2,Tina,Johnson,,F,57.0
3,3,Jake,Milner,24.0,M,62.0
4,4,Amy,Cooze,73.0,F,
5,5,Thomas,Edison,15.0,M,60.0
6,6,Bryan,Adams,65.0,M,55.0


In [9]:
df_hdf

Unnamed: 0,first_name,last_name,age,gender,score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,
5,Thomas,Edison,15.0,M,60.0
6,Bryan,Adams,65.0,M,55.0


In [10]:
df_pickle

Unnamed: 0,first_name,last_name,age,gender,score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,
5,Thomas,Edison,15.0,M,60.0
6,Bryan,Adams,65.0,M,55.0


<a id=#understanding></a>
# Understanding table, head, tail, describe, shape

In [11]:
df = pd.DataFrame(df_csv)

#Prints the first 5 rows
df.head()

Unnamed: 0.1,Unnamed: 0,first_name,last_name,age,gender,score
0,0,Jason,Miller,42.0,M,90.0
1,1,Molly,Jacobson,52.0,F,85.0
2,2,Tina,Johnson,,F,57.0
3,3,Jake,Milner,24.0,M,62.0
4,4,Amy,Cooze,73.0,F,


In [12]:
#Prints the last 5 rows
df.tail()

Unnamed: 0.1,Unnamed: 0,first_name,last_name,age,gender,score
2,2,Tina,Johnson,,F,57.0
3,3,Jake,Milner,24.0,M,62.0
4,4,Amy,Cooze,73.0,F,
5,5,Thomas,Edison,15.0,M,60.0
6,6,Bryan,Adams,65.0,M,55.0


In [13]:
#Prints a summary of the table for numeric columns
df.describe()

Unnamed: 0.1,Unnamed: 0,age,score
count,7.0,6.0,6.0
mean,3.0,45.166667,68.166667
std,2.160247,22.728103,15.250137
min,0.0,15.0,55.0
25%,1.5,28.5,57.75
50%,3.0,47.0,61.0
75%,4.5,61.75,79.25
max,6.0,73.0,90.0


In [14]:
#Prints the (rows,columns) of the table
df.shape

(7, 6)

<a id=#rename_col></a>
# Rename columns

Method 1: Use df.columns but you have to rename every column again

Method 2: Use df.rename to rename specific columns only. Set inplace=True if you wish to replace the df

In [15]:
df.columns = ['Id','first_name','last_name','Age','Gender','Score']
df

Unnamed: 0,Id,first_name,last_name,Age,Gender,Score
0,0,Jason,Miller,42.0,M,90.0
1,1,Molly,Jacobson,52.0,F,85.0
2,2,Tina,Johnson,,F,57.0
3,3,Jake,Milner,24.0,M,62.0
4,4,Amy,Cooze,73.0,F,
5,5,Thomas,Edison,15.0,M,60.0
6,6,Bryan,Adams,65.0,M,55.0


In [16]:
df.rename(columns={'first_name':'First Name', 'last_name': 'Last Name'}, inplace=True)
df

Unnamed: 0,Id,First Name,Last Name,Age,Gender,Score
0,0,Jason,Miller,42.0,M,90.0
1,1,Molly,Jacobson,52.0,F,85.0
2,2,Tina,Johnson,,F,57.0
3,3,Jake,Milner,24.0,M,62.0
4,4,Amy,Cooze,73.0,F,
5,5,Thomas,Edison,15.0,M,60.0
6,6,Bryan,Adams,65.0,M,55.0


<a id=#drop_col></a>
# Drop columns

In [17]:
df.drop(columns=['Id'], inplace=True)
df

Unnamed: 0,First Name,Last Name,Age,Gender,Score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,
5,Thomas,Edison,15.0,M,60.0
6,Bryan,Adams,65.0,M,55.0


<a id=#missing_loc></a>
# Handling Missing Data, Impute mean using df.loc

Step 1: Get T/F for rows in which the value is missing/NA

In [18]:
df['Score'].isna()

0    False
1    False
2    False
3    False
4     True
5    False
6    False
Name: Score, dtype: bool

Step 2: Get mean of values. In this case its the sum of all Non-NA values / Non-NA count (409/6) instead of (409/7)

In [19]:
df['Score'].mean()

68.16666666666667

Step 3: Impute Missing values with mean

In [20]:
#df.loc[ROW] finds the specific rows where a condition is fulfilled
df.loc[df['Score'].isna()]

Unnamed: 0,First Name,Last Name,Age,Gender,Score
4,Amy,Cooze,73.0,F,


In [21]:
#df.loc[ROW, COL] finds the rows where a condition is fulfilled, and goes to the stated columns
df.loc[df['Score'].isna(), ['Score']]

Unnamed: 0,Score
4,


In [22]:
#df.loc[ROW, COL] finds the rows where a condition is fulfilled, and goes to the stated columns and is assigned the value 
df.loc[df['Score'].isna(), ['Score']] = df['Score'].mean()
df

Unnamed: 0,First Name,Last Name,Age,Gender,Score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,68.166667
5,Thomas,Edison,15.0,M,60.0
6,Bryan,Adams,65.0,M,55.0


<a id=#missing_fillna></a>
# Handling Missing Data, Impute mean using df.fillna()

In [23]:
df['Age'].fillna(df['Age'].mean())

0    42.000000
1    52.000000
2    45.166667
3    24.000000
4    73.000000
5    15.000000
6    65.000000
Name: Age, dtype: float64

In [24]:
df

Unnamed: 0,First Name,Last Name,Age,Gender,Score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,68.166667
5,Thomas,Edison,15.0,M,60.0
6,Bryan,Adams,65.0,M,55.0


<a id=#filter></a>
# Filter DataFrame Rows

In [25]:
df_male, df_female = df[df['Gender'] == 'M'], df[df['Gender'] == 'F']
df_female 

Unnamed: 0,First Name,Last Name,Age,Gender,Score
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
4,Amy,Cooze,73.0,F,68.166667


<a id=#groupby></a>
# Groupby aggregations

In [26]:
df.groupby(['Gender'])['Score'].mean()

Gender
F    70.055556
M    66.750000
Name: Score, dtype: float64

This returns a Groupby object and not a dataframe. To make it a dataframe, we will use a reset_index.

In [27]:
df_gb_gender = df.groupby(['Gender'])['Score'].mean().reset_index()
df_gb_gender

Unnamed: 0,Gender,Score
0,F,70.055556
1,M,66.75


<a id=#sort></a>
# Sort DataFrame by columns

To sort by multiple columns, place column names in a list. In this case it'll sort by Gender in ascending, followed by Age in Descending

In [28]:
df.sort_values(['Gender', 'Age'], ascending=[True, False])

Unnamed: 0,First Name,Last Name,Age,Gender,Score
4,Amy,Cooze,73.0,F,68.166667
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
6,Bryan,Adams,65.0,M,55.0
0,Jason,Miller,42.0,M,90.0
3,Jake,Milner,24.0,M,62.0
5,Thomas,Edison,15.0,M,60.0


<a id=#loc></a>
# Row/Col extraction using df.loc

df.loc[row, col] is label based.

In [29]:
df.head()

Unnamed: 0,First Name,Last Name,Age,Gender,Score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,68.166667


df.loc[:, ['Age', 'Score']] extracts out all rows(:) in column Age and Score(['Age','Score'])

In [30]:
df.loc[:, ['Age', 'Score']]

Unnamed: 0,Age,Score
0,42.0,90.0
1,52.0,85.0
2,,57.0
3,24.0,62.0
4,73.0,68.166667
5,15.0,60.0
6,65.0,55.0


df.loc[2:4] or df.loc[2:4,:] extracts out rows from 2 to 4 inclusive

In [31]:
df.loc[2:4]

Unnamed: 0,First Name,Last Name,Age,Gender,Score
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,68.166667


df.loc[df['Age'] > 50] extracts out rows where Age is more than 50, or more specific when a boolean value is True

In [32]:
df['Age'] > 50

0    False
1     True
2    False
3    False
4     True
5    False
6     True
Name: Age, dtype: bool

In [33]:
df.loc[df['Age'] > 50]

Unnamed: 0,First Name,Last Name,Age,Gender,Score
1,Molly,Jacobson,52.0,F,85.0
4,Amy,Cooze,73.0,F,68.166667
6,Bryan,Adams,65.0,M,55.0


<a id=#iloc></a>
# Row/Col extraction using df.iloc

df.loc[row, col] is integer based.

df.iloc[:, [2,4]] extracts out all rows(:) in column Age and Score found in columns 2 and 4

In [34]:
df.iloc[:, [2,4]]

Unnamed: 0,Age,Score
0,42.0,90.0
1,52.0,85.0
2,,57.0
3,24.0,62.0
4,73.0,68.166667
5,15.0,60.0
6,65.0,55.0


df.iloc[2:4] or df.iloc[2:4,:] extracts out rows from 2 to 4 exclusive

In [35]:
df.iloc[2:4]

Unnamed: 0,First Name,Last Name,Age,Gender,Score
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0


<a id=#apply></a>
# Calculations or Transforming data in DataFrame using apply

In [36]:
df

Unnamed: 0,First Name,Last Name,Age,Gender,Score
0,Jason,Miller,42.0,M,90.0
1,Molly,Jacobson,52.0,F,85.0
2,Tina,Johnson,,F,57.0
3,Jake,Milner,24.0,M,62.0
4,Amy,Cooze,73.0,F,68.166667
5,Thomas,Edison,15.0,M,60.0
6,Bryan,Adams,65.0,M,55.0


apply allows you to pass objects into a function to be applied. In this case, the Score column is passed into a function to half its value

In [37]:
half_val = lambda x : x / 2
df['Altered Score'] = df['Score'].apply(half_val)

#You may write it as a single line as shown below

# df['Altered Score'] = df['Score'].apply(lambda x : x / 2)

In [38]:
df

Unnamed: 0,First Name,Last Name,Age,Gender,Score,Altered Score
0,Jason,Miller,42.0,M,90.0,45.0
1,Molly,Jacobson,52.0,F,85.0,42.5
2,Tina,Johnson,,F,57.0,28.5
3,Jake,Milner,24.0,M,62.0,31.0
4,Amy,Cooze,73.0,F,68.166667,34.083333
5,Thomas,Edison,15.0,M,60.0,30.0
6,Bryan,Adams,65.0,M,55.0,27.5


By using axis=1, each row is passed into a function to calculate the person's full name.

In [39]:
fullname = lambda row: row['First Name'] + ' ' + row['Last Name']
df['Full Name'] = df.apply(fullname, axis=1)

#You may write it as a single line as shown below

# df['Full Name'] = df.apply(lambda row: row['First Name'] + ' ' + row['Last Name'], axis=1)

In [40]:
df

Unnamed: 0,First Name,Last Name,Age,Gender,Score,Altered Score,Full Name
0,Jason,Miller,42.0,M,90.0,45.0,Jason Miller
1,Molly,Jacobson,52.0,F,85.0,42.5,Molly Jacobson
2,Tina,Johnson,,F,57.0,28.5,Tina Johnson
3,Jake,Milner,24.0,M,62.0,31.0,Jake Milner
4,Amy,Cooze,73.0,F,68.166667,34.083333,Amy Cooze
5,Thomas,Edison,15.0,M,60.0,30.0,Thomas Edison
6,Bryan,Adams,65.0,M,55.0,27.5,Bryan Adams
