# Introduction

This article overviews how to quickly set up and get started with the `pandas` data analysis library. It also lists common code snippets for parsing, loading, and transforming data.

# Installing and Importing

First we need to install python and the pip package manager.

In [0]:
from google.colab import drive
drive.mount('/content/drive')

In [1]:
# Setting path
import os

print("Current directory is :" , os.getcwd())
path = "/content/drive/My Drive/Introduction to Data Science - Python edition/Module 1"
os.chdir(path)

print("Current directory is :" , os.getcwd())

Current directory is : /content


FileNotFoundError: ignored

In [0]:
# installing and importing pandas
#!pip install pandas
#!conda install pandas
import pandas as pd

print("Version of Pandas is :", pd.__version__)

Version of Pandas is : 0.25.3


# Creating Data Frames

Data frames are the central concept in pandas. In essence, a data frame is table with labeled rows and columns. Data frames can be created from multiple sources - e.g. CSV files, excel files, and JSON.

## Hardcoded Dataframes

Hardcoded data frames can be constructed by providing a hash of columns and their values.

In [0]:
import numpy as np

df = pd.DataFrame({
   'col1': ['Item0', 'Item0', 'Item1', 'Item1'],
   'col2': ['Gold', 'Bronze', 'Gold', 'Silver'],
   'col3': [1, 2, np.nan, 4]
})

df.head()

Unnamed: 0,col1,col2,col3
0,Item0,Gold,1.0
1,Item0,Bronze,2.0
2,Item1,Gold,
3,Item1,Silver,4.0


## Loading CSV files

Loading a CSV file as a data frame is pretty easy.

In [0]:
df = pd.read_csv("./iris.csv")
df.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


# Previewing Data

To preview the data and the metadata of a dataframe you can use the following functions:

## Top 5 rows

In [0]:
# Displays the top 5 rows. Accepts an optional int parameter - num. of rows to show
df.head(5)

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [0]:
# Similar to head, but displays the last rows
df.tail()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


## Shape of data

In [0]:
# The dimensions of the dataframe as a (rows, cols) tuple
df.shape

(150, 5)

## Length of data

In [0]:
# The number of columns. Equal to df.shape[0]
len(df)

150

## Columns of data

In [0]:
# An array of the column names
df.columns

Index(['sepal.length', 'sepal.width', 'petal.length', 'petal.width',
       'variety'],
      dtype='object')

In [0]:
var_name = df.columns.tolist()
var_name

['sepal.length', 'sepal.width', 'petal.length', 'petal.width', 'variety']

## Variable types

In [0]:
# Columns and their types
df.dtypes

sepal.length    float64
sepal.width     float64
petal.length    float64
petal.width     float64
variety          object
dtype: object

## Quick glance of data

In [0]:
# Information of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal.length    150 non-null float64
sepal.width     150 non-null float64
petal.length    150 non-null float64
petal.width     150 non-null float64
variety         150 non-null object
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


## Missing value summary

In [0]:
# to check columns missing values
df.isnull().sum()

sepal.length    0
sepal.width     0
petal.length    0
petal.width     0
variety         0
dtype: int64

In [0]:
# to check total missing values
df.isnull().sum().sum()

0

## Converting the frame to two-dimensional

In [0]:
# Converts the frame to a two-dimensional table
df.values

array([[5.1, 3.5, 1.4, 0.2, 'Setosa'],
       [4.9, 3.0, 1.4, 0.2, 'Setosa'],
       [4.7, 3.2, 1.3, 0.2, 'Setosa'],
       [4.6, 3.1, 1.5, 0.2, 'Setosa'],
       [5.0, 3.6, 1.4, 0.2, 'Setosa'],
       [5.4, 3.9, 1.7, 0.4, 'Setosa'],
       [4.6, 3.4, 1.4, 0.3, 'Setosa'],
       [5.0, 3.4, 1.5, 0.2, 'Setosa'],
       [4.4, 2.9, 1.4, 0.2, 'Setosa'],
       [4.9, 3.1, 1.5, 0.1, 'Setosa'],
       [5.4, 3.7, 1.5, 0.2, 'Setosa'],
       [4.8, 3.4, 1.6, 0.2, 'Setosa'],
       [4.8, 3.0, 1.4, 0.1, 'Setosa'],
       [4.3, 3.0, 1.1, 0.1, 'Setosa'],
       [5.8, 4.0, 1.2, 0.2, 'Setosa'],
       [5.7, 4.4, 1.5, 0.4, 'Setosa'],
       [5.4, 3.9, 1.3, 0.4, 'Setosa'],
       [5.1, 3.5, 1.4, 0.3, 'Setosa'],
       [5.7, 3.8, 1.7, 0.3, 'Setosa'],
       [5.1, 3.8, 1.5, 0.3, 'Setosa'],
       [5.4, 3.4, 1.7, 0.2, 'Setosa'],
       [5.1, 3.7, 1.5, 0.4, 'Setosa'],
       [4.6, 3.6, 1.0, 0.2, 'Setosa'],
       [5.1, 3.3, 1.7, 0.5, 'Setosa'],
       [4.8, 3.4, 1.9, 0.2, 'Setosa'],
       [5.0, 3.0, 1.6, 0.

## Descriptive Statistics

In [0]:
# Displays descriptive stats for all columns
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sepal.length,150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
sepal.width,150.0,3.057333,0.435866,2.0,2.8,3.0,3.3,4.4
petal.length,150.0,3.758,1.765298,1.0,1.6,4.35,5.1,6.9
petal.width,150.0,1.199333,0.762238,0.1,0.3,1.3,1.8,2.5


In [0]:
# Frequency
df['variety'].value_counts(normalize = False)

Versicolor    50
Setosa        50
Virginica     50
Name: variety, dtype: int64

In [0]:
# Frequency
df['variety'].value_counts(normalize = True)*100

Versicolor    33.333333
Setosa        33.333333
Virginica     33.333333
Name: variety, dtype: float64

In [0]:
# Unique values
len(df['variety'].unique())

3

# Data Manipulation

## Sorting

The sort_index method is used to sort the frame by one of its axis indices. The axis is either 0 or 1 - row/column axis respectively. We can also sort by one or multiple columns.

In [0]:
# Sort rows descendingly by the index
df = df.sort_index(axis=0, ascending=False)
df.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
149,5.9,3.0,5.1,1.8,Virginica
148,6.2,3.4,5.4,2.3,Virginica
147,6.5,3.0,5.2,2.0,Virginica
146,6.3,2.5,5.0,1.9,Virginica
145,6.7,3.0,5.2,2.3,Virginica


In [0]:
# Sorting by columns
df = df.sort_values(by=['sepal.length', 'sepal.width'], ascending=False)
df.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
131,7.9,3.8,6.4,2.0,Virginica
117,7.7,3.8,6.7,2.2,Virginica
135,7.7,3.0,6.1,2.3,Virginica
122,7.7,2.8,6.7,2.0,Virginica
118,7.7,2.6,6.9,2.3,Virginica


## Renaming columns

In [0]:
# changing cols with rename() 
new_data = df.rename(columns = {"sepal.length": "sepal_length", 
                                  "sepal.width":"sepal_width", 
                                  "petal.length": "petal_length",
                                  "petal.width": "petal_width"})

new_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety
131,7.9,3.8,6.4,2.0,Virginica
117,7.7,3.8,6.7,2.2,Virginica
135,7.7,3.0,6.1,2.3,Virginica
122,7.7,2.8,6.7,2.0,Virginica
118,7.7,2.6,6.9,2.3,Virginica


In [0]:
# changing columns using .columns() 
df.columns = ['sepal_length','sepal_width', 'petal_length', 'petal_width', 'variety'] 

df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety
131,7.9,3.8,6.4,2.0,Virginica
117,7.7,3.8,6.7,2.2,Virginica
135,7.7,3.0,6.1,2.3,Virginica
122,7.7,2.8,6.7,2.0,Virginica
118,7.7,2.6,6.9,2.3,Virginica


## Dropping columns

In [0]:
new_data = df.drop(['variety'], axis=1)
new_data.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
131,7.9,3.8,6.4,2.0
117,7.7,3.8,6.7,2.2
135,7.7,3.0,6.1,2.3
122,7.7,2.8,6.7,2.0
118,7.7,2.6,6.9,2.3


## Filtering

### Filtering by columns

Individual columns can be selected with the [] operator or directly as attributes.

In [0]:
# Selects only the column named 'col1';
temp = df.sepal_length
temp.head()

131    7.9
117    7.7
135    7.7
122    7.7
118    7.7
Name: sepal_length, dtype: float64

In [0]:
# Same as previous
temp = df[['sepal_length']]
temp.head()

Unnamed: 0,sepal_length
131,7.9
117,7.7
135,7.7
122,7.7
118,7.7


In [0]:
# Select two columns
temp = df[['sepal_length', 'sepal_width']]
temp.head()

Unnamed: 0,sepal_length,sepal_width
131,7.9,3.8
117,7.7,3.8
135,7.7,3.0
122,7.7,2.8
118,7.7,2.6


### Filtering by rows

We can also select by absolute coordinates/position in the frame. Indices are zero based:

In [0]:
# Selects second row
temp1 = df.iloc[1]
# Selects rows 1-to-3
temp2 = df.iloc[1:3]
# First row, first column
temp3 = df.iloc[0,0]
# First 4 rows and first 2 columns
temp4 = df.iloc[0:4, 0:2]

print(temp1)
print(temp2)
print(temp3)
print(temp4)

sepal_length          7.7
sepal_width           3.8
petal_length          6.7
petal_width           2.2
variety         Virginica
Name: 117, dtype: object
     sepal_length  sepal_width  petal_length  petal_width    variety
117           7.7          3.8           6.7          2.2  Virginica
135           7.7          3.0           6.1          2.3  Virginica
7.9
     sepal_length  sepal_width
131           7.9          3.8
117           7.7          3.8
135           7.7          3.0
122           7.7          2.8


Most often, we need to select by a condition on the cell values. To do so, we provide a boolean array denoting which rows will be selected.

In [0]:
# Query by a single column value
temp5 = df[df.petal_length > 2] 

# Query by a single column, if it is in a list of predefined values
temp6 = df[df['variety'].isin(['Virginica', 'Versicolor'])] 

# A conjunction query using two columns
temp7 = df[(df['petal_length'] > 2) & (df['variety'] == 'Virginica')] 

# A disjunction query using two columns
temp8 = df[(df['petal_length'] > 2) | (df['variety'] == 'Virginica')]

# A query checking the textual content of the cells
temp9 = df[df.variety.str.contains('Virgini')]

print(temp5.head())
print(temp6.head())
print(temp7.head())
print(temp8.head())
print(temp9.head())

     sepal_length  sepal_width  petal_length  petal_width    variety
131           7.9          3.8           6.4          2.0  Virginica
117           7.7          3.8           6.7          2.2  Virginica
135           7.7          3.0           6.1          2.3  Virginica
122           7.7          2.8           6.7          2.0  Virginica
118           7.7          2.6           6.9          2.3  Virginica
     sepal_length  sepal_width  petal_length  petal_width    variety
131           7.9          3.8           6.4          2.0  Virginica
117           7.7          3.8           6.7          2.2  Virginica
135           7.7          3.0           6.1          2.3  Virginica
122           7.7          2.8           6.7          2.0  Virginica
118           7.7          2.6           6.9          2.3  Virginica
     sepal_length  sepal_width  petal_length  petal_width    variety
131           7.9          3.8           6.4          2.0  Virginica
117           7.7          3.8    

## Creating New columns

In [0]:
# Simply putting some values
temp10 = df.copy()

temp10['flag'] = 1
temp10.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety,flag
131,7.9,3.8,6.4,2.0,Virginica,1
117,7.7,3.8,6.7,2.2,Virginica,1
135,7.7,3.0,6.1,2.3,Virginica,1
122,7.7,2.8,6.7,2.0,Virginica,1
118,7.7,2.6,6.9,2.3,Virginica,1


In [0]:
# based on conditions
temp10 = df.copy()

temp10['Target'] = np.where(df['variety'] == 'Virginica', 1,0)
temp10.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,variety,Target
131,7.9,3.8,6.4,2.0,Virginica,1
117,7.7,3.8,6.7,2.2,Virginica,1
135,7.7,3.0,6.1,2.3,Virginica,1
122,7.7,2.8,6.7,2.0,Virginica,1
118,7.7,2.6,6.9,2.3,Virginica,1


In [0]:
# Some complex example
temp10['Target'] = np.where(df['variety'] == 'Virginica', 1, 
                            np.where(df['variety'] == 'Versicolor', 2,3))
temp10.Target.value_counts()

3    50
2    50
1    50
Name: Target, dtype: int64

## Merging two dataframe

For merging two dataframe, pd.merge comes very handy. It is equipped with sql like features gives flexibity while merging.

In [0]:
# Creating ID columns
df['ID'] = np.linspace(start = 1, stop = 151, num = 150, endpoint = False)

temp12 = df.copy()
temp12.columns = ['sepallength','sepalwidth','petallength', 'petalwidth','Target','ID']
temp12 = pd.merge(temp12, df, on = 'ID', how = 'left')

temp12.head()

Unnamed: 0,sepallength,sepalwidth,petallength,petalwidth,Target,ID,sepal_length,sepal_width,petal_length,petal_width,variety
0,7.9,3.8,6.4,2.0,Virginica,1.0,7.9,3.8,6.4,2.0,Virginica
1,7.7,3.8,6.7,2.2,Virginica,2.0,7.7,3.8,6.7,2.2,Virginica
2,7.7,3.0,6.1,2.3,Virginica,3.0,7.7,3.0,6.1,2.3,Virginica
3,7.7,2.8,6.7,2.0,Virginica,4.0,7.7,2.8,6.7,2.0,Virginica
4,7.7,2.6,6.9,2.3,Virginica,5.0,7.7,2.6,6.9,2.3,Virginica
