# Introduction to Pandas

Pandas is a high-level data manipulation package which was built on top of Numpy. The key structures within pandas include Series and Dataframes.

## Series

A Series is a one-dimensional array with axis labels (an index).

In [2]:
# importing libraries and packages 
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
# creating a Series from a list 
x = pd.Series([10,20,30,40,50])
x

0    10
1    20
2    30
3    40
4    50
dtype: int64

we can access different components separately:

In [6]:
# accessing the index
x.index

RangeIndex(start=0, stop=5, step=1)

In [7]:
# accessing the values 
x.values

array([10, 20, 30, 40, 50])

In [9]:
# accessing the datatype 
# a Series is an ndarray, thus it's homogeneous and CANNOT store multiple dtypes
x.dtype

dtype('int64')

In [12]:
# create a Series with an index
data = [450, 650, 870]
sales = Series(data, index=['Don', 'Mike', 'Edwin'])
sales

Don      450
Mike     650
Edwin    870
dtype: int64

In [13]:
# check the type 
type(sales)

pandas.core.series.Series

In [14]:
# if we check the index of sales, we will get the values, rather than the range, because it's a string
sales.index

Index(['Don', 'Mike', 'Edwin'], dtype='object')

### Accessing Values 

In [15]:
# you can access values using the index name
sales['Don']

np.int64(450)

In [16]:
# you can still use traditional indexing
sales[0]

  sales[0]


np.int64(450)

### Checking for conditions 

In [18]:
# you can filter based on conditions 
sales>500
# this will usually return booleans

Don      False
Mike      True
Edwin     True
dtype: bool

In [19]:
# we can use these booleans 
sales[[False, True, True]]

Mike     650
Edwin    870
dtype: int64

In [20]:
# if we want to see values greater than 500, we can use those booleans
sales[sales>500]

Mike     650
Edwin    870
dtype: int64

In [21]:
# checking the names in the index
'Don' in sales

True

In [22]:
# false example
'Sally' in sales

False

In [23]:
# what about this?
450 in sales
# 450 is not an index, it's a value. Thus it will return false

False

### Working with Dictionaries 

In [24]:
# converting a series to a dictionary 
sales_dict = sales.to_dict()
sales_dict

{'Don': 450, 'Mike': 650, 'Edwin': 870}

In [25]:
# converting a dict to a series 
sales_series = Series(sales_dict)
sales_series

Don      450
Mike     650
Edwin    870
dtype: int64

### Adding entries and working with NaN/null values

In [29]:
# we can create a new Series from an existing Series 
# if we can specify names in the index that were NOT there already, NaN values will be assigned 
new_sales = Series(sales, index=['Don', 'Mike', 'Sally','Edwin','Lucy'])
new_sales

Don      450.0
Mike     650.0
Sally      NaN
Edwin    870.0
Lucy       NaN
dtype: float64

In [30]:
# we can check if there are any NaN values in a Series 
# for this we use numpy! 
np.isnan(new_sales)

Don      False
Mike     False
Sally     True
Edwin    False
Lucy      True
dtype: bool

In [31]:
# to check for null values, use pandas
pd.isnull(new_sales)

Don      False
Mike     False
Sally     True
Edwin    False
Lucy      True
dtype: bool

### Naming components in a Series 

In [33]:
# name an index 
sales.index.name = 'Sales person'
sales

Sales person
Don      450
Mike     650
Edwin    870
dtype: int64

In [34]:
# naming a Series 
sales.name = 'Total tv sales'
sales

Sales person
Don      450
Mike     650
Edwin    870
Name: Total tv sales, dtype: int64

## DataFrames

DataFrames are two-dimensional, size-mutable, potentially heterogeneous tabular data structures. This data structure contains TWO labeled axes (rows and columns).

### Creating a DataFrame 

In [39]:
# creating a DataFrame from a list 
data = [['Adrian', 20], ['Bethany', 23], ['Chloe', 41]]

# when we create a DataFrame, we can specify what the column names are and the data type is

df = pd.DataFrame(data, columns=['Name', 'Age'])
df

Unnamed: 0,Name,Age
0,Adrian,20
1,Bethany,23
2,Chloe,41


In [64]:
# creating a DataFrame from a dictionary
data = {'Name': 'Adrian', 'Age':20, 'Job':'Data Engineer'}
df_from_dict = pd.DataFrame([data])
df_from_dict

Unnamed: 0,Name,Age,Job
0,Adrian,20,Data Engineer


In [73]:
# adding custom indexes
# we can use index= to set a custom index. the index value must be stored in an iterable(list, tuple, or array)
df = pd.DataFrame([{'Name': 'Adrian', 'Age':20, 'Job':'Data Engineer'}], index=['Person1'])
df

Unnamed: 0,Name,Age,Job
Person1,Adrian,20,Data Engineer


In [100]:
# creating a DataFrame from a list of dictionaries is very similar: 

data = [
    {'Name': 'Adrian', 'Age':20, 'Job':'Data Engineer'},
    {'Name': 'Alex', 'Age':30, 'Job':'Software Engineer'},
    {'Name': 'Harold', 'Age':27, 'Job':'Cloud Engineer'}
]

df_people = pd.DataFrame(data, index=['Person1', 'Person2', 'Person3'])
df_people

Unnamed: 0,Name,Age,Job
Person1,Adrian,20,Data Engineer
Person2,Alex,30,Software Engineer
Person3,Harold,27,Cloud Engineer


In [88]:
# creating a DataFrame from a Series
df_sales = pd.DataFrame([sales])
df_sales

Sales person,Don,Mike,Edwin
Total tv sales,450,650,870


In [101]:
# adding a Series to an existing DataFrame

new_series = Series({'Name': 'Adrian2', 'Age':20, 'Job':'Data Engineer'})
# transforms a series into a single row in DataFrame
# .T is transpose operation. It flips the DataFrame so that the original series index become the columns and the original series values become a single row
df_people = pd.concat([df_people, new_series.to_frame().T], ignore_index=True)
df_people

Unnamed: 0,Name,Age,Job
0,Adrian,20,Data Engineer
1,Alex,30,Software Engineer
2,Harold,27,Cloud Engineer
3,Adrian2,20,Data Engineer


In [81]:
# shifting/changing a DataFrame's index

In [82]:
# Filling in missing values

In [None]:
# what if we do not want to fill every value with the same data?
#backfill
#bfill
#pad
#ffill