# Introduction to Pandas Library

<strong>Pandas</strong> is a popular library for data analysis built on top of Python. It provides two data structures for manipulating data: <strong>Dataframe</strong> and <strong>Series</strong>.

In order to use the Pandas library, we need to install it first.

In [None]:
!pip install xlrd
!pip install openpyxl


In order to use the built-in functions in the Pandas library, we need to import Pandas (assuming that it is installed).

In [1]:
import pandas as pd

<hr>

# DataFrame

A <strong>DataFrame</strong> is a two-dimensional data structure consisting of rows and columns. It can be created by loading data from existing storage such as SQL databases, CSV files, and Excel files; it can also be created from lists and dictionaries.

## Creating DataFrames

We can create a DataFrame from out of a dictionary using the <code>.DataFrame()</code> function in Pandas. The keys in the dictionary represents column labels, and the values represent rows.

In [2]:
# Define a dictionary with employee info
employeeInfo = {'Name':['Rose', 'John', 'Jane', 'Mary', 'Michael'], 'ID':[1, 2, 3, 4, 5], 'Department':['Human Resources', 'Information Technology', 'Marketing', 'Accounting', 'Information Technology'], 'Salary':[39000, 56000, 42000, 53000, 48000]}

# Cast the dictionary to a DataFrame
employeeData = pd.DataFrame(employeeInfo)

# Display the DataFrame
employeeData

Unnamed: 0,Name,ID,Department,Salary
0,Rose,1,Human Resources,39000
1,John,2,Information Technology,56000
2,Jane,3,Marketing,42000
3,Mary,4,Accounting,53000
4,Michael,5,Information Technology,48000


We can also use the built-in functions in Pandas to read data directly from databases or files.

In [11]:
csv_path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%204/data/TopSellingAlbums.csv'
df = pd.read_csv(csv_path)

We can use the <code>.head()</code> function to display the first 5 rows of data.

In [12]:
df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0


## Selecting Data

We can select column(s) using column names.

In [3]:
employeeList = employeeData[['ID', 'Name', 'Department']]
employeeList

Unnamed: 0,ID,Name,Department
0,1,Rose,Human Resources
1,2,John,Information Technology
2,3,Jane,Marketing
3,4,Mary,Accounting
4,5,Michael,Information Technology


We can also select data using the <code>.loc()</code> and <code>.iloc()</code> functions.

The <code>.loc(row_label, column_label)</code> function is a <strong>label-based</strong> which means we must pass the row and column names to select data. Both the start bound and the stop bound are inclusive for the selected range.

The <code>.iloc(row_index, column_index)</code> function is a <strong>index-based</strong> which means we must pass the row and column indexes to select data. The start bound is inclusive, but the stop bound is exclusive for the selected range.

In [4]:
# Access the data in the first row and the "Department" column
employeeData.loc[0, 'Department']

'Human Resources'

In [5]:
# Access the data in the first row and third column
employeeData.iloc[0,2]

'Human Resources'

We can set a column as the index column using the <code>.set_index()</code> function. Then we can access the column using names.

In [6]:
employeeData1 = employeeData
employeeData1 = employeeData1.set_index('Name')
employeeData1.head()

Unnamed: 0_level_0,ID,Department,Salary
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Rose,1,Human Resources,39000
John,2,Information Technology,56000
Jane,3,Marketing,42000
Mary,4,Accounting,53000
Michael,5,Information Technology,48000


In [7]:
employeeData1.loc['Jane', 'Salary']

42000

We can use the <code>[row_start:row_stop, column_start:column_stop]</code> operator to slice out a certain area in the DataFrame.

In [8]:
employeeData.iloc[0:2, 0:3]

Unnamed: 0,Name,ID,Department
0,Rose,1,Human Resources
1,John,2,Information Technology


In [9]:
employeeData.loc[0:1, 'Name':'Department']

Unnamed: 0,Name,ID,Department
0,Rose,1,Human Resources
1,John,2,Information Technology


In [10]:
employeeData1.loc['Rose':'John', 'ID':'Department']

Unnamed: 0_level_0,ID,Department
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Rose,1,Human Resources
John,2,Information Technology


<hr>

# Series

A <strong>Series</strong> is a one-dimentional array of indexed data consisting of an array of actual data and an associated array of indexes or data labels.