<h2 id="pandas">Introduction of <code>Pandas</code></h2>



**Pandas** is a popular library for data analysis built on top of the Python programming language. Pandas generally provide two data structures for manipulating data, They are:

*   DataFrame
*   Series

A **DataFrame** is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.

*   A Pandas DataFrame will be created by loading the datasets from existing storage.
*   Storage can be SQL Database, CSV file, an Excel file, etc.
*   It can also be created from the lists, dictionary, and from a list of dictionaries.

**Series** represents a one-dimensional array of indexed data.
It has two main components :

1.  An array of actual data.
2.  An associated array of indexes or data labels.

The index is used to access individual data values. You can also get a column of a dataframe as a **Series**. You can think of a Pandas series as a 1-D dataframe.


In [None]:
# Dependency needed to install file 

!pip install xlrd

In [3]:
# Import required library

import pandas as pd

In [4]:
# Read data from CSV file

csv_path = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-PY0101EN-SkillsNetwork/labs/Module%204/data/TopSellingAlbums.csv'
df = pd.read_csv(csv_path)

In [5]:
# Print first five rows of the dataframe

df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,0:42:19,"pop, rock, R&B",46.0,65,30-Nov-82,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,25-Jul-80,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,01-Mar-73,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,17-Nov-92,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,21-Oct-77,,8.0


In [6]:
# Read data from Excel File directly and print the first five rows

xlsx_path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/PY0101EN/Chapter%204/Datasets/TopSellingAlbums.xlsx'

df = pd.read_excel(xlsx_path)
df.head()

Unnamed: 0,Artist,Album,Released,Length,Genre,Music Recording Sales (millions),Claimed Sales (millions),Released.1,Soundtrack,Rating
0,Michael Jackson,Thriller,1982,00:42:19,"pop, rock, R&B",46.0,65,1982-11-30,,10.0
1,AC/DC,Back in Black,1980,0:42:11,hard rock,26.1,50,1980-07-25,,9.5
2,Pink Floyd,The Dark Side of the Moon,1973,0:42:49,progressive rock,24.2,45,1973-03-01,,9.0
3,Whitney Houston,The Bodyguard,1992,0:57:44,"R&B, soul, pop",27.4,44,1992-11-17,Y,8.5
4,Meat Loaf,Bat Out of Hell,1977,0:46:33,"hard rock, progressive rock",20.6,43,1977-10-21,,8.0


Viewing Data and Accessing Data


In [7]:
# Access a specific column ex: Length

x = df[['Length']]
x

Unnamed: 0,Length
0,00:42:19
1,0:42:11
2,0:42:49
3,0:57:44
4,0:46:33
5,0:43:08
6,01:15:54
7,0:40:01


In [8]:
# Get the column as a series

x = df['Length']
x

0    00:42:19
1     0:42:11
2     0:42:49
3     0:57:44
4     0:46:33
5     0:43:08
6    01:15:54
7     0:40:01
Name: Length, dtype: object

In [9]:
x = df[['Artist']]
type(x)

pandas.core.frame.DataFrame

In [10]:
# Access to multiple columns

y = df[['Artist','Length','Genre']]
y

Unnamed: 0,Artist,Length,Genre
0,Michael Jackson,00:42:19,"pop, rock, R&B"
1,AC/DC,0:42:11,hard rock
2,Pink Floyd,0:42:49,progressive rock
3,Whitney Houston,0:57:44,"R&B, soul, pop"
4,Meat Loaf,0:46:33,"hard rock, progressive rock"
5,Eagles,0:43:08,"rock, soft rock, folk rock"
6,Bee Gees,01:15:54,disco
7,Fleetwood Mac,0:40:01,soft rock


One way to access unique elements is the <code>iloc</code> method .

In [11]:
# Access the value on the first row and the first column

df.iloc[0, 0]

'Michael Jackson'

In [12]:
# Access the value on the second row and the third column
df.iloc[1,2]

np.int64(1980)

In [13]:
# Access the column using the name

df.loc[1, 'Artist']

'AC/DC'

### Slicing using both the index and the name of the column:


<code>loc()</code> is a label-based data selecting method which means that we have to pass the name of the row or column that we want to select. This method includes the last element of the range passed in it.

* Syntax : loc\[row_label, column_label]

<code>iloc()</code> is an indexed-based selecting method which means that we have to pass integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it.

* Syntax : iloc\[row_index, column_index]


In [14]:
# Slicing the dataframe

df.iloc[0:2, 0:3]

Unnamed: 0,Artist,Album,Released
0,Michael Jackson,Thriller,1982
1,AC/DC,Back in Black,1980


In [15]:
# Slicing the dataframe using name

df.loc[0:2, 'Artist':'Released']

Unnamed: 0,Artist,Album,Released
0,Michael Jackson,Thriller,1982
1,AC/DC,Back in Black,1980
2,Pink Floyd,The Dark Side of the Moon,1973


Creating a DataFrame out of a dictionary :


In [16]:
#Define a dictionary 'x'

abc = {'Name': ['Rose','John', 'Jane', 'Mary'], 'ID': [1, 2, 3, 4], 'Department': ['Architect Group', 'Software Group', 'Design Team', 'Infrastructure'], 
      'Salary':[100000, 80000, 50000, 60000]}

#casting the dictionary to a DataFrame
df = pd.DataFrame(abc)

#display the result df
df

Unnamed: 0,Name,ID,Department,Salary
0,Rose,1,Architect Group,100000
1,John,2,Software Group,80000
2,Jane,3,Design Team,50000
3,Mary,4,Infrastructure,60000
