# Pandas


Pandas is a library built using NumPy specifically for data analysis. You'll be using Pandas heavily for data manipulation, visualisation, building machine learning models, etc. 

There are two main data structures in Pandas - Series and Dataframes. The default way to store data is dataframes, and thus manipulating dataframes quickly is probably the most important skill set for data analysis. 

<img src="pandas_1.png">

*source: https://pandas.pydata.org/pandas-docs/stable/overview.html*

In this section, you will study:
1. The pandas Series (similar to a numpy array)
    * Creating a pandas series
    * Indexing series
2. Dataframes 
    * Creating dataframes from dictionaries
    * Importing CSV data files as pandas dataframes
    * Reading and summarising dataframes
    * Sorting dataframes 

### 1. The Pandas Series 

A series is similar to a 1-D numpy array, and contains scalar values of the same type (numeric, character, datetime etc.). 
A dataframe is simply a table where each column is a pandas series.


#### Creating Pandas Series

Series are one-dimensional array-like structures, though unlike numpy arrays, they often contain non-numeric data (characters, dates, time, booleans etc.)

You can create pandas series from array-like objects using ```pd.Series()```.

In [2]:
# import pandas, pd is an alias
import pandas as pd

# Creating a numeric pandas series
first_series = pd.Series([14, 8, 7, 29, 2, 37,1,5])
print(first_series)
print(type(first_series))

0    14
1     8
2     7
3    29
4     2
5    37
6     1
7     5
dtype: int64
<class 'pandas.core.series.Series'>


Note that each element in the Series has an index, and the index starts at 0 as usual.

In [4]:
# Character series

character_series = pd.Series(['y', 't', 'abc', 'we'])
character_series

0      y
1      t
2    abc
3     we
dtype: object

In [3]:
object_series = pd.Series(['a', 3, 4.5, 'we'])
object_series

0      a
1      3
2    4.5
3     we
dtype: object

In [6]:
# creating a series of datetime type

datetime_series = pd.date_range(start = '5-8-2018', end = '12-31-2025', freq='y')
print(datetime_series)
type(datetime_series)


DatetimeIndex(['2018-12-31', '2019-12-31', '2020-12-31', '2021-12-31',
               '2022-12-31', '2023-12-31', '2024-12-31', '2025-12-31'],
              dtype='datetime64[ns]', freq='A-DEC')


pandas.core.indexes.datetimes.DatetimeIndex

#### Indexing Series

Indexing series is exactly same as 1-D numpy arrays - index starts at 0.

In [7]:
# Indexing pandas series: Similar to indexing 1-d numpy arrays or lists
# accessing the third element
first_series[2]



7

In [8]:
# accessing elements starting index = 1 till the end
first_series[1:]

1     8
2     7
3    29
4     2
5    37
6     1
7     5
dtype: int64

In [9]:
# accessing the second and the fourth elements
# note that first_series[1, 3] will not work, you need to pass the indices [1, 3] as a list inside the original []
first_series[[4,7]]

4    2
7    5
dtype: int64

Usually, you will work with Series only as a part of dataframes. Let's study the basics of dataframes.

### The Pandas Dataframe

Dataframe is the most widely used data-structure in data analysis. It is a table with rows and columns, with rows having an index and columns having meaningful names.

#### Creating dataframes from dictionaries

There are various ways of creating dataframes, such as creating them from dictionaries, JSON objects, reading from txt, CSV files, etc. 

In [10]:
# keys become column names
df = pd.DataFrame({'name': ['Vinay', 'Kushal', 'Aman', 'Saif'], 
                   'age': [22, 25, 24, 28,8], 
                    'occupation': ['engineer', 'doctor', 'data analyst', 'teacher']})
df

ValueError: arrays must all be same length

#### Importing CSV data files as pandas dataframes 

For the upcoming exercises, we will use a dataset of a retail store having details about the orders placed, customers, product details, sales, profits etc. 



In [11]:
# reading a CSV file as a dataframe
ticket_df = pd.read_csv("./global_sales_data/Ticket_data_New.csv")

#head() always returns the first 5 rows
ticket_df.head(8)

Unnamed: 0,Number,Opened,Task type,Priority,State,Assignment group,Assigned to,Updated,Updated by,Closed,Closed by
0,INC0073572,18/04/2018 19:07:53,Incident,2 - High,On Hold,FMS Support Team,Babu,18/04/2018 21:10:27,Babu,,Babu
1,INC0073024,17/04/2018 15:08:44,Incident,4 - Low,On Hold,MSS Support Team,Rahul,17/04/2018 16:09:35,Rahul,,Rahul
2,INC0073023,17/04/2018 15:08:31,Incident,4 - Low,On Hold,FMS Support Team,Saif,18/04/2018 10:39:59,Saif,,Saif
3,INC0072924,17/04/2018 09:35:10,Incident,1 - Critical,Resolved,FMS Support Team,Krishna,17/04/2018 20:31:43,Krishna,,Krishna
4,INC0072821,17/04/2018 00:26:02,Incident,3 - Moderate,Resolved,FMS Support Team,Vamsi,17/04/2018 22:09:47,Vamsi,,Vamsi
5,INC0072519,16/04/2018 09:13:41,Incident,3 - Moderate,Closed,FMS Support Team,Rajesh,16/04/2018 15:35:18,Rajesh,16/04/2018 15:35:18,Rajesh
6,INC0072056,13/04/2018 14:52:34,Incident,3 - Moderate,On Hold,FMS Support Team,Saif,13/04/2018 15:56:49,Saif,,Saif
7,INC0071701,43438.51061,Incident,2 - High,Closed,FMS Support Team,Rahul,17/04/2018 19:00:06,Rahul,17/04/2018 19:00:06,Rahul


In [12]:
ticket_df.tail(10)

Unnamed: 0,Number,Opened,Task type,Priority,State,Assignment group,Assigned to,Updated,Updated by,Closed,Closed by
16,INC0070574,43347.59802,Incident,4 - Low,Closed,FMS Support Team,Krishna,15/04/2018 22:00:06,Krishna,15/04/2018 22:00:06,Krishna
17,INC0070332,43316.55322,Incident,4 - Low,Closed,FMS Support Team,Krishna,14/04/2018 01:00:03,Krishna,14/04/2018 01:00:03,Krishna
18,INC0070018,43255.72841,Incident,3 - Moderate,Closed,FMS Support Team,Vamsi,43408.52895,Vamsi,43408.52895,Vamsi
19,INC0069946,43255.55411,Incident,4 - Low,Closed,FMS Support Team,Rahul,18/04/2018 17:00:03,Rahul,18/04/2018 17:00:03,Rahul
20,INC0069940,43255.53656,Incident,4 - Low,On Hold,FMS Support Team,Vamsi,13/04/2018 18:01:33,Vamsi,,Vamsi
21,INC0069939,43255.52418,Incident,4 - Low,Closed,FMS Support Team,Krishna,18/04/2018 17:00:08,Krishna,18/04/2018 17:00:08,Krishna
22,INC0069907,43255.42479,Incident,4 - Low,On Hold,MMS Support Team,Rahul,43255.75743,Rahul,,Rahul
23,INC0069608,43224.57965,Incident,4 - Low,Closed,FMS Support Team,Vamsi,43377.62538,Vamsi,43377.62538,Vamsi
24,INC0069516,43224.21356,Incident,3 - Moderate,Closed,FMS Support Team,Vamsi,43438.0002,Vamsi,43438.0002,Vamsi
25,INC0069438,43194.91313,Incident,4 - Low,Closed,MMS Support Team,Krishna,43347.95846,Krishna,43347.95846,Krishna


Usually, dataframes are imported as CSV files, but sometimes it is more convenient to convert dictionaries 
into dataframes. For e.g. when the raw data is in a JSON format (which is not uncommon), you can easily convert it into a dictionary, and then into a dataframe. 

You will learn how to convert JSON objects to dataframes later.

#### Reading and Summarising Dataframes

After you import a dataframe, you'd want to quickly understand its structure, shape, meanings of rows and columns etc. Further, you may want to look at summary statistics - such as mean, percentiles etc.

In [None]:
#tail() always gives last 5 rows in the dataframe
ticket_df.tail(10)

Here, each row represents an order placed at a retail store. Notice the index associated with each row - starts at 0 and ends at 8398, implying that there were 8399 orders placed.

In [13]:
# Looking at the datatypes of each column
ticket_df.info()

# Note that each column is basically a pandas Series of length 8399
# The ID columns are 'objects', i.e. they are being read as characters
# The rest are numeric (floats or int)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26 entries, 0 to 25
Data columns (total 11 columns):
Number              26 non-null object
Opened              26 non-null object
Task type           26 non-null object
Priority            26 non-null object
State               26 non-null object
Assignment group    26 non-null object
Assigned to         26 non-null object
Updated             26 non-null object
Updated by          26 non-null object
Closed              15 non-null object
Closed by           26 non-null object
dtypes: object(11)
memory usage: 2.3+ KB


In [25]:
# Describe gives you a summary of all the numeric columns in the dataset
ticket_df.describe()

Unnamed: 0,Number,Opened,Task type,Priority,State,Assignment group,Assigned to,Updated,Updated by,Closed,Closed by
count,26,26.0,26,26,26,26,26,26,26,15,26
unique,26,26.0,1,4,4,3,6,26,6,15,6
top,INC0070332,43347.67469,Incident,4 - Low,Closed,FMS Support Team,Krishna,15/04/2018 21:00:08,Krishna,15/04/2018 21:00:08,Krishna
freq,1,1.0,26,13,14,23,8,1,8,1,8


In [26]:
# Column names
ticket_df.columns

Index(['Number', 'Opened', 'Task type', 'Priority', 'State',
       'Assignment group', 'Assigned to', 'Updated', 'Updated by', 'Closed',
       'Closed by'],
      dtype='object')

In [27]:
# The number of rows and columns
ticket_df.shape

(26, 11)

In [28]:
# You can extract the values of a dataframe as a numpy array using df.values . In this example we are printing only first 2 values
ticket_df.values[0:2]

array([['INC0073572', '18/04/2018 19:07:53', 'Incident', '2 - High',
        'On Hold', 'FMS Support Team', 'Babu', '18/04/2018 21:10:27',
        'Babu', nan, 'Babu'],
       ['INC0073024', '17/04/2018 15:08:44', 'Incident', '4 - Low',
        'On Hold', 'MSS Support Team', 'Rahul', '17/04/2018 16:09:35',
        'Rahul', nan, 'Rahul']], dtype=object)

#### Indices 

An important concept in pandas dataframes is that of *row indices*. By default, each row is assigned indices starting from 0, and are represented at the left side of the dataframe. 

In [31]:
#Error is expected here. Dataframe gets confused which indice you are trying. Column or row?
print(ticket_df[2])

KeyError: 2

In [32]:
ticket_df.head()
sort_df_number =ticket_df.sort_values(by='Number', ascending=False)
sort_df_number

Unnamed: 0,Number,Opened,Task type,Priority,State,Assignment group,Assigned to,Updated,Updated by,Closed,Closed by
0,INC0073572,18/04/2018 19:07:53,Incident,2 - High,On Hold,FMS Support Team,Babu,18/04/2018 21:10:27,Babu,,Babu
1,INC0073024,17/04/2018 15:08:44,Incident,4 - Low,On Hold,MSS Support Team,Rahul,17/04/2018 16:09:35,Rahul,,Rahul
2,INC0073023,17/04/2018 15:08:31,Incident,4 - Low,On Hold,FMS Support Team,Saif,18/04/2018 10:39:59,Saif,,Saif
3,INC0072924,17/04/2018 09:35:10,Incident,1 - Critical,Resolved,FMS Support Team,Krishna,17/04/2018 20:31:43,Krishna,,Krishna
4,INC0072821,17/04/2018 00:26:02,Incident,3 - Moderate,Resolved,FMS Support Team,Vamsi,17/04/2018 22:09:47,Vamsi,,Vamsi
5,INC0072519,16/04/2018 09:13:41,Incident,3 - Moderate,Closed,FMS Support Team,Rajesh,16/04/2018 15:35:18,Rajesh,16/04/2018 15:35:18,Rajesh
6,INC0072056,13/04/2018 14:52:34,Incident,3 - Moderate,On Hold,FMS Support Team,Saif,13/04/2018 15:56:49,Saif,,Saif
7,INC0071701,43438.51061,Incident,2 - High,Closed,FMS Support Team,Rahul,17/04/2018 19:00:06,Rahul,17/04/2018 19:00:06,Rahul
8,INC0071459,43408.78726,Incident,3 - Moderate,Resolved,FMS Support Team,Krishna,18/04/2018 19:31:41,Krishna,,Krishna
9,INC0071094,43377.84138,Incident,4 - Low,Closed,FMS Support Team,Saif,15/04/2018 21:00:08,Saif,15/04/2018 21:00:08,Saif


Now, arbitrary numeric indices are difficult to read and work with. Thus, you may want to change the indices of the df to something more meanigful.

Let's change the index to Incident Number (unique id of each order), so that you can select rows using the order ids directly.

In [33]:
# Setting index to Ord_id
ticket_df.set_index('Number', inplace = True)
ticket_df.head()

Unnamed: 0_level_0,Opened,Task type,Priority,State,Assignment group,Assigned to,Updated,Updated by,Closed,Closed by
Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
INC0073572,18/04/2018 19:07:53,Incident,2 - High,On Hold,FMS Support Team,Babu,18/04/2018 21:10:27,Babu,,Babu
INC0073024,17/04/2018 15:08:44,Incident,4 - Low,On Hold,MSS Support Team,Rahul,17/04/2018 16:09:35,Rahul,,Rahul
INC0073023,17/04/2018 15:08:31,Incident,4 - Low,On Hold,FMS Support Team,Saif,18/04/2018 10:39:59,Saif,,Saif
INC0072924,17/04/2018 09:35:10,Incident,1 - Critical,Resolved,FMS Support Team,Krishna,17/04/2018 20:31:43,Krishna,,Krishna
INC0072821,17/04/2018 00:26:02,Incident,3 - Moderate,Resolved,FMS Support Team,Vamsi,17/04/2018 22:09:47,Vamsi,,Vamsi


Having meaningful row labels as indices helps you to select (subset) dataframes easily. You will study selecting dataframes in the next section. 

#### Sorting dataframes

You can sort dataframes in two ways - 1) by the indices and 2) by the values.  


In [34]:
#Error is expected. This is not the way you select a row in Dataframe
ticket_df['INC0072821']

KeyError: 'INC0072821'

In [35]:
# Sorting by index
# axis = 0 indicates that you want to sort rows (use axis=1 for columns)
sort_df = ticket_df.sort_index(axis = 0, ascending = True)
sort_df.head()

Unnamed: 0_level_0,Opened,Task type,Priority,State,Assignment group,Assigned to,Updated,Updated by,Closed,Closed by
Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
INC0069438,43194.91313,Incident,4 - Low,Closed,MMS Support Team,Krishna,43347.95846,Krishna,43347.95846,Krishna
INC0069516,43224.21356,Incident,3 - Moderate,Closed,FMS Support Team,Vamsi,43438.0002,Vamsi,43438.0002,Vamsi
INC0069608,43224.57965,Incident,4 - Low,Closed,FMS Support Team,Vamsi,43377.62538,Vamsi,43377.62538,Vamsi
INC0069907,43255.42479,Incident,4 - Low,On Hold,MMS Support Team,Rahul,43255.75743,Rahul,,Rahul
INC0069939,43255.52418,Incident,4 - Low,Closed,FMS Support Team,Krishna,18/04/2018 17:00:08,Krishna,18/04/2018 17:00:08,Krishna


In [36]:
# Sorting by values

# Sorting in increasing order of Sales
sort_df_prior = ticket_df.sort_values(by='Priority')
sort_df_prior.head(12)

Unnamed: 0_level_0,Opened,Task type,Priority,State,Assignment group,Assigned to,Updated,Updated by,Closed,Closed by
Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
INC0072924,17/04/2018 09:35:10,Incident,1 - Critical,Resolved,FMS Support Team,Krishna,17/04/2018 20:31:43,Krishna,,Krishna
INC0073572,18/04/2018 19:07:53,Incident,2 - High,On Hold,FMS Support Team,Babu,18/04/2018 21:10:27,Babu,,Babu
INC0070604,43347.64711,Incident,2 - High,Canceled,FMS Support Team,Vamsi,43347.79096,Vamsi,43347.79096,Vamsi
INC0070631,43347.67469,Incident,2 - High,Closed,FMS Support Team,Krishna,43347.74354,Krishna,43347.74354,Krishna
INC0071701,43438.51061,Incident,2 - High,Closed,FMS Support Team,Rahul,17/04/2018 19:00:06,Rahul,17/04/2018 19:00:06,Rahul
INC0070694,43347.75521,Incident,2 - High,Resolved,FMS Support Team,Rahul,17/04/2018 13:39:33,Rahul,,Rahul
INC0072056,13/04/2018 14:52:34,Incident,3 - Moderate,On Hold,FMS Support Team,Saif,13/04/2018 15:56:49,Saif,,Saif
INC0070018,43255.72841,Incident,3 - Moderate,Closed,FMS Support Team,Vamsi,43408.52895,Vamsi,43408.52895,Vamsi
INC0072519,16/04/2018 09:13:41,Incident,3 - Moderate,Closed,FMS Support Team,Rajesh,16/04/2018 15:35:18,Rajesh,16/04/2018 15:35:18,Rajesh
INC0069516,43224.21356,Incident,3 - Moderate,Closed,FMS Support Team,Vamsi,43438.0002,Vamsi,43438.0002,Vamsi


In [37]:
# Sorting in decreasing order of Shipping_Cost
sort_df_state = ticket_df.sort_values(by='State', ascending = False)
sort_df_state.head(12)

Unnamed: 0_level_0,Opened,Task type,Priority,State,Assignment group,Assigned to,Updated,Updated by,Closed,Closed by
Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
INC0071459,43408.78726,Incident,3 - Moderate,Resolved,FMS Support Team,Krishna,18/04/2018 19:31:41,Krishna,,Krishna
INC0072924,17/04/2018 09:35:10,Incident,1 - Critical,Resolved,FMS Support Team,Krishna,17/04/2018 20:31:43,Krishna,,Krishna
INC0072821,17/04/2018 00:26:02,Incident,3 - Moderate,Resolved,FMS Support Team,Vamsi,17/04/2018 22:09:47,Vamsi,,Vamsi
INC0070694,43347.75521,Incident,2 - High,Resolved,FMS Support Team,Rahul,17/04/2018 13:39:33,Rahul,,Rahul
INC0069907,43255.42479,Incident,4 - Low,On Hold,MMS Support Team,Rahul,43255.75743,Rahul,,Rahul
INC0069940,43255.53656,Incident,4 - Low,On Hold,FMS Support Team,Vamsi,13/04/2018 18:01:33,Vamsi,,Vamsi
INC0073024,17/04/2018 15:08:44,Incident,4 - Low,On Hold,MSS Support Team,Rahul,17/04/2018 16:09:35,Rahul,,Rahul
INC0070699,43347.76417,Incident,4 - Low,On Hold,FMS Support Team,Krishna,43347.80557,Krishna,,Krishna
INC0073572,18/04/2018 19:07:53,Incident,2 - High,On Hold,FMS Support Team,Babu,18/04/2018 21:10:27,Babu,,Babu
INC0072056,13/04/2018 14:52:34,Incident,3 - Moderate,On Hold,FMS Support Team,Saif,13/04/2018 15:56:49,Saif,,Saif


In [None]:
# Sorting by more than two columns

# Sorting in ascending order of Sales for each Product
sort_df_state_prior= ticket_df.sort_values(by=['State', 'Priority'], ascending = False)
sort_df_state_prior.head(12)