The first main data type we will learn about for pandas is the Series data type. Let's import Pandas and explore the Series object.

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

Let's explore this concept through some examples:

In [1]:
import numpy as np
import pandas as pd

### Creating a Series

You can convert a list,numpy array, or dictionary to a Series:

In [2]:
labels = ['a','b','c']
my_list = [10,20,30]
arr = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}

** Using Lists**

In [3]:
ser=pd.Series(data=labels)

In [6]:
ser

0    a
1    b
2    c
dtype: object

In [7]:
pd.Series(data=my_list,index=labels)

a    10
b    20
c    30
dtype: int64

In [8]:
pd.Series(data=my_list,
    index=labels)

a    10
b    20
c    30
dtype: int64

In [9]:
pd.Series(my_list,labels)

a    10
b    20
c    30
dtype: int64

** NumPy Arrays **

In [13]:
pd.Series(arr)

0    10
1    20
2    30
dtype: int32

In [14]:
pd.Series(arr,labels)

a    10
b    20
c    30
dtype: int32

** Dictionary**

In [11]:
d

{'a': 10, 'b': 20, 'c': 30}

In [13]:
pd.Series(data=d,index=['a','e','f'])

a    10.0
e     NaN
f     NaN
dtype: float64

### Data in a Series

A pandas Series can hold a variety of object types:

In [16]:
pd.Series(data=labels)

0    a
1    b
2    c
dtype: object

In [14]:
# Even functions (although unlikely that you will use this)
pd.Series([sum,print,len])

0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

Let's see some examples of how to grab information from a Series. Let us create two sereis, ser1 and ser2:

In [15]:
ser1 = pd.Series([1,2,3,4], ['USA', 'Germany','USSR', 'Japan'])  

In [16]:
ser1

USA        1
Germany    2
USSR       3
Japan      4
dtype: int64

In [17]:
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])                                   

In [18]:
ser2

USA        1
Germany    2
Italy      5
Japan      4
dtype: int64

In [22]:
ser1['USA']

1

Operations are then also done based off of index:

In [23]:
ser3 = ser1 + ser2

In [25]:
ser1,ser2

(USA        1
 Germany    2
 USSR       3
 Japan      4
 dtype: int64,
 USA        1
 Germany    2
 Italy      5
 Japan      4
 dtype: int64)

In [24]:
ser3

Germany    4.0
Italy      NaN
Japan      8.0
USA        2.0
USSR       NaN
dtype: float64

In [26]:
s_attr_methods = set(dir(pd.Series))
len(s_attr_methods) #wow ,It's way lot to remember

421

### Calling Series Methods

In [30]:
sal = pd.read_csv('Salaries.csv')
sal.head()

Unnamed: 0,Id,EmployeeName,JobTitle,BasePay,OvertimePay,OtherPay,Benefits,TotalPay,TotalPayBenefits,Year,Notes,Agency,Status
0,1,NATHANIEL FORD,GENERAL MANAGER-METROPOLITAN TRANSIT AUTHORITY,167411.18,0.0,400184.25,,567595.43,567595.43,2011,,San Francisco,
1,2,GARY JIMENEZ,CAPTAIN III (POLICE DEPARTMENT),155966.02,245131.88,137811.38,,538909.28,538909.28,2011,,San Francisco,
2,3,ALBERT PARDINI,CAPTAIN III (POLICE DEPARTMENT),212739.13,106088.18,16452.6,,335279.91,335279.91,2011,,San Francisco,
3,4,CHRISTOPHER CHONG,WIRE ROPE CABLE MAINTENANCE MECHANIC,77916.0,56120.71,198306.9,,332343.61,332343.61,2011,,San Francisco,
4,5,PATRICK GARDNER,"DEPUTY CHIEF OF DEPARTMENT,(FIRE DEPARTMENT)",134401.6,9737.0,182234.59,,326373.19,326373.19,2011,,San Francisco,


In [37]:
Job_title = sal['JobTitle']
Empl_name = sal['EmployeeName']
Base_pay = sal['BasePay']

In [39]:
sal.EmployeeName.dtype

dtype('O')

In [40]:
Base_pay.dtype

dtype('float64')

In [41]:
Base_pay.head()

0    167411.18
1    155966.02
2    212739.13
3     77916.00
4    134401.60
Name: BasePay, dtype: float64

In [42]:
Base_pay.sample(n=6,random_state=31)

48611      66009.98
102961     22652.00
144632      5143.60
2693      117450.79
124322     79281.03
113235    146560.57
Name: BasePay, dtype: float64

In [43]:
Base_pay.value_counts()

0.00        1298
54703.00     338
55026.00     297
48472.40     210
65448.00     153
            ... 
65402.73       1
68818.72       1
73810.54       1
56242.68       1
15.50          1
Name: BasePay, Length: 109489, dtype: int64

In [44]:
Base_pay.value_counts(normalize=True)

0.00        0.008768
54703.00    0.002283
55026.00    0.002006
48472.40    0.001418
65448.00    0.001033
              ...   
65402.73    0.000007
68818.72    0.000007
73810.54    0.000007
56242.68    0.000007
15.50       0.000007
Name: BasePay, Length: 109489, dtype: float64

In [45]:
Base_pay.count()

148045

In [47]:
Base_pay.min(),Base_pay.max(),Base_pay.median()

(-166.01, 319275.01, 65007.45)

In [48]:
Base_pay.describe()

count    148045.000000
mean      66325.448841
std       42764.635495
min        -166.010000
25%       33588.200000
50%       65007.450000
75%       94691.050000
max      319275.010000
Name: BasePay, dtype: float64

In [50]:
len(Base_pay.isna()) # note difference between count() and isna()

148654

In [61]:
Base_pay.hasnans
# all(Base_pay.isna())

True

In [60]:
Base_pay.dropna()

0         167411.18
1         155966.02
2         212739.13
3          77916.00
4         134401.60
            ...    
148645         0.00
148647         0.00
148648         0.00
148649         0.00
148653         0.00
Name: BasePay, Length: 148045, dtype: float64

In [66]:
Base_pay.notna().describe()

count     148654
unique         2
top         True
freq      148045
Name: BasePay, dtype: object

### Series Operation
Series and DataFrames support many of the Python operators.Typically, a new Series or Dataframe is returned when using an operator.
Basic operator work similar way as with scalar quantities.

In [69]:
Base_pay//1000  # salary in thousands

0         167.0
1         155.0
2         212.0
3          77.0
4         134.0
          ...  
148649      0.0
148650      NaN
148651      NaN
148652      NaN
148653      0.0
Name: BasePay, Length: 148654, dtype: float64

In [71]:
Base_pay//1000 >200

0         False
1         False
2          True
3         False
4         False
          ...  
148649    False
148650    False
148651    False
148652    False
148653    False
Name: BasePay, Length: 148654, dtype: bool

In [72]:
Base_pay.gt (200000)

0         False
1         False
2          True
3         False
4         False
          ...  
148649    False
148650    False
148651    False
148652    False
148653    False
Name: BasePay, Length: 148654, dtype: bool

<table >
    <tr>
        <td>Operator group </td>
        <td>Operator </td>
        <td>Series method name </td>
            </tr>
    <tr>
        <td>  Artithematic   </td>
        <td>  +,-,*,/,//,\**,%  </td>
        <td>  .add .sub .mul .div .floordiv .mod . pow  </td>
    </tr>
    <tr>
    <tr>
        <td>  Comparison  </td>
        <td>  <,>,<=,>=,==,!= </td>
        <td>  .lt,.gt,.le,.ge,.eq,.ne  </td>
    </tr>
    </table>

In [81]:
Base_pay.isna().sum()  # To count nan entries
Base_pay.fillna(0).astype(int).isna().sum() #Replace all Nan with 0 leaving none of  'Nan' entry

0

#### Debugging Chained Operations

In [83]:
def debug_ser(ser):
    print("Before")
    print(ser)
    print("After")
    return ser

In [84]:
Base_pay.fillna(0).pipe(debug_ser).astype(int).isna().sum()

Before
0         167411.18
1         155966.02
2         212739.13
3          77916.00
4         134401.60
            ...    
148649         0.00
148650         0.00
148651         0.00
148652         0.00
148653         0.00
Name: BasePay, Length: 148654, dtype: float64
After


0

Let's stop here for now and move on to DataFrames, which will expand on the concept of Series!
# Great Job!