In [1]:
!pip install numpy
!pip install pandas


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/opt/python@3.10/bin/python3.10 -m pip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m23.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49m/usr/local/opt/python@3.10/bin/python3.10 -m pip install --upgrade pip[0m


# DataFrames in pandas
A set of examples that exhibit some of the core features of the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) data type in the `pandas` module.

In [1]:
import numpy as np
import pandas as pd

## Basic concept
A DataFrame is a two-dimensional tabular data struture.  It is easily visualized like a spreadsheet, with rows and columns.

In [2]:
# create a DataFrame from a dictionary containing labeled pandas Series
df = pd.DataFrame({
    'name': pd.Series( ['Foo', 'Bar', 'Baz', 'Bum', 'Buddle'] ),
    'email': pd.Series( ['fo1258@foo.edu', 'br9876@foo.edu', 'bz2292@foo.edu', 'bm4567@foo.edu', 'bp987@foo.edu'] ),
    'midterm exam': pd.Series( [99, 64, 87, 64, 72] ),
    'final exam': pd.Series( [94, 72, 81, 59, 88] )
})
df

Unnamed: 0,name,email,midterm exam,final exam
0,Foo,fo1258@foo.edu,99,94
1,Bar,br9876@foo.edu,64,72
2,Baz,bz2292@foo.edu,87,81
3,Bum,bm4567@foo.edu,64,59
4,Buddle,bp987@foo.edu,72,88


In [3]:
# get the DataFrame's schema
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   name          5 non-null      object
 1   email         5 non-null      object
 2   midterm exam  5 non-null      int64 
 3   final exam    5 non-null      int64 
dtypes: int64(2), object(2)
memory usage: 288.0+ bytes


### Columns as Series
Each column is a named `pandas` Series.

To access a single column, simply supply the column's name.

In [4]:
df['midterm exam']

0    99
1    64
2    87
3    64
4    72
Name: midterm exam, dtype: int64

To access multiple columns, supply a list of the column names.

In [5]:
df[ ['name', 'midterm exam'] ]

Unnamed: 0,name,midterm exam
0,Foo,99
1,Bar,64
2,Baz,87
3,Bum,64
4,Buddle,72


In [6]:
# prove that a column of a DataFrame is a Series
type( df[['midterm exam', 'final exam'] ] )

pandas.core.frame.DataFrame

Create a new column by assigning it a series as a value.

In [7]:
# let's create an 'overall score' column that is 40% the midterm score + 60% the final exam score.
df['overall score'] = 0.4*df['midterm exam'] + 0.6*df['final exam']
df

Unnamed: 0,name,email,midterm exam,final exam,overall score
0,Foo,fo1258@foo.edu,99,94,96.0
1,Bar,br9876@foo.edu,64,72,68.8
2,Baz,bz2292@foo.edu,87,81,83.4
3,Bum,bm4567@foo.edu,64,59,61.0
4,Buddle,bp987@foo.edu,72,88,81.6


In [8]:
# generate a comment column explaining the scores in nice friendly human language.
# set axis=1 to have the lambda function receive columns as its argument
df['comment'] = df.apply(lambda row: "Your overall score in the course is " + str(row['overall score']), axis=1 )
df

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment
0,Foo,fo1258@foo.edu,99,94,96.0,Your overall score in the course is 96.0
1,Bar,br9876@foo.edu,64,72,68.8,Your overall score in the course is 68.8
2,Baz,bz2292@foo.edu,87,81,83.4,Your overall score in the course is 83.4
3,Bum,bm4567@foo.edu,64,59,61.0,Your overall score in the course is 61.0
4,Buddle,bp987@foo.edu,72,88,81.6,Your overall score in the course is 81.6


## Rows
Each row is also considered a `pandas` Series.

In [9]:
# get a row by its index
df.loc[1]

name                                                  Bar
email                                      br9876@foo.edu
midterm exam                                           64
final exam                                             72
overall score                                        68.8
comment          Your overall score in the course is 68.8
Name: 1, dtype: object

In [10]:
# prove that a row of a DataFrame is a Series
type( df.loc[1] )

pandas.core.series.Series

In [11]:
# get a row by its integer index (this works even when the indices are strings)
df.iloc[2]

name                                                  Baz
email                                      bz2292@foo.edu
midterm exam                                           87
final exam                                             81
overall score                                        83.4
comment          Your overall score in the course is 83.4
Name: 2, dtype: object

Get multiple rows by supplying a list of integer indices:

In [12]:
df.iloc[ [1, 2]  ]

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment
1,Bar,br9876@foo.edu,64,72,68.8,Your overall score in the course is 68.8
2,Baz,bz2292@foo.edu,87,81,83.4,Your overall score in the course is 83.4


Get multiple rows by supplying a range

In [13]:
df.iloc[ 1:3 ] # exclusive of index #3

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment
1,Bar,br9876@foo.edu,64,72,68.8,Your overall score in the course is 68.8
2,Baz,bz2292@foo.edu,87,81,83.4,Your overall score in the course is 83.4


Note that the same range, when using `loc` instead of `iloc`, will be inclusive of the upper bound:

In [14]:
df.loc[ 1:3 ] # inclusive of index #3

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment
1,Bar,br9876@foo.edu,64,72,68.8,Your overall score in the course is 68.8
2,Baz,bz2292@foo.edu,87,81,83.4,Your overall score in the course is 83.4
3,Bum,bm4567@foo.edu,64,59,61.0,Your overall score in the course is 61.0


In [15]:
df.shape

(5, 6)

## Getting a subset of a dataframe

Get a set of rows and columns by their indexes:

In [16]:
df.loc[ [1,2], ['name', 'overall score'] ]

Unnamed: 0,name,overall score
1,Bar,68.8
2,Baz,83.4


In [17]:
# a more verbose syntax for doing the same thing
df.loc[ [1,2] ][ ['name', 'overall score'] ]

Unnamed: 0,name,overall score
1,Bar,68.8
2,Baz,83.4


You can specify a range of rows and/or a range of columns:

In [18]:
df.loc[ 1:3, 'name':'midterm exam' ]

Unnamed: 0,name,email,midterm exam
1,Bar,br9876@foo.edu,64
2,Baz,bz2292@foo.edu,87
3,Bum,bm4567@foo.edu,64


Note that accessing rows and columns like this results a `DataFrame`, not a `Series`.

In [19]:
type(df.loc[ [1,2], ['name', 'overall score'] ]) 

pandas.core.frame.DataFrame

## Sorting

In [20]:
# sort by a column's value
df.sort_values(by='name', ascending=True)

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment
1,Bar,br9876@foo.edu,64,72,68.8,Your overall score in the course is 68.8
2,Baz,bz2292@foo.edu,87,81,83.4,Your overall score in the course is 83.4
4,Buddle,bp987@foo.edu,72,88,81.6,Your overall score in the course is 81.6
3,Bum,bm4567@foo.edu,64,59,61.0,Your overall score in the course is 61.0
0,Foo,fo1258@foo.edu,99,94,96.0,Your overall score in the course is 96.0


In [21]:
# add a new row with the same name as an existing row, but a different email
new_row = pd.Series({ 
    'name': 'Baz', 
    'email': 'bz2289@foo.edu',
    'midterm exam': 88,
    'final exam': 74
})

# append and automatically assign an index to new row
df = df.append(new_row, ignore_index=True) 

# sort by primary and secondary columns
df.sort_values(by=['name', 'email'], ascending=True)


Unnamed: 0,name,email,midterm exam,final exam,overall score,comment
1,Bar,br9876@foo.edu,64,72,68.8,Your overall score in the course is 68.8
5,Baz,bz2289@foo.edu,88,74,,
2,Baz,bz2292@foo.edu,87,81,83.4,Your overall score in the course is 83.4
4,Buddle,bp987@foo.edu,72,88,81.6,Your overall score in the course is 81.6
3,Bum,bm4567@foo.edu,64,59,61.0,Your overall score in the course is 61.0
0,Foo,fo1258@foo.edu,99,94,96.0,Your overall score in the course is 96.0


In [22]:
# sort by index
df.sort_index(ascending=False)

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment
5,Baz,bz2289@foo.edu,88,74,,
4,Buddle,bp987@foo.edu,72,88,81.6,Your overall score in the course is 81.6
3,Bum,bm4567@foo.edu,64,59,61.0,Your overall score in the course is 61.0
2,Baz,bz2292@foo.edu,87,81,83.4,Your overall score in the course is 83.4
1,Bar,br9876@foo.edu,64,72,68.8,Your overall score in the course is 68.8
0,Foo,fo1258@foo.edu,99,94,96.0,Your overall score in the course is 96.0


## Filtering rows


In [23]:
# match a criterion
df[ df['name'] == 'Bar' ]

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment
1,Bar,br9876@foo.edu,64,72,68.8,Your overall score in the course is 68.8


In [24]:
# match multiple criteria using & or | logic operators
df[ (df['name'] != 'Bar') & (df['midterm exam'] > 50) ]

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment
0,Foo,fo1258@foo.edu,99,94,96.0,Your overall score in the course is 96.0
2,Baz,bz2292@foo.edu,87,81,83.4,Your overall score in the course is 83.4
3,Bum,bm4567@foo.edu,64,59,61.0,Your overall score in the course is 61.0
4,Buddle,bp987@foo.edu,72,88,81.6,Your overall score in the course is 81.6
5,Baz,bz2289@foo.edu,88,74,,


## Filtering columns

Extracting a **single column** is straightforward with square bracket syntax.

In [25]:
# fetch the 'name' column - this returns a Series
df['name']

0       Foo
1       Bar
2       Baz
3       Bum
4    Buddle
5       Baz
Name: name, dtype: object

The easiest way to extract **multiple columns** from a dataframe is by supplying a list of column names.

In [26]:
# fetch the 'name' and 'final exam' columns - this returns a DataFrame
df[ ['name', 'final exam'] ]

Unnamed: 0,name,final exam
0,Foo,94
1,Bar,72
2,Baz,81
3,Bum,59
4,Buddle,88
5,Baz,74


## Filtering rows and columns

It is possible to use two sets of brackets to perform both row and column filters in one expression.

In [27]:
# find one row by its index, and fetch one column from the results - this returns a single value
df.loc[2]['final exam']

81

In [28]:
# filter rows by criteria, and fetch one column from the results - this returns a Series
df[ df['name'] != 'Baz'][ 'name' ]

0       Foo
1       Bar
3       Bum
4    Buddle
Name: name, dtype: object

In [29]:
# filter rows, and fetch multiple columns from the results - this returns a DataFrame
df[ df['name'] != 'Baz'][ ['name', 'midterm exam'] ] 

Unnamed: 0,name,midterm exam
0,Foo,99
1,Bar,64
3,Bum,64
4,Buddle,72


## Basic operations

In [30]:
# give a flat 2% curve to all students on the midterm exam
# update the midterm exam column
df['midterm exam'] = df['midterm exam'] + 2
df

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment
0,Foo,fo1258@foo.edu,101,94,96.0,Your overall score in the course is 96.0
1,Bar,br9876@foo.edu,66,72,68.8,Your overall score in the course is 68.8
2,Baz,bz2292@foo.edu,89,81,83.4,Your overall score in the course is 83.4
3,Bum,bm4567@foo.edu,66,59,61.0,Your overall score in the course is 61.0
4,Buddle,bp987@foo.edu,74,88,81.6,Your overall score in the course is 81.6
5,Baz,bz2289@foo.edu,90,74,,


In [31]:
# add a new column to the dataframe...

# first, generate a Series of fake student ids
n_numbers = pd.Series(100000000 * np.random.random(7) ) # generate a series of random numbers
n_numbers = n_numbers.astype(int) # convert to a simple int to remove decimal place
n_numbers = 'N' + n_numbers.map(str) # add the letter 'N' in front of each number (first convert each to str)

# add to dataframe as a new column
df['n number'] = n_numbers
df

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment,n number
0,Foo,fo1258@foo.edu,101,94,96.0,Your overall score in the course is 96.0,N37419271
1,Bar,br9876@foo.edu,66,72,68.8,Your overall score in the course is 68.8,N71172550
2,Baz,bz2292@foo.edu,89,81,83.4,Your overall score in the course is 83.4,N7408211
3,Bum,bm4567@foo.edu,66,59,61.0,Your overall score in the course is 61.0,N45718482
4,Buddle,bp987@foo.edu,74,88,81.6,Your overall score in the course is 81.6,N58780095
5,Baz,bz2289@foo.edu,90,74,,,N33025817


## Merging two dataframes

In [32]:
# let's first create a second dataframe with some more information about each student
# note that one of the indices in this dataframe does not exist in the other dataframe
df2 = pd.DataFrame({
    'major': ['Math', 'Computer Science', 'Philosophy', 'Organic Gardening', 'Organic Gardening', 'Sociology'],
    'minor': ['Art History', 'Linguistics', 'Music Performance', 'Theater Lighting', 'Theater Lighting', 'Mathematics']
}, index = [3, 0, 2, 1, 4, 5])

df2

Unnamed: 0,major,minor
3,Math,Art History
0,Computer Science,Linguistics
2,Philosophy,Music Performance
1,Organic Gardening,Theater Lighting
4,Organic Gardening,Theater Lighting
5,Sociology,Mathematics


In [33]:
# do an "inner join" type merge, where referential integrity is maintained
df.join(df2)

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment,n number,major,minor
0,Foo,fo1258@foo.edu,101,94,96.0,Your overall score in the course is 96.0,N37419271,Computer Science,Linguistics
1,Bar,br9876@foo.edu,66,72,68.8,Your overall score in the course is 68.8,N71172550,Organic Gardening,Theater Lighting
2,Baz,bz2292@foo.edu,89,81,83.4,Your overall score in the course is 83.4,N7408211,Philosophy,Music Performance
3,Bum,bm4567@foo.edu,66,59,61.0,Your overall score in the course is 61.0,N45718482,Math,Art History
4,Buddle,bp987@foo.edu,74,88,81.6,Your overall score in the course is 81.6,N58780095,Organic Gardening,Theater Lighting
5,Baz,bz2289@foo.edu,90,74,,,N33025817,Sociology,Mathematics


In [34]:
# do a "left join" type merge, where referential integrity is not maintained
pd.concat( [df, df2], axis=1)

Unnamed: 0,name,email,midterm exam,final exam,overall score,comment,n number,major,minor
0,Foo,fo1258@foo.edu,101,94,96.0,Your overall score in the course is 96.0,N37419271,Computer Science,Linguistics
1,Bar,br9876@foo.edu,66,72,68.8,Your overall score in the course is 68.8,N71172550,Organic Gardening,Theater Lighting
2,Baz,bz2292@foo.edu,89,81,83.4,Your overall score in the course is 83.4,N7408211,Philosophy,Music Performance
3,Bum,bm4567@foo.edu,66,59,61.0,Your overall score in the course is 61.0,N45718482,Math,Art History
4,Buddle,bp987@foo.edu,74,88,81.6,Your overall score in the course is 81.6,N58780095,Organic Gardening,Theater Lighting
5,Baz,bz2289@foo.edu,90,74,,,N33025817,Sociology,Mathematics


## Setting the index
It's possible to change which column is used as an index

In [35]:
# set the index to be the new n number
df.set_index('n number')

Unnamed: 0_level_0,name,email,midterm exam,final exam,overall score,comment
n number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
N37419271,Foo,fo1258@foo.edu,101,94,96.0,Your overall score in the course is 96.0
N71172550,Bar,br9876@foo.edu,66,72,68.8,Your overall score in the course is 68.8
N7408211,Baz,bz2292@foo.edu,89,81,83.4,Your overall score in the course is 83.4
N45718482,Bum,bm4567@foo.edu,66,59,61.0,Your overall score in the course is 61.0
N58780095,Buddle,bp987@foo.edu,74,88,81.6,Your overall score in the course is 81.6
N33025817,Baz,bz2289@foo.edu,90,74,,


## Importing data from files
Pandas can import from a variety of common data file formats, including CSV, JSON, fixed-width column text, and more.

In [36]:
# open data about NYC jobs from https://data.cityofnewyork.us/City-Government/NYC-Jobs/kpav-sd4t
df = pd.read_csv('./data/NYC_Jobs.csv')

In [37]:
# get the DataFrame's schema - notice the auto-detection of data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2823 entries, 0 to 2822
Data columns (total 30 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Job ID                         2823 non-null   int64  
 1   Agency                         2823 non-null   object 
 2   Posting Type                   2823 non-null   object 
 3   # Of Positions                 2823 non-null   int64  
 4   Business Title                 2823 non-null   object 
 5   Civil Service Title            2823 non-null   object 
 6   Title Classification           2823 non-null   object 
 7   Title Code No                  2823 non-null   object 
 8   Level                          2823 non-null   object 
 9   Job Category                   2821 non-null   object 
 10  Full-Time/Part-Time indicator  2680 non-null   object 
 11  Career Level                   2821 non-null   object 
 12  Salary Range From              2823 non-null   f

In [38]:
# show a few randomly-sampled rows
df.sample(5)

Unnamed: 0,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Classification,Title Code No,Level,Job Category,...,Additional Information,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date
72,483730,TAXI & LIMOUSINE COMMISSION,External,2,Data Scientist - Policy Analytics,CITY RESEARCH SCIENTIST,Non-Competitive-5,21744,2,"Technology, Data & Innovation Policy, Research...",...,,"Click, APPLY NOW Current city employees must a...",,"33 Beaver St, New York NY",,New York City residency is generally required ...,09/13/2021,,09/14/2021,02/15/2022
1973,490660,DEPT OF ENVIRONMENT PROTECTION,Internal,23,Engineering Technician II,ENGINEERING TECHNICIAN,Competitive-1,20113,2,"Engineering, Architecture, & Planning",...,Appointments are subject to OMB approval. For ...,Click the Apply Now button,,,,New York City residency is generally required ...,10/08/2021,,10/21/2021,02/15/2022
1519,519921,OFFICE OF MANAGEMENT & BUDGET,Internal,1,Summer Undergraduate Intern Community Board R...,SUMMER COLLEGE INTERN,Non-Competitive-5,10234,0,"Finance, Accounting, & Procurement",...,REQUIREMENTS: Undergraduate interns should ha...,Please go to www.nyc.gov/careers and search fo...,,255 Greenwich Street,,You must be legally eligible to work in the Un...,02/09/2022,,02/09/2022,02/15/2022
1480,517184,HOUSING PRESERVATION & DVLPMNT,External,1,"Director, Brooklyn Planning",ASSOCIATE HOUSING DEVELOPMENT,Competitive-1,22508,0,"Engineering, Architecture, & Planning",...,,Apply online.,,100 Gold Street,,New York City residency is generally required ...,01/25/2022,24-FEB-2022,01/25/2022,02/15/2022
46,458082,DEPT OF HEALTH/MENTAL HYGIENE,Internal,1,City Custodial Assistant,CITY CUSTODIAL ASSISTANT,Labor-3,90644,0,Health Building Operations & Maintenance,...,**IMPORTANT NOTES TO ALL CANDIDATES: Please n...,Apply online with a cover letter to https://a1...,,,,New York City residency is generally required ...,01/05/2022,05-MAY-2022,01/05/2022,02/15/2022


In [39]:
# look for good-paying jobs ( > $200,000) available for external candidates
df_external_jobs = df[ df['Posting Type'] == 'External' ]
df_external_annual_jobs = df_external_jobs[ df_external_jobs['Salary Frequency'] == 'Annual' ]
df_external_annual_jobs_over_200k = df_external_annual_jobs[ df_external_annual_jobs['Salary Range To'] >= 200000 ]
df_external_annual_jobs_over_200k.sample(5)

Unnamed: 0,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Classification,Title Code No,Level,Job Category,...,Additional Information,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date
521,440527,NYC EMPLOYEES RETIREMENT SYS,External,1,ADMINISTRATIVE MANAGEMENT AUDITOR,ADMINISTRATIVE MANAGEMENT AUDI,Competitive-1,10010,M5,Administration & Human Resources,...,,"TO APPLY FOR CONSIDERATION, PLEASE FORWARD A C...",,,,New York City residency is generally required ...,06/24/2020,,07/07/2020,02/15/2022
1625,468218,DEPARTMENT OF CORRECTION,External,1,General Counsel,DEPUTY COMMISSIONER (DOC),Non-Competitive-5,95043,M5,Legal Affairs,...,,For City employees: Go to Employee Self-Servic...,,,,New York City residency is generally required ...,07/13/2021,,07/13/2021,02/15/2022
1953,502421,OFFICE OF MANAGEMENT & BUDGET,External,1,Assistant Director Education,BUDGET ANALYST (OMB)-MANAGERIA,Pending Classification-2,0608A,M4,"Finance, Accounting, & Procurement Policy, Res...",...,"REQUIREMENTS: Assistant Director ($141,766): ...","For City employees, please go to Employee Self...",,255 Greenwich Street,,New York City residency is generally required ...,11/19/2021,,11/19/2021,02/15/2022
1782,520572,OFFICE OF LABOR RELATIONS,External,1,General Counsel,COUNSEL (OLR),Non-Competitive-5,30100,M6,Legal Affairs,...,"The salary being offered is $180,000 to $230,0...",To apply please submit your resume and cover l...,,,,New York City residency is generally required ...,02/14/2022,22-FEB-2022,02/14/2022,02/15/2022
1788,468476,NYC HOUSING AUTHORITY,External,1,Director of Public Housing Tenancy Operations,ADMINISTRATIVE HOUSING MANAGER,Competitive-1,10018,M4,"Policy, Research & Analysis",...,Preference will be given to employees who have...,Click the Apply Now button.,,,,NYCHA has no residency requirements.,08/04/2021,,08/30/2021,02/15/2022


In [40]:
df_external_annual_jobs_over_200k[ ["Business Title", "Salary Range From", "Salary Range To"] ].sort_values(by="Salary Range To", ascending=False).head(5)

Unnamed: 0,Business Title,Salary Range From,Salary Range To
2615,"Head, Real Estate Investments Portfolio",250000.0,265000.0
2288,Chief Risk Officer,250000.0,265000.0
2671,Assistant to the Deputy Borough President.,58700.0,252165.0
1837,Assistant to the Ptesident,58700.0,252165.0
397,Deputy Commissioner,106729.0,241434.0


In [41]:
# the same query as above, just in one line
df[ (df['Posting Type'] == 'External') & (df['Salary Frequency'] == 'Annual') & (df['Salary Range To'] >= 200000) ].sample(5)

Unnamed: 0,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Classification,Title Code No,Level,Job Category,...,Additional Information,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date
642,373748,NYC EMPLOYEES RETIREMENT SYS,External,1,ADMINISTRATIVE MANAGEMENT AUDITOR,ADMINISTRATIVE MANAGEMENT AUDI,Competitive-1,10010,M7,"Administration & Human Resources Finance, Acco...",...,,"TO APPLY FOR CONSIDERATION, PLEASE FORWARD A C...",,,,New York City residency is generally required ...,11/07/2018,,11/07/2018,02/15/2022
1782,520572,OFFICE OF LABOR RELATIONS,External,1,General Counsel,COUNSEL (OLR),Non-Competitive-5,30100,M6,Legal Affairs,...,"The salary being offered is $180,000 to $230,0...",To apply please submit your resume and cover l...,,,,New York City residency is generally required ...,02/14/2022,22-FEB-2022,02/14/2022,02/15/2022
2525,519718,DEPARTMENT OF TRANSPORTATION,External,1,Assistant Commissioner - Office of Special Events,ADMINISTRATIVE COMMUNITY RELAT,Competitive-1,10022,M4,Administration & Human Resources Constituent S...,...,***In order to be considered for this position...,***In order to be considered for this position...,Mon-Fri 9am - 5pm,55 Water St Ny Ny,,New York City residency is generally required ...,02/10/2022,23-FEB-2022,02/11/2022,02/15/2022
1063,441706,TAXI & LIMOUSINE COMMISSION,External,1,General Counsel/Deputy Commissioner for Legal ...,EXECUTIVE AGENCY COUNSEL,Non-Competitive-5,95005,M6,Legal Affairs,...,"As of August 2, 2021, all new hires must be va...","Click, APPLY NOW Current city employees must a...",,"33 Beaver St, New York Ny",,New York City residency is generally required ...,07/13/2020,,08/16/2021,02/15/2022
2402,453677,NYC EMPLOYEES RETIREMENT SYS,External,1,ADMINISTRATIVE MANAGEMENT AUDITOR,ADMINISTRATIVE MANAGEMENT AUDI,Competitive-1,10010,M5,Administration & Human Resources,...,,"TO APPLY FOR CONSIDERATION, PLEASE FORWARD A C...",,,,New York City residency is generally required ...,11/17/2020,,11/17/2020,02/15/2022


## Deal with missing values

In [42]:
# find any rows with missing values
bad_rows = df[ df.isnull().any(axis=1) ]
bad_rows.head(3)

Unnamed: 0,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Classification,Title Code No,Level,Job Category,...,Additional Information,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date
0,517168,HOUSING PRESERVATION & DVLPMNT,External,1,"Deputy Director, Pre-Development Planning",ASSOCIATE HOUSING DEVELOPMENT,Competitive-1,22508,0,"Engineering, Architecture, & Planning",...,,Apply online.,,100 Gold Street,,New York City residency is generally required ...,01/25/2022,24-FEB-2022,01/25/2022,02/15/2022
1,501645,ADMIN FOR CHILDREN'S SVCS,Internal,2,Senior Stationary Engineer,SENIOR STATIONARY ENGINEER,Competitive-1,91638,0,Building Operations & Maintenance Social Services,...,Section 424-A of the New York Social Services ...,Click on the Apply to button,,,,New York City Residency is not required for th...,02/04/2022,06-MAR-2022,02/04/2022,02/15/2022
2,520016,NYC EMPLOYEES RETIREMENT SYS,External,1,Disability Case Management Supervisor,COMMUNITY COORDINATOR,Non-Competitive-5,56058,0,Constituent Services & Community Programs Comm...,...,,"TO APPLY FOR CONSIDERATION, PLEASE FORWARD A C...",,,,New York City residency is generally required ...,02/10/2022,25-FEB-2022,02/10/2022,02/15/2022


In [43]:
# drop a few rows with missing data manually
new_df = df.drop( df.index[ [0, 1, 2] ] )

# look for missing values again
bad_rows = new_df[ new_df.isnull().any(axis=1) ]
bad_rows.head(3)

Unnamed: 0,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Classification,Title Code No,Level,Job Category,...,Additional Information,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date
3,515139,FIRE DEPARTMENT,Internal,4,Chief Dispatcher,SUPERVISING FIRE ALARM DISPATC,Competitive-1,71060,2,Communications & Intergovernmental Affairs,...,NOTE: This position is open to qualified perso...,CITY EMPLOYEES MUST APPLY VIA EMPLOYEE SELF SE...,Supervising Fire Alarm Dispatchers may be requ...,Fire Dispatch Operations - PSAC 1,,New York City residency is generally required ...,02/03/2022,24-FEB-2022,02/03/2022,02/15/2022
4,497303,DEPT OF PARKS & RECREATION,Internal,1,Recreation Supervisor,RECREATION SUPERVISOR,Competitive-1,60440,0,Constituent Services & Community Programs,...,Fees: Hired candidates will be subject to a pr...,Please submit a cover letter and resume. Park...,,"Jackie Robinson Recreation Center, Manhattan",,"Residency in New York City, Nassau, Orange, Ro...",02/11/2022,11-MAR-2022,02/11/2022,02/15/2022
5,497303,DEPT OF PARKS & RECREATION,Internal,1,Recreation Supervisor,RECREATION SUPERVISOR,Competitive-1,60440,0,Constituent Services & Community Programs,...,Fees: Hired candidates will be subject to a pr...,Please submit a cover letter and resume. Park...,,"Jackie Robinson Recreation Center, Manhattan",,"Residency in New York City, Nassau, Orange, Ro...",02/11/2022,11-MAR-2022,02/11/2022,02/15/2022


In [44]:
# drop rows with any missing values
new_df.dropna()

Unnamed: 0,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Classification,Title Code No,Level,Job Category,...,Additional Information,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date


In [45]:
# oops... we were too aggressive.  Let's drop just those rows with missing salary info
df = df.dropna(subset=['Salary Range From', 'Salary Range To'])
df.sample(3)

Unnamed: 0,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Classification,Title Code No,Level,Job Category,...,Additional Information,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date
1217,442559,NYC HOUSING AUTHORITY,Internal,100,CARETAKER X (HA),CARETAKER (HA),Labor-3,90645,0,Building Operations & Maintenance,...,NYCHA residents are encouraged to apply.,Click the Apply Now button.,,,,NYCHA has no residency requirements.,09/04/2020,,09/04/2020,02/15/2022
564,517984,DEPT OF ENVIRONMENT PROTECTION,Internal,3,Clerical Associate III,CLERICAL ASSOCIATE,Competitive-1,10251,3,Administration & Human Resources,...,Appointments are subject to OMB approval. For...,To Apply: Click Apply Now button,35 hours,Various Locations at Wastewater Treatment Plan...,,New York City residency is generally required ...,02/04/2022,18-FEB-2022,02/04/2022,02/15/2022
2016,511510,HRA/DEPT OF SOCIAL SERVICES,Internal,1,TIMEKEEPER,CLERICAL ASSOCIATE,Competitive-1,10251,4,Administration & Human Resources Social Services,...,The federal government provides student loan f...,CANDIDATE MUST BE PERMANENT IN THE CLERICAL AS...,,"33 Beaver St, New York, NY",,New York City residency is generally required ...,12/23/2021,25-FEB-2022,02/11/2022,02/15/2022


In [46]:
# fill in missing values with zeros
df = df.fillna(0)
df.sample(3)

Unnamed: 0,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Classification,Title Code No,Level,Job Category,...,Additional Information,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date
834,495413,OFFICE OF EMERGENCY MANAGEMENT,Internal,1,HEALTH & MEDICAL SPECIALIST,EMERGENCY PREPAREDNESS SPECIAL,Pending Classification-2,6766,1,Health,...,New York City Emergency Management (NYCEM) hel...,Current City Employees: Apply via Employee Sel...,Monday-Friday 9-5,"165 Cadman Plaza East Brooklyn, NY 11201",0.0,New York City residency is generally required ...,11/15/2021,0,11/16/2021,02/15/2022
1053,509209,DEPARTMENT FOR THE AGING,Internal,1,Data Accuracy Specialist (Per-Diem),CITY RESEARCH SCIENTIST,Non-Competitive-5,21744,3,"Policy, Research & Analysis Social Services",...,This Per Diem position is full-time (35 hours ...,Please be sure to submit a resume & cover lett...,0,0,0.0,New York City residency is generally required ...,02/09/2022,10-MAY-2022,02/09/2022,02/15/2022
2238,474116,DEPT OF DESIGN & CONSTRUCTION,Internal,1,Capital Payments Auditor,ACCOUNTANT,Competitive-1,40510,2,"Finance, Accounting, & Procurement",...,0,"For City Employees, please go to Employee Self...",35 Hours,"30-30 Thomson Avenue, LIC, NY",0.0,New York City residency is generally required ...,12/22/2021,27-FEB-2022,01/28/2022,02/15/2022


## Basic statistics

In [47]:
# get an overview of most common stats
df.describe()

Unnamed: 0,Job ID,# Of Positions,Salary Range From,Salary Range To,Recruitment Contact
count,2823.0,2823.0,2823.0,2823.0,2823.0
mean,484959.478215,2.733971,63866.675553,88589.563539,0.0
std,51846.706525,11.006525,33002.429986,46742.505962,0.0
min,87990.0,1.0,0.0,15.45,0.0
25%,471473.0,1.0,49950.0,62215.0,0.0
50%,502085.0,1.0,62397.0,83981.0,0.0
75%,514317.5,1.0,79620.0,110000.0,0.0
max,520640.0,250.0,250000.0,265000.0,0.0


In [48]:
# the same, but just for the 'Salary Range To' field (which is a Series, of course)
df['Salary Range To'].describe()

count      2823.000000
mean      88589.563539
std       46742.505962
min          15.450000
25%       62215.000000
50%       83981.000000
75%      110000.000000
max      265000.000000
Name: Salary Range To, dtype: float64

In [49]:
# get just the mean from the column
df['Salary Range To'].median()

83981.0

The other statistics functions - `min()`, `max()`, `mean()`, `std()`, `count()` - work similarly.

## Count values
The `value_counts()` function of a Series returns the number of times each value occurs.

In [50]:
df['Career Level'].value_counts()

Experienced (non-manager)    1933
Manager                       385
Entry-Level                   362
Executive                      80
Student                        61
0                               2
Name: Career Level, dtype: int64

In [51]:
df['Full-Time/Part-Time indicator'].value_counts()

F    2551
0     143
P     129
Name: Full-Time/Part-Time indicator, dtype: int64

## Grouping by a column

In [52]:
# count how many jobs are in each agency
df.groupby("Agency")['Agency'].count()

Agency
ADMIN FOR CHILDREN'S SVCS          27
ADMIN TRIALS AND HEARINGS          28
BOROUGH PRESIDENT-QUEENS            4
BUSINESS INTEGRITY COMMISSION       7
CIVILIAN COMPLAINT REVIEW BD       17
CONFLICTS OF INTEREST BOARD         8
CONSUMER AFFAIRS                   48
DEPARTMENT FOR THE AGING           26
DEPARTMENT OF BUILDINGS            15
DEPARTMENT OF BUSINESS SERV.        8
DEPARTMENT OF CITY PLANNING        27
DEPARTMENT OF CORRECTION          213
DEPARTMENT OF FINANCE              31
DEPARTMENT OF INVESTIGATION        24
DEPARTMENT OF PROBATION             6
DEPARTMENT OF SANITATION           32
DEPARTMENT OF TRANSPORTATION      161
DEPT OF CITYWIDE ADMIN SVCS        37
DEPT OF DESIGN & CONSTRUCTION      55
DEPT OF ENVIRONMENT PROTECTION    356
DEPT OF HEALTH/MENTAL HYGIENE     203
DEPT OF INFO TECH & TELECOMM       27
DEPT OF PARKS & RECREATION         66
DEPT OF YOUTH & COMM DEV SRVS      29
DEPT. OF HOMELESS SERVICES          2
DISTRICT ATTORNEY KINGS COUNTY      4
DISTR

In [53]:
# calculate the mean top salary within each agency
df.groupby("Agency")['Salary Range To'].mean()

Agency
ADMIN FOR CHILDREN'S SVCS          75985.334815
ADMIN TRIALS AND HEARINGS          52919.282857
BOROUGH PRESIDENT-QUEENS          252165.000000
BUSINESS INTEGRITY COMMISSION      87857.142857
CIVILIAN COMPLAINT REVIEW BD       81148.529412
CONFLICTS OF INTEREST BOARD        73727.500000
CONSUMER AFFAIRS                   80520.537500
DEPARTMENT FOR THE AGING           50468.342308
DEPARTMENT OF BUILDINGS            75645.866667
DEPARTMENT OF BUSINESS SERV.      128485.375000
DEPARTMENT OF CITY PLANNING        82946.000000
DEPARTMENT OF CORRECTION           96335.074123
DEPARTMENT OF FINANCE              94341.318065
DEPARTMENT OF INVESTIGATION        78847.250000
DEPARTMENT OF PROBATION            72848.333333
DEPARTMENT OF SANITATION           62130.343750
DEPARTMENT OF TRANSPORTATION      107995.890497
DEPT OF CITYWIDE ADMIN SVCS        88593.827027
DEPT OF DESIGN & CONSTRUCTION     110570.036364
DEPT OF ENVIRONMENT PROTECTION     94913.979372
DEPT OF HEALTH/MENTAL HYGIENE    

In [54]:
# show just the top 10 paying agencies
df.groupby("Agency")['Salary Range To'].mean().sort_values().tail(10)

Agency
DEPT. OF HOMELESS SERVICES        109409.000000
DEPT OF DESIGN & CONSTRUCTION     110570.036364
OFFICE OF THE COMPTROLLER         111008.448846
MAYORS OFFICE OF CONTRACT SVCS    112500.000000
DEPT OF INFO TECH & TELECOMM      116225.518519
NYC EMPLOYEES RETIREMENT SYS      122217.108648
DEPARTMENT OF BUSINESS SERV.      128485.375000
NYC FIRE PENSION FUND             192152.000000
OFFICE OF LABOR RELATIONS         231974.000000
BOROUGH PRESIDENT-QUEENS          252165.000000
Name: Salary Range To, dtype: float64

In [55]:
# find agencies with the largest range of salaries
df['Salary Range'] = df['Salary Range To'] - df['Salary Range From']
df.groupby("Agency")['Salary Range'].mean().sort_values().tail(10)[::-1]

Agency
BOROUGH PRESIDENT-QUEENS         193465.000000
OFFICE OF LABOR RELATIONS        134144.000000
NYC FIRE PENSION FUND            120114.000000
DEPARTMENT OF BUSINESS SERV.      72978.000000
DEPT OF INFO TECH & TELECOMM      55328.702222
DEPT OF DESIGN & CONSTRUCTION     45076.872727
DEPARTMENT OF TRANSPORTATION      44444.558509
NYC EMPLOYEES RETIREMENT SYS      43930.259740
FINANCIAL INFO SVCS AGENCY        38006.085306
NYC HOUSING AUTHORITY             37707.622407
Name: Salary Range, dtype: float64

## Shape

In [56]:
# how many rows and columns?
df.shape

(2823, 31)

In [57]:
# remind ourselves of the look of the data
df.sample(3)

Unnamed: 0,Job ID,Agency,Posting Type,# Of Positions,Business Title,Civil Service Title,Title Classification,Title Code No,Level,Job Category,...,To Apply,Hours/Shift,Work Location 1,Recruitment Contact,Residency Requirement,Posting Date,Post Until,Posting Updated,Process Date,Salary Range
1915,470130,DEPARTMENT OF CORRECTION,External,1,Director of Data Analytics & Research,CITY RESEARCH SCIENTIST,Non-Competitive-5,21744,4A,"Technology, Data & Innovation",...,For City employees: Go to Employee Self-Servic...,0,0,0.0,New York City residency is generally required ...,07/27/2021,0,07/27/2021,02/15/2022,14143.0
2048,510370,NYC EMPLOYEES RETIREMENT SYS,External,1,ADMINISTRATIVE RETIREMENTS BENEFITS SPECIALIST...,ADMINISTRATIVE RETIREMENTS BEN,Competitive-1,8298C,00,"Technology, Data & Innovation",...,NYCERS is an Equal Opportunity Employer Intern...,0,0,0.0,New York City residency is generally required ...,12/21/2021,0,12/22/2021,02/15/2022,15000.0
1397,519059,DEPT OF DESIGN & CONSTRUCTION,External,1,Engineer-In-Charge,CIVIL ENGINEER,Competitive-1,20215,02,"Engineering, Architecture, & Planning",...,"For City Employees, please go to Employee Self...",35 Hours,"30-30 Thomson Avenue, Long Island City, NY 11101",0.0,New York City Residency is not required for th...,02/04/2022,05-MAY-2022,02/07/2022,02/15/2022,31360.0


In [58]:
# flip the dataframe so columns become rows and rows become columns
df.transpose().head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2813,2814,2815,2816,2817,2818,2819,2820,2821,2822
Job ID,517168,501645,520016,515139,497303,497303,460742,465916,462614,496445,...,435593,507116,496359,518449,508500,483210,518950,484984,520249,518700
Agency,HOUSING PRESERVATION & DVLPMNT,ADMIN FOR CHILDREN'S SVCS,NYC EMPLOYEES RETIREMENT SYS,FIRE DEPARTMENT,DEPT OF PARKS & RECREATION,DEPT OF PARKS & RECREATION,NYC HOUSING AUTHORITY,DEPT OF HEALTH/MENTAL HYGIENE,DEPARTMENT OF CORRECTION,NYC HOUSING AUTHORITY,...,DEPARTMENT OF CORRECTION,TEACHERS RETIREMENT SYSTEM,DEPARTMENT OF TRANSPORTATION,TAXI & LIMOUSINE COMMISSION,DEPARTMENT FOR THE AGING,DEPT OF PARKS & RECREATION,NYC HOUSING AUTHORITY,FINANCIAL INFO SVCS AGENCY,POLICE DEPARTMENT,POLICE DEPARTMENT
Posting Type,External,Internal,External,Internal,Internal,Internal,External,Internal,External,External,...,Internal,External,Internal,Internal,Internal,Internal,External,External,Internal,Internal
# Of Positions,1,2,1,4,1,1,3,1,8,1,...,1,1,1,1,1,1,1,1,1,1
Business Title,"Deputy Director, Pre-Development Planning",Senior Stationary Engineer,Disability Case Management Supervisor,Chief Dispatcher,Recreation Supervisor,Recreation Supervisor,Environmental Compliance Analyst,Environmental Health Scientist,Agency Attorney,AGENCY ATTORNEY INTERNE,...,Policy Analyst,Agency Attorney,Associate Project Manager 3,Database Administrator,"Senior Director of Budgets, Project Management...","Landscape Architect for Forestry, Horticulture...",Board Committee Meeting Coordinator,FMS ARCHITECT/DEVELOPER,Senior Police Administrative Aide,Police Administrative Aide
