**<center><font size="40">pandas analysis</font></center>**

This is a cheat-sheet for pandas api in python. Most of the examples here are found/learned from the pandas documentation, various websites on the internet and DataCamp Python Career Track Courses.

In [1]:
import pandas as pd

# Datasets

In [2]:
police = pd.read_csv('data/police.csv')

In [17]:
police_raw = pd.read_csv('data/police_raw.csv')

# EDA
Exploratory Data Analysis techniques, functions, methodsd, attributes.

## head() / tail()
Capture DF first or last X rows.

In [8]:
pd.concat([police.head(2),police.tail(2)], axis='index')

Unnamed: 0,stop_datetime,driver_gender,driver_race,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district,Frisk
0,2005-01-04 12:55:00,M,White,Equipment,False,,Citation,False,0-15 Min,False,Zone X4,False
1,2005-01-23 23:15:00,M,White,Speeding,False,,Citation,False,0-15 Min,False,Zone K3,False
86534,2015-12-31 22:09:00,F,Hispanic,Equipment,False,,Warning,False,0-15 Min,False,Zone K3,False
86535,2015-12-31 22:47:00,M,White,Registration/plates,False,,Citation,False,0-15 Min,False,Zone X4,False


## info()
Concise summary of a DF.

In [5]:
police.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86536 entries, 0 to 86535
Data columns (total 12 columns):
stop_datetime         86536 non-null object
driver_gender         86536 non-null object
driver_race           86536 non-null object
violation             86536 non-null object
search_conducted      86536 non-null bool
search_type           3307 non-null object
stop_outcome          86536 non-null object
is_arrested           86536 non-null bool
stop_duration         86536 non-null object
drugs_related_stop    86536 non-null bool
district              86536 non-null object
Frisk                 86536 non-null bool
dtypes: bool(4), object(8)
memory usage: 5.6+ MB


## describe()
Summary descriptive statistics about each column of a DF

In [10]:
police.iloc[:, :5].describe()

Unnamed: 0,stop_datetime,driver_gender,driver_race,violation,search_conducted
count,86536,86536,86536,86536,86536
unique,84429,2,5,6,2
top,2015-01-10 09:11:00,M,White,Speeding,False
freq,8,62762,61870,48423,83229


## isna()
Detect missing values.

In [11]:
police.isna().sum()

stop_datetime             0
driver_gender             0
driver_race               0
violation                 0
search_conducted          0
search_type           83229
stop_outcome              0
is_arrested               0
stop_duration             0
drugs_related_stop        0
district                  0
Frisk                     0
dtype: int64

## value_counts()
Counts of unique values for a column.

In [14]:
police.driver_race.value_counts()

White       61870
Black       12285
Hispanic     9727
Asian        2389
Other         265
Name: driver_race, dtype: int64

Fractions

In [15]:
police.driver_race.value_counts(normalize=True)

White       0.714963
Black       0.141964
Hispanic    0.112404
Asian       0.027607
Other       0.003062
Name: driver_race, dtype: float64

# Preprocess

## NaN-s
NaN-s or missing values can be:
* **Removed** - If small fraction excist
* **Imputed** - Replaced with an estimate
* **Factored** - Use it as a value itself

### isna()
Find missing values.

In [41]:
police_raw[police_raw.driver_gender.isna()].iloc[::2000,:]

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
15,RI,2005-09-03,16:02,,,,,,False,,,,,False,Zone K3
35464,RI,2009-06-14,17:11,,,,,,False,,,,,False,Zone X4
65902,RI,2012-12-11,22:24,,,,,,False,,,,,False,Zone K2


### drop()
Remove entire columns or rows from a DF.

In [30]:
police_raw.head(1)

Unnamed: 0,state,stop_date,stop_time,county_name,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4


In [33]:
police_raw[['county_name']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91741 entries, 0 to 91740
Data columns (total 1 columns):
county_name    0 non-null float64
dtypes: float64(1)
memory usage: 716.9 KB


In [34]:
#remove 'county_name' column since all values are NaN-s
police_raw.drop('county_name', axis='columns').head(2)

Unnamed: 0,state,stop_date,stop_time,driver_gender,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district
0,RI,2005-01-04,12:55,M,White,Equipment/Inspection Violation,Equipment,False,,Citation,False,0-15 Min,False,Zone X4
1,RI,2005-01-23,23:15,M,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False,Zone K3


### dropna()
Remove missing values.

In [25]:
#sample data
p_raw_sub = police_raw[['stop_date', 'driver_gender', 'stop_duration']]
p_raw_sub.head(2)

Unnamed: 0,stop_date,driver_gender,stop_duration
0,2005-01-04,M,0-15 Min
1,2005-01-23,M,0-15 Min


In [27]:
p_raw_sub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91741 entries, 0 to 91740
Data columns (total 3 columns):
stop_date        91741 non-null object
driver_gender    86536 non-null object
stop_duration    86539 non-null object
dtypes: object(3)
memory usage: 2.1+ MB


In [28]:
#dropping rows that contain missing values
p_raw_sub.dropna(axis='index',how='any').shape

(86536, 3)

## Mapping

### apply()

In [43]:
police_raw.violation_raw.value_counts(normalize=True, dropna=False)

Speeding                            0.527834
Other Traffic Violation             0.176846
Equipment/Inspection Violation      0.119053
NaN                                 0.056703
Registration Violation              0.040364
Seatbelt Violation                  0.031131
Special Detail/Directed Patrol      0.026891
Call for Service                    0.015173
Motorist Assist/Courtesy            0.002235
Violation of City/Town Ordinance    0.001973
APB                                 0.000992
Suspicious Person                   0.000610
Warrant                             0.000196
Name: violation_raw, dtype: float64

In [44]:
#dict for replacing values
replace_dict = {'Speeding':'Speeding',
                'Other Traffic Violation':'Moving',
                'Equipment/Inspection Violation':'Equipment',
                'Registration Violation':'Registration',
                'Seatbelt Violation':'Seat belt'}

police_raw.violation_raw.apply(lambda x: x if x in replace_dict else 'Other').value_counts(normalize=True)

Speeding                          0.527834
Other Traffic Violation           0.176846
Equipment/Inspection Violation    0.119053
Other                             0.104773
Registration Violation            0.040364
Seatbelt Violation                0.031131
Name: violation_raw, dtype: float64

## Data Types

### dtypes

In [55]:
#determine data dtype(s)
police.dtypes

stop_datetime         object
driver_gender         object
driver_race           object
violation             object
search_conducted        bool
search_type           object
stop_outcome          object
is_arrested             bool
stop_duration         object
drugs_related_stop      bool
district              object
Frisk                   bool
dtype: object

### astype()
Change the data type of a column / row.

In [57]:
police.driver_race.astype('category').dtype

CategoricalDtype(categories=['Asian', 'Black', 'Hispanic', 'Other', 'White'], ordered=False)

## Indexing
Key building blocks:
* **Indexes**: Sequence of labels
* **Series**: 1D array with Index
* **DataFrames**: 2D array with Series as columns

In [64]:
#accessing index name
print(police.index.name)

None


### set_index()

In [58]:
police.head(2)

Unnamed: 0,stop_datetime,driver_gender,driver_race,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district,Frisk
0,2005-01-04 12:55:00,M,White,Equipment,False,,Citation,False,0-15 Min,False,Zone X4,False
1,2005-01-23 23:15:00,M,White,Speeding,False,,Citation,False,0-15 Min,False,Zone K3,False


In [61]:
police_index = police.head(2).set_index('stop_datetime')
police_index

Unnamed: 0_level_0,driver_gender,driver_race,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district,Frisk
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2005-01-04 12:55:00,M,White,Equipment,False,,Citation,False,0-15 Min,False,Zone X4,False
2005-01-23 23:15:00,M,White,Speeding,False,,Citation,False,0-15 Min,False,Zone K3,False


### reset_index()

In [62]:
police_index.reset_index()

Unnamed: 0,stop_datetime,driver_gender,driver_race,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district,Frisk
0,2005-01-04 12:55:00,M,White,Equipment,False,,Citation,False,0-15 Min,False,Zone X4,False
1,2005-01-23 23:15:00,M,White,Speeding,False,,Citation,False,0-15 Min,False,Zone K3,False


### MultiIndexing
Specifying more than one index.

In [108]:
#generate a example DF
sample = police.loc[:5,'stop_datetime':'search_conducted']
sample

Unnamed: 0,stop_datetime,driver_gender,driver_race,violation,search_conducted
0,2005-01-04 12:55:00,M,White,Equipment,False
1,2005-01-23 23:15:00,M,White,Speeding,False
2,2005-02-17 04:15:00,M,White,Speeding,False
3,2005-02-20 17:15:00,M,White,Other,False
4,2005-02-24 01:20:00,F,White,Speeding,False
5,2005-03-14 10:00:00,F,White,Speeding,False


In [109]:
#setting multiindex
sample.set_index(['stop_datetime','violation'], inplace=True)

In [110]:
sample

Unnamed: 0_level_0,Unnamed: 1_level_0,driver_gender,driver_race,search_conducted
stop_datetime,violation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005-01-04 12:55:00,Equipment,M,White,False
2005-01-23 23:15:00,Speeding,M,White,False
2005-02-17 04:15:00,Speeding,M,White,False
2005-02-20 17:15:00,Other,M,White,False
2005-02-24 01:20:00,Speeding,F,White,False
2005-03-14 10:00:00,Speeding,F,White,False


In [73]:
#access gb index with two levels (violation and driver_gender)
sample.index

MultiIndex([('2005-01-04 12:55:00', 'Equipment'),
            ('2005-01-23 23:15:00',  'Speeding'),
            ('2005-02-17 04:15:00',  'Speeding'),
            ('2005-02-20 17:15:00',     'Other'),
            ('2005-02-24 01:20:00',  'Speeding'),
            ('2005-03-14 10:00:00',  'Speeding')],
           names=['stop_datetime', 'violation'])

#### swap levels

In [111]:
sample.swaplevel(0,1)

Unnamed: 0_level_0,Unnamed: 1_level_0,driver_gender,driver_race,search_conducted
violation,stop_datetime,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Equipment,2005-01-04 12:55:00,M,White,False
Speeding,2005-01-23 23:15:00,M,White,False
Speeding,2005-02-17 04:15:00,M,White,False
Other,2005-02-20 17:15:00,M,White,False
Speeding,2005-02-24 01:20:00,F,White,False
Speeding,2005-03-14 10:00:00,F,White,False


### sort_index()

In [75]:
sample.sort_index(level='violation')

Unnamed: 0_level_0,Unnamed: 1_level_0,driver_gender,driver_race,search_conducted
stop_datetime,violation,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005-01-04 12:55:00,Equipment,M,White,False
2005-02-20 17:15:00,Other,M,White,False
2005-01-23 23:15:00,Speeding,M,White,False
2005-02-17 04:15:00,Speeding,M,White,False
2005-02-24 01:20:00,Speeding,F,White,False
2005-03-14 10:00:00,Speeding,F,White,False


### Datetime

In [89]:
#sample data
sample = police.loc[:6, 'stop_datetime':'search_conducted']

#set stop_datetiem dtype to datetime
sample['stop_datetime'] = sample.stop_datetime.astype('datetime64')

sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
stop_datetime       7 non-null datetime64[ns]
driver_gender       7 non-null object
driver_race         7 non-null object
violation           7 non-null object
search_conducted    7 non-null bool
dtypes: bool(1), datetime64[ns](1), object(3)
memory usage: 359.0+ bytes


In [94]:
sample.set_index('stop_datetime', inplace=True)
sample

Unnamed: 0_level_0,driver_gender,driver_race,violation,search_conducted
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005-01-04 12:55:00,M,White,Equipment,False
2005-01-23 23:15:00,M,White,Speeding,False
2005-02-17 04:15:00,M,White,Speeding,False
2005-02-20 17:15:00,M,White,Other,False
2005-02-24 01:20:00,F,White,Speeding,False
2005-03-14 10:00:00,F,White,Speeding,False
2005-03-29 21:55:00,M,White,Speeding,False


#### by date

In [98]:
#selecting by date
sample.loc['2005-2', ]

Unnamed: 0_level_0,driver_gender,driver_race,violation,search_conducted
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005-02-17 04:15:00,M,White,Speeding,False
2005-02-20 17:15:00,M,White,Other,False
2005-02-24 01:20:00,F,White,Speeding,False


#### by time

In [99]:
#selecting by time
sample.between_time('12:00:00','22:00:00')

Unnamed: 0_level_0,driver_gender,driver_race,violation,search_conducted
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2005-01-04 12:55:00,M,White,Equipment,False
2005-02-20 17:15:00,M,White,Other,False
2005-03-29 21:55:00,M,White,Speeding,False


#### Resampling

Resampling - e.g. resampling data from daily to weekly (downsampling).

In [101]:
#sample data
sample = police.copy()
sample.stop_datetime = sample.stop_datetime.astype('datetime64')
sample.set_index('stop_datetime', inplace=True)
sample.head(2)

Unnamed: 0_level_0,driver_gender,driver_race,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district,Frisk
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2005-01-04 12:55:00,M,White,Equipment,False,,Citation,False,0-15 Min,False,Zone X4,False
2005-01-23 23:15:00,M,White,Speeding,False,,Citation,False,0-15 Min,False,Zone K3,False


In [104]:
#calculate monthly searches conducted
sample.resample('M').search_conducted.sum() # 'M' stands for month

stop_datetime
2005-01-31     0.0
2005-02-28     0.0
2005-03-31     0.0
2005-04-30     0.0
2005-05-31     0.0
              ... 
2015-08-31    25.0
2015-09-30    18.0
2015-10-31    20.0
2015-11-30    11.0
2015-12-31    13.0
Freq: M, Name: search_conducted, Length: 132, dtype: float64

In [106]:
#weekly max numnber of arrests done
sample.resample('W').search_conducted.sum().max()

18.0

In [107]:
#same query for 2 week interval
sample.resample('2W').search_conducted.sum().max()

26.0

# Analysis

## Categorical 

## Frequency table
2 Series of categorices to a fruency table

In [4]:
police = pd.read_csv('data/police_stops.csv', index_col=0, parse_dates=['stop_datetime'])
police.head(2)

Unnamed: 0_level_0,driver_gender,driver_race,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop,district,Frisk
stop_datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2005-01-04 12:55:00,M,White,Equipment,False,,Citation,False,0-15 Min,False,Zone X4,False
2005-01-23 23:15:00,M,White,Speeding,False,,Citation,False,0-15 Min,False,Zone K3,False


Examine *violations* in different *districts*

In [5]:
pd.crosstab(index=police.violation, columns=police.district)

district,Zone K1,Zone K2,Zone K3,Zone X1,Zone X3,Zone X4
violation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Equipment,672,2061,2302,296,2049,3541
Moving violation,1254,2962,2898,671,3086,5353
Other,290,942,705,143,769,1560
Registration/plates,120,768,695,38,671,1411
Seat belt,0,481,638,74,820,843
Speeding,5960,10448,12322,1119,8779,9795
