# (25') Using pandas

### [GitHub repository](https://github.com/maurolepore/using-pandas)  |  [Feedback](https://github.com/maurolepore/using-pandas/issues/1)

### Setup

* Create a new jupyter notebook under the same directory as the data.

### Curriculum

Module 1 > Python for data science

> Students will be familiarized with popular Python libraries that are used in Data Science, such as Pandas.

![](https://i.imgur.com/zsnUvnj.png)

### Requirements

* You already know the basics of [pandas](http://pandas.pydata.org/pandas-docs/stable/).

### Additional resources

* [How to do stuff with pandas](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb).
* Cheetsheets.

### Objectives

* Import, explore and answer a data science question with a real-world dataset and pandas.

----

This material was adapted from the tutorial [_Using pandas for Better (and Worse) Data Science_](https://github.com/justmarkham/pycon-2018-tutorial), presented by [Kevin Markham](http://www.dataschool.io/about/) at PyCon on May 10, 2018. Kevin Markham is the founder of [Data School](http://www.dataschool.io/), an online school for learning data science with Python.

----

### Import

In [1]:
# Import pandas as usual
import pandas as pd

`police.csv` contains data of traffic stops made by police in Rhode Island. It is adapted from [Stanford Open Policing Project](https://openpolicing.stanford.edu/), and available under the [Open Data Commons Attribution License](https://opendatacommons.org/licenses/by/summary/).

In [16]:
url = 'https://raw.githubusercontent.com/maurolepore/using-pandas/master/police.csv'
# ri stands for Rhode Island
# ri = pd.____(url)

In [18]:
# Explore the first few rows. What does each row represent?

In [19]:
# How many columns and rows does the data contain?

In [20]:
# What types of data does this dataset have? What do they mean?

* What does `NaN` mean?
* Why might a value be missing?
* Why mark missing data as `NaN`? Why not `0`, `' '`, or `'Unknown'`?

In [21]:
# How many missing values are there in each column?

### Tidy

In [8]:
ri.drop('county_name', axis='columns', inplace=True)

In [9]:
# Confirm with .shape
ri.shape

(91741, 14)

In [10]:
# Confirm with .columns
ri.columns

Index(['stop_date', 'stop_time', 'driver_gender', 'driver_age_raw',
       'driver_age', 'driver_race', 'violation_raw', 'violation',
       'search_conducted', 'search_type', 'stop_outcome', 'is_arrested',
       'stop_duration', 'drugs_related_stop'],
      dtype='object')

How else could you do the same? (The Tab key may help you discover useful methods)

In [11]:
ri.dropna(axis='columns', how='all', inplace=True)
ri.shape

(91741, 14)

Take aways:

* Pay attention to default arguments.
* Check that your code did what you expected.
* There is more than one way to do everything.

### Transform

#### Do males or females speed more often?

* Pay special attention to the columns 'violation' and 'driver_gender'.
* There are at least two ways to understant and answer this question.

In [12]:
# Let's see the first few rows again
ri.head()

Unnamed: 0,stop_date,stop_time,driver_gender,driver_age_raw,driver_age,driver_race,violation_raw,violation,search_conducted,search_type,stop_outcome,is_arrested,stop_duration,drugs_related_stop
0,2005-01-02,01:55,M,1985.0,20.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
1,2005-01-18,08:15,M,1965.0,40.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
2,2005-01-23,23:15,M,1972.0,33.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False
3,2005-02-20,17:15,M,1986.0,19.0,White,Call for Service,Other,False,,Arrest Driver,True,16-30 Min,False
4,2005-03-14,10:00,F,1984.0,21.0,White,Speeding,Speeding,False,,Citation,False,0-15 Min,False


#### 2.a. When someone is stopped for speeding, How often is it a male or female?

In [13]:
speeding = ri[ri.violation == 'Speeding']
count_by_gender = speeding.driver_gender.value_counts(normalize=True)
count_by_gender

M    0.680527
F    0.319473
Name: driver_gender, dtype: float64

Does this prove that males/famales speed more?

#### 2.b. When a male is pulled over, How often is for speeding? (repeat for female)

In [14]:
# Try using .groupby()
ri.groupby('driver_gender').violation.value_counts(normalize=True)

driver_gender  violation          
F              Speeding               0.658500
               Moving violation       0.136277
               Equipment              0.105780
               Registration/plates    0.043086
               Other                  0.029348
               Seat belt              0.027009
M              Speeding               0.524350
               Moving violation       0.207012
               Equipment              0.135671
               Other                  0.057668
               Registration/plates    0.038461
               Seat belt              0.036839
Name: violation, dtype: float64

Is this result consistent with 2.a.?

Take away: 

* There is more than one way to understand and answer a question.

### Communicate

```
git add .
git commit -m "End demo"
git push
```

[Now see this report on GitHub](https://github.com/maurolepore/using-pandas/blob/master/using-pandas.ipynb)