# (25') Using pandas

2019-05-20, Mauro Lepore (maurolepore@gmail.com)

# http://bit.ly/using-pandas

---

![](https://i.imgur.com/zsnUvnj.png)

> Module 1: Students will be familiarized with popular Python libraries that are used in Data Science

---

# Setup

* Create a new jupyter notebook.
* Issues? Fork this notebook: http://bit.ly/using-pandas-kaggle

# Requirements

* You already know the basics of [pandas](http://pandas.pydata.org/pandas-docs/stable/).

# Additional resources

* [pandas documentation](http://pandas.pydata.org/pandas-docs/stable/).
* [How to do stuff with pandas](http://nbviewer.jupyter.org/github/justmarkham/pandas-videos/blob/master/pandas.ipynb).
* Search cheetsheets in the wild.

# Objectives

* Import, explore and answer a data science question with a real-world dataset and pandas.

----

This material was adapted from the tutorial [_Using pandas for Better (and Worse) Data Science_](https://github.com/justmarkham/pycon-2018-tutorial), presented by [Kevin Markham](http://www.dataschool.io/about/) at PyCon on May 10, 2018. Kevin Markham is the founder of [Data School](http://www.dataschool.io/), an online school for learning data science with Python.

----

---
# Questions?
---

### Import

In [1]:
# Import pandas as usual

<details> 
  <summary>Hint</summary>
    <code>`import ____ as ____`</code> 
</details> 

<details> 
  <summary>Solution</summary> 
    <code>`import pandas as pd`</code> 
</details> 

`police.csv` contains data of traffic stops made by police in Rhode Island. It is adapted from [Stanford Open Policing Project](https://openpolicing.stanford.edu/), and available under the [Open Data Commons Attribution License](https://opendatacommons.org/licenses/by/summary/).

In [3]:
# Data: http://bit.ly/police-ri
# Horrible url: https://raw.githubusercontent.com/maurolepore/using-pandas/master/police.csv
# Use `ri` for Rhode Island

<details> 
  <summary>Hint</summary>
    <code>`ri = pd.____(url)`</code> 
</details> 
  
<details> 
  <summary>Solution</summary> 
    <code>`ri = pd.read_csv('https://raw.githubusercontent.com/maurolepore/using-pandas/master/police.csv')`</code> 
</details> 

In [3]:
# Explore the first few rows. What does each row represent?

<details> 
  <summary>Hint</summary>
    <code>`ri.____()`</code> 
</details> 
  
<details> 
  <summary>Solution</summary> 
    <code>`ri.head()`</code> 
</details> 

In [4]:
# How many columns and rows does the data contain?

<details> 
  <summary>Hint</summary>
    <code>`____.shape`</code> 
</details> 
  
<details> 
  <summary>Solution</summary> 
    <code>`ri.shape`</code> 
</details> 

In [5]:
# What types of data does this dataset have? What do they mean?

<details> 
  <summary>Hint</summary>
    <code>`____.dtypes`</code> 
</details> 
  
<details> 
  <summary>Solution</summary> 
    <code>`ri.dtypes`</code> 
</details> 

* What does `NaN` mean?
* Why might a value be missing?
* Why mark missing data as `NaN`? Why not `0`, `' '`, or `'Unknown'`?

In [6]:
# How many missing values are there in each column?

<details> 
  <summary>Hint</summary>
    <code>`ri.isnull().____()`</code> 
</details> 
  
<details> 
  <summary>Solution</summary> 
    <code>`ri.isnull().sum()`</code> 
</details> 

---
# Questions?
---

### Tidy

In [7]:
# Drop the column that contains only missing values

<details> 
  <summary>Hint</summary>
    <code>`ri.drop('____', axis='____', inplace=____)`</code> 
</details> 
  
<details> 
  <summary>Solution</summary> 
    <code>
        ```
        ri.drop('county_name', axis='columns', inplace=True)
        # Same
        ri.dropna(axis='columns', how='all', inplace=True)
        ```
    </code>
</details> 

In [8]:
# Confirm that the column is gone

<details> 
  <summary>Hint</summary>
    <code>`ri.____`</code> 
</details> 
  
<details> 
  <summary>Solution</summary> 
    <code>`ri.shape`</code> 
</details> 

---
Take aways:

* Pay attention to default arguments.
* Check that your code did what you expected.
* There is more than one way to do everything.

---
# Questions?
---

### Transform

#### Do males or females speed more often?

Pay special attention to the columns `violation` and `driver_gender`.

There are at least two ways to understand and answer this question:

#### 1. When someone is stopped for speeding, How often is it a male or female?

In [9]:
# 1. Pick speeding rows
# 2. Select gender column
# 3. Count values by gender (as a proportion)

<details> 
  <summary>Hint</summary>
    <code>
        ```
        speeding = ri[ri.____ == 'Speeding']
        ____.____.value_counts(normalize=True)
        ```
    </code> 
</details> 
  
<details> 
  <summary>Solution</summary> 
    <code>
        ```
        speeding = ri[ri.violation == 'Speeding']
        speeding.driver_gender.value_counts(normalize=True)
        ```
    </code> 
</details> 

Does this prove that males/females speed more?

#### 2. When a male is pulled over, How often is for speeding? (repeat for female)

In [10]:
# 1. Group the data by gender
# 2. Select the violations column
# 3. Count values by violation (as a proportion)

<details> 
  <summary>Hint</summary>
    <code>`ri.groupby('____').____.value_counts(normalize=True)`</code> 
</details> 
  
<details> 
  <summary>Solution</summary> 
    <code>`ri.groupby('driver_gender').violation.value_counts(normalize=True)`</code> 
</details> 

Are the two anwers consistent?

Take away: 

* There is more than one way to understand and answer a question.

### Communicate

```
git add .
git commit -m "End demo"
git push
```

[See this report on GitHub](https://github.com/maurolepore/using-pandas/blob/master/using-pandas.ipynb)

---
# Quesitons?
---

 # [Poll](https://github.com/maurolepore/using-pandas/issues/1)