<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# `pandas` Data Munging Overview: Part 2


---

### Lesson Guide
- [Exercise #3](#exercise-3)
- [Split-Apply-Combine](#split-apply-combine)
    - [`.groupby()`](#groupby)
    - [Apply Functions to Groups and Combine](#apply-combine)
- [Exercise #4](#exercise-4)
- [Indexing](#indexing)
    - [Location Indexing With `.loc()`](#loc)
    - [Position Indexing With `.iloc()`](#iloc)
- [Other Frequently Used Features](#frequent)
    - [Using Map Functions With Replacement Dictionaries](#map-dict)
    - [Encoding Strings as Integers With `.factorize()`](#factorize)
    - [Determining Unique Values](#unique)
    - [Replacing Values With `.replace()`](#replace)
    - [Series String Methods With `.str`](#series-str)
    - [Datetime Conversion and Arithmetic](#datetime)
    - [Setting and Resetting the Index](#set-reset-index)
    - [Sorting by Index](#sort-by-index)
    - [Changing the Data Type of a Column](#change-dtype)
    - [Creating Dummy-Coded Columns](#dummy)
    - [Concatenating DataFrames](#concatenate)
    - [Detecting and Dropping Duplicate Rows](#duplicate-rows)
    - [Writing a DataFrame to a `.csv`](#write-csv)
    - [Pickling a DataFrame](#pickle)
    - [Randomly Sampling a DataFrame](#sample)
- [Infrequently Used Features](#infrequent)
    - [Creating DataFrames From Dictionaries and Lists of Lists](#toy-dataframes)
    - [Performing Cross-Tabulations](#crosstab)
    - [Query-Filtering Syntax](#query)
    - [Calculating Memory Usage](#memory-usage)
    - [Converting Column to Category Type](#category-type)
    - [Creating Columns With `.assign()`](#assign)
    - [Limiting the Number of Rows to Load in a File Read](#limit-rows-read)
    - [Manually Setting the Number of Rows and Columns to Print](#manual-print)

In [2]:
import pandas as pd

<a id='exercise-3'></a>
## Exercise #3

---

**Using the UFO data provided below:**
1. Read in the data.
2. Check the shape and describe the columns.
3. Find the four most frequently reported colors.
4. Find the most frequent city for reports in state `VA`.
5. Find only UFO reports from Arlington, VA.
6. Find the number of missing values in each column.
7. Show only UFO reports where `city` is missing.
8. Count the number of rows with no null values.
9. Amend column names with spaces to have underscores.
10. Make a new column that is a combination of `city` and `state`.

In [3]:
ufo_csv = '../../../../resource-datasets/ufo_sightings/ufo.csv'

In [4]:
ufo_data = pd.read_csv('../../../../resource-datasets/ufo_sightings/ufo.csv'
                    )

In [16]:
print(ufo_data.shape)
print(ufo_data.describe())

(80543, 5)
           City Colors Reported Shape Reported  State            Time
count     80496           17034          72141  80543           80543
unique    13504              31             27     52           68901
top     Seattle          ORANGE          LIGHT     CA  7/4/2014 22:00
freq        646            5216          16332  10743              45


In [None]:
#renamed_drinks = drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'})

In [26]:
ufo_data = ufo_data.rename(columns={"Colors Reported":"colors","Shape Reported":"shape"})

In [27]:
ufo_data.head()

Unnamed: 0,City,colors,shape,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [32]:
ufo_data["shape"].value_counts().head()

LIGHT       16332
TRIANGLE     7816
CIRCLE       7725
FIREBALL     6249
OTHER        5506
Name: shape, dtype: int64

In [34]:
ufo_data["State"].unique()

array(['NY', 'NJ', 'CO', 'KS', 'ND', 'CA', 'MI', 'AK', 'OR', 'AL', 'SC',
       'IA', 'GA', 'TN', 'NE', 'LA', 'KY', 'WV', 'NM', 'UT', 'RI', 'FL',
       'VA', 'NC', 'TX', 'WA', 'ME', 'IL', 'AZ', 'OH', 'PA', 'MN', 'WI',
       'MD', 'SD', 'NV', 'ID', 'MO', 'OK', 'IN', 'CT', 'MS', 'AR', 'WY',
       'MA', 'MT', 'DE', 'NH', 'VT', 'HI', 'Ca', 'Fl'], dtype=object)

In [36]:
mask_va=ufo_data[ufo_data["State"]=="VA"]

In [43]:
mask_va["City"].value_counts().head()

Virginia Beach    110
Richmond           92
Alexandria         48
Roanoke            35
Chesapeake         33
Name: City, dtype: int64

In [45]:
ufo_data.head()

Unnamed: 0,City,colors,shape,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [48]:
 #df[df[“col”]>20][col3,col5] 

ufo_data[(ufo_data["State"]=="VA")&(ufo_data["City"]=="Arlington")].head()

Unnamed: 0,City,colors,shape,State,Time
202,Arlington,GREEN,OVAL,VA,7/13/1952 21:00
6300,Arlington,,CHEVRON,VA,5/5/1990 21:40
10278,Arlington,,DISK,VA,5/27/1997 15:30
14527,Arlington,,OTHER,VA,9/10/1999 21:41
17984,Arlington,RED,DISK,VA,11/19/2000 22:00


In [50]:
ufo_data.isnull().sum()

City         47
colors    63509
shape      8402
State         0
Time          0
dtype: int64

In [53]:
ufo_data.shape[0]

80543

In [63]:
ufo_data[~ufo_data["City"].isnull()].head()

Unnamed: 0,City,colors,shape,State,Time
0,Ithaca,,TRIANGLE,NY,6/1/1930 22:00
1,Willingboro,,OTHER,NJ,6/30/1930 20:00
2,Holyoke,,OVAL,CO,2/15/1931 14:00
3,Abilene,,DISK,KS,6/1/1931 13:00
4,New York Worlds Fair,,LIGHT,NY,4/18/1933 19:00


In [64]:
ufo_data[ufo_data.City.isnull].head()

TypeError: isnull() takes 1 positional argument but 2 were given

In [60]:
#users[users.occupation.isin(['doctor', 'lawyer'])].head()
ufo_data[ufo_data.isin("NaN")]

TypeError: only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'str'

<a id='split-apply-combine'></a>
## Split-Apply-Combine

---

![](assets/split_apply_combine.png)

<a id='groupby'></a>
### `.groupby()`

**Q.1** Using the `drinks` DataFrame, calculate the mean `beer` servings by continent.

In [5]:
drinks = pd.read_csv('../../../../resource-datasets/alcohol_by_country/drinks.csv')

In [8]:
continent = drinks.continent.unique()

In [9]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [11]:
drinks.beer_servings.mean()

106.16062176165804

In [12]:
mask_as =  drinks[drinks["continent"]=="AS"]

In [15]:
mask_as.beer_servings.mean()

37.04545454545455

**Q.2** Describe the `beer` column by continent.

In [6]:
# A:

<a id='apply-combine'></a>
### Apply Functions to Groups and Combine

**Q.1** Find the `count`, `mean`, `minimum`, and `maximum `of the `beer` column by continent.

In [7]:
# A:

**Q.2** Perform the same task as in Q.1, but now sort the output by the `mean` column.

In [8]:
# A:

**Q.3** Apply a custom function to all columns of the `drinks` DataFrame, grouping by continent.

In [9]:
# A:

**Q.4** **Note:** If you don't specify a column for the aggregation function, it will be applied to all numeric columns.

In [10]:
# A:

<a id='exercise-4'></a>

## Exercise #4

---

**Using the `users` DataFrame**:
1. Count the number of distinct occupations in `users`.
2. Calculate the mean age by occupation.
3. Calculate the minimum and maximum age by occupation.
4. Calculate the mean age by cross-sections of `occupation` and `gender`.

> **Tip**: Multiple columns can be passed to the `.groupby()` function for more granular cross-sections.

In [11]:
# A:

<a id='indexing'></a>
## Indexing

---
<a id='loc'></a>
### Location Indexing With `.loc()`

**Q.1** Select all rows and the `city` column from the UFO data set using `.loc()`.

In [12]:
# A:

**Q.2** Select all rows and columns in `city` and `state`.

In [13]:
# A:

**Q.3** Select all rows and columns from `city` *through* `state`.

In [14]:
# A:

**Q.4** Select:
- All columns at row 0.
- All columns at rows 0:2.
- Columns `city` through `state` at rows 0:2.

In [15]:
# A:

<a id='iloc'></a>
### Position indexing with `.iloc`

**Q.1** Select all rows and columns in position 0 and 3.

In [16]:
# A:

**Q.2** Select all rows and columns in positions 0 through 4.

In [17]:
# A:

**Q.3** Select rows in positions 0:3, along with all columns.

In [18]:
# A:

<a id='frequent'></a>
## Frequently Used Features

---
<a id='map-dict'></a>
### Using Map Functions With Replacement Dictionaries

In [19]:
# A:

<a id='factorize'></a>
### Encoding Strings as Integers With `.factorize()`

In [20]:
# A:

<a id='unique'></a>
### Determining Unique Values

In [21]:
# A:

<a id='replace'></a>
### Replacing Values With `.replace()`

In [22]:
# A:

<a id='series-str'></a>
### Series String Methods With `.str`

In [23]:
# A:

<a id='datetime'></a>
### Datetime Conversion and Arithmetic

In [24]:
# A:

<a id='set-reset-index'></a>
### Setting and Resetting the Index

In [25]:
# A:

<a id='sort-by-index'></a>
### Sorting by Index

In [26]:
# A:

<a id='change-dtype'></a>
### Changing the Data Type of a Column

In [27]:
# A:

<a id='dummy'></a>
### Creating Dummy-Coded Columns

In [28]:
# A:

<a id='concatenate'></a>
### Concatenating DataFrames

In [29]:
# A:

<a id='duplicate-rows'></a>
### Detecting and Dropping Duplicate Rows

In [30]:
# A:

<a id='write-csv'></a>
### Writing a DataFrame to a `.csv`
```python
# Write a DataFrame out to a `.csv`.
drinks.to_csv('drinks_updated.csv')  # Index is used as the first column
drinks.to_csv('drinks_updated.csv', index=False) # Ignore index
```

<a id='pickle'></a>
### Pickling a DataFrame
```python
# Save a DataFrame to disk (a.k.a., "pickle") and read it from disk (a.k.a., "unpickle").
drinks.to_pickle('drinks_pickle')
pd.read_pickle('drinks_pickle')
```

<a id='sample'></a>
### Randomly Sampling a DataFrame

In [31]:
# A:

<a id='infrequent'></a>
## Infrequently Used Features

---

<a id='toy-dataframes'></a>
### Creating DataFrames From Dictionaries and Lists of Lists

In [32]:
# A:

In [33]:
# A:

<a id='crosstab'></a>
### Performing Cross-Tabulations

In [34]:
# A:

<a id='query'></a>
### Query-Filtering Syntax

In [35]:
# A:

<a id='memory-usage'></a>
### Calculating Memory Usage

In [36]:
# A:

<a id='category-type'></a>
### Converting Column to Category Type

In [37]:
# A:

<a id='assign'></a>
### Creating Columns With `.assign()`

In [38]:
# A:

<a id='limit-rows-read'></a>
### Limiting the Number of Rows to Load in a File Read

In [39]:
# A:

<a id='manual-print'></a>
### Manually Setting the Number of Rows and Columns to Print

In [40]:
# A:

In [41]:
# A: