# Data Manipulation with pandas

#### Course Description
pandas is the world's most popular Python library, used for everything from data manipulation to data analysis. In this course, you'll learn how to manipulate DataFrames, as you extract, filter, and transform real-world datasets for analysis. Using pandas you’ll explore all the core data science concepts. Using real-world data, including Walmart sales figures and global temperature time series, you’ll learn how to import, clean, calculate statistics, and create visualizations—using pandas to add to the power of Python!

---

## Course Outline

#### Chapter 1: DataFrames
- Sorting and subsetting
- Creating new columns

#### Chapter 2: Aggregating Data
- Summary statistics
- Counting
- Grouped summary statistics

#### Chapter 3: Slicing and Indexing Data
- Subsetting using slicing
- Indexes and subsetting using indexes

#### Chapter 4: Creating and Visualizing Data
- Plotting
- Handling missing data
- Reading data into a DataFrame

---

## [1] Transforming DataFrames
Let’s master the pandas basics. Learn how to inspect DataFrames and perform fundamental manipulations, including sorting rows, subsetting, and adding new columns.

### Introducing DataFrames
- pandas is built on NumPy and Matplotlib
    - NumPy provides multidimensional array objects for easy data manipulation that uses to store data
    - Matplotlib has powerful data visualization capabilities that pandas takes advantage of


#### Store Data for Analysis
- Rectangular Data (Tabular Data) - most common form

#### Exploring a DataFrame 

`.head()`
- a method that returns the first few rows of the DataFrame

`.info()`
- a method that displays the names of the columns, the data types they contain, and whether that have any missing values

`.shape`
- an attribute that contains a tuple that holds the number of rows followed by the number of columns
- since it is an attribute, it can be writtenw without parentheses

`.describe()`
- a method that computes some summary statistics for numerical columns, like mean and median
- good for quick overview of numerical variables

#### Components of a DataFrame:
DataFrame consists of three (3) different components, accessible using attributes.

`.values`
- this attribute contains the data values in a 2-dimensional NumPy array

`.columns`
- this attribute contains column names

`.index`
- this attribute contains row numbers or row names
- allows flexibility in labels
- **row labels are stored in .index, and not in .rows**

---

**Inspecting a DataFrame**

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

`.head()` returns the first few rows (the “head” of the DataFrame). <br>
`.info()` shows information on each of the columns, such as the data type and number of missing values. <br>
`.shape` returns the number of rows and columns of the DataFrame. <br>
`.describe()` calculates a few summary statistics for each column.

`homelessness` is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The `individual` column is the number of homeless individuals not part of a family with children. The `family_members` column is the number of homeless individuals part of a family with children. The `state_pop` column is the state's total population.

`pandas` is imported for you.

In [1]:
import pandas as pd

# print the head of the `homelessness` DataFrame.
homelessness = pd.read_csv('dataset/homelessness.csv')
homelessness.head()

# Print information about homelessness
homelessness.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      51 non-null     int64  
 1   region          51 non-null     object 
 2   state           51 non-null     object 
 3   individuals     51 non-null     float64
 4   family_members  51 non-null     float64
 5   state_pop       51 non-null     int64  
dtypes: float64(2), int64(2), object(2)
memory usage: 2.5+ KB


In [2]:
# Print the shape of homelessness
homelessness.shape

(51, 6)

In [3]:
# Print a description of homelessness
homelessness.describe()

Unnamed: 0.1,Unnamed: 0,individuals,family_members,state_pop
count,51.0,51.0,51.0,51.0
mean,25.0,7225.784314,3504.882353,6405637.0
std,14.866069,15991.025083,7805.411811,7327258.0
min,0.0,434.0,75.0,577601.0
25%,12.5,1446.5,592.0,1777414.0
50%,25.0,3082.0,1482.0,4461153.0
75%,37.5,6781.5,3196.0,7340946.0
max,50.0,109008.0,52070.0,39461590.0


**Parts of a DataFrame**

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

- `.values:` A two-dimensional NumPy array of values.
- `.columns:` An index of columns: the column names.
- `.index:` An index for the rows: either row numbers or row names.

You can usually think of indexes as a list of strings or numbers, though the pandas `Index` data type allows for more sophisticated options. (These will be covered later in the course.)

`homelessness` is available.

In [4]:
# Import pandas using the alias pd
import pandas as pd

# Print the values of homelessness
print(homelessness.values)

print(" ")

# Print the column index of homelessness
print(homelessness.columns)

print(" ")

# Print the row index of homelessness
print(homelessness.index)

[[0 'East South Central' 'Alabama' 2570.0 864.0 4887681]
 [1 'Pacific' 'Alaska' 1434.0 582.0 735139]
 [2 'Mountain' 'Arizona' 7259.0 2606.0 7158024]
 [3 'West South Central' 'Arkansas' 2280.0 432.0 3009733]
 [4 'Pacific' 'California' 109008.0 20964.0 39461588]
 [5 'Mountain' 'Colorado' 7607.0 3250.0 5691287]
 [6 'New England' 'Connecticut' 2280.0 1696.0 3571520]
 [7 'South Atlantic' 'Delaware' 708.0 374.0 965479]
 [8 'South Atlantic' 'District of Columbia' 3770.0 3134.0 701547]
 [9 'South Atlantic' 'Florida' 21443.0 9587.0 21244317]
 [10 'South Atlantic' 'Georgia' 6943.0 2556.0 10511131]
 [11 'Pacific' 'Hawaii' 4131.0 2399.0 1420593]
 [12 'Mountain' 'Idaho' 1297.0 715.0 1750536]
 [13 'East North Central' 'Illinois' 6752.0 3891.0 12723071]
 [14 'East North Central' 'Indiana' 3776.0 1482.0 6695497]
 [15 'West North Central' 'Iowa' 1711.0 1038.0 3148618]
 [16 'West North Central' 'Kansas' 1443.0 773.0 2911359]
 [17 'East South Central' 'Kentucky' 2735.0 953.0 4461153]
 [18 'West South Cen

--- 
### Sortng and Subsetting

#### Sorting
`.sort_values()`
- change the order of the rows by passing in a column name to sort

In [5]:
# Import pandas using the alias pd
import pandas as pd

homelessness.head()

# Sort A-Z
homelessness.sort_values("state").head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
0,0,East South Central,Alabama,2570.0,864.0,4887681
1,1,Pacific,Alaska,1434.0,582.0,735139
2,2,Mountain,Arizona,7259.0,2606.0,7158024
3,3,West South Central,Arkansas,2280.0,432.0,3009733
4,4,Pacific,California,109008.0,20964.0,39461588


In [6]:
# Sort Z - A
homelessness.sort_values("state", ascending = False).head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
50,50,Mountain,Wyoming,434.0,205.0,577601
49,49,East North Central,Wisconsin,2740.0,2167.0,5807406
48,48,South Atlantic,West Virginia,1021.0,222.0,1804291
47,47,Pacific,Washington,16424.0,5880.0,7523869
46,46,South Atlantic,Virginia,3928.0,2047.0,8501286


#### Sort by multiple variables
- passing in a list of column names to .sort_values()

In [7]:
homelessness.sort_values(["state", "state_pop"]).head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
0,0,East South Central,Alabama,2570.0,864.0,4887681
1,1,Pacific,Alaska,1434.0,582.0,735139
2,2,Mountain,Arizona,7259.0,2606.0,7158024
3,3,West South Central,Arkansas,2280.0,432.0,3009733
4,4,Pacific,California,109008.0,20964.0,39461588


In [8]:
# specificies which direction sorting should be done for each variable
homelessness.sort_values(["state", "state_pop"], ascending = [True, False]).head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
0,0,East South Central,Alabama,2570.0,864.0,4887681
1,1,Pacific,Alaska,1434.0,582.0,735139
2,2,Mountain,Arizona,7259.0,2606.0,7158024
3,3,West South Central,Arkansas,2280.0,432.0,3009733
4,4,Pacific,California,109008.0,20964.0,39461588


### Subsetting columns

In [9]:
# Subsetting ONE column
homelessness["state"].to_frame().head()

Unnamed: 0,state
0,Alabama
1,Alaska
2,Arizona
3,Arkansas
4,California


### Subsetting multiple columns
- the OUTER bracket is responsible for subsetting the dataframe
- the INNER bracker is creating a list of column names to subset

> you could provide a separate list of column names as a varibale and then use that list to perfor the same subsetting

In [10]:
# Subsetting MULTIPLE columns
homelessness[['state', 'state_pop']].head()

Unnamed: 0,state,state_pop
0,Alabama,4887681
1,Alaska,735139
2,Arizona,7158024
3,Arkansas,3009733
4,California,39461588


In [11]:
# the same with the one above
cols_to_subset = ['state', 'state_pop']
homelessness[cols_to_subset].head()

Unnamed: 0,state,state_pop
0,Alabama,4887681
1,Alaska,735139
2,Arizona,7158024
3,Arkansas,3009733
4,California,39461588


#### Subsetting rows
- creating a logical condition to filter against

In [12]:
homelessness[homelessness['state_pop'] >  10000000]

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
4,4,Pacific,California,109008.0,20964.0,39461588
9,9,South Atlantic,Florida,21443.0,9587.0,21244317
10,10,South Atlantic,Georgia,6943.0,2556.0,10511131
13,13,East North Central,Illinois,6752.0,3891.0,12723071
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
33,33,South Atlantic,North Carolina,6451.0,2817.0,10381615
35,35,East North Central,Ohio,6929.0,3320.0,11676341
38,38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922
43,43,West South Central,Texas,19199.0,6111.0,28628666


### Subsetting based on text data
- uses double equal sign in the logical condition to filter

In [13]:
homelessness[homelessness['region'] == "Mid-Atlantic"]

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
30,30,Mid-Atlantic,New Jersey,6048.0,3350.0,8886025
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
38,38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922


#### Subsetting based on dates
- dates are in quotes and are written as 'year-month-day' -> international standard date format

#### Subsetting based on multiple conditions
- combine conditions using logical operators
- in the example, only rows that meet both conditions will be subsetted

In [14]:
mid = homelessness['region'] == "Mid-Atlantic"
pop10 = homelessness['state_pop'] >  10000000
homelessness[mid & pop10]

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
38,38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922


In [15]:
# same as above but in one line only
# add parenthesis to each condition
homelessness[(homelessness['region'] == "Mid-Atlantic") & (homelessness['state_pop'] >  10000000)]

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
38,38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922


#### Subsetting using isin.()
- used to filter on multiple values of a categorical variable
- takes in a list of values to filter for

In [16]:
is_pacific_or_mountain = homelessness['region'].isin(['Pacific', 'Mountain'])
homelessness[is_pacific_or_mountain]

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
1,1,Pacific,Alaska,1434.0,582.0,735139
2,2,Mountain,Arizona,7259.0,2606.0,7158024
4,4,Pacific,California,109008.0,20964.0,39461588
5,5,Mountain,Colorado,7607.0,3250.0,5691287
11,11,Pacific,Hawaii,4131.0,2399.0,1420593
12,12,Mountain,Idaho,1297.0,715.0,1750536
26,26,Mountain,Montana,983.0,422.0,1060665
28,28,Mountain,Nevada,7058.0,486.0,3027341
31,31,Mountain,New Mexico,1949.0,602.0,2092741
37,37,Pacific,Oregon,11139.0,3337.0,4181886


**Sorting rows**

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to `.sort_values()`.

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

<br>

<table>
    <tr>
        <th> Sort on </th>
        <th> Syntax </th>
    </tr>
    <tr>
        <td> one column </td>
        <td> df.sort_values("breed") </td>
    </tr>
    <tr>
        <td> multiple columns </td>
        <td> df.sort_values(["breed", "weight_kg"]) </td>
    </tr>
</table>

<br>

By combining `.sort_values()` with `.head()`, you can answer questions in the form, **"What are the top cases where…?"**.

- Sort `homelessness` by the number of homeless individuals, from smallest to largest, and save this as `homelessness_ind`.
- Print the head of the sorted DataFrame.

In [17]:
# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values('individuals')

# Print the top few rows
homelessness_ind.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
50,50,Mountain,Wyoming,434.0,205.0,577601
34,34,West North Central,North Dakota,467.0,75.0,758080
7,7,South Atlantic,Delaware,708.0,374.0,965479
39,39,New England,Rhode Island,747.0,354.0,1058287
45,45,New England,Vermont,780.0,511.0,624358


In [18]:
homelessness.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
0,0,East South Central,Alabama,2570.0,864.0,4887681
1,1,Pacific,Alaska,1434.0,582.0,735139
2,2,Mountain,Arizona,7259.0,2606.0,7158024
3,3,West South Central,Arkansas,2280.0,432.0,3009733
4,4,Pacific,California,109008.0,20964.0,39461588


- Sort `homelessness` by the number of homeless `family_members` in descending order, and save this as `homelessness_fam`.
- Print the head of the sorted DataFrame.

In [19]:
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values('family_members', ascending = False)

# Print the top few rows
homelessness_fam.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
4,4,Pacific,California,109008.0,20964.0,39461588
21,21,New England,Massachusetts,6811.0,13257.0,6882635
9,9,South Atlantic,Florida,21443.0,9587.0,21244317
43,43,West South Central,Texas,19199.0,6111.0,28628666


- Sort `homelessness` first by region (ascending), and then by number of family members (descending). Save this as `homelessness_reg_fam`.
- Print the head of the sorted DataFrame.

In [20]:
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members' ], ascending = [True, False])

# Print the top few rows
homelessness_reg_fam.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
13,13,East North Central,Illinois,6752.0,3891.0,12723071
35,35,East North Central,Ohio,6929.0,3320.0,11676341
22,22,East North Central,Michigan,5209.0,3142.0,9984072
49,49,East North Central,Wisconsin,2740.0,2167.0,5807406
14,14,East North Central,Indiana,3776.0,1482.0,6695497


**Subsetting columns**

When working with data, you may not need all of the variables in your dataset. Square brackets (`[]`) can be used to select only the columns that matter to you in an order that makes sense to you. To select only `"col_a"` of the DataFrame df, use

`df["col_a"]`


To select "col_a" and "col_b" of df, use

`df[["col_a", "col_b"]]`

- Create a DataFrame called `individuals` that contains only the `individuals` column of `homelessness`.
- Print the head of the result.

In [21]:
# Select the individuals column
individuals = homelessness['individuals']

# Print the head of the result
individuals.head().to_frame()

Unnamed: 0,individuals
0,2570.0
1,1434.0
2,7259.0
3,2280.0
4,109008.0


- Create a DataFrame called `state_fam` that contains only the `state` and `family_members` columns of `homelessness`, in that order.

In [22]:
# Select the state and family_members columns
state_fam = homelessness[['state', 'family_members']]

# Print the head of the result
state_fam.head()

Unnamed: 0,state,family_members
0,Alabama,864.0
1,Alaska,582.0
2,Arizona,2606.0
3,Arkansas,432.0
4,California,20964.0


- Create a DataFrame called `ind_state` that contains the `individuals` and `stat` columns of `homelessness`, in that order.

In [23]:
# Select only the individuals and state columns, in that order
ind_state = homelessness[['individuals', 'state']]

# Print the head of the result
ind_state.head()

Unnamed: 0,individuals,state
0,2570.0,Alabama
1,1434.0,Alaska
2,7259.0,Arizona
3,2280.0,Arkansas
4,109008.0,California


**Subsetting rows**

A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as *filtering rows* or *selecting rows*.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return `True` or `False` for each row, then pass that inside square brackets.

`dogs[dogs["height_cm"] > 60]` <br>
`dogs[dogs["color"] == "tan"]`

You can filter for multiple conditions at once by using the "bitwise and" operator, `&`.

`dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]`

- Filter `homelessness` for cases where the number of individuals is greater than ten thousand, assigning to `ind_gt_10k`. View the printed result.

In [24]:
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness['individuals'] > 10000]

# See the result
pd.DataFrame(ind_gt_10k)

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
4,4,Pacific,California,109008.0,20964.0,39461588
9,9,South Atlantic,Florida,21443.0,9587.0,21244317
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
37,37,Pacific,Oregon,11139.0,3337.0,4181886
43,43,West South Central,Texas,19199.0,6111.0,28628666
47,47,Pacific,Washington,16424.0,5880.0,7523869


- Filter `homelessness` for cases where the USA Census region is `"Mountain"`, assigning to `mountain_reg`. View the printed result.

In [25]:
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness['region'] == 'Mountain']

# See the result
mountain_reg

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
2,2,Mountain,Arizona,7259.0,2606.0,7158024
5,5,Mountain,Colorado,7607.0,3250.0,5691287
12,12,Mountain,Idaho,1297.0,715.0,1750536
26,26,Mountain,Montana,983.0,422.0,1060665
28,28,Mountain,Nevada,7058.0,486.0,3027341
31,31,Mountain,New Mexico,1949.0,602.0,2092741
44,44,Mountain,Utah,1904.0,972.0,3153550
50,50,Mountain,Wyoming,434.0,205.0,577601


- Filter `homelessness` for cases where the number of `family_members` is less than one thousand and the `region` is "Pacific", assigning to `fam_lt_1k_pac`.

In [26]:
# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness['family_members'] < 1000) & (homelessness['region'] == 'Pacific')]

# See the result
print(fam_lt_1k_pac)

   Unnamed: 0   region   state  individuals  family_members  state_pop
1           1  Pacific  Alaska       1434.0           582.0     735139


**Subsetting rows by categorical variables**

Subsetting data based on a categorical variable often involves using the "or" operator (`|`) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the `.isin()` method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

```
colors = ["brown", "black", "tan"]
condition = dogs["color"].isin(colors)
dogs[condition]
```

- Filter `homelessness` for cases where the USA census region is "South Atlantic" or it is "Mid-Atlantic", assigning to `south_mid_atlantic`. View the printed result.

In [27]:
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic = homelessness[homelessness['region'].isin(['South Atlantic', 'Mid-Atlantic'])]

# See the result
pd.DataFrame(south_mid_atlantic)

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
7,7,South Atlantic,Delaware,708.0,374.0,965479
8,8,South Atlantic,District of Columbia,3770.0,3134.0,701547
9,9,South Atlantic,Florida,21443.0,9587.0,21244317
10,10,South Atlantic,Georgia,6943.0,2556.0,10511131
20,20,South Atlantic,Maryland,4914.0,2230.0,6035802
30,30,Mid-Atlantic,New Jersey,6048.0,3350.0,8886025
32,32,Mid-Atlantic,New York,39827.0,52070.0,19530351
33,33,South Atlantic,North Carolina,6451.0,2817.0,10381615
38,38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922
40,40,South Atlantic,South Carolina,3082.0,851.0,5084156


- Filter `homelessness` for cases where the USA census `state` is in the list of Mojave states, `canu`, assigning to `mojave_homelessness`. View the printed result.

In [28]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness['state'].isin(canu)]

# See the result
pd.DataFrame(mojave_homelessness)

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
2,2,Mountain,Arizona,7259.0,2606.0,7158024
4,4,Pacific,California,109008.0,20964.0,39461588
28,28,Mountain,Nevada,7058.0,486.0,3027341
44,44,Mountain,Utah,1904.0,972.0,3153550


--- 
### New Columns

#### Adding New Columns

In [29]:
# made a duplicate

homelessness1 = homelessness.copy()
homelessness1.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop
0,0,East South Central,Alabama,2570.0,864.0,4887681
1,1,Pacific,Alaska,1434.0,582.0,735139
2,2,Mountain,Arizona,7259.0,2606.0,7158024
3,3,West South Central,Arkansas,2280.0,432.0,3009733
4,4,Pacific,California,109008.0,20964.0,39461588


In [30]:
homelessness1['percent_state_pop'] = homelessness1['state_pop'] / 1000
homelessness1.head()

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop,percent_state_pop
0,0,East South Central,Alabama,2570.0,864.0,4887681,4887.681
1,1,Pacific,Alaska,1434.0,582.0,735139,735.139
2,2,Mountain,Arizona,7259.0,2606.0,7158024,7158.024
3,3,West South Central,Arkansas,2280.0,432.0,3009733,3009.733
4,4,Pacific,California,109008.0,20964.0,39461588,39461.588


**Adding new columns**

You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.

- Add a new column to `homelessness`, named `total`, containing the sum of the `individuals` and `family_members` columns.
- Add another column to `homelessness`, named `p_individuals`, containing the proportion of homeless people in each state who are individuals.

In [31]:
# Add total col as sum of individuals and family_members
homelessness1['total'] = homelessness1['individuals'] + homelessness1['family_members']

# Add p_individuals col as proportion of total that are individuals
homelessness1['p_individuals'] = homelessness1['individuals'] / homelessness1['total']

# See the result
pd.DataFrame(homelessness1)

Unnamed: 0.1,Unnamed: 0,region,state,individuals,family_members,state_pop,percent_state_pop,total,p_individuals
0,0,East South Central,Alabama,2570.0,864.0,4887681,4887.681,3434.0,0.748398
1,1,Pacific,Alaska,1434.0,582.0,735139,735.139,2016.0,0.71131
2,2,Mountain,Arizona,7259.0,2606.0,7158024,7158.024,9865.0,0.735834
3,3,West South Central,Arkansas,2280.0,432.0,3009733,3009.733,2712.0,0.840708
4,4,Pacific,California,109008.0,20964.0,39461588,39461.588,129972.0,0.838704
5,5,Mountain,Colorado,7607.0,3250.0,5691287,5691.287,10857.0,0.700654
6,6,New England,Connecticut,2280.0,1696.0,3571520,3571.52,3976.0,0.573441
7,7,South Atlantic,Delaware,708.0,374.0,965479,965.479,1082.0,0.654344
8,8,South Atlantic,District of Columbia,3770.0,3134.0,701547,701.547,6904.0,0.54606
9,9,South Atlantic,Florida,21443.0,9587.0,21244317,21244.317,31030.0,0.691041


**Combo-attack!**

You've seen the four most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding new columns. In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

In this exercise, you'll answer the question, **"Which state has the highest number of homeless individuals per 10,000 people in the state?"**

In [32]:
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness1["indiv_per_10k"] = 10000 * homelessness1['individuals'] / homelessness1['state_pop'] 

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness1[homelessness1['indiv_per_10k'] > 20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values('indiv_per_10k', ascending = False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[['state' , 'indiv_per_10k']]

# See the result
pd.DataFrame(result)

Unnamed: 0,state,indiv_per_10k
8,District of Columbia,53.738381
11,Hawaii,29.079406
4,California,27.623825
37,Oregon,26.636307
28,Nevada,23.314189
47,Washington,21.829195
32,New York,20.392363


----

## [2] Aggregating DataFrames

- In this chapter, you’ll calculate summary statistics on DataFrame columns, and master grouped summary statistics and pivot tables.

### Summary statistics
- numbers that summarize and tells about the dataset


- `.mean()`
  - tells where the "center" of the data is
- `.median()` 
- `.mode()`
- `.min()`
- `.max()`
- `.std()`
- `.var()`
- `.sum()`
- `.quantile()`

### Summarizing dates
![image-2.png](attachment:image-2.png)

### The .agg() method
- a method that allows to compute custom summary statistics

**A function called pct30 that computes the thirtieth percentile of a DataFrame column**
- takes in a column and spits out the column's 30th percentile

In [33]:
def pct30(column):
    return column.quantile(0.3)

# gives the 30th percentile of one column
homelessness['state_pop'].agg(pct30)

2092741.0

In [34]:
# gives the 30th percentile of multiple column
pd.DataFrame(homelessness[['state_pop', 'individuals']].agg(pct30))

Unnamed: 0,0
state_pop,2092741.0
individuals,1745.0


In [35]:
# Multiple summaries
def pct40(column):
    return column.quantile(0.4)

homelessness['state_pop'].agg([pct30, pct40])

pct30    2092741.0
pct40    3148618.0
Name: state_pop, dtype: float64

### Computing Cumulative Statistics

#### Cumulative sum
- `.cumsum()`
  - calling .cumsum() on a column returns not just one number but a number of each row of the DataFrame
- `.cummax()`
- `.cummin()`
- `.cumprod()`

In [36]:
# index 0 is index 0
# index 1 = index 0 + index 0
pd.DataFrame(homelessness['individuals'].cumsum()).head()

Unnamed: 0,individuals
0,2570.0
1,4004.0
2,11263.0
3,13543.0
4,122551.0


In [37]:
pd.DataFrame(homelessness['individuals'].cummax())

Unnamed: 0,individuals
0,2570.0
1,2570.0
2,7259.0
3,7259.0
4,109008.0
5,109008.0
6,109008.0
7,109008.0
8,109008.0
9,109008.0


In [38]:
pd.DataFrame(homelessness['individuals'].cummin())

Unnamed: 0,individuals
0,2570.0
1,1434.0
2,1434.0
3,1434.0
4,1434.0
5,1434.0
6,1434.0
7,708.0
8,708.0
9,708.0


---

#### Walmart (sales) dataset

- contains weekly sales in US dollars in various stores
- each stores has an ID number and specific store type
- sales are separated by department ID
- includes information on: 
    - whether it is holiday week or not
    - the average temperature during the week in that location
    - the average fuel price in dollars per liter that week
    - the national unemployment rate that week

In [39]:
# import dataset
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

sales = pd.read_csv("dataset/walmart.csv", index_col = 0)
sales.head()

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808
3,1,A,1,2010-05-07,17413.94,False,22.527778,0.748928,7.808
4,1,A,1,2010-06-04,17558.09,False,27.05,0.714586,7.808


**Mean and median**

Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

- Explore your new DataFrame first by printing the first few rows of the `sales` DataFrame.
- Print information about the columns in `sales`.
- Print the mean of the `weekly_sales` column.
- Print the median of the `weekly_sales` column.

In [40]:
# Print the head of the sales DataFrame
sales.head()

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808
3,1,A,1,2010-05-07,17413.94,False,22.527778,0.748928,7.808
4,1,A,1,2010-06-04,17558.09,False,27.05,0.714586,7.808


In [41]:
# Print the info about the sales DataFrame
sales.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10774 entries, 0 to 10773
Data columns (total 9 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   store                 10774 non-null  int64  
 1   type                  10774 non-null  object 
 2   department            10774 non-null  int64  
 3   date                  10774 non-null  object 
 4   weekly_sales          10774 non-null  float64
 5   is_holiday            10774 non-null  bool   
 6   temperature_c         10774 non-null  float64
 7   fuel_price_usd_per_l  10774 non-null  float64
 8   unemployment          10774 non-null  float64
dtypes: bool(1), float64(4), int64(2), object(2)
memory usage: 768.1+ KB


In [42]:
# Print the mean of weekly_sales
print(sales['weekly_sales'].mean())

23843.950148505668


In [43]:
# Print the median of weekly_sales
print(sales['weekly_sales'].median())

12049.064999999999


> The mean weekly sales amount is almost double the median weekly sales amount! This can tell you that there are a few very high sales weeks that are making the mean so much higher than the median.

**Summarizing dates**

Summary statistics can also be calculated on date columns that have values with the data type `datetime64`. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

- Print the `maximum` of the `date` column.
- Print the `minimum` of the `date` column.

In [44]:
# Print the maximum of the date column
print(sales['date'].max())

# Print the minimum of the date column
print(sales['date'].min())

2012-10-26
2010-02-05


> There are data from February of 2010 to October of 2012.

**Efficient summaries**

While pandas and NumPy have tons of functions, sometimes, you may need a different function to summarize your data.

The `.agg()` method allows you to apply your own custom functions to a DataFrame, as well as apply functions to more than one column of a DataFrame at once, making your aggregations super-efficient. For example,

`df['column'].agg(function)`


In the custom function for this exercise, "IQR" is short for inter-quartile range, which is the 75th percentile minus the 25th percentile. It's an alternative to standard deviation that is helpful if your data contains outliers.

- Use the custom `iqr` function defined for you along with `.agg()` to print the IQR of the `temperature_c` column of `sales`.

In [45]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)
    
# Print IQR of the temperature_c column
print(sales['temperature_c'].agg(iqr))

16.58333333333334


- Update the column selection to use the custom `iqr` function with `.agg()` to print the IQR of `temperature_c`, `fuel_price_usd_per_l`, and `unemployment`, in that order.

In [46]:
# A custom IQR function
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", 'fuel_price_usd_per_l', 'unemployment']].agg(iqr))

temperature_c           16.583333
fuel_price_usd_per_l     0.073176
unemployment             0.565000
dtype: float64


- Update the aggregation functions called by `.agg()`: include `iqr` and `np.median` in that order.

In [47]:
# Import NumPy and create custom IQR function
import numpy as np
def iqr(column):
    return column.quantile(0.75) - column.quantile(0.25)

# Update to print IQR and median of temperature_c, fuel_price_usd_per_l, & unemployment
print(sales[["temperature_c", "fuel_price_usd_per_l", "unemployment"]].agg([iqr, np.median]))

        temperature_c  fuel_price_usd_per_l  unemployment
iqr         16.583333              0.073176         0.565
median      16.966667              0.743381         8.099


**Cumulative statistics**

Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

A DataFrame called `sales_1_1` has been created for you, which contains the sales data for department 1 of store 1.

In [48]:
sales_1_1 = sales[(sales['store'] == 1) & (sales['department'] == 1)]
sales_1_1

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808
3,1,A,1,2010-05-07,17413.94,False,22.527778,0.748928,7.808
4,1,A,1,2010-06-04,17558.09,False,27.05,0.714586,7.808
5,1,A,1,2010-07-02,16333.14,False,27.172222,0.705076,7.787
6,1,A,1,2010-08-06,17508.41,False,30.644444,0.69398,7.787
7,1,A,1,2010-09-03,16241.78,False,27.338889,0.680772,7.787
8,1,A,1,2010-10-01,20094.19,False,22.161111,0.68764,7.838
9,1,A,1,2010-11-05,34238.88,False,14.855556,0.710359,7.838


In [49]:
# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values("date", ascending = True)
sales_1_1

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808
3,1,A,1,2010-05-07,17413.94,False,22.527778,0.748928,7.808
4,1,A,1,2010-06-04,17558.09,False,27.05,0.714586,7.808
5,1,A,1,2010-07-02,16333.14,False,27.172222,0.705076,7.787
6,1,A,1,2010-08-06,17508.41,False,30.644444,0.69398,7.787
7,1,A,1,2010-09-03,16241.78,False,27.338889,0.680772,7.787
8,1,A,1,2010-10-01,20094.19,False,22.161111,0.68764,7.838
9,1,A,1,2010-11-05,34238.88,False,14.855556,0.710359,7.838


In [50]:
# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum()
sales_1_1['cum_weekly_sales']

0      24924.50
1      46752.40
2     104010.83
3     121424.77
4     138982.86
5     155316.00
6     172824.41
7     189066.19
8     209160.38
9     243399.26
10    265916.82
11    281901.06
Name: cum_weekly_sales, dtype: float64

In [51]:
# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values("date", ascending = True)

# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum()

# Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1['cum_max_sales'] = sales_1_1['weekly_sales'].cummax()

# See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

          date  weekly_sales  cum_weekly_sales  cum_max_sales
0   2010-02-05      24924.50          24924.50       24924.50
1   2010-03-05      21827.90          46752.40       24924.50
2   2010-04-02      57258.43         104010.83       57258.43
3   2010-05-07      17413.94         121424.77       57258.43
4   2010-06-04      17558.09         138982.86       57258.43
5   2010-07-02      16333.14         155316.00       57258.43
6   2010-08-06      17508.41         172824.41       57258.43
7   2010-09-03      16241.78         189066.19       57258.43
8   2010-10-01      20094.19         209160.38       57258.43
9   2010-11-05      34238.88         243399.26       57258.43
10  2010-12-03      22517.56         265916.82       57258.43
11  2011-01-07      15984.24         281901.06       57258.43


> You've accumulated success! Not all functions that calculate on columns return a single number. Some, like the cumulative statistic functions, return a whole column.

---

### Counting
- summarize categorical data using counting

#### Dropping duplicate row entry

In [52]:
sales_1_1.drop_duplicates(subset = 'date')

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment,cum_weekly_sales,cum_max_sales
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106,24924.5,24924.5
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106,46752.4,24924.5
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808,104010.83,57258.43
3,1,A,1,2010-05-07,17413.94,False,22.527778,0.748928,7.808,121424.77,57258.43
4,1,A,1,2010-06-04,17558.09,False,27.05,0.714586,7.808,138982.86,57258.43
5,1,A,1,2010-07-02,16333.14,False,27.172222,0.705076,7.787,155316.0,57258.43
6,1,A,1,2010-08-06,17508.41,False,30.644444,0.69398,7.787,172824.41,57258.43
7,1,A,1,2010-09-03,16241.78,False,27.338889,0.680772,7.787,189066.19,57258.43
8,1,A,1,2010-10-01,20094.19,False,22.161111,0.68764,7.838,209160.38,57258.43
9,1,A,1,2010-11-05,34238.88,False,14.855556,0.710359,7.838,243399.26,57258.43


### Dropping duplicate pairs

In [53]:
sales_1_1.drop_duplicates(subset = ['date', 'temperature_c'])

Unnamed: 0,store,type,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment,cum_weekly_sales,cum_max_sales
0,1,A,1,2010-02-05,24924.5,False,5.727778,0.679451,8.106,24924.5,24924.5
1,1,A,1,2010-03-05,21827.9,False,8.055556,0.693452,8.106,46752.4,24924.5
2,1,A,1,2010-04-02,57258.43,False,16.816667,0.718284,7.808,104010.83,57258.43
3,1,A,1,2010-05-07,17413.94,False,22.527778,0.748928,7.808,121424.77,57258.43
4,1,A,1,2010-06-04,17558.09,False,27.05,0.714586,7.808,138982.86,57258.43
5,1,A,1,2010-07-02,16333.14,False,27.172222,0.705076,7.787,155316.0,57258.43
6,1,A,1,2010-08-06,17508.41,False,30.644444,0.69398,7.787,172824.41,57258.43
7,1,A,1,2010-09-03,16241.78,False,27.338889,0.680772,7.787,189066.19,57258.43
8,1,A,1,2010-10-01,20094.19,False,22.161111,0.68764,7.838,209160.38,57258.43
9,1,A,1,2010-11-05,34238.88,False,14.855556,0.710359,7.838,243399.26,57258.43


### Counting - Easy as 1, 2, 3

In [54]:
sales['store'].value_counts().to_frame()

Unnamed: 0,store
13,913
20,910
19,906
10,902
4,901
1,901
27,900
2,897
6,894
31,890


In [55]:
sales['store'].value_counts(sort = True).to_frame()

Unnamed: 0,store
13,913
20,910
19,906
10,902
4,901
1,901
27,900
2,897
6,894
31,890


### Proportions
- `.normalize()`
- an argument used to turn counts into proportions of the total

In [56]:
sales['store'].value_counts(normalize = True).to_frame()

Unnamed: 0,store
13,0.084741
20,0.084463
19,0.084091
10,0.08372
4,0.083627
1,0.083627
27,0.083534
2,0.083256
6,0.082978
31,0.082606


**Dropping duplicates**

Removing duplicates is an essential skill to get accurate counts because often, you don't want to count the same thing multiple times. In this exercise, you'll create some new DataFrames using unique values from sales.

- Remove rows of `sales` with duplicate pairs of `store` and `type` and save as `store_types` and print the head.
- Remove rows of `sales` with duplicate pairs of `store` and `department` and save as `store_depts` and print the head.
- Subset the rows that are holiday weeks using the `is_holiday` column, and drop the duplicate `date`s, saving as `holiday_dates`.
- Select the `date` column of `holiday_dates`, and print.

In [57]:
# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset = ['store', 'type'])
print(store_types.head())

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset = ['store', 'department'])
print(store_depts.head())

# Subset the rows where is_holiday is True and drop duplicate dates
holiday_dates = sales[sales['is_holiday'] == True ].drop_duplicates(subset = 'date')

# Print date col of holiday_dates
print(holiday_dates['date'])

      store type  department        date  weekly_sales  is_holiday  \
0         1    A           1  2010-02-05      24924.50       False   
901       2    A           1  2010-02-05      35034.06       False   
1798      4    A           1  2010-02-05      38724.42       False   
2699      6    A           1  2010-02-05      25619.00       False   
3593     10    B           1  2010-02-05      40212.84       False   

      temperature_c  fuel_price_usd_per_l  unemployment  
0          5.727778              0.679451         8.106  
901        4.550000              0.679451         8.324  
1798       6.533333              0.686319         8.623  
2699       4.683333              0.679451         7.259  
3593      12.411111              0.782478         9.765  
    store type  department        date  weekly_sales  is_holiday  \
0       1    A           1  2010-02-05      24924.50       False   
12      1    A           2  2010-02-05      50605.27       False   
24      1    A           3 

> Dazzling duplicate dropping! The holiday weeks correspond to the Superbowl in February, Labor Day in September, Thanksgiving in November, and Christmas in December.

**Counting categorical variables**

Counting is a great way to get an overview of your data and to spot curiosities that you might not notice otherwise. In this exercise, you'll count the number of each type of store and the number of each department number using the DataFrames you created in the previous exercise:

```
# Drop duplicate store/type combinations
store_types = sales.drop_duplicates(subset=["store", "type"])

# Drop duplicate store/department combinations
store_depts = sales.drop_duplicates(subset=["store", "department"])
```

<br>

- Count the number of stores of each store `type` in `store_types`.
- Count the proportion of stores of each store `type` in `store_types`.
- Count the number of different `department`s in `store_depts`, sorting the counts in descending order.
- Count the proportion of different `department`s in `store_depts`, sorting the proportions in descending order.

In [58]:
# Count the number of stores of each type
store_counts = store_types['type'].value_counts()
print(store_counts)

# Get the proportion of stores of each type
store_props = store_types['type'].value_counts(normalize = True)
print(store_props)

# Count the number of each department number and sort
dept_counts_sorted = store_depts['department'].value_counts(sort = True)
print(dept_counts_sorted)

# Get the proportion of departments of each number and sort
dept_props_sorted = store_depts['department'].value_counts(sort = True, normalize = True)
print(dept_props_sorted)

A    11
B     1
Name: type, dtype: int64
A    0.916667
B    0.083333
Name: type, dtype: float64
41    12
30    12
23    12
24    12
25    12
      ..
37    10
48     8
50     6
39     4
43     2
Name: department, Length: 80, dtype: int64
41    0.012917
30    0.012917
23    0.012917
24    0.012917
25    0.012917
        ...   
37    0.010764
48    0.008611
50    0.006459
39    0.004306
43    0.002153
Name: department, Length: 80, dtype: float64


> Great counting! It looks like department 43 only exists in two stores.

---

### Grouped summary statistics

In [90]:
sales.groupby('store')['weekly_sales'].mean().to_frame()

Unnamed: 0_level_0,weekly_sales
store,Unnamed: 1_level_1
1,20896.941787
2,26517.435162
4,26126.986071
6,21561.186477
10,25696.67837
13,25664.149474
14,30384.003017
19,19930.838157
20,28382.766385
27,24207.474711


In [60]:
sales.groupby('store')['department'].count().to_frame()

Unnamed: 0_level_0,department
store,Unnamed: 1_level_1
1,901
2,897
4,901
6,894
10,902
13,913
14,885
19,906
20,910
27,900


In [94]:
# multiple grouped summaries
sales.groupby('store')['weekly_sales'].agg([min, max, sum])

Unnamed: 0_level_0,min,max,sum
store,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,-698.0,140504.41,18828144.55
2,-1098.0,178982.89,23786139.34
4,-88.0,165765.5,23540414.45
6,-698.0,119812.38,19275700.71
10,-798.0,232558.51,23178403.89
13,-98.0,166872.27,23431368.47
14,-498.0,293966.05,26889842.67
19,-449.0,147449.06,18057339.37
20,-598.0,195223.84,25828317.41
27,-409.0,177605.86,21786727.24


In [95]:
# Grouping by multiple variables

# Avergae Weekly Sales per Department per Store
sales.groupby(['store', 'department'])['weekly_sales'].mean().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,weekly_sales
store,department,Unnamed: 2_level_1
1,1,23491.755000
1,2,47421.124167
1,3,12872.590000
1,4,38382.255833
1,5,23761.120000
...,...,...
39,95,97174.169167
39,96,24947.875833
39,97,23002.670000
39,98,9089.097500


In [98]:
# many groups, many summaries

sales.groupby(['store', 'department'])[['weekly_sales', 'unemployment']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,weekly_sales,unemployment
store,department,Unnamed: 2_level_1,Unnamed: 3_level_1
1,1,23491.755000,7.854417
1,2,47421.124167,7.854417
1,3,12872.590000,7.854417
1,4,38382.255833,7.854417
1,5,23761.120000,7.854417
...,...,...,...
39,95,97174.169167,8.450250
39,96,24947.875833,8.450250
39,97,23002.670000,8.450250
39,98,9089.097500,8.450250


In [118]:
# count distinct values

sales.groupby(['store', 'type']).nunique()

Unnamed: 0_level_0,Unnamed: 1_level_0,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
store,type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,A,77,46,891,2,46,45,12
2,A,78,45,886,2,45,43,12
4,A,78,48,891,2,48,47,12
6,A,77,43,889,2,43,42,11
10,B,77,33,891,2,33,33,9
13,A,79,39,907,2,38,38,12
14,A,77,49,871,2,49,47,12
19,A,79,51,893,2,51,50,12
20,A,78,57,893,2,57,54,12
27,A,78,53,886,2,53,50,12


In [120]:
# count ALL values

sales.groupby(['store', 'type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,department,date,weekly_sales,is_holiday,temperature_c,fuel_price_usd_per_l,unemployment
store,type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,A,901,901,901,901,901,901,901
2,A,897,897,897,897,897,897,897
4,A,901,901,901,901,901,901,901
6,A,894,894,894,894,894,894,894
10,B,902,902,902,902,902,902,902
13,A,913,913,913,913,913,913,913
14,A,885,885,885,885,885,885,885
19,A,906,906,906,906,906,906,906
20,A,910,910,910,910,910,910,910
27,A,900,900,900,900,900,900,900


--- 

**What percent of sales occurred at each store type?**

While `.groupby()` is useful, you can calculate grouped summary statistics without it.

Walmart distinguishes three types of stores: "supercenters," "discount stores," and "neighborhood markets," encoded in this dataset as type "A," "B," and "C." In this exercise, you'll calculate the total sales made at each store type, without using `.groupby()`. You can then use these numbers to see what proportion of Walmart's total sales were made at each type.

- Calculate the total `weekly_sales` over the whole dataset.
- Subset for `type "A"` stores, and calculate their total weekly sales.
- Do the same for `type "B"` and `type "C"` stores.
- Combine the A/B/C results into a list, and divide by `sales_all` to get the proportion of sales by type.

In [121]:
# Calc total weekly sales
sales_all = sales["weekly_sales"].sum()

# Subset for type A stores, calc total weekly sales
sales_A = sales[sales["type"] == "A"]["weekly_sales"].sum()

# Subset for type B stores, calc total weekly sales
sales_B = sales[sales["type"] == "B"]["weekly_sales"].sum()

# Subset for type C stores, calc total weekly sales
sales_C = sales[sales["type"] == "C"]["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = [sales_A, sales_B, sales_C] / sales_all
print(sales_propn_by_type)

[0.9097747 0.0902253 0.       ]


> Marvelous mathematics! About 91% of sales occurred in stores of type A', 9% in stores of type B, and there are no sales records for stores of type C. 

**Calculations with .groupby()**

The `.groupby()` method makes life much easier. In this exercise, you'll perform the same calculations as last time, except you'll use the `.groupby()` method. You'll also perform calculations on data grouped by two variables to see if sales differ by store type depending on if it's a holiday week or not.

- Group `sales` by `"type"`, take the sum of `"weekly_sales"`, and store as `sales_by_type`.
- Calculate the proportion of sales at each store type by dividing by the sum of `sales_by_type`. Assign to `sales_propn_by_type`.

In [132]:
# Group by type; calc total weekly sales
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Get proportion for each type
sales_propn_by_type = sales_by_type / sum(sales_by_type)
pd.DataFrame(sales_propn_by_type)

Unnamed: 0_level_0,weekly_sales
type,Unnamed: 1_level_1
A,0.909775
B,0.090225


- Group sales by `"type"` and `"is_holiday"`, take the sum of `weekly_sales`, and store as `sales_by_type_is_holiday`.

In [131]:
# From previous step
sales_by_type = sales.groupby("type")["weekly_sales"].sum()

# Group by type and is_holiday; calc total weekly sales
sales_by_type_is_holiday = sales.groupby(['type',  'is_holiday'])['weekly_sales'].sum()
pd.DataFrame(sales_by_type_is_holiday)

Unnamed: 0_level_0,Unnamed: 1_level_0,weekly_sales
type,is_holiday,Unnamed: 2_level_1
A,False,233692700.0
A,True,23601.81
B,False,23176780.0
B,True,1621.41


#### Multiple grouped summaries

Earlier in this chapter, you saw that the `.agg()` method is useful to compute multiple statistics on multiple variables. It also works with grouped data. NumPy, which is imported as `np`, has many different summary statistics functions, including: `np.min`, `np.max`, `np.mean`, and `np.median`.

- Get the min, max, mean, and median of `weekly_sales` for each store type using `.groupby()` and `.agg()`. Store this as `sales_stats`. Make sure to use `numpy` functions!
- Get the min, max, mean, and median of `unemployment` and `fuel_price_usd_per_l` for each store type. Store this as `unemp_fuel_stats`.

In [139]:
# Import numpy with the alias np
import numpy as np

# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby('type')['weekly_sales'].agg([np.min, np.max, np.mean, np.median])

# Print sales_stats
pd.DataFrame(sales_stats)


Unnamed: 0_level_0,amin,amax,mean,median
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
A,-1098.0,293966.05,23674.667242,11943.92
B,-798.0,232558.51,25696.67837,13336.08


In [150]:
# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby('type')[['store', 'weekly_sales']].agg([np.min, np.max, np.mean, np.median])
pd.DataFrame(sales_stats)

Unnamed: 0_level_0,store,store,store,store,weekly_sales,weekly_sales,weekly_sales,weekly_sales
Unnamed: 0_level_1,amin,amax,mean,median,amin,amax,mean,median
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
A,1,39,15.939121,14,-1098.0,293966.05,23674.667242,11943.92
B,10,10,10.0,10,-798.0,232558.51,25696.67837,13336.08


In [153]:
# For each store type, aggregate weekly_sales: get min, max, mean, and median
sales_stats = sales.groupby('type')['store'].nunique()
pd.DataFrame(sales_stats)

Unnamed: 0_level_0,store
type,Unnamed: 1_level_1
A,11
B,1


In [138]:
# For each store type, aggregate unemployment and fuel_price_usd_per_l: get min, max, mean, and median
unemp_fuel_stats = sales.groupby('type')[['unemployment', 'fuel_price_usd_per_l']].agg([np.min, np.max, np.mean, np.median])


# Print unemp_fuel_stats
pd.DataFrame(unemp_fuel_stats)

Unnamed: 0_level_0,unemployment,unemployment,unemployment,unemployment,fuel_price_usd_per_l,fuel_price_usd_per_l,fuel_price_usd_per_l,fuel_price_usd_per_l
Unnamed: 0_level_1,amin,amax,mean,median,amin,amax,mean,median
type,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
A,3.879,8.992,7.972611,8.067,0.664129,1.10741,0.744619,0.735455
B,7.17,9.765,9.279323,9.199,0.760023,1.107674,0.805858,0.803348


----

### Pivot Tables

- another way of calculating grouped summary statistics

In [142]:
sales.groupby('store')['weekly_sales'].mean()

store
1     20896.941787
2     26517.435162
4     26126.986071
6     21561.186477
10    25696.678370
13    25664.149474
14    30384.003017
19    19930.838157
20    28382.766385
27    24207.474711
31    18178.932225
39    18414.938423
Name: weekly_sales, dtype: float64

#### Group by to pivot table

**values**
- value argument is the column that you want to summarize 

**index**
- index argument is the column that you want to group by

> By default, pivot_table takes the mean value for each group.

In [154]:
sales.pivot_table(values = 'weekly_sales', index = 'store')

Unnamed: 0_level_0,weekly_sales
store,Unnamed: 1_level_1
1,20896.941787
2,26517.435162
4,26126.986071
6,21561.186477
10,25696.67837
13,25664.149474
14,30384.003017
19,19930.838157
20,28382.766385
27,24207.474711


#### Different Statistics

`aggfunc`
- use to get a different statistics
- pass in a function

In [155]:
sales.pivot_table(values = 'weekly_sales', index = 'store', aggfunc = np.median)

Unnamed: 0_level_0,weekly_sales
store,Unnamed: 1_level_1
1,9775.17
2,13764.94
4,13063.93
6,13200.78
10,13336.08
13,13050.06
14,14792.99
19,11091.77
20,14706.675
27,12689.85


### Multiple statistics

In [156]:
sales.pivot_table(values = 'weekly_sales', index = 'store', aggfunc = [np.median, np.mean])

Unnamed: 0_level_0,median,mean
Unnamed: 0_level_1,weekly_sales,weekly_sales
store,Unnamed: 1_level_2,Unnamed: 2_level_2
1,9775.17,20896.941787
2,13764.94,26517.435162
4,13063.93,26126.986071
6,13200.78,21561.186477
10,13336.08,25696.67837
13,13050.06,25664.149474
14,14792.99,30384.003017
19,11091.77,19930.838157
20,14706.675,28382.766385
27,12689.85,24207.474711


#### Pivot on two variables
- to group by two variables, passing in a second variable name into `columns` argument
- to remove lots of missing values in pivot table, use `fill_value` argument
- last row and last column of the pivot table contains the mean of all the values in the column or row, not including the missing values that were filled in with 0s

In [162]:
sales.pivot_table(values = 'weekly_sales', index = 'store', columns = 'type', fill_value = 0, margins = True)

type,A,B,All
store,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,20896.941787,0.0,20896.941787
2,26517.435162,0.0,26517.435162
4,26126.986071,0.0,26126.986071
6,21561.186477,0.0,21561.186477
10,0.0,25696.67837,25696.67837
13,25664.149474,0.0,25664.149474
14,30384.003017,0.0,30384.003017
19,19930.838157,0.0,19930.838157
20,28382.766385,0.0,28382.766385
27,24207.474711,0.0,24207.474711


**Pivoting on one variable**

Pivot tables are the standard way of aggregating data in spreadsheets. In pandas, pivot tables are essentially just another way of performing grouped calculations. That is, the `.pivot_table()` method is just an alternative to `.groupby()`.

In this exercise, you'll perform calculations using `.pivot_table()` to replicate the calculations you performed in the last lesson using `.groupby()`.

- Get the mean `weekly_sales` by type using `.pivot_table()` and store as `mean_sales_by_type`.

In [167]:
# Pivot for mean weekly_sales for each store type
mean_sales_by_type = sales.pivot_table(values = 'weekly_sales', index = 'type')

# Print mean_sales_by_type
pd.DataFrame(mean_sales_by_type)

Unnamed: 0_level_0,weekly_sales
type,Unnamed: 1_level_1
A,23674.667242
B,25696.67837


- Get the mean and median (using NumPy functions) of `weekly_sales` by `type` using `.pivot_table()` and store as `mean_med_sales_by_type`.

In [166]:
# Import NumPy as np
import numpy as np

# Pivot for mean and median weekly_sales for each store type
mean_med_sales_by_type = sales.pivot_table(values = 'weekly_sales', index = 'type', aggfunc = [np. mean, np.median])

# Print mean_med_sales_by_type
pd.DataFrame(mean_med_sales_by_type)

Unnamed: 0_level_0,mean,median
Unnamed: 0_level_1,weekly_sales,weekly_sales
type,Unnamed: 1_level_2,Unnamed: 2_level_2
A,23674.667242,11943.92
B,25696.67837,13336.08


- Get the mean of `weekly_sales` by `type` and `is_holiday` using `.pivot_table()` and store as `mean_sales_by_type_holiday`.

In [169]:
# Pivot for mean weekly_sales by store type and holiday 
mean_sales_by_type_holiday = sales.pivot_table(values = 'weekly_sales', index = 'type', columns = 'is_holiday')

# Print mean_sales_by_type_holiday
pd.DataFrame(mean_sales_by_type_holiday)

is_holiday,False,True
type,Unnamed: 1_level_1,Unnamed: 2_level_1
A,23768.583523,590.04525
B,25751.980533,810.705


> Pivot tables are another way to do the same thing as a group-by-then-summarize.

**Fill in missing values and sum values with pivot tables**

The `.pivot_table()` method has several useful arguments, including `fill_value` and `margins`.

- `fill_value` replaces missing values with a real value (known as *imputation*). What to replace missing values with is a topic big enough to have its own course (Dealing with Missing Data in Python), but the simplest thing to do is to substitute a dummy value.
- `margins` is a shortcut for when you pivoted by two variables, but also wanted to pivot by each of those variables separately: it gives the row and column totals of the pivot table contents.


In this exercise, you'll practice using these arguments to up your pivot table skills, which will help you crunch numbers more efficiently!
- Print the mean `weekly_sales` by department and type, filling in any missing values with `0`.

In [171]:
# Print mean weekly_sales by department and type; fill missing values with 0
pd.DataFrame(sales.pivot_table(values = 'weekly_sales', index = 'type', columns = 'department', fill_value = 0))

department,1,2,3,4,5,6,7,8,9,10,...,90,91,92,93,94,95,96,97,98,99
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,30961.725379,67600.158788,17160.002955,44285.399091,34821.011364,7136.292652,38454.336818,48583.475303,30120.449924,30930.456364,...,85776.905909,70423.165227,139722.204773,53413.633939,60081.155303,123933.787121,21367.042857,28471.26697,12875.423182,379.123659
B,44050.626667,112958.526667,30580.655,51219.654167,63236.875,10717.2975,52909.653333,90733.753333,66679.301667,48595.126667,...,14780.21,13199.6025,50859.278333,1466.274167,161.445833,77082.1025,9528.538333,5828.873333,217.428333,0.0


- Print the mean `weekly_sales` by `department` and `type`, filling in any missing values with `0` and summing all rows and columns.

In [173]:
# Print the mean weekly_sales by department and type; fill missing values with 0s; sum all rows and cols
pd.DataFrame(sales.pivot_table(values = "weekly_sales", index = "department", columns = "type", fill_value = 0, margins = True))

type,A,B,All
department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,30961.725379,44050.626667,32052.467153
2,67600.158788,112958.526667,71380.022778
3,17160.002955,30580.655000,18278.390625
4,44285.399091,51219.654167,44863.253681
5,34821.011364,63236.875000,37189.000000
...,...,...,...
96,21367.042857,9528.538333,20337.607681
97,28471.266970,5828.873333,26584.400833
98,12875.423182,217.428333,11820.590278
99,379.123659,0.000000,379.123659
