# Transforming DataFrames

# Outline
- [1 Inspecting a DataFrame](#inspect)
- [2 Parts of DataFrame](#df-parts)
- [3 Sorting Rows](#sort)
- [4 Subsetting](#subset)
- [&nbsp;&nbsp;4.1 Subsetting Columns](#subset-cols)
- [&nbsp;&nbsp;4.2 Subsetting Rows (Filtering)](#subset-rows)
- [&nbsp;&nbsp;4.4 Subsetting Rows by Categorical Values](#subset-rows-catego)
- [5 Adding New Columns](#new-cols)
- [6 Mixing all Together](#mix)


Importing pandas & loading data into homelessness

In [None]:
import pandas as pd
homelessness = pd.read_csv("./../../data/homelessness.csv", index_col=0)

<a id="inspect"></a>
# __1 Inspecting a DataFrame__
When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

- [.head()](#head) returns the first few rows (the “head” of the DataFrame).
- [.info()](#info) shows information on each of the columns, such as the data type and number of missing values.
- [.shape](#shape) returns the number of rows and columns of the DataFrame.
- [.describe()](#describe) calculates a few summary statistics for each column.
homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state's total population.


<a id="head"></a>
Print the head of the homelessness data

In [None]:
homelessness.head()

<a id="info"></a>
Print information about the column types and missing values in homelessness.

In [None]:
homelessness.info()

<a id="shape"></a>
Print the number of rows and columns in homelessness.

In [None]:
homelessness.shape

<a id="describe"></a>
Print some summary statistics that describe the homelessness DataFrame.

In [None]:
homelessness.describe()

<a id="df-parts"></a>
# __2 Parts of a DataFrame__
To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

- [.values](#vals): A two-dimensional NumPy array of values.
- [.columns](#cols): An index of columns: the column names.
- [.index](#idx): An index for the rows: either row numbers or row names.
You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options. (These will be covered later in the course.)


<a id="vals"></a>
Print a 2D NumPy array of the values in homelessness.

In [25]:
homelessness.values

array([['East South Central', 'Alabama', 2570.0, 864.0, 4887681, 3434.0,
        0.7483983692486895, 5.258117295298118],
       ['Pacific', 'Alaska', 1434.0, 582.0, 735139, 2016.0,
        0.7113095238095238, 19.50651509442432],
       ['Mountain', 'Arizona', 7259.0, 2606.0, 7158024, 9865.0,
        0.7358337557019767, 10.141066864263097],
       ['West South Central', 'Arkansas', 2280.0, 432.0, 3009733, 2712.0,
        0.8407079646017699, 7.575422803285209],
       ['Pacific', 'California', 109008.0, 20964.0, 39461588, 129972.0,
        0.8387037208014034, 27.62382497126066],
       ['Mountain', 'Colorado', 7607.0, 3250.0, 5691287, 10857.0,
        0.7006539559731049, 13.366045325073221],
       ['New England', 'Connecticut', 2280.0, 1696.0, 3571520, 3976.0,
        0.5734406438631791, 6.383836573783711],
       ['South Atlantic', 'Delaware', 708.0, 374.0, 965479, 1082.0,
        0.6543438077634011, 7.333147587881249],
       ['South Atlantic', 'District of Columbia', 3770.0, 3134.0, 

<a id="cols"></a>
Print the column names of homelessness.

In [26]:
homelessness.columns

Index(['region', 'state', 'individuals', 'family_members', 'state_pop',
       'total', 'p_individuals', 'indiv_per_10k'],
      dtype='object')

<a id="idx"></a>
Print the index of homelessness.

In [27]:
homelessness.index

Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50],
      dtype='int64')

<a id="sort"></a>
# __3 Sorting rows__
Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to [.sort_values().](#row-sort)

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.


|Sort on...|Syntax|
|---|---|
|one column|df.sort_values("breed")|
|multiple columns|df.sort_values(["breed", "weight_kg"])|

By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".

<a id="row-sort"></a>
Sort homelessness by the number of homeless individuals, from smallest to largest, and save this as homelessness_ind.
Print the head of the sorted DataFrame.

In [28]:
homelessness_ind = homelessness.sort_values("individuals", ascending=True)
homelessness_ind

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals,indiv_per_10k
50,Mountain,Wyoming,434.0,205.0,577601,639.0,0.679186,7.513837
34,West North Central,North Dakota,467.0,75.0,758080,542.0,0.861624,6.1603
7,South Atlantic,Delaware,708.0,374.0,965479,1082.0,0.654344,7.333148
39,New England,Rhode Island,747.0,354.0,1058287,1101.0,0.678474,7.058577
45,New England,Vermont,780.0,511.0,624358,1291.0,0.604183,12.492833
29,New England,New Hampshire,835.0,615.0,1353465,1450.0,0.575862,6.169351
41,West North Central,South Dakota,836.0,323.0,878698,1159.0,0.721311,9.514077
26,Mountain,Montana,983.0,422.0,1060665,1405.0,0.699644,9.267771
48,South Atlantic,West Virginia,1021.0,222.0,1804291,1243.0,0.8214,5.658732
24,East South Central,Mississippi,1024.0,328.0,2981020,1352.0,0.757396,3.435066


Sort homelessness by the number of homeless family_members in descending order, and save this as homelessness_fam.
Print the head of the sorted DataFrame.

In [29]:
homelessness_fam = homelessness.sort_values("family_members", ascending=False)
homelessness_fam

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals,indiv_per_10k
32,Mid-Atlantic,New York,39827.0,52070.0,19530351,91897.0,0.433387,20.392363
4,Pacific,California,109008.0,20964.0,39461588,129972.0,0.838704,27.623825
21,New England,Massachusetts,6811.0,13257.0,6882635,20068.0,0.339396,9.895919
9,South Atlantic,Florida,21443.0,9587.0,21244317,31030.0,0.691041,10.093523
43,West South Central,Texas,19199.0,6111.0,28628666,25310.0,0.758554,6.706215
47,Pacific,Washington,16424.0,5880.0,7523869,22304.0,0.73637,21.829195
38,Mid-Atlantic,Pennsylvania,8163.0,5349.0,12800922,13512.0,0.60413,6.376884
13,East North Central,Illinois,6752.0,3891.0,12723071,10643.0,0.634408,5.306895
30,Mid-Atlantic,New Jersey,6048.0,3350.0,8886025,9398.0,0.643541,6.806193
37,Pacific,Oregon,11139.0,3337.0,4181886,14476.0,0.769481,26.636307


Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as homelessness_reg_fam.
Print the head of the sorted DataFrame.

In [30]:
homelessness_reg_fam = homelessness.sort_values(["region", "family_members"], ascending=[True, False])

homelessness_reg_fam.head()

Unnamed: 0,region,state,individuals,family_members,state_pop,total,p_individuals,indiv_per_10k
13,East North Central,Illinois,6752.0,3891.0,12723071,10643.0,0.634408,5.306895
35,East North Central,Ohio,6929.0,3320.0,11676341,10249.0,0.676066,5.934222
22,East North Central,Michigan,5209.0,3142.0,9984072,8351.0,0.623758,5.21731
49,East North Central,Wisconsin,2740.0,2167.0,5807406,4907.0,0.558386,4.718113
14,East North Central,Indiana,3776.0,1482.0,6695497,5258.0,0.718144,5.639611


<a id="subset"></a>
# __4 Subsetting__
<a id="subset-cols"></a>
## 4.1 Columns <br>
When working with data, you may not need all of the variables in your dataset. Square brackets (`[]`) can be used to select only the columns that matter to you in an order that makes sense to you. To select only "col_a" of the DataFrame `df`, use
<br>`df["col_a"]`<br>
To select "col_a" and "col_b" of df, use
<br>`df[["col_a", "col_b"]]`<br>

Create a DataFrame called individuals that contains only the individuals column of homelessness.
Print the head of the result.

In [None]:
individuals = homelessness["individuals"]
individuals.head()

Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order.
Print the head of the result.

In [31]:
state_fam = homelessness[["state", "family_members"]]
state_fam.head()

Unnamed: 0,state,family_members
0,Alabama,864.0
1,Alaska,582.0
2,Arizona,2606.0
3,Arkansas,432.0
4,California,20964.0


Create a DataFrame called ind_state that contains the individuals and state columns of homelessness, in that order.
Print the head of the result.

In [32]:
ind_state = homelessness[["individuals", "state"]]
ind_state.head()

Unnamed: 0,individuals,state
0,2570.0,Alabama
1,1434.0,Alaska
2,7259.0,Arizona
3,2280.0,Arkansas
4,109008.0,California


<a id="subset-rows"></a>
## 4.2 Subsetting Rows
A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return `True` or `False` for each row, then pass that inside square brackets.

`dogs[dogs["height_cm"] > 60]` <br>
`dogs[dogs["color"] == "tan"]` <br>
You can filter for multiple conditions at once by using the "bitwise and" operator, `&`.

`dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]`


Filter homelessness for cases where the number of individuals is greater than ten thousand, assigning to ind_gt_10k. View the printed result.

In [None]:
ind_gt_10k = homelessness[homelessness["individuals"] > 10_000]
ind_gt_10k

Filter homelessness for cases where the USA Census region is "Mountain", assigning to mountain_reg. View the printed result.

In [None]:
mountain_reg = homelessness[homelessness["region"] == "Mountain"]
mountain_reg

Filter homelessness for cases where the number of family_members is less than one thousand and the region is "Pacific", assigning to fam_lt_1k_pac. View the printed result.

In [None]:
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1_000) & (homelessness["region"] == "Pacific")]
fam_lt_1k_pac

<a id="subset-rows-catego"></a>
## 4.3 Subsetting rows by categorical variables
Subsetting data based on a categorical variable often involves using the "or" operator (`|`) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the `.isin()` method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

`colors = ["brown", "black", "tan"]` <br>
`condition = dogs["color"].isin(colors)` <br>
`dogs[condition]` <br>


Filter homelessness for cases where the USA census region is "South Atlantic" or it is "Mid-Atlantic", assigning to south_mid_atlantic. View the printed result.

In [None]:
south_mid_atlantic = homelessness[homelessness["region"].isin(["South Atlantic", "Mid-Atlantic"])]
south_mid_atlantic

Filter homelessness for cases where the USA census state is in the list of Mojave states, canu, assigning to mojave_homelessness. View the printed result.

In [None]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

mojave_homelessness = homelessness[homelessness["state"].isin(canu)]
mojave_homelessness

<a id="new-cols"></a>
# __5 Adding new columns__
You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.

Add a new column to `homelessness`, named `total`, containing the sum of the `individuals` and `family_members` columns.
Add another column to `homelessness`, named `p_individuals`, containing the proportion of homeless people in each state who are individuals.

In [None]:
homelessness["total"] = homelessness["individuals"] + homelessness["family_members"]
homelessness["p_individuals"] = homelessness["individuals"] / homelessness["total"]

homelessness

<a id="mix"></a>
# __6 Combo-attack!__
You've seen the four most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding new columns. In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

In this exercise, you'll answer the question, "Which state has the highest number of homeless individuals per 10,000 people in the state?" Combine your new `pandas` skills to find out.

Add a column to `homelessness`, `indiv_per_10k`, containing the number of homeless individuals per ten thousand people in each state.

In [None]:
homelessness["indiv_per_10k"] = homelessness["individuals"] / homelessness["state_pop"] * 10_000

Subset rows where `indiv_per_10k` is higher than `20`, assigning to `high_homelessness`.

In [None]:
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

Sort `high_homelessness` by descending `indiv_per_10k`, assigning to `high_homelessness_srt`.

In [None]:
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending=False)
high_homelessness_srt

Select only the `state` and `indiv_per_10k` columns of `high_homelessness_srt` and save as `result`. Look at the `result`.

In [None]:
result = high_homelessness_srt[["state", "indiv_per_10k"]]

result