# Transforming dataframe

also known as updating or feature engineering

In [7]:
from IPython.display import IFrame

# Display PDF with responsive width and height
IFrame("https://projector-video-pdf-converter.datacamp.com/22066/chapter1.pdf", width="100%", height="600px")

## Introducing DataFrame

Pandas Building Base
- NumPy: for powerful data manipulation
- matplotlib: for data visualization

explore dataframe methods and attributes
- head() : method : returns the first n rows of the DataFrame
    - example: df.head(5)
   [![https://imgur.com/YV4KpeZ.png](https://imgur.com/YV4KpeZ.png)](https://imgur.com/YV4KpeZ.png)
- tail() : method : returns the last n rows of the DataFrame
    - example: df.tail(5)
    [![https://imgur.com/Bno9zKj.png](https://imgur.com/Bno9zKjl.png)](https://imgur.com/Bno9zKjl.png)
- shape : attribute : returns a tuple representing the dimensions of the DataFrame
    - example: df.shape
    [![https://imgur.com/MfPpIjW.png](https://imgur.com/MfPpIjW.png)](https://imgur.com/MfPpIjW.png)
- info() : method : provides a concise summary of the DataFrame
    - example: df.info()
    [![https://imgur.com/FhgZwz9.png](https://imgur.com/FhgZwz9.png)](https://imgur.com/FhgZwz9.png)
- describe() : method : generates descriptive statistics of the DataFrame
    - example: df.describe()
    [![https://imgur.com/UqvlgK8.png](https://imgur.com/UqvlgK8l.png)](https://imgur.com/UqvlgK8l.png)
- columns : attribute : returns the column labels of the DataFrame
    - example: df.columns
    [![https://imgur.com/6Ra2zVY.png](https://imgur.com/6Ra2zVY.png)](https://imgur.com/6Ra2zVY.png)

- [![https://imgur.com/x8J9pAo.png](https://imgur.com/x8J9pAo.png)](https://imgur.com/x8J9pAo.png)


### Sorting and subsetting

Sorting rows
[![https://imgur.com/PJrIwyz.png](https://imgur.com/PJrIwyz.png)](https://imgur.com/PJrIwyz.png)

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.
Sort on … 	Syntax
one column 	df.sort_values("breed")
multiple columns 	df.sort_values(["breed", "weight_kg"])

By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".

homelessness is available and pandas is loaded as pd.

In [8]:
import pandas as pd
homelessness = pd.read_csv('./data/homelessness.csv')

In [9]:
# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values('individuals')

# Print the top few rows
print(homelessness_ind.head())

    Unnamed: 0              region         state  individuals  family_members  \
50          50            Mountain       Wyoming        434.0           205.0   
34          34  West North Central  North Dakota        467.0            75.0   
7            7      South Atlantic      Delaware        708.0           374.0   
39          39         New England  Rhode Island        747.0           354.0   
45          45         New England       Vermont        780.0           511.0   

    state_pop  
50     577601  
34     758080  
7      965479  
39    1058287  
45     624358  


In [10]:
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values('family_members', ascending=False)

print(homelessness_fam.head())

    Unnamed: 0              region          state  individuals  \
32          32        Mid-Atlantic       New York      39827.0   
4            4             Pacific     California     109008.0   
21          21         New England  Massachusetts       6811.0   
9            9      South Atlantic        Florida      21443.0   
43          43  West South Central          Texas      19199.0   

    family_members  state_pop  
32         52070.0   19530351  
4          20964.0   39461588  
21         13257.0    6882635  
9           9587.0   21244317  
43          6111.0   28628666  


In [11]:
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'],ascending=[True, False])

# Print the top few rows
print(homelessness_reg_fam.head())

    Unnamed: 0              region      state  individuals  family_members  \
13          13  East North Central   Illinois       6752.0          3891.0   
35          35  East North Central       Ohio       6929.0          3320.0   
22          22  East North Central   Michigan       5209.0          3142.0   
49          49  East North Central  Wisconsin       2740.0          2167.0   
14          14  East North Central    Indiana       3776.0          1482.0   

    state_pop  
13   12723071  
35   11676341  
22    9984072  
49    5807406  
14    6695497  


## subsetting

[![https://imgur.com/QrjfgeM.png](https://imgur.com/QrjfgeM.png)](https://imgur.com/QrjfgeM.png)

[![https://imgur.com/Ejiaw1G.png](https://imgur.com/Ejiaw1G.png)](https://imgur.com/Ejiaw1G.png)


[![https://imgur.com/Kta1y1F.png](https://imgur.com/Kta1y1F.png)](https://imgur.com/Kta1y1F.png)
[![https://imgur.com/MHhGUQF.png](https://imgur.com/MHhGUQF.png)](https://imgur.com/MHhGUQF.png)

[![https://imgur.com/dJFPaHw.png](https://imgur.com/dJFPaHw.png)](https://imgur.com/dJFPaHw.png)

[![https://imgur.com/6Luu9Qz.png](https://imgur.com/6Luu9Qz.png)](https://imgur.com/6Luu9Qz.png)

## Adding new columns

You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units.

homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state's total population.

homelessness is available and pandas is loaded as pd.

In [12]:


# Add total col as sum of individuals and family_members
homelessness['total'] = homelessness['individuals'] + homelessness['family_members']
 
# Add p_homeless col as proportion of total homeless population to the state population
homelessness['p_homeless'] = homelessness['total'] / homelessness['state_pop']

# See the result
print(homelessness.head())

   Unnamed: 0              region       state  individuals  family_members  \
0           0  East South Central     Alabama       2570.0           864.0   
1           1             Pacific      Alaska       1434.0           582.0   
2           2            Mountain     Arizona       7259.0          2606.0   
3           3  West South Central    Arkansas       2280.0           432.0   
4           4             Pacific  California     109008.0         20964.0   

   state_pop     total  p_homeless  
0    4887681    3434.0    0.000703  
1     735139    2016.0    0.002742  
2    7158024    9865.0    0.001378  
3    3009733    2712.0    0.000901  
4   39461588  129972.0    0.003294  


## Combo-attack!

You've seen the four most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding new columns. In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.

In this exercise, you'll answer the question, "Which state has the highest number of homeless individuals per 10,000 people in the state?" Combine your new pandas skills to find out.

In [13]:
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness['individuals'] / homelessness['state_pop']

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness['indiv_per_10k']>20]

# # Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values('indiv_per_10k',ascending=False)

# # From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[['state','indiv_per_10k']]

# # See the result
print(result)
# high_homelessness_srt

                   state  indiv_per_10k
8   District of Columbia      53.738381
11                Hawaii      29.079406
4             California      27.623825
37                Oregon      26.636307
28                Nevada      23.314189
47            Washington      21.829195
32              New York      20.392363
