In [3]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [14]:
#Dataset
homelessness = pd.read_csv('./Data set/homelessness.csv')

## Inspecting a DataFrame

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

* .head() returns the first few rows (the “head” of the DataFrame).
* .info() shows information on each of the columns, such as the data type and number of missing values.
* .shape returns the number of rows and columns of the DataFrame.
* .describe() calculates a few summary statistics for each column.

In [3]:
# Print the head of the homelessness data
print(homelessness.head())

                         state  individuals  family_members  state_pop
region                                                                
East South Central     Alabama       2570.0           864.0    4887681
Pacific                 Alaska       1434.0           582.0     735139
Mountain               Arizona       7259.0          2606.0    7158024
West South Central    Arkansas       2280.0           432.0    3009733
Pacific             California     109008.0         20964.0   39461588


In [4]:
# Print information about homelessness
print(homelessness.info())

<class 'pandas.core.frame.DataFrame'>
Index: 51 entries, East South Central to Mountain
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   state           51 non-null     object 
 1   individuals     51 non-null     float64
 2   family_members  51 non-null     float64
 3   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 2.0+ KB
None


In [5]:
# Print the shape of homelessness
print(homelessness.shape)

(51, 4)


In [6]:
# Print a description of homelessness
print(homelessness.describe())

         individuals  family_members     state_pop
count      51.000000       51.000000  5.100000e+01
mean     7225.784314     3504.882353  6.405637e+06
std     15991.025083     7805.411811  7.327258e+06
min       434.000000       75.000000  5.776010e+05
25%      1446.500000      592.000000  1.777414e+06
50%      3082.000000     1482.000000  4.461153e+06
75%      6781.500000     3196.000000  7.340946e+06
max    109008.000000    52070.000000  3.946159e+07


## Parts of a DataFrame
To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

* .values: A two-dimensional NumPy array of values.
* .columns: An index of columns: the column names.
* .index: An index for the rows: either row numbers or row names.
 You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options.

In [7]:
# Import pandas using the alias pd
import pandas as pd

# Print the values of homelessness
print(homelessness.values)

# Print the column index of homelessness
print(homelessness.columns)

# Print the row index of homelessness
print(homelessness.index)

[['Alabama' 2570.0 864.0 4887681]
 ['Alaska' 1434.0 582.0 735139]
 ['Arizona' 7259.0 2606.0 7158024]
 ['Arkansas' 2280.0 432.0 3009733]
 ['California' 109008.0 20964.0 39461588]
 ['Colorado' 7607.0 3250.0 5691287]
 ['Connecticut' 2280.0 1696.0 3571520]
 ['Delaware' 708.0 374.0 965479]
 ['District of Columbia' 3770.0 3134.0 701547]
 ['Florida' 21443.0 9587.0 21244317]
 ['Georgia' 6943.0 2556.0 10511131]
 ['Hawaii' 4131.0 2399.0 1420593]
 ['Idaho' 1297.0 715.0 1750536]
 ['Illinois' 6752.0 3891.0 12723071]
 ['Indiana' 3776.0 1482.0 6695497]
 ['Iowa' 1711.0 1038.0 3148618]
 ['Kansas' 1443.0 773.0 2911359]
 ['Kentucky' 2735.0 953.0 4461153]
 ['Louisiana' 2540.0 519.0 4659690]
 ['Maine' 1450.0 1066.0 1339057]
 ['Maryland' 4914.0 2230.0 6035802]
 ['Massachusetts' 6811.0 13257.0 6882635]
 ['Michigan' 5209.0 3142.0 9984072]
 ['Minnesota' 3993.0 3250.0 5606249]
 ['Mississippi' 1024.0 328.0 2981020]
 ['Missouri' 3776.0 2107.0 6121623]
 ['Montana' 983.0 422.0 1060665]
 ['Nebraska' 1745.0 676.0 192

<div class="css-ikv0qb"><h1 class="css-f2t179">Sorting rows</h1><div class="css-hu6jey"><div class="">
<p>Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to <code>.sort_values()</code>.</p>
<p>In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.</p>
<table>
<thead>
<tr>
<th>Sort on …</th>
<th>Syntax</th>
</tr>
</thead>
<tbody>
<tr>
<td>one column</td>
<td><code>df.sort_values("breed")</code></td>
</tr>
<tr>
<td>multiple columns</td>
<td><code>df.sort_values(["breed", "weight_kg"])</code></td>
</tr>
</tbody>
</table>
<p>By combining <code>.sort_values()</code> with <code>.head()</code>, you can answer questions in the form, "What are the top cases where…?".</p>


In [5]:
# Sort homelessness by individuals
homelessness_ind =homelessness.sort_values('individuals')

# Print the top few rows
print(homelessness_ind.head())

                           state  individuals  family_members  state_pop
region                                                                  
Mountain                 Wyoming        434.0           205.0     577601
West North Central  North Dakota        467.0            75.0     758080
South Atlantic          Delaware        708.0           374.0     965479
New England         Rhode Island        747.0           354.0    1058287
New England              Vermont        780.0           511.0     624358


In [6]:
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values("family_members",ascending=False)

# Print the top few rows
print(homelessness_fam.head())

                            state  individuals  family_members  state_pop
region                                                                   
Mid-Atlantic             New York      39827.0         52070.0   19530351
Pacific                California     109008.0         20964.0   39461588
New England         Massachusetts       6811.0         13257.0    6882635
South Atlantic            Florida      21443.0          9587.0   21244317
West South Central          Texas      19199.0          6111.0   28628666


In [7]:
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["region","family_members"] ,ascending=[True,False])

# Print the top few rows
print(homelessness_reg_fam.head())

                        state  individuals  family_members  state_pop
region                                                               
East North Central   Illinois       6752.0          3891.0   12723071
East North Central       Ohio       6929.0          3320.0   11676341
East North Central   Michigan       5209.0          3142.0    9984072
East North Central  Wisconsin       2740.0          2167.0    5807406
East North Central    Indiana       3776.0          1482.0    6695497


<div class="listview__content"><div class="css-ikv0qb"><h1 class="css-f2t179">Subsetting columns</h1><div class="css-hu6jey"><div class="">
<p>When working with data, you may not need all of the variables in your dataset. Square brackets (<code>[]</code>) can be used to select only the columns that matter to you in an order that makes sense to you.
To select only <code>"col_a"</code> of the DataFrame <code>df</code>, use</p>
<pre><code>df["col_a"]
</code></pre>
<p>To select <code>"col_a"</code> and <code>"col_b"</code> of <code>df</code>, use</p>
<pre><code>df[["col_a", "col_b"]]
</code></pre>
</div></div></div></div>

In [8]:
# Select the individuals column
individuals = homelessness['individuals']

# Print the head of the result
print(individuals.head())

region
East South Central      2570.0
Pacific                 1434.0
Mountain                7259.0
West South Central      2280.0
Pacific               109008.0
Name: individuals, dtype: float64


In [9]:
# Select the state and family_members columns
state_fam = homelessness[['state','family_members']]

# Print the head of the result
print(state_fam.head())

                         state  family_members
region                                        
East South Central     Alabama           864.0
Pacific                 Alaska           582.0
Mountain               Arizona          2606.0
West South Central    Arkansas           432.0
Pacific             California         20964.0


In [10]:
# Select only the individuals and state columns, in that order
ind_state = homelessness[['individuals','state']]

# Print the head of the result
print(ind_state.head())

                    individuals       state
region                                     
East South Central       2570.0     Alabama
Pacific                  1434.0      Alaska
Mountain                 7259.0     Arizona
West South Central       2280.0    Arkansas
Pacific                109008.0  California


<div class="css-ikv0qb"><h1 class="css-f2t179">Subsetting rows</h1><div class="css-hu6jey"><div class="">
<p>A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as <em>filtering rows</em> or <em>selecting rows</em>.</p>
<p>There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return <code>True</code> or <code>False</code> for each row, then pass that inside square brackets.</p>
<pre><code>dogs[dogs["height_cm"] &gt; 60]
dogs[dogs["color"] == "tan"]
</code></pre>
<p>You can filter for multiple conditions at once by using the "bitwise and" operator, <code>&amp;</code>.</p>
<pre><code>dogs[(dogs["height_cm"] &gt; 60) &amp; (dogs["color"] == "tan")]
</code></pre>
</div></div></div>

In [11]:
# Filter for rows where individuals is greater than 10000
ind_gt_10k = homelessness[homelessness['individuals'] > 10000]

# See the result
print(ind_gt_10k)

                         state  individuals  family_members  state_pop
region                                                                
Pacific             California     109008.0         20964.0   39461588
South Atlantic         Florida      21443.0          9587.0   21244317
Mid-Atlantic          New York      39827.0         52070.0   19530351
Pacific                 Oregon      11139.0          3337.0    4181886
West South Central       Texas      19199.0          6111.0   28628666
Pacific             Washington      16424.0          5880.0    7523869


In [15]:
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness['region']=='Mountain']

# See the result
print(mountain_reg )

      region       state  individuals  family_members  state_pop
2   Mountain     Arizona       7259.0          2606.0    7158024
5   Mountain    Colorado       7607.0          3250.0    5691287
12  Mountain       Idaho       1297.0           715.0    1750536
26  Mountain     Montana        983.0           422.0    1060665
28  Mountain      Nevada       7058.0           486.0    3027341
31  Mountain  New Mexico       1949.0           602.0    2092741
44  Mountain        Utah       1904.0           972.0    3153550
50  Mountain     Wyoming        434.0           205.0     577601


In [16]:
# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness['family_members'] < 1000) & (homelessness['region']=="Pacific")]

# See the result
print(fam_lt_1k_pac)

    region   state  individuals  family_members  state_pop
1  Pacific  Alaska       1434.0           582.0     735139


<div class="css-ikv0qb"><h1 class="css-f2t179">Subsetting rows by categorical variables</h1><div class="css-hu6jey"><div class="">
<p>Subsetting data based on a categorical variable often involves using the "or" operator (<code>|</code>) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. 
Instead, use the <code>.isin()</code> method, which will allow you to tackle this problem by writing one condition instead of three separate ones.</p>
<pre><code>colors = ["brown", "black", "tan"]
condition = dogs["color"].isin(colors)
dogs[condition]
</code></pre>
</div></div></div>


In [17]:
# Subset for rows in South Atlantic or Mid-Atlantic regions
south_mid_atlantic =homelessness[homelessness['region'].isin(["South Atlantic","Mid-Atlantic"])]
# See the result
print(south_mid_atlantic)

            region                 state  individuals  family_members  \
7   South Atlantic              Delaware        708.0           374.0   
8   South Atlantic  District of Columbia       3770.0          3134.0   
9   South Atlantic               Florida      21443.0          9587.0   
10  South Atlantic               Georgia       6943.0          2556.0   
20  South Atlantic              Maryland       4914.0          2230.0   
30    Mid-Atlantic            New Jersey       6048.0          3350.0   
32    Mid-Atlantic              New York      39827.0         52070.0   
33  South Atlantic        North Carolina       6451.0          2817.0   
38    Mid-Atlantic          Pennsylvania       8163.0          5349.0   
40  South Atlantic        South Carolina       3082.0           851.0   
46  South Atlantic              Virginia       3928.0          2047.0   
48  South Atlantic         West Virginia       1021.0           222.0   

    state_pop  
7      965479  
8      701547  
9 

In [18]:
# The Mojave Desert states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter for rows in the Mojave Desert states
mojave_homelessness =homelessness[homelessness['state'].isin(canu)]

# See the result
print(mojave_homelessness)

      region       state  individuals  family_members  state_pop
2   Mountain     Arizona       7259.0          2606.0    7158024
4    Pacific  California     109008.0         20964.0   39461588
28  Mountain      Nevada       7058.0           486.0    3027341
44  Mountain        Utah       1904.0           972.0    3153550


<div class="css-ikv0qb"><h1 class="css-f2t179">Adding new columns</h1><div class="css-hu6jey"><div class="">
<p>You aren't stuck with just the data you are given. Instead, you can add new columns to a DataFrame. This has many names, such as <em>transforming</em>, <em>mutating</em>, and <em>feature engineering</em>.</p>
<p>You can create new columns from scratch, but it is also common to derive them from other columns, for example, by adding columns together or by changing their units. </p>
</div></div></div>

In [19]:
# Add total col as sum of individuals and family_members
homelessness['total']=homelessness['individuals']+homelessness['family_members']
# Add p_individuals col as proportion of total that are individuals
homelessness['p_individuals']=homelessness['individuals'] / homelessness['total']

# See the result
print(homelessness)

                region                 state  individuals  family_members  \
0   East South Central               Alabama       2570.0           864.0   
1              Pacific                Alaska       1434.0           582.0   
2             Mountain               Arizona       7259.0          2606.0   
3   West South Central              Arkansas       2280.0           432.0   
4              Pacific            California     109008.0         20964.0   
5             Mountain              Colorado       7607.0          3250.0   
6          New England           Connecticut       2280.0          1696.0   
7       South Atlantic              Delaware        708.0           374.0   
8       South Atlantic  District of Columbia       3770.0          3134.0   
9       South Atlantic               Florida      21443.0          9587.0   
10      South Atlantic               Georgia       6943.0          2556.0   
11             Pacific                Hawaii       4131.0          2399.0   

<div class="css-ikv0qb"><h1 class="css-f2t179">Combo-attack!</h1><div class="css-hu6jey"><div class="">
<p>You've seen the four most common types of data manipulation: sorting rows, subsetting columns, subsetting rows, and adding new columns. In a real-life data analysis, you can mix and match these four manipulations to answer a multitude of questions.</p>
<p>In this exercise, you'll answer the question, "Which state has the highest number of homeless individuals per 10,000 people in the state?" Combine your new <code>pandas</code> skills to find out.</p></div></div></div>

In [21]:
# Create indiv_per_10k col as homeless individuals per 10k state pop
homelessness["indiv_per_10k"] = 10000 * homelessness['individuals'] / homelessness["state_pop"]

# Subset rows for indiv_per_10k greater than 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] >20]

# Sort high_homelessness by descending indiv_per_10k
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k",ascending=False)

# From high_homelessness_srt, select the state and indiv_per_10k cols
result = high_homelessness_srt[['state','indiv_per_10k']]

# See the result
print(result)

                   state  indiv_per_10k
8   District of Columbia      53.738381
11                Hawaii      29.079406
4             California      27.623825
37                Oregon      26.636307
28                Nevada      23.314189
47            Washington      21.829195
32              New York      20.392363
