<img src="https://pandas.pydata.org/static/img/pandas.svg" width="250">

## <center> Categorizing and Labelling Data
    
+ **pd.cut()**: for bins, buckets the information
+ **df[col_name].map()**
+ **pd.Categorical()** : converting the column to categorized, ordered one
+ **pd.get_dummies()**: one hot encoding

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame({
    "Species":['Chinook','Chum','Coho','Steelhead','Bull Trout'],
     "Population":['Skokomish','Lower Skokomish','Skokomish','Skokomish','SF Skokomish'],
     "Count":[1208,2396,3220,6245,8216]
})

In [3]:
df

Unnamed: 0,Species,Population,Count
0,Chinook,Skokomish,1208
1,Chum,Lower Skokomish,2396
2,Coho,Skokomish,3220
3,Steelhead,Skokomish,6245
4,Bull Trout,SF Skokomish,8216


---------


# Binning Numerical Data with `pd.cut()`

### Create the buckets

In [4]:
bins = [0, 2000, 4000, 6000, 8000, np.inf] # bins size

labels = ['Low Return', 'Below Avg Return', 'Avg Return', 'Above Avg Return', 'High Return'] # bins label

### Add the bucket information column (Count Category based on Salmon's Count)

In [5]:
pd.cut(x=df['Count'], bins=bins, labels=labels)

0          Low Return
1    Below Avg Return
2    Below Avg Return
3    Above Avg Return
4         High Return
Name: Count, dtype: category
Categories (5, object): ['Low Return' < 'Below Avg Return' < 'Avg Return' < 'Above Avg Return' < 'High Return']

In [6]:
df['Count Category'] = pd.cut(x=df['Count'], bins=bins, labels=labels)

df.head()

Unnamed: 0,Species,Population,Count,Count Category
0,Chinook,Skokomish,1208,Low Return
1,Chum,Lower Skokomish,2396,Below Avg Return
2,Coho,Skokomish,3220,Below Avg Return
3,Steelhead,Skokomish,6245,Above Avg Return
4,Bull Trout,SF Skokomish,8216,High Return


--------

# Map Species to Endangered Status

In [7]:
fed_status = {
    "Chinook":"Threatened",
    "Chum":"Not Warranted",
    "Coho":"Not Warranted",
    "Steelhead":"Threatened"
}

In [8]:
# use the map to map the status based on salmon's species name
df['Federal Status'] = df['Species'].map(fed_status)

In [9]:
df.head() # as there is no mapping name for "Bull Trout", status will be NaN

Unnamed: 0,Species,Population,Count,Count Category,Federal Status
0,Chinook,Skokomish,1208,Low Return,Threatened
1,Chum,Lower Skokomish,2396,Below Avg Return,Not Warranted
2,Coho,Skokomish,3220,Below Avg Return,Not Warranted
3,Steelhead,Skokomish,6245,Above Avg Return,Threatened
4,Bull Trout,SF Skokomish,8216,High Return,


------

# Introducing the `categorical` data type using `pd.Categorical()`

### making oridinal categories for "Count Category" column

In [10]:
labels # this is the ranking that we want to use

['Low Return',
 'Below Avg Return',
 'Avg Return',
 'Above Avg Return',
 'High Return']

In [12]:
df['Count Category'] = pd.Categorical(df['Count Category'], 
                                     ordered=True, # saying that the labels that we prepared is already in order
                                     categories=labels)

In [13]:
df['Count Category']

0          Low Return
1    Below Avg Return
2    Below Avg Return
3    Above Avg Return
4         High Return
Name: Count Category, dtype: category
Categories (5, object): ['Low Return' < 'Below Avg Return' < 'Avg Return' < 'Above Avg Return' < 'High Return']

#### we can test it by using sort_values

In [15]:
df

Unnamed: 0,Species,Population,Count,Count Category,Federal Status
0,Chinook,Skokomish,1208,Low Return,Threatened
1,Chum,Lower Skokomish,2396,Below Avg Return,Not Warranted
2,Coho,Skokomish,3220,Below Avg Return,Not Warranted
3,Steelhead,Skokomish,6245,Above Avg Return,Threatened
4,Bull Trout,SF Skokomish,8216,High Return,


#### Now the data is returned, sorted by Highest Category

In [14]:
df.sort_values(by='Count Category', ascending=False)

Unnamed: 0,Species,Population,Count,Count Category,Federal Status
4,Bull Trout,SF Skokomish,8216,High Return,
3,Steelhead,Skokomish,6245,Above Avg Return,Threatened
1,Chum,Lower Skokomish,2396,Below Avg Return,Not Warranted
2,Coho,Skokomish,3220,Below Avg Return,Not Warranted
0,Chinook,Skokomish,1208,Low Return,Threatened


------

# Use `get_dummies()` to Convert a Categorical Variable into a Dummy Variable

In [16]:
pd.get_dummies(df['Count Category'])

Unnamed: 0,Low Return,Below Avg Return,Avg Return,Above Avg Return,High Return
0,1,0,0,0,0
1,0,1,0,0,0
2,0,1,0,0,0
3,0,0,0,1,0
4,0,0,0,0,1
