# Working with Categorical Data

Example code adapts from a source (https://pbpython.com/pandas_dtypes_cat.html) and aims to illustrate how to work with categorical data in a Pandas DataFrame efficiently. Categorical data represents data that falls into categories or groups and is often used to store data with a limited number of unique values, such as colors, countries, or product categories.

In [1]:
#Working with categorical data - Example adapted from: https://pbpython.com/pandas_dtypes_cat.html
# import required modules
import pandas as pd
import numpy as np
import requests
from io import StringIO
from io import BytesIO
from zipfile import ZipFile

- requests: This module allows the code to send HTTP requests and receive responses from web servers. It is used here to fetch data from a remote location.
- StringIO: This module provides an in-memory buffer for text data. It is used to handle string data as if it were a file.
- BytesIO: This module provides an in-memory buffer for binary data. It is used to handle binary data as if it were a file.
- ZipFile: This module allows the code to work with ZIP-compressed files, which is useful when dealing with compressed data.

In [2]:
#Defining location of dataset 
filepath="/opt/datasets/ist652/Categories/medical.zip"

In [3]:
df=pd.read_csv(filepath,compression='zip')

In [5]:
df.head(2)

Unnamed: 0,Change_Type,Covered_Recipient_Type,Recipient_Primary_Business_Street_Address_Line1,Recipient_City,Recipient_State,Recipient_Zip_Code,Recipient_Country,Principal_Investigator_1_Profile_ID,Principal_Investigator_1_First_Name,Principal_Investigator_1_Last_Name,...,Total_Amount_of_Payment_USDollars,Date_of_Payment,Form_of_Payment_or_Transfer_of_Value,Preclinical_Research_Indicator,Delay_in_Publication_Indicator,Name_of_Study,Dispute_Status_for_Publication,Record_ID,Program_Year,Payment_Publication_Date
0,UNCHANGED,Covered Recipient Teaching Hospital,450 Brookline Ave,Boston,MA,2215,United States,754443.0,OSAMA,RAHMA,...,21.0,12/08/2018,Cash or cash equivalent,No,No,"Safety, Pharmacokinetics, and Pharmacodynamics...",No,576946373,2018,01/22/2021
1,UNCHANGED,Covered Recipient Teaching Hospital,601 EAST ROLLINS STREET,ORLANDO,FL,32803,United States,155977.0,NAUSHAD,SHAIK,...,101.25,10/10/2018,Cash or cash equivalent,No,No,QP ExCELs,No,609099103,2018,01/22/2021


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 611664 entries, 0 to 611663
Data columns (total 34 columns):
 #   Column                                                            Non-Null Count   Dtype  
---  ------                                                            --------------   -----  
 0   Change_Type                                                       611664 non-null  object 
 1   Covered_Recipient_Type                                            611664 non-null  object 
 2   Recipient_Primary_Business_Street_Address_Line1                   611460 non-null  object 
 3   Recipient_City                                                    611460 non-null  object 
 4   Recipient_State                                                   610452 non-null  object 
 5   Recipient_Zip_Code                                                610452 non-null  object 
 6   Recipient_Country                                                 611460 non-null  object 
 7   Principal_Investigat

#### Let's see which columns may be good candidates for a categorical data type by counting how many unique entries/values are in each column and also using that information to determine if there are some columns that are not relevant.


#### Code Overview
1. Creates a DataFrame `unique_counts` using the `pd.DataFrame.from_records()` method.
2. The `from_records()` method takes a list of tuples as input, where each tuple represents a row in the DataFrame. In this case, the list of tuples is generated using a list comprehension.
3. The list comprehension iterates over the columns of the DataFrame `df` and creates a tuple for each column. The tuple contains the column name and the number of unique values in that column (`df[col].nunique()`).
4. The resulting list of tuples is passed to the `from_records()` method, which creates the DataFrame `unique_counts`.
5. The `columns` parameter is used to specify the column names of the resulting DataFrame.
6. The `sort_values()` method is called on the `unique_counts` DataFrame to sort it in ascending order based on the `Num_Unique` column.

In summary, the code takes a DataFrame `df` as input and creates a new DataFrame `unique_counts`, which contains two columns: 'Column_Name' and 'Num_Unique'. Each row of `unique_counts` represents a column from the original DataFrame `df`, and the 'Num_Unique' column shows the count of unique values in that particular column. The resulting DataFrame is sorted in ascending order based on the number of unique values, making it easier to identify which columns have the least number of unique values.


In [6]:
unique_counts = pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns],
                          columns=['Column_Name', 'Num_Unique']).sort_values(by=['Num_Unique'])

In [7]:
unique_counts

Unnamed: 0,Column_Name,Num_Unique
33,Payment_Publication_Date,1
28,Delay_in_Publication_Indicator,1
32,Program_Year,1
30,Dispute_Status_for_Publication,2
27,Preclinical_Research_Indicator,2
26,Form_of_Payment_or_Transfer_of_Value,2
23,Related_Product_Indicator,2
0,Change_Type,3
1,Covered_Recipient_Type,4
15,Principal_Investigator_1_Primary_Type,6


 `unique_counts = pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns], columns=['Column_Name', 'Num_Unique']).sort_values(by=['Num_Unique'])`:

   a. `pd.DataFrame.from_records(...)`: This part of the code creates a new DataFrame from a list of records. The records are generated using a list comprehension, which iterates over the columns of the original DataFrame `df`.

   b. `(col, df[col].nunique())`: For each column `col` in the DataFrame `df`, a tuple `(col, df[col].nunique())` is created. The first element of the tuple is the column name (`col`), and the second element is the number of unique values in that column, obtained using the `nunique()` method of pandas.

   c. `columns=['Column_Name', 'Num_Unique']`: This specifies the column names for the new DataFrame. The first column will be named 'Column_Name', and the second column will be named 'Num_Unique'.

   d. `.sort_values(by=['Num_Unique'])`: After creating the DataFrame, this part sorts the DataFrame based on the 'Num_Unique' column in ascending order. This means that the DataFrame will be ordered from the columns with the least unique values to the columns with the most unique values.




In [8]:
#drop columns that don't bring any new information
df.drop(['Payment_Publication_Date','Delay_in_Publication_Indicator','Program_Year'],axis=1,inplace=True)

#### Dropping Columns:
The main purpose of this code is to drop specific columns from the DataFrame `df`. The columns being dropped are `'Payment_Publication_Date'`, `'Delay_in_Publication_Indicator'`, and `'Program_Year'`.

#### Syntax of the `drop()` method:

The `drop()` method  is used to remove rows or columns from a DataFrame. Its syntax is as follows:

```python
DataFrame.drop(labels, axis=0, index=None, columns=None, inplace=False)
```

- `labels`: This parameter specifies the rows or columns to be removed. In our case, it's the list of column names `['Payment_Publication_Date', 'Delay_in_Publication_Indicator', 'Program_Year']`.
- `axis`: This parameter indicates whether we want to drop rows (axis=0) or columns (axis=1). In this code, `axis=1` means we are dropping columns.
- `index` and `columns`: These parameters are used to specify the labels of rows and columns to drop. Since we are using `axis=1`, the `columns` parameter is the one we are interested in, and we pass the list of column names to it.
- `inplace`: This parameter determines whether the DataFrame is modified in place (True) or a new DataFrame with the specified changes is returned (False). In this case, `inplace=True`, so the changes will be made to the original `df` DataFrame.

#### Functionality :

The purpose of dropping these specific columns from the DataFrame could vary depending on the data and the context of the analysis. Some common reasons for dropping columns include:

1. **Redundant Information**: If the columns `'Payment_Publication_Date'`, `'Delay_in_Publication_Indicator'`, and `'Program_Year'` contain redundant or unnecessary information that is already captured in other columns, it makes sense to drop them to reduce data redundancy and save memory.

2. **Missing or Irrelevant Data**: If these columns have a significant number of missing values or contain data that is irrelevant to the current analysis, dropping them can improve the quality and relevance of the remaining data.

3. **Data Privacy**: In some cases, certain columns may contain sensitive or personally identifiable information (PII). To protect data privacy, those columns might be dropped.

4. **Model Training**: When preparing data for machine learning models, dropping irrelevant or non-predictive columns can improve model performance and reduce overfitting.



In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 611664 entries, 0 to 611663
Data columns (total 31 columns):
 #   Column                                                            Non-Null Count   Dtype  
---  ------                                                            --------------   -----  
 0   Change_Type                                                       611664 non-null  object 
 1   Covered_Recipient_Type                                            611664 non-null  object 
 2   Recipient_Primary_Business_Street_Address_Line1                   611460 non-null  object 
 3   Recipient_City                                                    611460 non-null  object 
 4   Recipient_State                                                   610452 non-null  object 
 5   Recipient_Zip_Code                                                610452 non-null  object 
 6   Recipient_Country                                                 611460 non-null  object 
 7   Principal_Investigat

### There is a big jump in unique values when we get to 670. We will use that as the threshold (actually, we will make the threshold 700) for conversion to a Categorical values column (except for columns that have date/time based information). 

In [10]:
cols_to_exclude = ['Date_of_Payment']
for col in df.columns:
    if df[col].nunique() < 700 and col not in cols_to_exclude:
        df[col] = df[col].astype('category')

### Explanation:

1. `cols_to_exclude = ['Date_of_Payment']`: This line initializes a list `cols_to_exclude` containing the column names that should be excluded from the process. In this case, it contains only one item, which is `'Date_of_Payment'`. The purpose of this list is to prevent the column with the name `'Date_of_Payment'` from being converted to the 'category' data type.

2. `for col in df.columns:`: This line starts a loop that iterates over each column in the DataFrame `df`.

3. `if df[col].nunique() < 700 and col not in cols_to_exclude:`: This is an `if` statement that checks two conditions:
   - `df[col].nunique() < 700`: It checks if the number of unique values in the current column (`col`) is less than 700. The `nunique()` method is used to count the number of unique values in a column.
   - `col not in cols_to_exclude`: It checks if the current column (`col`) is not in the `cols_to_exclude` list. This condition ensures that the column `'Date_of_Payment'` is not processed.

4. `df[col] = df[col].astype('category')`: If both conditions in the `if` statement are true, this line converts the data type of the current column (`col`) to `'category'`. The 'category' data type in pandas is a special data type used for categorical variables, which can significantly reduce memory usage and speed up certain operations on the column.

### How it works:

The code's main purpose is to optimize memory usage and improve performance for columns in a  DataFrame `df`. It does this by converting columns with a low number of unique values (less than 700) to the 'category' data type. By using the 'category' data type, pandas can efficiently represent the categorical variables, which can be beneficial for data sets with repetitive or limited unique values.

The code iterates through each column in the DataFrame `df`. For each column, it checks two conditions:
- Whether the number of unique values in the column is less than 700.
- Whether the column is not named `'Date_of_Payment'`.

If both conditions are met, the code converts the data type of that column to `'category'`.

### Functionality:

1. **Memory optimization**: By converting columns with low cardinality (few unique values) to the 'category' data type, the code helps reduce memory usage. The 'category' type uses integer-based codes internally to represent the categories, rather than storing each category as a string, leading to more efficient memory utilization.

2. **Performance improvement**: Using the 'category' data type for columns can speed up certain operations in pandas. For example, grouping, filtering, and merging data frames with categorical columns can be faster due to the internal integer representation of categories.



In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 611664 entries, 0 to 611663
Data columns (total 34 columns):
 #   Column                                                            Non-Null Count   Dtype   
---  ------                                                            --------------   -----   
 0   Change_Type                                                       611664 non-null  category
 1   Covered_Recipient_Type                                            611664 non-null  category
 2   Recipient_Primary_Business_Street_Address_Line1                   611460 non-null  object  
 3   Recipient_City                                                    611460 non-null  object  
 4   Recipient_State                                                   610452 non-null  category
 5   Recipient_Zip_Code                                                610452 non-null  object  
 6   Recipient_Country                                                 611460 non-null  category
 7   Principal_I

Please note that by using categorical types, we have reduced the memory use of the dataframe substantially.

In [11]:
#Summary of total payments made by covered recipient type
df.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()

  df.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()


Unnamed: 0_level_0,Total_Amount_of_Payment_USDollars
Covered_Recipient_Type,Unnamed: 1_level_1
Covered Recipient Physician,97156870.0
Covered Recipient Teaching Hospital,1258553000.0
Non-covered Recipient Entity,3922970000.0
Non-covered Recipient Individual,773178.4


`groupby('Covered_Recipient_Type')`: This part of the code is using the `groupby()` function of Pandas. It groups the rows of the DataFrame `df` based on the unique values in the 'Covered_Recipient_Type' column. This operation creates separate groups for each unique recipient type in the DataFrame.

`['Total_Amount_of_Payment_USDollars']`: This part specifies the column that we want to aggregate, in this case, the 'Total_Amount_of_Payment_USDollars' column. We want to calculate the sum of payments for each recipient type.

`.sum()`: This function is applied to each group of recipients. It calculates the sum of the 'Total_Amount_of_Payment_USDollars' column within each group, effectively giving us the total payment made to each covered recipient type.

`.to_frame()`: function is used to convert the result into a Pandas DataFrame. The resulting DataFrame will have two columns: 'Covered_Recipient_Type' and 'Total_Amount_of_Payment_USDollars'. The 'Covered_Recipient_Type' column will contain the unique recipient types, and the 'Total_Amount_of_Payment_USDollars' column will contain the corresponding total payment for each type.

The final output of the code will be a summary table showing the total payments made to covered recipients based on their types.

Simplified Example:
Suppose the original DataFrame `df` looks like this:

| Index | Covered_Recipient_Type | Total_Amount_of_Payment_USDollars |
|-------|-----------------------|----------------------------------|
| 0     | Physician             | 2000                             |
| 1     | Hospital              | 1500                             |
| 2     | Physician             | 3000                             |
| 3     | Pharmacy              | 500                              |
| 4     | Hospital              | 1000                             |

After applying the given code snippet, the output DataFrame will look like this:

| Covered_Recipient_Type | Total_Amount_of_Payment_USDollars |
|-----------------------|----------------------------------|
| Physician             | 5000                             |
| Hospital              | 2500                             |
| Pharmacy              | 500                              |


To change the order of *Covered_Recipient_Type* we create a *CategoricalDtype*

In [9]:
from pandas.api.types import CategoricalDtype
cats_in_order = ["Non-covered Recipient Entity", "Covered Recipient Teaching Hospital",
                 "Covered Recipient Physician", "Non-covered Recipient Individual"]
covered_type = CategoricalDtype(categories=cats_in_order, ordered=True)

#### Structure and Functionality:

1. Importing necessary libraries:
   The code starts by importing the required functionality from the `pandas` library. Specifically, it imports the `CategoricalDtype` class, which allows us to define custom categorical data types with ordered categories.

2. Defining the categories and order:
   The code defines a list called `cats_in_order`, which contains the names of the categories that will be used for the custom categorical data type. In this case, the categories are:
   - "Non-covered Recipient Entity"
   - "Covered Recipient Teaching Hospital"
   - "Covered Recipient Physician"
   - "Non-covered Recipient Individual"

3. Creating the custom ordered categorical data type:
   The code then creates a new `CategoricalDtype` object named `covered_type`. This object is initialized with the `categories` parameter set to the `cats_in_order` list, and the `ordered` parameter set to `True`. This means that the categories in the data type have a specific order, and operations involving this data type will take this order into account.



In [10]:
covered_type

CategoricalDtype(categories=['Non-covered Recipient Entity',
                  'Covered Recipient Teaching Hospital',
                  'Covered Recipient Physician',
                  'Non-covered Recipient Individual'],
, ordered=True)

In [11]:
df['Covered_Recipient_Type'] = df['Covered_Recipient_Type'].cat.reorder_categories(cats_in_order, ordered=True)

Let's break down the code into its individual parts:

1. `df['Covered_Recipient_Type']`: This selects the column named `'Covered_Recipient_Type'` from the DataFrame `df`.

2. `.cat`: The `.cat` attribute is used to access categorical data functionalities in pandas. It is used when the column contains categorical data, i.e., data with a limited and fixed number of unique values.

3. `.reorder_categories(cats_in_order, ordered=True)`: This is a method of the categorical data accessor (`.cat`) that allows reordering the categories of the categorical column.

    - `cats_in_order`: contains the unique values of the categorical data in the desired order.
    
    - `ordered=True`:  When `ordered=True`, it means that the categories have a meaningful order, which is useful in certain operations.

### Notable Features/Functionality:
1. **Categorical Data**: The code deals with categorical data in the DataFrame column `'Covered_Recipient_Type'`. Categorical data is a type of data that consists of categories or groups rather than numerical values.

2. **Reordering Categories**: The main functionality of this code is to reorder the categories of the `'Covered_Recipient_Type'` column. It is useful when you want to control the display order of the categories in plots or when performing aggregations or analysis based on the category order.

3. **Ordered Categories**: The code sets `ordered=True`, indicating that the categories have a meaningful order. This allows for specific operations that take advantage of the order, such as computing the median or using methods that require ordered categories.



In [12]:
df.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()

Unnamed: 0_level_0,Total_Amount_of_Payment_USDollars
Covered_Recipient_Type,Unnamed: 1_level_1
Non-covered Recipient Entity,3922970000.0
Covered Recipient Teaching Hospital,1258553000.0
Covered Recipient Physician,97156870.0
Non-covered Recipient Individual,773178.4
