##### Tutorial 07: Working with Categorical Data in SQLite

In this tutorial, we will explore how to work with categorical data in SQLite. Categorical data refers to variables that contain label values, such as "Yes" or "No," or categories like "Red," "Blue," or "Green." In SQL, categorical data is often stored as text or integer values that represent different categories.

**Working with Categorical Data in SQLite**
SQLite does not have a built-in data type specifically for categorical data. However, categorical data can be represented in the following ways:

**TEXT**: Used to store string values that represent different categories.
**INTEGER**: Used to store numerical values that represent different categories (e.g., 1 for "Yes" and 0 for "No").


**Example 1**: Remove extra space in the Categorical columns and update the table.  

**Example 2**: What is the gender distribution of participants applying for the program? Count how many participants identify as male, female, or other.

**Example 3**: How many participants from each state/region are applying for the program? Combine the data from all regions and count the total number of participants.

**Example 4**: What is the employment status distribution of participants? Count how many participants are employed, unemployed, or in another situation.

**Example 5**: How many participants use each type of internet (e.g., Wi-Fi, mobile data)? Group the participants based on their reported internet connection type.

**Example 6**: Combine data from State_Region and Gender to find out how many male participants are applying from each region.

**Example 7**: How many participants wish to join each course? Group participants by their preferred course.

**Example 8**: How many participants have personal goals related to career advancement, education, or other categories?

**Example 9**: How many participants belong to a school with "Tech" in the name?

**Example 10**: How many participants are in an academic career that includes the word "Graduate"?




In [None]:
import sqlite3
import pandas as pd

db_path = './database/mmdt.db3'


In [None]:
update_query = """
UPDATE participants
    SET State_Region = TRIM(State_Region),
        Current_Situation = TRIM(Current_Situation),
        Type_of_Internet = TRIM(Type_of_Internet),
        Device_used = TRIM(Device_used),
        School_Name = TRIM(School_Name),
        Gender = TRIM(Gender),
        Academic_career = TRIM(Academic_career),
        Personal_Professional_Goals = TRIM(Personal_Professional_Goals),
        Reason_Right_Person = TRIM(Reason_Right_Person),
        Personal_Professional_Challenges = TRIM(Personal_Professional_Challenges);
    """

conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute(update_query)
conn.commit()
conn.close()

In [None]:
query = """
        SELECT             
             CASE 
                WHEN p.Gender = 'Man' THEN 'Male'
                WHEN p.Gender = 'male' THEN 'Male'
                WHEN p.Gender = 'Male' THEN 'Male'
                WHEN p.Gender = 'Female' THEN 'Female'
                WHEN p.Gender is NULL THEN b.Gender
                ELSE 'Unknown'
            END as gender_group, 
            Count(*) as number
        FROM participants as p
        LEFT JOIN bhutan as b
        USING (ID)
        GROUP BY gender_group
        ORDER By number DESC;
        """

df = pd.read_sql_query(query, f'sqlite:///{db_path}')
df

In [None]:
query = """
        SELECT    
            COALESCE(p.Country, b.Country) as resident,
            COALESCE(p.State_Region, b.State_Region) as state,         
            Count(*) as number
        FROM participants as p
        LEFT JOIN bhutan as b
        USING (ID)
        GROUP BY resident, state
        ORDER By number DESC;
        """

df = pd.read_sql_query(query, f'sqlite:///{db_path}')
df

In [None]:
query = """
        SELECT    
            COALESCE(p.Current_Situation, b.Current_Situation) as employed_status,    
            Count(*) as number
        FROM participants as p
        LEFT JOIN bhutan as b
        USING (ID)
        GROUP BY employed_status
        ORDER By number DESC;
        """

df = pd.read_sql_query(query, f'sqlite:///{db_path}')
df