# Working With Strings

The following dataset contains users and codes. It's a real use case example of a Data Science project I worked with when having tools to work with strings is fundamental. The idea is simple, you have a list of users with a code. The code have different meanings because each letter in each position represents something different. For this example we are going to focus in one explanation, the program name. The rules:

- The code should always have 5 letters, by default `N`
- If the first letter is `A` (account) then the third letter contains the Program name.
- The Program names are: `G` Gold, `P` Platinum, and `B` Black.

Let's answer the most basic question: How many accounts per type do they have?

## Importing the Libraries

In [1]:
# General Libraries
import pandas as pd

In [2]:
# Yeast specifics classes
from yeast import Recipe
from yeast.steps import *
from yeast.transformers import *
from yeast.aggregations import *

## Getting the Data

In [3]:
codes = pd.read_csv('string_codes.csv')
codes.head()

Unnamed: 0,user,code
0,0,NNNNN
1,1,
2,2,ANPNN
3,3,A B
4,4,ANPNN


## Cleaning the Data
### Defining the processing Recipe

In [9]:
recipe = Recipe([
    # Trap: the column "code" on the csv is "  code"
    # Cleaning the column names should fix this
    CleanColumnNamesStep('snake'),
    # Replace the missing values by 'NNNNN' (no code)
    ReplaceNAStep('code', 'NNNNN'),
    # Let's clean the Code according to the business rules:
    MutateStep({
        # Transform the "name" column
        'code': [
            # No whitespace to the left or right of the string
            StrTrim(),
            # The code must have 5 characters, 'N' if no information
            StrPad(5, side='right', pad='N'),
            # Whitespaces are also coded as 'N',
            StrReplaceAll(' ', 'N')
        ],
        # Extract the first letter of the code (Account)
        'code_account': StrSlice(0, 1, column='code'),
        # Extract the third letter of the code (Account Type) if Account == 'A'
        'code_type': StrSlice(2, 3, column='code'),
        # Map the codes to the correct promotion name
        'program_name': MapValues({
            'G': 'Gold',
            'P': 'Platinum',
            'B': 'Black'
        }, column='code_type')
    })
])

In [10]:
recipe = recipe.prepare(codes)

In [11]:
clean_codes = recipe.bake(codes)
clean_codes.head()

Unnamed: 0,user,code,code_account,code_type,program_name
0,0,NNNNN,N,N,N
1,1,NNNNN,N,N,N
2,2,ANPNN,A,P,Platinum
3,3,ANBNN,A,B,Black
4,4,ANPNN,A,P,Platinum


### How many types of accounts do they have?

In [15]:
group_recipe = Recipe([
    # Keep Only Accounts with Type
    FilterStep('code_account == "A"'),
    # Group by Type
    GroupByStep('program_name'),
    # Count the types
    SummarizeStep({
        'program_name_count': AggCount('code_type')
    }),
    # Sort by count
    SortStep('program_name_count', ascending=False)
])

In [16]:
group_codes = group_recipe.bake(clean_codes)
group_codes.head(n=15)

Unnamed: 0,program_name,program_name_count
0,Gold,6
1,Platinum,4
2,Black,3
