# String Expression Operators

---
### Connecting to MongoDB using Pymongo
----

In [1]:
# Importing the required libraries
import pymongo
import pprint as pp

pp.sorted = lambda x, key=None: x

In [2]:
# Connect to the mongo client - Atlas Cluster
client = pymongo.MongoClient('mongodb://localhost:27017/')

In [3]:
# training dataset
db = client.training

In [4]:
# Sample hr document
pp.pprint(
    db.hr.find_one()
)

{'_id': ObjectId('60bc95fb12d1778df87722e2'),
 'enrollee_id': 23798,
 'gender': 'Male',
 'date_of_enrollment': datetime.datetime(2016, 8, 4, 8, 4, 14, 780000),
 'city': {'name': 'city_149', 'development_index': 0.689},
 'education': {'level': 'Graduate', 'discipline': 'STEM'},
 'experience': {'years': 3,
                'company_type': 'Pvt Ltd',
                'last_new_job': 1,
                'relevent_experience': 1},
 'training_hours': 106}


---
### [String operators](https://docs.mongodb.com/manual/reference/operator/aggregation/#string-expression-operators)

String expressions let us manipulate string values.

---

For example, we want to return all those documents `training_hours` is greater than 100 and we want to concatenate `education.level` and `education.discipline` together.

We will have to use the [$concat](https://docs.mongodb.com/manual/reference/operator/aggregation/concat/#-concat--aggregation-) operartor.

---

In [5]:
# Concat strings

result = db.hr.aggregate(
        # Pipeline
        [
            # Stage 1
            {
                '$match':{'training_hours':{'$gt':100}}
            },
            # Stage 2
            {
                '$project':{
                                '_id':0,
                                'Course':{
                                            '$concat':['$education.discipline',
                                                       '_',
                                                       '$education.level']
                                         },
                                'Training': '$training_hours'
                            }
            },
            # Stage 3
            {
                '$limit': 10
            }
        ])

# Print results
for doc in result:
    pp.pprint(doc)

{'Course': 'STEM_Graduate', 'Training': 106}
{'Course': 'STEM_Graduate', 'Training': 106}
{'Course': 'STEM_Graduate', 'Training': 106}
{'Course': 'STEM_Graduate', 'Training': 298}
{'Course': 'Arts_Graduate', 'Training': 101}
{'Course': 'STEM_Masters', 'Training': 114}
{'Course': 'STEM_Graduate', 'Training': 104}
{'Course': 'STEM_Graduate', 'Training': 109}
{'Course': 'STEM_Graduate', 'Training': 262}
{'Course': 'STEM_Graduate', 'Training': 112}


----
Similarly there are other operators like :-

- [$toUpper](https://docs.mongodb.com/manual/reference/operator/aggregation/toUpper/#mongodb-expression-exp.-toUpper) - Converts a string to upperrcase.

- [$toLower](https://docs.mongodb.com/manual/reference/operator/aggregation/toLower/#mongodb-expression-exp.-toLower) - Converts a string to lowercase.

- [$substrCP](https://docs.mongodb.com/manual/reference/operator/aggregation/substrCP/#-substrcp--aggregation-) - Returns the substring of a string.

- [$split](https://docs.mongodb.com/manual/reference/operator/aggregation/split/#mongodb-expression-exp.-split) - Splits a string into substrings based on a delimiter. Returns an array of substrings.

- [$strLenCP](https://docs.mongodb.com/manual/reference/operator/aggregation/strLenCP/#mongodb-expression-exp.-strLenCP) - Returns the length of string.


For example, let's work with the `experience.company_type` field.

----

In [6]:
# String operators

result = db.hr.aggregate(
        # Pipeline
        [
            # Stage 1
            {
                '$match':{'training_hours':{'$gt':200}}
            },
            # Stage 2
            {
                '$project':{
                                '_id':0,
                                
                                # Casing
                                'Upper_case':{'$toUpper':'$experience.company_type'},
                                'Lower_case':{'$toLower':'$experience.company_type'},
                                
                                # Substring
                                'Substr':{'$substrCP':['$experience.company_type', 0, 1]},
                                
                                # Split string on delimiter
                                'Split_on_Space':{'$split':['$experience.company_type', ' ']},
                                
                                # String length
                                'String_Length':{'$strLenCP':'$experience.company_type'}
                            }
            },
            # Stage 3
            {
                '$limit': 1
            }
        ])

# Print results
for doc in result:
    pp.pprint(doc)

{'Upper_case': 'PVT LTD',
 'Lower_case': 'pvt ltd',
 'Substr': 'P',
 'Split_on_Space': ['Pvt', 'Ltd'],
 'String_Length': 7}


----
**Regex**

Can also include regex operators in aggregation pipeline.

[$regexMatch](https://docs.mongodb.com/manual/reference/operator/aggregation/regexMatch/#-regexmatch--aggregation-) returns a boolean value indicating whether string matches regex pattern or not.

----

In [7]:
# Regex

result = db.hr.aggregate(
        # Pipeline
        [
            # Stage 1
            {
                '$project':{
                                '_id':0,
                                'String':'$experience.company_type',
                                # regex
                                'Regex':{
                                            '$regexMatch':{
                                                            'input':"$experience.company_type",
                                                            'regex':'LTD',
                                                            'options':'i'
                                                        }
                                        }
                            }
            },
            # Stage 2
            {
                '$limit': 5
            }
        ])

# Print results
for doc in result:
    pp.pprint(doc)

{'String': 'Pvt Ltd', 'Regex': True}
{'String': 'Funded Startup', 'Regex': False}
{'String': 'Public Sector', 'Regex': False}
{'String': 'Pvt Ltd', 'Regex': True}
{'String': 'Funded Startup', 'Regex': False}


---
[$regexFind](https://docs.mongodb.com/manual/reference/operator/aggregation/regexFind/#-regexfind--aggregation-) returns information about a match on a document. If a match is not found, returns null.

---

In [8]:
# Regex

result = db.hr.aggregate(
        # Pipeline
        [
            # Stage 1
            {
                '$project':{
                                '_id':0,
                                'String':'$experience.company_type',
                                # regex
                                'Regex':{
                                            '$regexFind':{
                                                            'input':'$experience.company_type',
                                                            'regex':'^P',
                                                            'options':'i'
                                                        }
                                        }
                            }
            },
            # Stage 2
            {
                '$limit': 10
            }
        ])

# Print results
for doc in result:
    pp.pprint(doc)

{'String': 'Pvt Ltd', 'Regex': {'match': 'P', 'idx': 0, 'captures': []}}
{'String': 'Funded Startup', 'Regex': None}
{'String': 'Public Sector', 'Regex': {'match': 'P', 'idx': 0, 'captures': []}}
{'String': 'Pvt Ltd', 'Regex': {'match': 'P', 'idx': 0, 'captures': []}}
{'String': 'Funded Startup', 'Regex': None}
{'String': 'Pvt Ltd', 'Regex': {'match': 'P', 'idx': 0, 'captures': []}}
{'String': 'Pvt Ltd', 'Regex': {'match': 'P', 'idx': 0, 'captures': []}}
{'String': 'Pvt Ltd', 'Regex': {'match': 'P', 'idx': 0, 'captures': []}}
{'String': 'Pvt Ltd', 'Regex': {'match': 'P', 'idx': 0, 'captures': []}}
{'String': 'Pvt Ltd', 'Regex': {'match': 'P', 'idx': 0, 'captures': []}}


---

The [captures array](https://docs.mongodb.com/manual/reference/operator/aggregation/regexFind/#captures-output-behavior) in the results corresponds to the groups captured by the matching string. Capture groups are specified with unescaped parentheses () in the regex pattern.

---

In [9]:
# Regex

result = db.hr.aggregate(
        # Pipeline
        [
            # Stage 1
            {
                '$project':{
                                '_id':0,
                                'String':'$experience.company_type',
                                # regex
                                'Regex':{
                                            '$regexFind':{
                                                            'input':'$experience.company_type',
                                                            'regex':'^P(vt|ub)',
                                                            'options':'i'
                                                        }
                                        }
                            }
            },
            # Stage 2
            {
                '$limit': 10
            }
        ])

# Print results
for doc in result:
    pp.pprint(doc)

{'String': 'Pvt Ltd', 'Regex': {'match': 'Pvt', 'idx': 0, 'captures': ['vt']}}
{'String': 'Funded Startup', 'Regex': None}
{'String': 'Public Sector',
 'Regex': {'match': 'Pub', 'idx': 0, 'captures': ['ub']}}
{'String': 'Pvt Ltd', 'Regex': {'match': 'Pvt', 'idx': 0, 'captures': ['vt']}}
{'String': 'Funded Startup', 'Regex': None}
{'String': 'Pvt Ltd', 'Regex': {'match': 'Pvt', 'idx': 0, 'captures': ['vt']}}
{'String': 'Pvt Ltd', 'Regex': {'match': 'Pvt', 'idx': 0, 'captures': ['vt']}}
{'String': 'Pvt Ltd', 'Regex': {'match': 'Pvt', 'idx': 0, 'captures': ['vt']}}
{'String': 'Pvt Ltd', 'Regex': {'match': 'Pvt', 'idx': 0, 'captures': ['vt']}}
{'String': 'Pvt Ltd', 'Regex': {'match': 'Pvt', 'idx': 0, 'captures': ['vt']}}


---
**Using [$cond](https://docs.mongodb.com/manual/reference/operator/aggregation/cond/#mongodb-expression-exp.-cond) conditional operator we can label encode string values.**


It evaluates a boolean expression to return one of the two specified return expressions.


For example, we project a new field and return 1 if `experience.company_type` contains `ltd` and 0 otherwise.

-----

In [10]:
# Regex

result = db.hr.aggregate(
        # Pipeline
        [
            # Stage 1
            {
            '$project':{
                        '_id':0,
                        'Type':'$experience.company_type',
                        'Encoded':{
                                    '$cond':{
                                                'if':{
                                                        '$regexMatch':{
                                                                            'input':'$experience.company_type',
                                                                            'regex':'ltd',
                                                                            'options':'i'
                                                                        }
                                                    },
                                                'then':1,
                                                'else':0
                                                }
                                }
                        }
            },
            # Stage 2
            {
                '$limit': 10
            }
        ])

# Print results
for doc in result:
    pp.pprint(doc)

{'Type': 'Pvt Ltd', 'Encoded': 1}
{'Type': 'Funded Startup', 'Encoded': 0}
{'Type': 'Public Sector', 'Encoded': 0}
{'Type': 'Pvt Ltd', 'Encoded': 1}
{'Type': 'Funded Startup', 'Encoded': 0}
{'Type': 'Pvt Ltd', 'Encoded': 1}
{'Type': 'Pvt Ltd', 'Encoded': 1}
{'Type': 'Pvt Ltd', 'Encoded': 1}
{'Type': 'Pvt Ltd', 'Encoded': 1}
{'Type': 'Pvt Ltd', 'Encoded': 1}


----

### Exercise 1 - 

Look for those documents that have more than 5 years of total experience and whose `education.level` contains the `school` substring.

----

----

### Exercise 2 - 

Output a new fiedl `Encoded` whenever the `education.level` field contains the substring `school`.

----