### Regex

Links: 

https://docs.python.org/3/library/re.html
https://en.wikipedia.org/wiki/Regular_expression
https://www.w3schools.com/python/python_regex.asp

In [1]:
wiki = """
A regular expression (shortened as regex or regexp),[1] sometimes referred to as rational expression,[2][3] is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

The concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene formalized the concept of a regular language. They came into common use with Unix text-processing utilities. Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used, being the Perl syntax.

Regular expressions are used in search engines, in search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK, and in lexical analysis. Regular expressions are supported in many programming languages.
"""

In [2]:
import re

In [16]:
re.findall(r'\w+', wiki) # Finds all words

['A',
 'regular',
 'expression',
 'shortened',
 'as',
 'regex',
 'or',
 'regexp',
 '1',
 'sometimes',
 'referred',
 'to',
 'as',
 'rational',
 'expression',
 '2',
 '3',
 'is',
 'a',
 'sequence',
 'of',
 'characters',
 'that',
 'specifies',
 'a',
 'match',
 'pattern',
 'in',
 'text',
 'Usually',
 'such',
 'patterns',
 'are',
 'used',
 'by',
 'string',
 'searching',
 'algorithms',
 'for',
 'find',
 'or',
 'find',
 'and',
 'replace',
 'operations',
 'on',
 'strings',
 'or',
 'for',
 'input',
 'validation',
 'Regular',
 'expression',
 'techniques',
 'are',
 'developed',
 'in',
 'theoretical',
 'computer',
 'science',
 'and',
 'formal',
 'language',
 'theory',
 'The',
 'concept',
 'of',
 'regular',
 'expressions',
 'began',
 'in',
 'the',
 '1950s',
 'when',
 'the',
 'American',
 'mathematician',
 'Stephen',
 'Cole',
 'Kleene',
 'formalized',
 'the',
 'concept',
 'of',
 'a',
 'regular',
 'language',
 'They',
 'came',
 'into',
 'common',
 'use',
 'with',
 'Unix',
 'text',
 'processing',
 'uti

In [144]:
re.findall(r'\br\w+', wiki) # Finds all words starting with r

[]

In [17]:
re.search(r'\w+', wiki) # Finds the first occurrence matching the pattern 

<re.Match object; span=(1, 2), match='A'>

In [145]:
re.split(r'\s', wiki) # splits on whitespace
re.split(r'\w+,', wiki) # splits on words before commas

['\nA regular expression (shortened as regex or regexp),[1] sometimes referred to as rational ',
 '[2][3] is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on ',
 ' or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.\n\nThe concept of regular expressions began in the ',
 ' when the American mathematician Stephen Cole Kleene formalized the concept of a regular language. They came into common use with Unix text-processing utilities. Different syntaxes for writing regular expressions have existed since the ',
 ' one being the POSIX standard and ',
 ' widely ',
 ' being the Perl syntax.\n\nRegular expressions are used in search ',
 ' in search and replace dialogs of word processors and text ',
 ' in text processing utilities such as sed and ',
 ' and in lexical analysis. Regular expressions a

In [50]:
emails = """

email addresses are: oliv.ier@example.com. -- yaro@example.gov peter@example.ca 
drew@example.co.uk-,
marco@example.org


"""

In [51]:
# Super common use-case: extracting email addresses from a string
re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,7}\b', emails)

['oliv.ier@example.com',
 'yaro@example.gov',
 'peter@example.ca',
 'drew@example.co.uk',
 'marco@example.org']

### Lambda review

In [58]:
# plus_one = lambda variable: (if variable == 5: 
#                              return 'yes! it\'s 5')

plus_one_or_five = lambda variable: "yes! it's 5" if variable == 5 else variable + 1

In [59]:
plus_one_or_five(5)

"yes! it's 5"

In [146]:
plus_one_or_five(6)

7

### Data cleaning

Links:
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
https://www.kaggle.com/datasets/borapajo/food-choices?select=food_coded.csv

In [102]:
import pandas as pd

In [103]:
food = pd.read_csv('food_coded.csv')

In [104]:
food.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 61 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   GPA                           123 non-null    object 
 1   Gender                        125 non-null    int64  
 2   breakfast                     125 non-null    int64  
 3   calories_chicken              125 non-null    int64  
 4   calories_day                  106 non-null    float64
 5   calories_scone                124 non-null    float64
 6   coffee                        125 non-null    int64  
 7   comfort_food                  124 non-null    object 
 8   comfort_food_reasons          124 non-null    object 
 9   comfort_food_reasons_coded    106 non-null    float64
 10  cook                          122 non-null    float64
 11  comfort_food_reasons_coded.1  125 non-null    int64  
 12  cuisine                       108 non-null    float64
 13  diet_

In [105]:
food.GPA[:10]

0      2.4
1    3.654
2      3.3
3      3.2
4      3.5
5     2.25
6      3.8
7      3.3
8      3.3
9      3.3
Name: GPA, dtype: object

In [106]:
food.GPA.value_counts()

3.5           13
3             11
3.2           10
3.7           10
3.3            9
3.4            9
3.6            7
3.9            7
3.8            6
2.8            5
4              4
3.1            3
2.9            2
3.83           2
2.6            2
2.4            1
3.79 bitch     1
3.73           1
2.71           1
3.92           1
3.68           1
3.75           1
Unknown        1
3.77           1
3.63           1
3.67           1
3.89           1
Personal       1
3.35           1
3.292          1
3.605          1
3.654          1
3.65           1
3.87           1
2.2            1
3.904          1
2.25           1
3.882          1
Name: GPA, dtype: int64

In [107]:
food['GPA_clean'] = food.GPA.str.extract(r'(\d+.\d+)').astype(float)

In [108]:
food.GPA.value_counts().values == food.GPA_clean.value_counts().values

  food.GPA.value_counts().values == food.GPA_clean.value_counts().values


False

In [109]:
food.GPA.value_counts().values

array([13, 11, 10, 10,  9,  9,  7,  7,  6,  5,  4,  3,  2,  2,  2,  1,  1,
        1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,
        1,  1,  1,  1])

In [110]:
food.GPA_clean.value_counts()

3.500    13
3.200    10
3.700    10
3.400     9
3.300     9
3.900     7
3.600     7
3.800     6
2.800     5
3.100     3
2.600     2
2.900     2
3.830     2
3.790     1
2.710     1
3.730     1
2.400     1
3.920     1
3.680     1
3.750     1
3.770     1
3.630     1
3.670     1
3.890     1
3.350     1
3.292     1
3.605     1
3.654     1
3.650     1
3.870     1
2.200     1
3.904     1
2.250     1
3.882     1
Name: GPA_clean, dtype: int64

In [111]:
food.columns

Index(['GPA', 'Gender', 'breakfast', 'calories_chicken', 'calories_day',
       'calories_scone', 'coffee', 'comfort_food', 'comfort_food_reasons',
       'comfort_food_reasons_coded', 'cook', 'comfort_food_reasons_coded.1',
       'cuisine', 'diet_current', 'diet_current_coded', 'drink',
       'eating_changes', 'eating_changes_coded', 'eating_changes_coded1',
       'eating_out', 'employment', 'ethnic_food', 'exercise',
       'father_education', 'father_profession', 'fav_cuisine',
       'fav_cuisine_coded', 'fav_food', 'food_childhood', 'fries', 'fruit_day',
       'grade_level', 'greek_food', 'healthy_feeling', 'healthy_meal',
       'ideal_diet', 'ideal_diet_coded', 'income', 'indian_food',
       'italian_food', 'life_rewarding', 'marital_status',
       'meals_dinner_friend', 'mother_education', 'mother_profession',
       'nutritional_check', 'on_off_campus', 'parents_cook', 'pay_meal_out',
       'persian_food', 'self_perception_weight', 'soup', 'sports', 'thai_food',
       

In [112]:
to_drop = ['calories_day', 'comfort_food_reasons_coded', 'cuisine', 'type_sports', 'GPA']

In [113]:
food = food.drop(columns=to_drop)

In [119]:
food.isna().sum()

Gender                           0
breakfast                        0
calories_chicken                 0
calories_scone                   1
coffee                           0
comfort_food                     1
comfort_food_reasons             1
cook                             3
comfort_food_reasons_coded.1     0
diet_current                     1
diet_current_coded               0
drink                            2
eating_changes                   3
eating_changes_coded             0
eating_changes_coded1            0
eating_out                       0
employment                       9
ethnic_food                      0
exercise                        13
father_education                 1
father_profession                3
fav_cuisine                      2
fav_cuisine_coded                0
fav_food                         2
food_childhood                   1
fries                            0
fruit_day                        0
grade_level                      0
greek_food          

In [123]:
import numpy as np

In [132]:
food_gpa_mean = food.GPA_clean.mean()

In [134]:
food_gpa_mean

3.4401603773584903

In [152]:
food.GPA_clean.apply(
    lambda gpa: food_gpa_mean if not isinstance(gpa, float) else gpa
) # does not work 

food.GPA_clean.apply(
    lambda gpa: food_gpa_mean if str(gpa) == 'nan' else gpa
) # does work, but is not best practice 

0      2.40000
1      3.65400
2      3.30000
3      3.20000
4      3.50000
        ...   
120    3.50000
121    3.44016
122    3.88200
123    3.44016
124    3.90000
Name: GPA_clean, Length: 125, dtype: float64

In [127]:
np.nan == np.nan # nan does not equate with itself 

False

In [136]:
# Rather than apply, use the built-in pandas method
food.GPA_clean = food.GPA_clean.fillna(food_gpa_mean)