# Basic Data Science Maneuvers with Pandas

### Welcome to the introductory lesson on data science. Data science is an interdisciplinary field that encompasses computer science (computational tools), mathematical and statistical knowledge, and domain expertise (area of exploration). Essentially, we are creating and using computational tools to process and analyze huge volumes of data, allowing us to make more informed decisions and/or conclusions that can impacts any aspect of our lives. 

### One of the most important aspect of data science is the ability to handle large volumes of data. With the vast amount of data that is being collected every second, processing each datapoint manually one by one is an impossible task. This is where computer programs come to the rescue. With only a few lines of code, even the simplest program can help us generate new insights from a large dataset. The focus of the lesson is to learn how to handle and manipulate data, which lays the foundational knowledge set necessary to produce graphical visualizations, conduct statistical analysis, and much more (which would be explored in later lessons).    

### The most common tool/library that is used to handle datasets in Python is Pandas. 

## Introduction

### In this notebook, we will explore a sample dataset on medical insurance to practice handling data using Pandas.

## Set Up

In [2]:
import pandas as pd
import numpy as np

## Pandas Basics 

### Dataset can come is many forms and structures. One common form of storing data is a CSV file. A CSV (comma-separated values) file is a text file where commas are used to separate values. Think of this as a spreadsheet where each individual cell is separated by a comma. When a spreadsheet program (like Excel) open such file, it will know what to display into each individual cell. 

### Pandas has various methods to read dataset files of various forms and structures. In this notebook, we will focus on working with CSV files. After Pandas finish reading the file, it will store the data into a dataframe. You can think of a dataframe as a table with rows and columns.

### As noted above, the first step, after obtaining data, is to read the data into the Python notebook. The `read_csv()` method reads a CSV file and returns the corresponding dataframe. Note that in the below lines of code, we are passing the relative path (from the working directory/folder of this notebook) to the CSV file. This method also accepts absolute paths (starting from the root directory/folder). 

### Calling the variable that is storing the dataframe will display the dataframe with its values (truncated view if there are too many rows/columns). This is a way to check your work as you make increasingly more manipulations onto the dataframe. 

In [3]:
# Read the csv file into a Pandas data frame.
insuranceDF= pd.read_csv("insurance_modified.csv")

# Preview the dataframe by calling the variable (useful to double-checking your work).
# Note: Missing data is normally shown as NaN (similar to null).
insuranceDF

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,,yes,southwest,16884.92400
1,18,male,33.770,1.0,no,southeast,1725.55230
2,28,male,33.000,3.0,no,southeast,4449.46200
3,33,male,22.705,,no,northwest,21984.47061
4,32,male,28.880,,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3.0,no,northwest,10600.54830
1334,18,female,31.920,,no,northeast,2205.98080
1335,18,female,36.850,,no,southeast,1629.83350
1336,21,female,25.800,,no,southwest,2007.94500


### The index of dataframe refers to a series (list) of labels that is used to identify each row.

### Sometimes you might want to use one of the columns as the index. In this case, we can use the `set_index()` method to change the index.

In [4]:
# When you hear the word 'index', think 'row labels'. That is what it is!
# Note: The index can be non-numerical and non-unique.  

# Set the "sex" column to be the index of the dataframe.
insuranceDF.set_index("sex", inplace=True)
insuranceDF

Unnamed: 0_level_0,age,bmi,children,smoker,region,charges
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,19,27.900,,yes,southwest,16884.92400
male,18,33.770,1.0,no,southeast,1725.55230
male,28,33.000,3.0,no,southeast,4449.46200
male,33,22.705,,no,northwest,21984.47061
male,32,28.880,,no,northwest,3866.85520
...,...,...,...,...,...,...
male,50,30.970,3.0,no,northwest,10600.54830
female,18,31.920,,no,northeast,2205.98080
female,18,36.850,,no,southeast,1629.83350
female,21,25.800,,no,southwest,2007.94500


### We can also reset the index by using the `reset_index()` method.

In [5]:
# For our usecase, we would like to keep the default index setting.

insuranceDF.reset_index(inplace=True)
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.900,,yes,southwest,16884.92400
1,male,18,33.770,1.0,no,southeast,1725.55230
2,male,28,33.000,3.0,no,southeast,4449.46200
3,male,33,22.705,,no,northwest,21984.47061
4,male,32,28.880,,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,male,50,30.970,3.0,no,northwest,10600.54830
1334,female,18,31.920,,no,northeast,2205.98080
1335,female,18,36.850,,no,southeast,1629.83350
1336,female,21,25.800,,no,southwest,2007.94500


### What is the use of storing data if we do not have a way to retrieve the data? Fortunately, dataframes has many ways to retrieve/view data. 

### The `loc["row range", "column range"]` attribute allows us to retrieve data based on label name only. Range is denoted as a singular label, a continuous sequence of labels (denoted by `"starting label":"ending label"`), or as a list of labels. 

### - If multiple rows and columns are selected, a dataframe (2D/tabular) is returned.
### - If only a single row or column is selected, a series (1D/list) is returned.
### - If only a single row and column is selected, a singular value is returned.

### If we want to make a selection only on the rows, we only pass in the "row range" to loc.

In [14]:
# Use loc to select rows/columns by label name.

# View the information for person number 5 to 10 (inclusive) in the df.
insuranceDF.loc[5:10]

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
5,female,31,25.74,,no,southeast,3756.6216
6,female,46,33.44,1.0,no,southeast,8240.5896
7,female,37,27.74,3.0,no,northwest,7281.5056
8,male,37,29.83,2.0,no,northeast,6406.4107
9,female,60,25.84,,no,northwest,28923.13692
10,male,25,26.22,,no,northeast,2721.3208


### If we want to make a selection on both the rows and columns, we need to pass in both "row range" and "column range".

In [7]:
# View only the demographic information for person number 5 to 10 (inclusive).

# The syntac here is df[row_start:row:end, col_start:col_end]
# This is a very useful way to view and work with your data!
insuranceDF.loc[5:10, "sex":"region"] 

Unnamed: 0,sex,age,bmi,children,smoker,region
5,female,31,25.74,,no,southeast
6,female,46,33.44,1.0,no,southeast
7,female,37,27.74,3.0,no,northwest
8,male,37,29.83,2.0,no,northeast
9,female,60,25.84,,no,northwest
10,male,25,26.22,,no,northeast


### If we want to make a selection only on the columns, we need to pass "row range" as `:` and "column range".

### Using `:` as the "range" means that we are selecting the entire list (in this case, the entire row label list).

In [8]:
# View only the sex, bmi, and smoker information for everyone.

# Same syntax as above, but using only the colon for the row
# section means we display all of the rows. We only show the columns listed
insuranceDF.loc[:, ["sex", "bmi", "smoker"]] 

Unnamed: 0,sex,bmi,smoker
0,female,27.900,yes
1,male,33.770,no
2,male,33.000,no
3,male,22.705,no
4,male,28.880,no
...,...,...,...
1333,male,30.970,no
1334,female,31.920,no
1335,female,36.850,no
1336,female,25.800,no


### We can also retrieve/view data using the `iloc["row range", "column range"]` attribute. This is exactly the same as `loc` with the exception that it uses numerical indexing for both rows and columns instead of labels.

In [15]:
# Pandas can also use iloc to select rows/columns by number.
# Note: Counting start with 0 and the end is not inclusive.

# View only the sex, bmi, and smoker information for the 5th to 10th person (inclusive).
insuranceDF.iloc[4:10, [0, 2, 4]]

Unnamed: 0,sex,bmi,smoker
4,male,28.88,no
5,female,25.74,no
6,female,33.44,no
7,female,27.74,no
8,male,29.83,no
9,female,25.84,no


### We can create filters on the dataframe by creating a conditional statement. A conditonal statement uses comparison operators to return either TRUE or FALSE values. Typically, we filter the dataframe based on a column or row. 

### The first step is to create the conditional statement on either a row or column. This will produce a series/list with TRUE/FALSE values, which we can then use to map to the original dataframe. 

In [16]:
# Check everyone to see if they are a smoker.
# This will go through the 'smoker' column and replace 'yes' with True and anything else with 'False'
# This is a boolean mask and is a very helpful concept

# Note: The [] operator is the same as the loc operator.
insuranceDF["smoker"] == "yes" 

# This is the same as 
### insuranceDF.loc[:, "smoker"] == "yes"

0        True
1       False
2       False
3       False
4       False
        ...  
1333    False
1334    False
1335    False
1336    False
1337     True
Name: smoker, Length: 1338, dtype: bool

### The next step is to now pass the conditional statement as either the "row range" or "column range", which will filter the view to only show the corresponding rows or columns whoose conditions are met (or where the condition statement return TRUE).

In [17]:
# Filter out our existing dataframe to include only the first 10 smokers on the list.
insuranceDF.loc[insuranceDF["smoker"] == "yes"].head(10)

# This is the same as
### insuranceDF.loc[insuranceDF["smoker"] == "yes"].iloc[:10]

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.9,,yes,southwest,16884.924
11,female,62,26.29,,yes,southeast,27808.7251
14,male,27,42.13,,yes,southeast,39611.7577
19,male,30,35.3,,yes,southwest,36837.467
23,female,34,31.92,1.0,yes,northeast,37701.8768
29,male,31,36.3,2.0,yes,southwest,38711.0
30,male,22,35.6,,yes,southwest,35585.576
34,male,28,36.4,1.0,yes,southwest,51194.55914
38,male,35,36.67,1.0,yes,northeast,39774.2763
39,male,60,39.9,,yes,southwest,48173.361


### Now that we have covered the bare basics of reading data into dataframes and retrieving/viewing data from dataframes, we will now move on to covering some dataframe methods used for quick analysis and data manipulations.

### The first thing we should do once we have a dataframe is to look at its structure. The `info()` method outputs a concise summary of the dataframe, including the number of rows, number of columns, and the description for each column. This information will help us plan out what type of data manipulation and/or analysis we can perform with the data. 

In [21]:
# Summary of the dataframe
insuranceDF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   sex       1338 non-null   object 
 1   age       1338 non-null   int64  
 2   bmi       1338 non-null   float64
 3   children  764 non-null    float64
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(3), int64(1), object(3)
memory usage: 73.3+ KB


### The `describe()` method outputs the statistics for each column. Note that the information given by the `describe()` method describes the data within each column while the `info()` method describes the structure of each column.

### When Pandas makes numerical calculations, NaN values (or null values) either be treated as 0 or will be skipped altogether. This behavior is similar to the way we learn to handle missing data when doing the calculations manually.

### By default, Pandas will only output statistics for numerical columns.

In [22]:
# NaN values will either be treated as 0 or will be skipped in certain calculations.

# Statistics that is reported by default for dataframes.
insuranceDF.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,764.0,1338.0
mean,39.207025,30.663397,1.917539,13270.422265
std,14.04996,6.098187,0.983351,12110.011237
min,18.0,15.96,1.0,1121.8739
25%,27.0,26.29625,1.0,4740.28715
50%,39.0,30.4,2.0,9382.033
75%,51.0,34.69375,3.0,16639.912515
max,64.0,53.13,5.0,63770.42801


### We can also use the `describe()` method to get the statistics for a single column. Note that the statistics that is returned is different for numerical and non-numerical columns. 

In [23]:
# Statistics that is reported by default for series (numerical).
insuranceDF["bmi"].describe()

count    1338.000000
mean       30.663397
std         6.098187
min        15.960000
25%        26.296250
50%        30.400000
75%        34.693750
max        53.130000
Name: bmi, dtype: float64

In [24]:
# Statistics that is reported by default for series (non-numerical).
insuranceDF["smoker"].describe()

count     1338
unique       2
top         no
freq      1064
Name: smoker, dtype: object

### One way we can manipulate the dataframe to help us analyze the data is to sort the values for any given column(s) within a dataframe. This is done through the `sort_values("columns")` function. By default, the values will be sorted in ascending order. If we wish to sort the values in descending order, we will need to explicitly state it by passing `ascending=False`.

### Here we are also using the `head(x)` method immediately after the `sort_values` method, which is used to return only the top x rows within the dataframe. These methods can be called one right after another in a single expression because they take in a dataframe and outputs a dataframe. 

In [33]:
# Obtain the demographics of the 10 people with the lowest insurance charges.
chargesSortedInsuranceDF = insuranceDF.sort_values("charges", ascending=True).head(10)
chargesSortedInsuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
940,male,18,23.21,,no,southeast,1121.8739
808,male,18,30.14,,no,southeast,1131.5066
1244,male,18,33.33,,no,southeast,1135.9407
663,male,18,33.66,,no,southeast,1136.3994
22,male,18,34.1,,no,southeast,1137.011
194,male,18,34.43,,no,southeast,1137.4697
866,male,18,37.29,,no,southeast,1141.4451
781,male,18,41.14,,no,southeast,1146.7966
442,male,18,43.01,,no,southeast,1149.3959
1317,male,18,53.13,,no,southeast,1163.4627


### If we want to group the data, we can use the `groupby("columns")` method. The `groupby()` method DOES NOT return a dataframe. Instead, it returns a "groupby object" that contains information regarding each groupings. This "groupby object" is not inherently useful, however, we can apply an aggregate function to produce aggregated statistics/descriptions for each grouping. 

### Examples of aggreagate functions are `apply()`, `agg()`, `filter()`, `sum()`, `mean()`, etc. 

### The `agg()` method applies a predefined aggregation function onto a dataframe or the "groupby object".

In [34]:
# Group all the people by region, aggregate by median value.
medianByRegionInsuranceDF = insuranceDF.groupby("region").agg('median')
medianByRegionInsuranceDF

Unnamed: 0_level_0,age,bmi,children,charges
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
northeast,39.5,28.88,2.0,10057.652025
northwest,39.0,28.88,2.0,8965.79575
southeast,39.0,33.33,2.0,9294.13195
southwest,39.0,30.3,2.0,8798.593


### Here, we are using the `filter()` method to filter out the groupings that does not satisfy the criterion set by the  conditional statement. 

### In this case, we are using applying the criterion, defined by our own function, to each grouping. 

In [36]:
# Keep only regions where the mean BMI is greater than 30.

def isHighMeanBMI(df):
    return df["bmi"].mean() > 30

highMeanBMIInsuranceDF = insuranceDF.groupby("region").filter(isHighMeanBMI)
highMeanBMIInsuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.90,,yes,southwest,16884.92400
1,male,18,33.77,1.0,no,southeast,1725.55230
2,male,28,33.00,3.0,no,southeast,4449.46200
5,female,31,25.74,,no,southeast,3756.62160
6,female,46,33.44,1.0,no,southeast,8240.58960
...,...,...,...,...,...,...,...
1330,female,57,25.74,2.0,no,southeast,12629.16560
1331,female,23,33.40,,no,southwest,10795.93733
1332,female,52,44.70,3.0,no,southwest,11411.68500
1335,female,18,36.85,,no,southeast,1629.83350


### Generally speaking, NaN values (or null values) indicates that we have missing or errors within our dataset. Keeping these NaN values can impact our analysis and cause us to make incorrectly conclusions. There are many ways to handle these NaN values, but the approach must be reasonable based on the way the data is collected. 

### The `isna()` method detects whether the data is a missing value. By aggregating the results, we can see how many missing values exists for each column.

In [28]:
# Check how many NaN values we have for each column.
insuranceDF.isna().sum()

sex           0
age           0
bmi           0
children    574
smoker        0
region        0
charges       0
dtype: int64

### Sometimes, the data in the dataset can contain unrealistic values relative to the real world interpretation of the value. Therefore, it is also important for us to inspect the data stored in each column to ensure there are no "incorrect" values.

### The `unique()` output all the unique values for a given column (series). 

In [30]:
# Check the unique values in the children column.
insuranceDF["children"].unique()

array([nan,  1.,  3.,  2.,  5.,  4.])

### One way to handle missing data is to drop the rows or columns with missing values. We should careful when dropping rows or columns as that may induce bias (especially when there are intrinsic patterns/reasons regarding the missing values).

### The `dropna(axis=0)` drops rows with NaN values. The `dropna(axis=1)` drops columns with NaN values.

In [37]:
# One way to handle missing data is to remove rows where there are NaN values.

# Drops rows where NaN values exists.
droppedRowsInsuranceDF = insuranceDF.dropna(axis=0)
droppedRowsInsuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
1,male,18,33.770,1.0,no,southeast,1725.55230
2,male,28,33.000,3.0,no,southeast,4449.46200
6,female,46,33.440,1.0,no,southeast,8240.58960
7,female,37,27.740,3.0,no,northwest,7281.50560
8,male,37,29.830,2.0,no,northeast,6406.41070
...,...,...,...,...,...,...,...
1328,female,23,24.225,2.0,no,northeast,22395.74424
1329,male,52,38.600,2.0,no,southwest,10325.20600
1330,female,57,25.740,2.0,no,southeast,12629.16560
1332,female,52,44.700,3.0,no,southwest,11411.68500


In [38]:
# Another way to handle missing data is to remove columns where there are NaN values.

# Drops columns where NaN values exists
droppedColumnsInsuranceDF = insuranceDF.dropna(axis=1)
droppedColumnsInsuranceDF

Unnamed: 0,sex,age,bmi,smoker,region,charges
0,female,19,27.900,yes,southwest,16884.92400
1,male,18,33.770,no,southeast,1725.55230
2,male,28,33.000,no,southeast,4449.46200
3,male,33,22.705,no,northwest,21984.47061
4,male,32,28.880,no,northwest,3866.85520
...,...,...,...,...,...,...
1333,male,50,30.970,no,northwest,10600.54830
1334,female,18,31.920,no,northeast,2205.98080
1335,female,18,36.850,no,southeast,1629.83350
1336,female,21,25.800,no,southwest,2007.94500


### Another way to handle data is the replace the NaN values with reasonable values. This is done by the `fillna(x)` method which replaces all missing values with x. This method can be executed for an entire dataframe or individual columns. Generally speaking, it is better to replace values one column at a time as reasonable values can differ between columns. 

### The `replace("to replace", "replacement")` method functions similarly to `fillna()`, but we can explictly state which value to replace instead of replacing only missing values.

In [41]:
# Another way to handle missing data is to replace the NaN values with a reasonable value.

# Create a copy of insuranceDF for demo purposes.
filledInsuranceDF = insuranceDF.copy() 

# Replace the NaN values in "children" column with 0.
filledInsuranceDF["children"] = filledInsuranceDF["children"].fillna(0)
filledInsuranceDF

# This is the same as 
### filledInsuranceDF = filledInsuranceDF["children"].replace(np.nan, 0)

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.900,0.0,yes,southwest,16884.92400
1,male,18,33.770,1.0,no,southeast,1725.55230
2,male,28,33.000,3.0,no,southeast,4449.46200
3,male,33,22.705,0.0,no,northwest,21984.47061
4,male,32,28.880,0.0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,male,50,30.970,3.0,no,northwest,10600.54830
1334,female,18,31.920,0.0,no,northeast,2205.98080
1335,female,18,36.850,0.0,no,southeast,1629.83350
1336,female,21,25.800,0.0,no,southwest,2007.94500


### The most powerful way to handle and manipulate dataframes is to make our own function and apply it directly to the dataframe. This can be done at the dataframe level (all rows and/or columns) or at the series level (single row or column).

### The `apply("function")` method applies the function to either a series or dataframe. If we want to manipulate a single row or column, we will need to first apply the function into the selected row or column and then reassign it back into the dataframe. 

In [None]:
# Add a column named "parent" that determines whether a given person is a parent 
# given the number of children they have.

# Create a copy of insuranceDF for demo purposes.
withParentInsuranceDF = insuranceDF.copy() 

def isParent(children):
    if children > 0:
        return "yes"
    else:
        return "no"

withParentInsuranceDF["parent"] = withParentInsuranceDF["children"].apply(isParent)
withParentInsuranceDF

### We have just introduced some of the basic operations to handle data within Pandas dataframe. There are many more dataframe functions/methods that can be explored. In addition, there are also many more customizations you can do with the functions/methods we have introduced. These information can generally be found in API documentations. 

## Text Wrangling and Regex Basics

### As you may have already noticed, there are many different forms of data, including numerical, boolean (true or false), text, etc. In general, texts are one of the more complex data types to work with. Texts may contain a wealth of data that requires additional processing to carry out further analysis. For example, textual data may contain a lot of structure that can be extracted to create new features (columns). Sometimes, certain words or phrases need to be converted to a standard format to make proper grouping. 

### The following section will show ways to manipulate strings (textual data) on Pandas series. 

### The `str` attribute allows us to access the values of the series as strings, which allows us to perform common string manipulations for the entire series. We need to use the `str` attribute any time we want to work with strings, including consecutive string manipulations. 

### The `upper()` method converts the entire string to upper case. We can concatenate the text using the `+` operation.

In [43]:
# Change the text for the "region" column to be upper case and append the word "region" at the end.
insuranceDF["region"] = insuranceDF["region"].str.upper() + " region"
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.900,,yes,SOUTHWEST region,16884.92400
1,male,18,33.770,1.0,no,SOUTHEAST region,1725.55230
2,male,28,33.000,3.0,no,SOUTHEAST region,4449.46200
3,male,33,22.705,,no,NORTHWEST region,21984.47061
4,male,32,28.880,,no,NORTHWEST region,3866.85520
...,...,...,...,...,...,...,...
1333,male,50,30.970,3.0,no,NORTHWEST region,10600.54830
1334,female,18,31.920,,no,NORTHEAST region,2205.98080
1335,female,18,36.850,,no,SOUTHEAST region,1629.83350
1336,female,21,25.800,,no,SOUTHWEST region,2007.94500


### The `replace("to replace", "replacement")` method allows us to replace characters within the text.

In [44]:
# Replace the space character with the hyphen character for the "region" column.
insuranceDF["region"] = insuranceDF["region"].str.replace(' ', '-')
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.900,,yes,SOUTHWEST-region,16884.92400
1,male,18,33.770,1.0,no,SOUTHEAST-region,1725.55230
2,male,28,33.000,3.0,no,SOUTHEAST-region,4449.46200
3,male,33,22.705,,no,NORTHWEST-region,21984.47061
4,male,32,28.880,,no,NORTHWEST-region,3866.85520
...,...,...,...,...,...,...,...
1333,male,50,30.970,3.0,no,NORTHWEST-region,10600.54830
1334,female,18,31.920,,no,NORTHEAST-region,2205.98080
1335,female,18,36.850,,no,SOUTHEAST-region,1629.83350
1336,female,21,25.800,,no,SOUTHWEST-region,2007.94500


### The `split("pattern")` method splits up the text based on the pattern to split on, creating a series.

### We can select specific values within the resulting series after splitting by using the `str` attribute and retrieving the value at a specified index.

In [45]:
# Split the "region" column by the hypen character and take only the first element of the result.
insuranceDF["region"] = insuranceDF["region"].str.split('-').str[0]
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,female,19,27.900,,yes,SOUTHWEST,16884.92400
1,male,18,33.770,1.0,no,SOUTHEAST,1725.55230
2,male,28,33.000,3.0,no,SOUTHEAST,4449.46200
3,male,33,22.705,,no,NORTHWEST,21984.47061
4,male,32,28.880,,no,NORTHWEST,3866.85520
...,...,...,...,...,...,...,...
1333,male,50,30.970,3.0,no,NORTHWEST,10600.54830
1334,female,18,31.920,,no,NORTHEAST,2205.98080
1335,female,18,36.850,,no,SOUTHEAST,1629.83350
1336,female,21,25.800,,no,SOUTHWEST,2007.94500


### Similar to strings, we can also take the substrings by using the `str` attribute and then splicing the string with `["start index": "end index"]` operation.

In [46]:
# Take the first character from the "sex" column to convert it to upper case.
insuranceDF["sex"] = insuranceDF["sex"].str[0:1].str.upper()
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges
0,F,19,27.900,,yes,SOUTHWEST,16884.92400
1,M,18,33.770,1.0,no,SOUTHEAST,1725.55230
2,M,28,33.000,3.0,no,SOUTHEAST,4449.46200
3,M,33,22.705,,no,NORTHWEST,21984.47061
4,M,32,28.880,,no,NORTHWEST,3866.85520
...,...,...,...,...,...,...,...
1333,M,50,30.970,3.0,no,NORTHWEST,10600.54830
1334,F,18,31.920,,no,NORTHEAST,2205.98080
1335,F,18,36.850,,no,SOUTHEAST,1629.83350
1336,F,21,25.800,,no,SOUTHWEST,2007.94500


### We can create a conditional statement with strings by checking if the text contains the specified pattern. This is done with the `str` attribute followed by the `contains("pattern)` method.

In [47]:
# Check if the region is in the north.
insuranceDF["In North Region"] = insuranceDF["region"].str.contains("NORTH")
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges,In North Region
0,F,19,27.900,,yes,SOUTHWEST,16884.92400,False
1,M,18,33.770,1.0,no,SOUTHEAST,1725.55230,False
2,M,28,33.000,3.0,no,SOUTHEAST,4449.46200,False
3,M,33,22.705,,no,NORTHWEST,21984.47061,True
4,M,32,28.880,,no,NORTHWEST,3866.85520,True
...,...,...,...,...,...,...,...,...
1333,M,50,30.970,3.0,no,NORTHWEST,10600.54830,True
1334,F,18,31.920,,no,NORTHEAST,2205.98080,True
1335,F,18,36.850,,no,SOUTHEAST,1629.83350,False
1336,F,21,25.800,,no,SOUTHWEST,2007.94500,False


### The `len()` method allows us to get the length of the text. 

In [48]:
# Check the length of the string of the "smoker" column.
insuranceDF["smoker.len"] = insuranceDF["smoker"].str.len()
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges,In North Region,smoker.len
0,F,19,27.900,,yes,SOUTHWEST,16884.92400,False,3
1,M,18,33.770,1.0,no,SOUTHEAST,1725.55230,False,2
2,M,28,33.000,3.0,no,SOUTHEAST,4449.46200,False,2
3,M,33,22.705,,no,NORTHWEST,21984.47061,True,2
4,M,32,28.880,,no,NORTHWEST,3866.85520,True,2
...,...,...,...,...,...,...,...,...,...
1333,M,50,30.970,3.0,no,NORTHWEST,10600.54830,True,2
1334,F,18,31.920,,no,NORTHEAST,2205.98080,True,2
1335,F,18,36.850,,no,SOUTHEAST,1629.83350,False,2
1336,F,21,25.800,,no,SOUTHWEST,2007.94500,False,2


### Regex (regular expression) describes a sequence of characters that specifies a search pattern. Regex is a powerful way to search of specific patterns within text when done correctly, but can be quite complex/confusing. 

### It is easy to make unintended errors when creating a regex. Therefore, we should always test the regex throughly before applying them.

### For more information: https://docs.python.org/3/howto/regex.html
### Website to check/test your regex expression: https://regex101.com

In [50]:
# Pattern: 
# - First character is 'S'.
# - Followed by any characters exactly two times.
# - Followed by any character that is not 'a' to 'z' at least once.
# - Followed by any word character zero times or more.
# - Followed by 'T'.
pattern = r"S.{2}[^a-z]+\w*T"

# Find all matches to the above pattern within the 'region' column.
insuranceDF["regex check"] = insuranceDF["region"].str.findall(pattern)
insuranceDF

Unnamed: 0,sex,age,bmi,children,smoker,region,charges,In North Region,smoker.len,regex check
0,F,19,27.900,,yes,SOUTHWEST,16884.92400,False,3,[SOUTHWEST]
1,M,18,33.770,1.0,no,SOUTHEAST,1725.55230,False,2,[SOUTHEAST]
2,M,28,33.000,3.0,no,SOUTHEAST,4449.46200,False,2,[SOUTHEAST]
3,M,33,22.705,,no,NORTHWEST,21984.47061,True,2,[]
4,M,32,28.880,,no,NORTHWEST,3866.85520,True,2,[]
...,...,...,...,...,...,...,...,...,...,...
1333,M,50,30.970,3.0,no,NORTHWEST,10600.54830,True,2,[]
1334,F,18,31.920,,no,NORTHEAST,2205.98080,True,2,[]
1335,F,18,36.850,,no,SOUTHEAST,1629.83350,False,2,[SOUTHEAST]
1336,F,21,25.800,,no,SOUTHWEST,2007.94500,False,2,[SOUTHWEST]


### We have shown some of the most commonly used pandas operations/functions, but we have barely scratched the surface! These Pandas dataframe functions/methods/manipulations serves as the foundation to the next part of the lession, including data visualizations and statistical analysis. To learn more about all the other existing pandas functions and more information, check the following:

### - User Guide (Pandas): https://pandas.pydata.org/docs/user_guide/index.html#
### - API Reference (Pandas): https://pandas.pydata.org/docs/reference/index.html
### - Pandas Tutor (you will need to write out the sample data you are working with): https://pandastutor.com/vis.html 

**Source:**


Module adapted from Kaggle: https://www.kaggle.com/code/mariapushkareva/medical-insurance-cost-with-linear-regression/notebook

Dataset source: https://github.com/stedy/Machine-Learning-with-R-datasets