# Missing Values

We've seen in the previous lecture of how Pandas handles missing values using the `None` type and NumPy `NaN` values. Missing values are pretty common in data cleaning activities. And, missing values can be there for any number of reasons.

For instance, if you are running a survey and a respondant didn't answer a question the missing value is actually an omission. This kind of missing data is called **Missing at Random** if there are other variables that might be used to predict the variable which is missing. In my work when I delivery surveys I often find that missing data, say the interest in being involved in a follow up study, often has some correlation with another data field, like gender or ethnicity. If there is no relationship to other variables, then we call this data **Missing Completely at Random (MCAR)**.

These are just two examples of missing data, and there are many more. For instance, data might be missing because it wasn't collected, either by the process responsible for collecting that data, such as a researcher, or because it wouldn't make sense if it were collected. This last example is extremely common when you start joining DataFrames together from multiple sources, such as joining a list of people at a university with a list of offices in the university (students generally don't have offices).

Let's look at some ways of handling missing data in pandas.

**Learning Objectives**

* Handling Missing values for ordered/sequential data example a time-series


In [1]:
# Lets import pandas
import pandas as pd

In [2]:


# It's sometimes useful to consider missing values as actually having information.
#Consider a dataset that
# logs from online learning systems example video use in lecture capture systems. In these systems
# it's common for the player for have a heartbeat functionality where playback statistics are sent to the
# server every so often, maybe every 30 seconds. These heartbeats can get big as they can carry the whole
# state of the playback system such as where the video play head is at, where the video size is, which video
# is being rendered to the screen, how loud the volume is.

# If we load the data file log.csv, we can see an example of what this might look like.
df = pd.read_csv("C:/Users/Break/DATA-301/log.csv")
df.head(20)

Unnamed: 0,time,user,video,playback position,paused,volume
0,1469974424,cheryl,intro.html,5,False,10.0
1,1469974454,cheryl,intro.html,6,,
2,1469974544,cheryl,intro.html,9,,
3,1469974574,cheryl,intro.html,10,,
4,1469977514,bob,intro.html,1,,
5,1469977544,bob,intro.html,1,,
6,1469977574,bob,intro.html,1,,
7,1469977604,bob,intro.html,1,,
8,1469974604,cheryl,intro.html,11,,
9,1469974694,cheryl,intro.html,14,,


In [None]:
# In this data the first column is a timestamp in the Unix epoch format. The next column is the user name
# followed by a web page they're visiting and the video that they're playing. Each row of the DataFrame has a
# playback position. And we can see that as the playback position increases by one, the time stamp increases
# by about 30 seconds.

# Except for user Bob. It turns out that Bob has paused his playback so as time increases the playback
# position doesn't change. Note too how difficult it is for us to try and derive this knowledge from the data,
# because it's not sorted by time stamp as one might expect. This is actually not uncommon on systems which
# have a high degree of parallelism. There are a lot of missing values in the paused and volume columns. It's
# not efficient to send this information across the network if it hasn't changed. So this articular system
# just inserts null values into the database if there's no changes.

In [3]:


# In Pandas we can sort either by index or by values. Here we'll just promote the time stamp to an index then
# sort on the index.
df = df.set_index('time')
df = df.sort_index() #default is ascending=True
df.head(20)

Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,,
1469974454,sue,advanced.html,24,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


In [5]:
# If we look closely at the output though we'll notice that the index
# isn't really unique. Two users seem to be able to use the system at the same
# time. Again, a very common case. Let's reset the index, and use some
# multi-level indexing on time AND user together instead,
# promote the user name to a second level of the index to deal with that issue.

df = df.reset_index()
df = df.set_index(['time', 'user'])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,video,playback position,paused,volume
time,user,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,sue,advanced.html,24,,
1469974454,cheryl,intro.html,6,,
1469974484,cheryl,intro.html,7,,
1469974514,cheryl,intro.html,8,,
1469974524,sue,advanced.html,25,,
1469974544,cheryl,intro.html,9,,
1469974554,sue,advanced.html,26,,
1469974574,cheryl,intro.html,10,,


 **Methods for for handling missing values in time series data**

 * `ffill` (forward fill) and `bfill` (backward fill) are common methods used for imputations. Imputation refers to the process of filling in missing or incomplete data in a dataset with substitute values.

- **`ffill` (Forward Fill)**: This method propagates the last valid observation forward to the next missing value. It is useful when you assume that the value of the time series remains constant until a new data point is observed.
  
  Example:
  - Input: `[10, NaN, NaN, 20]`
  - After `ffill`: `[10, 10, 10, 20]`
  
- **`bfill` (Backward Fill)**: This method fills the missing value with the next valid observation. It assumes that future values can help fill the missing data points before them.
  
  Example:
  - Input: `[10, NaN, NaN, 20]`
  - After `bfill`: `[10, 20, 20, 20]`

Both methods are useful in time series data when you want to impute missing values without introducing external data or complex interpolation methods.

ffill/bfill work well when the data is expected to remain relatively constant over time.

* Linear interpolation is suitable for data with a linear trend.

* KNN and multiple imputation offer advanced techniques when the missing values are not missing at random.

* Time series models like ARIMA are ideal for forecast-based imputation when you want to leverage the temporal dependencies in the data.

In [4]:
# Now that we have the data indexed and sorted appropriately, we can fill the missing datas using ffill
# Next up is the method parameter. The two common fill values are ffill and bfill. ffill is for forward
# filling and it updates an na value for a particular cell with the value from the previous row. bfill is
# backward filling, which is the opposite of ffill. It fills the missing values with the next valid value.
# It's important to note that your data needs to be sorted in order for this to have the effect you might
# want. Data which comes from traditional database management systems usually has no order guarantee, just
# like this data. So be careful. It's
# good to remember when dealing with missing values so you can deal with individual columns or sets of columns
# by projecting them. So you don't have to fix all missing values in one command.

df = df.fillna(method='ffill')
df.head()

  df = df.fillna(method='ffill')
  df = df.fillna(method='ffill')


Unnamed: 0_level_0,user,video,playback position,paused,volume
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1469974424,cheryl,intro.html,5,False,10.0
1469974424,sue,advanced.html,23,False,10.0
1469974454,cheryl,intro.html,6,False,10.0
1469974454,sue,advanced.html,24,False,10.0
1469974484,cheryl,intro.html,7,False,10.0


In [5]:
# We can also do customized fill-in to replace values with the replace() function. It allows replacement from
# several approaches: value-to-value, list, dictionary, regex Let's generate a simple example
df = pd.DataFrame({'A': [1, 1, 2, 3, 4],
                   'B': [3, 6, 3, 8, 9],
                   'C': ['a', 'b', 'c', 'd', 'e']})
df

Unnamed: 0,A,B,C
0,1,3,a
1,1,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [6]:
# We can replace 1's with 100, let's try the value-to-value approach
df.replace(1, 100)

Unnamed: 0,A,B,C
0,100,3,a
1,100,6,b
2,2,3,c
3,3,8,d
4,4,9,e


In [9]:
# How about changing two values? Let's try the list approach For example, we want to change 1's to 100 and 3's
# to 300
df.replace([1, 3], [100, 300])

Unnamed: 0,A,B,C
0,100,300,a
1,100,6,b
2,2,300,c
3,300,8,d
4,4,9,e


In [12]:
# To replace using a regex we make the first parameter to replace the regex pattern we want to match, the
# second parameter the value we want to emit upon match, and then we pass in a third parameter "regex=True".

# Think about this problem: imagine we want to detect all html pages in
# the "video" column, lets say that just means they end with ".html", and we want to overwrite that with the
# keyword "webpage". How could we accomplish this?

In [None]:
# Here's my solution, first matching any number of characters then ending in .html
df.replace(to_replace=".*.html$", value="webpage", regex=True)

## Regular expressions (regex):

 are a powerful tool used for searching, manipulating, and matching patterns within strings (text). They define a search pattern that can be used to identify specific sequences of characters, validate inputs, or replace substrings in a string.

 ### Writing a regex expression:

 In regular expressions, the **caret (`^`)** and **dollar sign (`$`)** are **anchors** that match the beginning and end of a string, respectively. They are not characters to be matched literally but serve to define the position of the string.



1. **Caret (`^`)**: Matches the **beginning** of a string.
   - This means that the pattern following the caret must appear at the start of the string.
   - Example:
     - **`^Hello`** will match any string that starts with the word "Hello", but not "Say Hello" or "A Hello".
     - It will match "Hello there!" but **not** "Say Hello there!".
   
2. **Dollar sign (`$`)**: Matches the **end** of a string.
   - This means that the pattern preceding the dollar sign must appear at the end of the string.
   - Example:
     - **`world$`** will match any string that ends with the word "world", but not "worldwide" or "hello world".
     - It will match "Goodbye world" but **not** "world hello".


When used together, `^` and `$` ensure that the pattern matches the **entire string**, from start to finish.

- **`^Hello$`**: This pattern will only match the string "Hello" exactly, and **not** any string that contains "Hello" as a substring. It requires that the string starts with "Hello" and ends with "Hello", and there should be no other characters before or after it.

### Examples:

1. **Pattern**: `^cat$`
   - **Matches**: `"cat"`
   - **Doesn't match**: `"catalog"`, `"the cat"`, `"scat"`

2. **Pattern**: `^\d{3}$`
   - **Matches**: `"123"` (a string of exactly 3 digits)
   - **Doesn't match**: `"12"`, `"1234"`, `"abc123"`

3. **Pattern**: `^Hello world$`
   - **Matches**: `"Hello world"`
   - **Doesn't match**: `"Hello world!"`, `"Say Hello world"`




In regular expressions, the OR operation is represented by the pipe character |. It allows you to match one pattern or another. For example, if you want to match either "cat" or "dog", you would write the following regex:
```
cat|dog
```

This will match either the word "cat" or the word "dog" in a string.

Example: Matching a number or a word
```
\d+|[a-zA-Z]+

```
This pattern matches either:

* One or more digits (`\d+`), or

* One or more letters (`[a-zA-Z]+`).

Example: Matching a word or a number at the beginning or end of a string
```
^(cat|dog)$
```
This will match either "cat" or "dog" if they are the only content of the string (because of the `^` start anchor and `$` end anchor).

### Grouping and OR
You can group multiple patterns using parentheses (), which allows you to apply the | operator to a set of characters or patterns. For example:
```
 (green|grey) wool
```

This will match either "green wool" or "grey wool".


Regular expressions are widely used in programming languages, text editors, and for tasks such as:

1. **Pattern matching**: Identifying specific patterns within a string (e.g., finding all email addresses in a document).
2. **Validation**: Checking if a string follows a certain format, such as validating phone numbers or email addresses.
3. **String manipulation**: Replacing or extracting specific parts of a string.

A regex pattern can include special characters, such as:

- `.` (dot): Matches any single character (except newline).
- `*` (asterisk): Matches 0 or more of the preceding element.
- `+` (plus): Matches 1 or more of the preceding element.
- `?` (question mark): Makes the preceding element optional (0 or 1 occurrence).
- `[]` (brackets): Defines a character class, matching any of the characters within the brackets.
- `\d`: Matches any digit (equivalent to `[0-9]`).
- `\w`: Matches any word character (letters, digits, and underscores).
- `^`: Matches the start of a string.
- `$`: Matches the end of a string.

Example: If you wanted to match an email address, you might use the regex pattern `^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$`.

In `[a-zA-Z0-9.-]`, the hyphen is used to define a range of characters `(a-z, A-Z, 0-9)`.

* Inside a character class (`[]`), the hyphen is used to specify a range of characters. For example:

* a-z means any lowercase letter from a to z.

* [a-zA-Z]: matches any uppercase or lowercase letter.

* {2,}: specifies that the preceding character class (letters) must appear at least twice, which corresponds to the domain extension (e.g., .com, .org).

* $ asserts the end of the string.

* 0-9 means any digit from 0 to 9.

* When placed at the end or beginning of the character class, the hyphen is treated literally, meaning it matches the hyphen (-) character itself.

Example: ` [a-zA-Z0-9.-]+` matches:


* range of Lowercase letters (a-z) ( the hyphen is placed between two characters, it is interpreted as defining a range of characters.)

* Uppercase letters (A-Z),

* Digits (0-9),

* The dot (.),

* The hyphen (-).

The regular expression `.*.html$` can be broken down into the following components:

-- `.` matches any character (except newline),

-- `*` means "zero or more of the preceding character." This part matches any sequence of characters (or no characters at all).



-- `html`: This is the literal string "html".

-- `$`: This matches the end of the string

Explanation:

1. **`.`** (dot): This matches **any single character** except for a newline.
   - Example: It could match a letter, number, space, or special character.

2. **`*`** (asterisk): This means **zero or more** occurrences of the preceding character or pattern.
   - In this case, it means "zero or more of any character."
   - So, `.*` means "any sequence of characters (including none)."



3. **`html`**: This is a literal string that matches exactly the characters "html".
   - It matches the substring "html".

4. **`$`** (dollar sign): This is an **anchor** that matches the **end** of the string.
   - This means that the string must end with the pattern specified before the `$`.




One last note on missing values. When you use statistical functions on DataFrames, these functions typically ignore missing values. For instance if you try and calculate the mean value of a DataFrame, the underlying NumPy function will ignore missing values. This is usually what you want but you should be aware that values are being excluded. Why you have missing values really matters depending upon the problem you are trying to solve. It might be unreasonable to infer missing values, for instance, if the data shouldn't exist in the first place.