# Data Cleaning and Preparation

### First we include our imports. 

### For more on ramdom see ... https://www.sharpsightlabs.com/blog/numpy-random-seed/


In [1]:
import numpy as np
import pandas as pd
np.random.seed(12345) # generate psuedo ramdom numbers 
np.set_printoptions(precision=4, suppress=True) # sets the display of numbers. 
# We set to True to diable scientific notation

#### Line 3: np.random.seed(12345) sets the seed value for NumPy's random number generator. The seed ensures that the pseudo-random numbers generated by NumPy will be the same every time the code is executed with the same seed value. This is useful for reproducibility when working with random numbers.

#### Line 4: np.set_printoptions(precision=4, suppress=True) sets the display options for NumPy arrays when printed. The precision parameter controls the number of decimal places to display, and suppress=True disables the use of scientific notation when displaying very large or small numbers. By default, NumPy may use scientific notation for numbers outside a certain range to maintain clarity.

## Handling Missing Data

#### Here is a series datatype.  The use of `np.nan` as a missing value is just one way pandas can represent missing or undefined data.

In [2]:
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

###  The `isnull()` function is primarily used to detect missing or null values in a Pandas DataFrame or Series. A boolean series is returned by the isnull() method which stores True for every NaN value and False for a Not null value. 

In [6]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

#### How it Works

Let's break down how the `isnull()` function works:

1. The function is applied to the `string_data`.
2. For each element in `string_data`, the `isnull()` function checks if the value is missing (i.e., NaN - Not a Number or None).
3. It creates a boolean mask of the same shape as `string_data`, where each element in the mask is `True` if the corresponding element in `string_data` is null and `False` otherwise.

### Example

To further illustrate the functionality of the `isnull()` function, let's consider an example:

```python
import pandas as pd

# Sample data for a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', None],
    'Age': [25, 30, None, 40],
    'City': ['New York', 'London', 'Paris', 'Sydney']
}

# Create a DataFrame from the sample data
df = pd.DataFrame(data)

# Check for null values using `isnull()`
null_mask = df.isnull()

print(null_mask)
```

Output:
```
    Name    Age   City
0  False  False  False
1  False  False  False
2  False   True  False
3   True  False  False
```

In this example, a DataFrame `df` with three columns: 'Name', 'Age', and 'City'. The `isnull()` function identified the missing values (None) in the DataFrame and returned a boolean mask, where `True` corresponds to a null value.

### Use Cases

The `isnull()` function is commonly used in data preprocessing and data quality checking tasks. Some use cases include:

1. Identifying missing data: It helps to locate missing values in datasets and allows further analysis or data imputation as needed.
2. Filtering and cleaning data: By using the boolean mask returned by `isnull()`, you can filter out or remove rows or columns with missing data.
3. Statistical analysis: It can be used to assess the impact of missing values on statistical calculations and ensure data integrity.



In [7]:
np.count_nonzero(string_data.isnull())  #counts number of null entries

1

### Let us replace np.nan with a value

In [9]:
string_data[2] = 'apple'

#An index operation on the string_data list. It accesses the element at index 2 in the list.



In [10]:
np.count_nonzero(string_data.isnull())

# Now there is not any null values

0

## Filtering Out Missing Data

In [14]:
from numpy import nan as NA  # imports the nan constant from the NumPy library and aliases it as NA
# with this we don't have to repeatedly write np.nan

data = pd.Series([1, NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

This code use of `NA` instead of `np.nan` to represent missing values. Can make the code slightly more concise and easier to read. 

Use Case:
A use case for this code could be handling missing data in a dataset. When working with real-world data, it's common to encounter missing values, and using `NA` (or `np.nan`) allows you to represent and process missing data effectively. Pandas provides various methods to handle missing data, such as filling missing values, dropping rows or columns with missing values, or interpolating missing values based on existing data. For example:

```python
# Import Pandas and NumPy libraries
import pandas as pd
from numpy import nan as NA

# Sample data with missing values
data = pd.Series([1, NA, 3.5, NA, 7])

# Filling missing values with the mean of the non-missing values
filled_data = data.fillna(data.mean())

print(filled_data)
```

This code will fill the missing values (NA) in the Series `data` with the mean of the non-missing values, providing a cleaned dataset ready for analysis.

###  The dropna() method removes the rows that contains NULL values. Use cautiously.

##### The dropna() method * returns a new DataFrame object unless the inplace parameter is set to True , in that case the dropna() method does the removing in the original DataFrame. 

##### See https://sparkbyexamples.com/python/pandas-dropna-usage-examples/

In [16]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [17]:
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [18]:
#Another approach 
data[data.notnull()] # include values that are not null

0    1.0
2    3.5
4    7.0
dtype: float64

### How It Works:

1. `data.notnull()`: 
   This part of the code is a pandas DataFrame method that returns a DataFrame of the same shape as 'data', but with boolean values. It marks 'True' for cells where the corresponding value is not null and 'False' where the value is null.

2. `data[data.notnull()]`:
   In this line, the boolean DataFrame is used to index the original 'data' DataFrame. Using a boolean DataFrame to index another DataFrame will keep only the rows where the boolean DataFrame has 'True' values. As a result, it filters the DataFrame and keeps only the rows that have non-null values in any column.

### Example and Use Cases:

```python
import pandas as pd

# Sample DataFrame with some null values
data = pd.DataFrame({'A': [1, 2, None, 4],
                     'B': [5, None, 7, 8],
                     'C': [9, 10, 11, None]})

# Filtering non-null values using the provided code
filtered_data = data[data.notnull()]

print(filtered_data)
```

Output:
```
     A    B     C
0  1.0  5.0   9.0
2  3.0  7.0  11.0
```

In this example, the original DataFrame 'data' contains null values in some cells. The code filters out rows containing null values and creates a new DataFrame 'filtered_data' with only the rows that have non-null values in any column. As shown in the output, rows 1 and 3, which contained null values in at least one column, have been removed from the result.



In [14]:
data.dropna(inplace=True)
data

0    1.0
2    3.5
4    7.0
dtype: float64

By specifying `inplace=True`, the original DataFrame is modified in-place, and no new DataFrame is created.

In our example, the DataFrame `data` contains missing values (None) in the "Age" and "Gender" columns. When we apply `data.dropna(inplace=True)`, the rows with missing values are dropped from the DataFrame, and the DataFrame is updated to include only the rows that have complete data (no missing values).

## Notable Features 

Handling missing data is a common data preprocessing step in data analysis and machine learning workflows. The `dropna()` method offers a simple and effective way to remove rows with missing values. 

1. **Data Cleaning**: When working with real-world datasets, it's common to encounter missing data. By using `dropna()`, we can clean the data by removing incomplete rows.

2. **Preparation for Analysis**: Many statistical analysis methods do not handle missing data well. In such cases, dropping rows with missing values is necessary to perform accurate analysis.

3. **Machine Learning**: Most machine learning algorithms cannot handle missing data, so removing rows with missing values is often a crucial step before training models.

4. **Filtering Data**: When conducting specific analyses, researchers might only be interested in complete data. The `dropna()` method helps filter out incomplete cases.

5. **Trade-offs**: While dropping rows with missing values ensures complete data, it may result in losing valuable information. In some cases, other techniques like data imputation may be used to handle missing values more strategically.



In [19]:
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [20]:
#New data set 
data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],
                     [NA, NA, NA], [NA, 6.5, 3.]])
data


Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


By default *dropna()* will drop any row that has missing values

In [21]:
cleaned = data.dropna()
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [22]:
data.dropna(how='all') #drop row only if ALL values in the row are missing

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [24]:
data[4] = NA # specify the index
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [20]:
data.dropna(axis=1, how='all')  #processing missing values in the columns instead of the rows

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [25]:
# Another example 

df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA # adding 4 missing values in column 1
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.204708,,
1,-0.55573,,
2,0.092908,,0.769023
3,1.246435,,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


#### Code Explanation:

1. `df = pd.DataFrame(np.random.randn(7, 3))`: 
   - The code starts by creating a DataFrame named `df` using the Pandas library.
   - The DataFrame is initialized with random numbers drawn from a standard normal distribution (mean 0, standard deviation 1).
   - The DataFrame has 7 rows and 3 columns.

2. `df.iloc[:4, 1] = NA`: 
   - This line of code sets the values in the column at index 1 (the second column) of the DataFrame to missing values (NA).
   - It uses the `.iloc` indexer, which allows you to access DataFrame elements by their integer-based positions.
   - The `[:4, 1]` part of the code selects the first four rows (index 0 to 3) in the second column (index 1) and assigns them as missing values.
   - By default, missing values in Pandas are represented by `NaN`.

3. `df.iloc[:2, 2] = NA`: 
   - Similarly to the previous line, this line sets the values in the column at index 2 (the third column) of the DataFrame to missing values.
   - It selects the first two rows (index 0 to 1) in the third column (index 2) and assigns them as missing values.






In [24]:
df.dropna() # all rows with NaN is removed

Unnamed: 0,0,1,2
4,1.327195,-0.919262,-1.549106
5,0.022185,0.758363,-0.660524
6,0.86258,-0.010032,0.050009


In [25]:
df.dropna(thresh=2) # specify when NaN occurs twice then remove row.

Unnamed: 0,0,1,2
2,0.331286,,0.069877
3,0.246674,,1.004812
4,1.327195,-0.919262,-1.549106
5,0.022185,0.758363,-0.660524
6,0.86258,-0.010032,0.050009



### Filling In or Imputing Missing Data
#### There are times when we will consider imputing or  replace missing data with substituted values. To do this we use fillna()

####  See https://sparkbyexamples.com/pandas/pandas-dataframe-fillna-fill-nan-column-values-2/


In [26]:
df

Unnamed: 0,0,1,2
0,-0.204708,,
1,-0.55573,,
2,0.092908,,0.769023
3,1.246435,,-1.296221
4,0.274992,0.228913,1.352917
5,0.886429,-2.001637,-0.371843
6,1.669025,-0.43857,-0.539741


In [27]:
df.fillna(0) # substituting values with zeros

Unnamed: 0,0,1,2
0,-1.541996,0.0,0.0
1,0.28635,0.0,0.0
2,0.331286,0.0,0.069877
3,0.246674,0.0,1.004812
4,1.327195,-0.919262,-1.549106
5,0.022185,0.758363,-0.660524
6,0.86258,-0.010032,0.050009


The `fillna()` method is used to fill missing (NaN) values in a DataFrame or Series with specified values. In this case, the code is using `fillna(0)` to replace all NaN values in the DataFrame `df` with zeros (0).

### How It Works:

1. The `fillna(0)` method is called on the DataFrame `df`.
2. The `fillna()` method processes the DataFrame and identifies any missing (NaN) values.
3. Each NaN value found is replaced with the value provided in the argument, which is `0` in this case.

### Functionality:

#### 1. Replacing NaN Values:
The primary purpose of this code is to handle missing data (NaN) in the DataFrame. Replacing NaN values with zeros (0) could be useful in various scenarios, such as when performing numerical calculations, aggregations, or visualizations that require complete data.

#### 2. Numeric Data Consideration:
Since the code replaces NaN values with zeros (0), it is suitable for numeric data types. If the DataFrame contains non-numeric data types (e.g., strings), replacing NaN with zeros might not be appropriate, and alternative strategies like filling with an empty string or a placeholder could be more appropriate.





In [28]:
df # note the original stays the same

Unnamed: 0,0,1,2
0,-1.541996,,
1,0.28635,,
2,0.331286,,0.069877
3,0.246674,,1.004812
4,1.327195,-0.919262,-1.549106
5,0.022185,0.758363,-0.660524
6,0.86258,-0.010032,0.050009


In [29]:
df.fillna({1: 0.5, 2: 0}) # specify the columns and values 

Unnamed: 0,0,1,2
0,-1.541996,0.5,0.0
1,0.28635,0.5,0.0
2,0.331286,0.5,0.069877
3,0.246674,0.5,1.004812
4,1.327195,-0.919262,-1.549106
5,0.022185,0.758363,-0.660524
6,0.86258,-0.010032,0.050009



`.fillna({1: 0.5, 2: 0})`: This is the `fillna()` method applied to the DataFrame `df`. It takes a dictionary as an argument. The keys in the dictionary represent the column names or indices, and the values are the values that will be used to fill the missing data for each specified column.

### Functionality
1. **Selective Replacement**: The `fillna()` method allows you to specify which columns to process and what values to use for replacement. In the example given, it targets columns 1 and 2 and replaces missing values in those columns with 0.5 and 0, respectively.

2. **Different Replacement Values**: The code illustrates that you can use different values for different columns. This flexibility is helpful when different columns might require different imputation strategies based on their nature and context.

### Use Cases

1. **Data Cleaning**: When working with real-world datasets, it is common to encounter missing data. The `fillna()` method allows you to handle such cases and ensure that your analysis or machine learning algorithms can work with complete data.

2. **Imputation Strategies**: Different columns might require different imputation strategies. Some columns may be better suited to be filled with the mean, median, or mode, while others might benefit from specific domain-specific values.

3. **Data Preprocessing Pipelines**: The `fillna()` method can be incorporated into data preprocessing pipelines to ensure that missing values are handled consistently and effectively before further analysis or modeling.



In [30]:
#The _ indicates a self reference for df in this case
#equivalent to df=df.fillna(0, inplace=True)

_ = df.fillna(0, inplace=True)
df

Unnamed: 0,0,1,2
0,-1.541996,0.0,0.0
1,0.28635,0.0,0.0
2,0.331286,0.0,0.069877
3,0.246674,0.0,1.004812
4,1.327195,-0.919262,-1.549106
5,0.022185,0.758363,-0.660524
6,0.86258,-0.010032,0.050009


In Python, using an underscore as a variable name is often used as a convention to indicate that the variable will not be used further. It acts as a placeholder for a value that the programmer doesn't need or want to name explicitly. In this case, the variable `_` is used to capture the output of the `fillna` method, but it is not intended to be used later in the code.




In [31]:
df

Unnamed: 0,0,1,2
0,-1.541996,0.0,0.0
1,0.28635,0.0,0.0
2,0.331286,0.0,0.069877
3,0.246674,0.0,1.004812
4,1.327195,-0.919262,-1.549106
5,0.022185,0.758363,-0.660524
6,0.86258,-0.010032,0.050009


In [34]:
# Let us create a new dataframe

df = pd.DataFrame(np.random.randn(6, 3)) # six rows and three columns
df.iloc[2:, 1] = NA  # add Nan after row 2 col 1
df.iloc[4:, 2] = NA # add Nan after row 4 col 2
df

Unnamed: 0,0,1,2
0,0.152677,-1.565657,-0.56254
1,-0.032664,-0.929006,-0.482573
2,-0.036264,,0.980928
3,-0.589488,,-0.528735
4,0.457002,,
5,-1.022487,,


#### ffill() function is used to fill the missing value in the dataframe. ‘ffill’ stands for ‘forward fill’ and will propagate last valid observation forward.


In [35]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.152677,-1.565657,-0.56254
1,-0.032664,-0.929006,-0.482573
2,-0.036264,-0.929006,0.980928
3,-0.589488,-0.929006,-0.528735
4,0.457002,-0.929006,-0.528735
5,-1.022487,-0.929006,-0.528735


When the `'ffill'` method is used, Pandas replaces missing values with the last non-null value in each column. If there are consecutive missing values, they will be filled with the closest preceding non-null value.

### Use Cases:
1. **Time Series Data**: The forward fill method is commonly used with time series data, where missing values are filled with the last available observation in the time series.

2. **Data Preprocessing**: Before performing calculations or analysis on a DataFrame, it's essential to handle missing values. The forward fill method can be useful in certain situations where it makes sense to propagate the last known value.

3. **Data Imputation**: Data imputation refers to filling missing values with estimated values. Forward fill can be one of the techniques used for this purpose.

## Another Example:
Suppose we have the following DataFrame `df`:

|   A   |   B   |   C   |
|-------|-------|-------|
|  10   |  30   |  50   |
|  NaN  |  40   |  NaN  |
|  15   |  NaN  |  70   |
|  NaN  |  20   |  NaN  |

Calling `df.fillna(method='ffill')` will result in the following DataFrame:

|   A   |   B   |   C   |
|-------|-------|-------|
|  10   |  30   |  50   |
|  10   |  40   |  50   |
|  15   |  40   |  70   |
|  15   |  20   |  70   |

## Note:
It's essential to understand that the forward fill method might not always be appropriate for all datasets. In some cases, other imputation methods like backward fill ('bfill') or using statistical measures might be more suitable. Additionally, forward filling might not be suitable for certain data patterns or types of analysis. It's always essential to carefully consider the characteristics of the data and the specific use case before applying any data imputation technique.

In [36]:
df.fillna(method='ffill', limit=2) # in every column only forward fill 2 missing values

Unnamed: 0,0,1,2
0,0.152677,-1.565657,-0.56254
1,-0.032664,-0.929006,-0.482573
2,-0.036264,-0.929006,0.980928
3,-0.589488,-0.929006,-0.528735
4,0.457002,,-0.528735
5,-1.022487,,-0.528735


In [37]:
# Another dataset

data = pd.Series([1., NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [38]:
data.fillna(data.mean()) # fill with the mean value

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

- `data.mean()`: calculates the mean of each column in the `data` DataFrame. The `mean()` method, when applied to a DataFrame, computes the mean value for each numerical column separately, ignoring NaN values in the calculation.

Functionality
The primary functionality of this code is to handle missing values in the `data` DataFrame by filling them with the mean value of their respective columns. By doing so, it helps to avoid potential issues that may arise when analyzing or modeling data with missing values, as many algorithms are not able to handle missing data.



## Data Transformation

### Removing Duplicates

In [27]:
# New Dataset to add duplicate values

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


### Creating the DataFrame

The DataFrame is created using the `pd.DataFrame` function, which takes a dictionary as an argument. The dictionary contains column names as keys and corresponding lists as values. Each key-value pair represents a column in the DataFrame, where the key is the column name and the value is a list of data points for that column.

The DataFrame `data` consists of two columns: 'k1' and 'k2'. Let's break down each column:

- 'k1': It contains data points from the list ['one', 'two'] repeated 3 times, and then followed by a single occurrence of 'two'. So, the column looks like: ['one', 'two', 'one', 'two', 'one', 'two', 'two'].

- 'k2': It contains data points from the list [1, 1, 2, 3, 3, 4, 4]. So, the column looks like: [1, 1, 2, 3, 3, 4, 4].



#### Pandas duplicated() method helps in analyzing duplicate values only. It returns a boolean series which is True only for Unique elements.

In [40]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool



The `duplicated()` method returns a boolean Series or DataFrame, indicating whether each row in the original DataFrame or Series is a duplicate (True) or not (False).

**Functionality:**
   - When `data.duplicated()` is executed, it examines the rows of the DataFrame or Series and checks for duplicates.
   - For each row in the original data, the `duplicated()` method returns `True` if the row is a duplicate (i.e., it has an identical copy somewhere else in the dataset), and `False` if the row is unique (i.e., it does not have any identical copy).

4. **Use Cases:**
   - Identifying Duplicate Rows: The primary use case of `data.duplicated()` is to find and mark duplicate rows in a dataset. This can be useful for data cleaning, validation, or identifying potential data entry errors.
   - Filtering Duplicate Rows: By using the `data.duplicated()` method in combination with DataFrame/Series indexing, you can filter and extract duplicate rows from the original dataset.
   - Dropping Duplicate Rows: If you want to remove duplicate rows from the DataFrame, you can use the `drop_duplicates()` method in conjunction with `duplicated()`.





#### More info on detecting duplicates: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html

In [41]:
data.iloc[3,:]  #content of row with index 3

k1    two
k2      3
Name: 3, dtype: object

In [42]:
data.iloc[3,:]=['one', 1] # let us change the value
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,one,1
4,one,3
5,two,4
6,two,4


#### For each set of duplicated values, the first occurrence is set on False and all others on True.

In [43]:
data.duplicated() 

0    False
1    False
2    False
3     True
4    False
5    False
6     True
dtype: bool

In [44]:
data.drop_duplicates() #drops rows where ALL values are duplicates of the values in another row

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
4,one,3
5,two,4


In [45]:
#Adding a column to the dataset  - let's call it v1
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,one,1,3
4,one,3,4
5,two,4,5
6,two,4,6


In [46]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [47]:
data # note the original dataset 

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,one,1,3
4,one,3,4
5,two,4,5
6,two,4,6


In [48]:
data.drop_duplicates(['k1', 'k2'], keep='last') # Drop duplicates except for the last occurrence.

Unnamed: 0,k1,k2,v1
1,two,1,1
2,one,2,2
3,one,1,3
4,one,3,4
6,two,4,6


### Transforming Data Using a Function or Mapping

In [29]:
data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon',
                              'Pastrami', 'corned beef', 'Bacon',
                              'pastrami', 'honey ham', 'nova lox'],
                     'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data

Unnamed: 0,food,ounces
0,bacon,4.0
1,pulled pork,3.0
2,bacon,12.0
3,Pastrami,6.0
4,corned beef,7.5
5,Bacon,8.0
6,pastrami,3.0
7,honey ham,5.0
8,nova lox,6.0


### A dictionary to map meat source

In [30]:
# Dictionary
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}
meat_to_animal

{'bacon': 'pig',
 'pulled pork': 'pig',
 'pastrami': 'cow',
 'corned beef': 'cow',
 'honey ham': 'pig',
 'nova lox': 'salmon'}

In [31]:
lowercased = data['food'].str.lower()
lowercased

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [32]:
data['animal'] = lowercased.map(meat_to_animal) # add column
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


#### Explanation of the code


1. `data['animal']`: This indicates that we are adding a new column to the Pandas DataFrame named `data`. 

2. `lowercased.map(meat_to_animal)`: This code uses method `map()` to convert the values of `lowercased` to corresponding values based on a function called `meat_to_animal`.

3. `meat_to_animal`: A pre-defined mapping dictionary that maps meat names to their corresponding animal names.


|
### Use Case:
The code is useful when you have a DataFrame containing a column with names and you want to create a new column with their corresponding names. This can be helpful when analyzing consumption data, patterns, or performing other analyses related to the origin of different types.
```



### Replacing Values

#### See https://datatofish.com/replace-values-pandas-dataframe/

In [34]:
# New Data Series

data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [54]:
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64



### Code Explanation:

Uses the `replace` method to replace occurrences of the value `-999` with `np.nan` (which stands for "Not a Number") in a variable named `data`. 


### Code Functionality:

The purpose of this code is to handle missing or invalid data represented by the value `-999` in the `data` variable. It replaces all occurrences of `-999` with `np.nan`, effectively converting the invalid data points to a standard representation of missing data.


### Use Cases:

The code is particularly useful in data analysis and data preprocessing tasks. Some potential use cases include:

1. Data Cleaning: This code can be used to replace specific sentinel values (like `-999`) with `np.nan`, making it easier to perform data cleaning operations.

2. Statistical Analysis: When performing statistical calculations, `np.nan` values are typically treated as missing data and are automatically excluded from most calculations, such as mean, median, and standard deviation.

3. Data Visualization: Data visualization libraries like Matplotlib and Seaborn can handle `np.nan` values gracefully, ensuring that missing data points are not plotted, and the visual representation is not affected by invalid values.

4. Machine Learning: Many machine learning algorithms handle `np.nan` values in a sensible way, allowing you to work with datasets that contain missing data.



In [55]:
data.replace([-999, -1000], np.nan) 

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64


## How it Works

In this specific code snippet:
```python
data.replace([-999, -1000], np.nan)
```

- The `data` variable contains some data that may have the values `-999` or `-1000`.
- The `replace` method is called on `data` to replace the values `-999` and `-1000` with `np.nan`, which is a special constant representing "Not a Number" or missing data in the NumPy library.
- The method returns a modified version of `data` with the specified replacements, but since `inplace` is not set to `True`, the original `data` is not modified.



In [56]:
data.replace([-999, -1000], [np.nan, 0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

**How it Works:**
When `data.replace` is called with the provided arguments, it searches for occurrences of the values [-999, -1000] within the `data` object and replaces them with the corresponding values from the replacement list [np.nan, 0].

- Any occurrence of -999 in `data` will be replaced with `np.nan`. 
- Any occurrence of -1000 in `data` will be replaced with 0. 



In [57]:
data.replace({-999: np.nan, -1000: 0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

- Any occurrence of -999 in the 'data' variable will be replaced by `np.nan`, which is commonly used to represent missing or undefined values in pandas DataFrames or Series.

- Any occurrence of -1000 in the 'data' variable will be replaced by 0.


### Renaming Axis Indexes

#### NumPy arange() is one of the array creation routines based on numerical ranges. It creates an instance of ndarray with evenly spaced values and returns the reference to it. See https://realpython.com/how-to-use-numpy-arange/

#### The reshape() function allows us to reshape an array in Python. Reshaping basically means, changing the shape of an array.

In [35]:
data = pd.DataFrame(np.arange(12).reshape((3, 4)))
data

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11



Create the DataFrame:

    -The NumPy function `np.arange(12)` generates an array of numbers from 0 to 11.
    -The array is then reshaped into a 3x4 matrix using `np.reshape((3, 4))`.
    -The resulting 3x4 matrix is passed to the pandas DataFrame constructor: `data = pd.DataFrame(...)`



In [37]:
# Specifying the index name and the columns

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
colorado,4,5,6,7
New York,8,9,10,11


### Modifying Dataframe

In [38]:
data.rename(index=str.title, columns=str.upper) # changing case

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


In [39]:
data.rename(index={'Ohio': 'INDIANA'},
            columns={'three': 'peekaboo'})

# Rename row and columns. In this case Ohio is renames INDIANA and THREE is now peekaboo

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
colorado,4,5,6,7
New York,8,9,10,11


In [62]:
data # original

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
colorado,4,5,6,7
New York,8,9,10,11


In [63]:
data.rename(index={'Ohio': 'INDIANA'}, inplace=True) # make changes to the original
data

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
colorado,4,5,6,7
New York,8,9,10,11


In [64]:
data # original

Unnamed: 0,one,two,three,four
INDIANA,0,1,2,3
colorado,4,5,6,7
New York,8,9,10,11


### Discretization and Binning

In [40]:
# Here are ages of individuals in a room 

ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
ages

[20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

#### Pandas cut() function is used to separate the array elements into different bins

In [66]:
bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats   #This is now a "Categorical" object

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]


### Code Explanation:

1. `bins = [18, 25, 35, 60, 100]`: This line defines a list named `bins` that contains five integer values representing age intervals. These intervals are [18, 25), [25, 35), [35, 60), [60, 100), where the intervals are left-closed (inclusive) and right-open (exclusive). This means an age of 18 is included in the first interval, but an age of 25 is not. The last interval includes ages from 60 to 99; ages equal to 100 would be in the next interval, which is not specified in the provided bins list.

2. `cats = pd.cut(ages, bins)`: This line uses the `pd.cut()` function to categorize  ages into the bins defined in the `bins` list. The result of this line is a new pandas Categorical object, which represents the ages grouped into the specified bins. Each age in the `ages` array will be assigned to one of the bins based on its value.

   For example, if `ages = [20, 30, 40, 50, 70]`, the `cats` Categorical object might look like:
   ```
   [(18, 25], (25, 35], (35, 60], (35, 60], (60, 100]]
   Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]
   ```


### Notable Features and Use Cases:

This feature is especially useful in data preprocessing and analysis, where continuous data needs to be grouped into distinct categories for further analysis, visualization, or machine learning tasks.

Notable features and use cases of pandas' `pd.cut()` function:

1. **Data Binning**: `pd.cut()` allows you to group continuous data into discrete bins or intervals. This is helpful when you want to analyze data based on categories rather than individual data points.

2. **Custom Binning**: The `bins` list provides flexibility in defining custom bin intervals, allowing you to create categories based on domain-specific requirements.

3. **Categorical Object**: The result of `pd.cut()` is a Categorical object, which is a pandas data structure specifically designed to handle categorical data efficiently. It not only stores the category labels but also maintains an underlying numerical representation for efficient computations.

4. **Labeling Data**: The categories generated by `pd.cut()` can be used for labeling data points, enabling easy grouping and summarization of data based on these labels.

5. **Visualization**: Categorical data is often used for creating visualizations such as bar charts or histograms, where data is represented in discrete categories rather than individual data points.

6. **Statistical Analysis**: Categorical data allows for various statistical analyses, such as calculating counts, frequencies, and other aggregate measures for each category.

7. **Machine Learning**: Binning data into categories can be useful for certain machine learning algorithms that work better with categorical data or require reduced dimensionality.



In [77]:
cats.categories # these are the range of the bins.

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]')

In [68]:
cats.codes # the categories each age belong to

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [78]:
pd.value_counts(cats)

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
dtype: int64

In [71]:
pd.cut(ages, [18, 26, 36, 61, 100], right=False)

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

In [72]:
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names)

['Youth', 'Youth', 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

#### Passing a number to the cut function creates that number of equal sized bins based on                          the maximum and minimum values of the data

In [79]:
data = np.random.rand(20)
pd.cut(data, 4, precision=2)  

[(0.75, 1.0], (0.75, 1.0], (0.25, 0.5], (0.75, 1.0], (0.75, 1.0], ..., (0.25, 0.5], (0.75, 1.0], (-0.00089, 0.25], (-0.00089, 0.25], (0.75, 1.0]]
Length: 20
Categories (4, interval[float64, right]): [(-0.00089, 0.25] < (0.25, 0.5] < (0.5, 0.75] < (0.75, 1.0]]

### Detecting and Filtering Outliers

In [41]:
data = pd.DataFrame(np.random.randn(1000, 4))  #values follow a Normal distribution
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.066679,0.020188,-0.000651,-0.067915
std,0.992674,1.004784,0.995859,0.995834
min,-3.548824,-3.184377,-3.745356,-3.428254
25%,-0.596286,-0.648915,-0.642609,-0.77489
50%,0.094503,-0.001593,-0.012007,-0.117489
75%,0.780282,0.674685,0.654328,0.616366
max,2.653656,3.260383,3.927528,3.366626


In [42]:
data.head()

Unnamed: 0,0,1,2,3
0,0.476985,3.248944,-1.021228,-0.577087
1,0.124121,0.302614,0.523772,0.00094
2,1.34381,-0.713544,-0.831154,-2.370232
3,-1.860761,-0.860757,0.560145,-1.265934
4,0.119827,-1.063512,0.332883,-2.359419


In [82]:
col = data[1]
col[np.abs(col) > 3]

487   -3.428254
864    3.366626
Name: 1, dtype: float64

### Explanation:

1. `col = data[1]`: assigns the value of `data[1]` to the variable `col`. `data[1]` retrieves the second element from `data`.

2. `col[np.abs(col) > 3]`:uses boolean indexing along with NumPy's absolute function to filter elements in the variable `col`. The condition `np.abs(col) > 3` checks whether the absolute value of each element in `col` is greater than 3. The result of this condition is a boolean array of the same shape as `col`, where `True` indicates that the corresponding element in `col` satisfies the condition, and `False` otherwise.

The code then uses this boolean array to filter the elements of `col`. It returns a new array containing only those elements for which the corresponding value in the boolean array is `True`. In other words, it selects only the elements in `col` that have an absolute value greater than 3.

The code can be useful in scenarios where you need to filter elements from a dataset based on a specific condition. For example, if `data` represents measurements or observations, using `col[np.abs(col) > 3]` can help identify outliers or extreme values that deviate significantly from the rest of the data.



In [45]:
data[(np.abs(data) > 3).any(1)] #select all rows that contain a value exceeding 3 or -3



Unnamed: 0,0,1,2,3
0,0.476985,3.248944,-1.021228,-0.577087
92,0.552936,0.106061,3.927528,-0.255126
97,-0.56523,3.176873,0.959533,-0.97534
300,0.457246,-0.025907,-3.399312,-0.974657
319,1.951312,3.260383,0.963301,1.201206
395,0.508391,-0.196713,-3.745356,-1.520113
494,-0.242459,-3.05699,1.918403,-0.578828
517,0.682841,0.326045,0.425384,-3.428254
581,1.179227,-3.184377,1.369891,-1.074833
803,-3.548824,1.553205,-2.186301,1.277104


The .any(1) method checks if any element is True along the axis 1 (rows) of the array. In this context, it checks each row to see if there is at least one value greater than 3 (in magnitude). The result is a 1-dimensional Boolean array of shape (n_rows,), where n_rows is the number of rows in the original data array.

More information on *.any()* https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.any.html?highlight=any#pandas.DataFrame.any

### Computing Indicator/Dummy Variables

#### When you have categorical variables dummy variables are useful because they allow us to include in our analysis. 

#### Pandas.get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.

In [46]:
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1': range(6)})
df

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [47]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


 `pd.get_dummies()` is a common technique used in data preprocessing to convert categorical data into a format that can be used for machine learning models. It's particularly useful when dealing with categorical variables that don't have a natural ordinal relationship, as it avoids introducing any numerical order that could impact the model's performance.

### Input:
The code takes a DataFrame `df` as input, and it is assumed that the DataFrame contains a column labeled 'key' that contains categorical data.

### Output:
The output of the code is a DataFrame that contains the  representation of the 'key' column.


**One-Hot Encoding Process:**
   - The `pd.get_dummies()` function will analyze the unique values in the 'key' column and create new binary columns for each unique value.
   - Each binary column represents a unique category from the 'key' column.
   - If a row in the original 'key' column contains a particular category, the corresponding binary column will have a '1' in that row, and '0' otherwise.

### Example:

Suppose we have the following DataFrame `df`:

|    | key   |
|----|-------|
| 0  | A     |
| 1  | B     |
| 2  | C     |
| 3  | A     |
| 4  | B     |

The code `pd.get_dummies(df['key'])` will transform the 'key' column into one-hot encoded format:

|    | A   | B   | C   |
|----|-----|-----|-----|
| 0  | 1   | 0   | 0   |
| 1  | 0   | 1   | 0   |
| 2  | 0   | 0   | 1   |
| 3  | 1   | 0   | 0   |
| 4  | 0   | 1   | 0   |

### Use Cases:

Commonly used in machine learning workflows when dealing with categorical features: 
- Training machine learning models that require numeric inputs, such as logistic regression, support vector machines, and neural networks.
- Avoiding the introduction of unintended ordinal relationships between categories, which could lead to incorrect model predictions.
- Handling nominal data in a way that makes it more suitable for distance-based algorithms, like k-nearest neighbors (KNN).

It's important to note that one-hot encoding can lead to a significant increase in the number of columns, especially if the original categorical column has many unique categories. Therefore, it's crucial to consider the dimensionality and potential multicollinearity issues when using one-hot encoding in machine learning models.
```

In [76]:
dummies = pd.get_dummies(df['key'], prefix='key') # adds specified prefix 
dummies

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


### Regular Expressions: Extracting Information From Texts

Regular expressions - quick tutorial

https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285

The *re* library and Regular expression syntax 

https://docs.python.org/3/library/re.html

Some more advanced use and syntax
https://www.tutorialspoint.com/python/python_reg_expressions.htm

In [48]:
import re

text = "foo    bar\t car  \tmap"


In [49]:
text

'foo    bar\t car  \tmap'

In [50]:
re.split('\s+', text)  #splitting a string with a variable number of whitespaces

['foo', 'bar', 'car', 'map']

`re` (regular expression) module split a given string (`text`) based on a regular expression pattern. The goal is to split the string wherever there is one or more whitespace characters, including spaces, tabs, or newline characters.


### How it Works

The `re.split()` function takes two arguments: the regular expression pattern to split on, and the input string to be split. In this case, the pattern is `'\s+'`, which is a regular expression that matches one or more whitespace characters.

When the `re.split()` function encounters one or more consecutive whitespace characters in the input string, it will split the string at those positions and return a list of substrings obtained from the split.

#### Use Case:

This code can be useful in text processing tasks where you want to tokenize a given text into individual words or chunks based on whitespaces. It can be used in natural language processing (NLP) tasks, data cleaning, or text analysis tasks where you need to process text data and convert it into meaningful units for further analysis.

In [51]:
regex = re.compile('\s+')
regex.split(text)

['foo', 'bar', 'car', 'map']

Explore the following and additional examples from https://www.programiz.com/python-programming/regex

In [52]:

string = 'hello 12 hi 89. Howdy 34'
string

'hello 12 hi 89. Howdy 34'

In [53]:
# Extract numbers from a string

pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

['12', '89', '34']




1. `pattern = '\d+'`: This line defines the regular expression pattern that will be used for the search. In regular expressions, `\d` is a special sequence that matches any digit (0-9), and the `+` indicates that the preceding character (in this case, `\d`) should appear one or more times. So, the pattern `'\d+'` will match one or more consecutive digits in the string.

2. `result = re.findall(pattern, string)`: This line uses the `re.findall()` function from the `re` module to search for all occurrences of the pattern in the given `string`. The `re.findall()` function returns a list containing all the matches found in the string.


**Example:**

Suppose we have the following `string`:

```python
string = "The 2020 Olympics took place from 23rd July to 8th August."
```

When we run the provided code with this `string`, it will output:

```
['2020', '23', '8']
```

**Use Case:**

The provided code is useful when you need to extract numerical data from a text that follows a specific pattern. For example, it can be used to extract:

- Dates: If the dates in the text are represented consistently, such as 'DD/MM/YYYY' or 'YYYY-MM-DD', you can create an appropriate regular expression pattern to extract all the dates from the text.

- Numbers from a string: If you have a text containing numbers with a consistent format (e.g., phone numbers, IDs), you can use regular expressions to extract them.

- Parsing data: When processing data in a text file or log, you can use regular expressions to extract relevant information like timestamps, error codes, or other structured data.



In [54]:
result = re.split(pattern, string) 
print(result)

['hello ', ' hi ', '. Howdy ', '']



`re.split(pattern, string)`: This function is used to split a given `string` based on the specified `pattern` using regular expressions. It returns a list of substrings.

   - `pattern`: The regular expression pattern to be used for splitting the `string`.
   - `string`: The input string that needs to be split.

 `result = re.split(pattern, string)`: The code performs the split operation using the provided `pattern` and `string`, and stores the resulting list of substrings in the variable `result`.

### Another Example


```python
import re

# Example Input
pattern = r'\W+'  # \W matches any non-word character (e.g., space, punctuation)
string = "Hello, World! How are you?"

# Splitting the string based on the pattern
result = re.split(pattern, string)

# Output
print(result)
```

**Output:**
```
['Hello', 'World', 'How', 'are', 'you']
```

In this example, we are using the regular expression `r'\W+'` as the `pattern`, which matches one or more non-word characters. The input string is `"Hello, World! How are you?"`. After applying the split operation, the result is a list of substrings: `['Hello', 'World', 'How', 'are', 'you']`. The split occurred at spaces and the exclamation mark, removing them from the resulting substrings.



In [55]:
replace = ""
new_string = re.sub(pattern, replace, string) 
print(new_string)

hello  hi . Howdy 


1. `replace = ""`: This line initializes a variable called `replace` and assigns an empty string to it.

2. `new_string = re.sub(pattern, replace, string)`: This line uses the `re.sub()` function to replace occurrences of a specified pattern in a given string with the content of the `replace` variable. Here's an explanation of each argument:
   - `pattern`: This is the regular expression pattern that defines what should be replaced in the `string`. 
   - `replace`: This is the replacement string. Any match found by the `pattern` will be replaced with the content of this variable, which, in this case, is an empty string. By using an empty string, the matched pattern will effectively be removed from the original `string`.
   - `string`: This is the input string in which we want to perform the replacements.

3. `print(new_string)`: Finally, the code prints the resulting `new_string`, which is the original `string` after applying the replacement operation specified by the `re.sub()` function.





In [57]:
import re

string = "Python is fun"

# check if 'Python' is at the beginning
match = re.search('Python', string) 

if match:
  print("pattern found inside the string")
else:
  print("pattern not found")  

pattern found inside the string


#### Pattern Matching with re.search():

match = re.search('Python', string)
The re.search() function is used to search for the pattern 'Python' in the input string string.
re.search(pattern, string) searches for occurrences of pattern within string.
In this case, the pattern 'Python' is being searched for in the stri

In this example, the input string is "Python is fun," and the code searches for the pattern "Python" within the string. Since "Python" is present at the beginning of the string, a match is found, and the output is "pattern found inside the string."

