## Types of  Constraints

* type
* data range
* uniqueness
* membership

## str.strip(`char to remove`)
Removes element specificed in argument.

In [9]:
'bone'.strip('e')

'bon'

## .astype(`datatype`)
Converts datatype of column as specified.

## Categorical Data

If columns appear to be numerical, but are categorical, datatype should be `category`.  Summary statistics will be properly represented with this appropriate datatype.

## Converting to Pandas Datetime

```
df['date'] = pd.to_datetime(df['date'].dt.date)
```

## Checking  and dropping duplicates
**To check:**
```
duplicates = df.duplicated()
df[duplicates].sort_values(by = col_name)
```

**Parameters:**
* `subset` list of column names to check for duplication
* `keep` the 'first', 'last', or 'False' (all) values

To drop complete duplicates (all columns match):
```
df.drop_duplicates(inplace=True)
```

**Parameters:**
* `subset`
* `keep`
* `inplace` True or False whether to drop in the working table without creating a working object

To aggregate column-wise to remove duplicates:
```
column_names = ['a','b']
summaries = {
    'a': 'max',
    'b': 'mean'
}
df = duplicates.groupby(by = column_names).agg(summaries).reset_index()

duplicates = df.duplicated(subset = column_names, keep = False)
df[duplicates].sort_values()
```

## Working with Categorical Data
Finding the inconsistent rows
```
# categories is a df or series of all the correct categories for a column
inconsistent_categories = set(df[col]).difference(categories[col])
inconsistent_rows = df[col].isin(inconsistent_categories)
df[inconsistent_rows]
```

Dropping inconsistent rows:
```
consistent_data = df[~inconsistent_rows]
```

## Value Consistency

For a series, use `.value_counts()` to look at the count of various values.
For a dataframe, use `groupby()` with `count()`

To capitalize: `str.upper()`
To lower: `str.lower()`
To remove leading and trailing spaces: `str.strip()`

To collapse data into categories:
```
group_names = ['a', 'b', 'c']
df['categories'] = pd.qcut(df[col], q = 3, labels = group_names)
```
```
# Ranges specifies the cutoff points for the bins
ranges = [0,20,50,np.inf]
group_names = ['a','b','c']
df['categories'] = pd.cut(df[col], bins = ranges, labels = group_names)
```

To collapse more categories into fewer, make a mapping:
```
mapping = {
    'a' = 'A',
    'b' = 'A',
    'c' = 'A',
    'd' = 'B',
    'e' = 'B'
}
df[col] = df[col].replace(mapping)
```

## Unit Uniformity

Numerical conversions can be done as follows:
1. Subset on the values that need to be converted.
2. Compute the conversion.
3. Replace the subset with the converted values.

Treating date:

`pd.to_datetime(df[col], infer_datetime_format = True, errors = 'coerce')`

##  Cross Field Validation

To compute aggregations amongst multiple columns:
    
`df[[col1,col2,col3]].sum(axis=1)`

Useful functions for date:

* `dt.date.today()`
* `df['col'].dt.year`

## Completeness
Missing data is usually due to technical or human error.

To check for missing values:

`df.isna().sum()`

To drop missing values:

`df.dropna(subset = [col])`

Replacing missing values:

`df.fillna({col: value})`

To visualize missing data:

In [2]:
# import missingno as msno
# import matplotlib.pyplot as plt
# msno.matrix(df)
# plt.show()

## Minimum Edit Distance
A systematic way to identify how close 2 strings are.  We use:
* insert
* delete
* substitution
* transpose

We can use the `fuzz` package in python.



In [6]:
# from thefuzz import fuzz
# fuzz.WRatio('Reeding', 'Reading')

In [None]:
# from thefuzz import process
# matches process.extract(string, choices, limit=n)

`matches[0]` is the matched string
`matches[1]` is the WRatio

## Record Linkage
Join data sources that have similar names but are actually the same entity

In [9]:
# import recordlinkage
# indexer = recordlinkage.Index()
# indexer.block(col)
# pairs = indexer.index(df1,df2)
# compare_cl = recordlinkage.Compare()

## Find similar matches for columens with dates and categories
# compare_cl.exact(col_name1, col_name2, label=col_name)

## Find similar matches for columns with strings
# compare_cl.string(col_name1, col_name1, threshold=0.85, label=col_name)

## Find matches
# potential_matches = compare_cl.compute(pairs, df1, df2)

## Looking at probable matches, where n is column number
# potential_matches[potential_matches.sum(axis = 1) >= n]

In [None]:
## Getting indices from 2nd df only
# duplicate_rows = matches.index.get_level_values(1)

## To get the duplicates in df2
# df2_duplicates = df2[df2.index.isin(duplicate_rows)]

## Finding non-duplicates in df2
# df2_new = df2[~df2.index.isin(duplicate_rows)]

## Linking DataFrames
#df_linked = df1.append(df2_new)