# Data Auditing 

<div style=" color:black; text-shadow: 1px 1px brown; font-size:2em;  background:url(style/images/Lucerne3.jpg)">,
    <h1 align="center">Scientific Python
    <img src="style/images/kundalini_pythons_gold_outline.png" style="height:360px; align:center; " ></h1>
    </div>


## 1. Data cleansing process:

Data cleansing is an iterative process. The first step of the cleansing process is data auditing. In this step, we identify the types of anomalies that reduce the data quality.  Data auditing is about programmatically checking the data using some validation rules that are pre-specified, and then creating a report of the quality of the data and its problems. We often apply some statistical tests in this step for examining the data.
Data Anomalies can be classified at a high level into three categories:

1. **Syntactic Anomalies**: 
describe characteristics concerning the format and values used for representation of the entities. Syntactic anomalies such as: lexical errors, domain format errors, syntactical error and irregularities.
2. **Semantic Anomalies**: 
hinder the data collection from being a comprehensive and non-redundant representation of the mini-world. These types of anomalies include: Integrity constraint violations, contradictions, duplicates and invalid tuples
3. **Coverage Anomalies**: 
decrease the amount of entities and entity properties from the mini-world that are represented in the data collection. Coverage anomalies are categorized as: missing values and missing tuples

We give examples in this part of the auditing process that is applied to discover different anomalies in data.
***


## Wrangling Titanic Data

The Titanic data is the data set provided in the Kaggle competition "Titanic: Machine Learning from Disaster". The competition has been available from 28 Sep 2012 with more than 4000 teams joining the competition. 

"The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class. In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy". For more details, please refer to "https://www.kaggle.com/c/titanic" 

The focus here is not the analysis of the data. Instead, we will concentrate on the identification of errors in the data, which might cause problems in the analysis. This data set contains the following variables
* <font color="blue">survival</font>: a boolean variable indicates whether the passenger survived or not.
* <font color="blue">pclass</font>: Passenger's Carbin Class 
* <font color="blue">sex</font>: the gender of a passenger
* <font color="blue">age</font>: Age
* <font color="blue">sibsp</font>: Number of Siblings/Spouses Aboard 
* <font color="blue">parch</font>: Number of Parents/Children Aboard 
* <font color="blue">fare</font>: Passenger ticket Fare
* <font color="blue">embarked</font>: abbreviation of Port of Embarkation
* <font color="blue">class</font>: the passenger's carbin class
* <font color="blue">who</font>: a variable takes values in {man, woman, child}
* <font color="blue">adult_male</font>: a boolean variable
* <font color="blue">deck</font>: the deck
* <font color="blue">embark_town</font>: the name of the port of embarkation 
* <font color="blue">alive</font>: whether or not the passenger was alive
* <font color="blue">alone</font>: a boolean variable indicates if the passenger traveled alone.
* <font color="blue">name</font>: Name of the passenger

For convenience, we will use the demo version of Titanic data included in the seaborne library: https://github.com/mwaskom/seaborn-data/blob/master/titanic.csv. For the purpose of demonstration,
some errors have been introduced in the data.
Notice that this task was developed based on the materials provided in the Kaggle website. 

In [1]:
#Basic scientific python libs
import pandas as pd
# Visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
# Configure visualisations
%matplotlib inline
mpl.style.use( 'ggplot' )
#Notebook displace setting
from IPython.core.display import HTML
css = open('style/style-table.css').read() + open('style/style-notebook.css').read()
HTML('<style>{}</style>'.format(css))

### First, load the data using Pandas library 

As we discussed in the lecture, the first thing you should do is to inpect the file and figure out the file format. It is not hard to see that the titanic data is store in a csv file. So we can use the <a href="http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html">read_csv()</a> function.

Write you code below to load the csv file.

In [None]:
titanic =

Now, the data has been loaded and stored in a Panda DataFrame. We can take an overview of the data. For example, you might need to know 
* the number of columns, i.e., attributes, and what are the attributes?
* the number of rows, i.e., passengers
* what is the data type of each attributes?
and etc.

We start by looking at the dimensionality of the data and a few lines of the data.

In [None]:
print (titanic.shape) 
titanic.head(10)

You can also print out the last couple of rows with the <font color="orange">tail()</font> function. We've got a sense of the variables, their class type, and the first few observations of each by observing the data. We know we're working with 892 observations of 15 variables. 

Next, we have a look at some key information about each variable to answer the following questions
* **Which features are categorical?** These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? 
    * Categorical variables:<font color="blue">survived</font>, <font color="blue">sex</font>, <font color="blue">embarked</font>, <font color="blue">who</font>, <font color="blue">embark_town</font>, <font color="blue">alive</font>, <font color="blue">alone</font>, and <font color="blue">name</font>, <font color="blue">deck</font>
    * Ordinal variables: <font color="blue">pclass</font>, <font color="blue">class</font>, 
* **Which features are numerical**? Which features are numerical? Within numerical features are the values discrete, continuous, or time-series based?
    * Continuous: <font color="blue">age</font>, <font color="blue">fare</font>. 
    * Discrete: <font color="blue">sibsp</font>, <font color="blue">parch</font>.

Answering these questions will help us select the appropriate methods(e.g., plots) to audit the data.

In [None]:
titanic.info()

What is the distribution of the numerical values across the samples? This helps us determine, among other early insights, how representative is the training dataset of the actual problem domain.

Use the <font color="orange">describe()</font> function to get the distribution of each variable.

The observation tells us that 
* Total samples are 892 or 40% of the actual number of passengers on board the Titanic (2,224).
* Survived is a categorical feature with 0 or 1 values.
* Most passengers (> 75%) did not travel with parents or children.
* Nearly 30% of the passengers had siblings and/or spouse aboard.
* Fares varied significantly with few passengers paying as high as $512, and as low as 0.0.
* Few elderly passengers within age range 65-80.
* The minimum age is 0.42.
* Some missing values exist in the "age" column

Again, what is the distribution of the categorical variables?

What the table above tells us:
* Names are not unique across the dataset (count !=unique), "Behr, Mr. Karl Howell" appears twice.
* Sex variable as four possible values with 574 males, which is suspicious. 
* Embarked takes three possible values. In contrast, the embark_town takes 7 different values. However, if the two values represent the same information, it becomes suspicious.
* Alive is a boolean variable
* There are a lot of missing values in deck, and 2 in both <font color="blue">embarked</font> and <font color="blue">embark_town</font>.

It is clear that the summary statistics on the distributions of each variable gives us a lot of information about the variable. Before we continue our auditing process, we are could further split the "name" column into more meaningful columns for better analysis. 

In [None]:
# Lets seperate the titles from the name 
coltitle = titanic['name'].apply(lambda s: pd.Series({'title': s.split(',')[1].split('.')[0].strip(),
                            'lastName':s.split(',')[0].strip(), 'firstName':s.split(',')[1].split('.')[1].strip()}))
# Add the columns to the titanic dataframe
titanic = pd.concat([titanic, coltitle], axis=1) 
#Drop the Name column, but here we choose to keep it at the moment.
#titanic.drop('name', axis=1, inplace=True)
titanic.head()

Notice that we kept the "name" column for now, as we might need to check where we have correctly split it into three columns.

### Identify Syntactical Anomalies 
In this section, we will demonstrate how to audit the data to identify some syntactical errors. 

#### Are all the titles consistent?

Let's start with checking the <font color="blue">title</font> column, as we just extract it from the <font color="blue">name</font> column. We used the <font color="orange">split()</font> function together with the following delimiters: "," and "." Is it possible that the split method we used gave us some erroneous extraction? 

Write you code to count the frequency of each unique value of title. (Hint: use the <font color="orange">value_counts()</font> function.

We have got 17 different titles. We might ask if it is possible to have 17 different titles, as those often used are Mr, Mrs, Miss, Ms, and Dr. What is the meaning of the following title?
* Rev 
* Mlle
* Jonkheer
* Don
* Mme
* The countess

Is it possible that the patten we used to extract Title is not applicable to all the records?

In this case, we might need to have a look at the rows whose title is equal to the tiles listed above. For example, we look at "Rev".

Write your code below to print those rows:

The are six rows in the DataFrame that contain "Rev". It seams that "Rev" is not a random lexical error, instead it might be a valid title that is not use often nowadays. We can check if "Rev" is a title by searching it online. What we will get from Wikipedia is
> The Reverend is an honorific style most often placed before the names of Christian clergy and ministers. There are sometimes differences in the way the style is used in different countries and church traditions

We can confirm that Rev is a title that is not often used nowadays. Similarly, you can check the other titles as well. It is interesting that "The countess" and "Mlle" are titles for female and "Don" for male. Should we unify these titles? For instance, assume that we are going to unify the title values by replacing "The countess", "Lady", "Mme" and "Mlle" with "Miss" and "Don" with "Mr", what should we do?

Write your code to replace 
* "Mlle", "the Countess", "Lady" and "Mme" with "Miss";
* "Don" with "Mr".

Now, we can drop the <font color="blue">name</font> column.

#### Are there any lexical errors in the data?
Typos are the most common error, particularly whenever the data collection process involves human. While we were collecting the data, we might mis-type the name of the embark_town. It is always a good idea to check the categorical variables to make sure their values are spelled without errors. Let's look at the <font color="blue">embark_town</font>. You can also use the <font color="orange">value_counts()</font> function or the <font color="orange">unique()</font> function.

The output above show that 
* typos 
    * Cherbourg v.s. Cherborg
    * Southampton v.s. Southamtpon
    * Cherbourg v.s. Cherbourge
* Inconsistent spelling:
 * Queenstown v.s. queenstown

The assumption we made here is that it is less likely that the spelling with large counts is wrong. You can also check  <font color="blue">embark_town</font> against <font color="blue">embarked</font>. 

Now, replace the typos with the crossponding right spelling.

The cross-tabulation of <font color="blue">embark_town</font> and <font color="blue">embarked</font> below proves the correspondence between the values of <font color="blue">embark_town</font> and those of <font color="blue">embarked</font>.

Write you code to generate the cross-tabulation; (Hint: use the <font color="orange">crosstab()</font> function)

#### Further more, are their any other inconsistent spellings?
Here, we are going to use the <font color="blue">sex</font> variable as an example. The observation told us that it is a boolean variable that takes two values, i.e., <font color="blue">male</font> and <font color="blue">female</font> with lower case letters. Let's check is unique values in the <font color="blue">sex</font> column.

The output shows the number of unique values in "sex" is 4, which is supposed to be 2. The inconsistency here is obvious. We can either replace male/female with M/F or M/F with male/female. 

Write your code to replace "M" with "male" and "F" with "female".

You can check all the other categorical variables in a similar way.


### Semantic errors: 
Variables can be correlated with each other. One variable might provide information that we can use to validate another variable.  In this task, we will check where nor not the data
* violate the integrity constraints
* contradictions
* duplication

We firstly check the integrity constraints. Given the variable description, one can figure out that "age", "who" 
and "adult_male" are correlated. For example, **if we assume all the children should under 18, and both men and women should be 18 or above**,
* Were all children's ages under 18? 
* Were the ages of all men and women greater than or equal to 18?

To answer the questions, we need to compute the summary statistics individually for passengers marked as child, man and woman. 

One way is to use the <font color="orange">describe()</font> function together with the <font color="orange">groupby()</font> function.

The statistics shows that there are 83 children, 413 men and 218 women. 

Write your code to show the passagers satisifing the following conditions:
* titanic.who = man or woman
* titanic.age < 18

There are 30 passenger should be classified as man and woman respectively. Now we can replace the value of <font color="blue">who</font> for the above record to "child", given the assumption we made.

Change the value of <font color="blue">who</font> to "child" for the rows you found above.

Instead of using the <font color="orange">describe()</font> function, you can also choose to use plot. For example,

In [None]:
titanic.hist(by="who", column="age")

There are still one error in the <font color="blue">child</font> group. We have one child, whose age is 25.

Write you code to find the passenger.

In this case, we have to use the value of <font color="blue">sex</font> to figure out the value of <font color="blue">who</font>. 

Now change the value of <font color="blue">who</font> for this record.

Now, let's compute a simple cross-tabulation of two factors, i.e., <font color="blue">sex</font> and <font color="blue">who</font>:

The tabulation shows that all the women have gender "female" and all the men have gender "male".

#### Are the values of "adult_male" consistent with the values of "sex" and "who"?

Now, let's look at <font color="blue">adult_male</font>, the value of which should "True" if a passenger is male and adult, "False" if a passenger is female or a male child. In other words, we need to check the consistency among three variables. What should we do? 

According to our assumption on the age of children, we need to change the value of <font color="blue">adult_male</font> form True to False. 

Write you code below:

Now, another question is if it is possible for a child less than 10-years old to be on-board the ship alone. Let's check if we have any reocord satisfies 
* titanic.age < 10
* titanic.alone == True

The output shows that there was a 5-years old girl who was on-board and survived. Should we change the value of <font color="blue">alone</font>? The value is consistent with the value of <font color="blue">silbsp</font> and that of <font color="blue">parch</font>. In this case, we might choose to keep it as it is.

#### Are there any duplicated records?
If we assume that <font color="blue">firstName</font>, <font color="blue">lastName</font> and <font color="blue">age</font> can uniquely identify a passenger, we can then use the three values to check whether or not the dataset contains duplicated records.

Write you code to find the duplicates:

The output above shows that there are two duplicated records. If you carefully check the two records, the second one contains inconsistent values. For example, <font color="blue">survived</font> = 1, but <font color="blue">alive</font> = no, and <font color="blue">embarked</font> = C, but <font color="blue">embard_town</font> = Cherbourg. Taking into account these two observations, we can choose to remove the second record and just keep the first one.

## Summary
In this tutorial we demonstrate how to identify and correct some syntactic and semantic data errors. We will cover missing values and outliers separately in the following two weeks' tutorial.