# Identify Bad Data

Data can sometimes be flawed. Multiple human hands will input some of it into computer systems, leading to data entry errors or inconsistencies across different individuals. The data quality may also be lost due to translation across different formats or careless consolidating. Ultimately, these discordances will affect any type of process or analysis done down the line. This is one of the primary reasons that data cleansing is so important in data preparation. However, only detecting errors is not the main goal. It also involves knowing which information to include to give a good meaning to any kind of insight derived, especially those used in business decisions.

In this section and the next, you'll learn to identify and address issues with missing values, outliers, and unnecessary and inconsistent data. This section focuses on identification and the next focuses on remediation. 

The scenario for this section is that you have been asked by the head of recruiting for a tech company to analyze the data science jobs advertised on LinkedIn to predict the most competitive salary bands the company should be offering to employees or new hires to attract/retain talent. After web-scraping the information from LinkedIn, we first need to prepare the dataset as it was web-scraped, and there are no guarantees as to the quality.

In this section, we are going to use pandas and matplotlib (specifically the pyplot submodule), so we'll start by importing those libraries using aliases:

In [2]:
import pandas as pd
import matplotlib.pyplot as plt

Let’s import and preview the web-scraped LinkedIn dataset: 

In [3]:
import pandas as pd
import matplotlib.pyplot as plt

# read the CSV file into a pandas DataFrame 
jobs = pd.read_csv("../datasets/li-jobs-usa.csv")

# display the first five rows
jobs.head()

Unnamed: 0,title,company,description,onsite_remote,salary,sign_on_bonus,annual_bonus,location,criteria,posted_date
0,Data Analyst - Recent Graduate,paypal,"At PayPal (NASDAQ: PYPL), we believe that ever...",onsite,"$75,000.00\n -\n $95,000.00",9000,2400.0,Buffalo-Niagara Falls Area,"[{'Seniority level': 'Not Applicable'}, {'Empl...",
1,Data Analyst - Recent Graduate,paypal,"At PayPal (NASDAQ: PYPL), we believe that ever...",onsite,,4000,5400.0,"San Jose, CA","[{'Seniority level': 'Not Applicable'}, {'Empl...",
2,Data Analyst,paypal,"At PayPal (NASDAQ: PYPL), we believe that ever...",onsite,,2000,1200.0,"Texas, United States","[{'Seniority level': 'Not Applicable'}, {'Empl...",
3,Data Analyst,PayPal,"At PayPal (NASDAQ: PYPL), we believe that ever...",onsite,,3000,1800.0,"Illinois, United States","[{'Seniority level': 'Not Applicable'}, {'Empl...",
4,Entry-Level Data Analyst,The Federal Savings Bank,"The Federal Savings Bank, a national bank and ...",onsite,,4000,2400.0,"Chicago, IL","[{'Seniority level': 'Entry level'}, {'Employm...",


### Part I: Missing Values

A quick way to find the columns in our dataset that have missing values, is to use the `info()` method to display information about the DataFrame like this: 

`df_name.info()`

So let's look at the information for our LinkedIn dataset:

In [None]:
# display DataFrame information
jobs.info()

- `RangeIndex` indicates the total number of rows or entries in the DataFrame, and we see that is 2851 entries.
- The Non-Null Count shows the count of non-null (non-missing) values for each column.
- We see we have some potentially problematic non-null counts:
    - `salary`: 934/2851,
    - `annual_bonus`: 2843/2851,
    - `location`: 2820/2851
    - `posted_date`: 570/2851.


- Remember, what we're trying to do is predict salary bands, so the fact that `salary` is missing so much information is a little disconcerting. 
- Having identified that we think we have a problem here, let's get a handle on the degree of the problem.
- The `isnull()` method can be used in conjunction with the familiar `mean()` method to find the percentage of missing values in each problematic column:

`df_name[["variable_name", "variable_name", "variable_name"]].isnull().mean()` 



In [None]:
#create a DataFrame of boolean values with True for missing values and False if there is not a missing value at that position:
jobs_missing = jobs[['salary', 'annual_bonus', 'location', 'posted_date']]

#Calculate the percentage of missing data for each column
missing_percentages = jobs_missing.isnull().mean() * 100

missing_percentages

Our `salary` variable is missing in 67% of cases! That's a lot for a column that we are directly interested in. We'll see some options for what to do about missing data in the next section. 

### Part II. Outliers

An outlier is a term used in statistics to describe a data point or points that significantly differ from the other data points in a data set. They are unusually large or small and hence stand out from the rest of the data. 

To understand this concept, let's consider a simple example:

Let's say you're a teacher and you have ten students. You give them a test and the scores are as follows:

Student 1: 85

Student 2: 90

Student 3: 88

Student 4: 92

Student 5: 87

Student 6: 91

Student 7: 100

Student 8: 86

Student 9: 89

Student 10: 250

In this case, most of the scores are in the range of 85-100, which seems normal. However, the score of 250 for Student 10 is much higher than the rest. This would be considered an outlier because it is significantly different from the other scores.

Detecting and managing outliers is an important part of data analysis as they can heavily skew results and interpretations, and could potentially indicate issues with data collection or entry. In some cases, outliers may be excluded from data analyses, but in others, they could provide valuable information and thus would be kept and examined further. It's important to understand the reasons behind an outlier before deciding how to handle it.

To identify outliers, the easiest way to get started is with boxplots. Our handy dandy `boxplot` method in matplotlib can give us a nice picture, like this:

`plt.boxplot(df_name['variable_name'])`

- And it's alwats a good idea to give your viewers some clue what they're looking at with titles, axis labels etc:


`plt.title('Title')`
`plt.ylabel("y-axis label")`

The boxplot of the 'sign_on_bonus' variable in the jobs data shows us two outliers (the dots in the figure: 9000 and 8000 USD):



In [None]:
# create a boxplot 
plt.boxplot(jobs['sign_on_bonus'])

# Title
plt.title('Boxplot for sign_on_bonus')

# label y-axis
plt.ylabel("USD")


To check your data overall, use:

`df_name.boxplot()`

This will give you boxplots for all numerical variables.

In [None]:
jobs.boxplot()
plt.title("Boxplot of `jobs` DataFrame Variables")
plt.ylabel("USD")


#### Inter Quartile Range (IQR)

Calculating IQR is one of the main statistical tecniques used to identify outliers.  


##### What's a Quartile???

In calculating IQR, we use this funny-sounding thing called quartiles. Quartiles split our data into four equal parts. The first quartile (Q1) is the value below which 25% of our data falls, and the third quartile (Q3) is the value below which 75% of our data falls. For finding outliers, we first calculate the Interquartile Range (IQR) by subtracting Q1 from Q3. So we need to find Q1 and Q3. In pandas, the `quantile()` method is used to find the value at any percentile. To find Q1, we use `quantile(0.25)`, etc:


`Q1 = df_name['variable_name'].quantile(0.25)`

`Q3 = df_name['variable_name'].quantile(0.75)`

`IQR = Q3 - Q1`

Next we need to define the boundaries for outliers. The rule of thumb is outliers are any value above Q3 + 1.5IQR or below Q1 - 1.5IQR:


`lower_bound = Q1 - 1.5 * IQR`

`upper_bound = Q3 + 1.5 * IQR`

Finally, we use our rockin pandas skills to select values from the data:


`outliers = df_name['variable_name'][(df_name['variable_name'] < lower_bound) | (df_name['variable_name'] > upper_bound)]`


Let's find the outliers in the 'sign_on_bonus' data:


In [None]:
# calculate the IQR

Q1 = jobs['sign_on_bonus'].quantile(0.25)
Q3 = jobs['sign_on_bonus'].quantile(0.75)
IQR = Q3 - Q1

# Define the boundaries for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Identify the outliers
outliers = jobs['sign_on_bonus'][(jobs['sign_on_bonus'] < lower_bound) | (jobs['sign_on_bonus'] > upper_bound)] 

 # display the outliers
outliers
