# What are the different types of data issues?

* Various types of data issues may occur in an organization when collecting data, combining multiple datasets, receiving data from clients, customers or other departments and inputting data. Some example data issues include;


## Incomplete data

* This is data with missing fields and rows that occurs when no data value is stored for an attribute in an observation. Missing data are a common occurrence and can have a significant effect on the insights that can be drawn from the data.


## Duplicated entries

* Duplicate data is any entry that inadvertently shares data with another entry in a Database, ie a complete carbon copy. Duplicate entries in a dataset are also a common occurrence.


## Invalid Data

* Data attributes are not conforming with the logical dataset mapping. This includes wrong data types and wrong data formats which in turn interferes with the analysis process. Remember the computer doesn’t understand 95% as a numerical representation but instead as a string.


## Conflicting data

* Occurs when there are same records with different attributes ie there are deviations between data intended to capture the same real-world entity and can mislead any analysis done on it.


## Import Python Libraries

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

## Load the data 

In [4]:
# index_col = 0 >> specify the first column as indexs
data = pd.read_csv("Data/students_data.csv", index_col = 0) 

## Understanding the Data

In [5]:
# preview the first five rows 
data.head()

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
0,"JERIEL NDEDA, OBURA",13259.0,,,81.0,39.0,50.0,30.0,59.0,99%,80%
1,"MUKUHA TIMOTHY, KAMAU",13243.0,,,85.0,74.0,68.0,49.0,78.0,38%,86%
2,"JOB, NGARA",13307.0,,,54.0,49.0,53.0,59.0,72.0,86%,62%
3,"CHEGE DAVID, KAMAU",13258.0,,,71.0,97.0,92.0,41.0,81.0,77%,80%
4,"RAMADHAN MUSA, TEPO",13363.0,,,40.0,84.0,74.0,82.0,89.0,64%,46%


In [6]:
data.tail()

Unnamed: 0,names,admission number,house,balance,english,kiswahili,mathematics,science,sst/cre,Creative Arts,music
142,"TIMOTHY NDEDA, OBURA",13322634.0,Elgon,0.0,-78.0,40.0,99.0,70.0,49.0,99&,92&
143,"MUKUHA JERIEL, NGARA",1932845.0,Cherangani,321.0,94.0,780.0,420.0,71.0,88.0,56%,76%
144,"JOB, KAMAU",1430232.0,Nandi,43200.0,98.0,80.0,86.0,64.0,99.0,49%,69%
145,"CHEGE, KAMAU",159.0,Nandi,,508.0,409.0,77.0,58.0,56.0,88%,84%
146,"RAMADHAN, MUSA",87.0,Cherangani,,81.0,70.0,64.0,680.0,88.0,76%,72%


In [7]:
# column names 
data.columns

Index(['names', 'admission number', 'house', 'balance', 'english', 'kiswahili',
       'mathematics', 'science', 'sst/cre', 'Creative Arts', 'music'],
      dtype='object')

In [8]:
# shape 
data.shape

(147, 11)

In [9]:
# overview of the data 
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 147 entries, 0 to 146
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   names             147 non-null    object 
 1   admission number  124 non-null    float64
 2   house             26 non-null     object 
 3   balance           58 non-null     object 
 4   english           121 non-null    float64
 5   kiswahili         119 non-null    float64
 6   mathematics       130 non-null    float64
 7   science           117 non-null    float64
 8   sst/cre           132 non-null    float64
 9   Creative Arts     143 non-null    object 
 10  music             147 non-null    object 
dtypes: float64(6), object(5)
memory usage: 13.8+ KB


## Duplicates and Unwanted Observations

In [18]:
# duplicates 
data.duplicated().any()

True

In [49]:
# variable to store number of duplicates
no_true = 0

# loop through a bool series, where True is duplicated and False in not duplicated 
for val in data.duplicated():
    if (val == True):
        # increment the number of True values by one upon finding a duplicate
        no_true += 1

print(no_true)
# print(f"{ np.round(((no_true/len(data)) * 100), 4) } %")

8
5.4422 %


In [27]:
# Convert the number into a percentage 
percentage = no_true / len(data)
print(percentage)

0.05442176870748299


In [46]:
# Converting to percentage
conv_per = percentage * 100 
print(conv_per)

5.442176870748299


In [47]:
# round off
round_off = np.round(conv_per, 4)
print(round_off)

5.4422


In [48]:
# display the percentage 
print(f"{round_off}% of the data is duplicated")

5.4422% of the data is duplicated


In [52]:
data.drop_duplicates(subset = None, keep = "first", inplace = True)

In [53]:
data.duplicated().any()

False

### Duplicates in Specific Columns

* Unique identifiers should not be duplicated.

In [56]:
data.shape

(139, 11)

In [55]:
data.duplicated(subset = ['admission number']).any()

True

In [58]:
admission_col = data.duplicated(subset = ['admission number'])
type(admission_col)

pandas.core.series.Series

In [59]:
no_true = 0

for val in admission_col:
    if val == True:
        no_true += 1
        
print(no_true)

47


In [None]:
shared_admission_index = []

for index, val in enumerate(admission_col):
    if (val == True):
        shared_admission_index.append(index)
        
print(shared_admission_index)

[12, 16, 28, 37, 43, 44, 50, 52, 61, 63, 66, 70, 72, 74, 75, 78, 81, 82, 90, 91, 93, 94, 95, 100, 102, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133]


In [68]:
len(shared_admission_index)

47