# Data Cleaning with Python 

## 1. Standardisation

#### <font color="blue">Pre-requisites</font>

In [1]:
# Pre-requisite 1
# ---
# Importing pandas library
# ---
# -> This is a data analysis and manipulation library with Python.
# ---
# OUR CODE GOES BELOW
#
import pandas as pd

In [2]:
# Pre-requisite 2
# ---
# Importing the numpy library
# -> This is a library for scientific computing with Python.
# -> It simply allows us to perfom complex mathematical operations.
# ---
# OUR CODE GOES BELOW
#
import numpy as np

#### <font color="blue">Tasks</font>

##### <font color="blue">Task 1</font>

In [3]:
# Task 1
# ---
# Renaming column names
# ---
# Dataset url = http://bit.ly/DataCleaningDataset
# ---
# OUR CODE GOES BELOW
#

# Reading our dataset from the url
# ---
# We also specify the character ; as our separator
# ---
#
df = pd.read_csv('http://bit.ly/DataCleaningDataset', sep=';')
df.head()

Unnamed: 0,NAME,CITY,COUNTRY,HEIGHT,WEIGHT,ACCOUNT A,ACCOUNT B,TOTAL ACCOUNT
0,Adi Dako,LISBON,PORTUGAL,56,132.0,2390.0,4340,6730
1,John Paul,LONDON,UNITED KINGDOM,62,165.0,4500.0,34334,38834
2,Cindy Jules,Stockholm,Sweden,48,117.0,,5504,8949
3,Arthur Kegels,BRUSSELS,BELGIUM,59,121.0,4344.0,8999,300
4,Freya Bismark,Berlin,GERMANYY,53,126.0,7000.0,19000,26000


In [4]:
# Task 1a
# ---
# In this task, we will be renaming our columns, if we have many column names.
# We will use the str.strip(), str.lower(), str.replace() functions
# to ensure that our column names are in lowercase format that easily can work with.
# ---
# str.strip() - We use this function to remove leading and trailing characters.
# str.lower() - This function converts all characters to lowercase
# str.replace() - This functions is used replace text with some other text.
# ---
#

# Then preview our dataframe
# ---


In [5]:
# Task 1b
# ---
# Alternatively we can rename column names in a dataframe manually by
# specifying the column names that we would like to have.
# Something to note is that this method becomes cumbersome when the no. of variables/features increase.
# ---
#

# We will reload our dataset again for this task and create a new dataframe.
# ---
#
df2 = pd.read_csv('http://bit.ly/DataCleaningDataset', sep = ';')

# We then specify our columns names, store them in a list, then afterwards
# assign the list to the column labels. By doing this, we replace the original
# columns with our new column names stored in the list.
# ---
#

# We then preview our dataframe as shown to confirm our changes
# ---
#


##### <font color="blue">Task 2</font>

In [6]:
# Task 2
# ---
# During standardisation, we can also perform string conversion.
# In this task, we will convert the values of the column city to lower case values.
# From the previous task 1, we can see that the city column/feature has values
# with Upppercase and Sentense case values.
# ---
# OUR CODE GOES BELOW
#

# Lets convert the city column to comprise of only lowercase characters
# ---
#


##### <font color="blue">Task 3</font>

In [7]:
# Task 3
# ---
# We now perform types of conversion that we would want i.e. metric conversion.
# In this task, we convert our height values to centimeters having in mind
# that 1 inch = 2.54 cm.
# ---
# Dataset url = http://bit.ly/DataCleaningDataset
# ---
#

# We perform our conversion across the column that we would want
# then replace the column with the outcome of our conversion.
# ---
#


##### <font color="blue">Task 4</font>

In [8]:
# Task 4
# ---
# We can also perform other types of conversion such as datatype conversion 
# ---


# Let's first determine the column/feature datatypes
# ---
#


In [9]:
# We then perform a conversion by converting our column/feature (height)
# through the use of the apply() function, passing the numerical
# type (integer) provided by numpy
# To get an understanding of other datatypes provided by numpy we can visit:
# https://docs.scipy.org/doc/numpy/user/basics.types.html
# ---
# Other
# ---
#


# Let's now check whether our conversion happened by checking our updated datatypes.
# We want to see whether height feature was converted from float to integer.
# ---
#


In [10]:
# We can also refer to our previous values of the height feature in task 2,
# and we will see that our height values now only comprise of integers.
# Let's now inspect and see whether our changes took place.
# We will sample 5 records from our dataset. 
# ---
#


## 2. Syntax Errors

#### <font color="blue">Tasks</font>

##### <font color="blue">Task 1</font>

In [11]:
# Task 1
# ---
# While performing our analysis, we can get to a point where we need to
# fix spelling mistakes or typos. 
# ---
# Dataset url = http://bit.ly/DataCleaningDataset
# ---
# OUR CODE GOES BELOW
#

# Let's replacing any value "GERMANYY" with the correct value "GERMANY".
# We use the string replace() function to perform our operation as shown.
# ---
#


##### <font color="blue">Task 2</font>

In [12]:
# Task 2
# ---
# We can also decide to strip or remove leading spaces (space infront)
# and trailing spaces (spaces at the end) in our datset by using the
# string strip() function covered in this example.
# ---
# Dataset = http://bit.ly/DataCleaningDataset
# ---
# OUR CODE GOES BELOW
#

# We first load our dataframe column with the intention to observe leading
# and trailing spaces in the city column
# ---
#


In [13]:
# Then later we strip the leading and trailing spaces and lastly
# confirm our changes by previewing the city column
# ---
#


## 3. Irrelevant Data

#### <font color="blue">Tasks</font>

##### <font color="blue">Task 1</font>

In [14]:
# Task 1
# ---
# We can also delete/drop irrelevant columns/features.
# By irrelevant we mean dataset features/columns that we don't need
# to answer a research question.
# ---
# Dataset url = http://bit.ly/DataCleaningDataset
# ---
# OUR CODE GOES BELOW
#


In [15]:
# Deleting an Irrelevant Column i.e. if we don't require the column city
# to answer our research question.
# ---
# While dropping/deleting those two columns:
# a) We set axis = 1
#    A dataframe has two axes: “axis 0” and “axis 1”.
#    “axis 0” represents rows and “axis 1” represents columns.
# b) We can also set Inplace = True.
#    This means the changes would be made to the original dataframe.
# Dropping the irrelevant columns i.e. Team and Weight
# 


# And preview our resulting dataset
# ---
#


In [16]:
# We can also drop multiple columns - drop height and weight columns
# ---
#

# And preview our resulting dataset
# ---
#


##### <font color="blue">Task 2</font>

In [17]:
# Task 2
# ---
# We can also fix in-record & cross-datasets errors.
# These kinds errors result from having two or more values in the same row
# or across datasets contradicting with each other.
# ---
# Dataset = http://bit.ly/DataCleaningDataset
# ---
# OUR CODE GOES BELOW
#
# Create a column that stores the sum of account_a and acoount_b



# Previewing our resulting dataframe
# ---
#


In [18]:
# Create another column to tell us whether if the two columns match.
# We will use the numpy library through use of np.
# ---
#


# Previewing our resulting dataframe
# ---
#


In [19]:
# Let's now select the records which don't match
# ---
#


In [20]:
# At this point we can do several things
# 1. Correct the values,
# 2. Drop/Delete the values,
# 3. Or even decide to leave them as they are for certain reasons
# ---
# If we had a large dataset, we could get the no. of records using len(),
# this would help us in our decision making process.
# ---
#


## 4. Duplicates

#### <font color="blue">Tasks</font>

##### <font color="blue">Task 1</font>

In [21]:
# Task 1
# ---
# Finding duplicate records
# -> Duplicate records are repeated records in a dataset.
# ---
# Dataset url = http://bit.ly/NBABasketballDataset
# ---
# OUR CODE GOES BELOW
#

df_nba = pd.read_csv('http://bit.ly/NBABasketballDataset')

# Again, we first explore our dataset by determining the shape of
# our dataset (records/instances, columns/variables)
# ---
#
df_nba.shape

(458, 9)

In [22]:
# We can then identify which observations are duplicates
# through the duplicated() function and sum() to know how many
# duplicate records there are.
# Normally, duplicate records are dropped from the dataset.
# But in our case we don't have any duplicate records.
# ---
#


In [23]:
# Finding the no. of duplicates
# ---
#


##### <font color="blue">Task 2</font>

In [24]:
# Task 2
# ---
# Dropping duplicate columns
# ---
# Dataset = http://bit.ly/DataCleaningDataset
# ---
# OUR CODE GOES BELOW
#

# In our previous dataset, if there were duplicates we
# could have dropped them through the use of the drop_duplicates() function
# 
# ---
#


##### <font color="blue">Task 3</font>

In [25]:
# Task 3
# ---
# Dropping duplicates in a specific column
# ---
# Dataset url = http://bit.ly/DataCleaningDataset
# ---
#

# We can also consider records with repeated variables/columns
# as duplicates and deal with them. For example, we can
# identify duplicates in our dataset based on country.
# ---
#


In [26]:
# Then dropping the duplicates 
# NB: We will create in a new dataframe object which will contain our unique dataframe
# which won't have any duplicates.
# ---
#

# Determining the size of our new dataset
# We note that the two records were dropped from our original dataset
# ---
#


## 5. Missing Data

#### <font color="blue">Tasks</font>

##### <font color="blue">Task 1</font>

In [27]:
# Task 1
# ---
# Finding records with missing data
# ---
# Dataset url = http://bit.ly/DataCleaningDataset
# ---
# OUR CODE GOES BELOW
#

# We can check if there is any missing values in the entire dataframe using the isnull()
# NB: This method may not be the most convenient. Why?
# ---
#



In [28]:
# We can also check for missing values in each column
# NB: This method may not be the most convenient. Why?
#
# ---
#


In [29]:
# We can check how many missing values there are across each variable/column 
# ---
#


In [30]:
# We can also check to see if we have any missing values in the dataframe 
# ---
#


In [31]:
# Lastly, We can also get a total count of missing values 
#
# ---


##### <font color="blue">Task 2</font>

In [32]:
# Task 2
# ---
# Dealing with the missing data
# ---
#

# We can drop rows where all cells in that row is NA
#
# NB: We don't have these rows in our dataset
# ---
#


In [33]:
# We can also drop columns if they only contain missing values
# NB: We don't have these rows in our dataset
# ---
#


In [34]:
# We can drop rows that contain less than five observations
# NB: We don't have these rows in our dataset
# ---
#


In [35]:
# Lastly, we can also drop the missing observations
# ---
#




##### <font color="blue">Task 3</font>

In [36]:
# Task 3
# ---
# Flag missing values
# ---
# Dataset url = http://bit.ly/DataCleaningDataset
# ---
#

# We can also fill in missing data with zeros 
# NB: First create a copy of the original dataframe
# 
# ---
#
