#### Why Data Cleaning is Important
Without properly cleaned data, the results of any data analysis or machine learning model could be inaccurate. It is a commonly known fact that data scientists spend 80% of their time cleaning, manipulating and transforming data into the shape they want in order to carry out accurate analysis.

##### Some common problems with data:
- Column headers are variables, not variable names
- Multiple variables are stored in one column
- Variables are stored in both rows and columns
- Multiple types of observational units are stored in the same table
- A single observational unit stored in multiple tables

In this project efforts will be made to deal with and overcome some common untidy data problems.

In [1]:
#Importing required libraries
import numpy as np
import pandas as pd
from random import randint
import matplotlib.pyplot as plt

In [2]:
#Loading the dataset into a dataframe called ride_sharing
ride_sharing = pd.read_csv('../Datasets/ride_sharing_new.csv')
#Viewing the first 3 rows of the dataset
ride_sharing.head(3)

Unnamed: 0.1,Unnamed: 0,duration,station_A_id,station_A_name,station_B_id,station_B_name,bike_id,user_type,user_birth_year,user_gender
0,0,12 minutes,81,Berry St at 4th St,323,Broadway at Kearny,5480,2,1959,Male
1,1,24 minutes,3,Powell St BART Station (Market St at 4th St),118,Eureka Valley Recreation Center,5193,2,1965,Male
2,2,8 minutes,67,San Francisco Caltrain Station 2 (Townsend St...,23,The Embarcadero at Steuart St,3652,3,1993,Male


The dataset above is a bicycle ride sharing data from San Francisco. It contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.

The user_type column contains information on whether a user is taking a free ride and takes on the following values:

- 1 for free riders.

- 2 for pay per ride.

- 3 for monthly subscribers.  

In [3]:
#Printing information about the ride_sharing dataset
print(ride_sharing.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Unnamed: 0       25760 non-null  int64 
 1   duration         25760 non-null  object
 2   station_A_id     25760 non-null  int64 
 3   station_A_name   25760 non-null  object
 4   station_B_id     25760 non-null  int64 
 5   station_B_name   25760 non-null  object
 6   bike_id          25760 non-null  int64 
 7   user_type        25760 non-null  int64 
 8   user_birth_year  25760 non-null  int64 
 9   user_gender      25760 non-null  object
dtypes: int64(6), object(4)
memory usage: 2.0+ MB
None


It can be observed that the dataset has 10 columns 25760 rows and mixed data types. The user_type column should be categorical and nto integer.

In [4]:
#Summary statistics of the user_type column
print(ride_sharing['user_type'].describe())

count    25760.000000
mean         2.008385
std          0.704541
min          1.000000
25%          2.000000
50%          2.000000
75%          3.000000
max          3.000000
Name: user_type, dtype: float64


The Summary Statistics shown above is not very informative.

In [5]:
#Converting the user_type column to category and storing it in a new column 'user_type_cat'
ride_sharing['user_type_cat'] = ride_sharing['user_type'].astype('category')

#Confirming the change with an assert statement. The assert statement will return nothing if the assertion is true and an error otherwise
assert ride_sharing['user_type_cat'].dtype == 'category'

print(ride_sharing['user_type_cat'].describe())

count     25760
unique        3
top           2
freq      12972
Name: user_type_cat, dtype: int64


The Summary Statistics printed now is more useful because it gives the following information:
- there 3 unique categories
- the most occuring categoriy is 2 and it occurs 12972 times

In [6]:
#Looking at the 'duration' column in the dataset (first 5 rows)
ride_sharing['duration'][0:5]

0    12 minutes
1    24 minutes
2     8 minutes
3     4 minutes
4    11 minutes
Name: duration, dtype: object

It is observed that the duration is measured in minutes and as data scientist we would perhaps want to perform some numerical computations as finding the mean or the sum. This will not be possible because the column is saved with a string data type. In order to recode the column to a numeralcal one, the 'minutes' string needs to stripped off.

In [7]:
#Stripping off the minutes in the duration column and saving it in a new column 'duration_trim'
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip('minutes')

#Convert the new column into one with dtype of integer 
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype('int')

#Confirming the change with an assert statement
assert ride_sharing['duration_time'].dtype == 'int'

print(ride_sharing['duration_time'][0:5])
print('The average duration time is {:.2f} minutes'.format(np.mean(ride_sharing['duration_time'])))

0    12
1    24
2     8
3     4
4    11
Name: duration_time, dtype: int64
The average duration time is 11.39 minutes


We later received some extra data on the bicycle tire sizes, and this is stored in a list in other file. This list must be added as a colum to our dataset. The name of the file is extra.py

In [8]:
#Importing the file 
from extras import tire_sizes

ride_sharing['tire_sizes'] = tire_sizes
ride_sharing['tire_sizes'][0:5]

0    26
1    26
2    28
3    29
4    27
Name: tire_sizes, dtype: int64

In [9]:
#Printing information about the ride_sharing dataset
print(ride_sharing.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25760 entries, 0 to 25759
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   Unnamed: 0       25760 non-null  int64   
 1   duration         25760 non-null  object  
 2   station_A_id     25760 non-null  int64   
 3   station_A_name   25760 non-null  object  
 4   station_B_id     25760 non-null  int64   
 5   station_B_name   25760 non-null  object  
 6   bike_id          25760 non-null  int64   
 7   user_type        25760 non-null  int64   
 8   user_birth_year  25760 non-null  int64   
 9   user_gender      25760 non-null  object  
 10  user_type_cat    25760 non-null  category
 11  duration_trim    25760 non-null  object  
 12  duration_time    25760 non-null  int64   
 13  tire_sizes       25760 non-null  int64   
dtypes: category(1), int64(8), object(5)
memory usage: 2.6+ MB
None


In [10]:
#Converting tire sizes dtype to category
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

#Confirming the change with an assert statement
assert ride_sharing['tire_sizes'].dtype == 'category'

Bicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″. Let's select bicycles with tire sizes above 27 and set them to 27. But before that can be done we need to convert the data type of the tire sizes column to integer

In [14]:
# Converting tire_sizes to integer
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('int')

# Set all values above 27 to 27
ride_sharing.loc[ride_sharing['tire_sizes'] > 27, 'tire_sizes'] = 27
print(ride_sharing[ride_sharing['tire_sizes'] > 27])

Empty DataFrame
Columns: [Unnamed: 0, duration, station_A_id, station_A_name, station_B_id, station_B_name, bike_id, user_type, user_birth_year, user_gender, user_type_cat, duration_trim, duration_time, tire_sizes]
Index: []


In [18]:
# Reconverting tire_sizes back to categorical
ride_sharing['tire_sizes'] = ride_sharing['tire_sizes'].astype('category')

# Print tire size description
print(ride_sharing['tire_sizes'].describe())

print('The different tire sizes present are {}'.format(ride_sharing['tire_sizes'].unique()))

count     25760
unique        2
top          27
freq      19224
Name: tire_sizes, dtype: int64
The different tire sizes present are [26, 27]
Categories (2, int64): [26, 27]
