# Table of Contents
### 01. Importing Libaries
### 02. Importing Data
### 03. Data Cleaning & Consistency Checks
### 04. Exporting cleaned dataset

# 01. Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import os

# 02. Importing Data

In [2]:
path = r'/Users/lianabulte/Career Foundry/2023 Boat Sales Analysis'

In [3]:
df = pd.read_csv(os.path.join(path, '02 Data', 'Original Data', 'boat_data.csv'))

# 03. Data Cleaning & Consistency Checks

In [4]:
# review intial data
df.head()

Unnamed: 0,Price,Boat Type,Manufacturer,Type,Year Built,Length,Width,Material,Location,Number of views last 7 days
0,CHF 3337,Motor Yacht,Rigiflex power boats,new boat from stock,2017,4.0,1.9,,Switzerland Â» Lake Geneva Â» VÃ©senaz,226
1,EUR 3490,Center console boat,Terhi power boats,new boat from stock,2020,4.0,1.5,Thermoplastic,Germany Â» BÃ¶nningstedt,75
2,CHF 3770,Sport Boat,Marine power boats,new boat from stock,0,3.69,1.42,Aluminium,Switzerland Â» Lake of Zurich Â» StÃ¤fa ZH,124
3,DKK 25900,Sport Boat,Pioner power boats,new boat from stock,2020,3.0,1.0,,Denmark Â» Svendborg,64
4,EUR 3399,Fishing Boat,Linder power boats,new boat from stock,2019,3.55,1.46,Aluminium,Germany Â» Bayern Â» MÃ¼nchen,58


In [5]:
# review shape
df.shape

(9888, 10)

In [7]:
# checking column types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9888 entries, 0 to 9887
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Price                        9888 non-null   object 
 1   Boat Type                    9888 non-null   object 
 2   Manufacturer                 8550 non-null   object 
 3   Type                         9882 non-null   object 
 4   Year Built                   9888 non-null   int64  
 5   Length                       9879 non-null   float64
 6   Width                        9832 non-null   float64
 7   Material                     8139 non-null   object 
 8   Location                     9852 non-null   object 
 9   Number of views last 7 days  9888 non-null   int64  
dtypes: float64(2), int64(2), object(6)
memory usage: 772.6+ KB


Not all columns are at the same number of row entries , data types look appropriate

In [9]:
# drilling down to missing data
df.isna().sum()

Price                             0
Boat Type                         0
Manufacturer                   1338
Type                              6
Year Built                        0
Length                            9
Width                            56
Material                       1749
Location                         36
Number of views last 7 days       0
dtype: int64

Unable to impute for columns: manufacturer , type, and material, will replace them with 'Not Disclosed' in order to preserve as much of the data as possible 

In [11]:
df[['Manufacturer','Type','Material']] = df[['Manufacturer','Type','Material']].fillna('Not Disclosed')

In [12]:
# verifying the above worked
df.isna().sum()

Price                           0
Boat Type                       0
Manufacturer                    0
Type                            0
Year Built                      0
Length                          9
Width                          56
Material                        0
Location                       36
Number of views last 7 days     0
dtype: int64

Length, Width and Location NaNs are a small portion of the dataset, will choose to remove

In [14]:
#dropping remaining rows with NaN values
df2 = df.dropna()

In [15]:
# verifying the above worked
df2.isna().sum()

Price                          0
Boat Type                      0
Manufacturer                   0
Type                           0
Year Built                     0
Length                         0
Width                          0
Material                       0
Location                       0
Number of views last 7 days    0
dtype: int64

In [17]:
# veryfing total rows have decreased
df2.shape

(9796, 10)

In [21]:
# reviewing data for numerical column year built
df2['Year Built'].describe()

count    9796.000000
mean     1892.353205
std       461.835773
min         0.000000
25%      1996.000000
50%      2007.000000
75%      2017.000000
max      2021.000000
Name: Year Built, dtype: float64

The above statistics show that there are boats with 0 as the year built, need to investigate how many are affected by this

In [25]:
# identifying what the totla counts are per year 
df2['Year Built'].value_counts()

2020    1277
2019     660
0        550
2008     454
2007     389
        ... 
1914       1
1895       1
1885       1
1931       1
1900       1
Name: Year Built, Length: 122, dtype: int64

Able to identify that there are 550 rows with the year as 0, will be removing these rows so as to not interfere with later analysis

In [27]:
# removing rows with 0 value for the year built column
df2.drop(df2[(df2['Year Built'] == 0)].index, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.drop(df2[(df2['Year Built'] == 0)].index, inplace=True)


In [28]:
# checking the rows were removed
df2.shape

(9246, 10)

In [29]:
# reviewing data for numerical column year built
df2['Year Built'].describe()

count    9246.000000
mean     2004.920182
std        16.406383
min      1885.000000
25%      1999.000000
50%      2008.000000
75%      2018.000000
max      2021.000000
Name: Year Built, dtype: float64

In [30]:
#checking updated data set
df2.head(15)

Unnamed: 0,Price,Boat Type,Manufacturer,Type,Year Built,Length,Width,Material,Location,Number of views last 7 days
0,CHF 3337,Motor Yacht,Rigiflex power boats,new boat from stock,2017,4.0,1.9,Not Disclosed,Switzerland Â» Lake Geneva Â» VÃ©senaz,226
1,EUR 3490,Center console boat,Terhi power boats,new boat from stock,2020,4.0,1.5,Thermoplastic,Germany Â» BÃ¶nningstedt,75
3,DKK 25900,Sport Boat,Pioner power boats,new boat from stock,2020,3.0,1.0,Not Disclosed,Denmark Â» Svendborg,64
4,EUR 3399,Fishing Boat,Linder power boats,new boat from stock,2019,3.55,1.46,Aluminium,Germany Â» Bayern Â» MÃ¼nchen,58
6,CHF 3600,Catamaran,Not Disclosed,"Used boat,Unleaded",1999,6.2,2.38,Aluminium,Switzerland Â» Neuenburgersee Â» Yvonand,474
8,EUR 3333,Fishing Boat,Crescent power boats,new boat from stock,2019,3.64,1.37,Not Disclosed,Germany Â» Bayern Â» Boote+service Oberbayern,45
9,EUR 3300,Pontoon Boat,Whaly power boats,new boat from stock,2018,4.35,1.73,Not Disclosed,Italy Â» Dormelletto,180
10,CHF 3500,Fishing Boat,Terhi power boats,"Used boat,Electric",1987,4.35,1.75,GRP,Switzerland Â» Seengen,239
12,EUR 3500,Sport Boat,GS Nautica power boats,Used boat,2004,4.7,2.0,GRP,Italy Â» Lake Garda Â» Moniga del Garda (BS),69
13,CHF 4600,Runabout,Kimple power boats,new boat from stock,2020,4.4,1.65,Aluminium,Switzerland Â» Zugersee Â» Neuheim,113


In [31]:
# Changing 'type' header to 'boat condition' to clarify column 
df2.rename(columns={'Type': 'Boat Condition'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.rename(columns={'Type': 'Boat Condition'}, inplace=True)


In [32]:
# Changing 'Number of views last 7 days' header to 'Views from last 7 days' to clarify column 
df2.rename(columns={'Number of views last 7 days': 'Views from last 7 days'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.rename(columns={'Number of views last 7 days': 'Views from last 7 days'}, inplace=True)


In [33]:
#checking the updated worked
df2.head()

Unnamed: 0,Price,Boat Type,Manufacturer,Boat Condition,Year Built,Length,Width,Material,Location,Views from last 7 days
0,CHF 3337,Motor Yacht,Rigiflex power boats,new boat from stock,2017,4.0,1.9,Not Disclosed,Switzerland Â» Lake Geneva Â» VÃ©senaz,226
1,EUR 3490,Center console boat,Terhi power boats,new boat from stock,2020,4.0,1.5,Thermoplastic,Germany Â» BÃ¶nningstedt,75
3,DKK 25900,Sport Boat,Pioner power boats,new boat from stock,2020,3.0,1.0,Not Disclosed,Denmark Â» Svendborg,64
4,EUR 3399,Fishing Boat,Linder power boats,new boat from stock,2019,3.55,1.46,Aluminium,Germany Â» Bayern Â» MÃ¼nchen,58
6,CHF 3600,Catamaran,Not Disclosed,"Used boat,Unleaded",1999,6.2,2.38,Aluminium,Switzerland Â» Neuenburgersee Â» Yvonand,474


In [35]:
#dividing the price column to seperate currency from the dollar amount
df2[['Currency', 'Cost']] = df2.Price.str.split(" ", expand = True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2[['Currency', 'Cost']] = df2.Price.str.split(" ", expand = True)


In [36]:
df2.head()

Unnamed: 0,Price,Boat Type,Manufacturer,Boat Condition,Year Built,Length,Width,Material,Location,Views from last 7 days,Currency,Cost
0,CHF 3337,Motor Yacht,Rigiflex power boats,new boat from stock,2017,4.0,1.9,Not Disclosed,Switzerland Â» Lake Geneva Â» VÃ©senaz,226,CHF,3337
1,EUR 3490,Center console boat,Terhi power boats,new boat from stock,2020,4.0,1.5,Thermoplastic,Germany Â» BÃ¶nningstedt,75,EUR,3490
3,DKK 25900,Sport Boat,Pioner power boats,new boat from stock,2020,3.0,1.0,Not Disclosed,Denmark Â» Svendborg,64,DKK,25900
4,EUR 3399,Fishing Boat,Linder power boats,new boat from stock,2019,3.55,1.46,Aluminium,Germany Â» Bayern Â» MÃ¼nchen,58,EUR,3399
6,CHF 3600,Catamaran,Not Disclosed,"Used boat,Unleaded",1999,6.2,2.38,Aluminium,Switzerland Â» Neuenburgersee Â» Yvonand,474,CHF,3600


In [38]:
# check for duplicates
df2_dups = df2[df2.duplicated()]

In [39]:
df2_dups

Unnamed: 0,Price,Boat Type,Manufacturer,Boat Condition,Year Built,Length,Width,Material,Location,Views from last 7 days,Currency,Cost


There are no duplicates

In [41]:
# reverifying cleaned dataset 
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9246 entries, 0 to 9887
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Price                   9246 non-null   object 
 1   Boat Type               9246 non-null   object 
 2   Manufacturer            9246 non-null   object 
 3   Boat Condition          9246 non-null   object 
 4   Year Built              9246 non-null   int64  
 5   Length                  9246 non-null   float64
 6   Width                   9246 non-null   float64
 7   Material                9246 non-null   object 
 8   Location                9246 non-null   object 
 9   Views from last 7 days  9246 non-null   int64  
 10  Currency                9246 non-null   object 
 11  Cost                    9246 non-null   object 
dtypes: float64(2), int64(2), object(8)
memory usage: 939.0+ KB


In [46]:
#changing 'cost' data type from object to integer
df2["Cost"] = df2["Cost"].astype("int64")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2["Cost"] = df2["Cost"].astype("int64")


In [47]:
#validating change in dtypes
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9246 entries, 0 to 9887
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Price                   9246 non-null   object 
 1   Boat Type               9246 non-null   object 
 2   Manufacturer            9246 non-null   object 
 3   Boat Condition          9246 non-null   object 
 4   Year Built              9246 non-null   int64  
 5   Length                  9246 non-null   float64
 6   Width                   9246 non-null   float64
 7   Material                9246 non-null   object 
 8   Location                9246 non-null   object 
 9   Views from last 7 days  9246 non-null   int64  
 10  Currency                9246 non-null   object 
 11  Cost                    9246 non-null   int64  
dtypes: float64(2), int64(3), object(7)
memory usage: 939.0+ KB


In [48]:
#removing 'price' columne from data set
df2.drop(columns=['Price'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df2.drop(columns=['Price'], inplace=True)


In [49]:
df2.head()

Unnamed: 0,Boat Type,Manufacturer,Boat Condition,Year Built,Length,Width,Material,Location,Views from last 7 days,Currency,Cost
0,Motor Yacht,Rigiflex power boats,new boat from stock,2017,4.0,1.9,Not Disclosed,Switzerland Â» Lake Geneva Â» VÃ©senaz,226,CHF,3337
1,Center console boat,Terhi power boats,new boat from stock,2020,4.0,1.5,Thermoplastic,Germany Â» BÃ¶nningstedt,75,EUR,3490
3,Sport Boat,Pioner power boats,new boat from stock,2020,3.0,1.0,Not Disclosed,Denmark Â» Svendborg,64,DKK,25900
4,Fishing Boat,Linder power boats,new boat from stock,2019,3.55,1.46,Aluminium,Germany Â» Bayern Â» MÃ¼nchen,58,EUR,3399
6,Catamaran,Not Disclosed,"Used boat,Unleaded",1999,6.2,2.38,Aluminium,Switzerland Â» Neuenburgersee Â» Yvonand,474,CHF,3600


In [51]:
df2.shape

(9246, 11)

# 04. Exporting Dataset

In [52]:
#exporting cleaned data set
df2.to_csv(os.path.join(path, '02 Data', 'Prepared Data', 'clean_boat_data.csv'))