<h1 style="text-align:center;">Data Cleaning Practice</h1>

- Let's begin by importing the necessary libraries and the dataset:

In [326]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv("all_bikez_curated.csv")
df.sample(5)

Unnamed: 0,Brand,Model,Year,Category,Displacement (ccm),Power (hp),Engine cylinder,Engine stroke,Gearbox,Fuel capacity (lts),Fuel system,Fuel control,Cooling system,Transmission type,Dry weight (kg),Wheelbase (mm),Seat height (mm)
2309,benelli,752 s,2018,Naked bike,750.0,81.6,Twin,four-stroke,6-speed,15.0,Injection,,Liquid,Chain,,,
19926,kreidler,supermoto 250 dd,2012,Super motard,250.0,19.7,Single cylinder,four-stroke,6-speed,10.5,Carburettor,,Liquid,Chain,144.0,1450.0,860.0
3677,bmw,k 75 rt,1991,Touring,740.0,75.0,In-line three,four-stroke,5-speed,21.0,,Double Overhead Cams/Twin Cam (DOHC),Liquid,Shaft drive,,,
24487,mv agusta,brutale 675,2015,Naked bike,675.0,108.0,In-line three,four-stroke,6-speed,16.6,Injection,Double Overhead Cams/Twin Cam (DOHC),Liquid,Chain,167.0,1380.0,810.0
35866,yamaha,royal star venture 1300,2005,Touring,1294.0,98.0,V4,four-stroke,5-speed,22.71,Carburettor,Double Overhead Cams/Twin Cam (DOHC),Liquid,Shaft drive,,1704.0,749.0


### 💾 The data:
- "Brand" - brand name of the motorcycle.
-	"Model" - model name of the motorcycle.
-	"year" - year the motorcycle was built.
-	"Category" - sub-class the motorcycle belongs to in the market (style of motorcycle).
-	"Displacement (ccm)" - engine size of the motorcycle in cubic centimeters (ccm).
-	"Power (hp)" - max power output in horsepower (hp) and kilowatt (kW) along with peak power rpm.
-	"Engine cylinder" - number of cylinders in the engine as well as configuration.
-	"Engine stroke" - number of stages to complete one power stroke of the engine.
-	"Gearbox" - number of gears in transmission.
-	"Transmission type" - type of transmission of the motorcycle.
-	"Dry weight (kg)" - weight of the motorcycle, without any fluids, in kilograms (kg) and pounds (lbs).
-	"Wheelbase (mm)" - distance between the points where the front and rear wheels touch the ground in millimeters (mm).
-	"Fuel capacity (lts)" - maximum capacity of fuel tank in liters (lts).
-	"Fuel system" - fuel delivery system into engine.
-	"Fuel control" - valve configuration fo the engine.
-	"Seat height (mm)" - height from bottom of seat to the ground in millimeters (mm).
-	"Cooling system" - engine cooling system.



## Inspecting and Cleaning

In [327]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38543 entries, 0 to 38542
Data columns (total 17 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Brand                38543 non-null  object 
 1   Model                38515 non-null  object 
 2   Year                 38543 non-null  object 
 3   Category             38543 non-null  object 
 4   Displacement (ccm)   37514 non-null  float64
 5   Power (hp)           26162 non-null  float64
 6   Engine cylinder      38527 non-null  object 
 7   Engine stroke        38532 non-null  object 
 8   Gearbox              32733 non-null  object 
 9   Fuel capacity (lts)  31747 non-null  float64
 10  Fuel system          27893 non-null  object 
 11  Fuel control         22032 non-null  object 
 12  Cooling system       34322 non-null  object 
 13  Transmission type    32912 non-null  object 
 14  Dry weight (kg)      22534 non-null  float64
 15  Wheelbase (mm)       25530 non-null 

In [328]:
df.columns = ['brand', 'model', 'year', 'category', 'displacement', 'power', 'engine_cylinder', 'engine_stroke', \
                'gear_box', 'fuel_capacity', 'fuel_system', 'fuel_control', 'cooling_system', 'trans_type', 'dry_weight',\
                    'wheelbase', 'seat_height'] 

In [329]:
df.sample()

Unnamed: 0,brand,model,year,category,displacement,power,engine_cylinder,engine_stroke,gear_box,fuel_capacity,fuel_system,fuel_control,cooling_system,trans_type,dry_weight,wheelbase,seat_height
13235,honda,crf450x,2017,Enduro / offroad,449.0,44.5,Single cylinder,four-stroke,5-speed,7.19,Carburettor. Keihin® 40mm flat-slide carbureto...,Single Overhead Cams (SOHC),Liquid,Chain,,1481.0,963.0


In [330]:
df.year.unique()

array(['2011', '2007', '2021', '2016', '2018', '2020', '2022', '1923',
       '1924', '1925', '1926', '1927', '2009', '2010', '2014', '2008',
       '2019', '2012', '2013', '2040', '1957', '1958', '1955', '1956',
       '1952', '1953', '1954', '1959', '1960', '2077', '2003', '3019',
       '1961', '1962', '1963', '1964', '1965', '1966', '1967', '1968',
       '1969', '1971', '1972', '1970', '1974', '1962 CE', '1963 CE',
       '2005', '2015', '2018 CE', '2006', '1922', '1928', '1949', '1950',
       '1951', '2001', '2017', '1986', '1999', '1985', '1987', '1996',
       '1991', '1988', '1989', '1990', '1992', '2004', '2002', '1997',
       '1998', '2000', '1993', '1984', '1995', '1948', '1937', '1938',
       '1931', '1901', '1902', '1903', '1913', '1914', '1915', '1916',
       '1918', '1932', '1933', '1934', '1935', '1936', '1939', '1940',
       '1941', '1942', '1943', '1944', '1945', '1946', '1947', '1898',
       '1910', '1929', '1975', '1976', '1973', '1977', '1978', '1979',
     

You notice that this column needs some replacements before turning it to integer.

In [331]:
df.year = df.year.astype('str')
df.year = df.year.str.replace(' CE', '')
df.year.replace('3019', '2019', inplace=True)
df.year.replace('2077', '2017', inplace = True)
df.year.replace('2040', '2004', inplace=True)

In [332]:
df.year.unique()

array(['2011', '2007', '2021', '2016', '2018', '2020', '2022', '1923',
       '1924', '1925', '1926', '1927', '2009', '2010', '2014', '2008',
       '2019', '2012', '2013', '2004', '1957', '1958', '1955', '1956',
       '1952', '1953', '1954', '1959', '1960', '2017', '2003', '1961',
       '1962', '1963', '1964', '1965', '1966', '1967', '1968', '1969',
       '1971', '1972', '1970', '1974', '2005', '2015', '2006', '1922',
       '1928', '1949', '1950', '1951', '2001', '1986', '1999', '1985',
       '1987', '1996', '1991', '1988', '1989', '1990', '1992', '2002',
       '1997', '1998', '2000', '1993', '1984', '1995', '1948', '1937',
       '1938', '1931', '1901', '1902', '1903', '1913', '1914', '1915',
       '1916', '1918', '1932', '1933', '1934', '1935', '1936', '1939',
       '1940', '1941', '1942', '1943', '1944', '1945', '1946', '1947',
       '1898', '1910', '1929', '1975', '1976', '1973', '1977', '1978',
       '1979', '1980', '1982', '1983', '1981', '1917', '1919', '1994',
      

In [333]:
df.year = df.year.astype('int')

print(df.year.apply(lambda x: x > 2024).any())
print(df.year.apply(lambda x: x == np.NaN).any())
print(df.year.isna().any())

False
False
False


In [334]:
df.category.value_counts(dropna=False)

Scooter                      6704
Sport                        5557
Enduro / offroad             4273
Custom / cruiser             4161
Naked bike                   3251
Allround                     3141
Classic                      1885
Super motard                 1634
Touring                      1573
ATV                          1489
Sport touring                1364
Cross / motocross            1243
Unspecified category          806
Trial                         560
Minibike, cross               529
Prototype / concept model     210
Minibike, sport               142
Speedway                       21
Name: category, dtype: int64

In [335]:
df.category = df.category.astype('category')

In [336]:
df.engine_cylinder.value_counts(dropna=False)

Single cylinder         20717
V2                       7456
Twin                     3392
In-line four             3169
Electric                  997
Two cylinder boxer        936
In-line three             852
V4                        459
Six cylinder boxer        137
In-line six               114
V8                         79
Four cylinder boxer        53
V6                         42
Diesel                     37
Square four cylinder       33
Gas turbine                19
NaN                        16
Dual disk Wankel           13
Radial                     10
Single disk Wankel          8
V3                          3
V10                         1
Name: engine_cylinder, dtype: int64

In [337]:
df.engine_cylinder.dropna(inplace=True)

In [338]:
df.engine_stroke.value_counts(dropna=False)

 four-stroke            30721
 two-stroke              6694
Electric                  997
Diesel                     37
Square four cylinder       33
Gas turbine                19
Dual disk Wankel           13
NaN                        11
Radial                     10
Single disk Wankel          8
Name: engine_stroke, dtype: int64

In [339]:
df.engine_stroke.replace(' four-stroke', 'Four-stroke', inplace=True)
df.engine_stroke.replace(' two-stroke', 'Two-stroke', inplace=True)
df.engine_stroke.dropna(inplace=True)

In [340]:
df.gear_box.value_counts(dropna=False)

6-speed                 11912
5-speed                 11002
Automatic                6050
NaN                      5810
4-speed                  2799
3-speed                   478
1-speed                   289
2-speed                    93
4-speed with reverse       49
7-speed                    24
100-speed                  15
2-speed automatic           8
5-speed with reverse        5
10-speed                    4
8-speed                     3
6-speed with reverse        1
3-speed automatic           1
Name: gear_box, dtype: int64

In [341]:
df.gear_box.replace('100-speed', '7-speed', inplace=True)
df.gear_box.replace('8-speed', '7-speed', inplace=True)
df.gear_box.replace('10-speed', '7-speed', inplace=True)

In [342]:
df[df.power.isnull()]

Unnamed: 0,brand,model,year,category,displacement,power,engine_cylinder,engine_stroke,gear_box,fuel_capacity,fuel_system,fuel_control,cooling_system,trans_type,dry_weight,wheelbase,seat_height
28,access,xtreme s 480,2021,ATV,449.0,,Single cylinder,Four-stroke,Automatic,14.0,,Overhead Cams (OHC),Liquids,Chain,236.0,,
30,ace,exp-4,1923,Sport,1299.0,,In-line four,Four-stroke,,,Carburettor,,,Chain,,,
31,ace,standard,1924,Allround,1234.0,,Twin,Four-stroke,,,Carburettor,,,Chain,165.0,,
32,ace,standard,1925,Allround,1234.0,,Twin,Four-stroke,,,Carburettor,,,Chain,165.0,,
33,ace,standard,1926,Allround,1234.0,,Twin,Four-stroke,,,Carburettor,,,Chain,165.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
38506,beta,alp 125,2006,Trial,124.0,,Single cylinder,Four-stroke,5-speed,6.8,Carburettor. Mikuni UCAL,Overhead Cams (OHC),Air,Chain,101.0,1350.0,850.0
38507,beta,alp 125,2007,Trial,124.0,,Single cylinder,Four-stroke,5-speed,6.8,Carburettor. Mikuni UCAL 5Nh ø26-38,,Air,Chain,101.0,1350.0,850.0
38509,nipponia,status,2021,Scooter,299.0,,Single cylinder,Four-stroke,4-speed,12.7,Carburettor,,Air,,200.0,,710.0
38533,avon,e-mate,2015,Scooter,,,Electric,Electric,Automatic,,,,,,,,


In [343]:
df.power.fillna(0, inplace=True)

In [344]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38543 entries, 0 to 38542
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   brand            38543 non-null  object  
 1   model            38515 non-null  object  
 2   year             38543 non-null  int32   
 3   category         38543 non-null  category
 4   displacement     37514 non-null  float64 
 5   power            38543 non-null  float64 
 6   engine_cylinder  38527 non-null  object  
 7   engine_stroke    38532 non-null  object  
 8   gear_box         32733 non-null  object  
 9   fuel_capacity    31747 non-null  float64 
 10  fuel_system      27893 non-null  object  
 11  fuel_control     22032 non-null  object  
 12  cooling_system   34322 non-null  object  
 13  trans_type       32912 non-null  object  
 14  dry_weight       22534 non-null  float64 
 15  wheelbase        25530 non-null  float64 
 16  seat_height      24225 non-null  float64

In [345]:
df.gear_box.fillna(np.mod, inplace=True)
df.fuel_capacity.fillna(np.median, inplace=True)
df.cooling_system.fillna(np.mod, inplace=True)
df.trans_type.fillna(np.mod, inplace=True)

df.drop(columns=['dry_weight', 'wheelbase', 'seat_height', 'fuel_system', 'fuel_control'], axis=1, inplace=True)

In [346]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38543 entries, 0 to 38542
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   brand            38543 non-null  object  
 1   model            38515 non-null  object  
 2   year             38543 non-null  int32   
 3   category         38543 non-null  category
 4   displacement     37514 non-null  float64 
 5   power            38543 non-null  float64 
 6   engine_cylinder  38527 non-null  object  
 7   engine_stroke    38532 non-null  object  
 8   gear_box         38543 non-null  object  
 9   fuel_capacity    38543 non-null  object  
 10  cooling_system   38543 non-null  object  
 11  trans_type       38543 non-null  object  
dtypes: category(1), float64(2), int32(1), object(8)
memory usage: 3.1+ MB


In [347]:
df.dropna(inplace=True)

In [349]:
df.engine_cylinder = df.engine_cylinder.astype('category')
df.engine_stroke = df.engine_stroke.astype('category')
df.gear_box = df.gear_box.astype('category')

In [350]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37470 entries, 1 to 38542
Data columns (total 12 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   brand            37470 non-null  object  
 1   model            37470 non-null  object  
 2   year             37470 non-null  int32   
 3   category         37470 non-null  category
 4   displacement     37470 non-null  float64 
 5   power            37470 non-null  float64 
 6   engine_cylinder  37470 non-null  category
 7   engine_stroke    37470 non-null  category
 8   gear_box         37470 non-null  category
 9   fuel_capacity    37470 non-null  object  
 10  cooling_system   37470 non-null  object  
 11  trans_type       37470 non-null  object  
dtypes: category(4), float64(2), int32(1), object(5)
memory usage: 2.6+ MB


### Recap
- We replaced columns names with easy ones.
- We fixed 'Year' column by removing the 'CE' and converting it to integer.
- We fixed other columns data types.
- We filled missing values of columns 'gearbox', 'cooling_system' and 'trans_type' with the mode of each column.
- We filled missing values of "fuel_capacity" with the median of each column.
- we dropped 'dry_weight', 'wheelbase', 'seat_height', 'fuel_system' and 'fuel_control' columns as they have too many missing values.
- We dropped rows with missing values in 'displacement', 'model' and 'engine_cylinder' columns.

<h2 style="text-align:center;">Thank You</h2>