<h1 align=center>LOADING AND DATA WRANGLING</h1>
<p>
The main objective here is to successfully load the data file and look out for the missing values in it. After cleaning the data, the cleaned data file is saved as a new file for the next part i.e. the analysis process.
</p>

# Importing libraries and data

In [1]:
# importing all the important libraries at once
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pylab as plt
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from scipy import stats
import statsmodels.formula.api as smf
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
#ignore all the warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# file location on my computer
file = 'D:/Python_DataScience/pandas-demo/Project1/cars_project_1.csv'

#reading the data without any headers and assigning to variabe df
df = pd.read_csv(file, header = None)

In [3]:
# showing the first 5 rows of the dataframe df
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,16,17,18,19,20,21,22,23,24,25
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [4]:
# creating headers list
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
print("headers:\n", headers )

headers:
 ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price']


In [5]:
# printing the number of rows and columns
df.shape

(205, 26)

In [6]:
# adding the headers as Column names for our dataframe df
df.columns =headers
# looking at the dataframe with newly added columns
df.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


# INITIATING DATA WRANGLING PROCESS

### There are a lot of missing values (presence of "?" in the cells) in the dataset that needs to be filled up or discarded accordingly. 

Many ways are available to fill up the empty cells. Some of them are as follows:<ul>
<li>Filling the empty value using mean, median, mode or regression if the data is quantitative.</li>
<li>Filling the empty value using mode if data is qualitative.</li>
<li>Filling the empty value using data analysis or domain experience. (This is the preferred one)</li>
<li>Finally, discarding the data (most of the times, deleting an entire row or column) if not contributing to model development.</li></ul>

In [7]:
# replacing the "?" by NaN using np.nan feature 
df.replace("?",np.nan,inplace= True)
# printing the dataframe
df.head(5)

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


In [8]:
# obtaining info about null values and data type of each column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  164 non-null    object 
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       203 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

### Null cells summary : 
A summary of missing values are given below:<ul>
<li > <h5>normalized-losses : 39 missing values: </h5>The best approach for this variable would be to discard entire column due to large number of missing values</li>
 <li><h5>num-of-doors : 2 missing values: </h5> The best approach would be mode or simple observation</li> 
<li><h5> bore : 4 missing values:</h5> The best approach would be linear regression or mean</li>
<li> <h5>stroke : 4 missing values:</h5> The best approach would be linear regression or mean</li>
<li> <h5>horsepower : 2 missing values:</h5>The best approach would be linear regression or mean</li>
<li><h5> peak-rpm : 2 missing values:</h5> The best approach would be linear regression or mean</li>
</ul>

Also, from the summary we can see that the dtype of several columns are wrong and needs to be corrected.

In [9]:
# correcting the datatype of  columns 
df["normalized-losses"] = df["normalized-losses"].astype("float")
df["bore"] = df["bore"].astype("float")
df["stroke"] = df["stroke"].astype("float")
df["horsepower"] = df["horsepower"].astype("float")
df["peak-rpm"] = df["peak-rpm"].astype("float")
df["price"] = df["price"].astype("float")

# looking at the descriptive statistics of corrected dtypes  of continuous variables
df.describe()

Unnamed: 0,symboling,normalized-losses,wheel-base,length,width,height,curb-weight,engine-size,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
count,205.0,164.0,205.0,205.0,205.0,205.0,205.0,205.0,201.0,201.0,205.0,203.0,203.0,205.0,205.0,201.0
mean,0.834146,122.0,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329751,3.255423,10.142537,104.256158,5125.369458,25.219512,30.75122,13207.129353
std,1.245307,35.442168,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.273539,0.316717,3.97204,39.714369,479.33456,6.542142,6.886443,7947.066342
min,-2.0,65.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,0.0,94.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7775.0
50%,1.0,115.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,2.0,150.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.59,3.41,9.4,116.0,5500.0,30.0,34.0,16500.0
max,3.0,256.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


#### Dealing with missing "bore" and "stroke" values

Let us begin by identifying the missing cells for "bore" and "stroke" variables and try to fill in the values using regression. 

In [10]:
# identifying all the rows that contain null values 
is_NaN = df.isnull()
row_has_NaN = is_NaN.any(axis=1)
rows_with_NaN = df[row_has_NaN]

print(rows_with_NaN)

     symboling  normalized-losses           make fuel-type aspiration  \
0            3                NaN    alfa-romero       gas        std   
1            3                NaN    alfa-romero       gas        std   
2            1                NaN    alfa-romero       gas        std   
5            2                NaN           audi       gas        std   
7            1                NaN           audi       gas        std   
9            0                NaN           audi       gas      turbo   
14           1                NaN            bmw       gas        std   
15           0                NaN            bmw       gas        std   
16           0                NaN            bmw       gas        std   
17           0                NaN            bmw       gas        std   
27           1              148.0          dodge       gas      turbo   
43           0                NaN          isuzu       gas        std   
44           1                NaN          isuzu   

[46 rows x 26 columns]


Hence, going through the missing rows, we can find that the row numbered 55 to 58 for the bore and stroke columns contain null values. Lets print out those rows to analyse

In [11]:
# prining the required rows 
df.iloc[55:59,:]

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
55,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,70,4bbl,,,9.4,101.0,6000.0,17,23,10945.0
56,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,70,4bbl,,,9.4,101.0,6000.0,17,23,11845.0
57,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,70,4bbl,,,9.4,101.0,6000.0,17,23,13645.0
58,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,80,mpfi,,,9.4,135.0,6000.0,16,23,15645.0


Since bore and stroke values are continuous, filling it would be easier if we identified the correlation between variables

In [12]:
# selecting appropriate variables for regression
var = [ "length","peak-rpm","city-mpg"]

In [13]:
# preparing the test dataset for multipe linear regression
x_test = df.iloc[55:59,:]
x_test = x_test.drop(["bore","stroke"],axis = 1)
x_test = x_test[var]
x_test

Unnamed: 0,length,peak-rpm,city-mpg
55,169.0,6000.0,17
56,169.0,6000.0,17
57,169.0,6000.0,17
58,169.0,6000.0,16


In [14]:
# preparing the train dataset
x_train = df
x_train = x_train.dropna(axis = 0)
y_train = x_train["bore"]
x_train = x_train[var]
x_train.head(5)

Unnamed: 0,length,peak-rpm,city-mpg
3,176.6,5500.0,24
4,176.6,5500.0,18
6,192.7,5500.0,19
8,192.7,5500.0,17
10,176.8,5800.0,23


In [15]:
# applying linear regression to fit the values
lm = LinearRegression()
lm.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [16]:
# predicting the values
y_test = lm.predict(x_test)
y_test

array([3.30160722, 3.30160722, 3.30160722, 3.31821145])

In [17]:
# we can see that variables "length","peak-rpm", "city-mpg" have the least p-values which make them significant for prediction of "bore" missing values
X = x_train
Y = y_train
X = sm.add_constant(X)
model1= sm.OLS(Y,X)
result = model1.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                   bore   R-squared:                       0.506
Model:                            OLS   Adj. R-squared:                  0.496
Method:                 Least Squares   F-statistic:                     52.87
Date:                Mon, 21 Dec 2020   Prob (F-statistic):           1.36e-23
Time:                        21:24:29   Log-Likelihood:                 40.677
No. Observations:                 159   AIC:                            -73.35
Df Residuals:                     155   BIC:                            -61.08
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          3.2573      0.537      6.066      0.0

In [18]:
# assigning y_test values for "bore" to the original table 
df.iloc[55:59,:]["bore"] = y_test
df.iloc[55:59, :]

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
55,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,70,4bbl,3.301607,,9.4,101.0,6000.0,17,23,10945.0
56,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,70,4bbl,3.301607,,9.4,101.0,6000.0,17,23,11845.0
57,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,70,4bbl,3.301607,,9.4,101.0,6000.0,17,23,13645.0
58,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,80,mpfi,3.318211,,9.4,135.0,6000.0,16,23,15645.0


#### Repeating the same procedure to calculate the "stroke" missing values 

In [19]:
var1=["engine-size","peak-rpm","highway-mpg"]

In [20]:
# preparing the test dataset for multipe linear regression
x_test_1 = df.iloc[55:59,:]
x_test_1= x_test_1[var1]
x_test_1

Unnamed: 0,engine-size,peak-rpm,highway-mpg
55,70,6000.0,23
56,70,6000.0,23
57,70,6000.0,23
58,80,6000.0,23


In [21]:
# preparing the train dataset
x_train_1 = df
x_train_1 = x_train_1.dropna(axis = 0)
y_train_1 = x_train_1["stroke"]
x_train_1 = x_train_1[var1]
x_train_1.tail(5)

Unnamed: 0,engine-size,peak-rpm,highway-mpg
200,141,5400.0,28
201,141,5300.0,25
202,173,5500.0,23
203,145,4800.0,27
204,141,5400.0,25


In [22]:
lm = LinearRegression()
lm.fit(x_train_1,y_train_1)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [23]:
# predicting the values
y_test_1 = lm.predict(x_test_1)
y_test_1

array([2.79690285, 2.79690285, 2.79690285, 2.86719203])

In [24]:
# we can see that variables "length","peak-rpm", "city-mpg" have the least p-values which make them significant for prediction of "stroke" missing values
X = x_train_1
Y = y_train_1
X = sm.add_constant(X)
model1= sm.OLS(Y,X)
result = model1.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                 stroke   R-squared:                       0.209
Model:                            OLS   Adj. R-squared:                  0.193
Method:                 Least Squares   F-statistic:                     13.62
Date:                Mon, 21 Dec 2020   Prob (F-statistic):           6.26e-08
Time:                        21:24:29   Log-Likelihood:                -12.350
No. Observations:                 159   AIC:                             32.70
Df Residuals:                     155   BIC:                             44.98
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           0.9607      0.447      2.148      

In [25]:
# assigning y_test values for "stroke" to the original table 
df.iloc[55:59,:]["stroke"] = y_test_1
df.iloc[55:59, :]

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
55,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,70,4bbl,3.301607,2.796903,9.4,101.0,6000.0,17,23,10945.0
56,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,70,4bbl,3.301607,2.796903,9.4,101.0,6000.0,17,23,11845.0
57,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,70,4bbl,3.301607,2.796903,9.4,101.0,6000.0,17,23,13645.0
58,3,150.0,mazda,gas,std,two,hatchback,rwd,front,95.3,...,80,mpfi,3.318211,2.867192,9.4,135.0,6000.0,16,23,15645.0


#### Dealing with "num-of-doors" missing values

Let us resolve the missing values of "num-of-doors" variable. Here we can see that the missing values for the row index 27 and 63. The "body-style" for the same row is sedan. Observing the data, we find that sedans have four number of doors, which coincides with our knowledge that sedans usually have four doors. Hence, four doors is the answer for both the missing values for "num-of-doors" column.

In [26]:
df.iloc[[27,63],[5]] = df.iloc[[27,63],[5]].replace(np.nan,"four")
df.iloc[[27,63],[5]]

Unnamed: 0,num-of-doors
27,four
63,four


#### Dealing with "horsepower" and "peak-rpm" missing values using Multiple Linear Regression 

In [27]:
df.iloc[[130,131],]

Unnamed: 0,symboling,normalized-losses,make,fuel-type,aspiration,num-of-doors,body-style,drive-wheels,engine-location,wheel-base,...,engine-size,fuel-system,bore,stroke,compression-ratio,horsepower,peak-rpm,city-mpg,highway-mpg,price
130,0,,renault,gas,std,four,wagon,fwd,front,96.1,...,132,mpfi,3.46,3.9,8.7,,,23,31,9295.0
131,2,,renault,gas,std,two,hatchback,fwd,front,96.1,...,132,mpfi,3.46,3.9,8.7,,,23,31,9895.0


In [28]:
var2 = ["height","curb-weight","engine-size","compression-ratio","city-mpg"]

In [29]:
# preparing the test dataset for multipe linear regression
x_test_2 = df.iloc[[130,131],:]
x_test_2= x_test_2[var2]
x_test_2

Unnamed: 0,height,curb-weight,engine-size,compression-ratio,city-mpg
130,55.2,2579,132,8.7,23
131,50.5,2460,132,8.7,23


In [30]:
# preparing the train dataset
x_train_2 = df
x_train_2 = x_train_2.dropna(axis = 0)
y_train_2 = x_train_2["horsepower"]
x_train_2 = x_train_2[var2]
x_train_2.head(5)

Unnamed: 0,height,curb-weight,engine-size,compression-ratio,city-mpg
3,54.3,2337,109,10.0,24
4,54.3,2824,136,8.0,18
6,55.7,2844,136,8.5,19
8,55.9,3086,131,8.3,17
10,54.3,2395,108,8.8,23


In [31]:
lm = LinearRegression()
lm.fit(x_train_2,y_train_2)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [32]:
# predicting the values
y_test_2 = lm.predict(x_test_2)
y_test_2

array([107.2168916 , 115.18587829])

In [33]:
X = x_train_2
Y = y_train_2
X = sm.add_constant(X)
model1= sm.OLS(Y,X)
result = model1.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:             horsepower   R-squared:                       0.829
Model:                            OLS   Adj. R-squared:                  0.824
Method:                 Least Squares   F-statistic:                     153.4
Date:                Mon, 21 Dec 2020   Prob (F-statistic):           9.56e-59
Time:                        21:24:30   Log-Likelihood:                -647.31
No. Observations:                 164   AIC:                             1307.
Df Residuals:                     158   BIC:                             1325.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const               187.8441     25.82

In [34]:
# assigning y_test values for "horsepower" to the original table 
df.iloc[130:132,:]["horsepower"]= y_test_2
df.iloc[130:132, :]["horsepower"]

130    107.216892
131    115.185878
Name: horsepower, dtype: float64

In [35]:
var3 = ["engine-size","bore","compression-ratio","horsepower"]

In [36]:
x_test_3 = df.iloc[[130,131],:]
x_test_3= x_test_3[var3]
x_test_3

Unnamed: 0,engine-size,bore,compression-ratio,horsepower
130,132,3.46,8.7,107.216892
131,132,3.46,8.7,115.185878


In [37]:
# preparing the train dataset
x_train_3 = df
x_train_3 = x_train_3.dropna(axis = 0)
y_train_3 = x_train_3["peak-rpm"]
x_train_3 = x_train_3[var3]
x_train_3.head(5)

Unnamed: 0,engine-size,bore,compression-ratio,horsepower
3,109,3.19,10.0,102.0
4,136,3.19,8.0,115.0
6,136,3.19,8.5,110.0
8,131,3.13,8.3,140.0
10,108,3.5,8.8,101.0


In [38]:
lm = LinearRegression()
lm.fit(x_train_3,y_train_3)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [39]:
# predicting the values
y_test_3 = lm.predict(x_test_3)
y_test_3

array([5050.34257672, 5154.5495386 ])

In [40]:
X = x_train_3
Y = y_train_3
X = sm.add_constant(X)
model1= sm.OLS(Y,X)
result = model1.fit()
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:               peak-rpm   R-squared:                       0.490
Model:                            OLS   Adj. R-squared:                  0.477
Method:                 Least Squares   F-statistic:                     38.23
Date:                Mon, 21 Dec 2020   Prob (F-statistic):           2.17e-22
Time:                        21:24:30   Log-Likelihood:                -1189.3
No. Observations:                 164   AIC:                             2389.
Df Residuals:                     159   BIC:                             2404.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const              7360.9508    370.03

In [41]:
# assigning y_test values for "horsepower" to the original table 
df.iloc[130:132,:]["peak-rpm"]= y_test_3
df.iloc[130:132, :]["peak-rpm"]

130    5050.342577
131    5154.549539
Name: peak-rpm, dtype: float64

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          205 non-null    int64  
 1   normalized-losses  164 non-null    float64
 2   make               205 non-null    object 
 3   fuel-type          205 non-null    object 
 4   aspiration         205 non-null    object 
 5   num-of-doors       205 non-null    object 
 6   body-style         205 non-null    object 
 7   drive-wheels       205 non-null    object 
 8   engine-location    205 non-null    object 
 9   wheel-base         205 non-null    float64
 10  length             205 non-null    float64
 11  width              205 non-null    float64
 12  height             205 non-null    float64
 13  curb-weight        205 non-null    int64  
 14  engine-type        205 non-null    object 
 15  num-of-cylinders   205 non-null    object 
 16  engine-size        205 non

#### Dealing with "normalized-losses" variable and "price" variable missing values:
The normalized variable has many missing values, thus it is better to remove the entire column
The target variable "price" has four missing values, thus it is better that we remove those rows that contain missing values as it wount contribute to the model building.

In [43]:
# removing "normalized-losses" variable 
df.drop(labels="normalized-losses", axis=1, index=None, columns=None, level=None, inplace=True, errors='raise')
# removing rows having missing values for "price" variable
df.dropna(axis=0,inplace = True)
#reseting the index or row values
df.reset_index(drop=True, inplace=True)

In [44]:
# rechecking that all missing values have been dealt with
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 25 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   symboling          201 non-null    int64  
 1   make               201 non-null    object 
 2   fuel-type          201 non-null    object 
 3   aspiration         201 non-null    object 
 4   num-of-doors       201 non-null    object 
 5   body-style         201 non-null    object 
 6   drive-wheels       201 non-null    object 
 7   engine-location    201 non-null    object 
 8   wheel-base         201 non-null    float64
 9   length             201 non-null    float64
 10  width              201 non-null    float64
 11  height             201 non-null    float64
 12  curb-weight        201 non-null    int64  
 13  engine-type        201 non-null    object 
 14  num-of-cylinders   201 non-null    object 
 15  engine-size        201 non-null    int64  
 16  fuel-system        201 non

In [45]:
# saving the clean data to a new file
df.to_csv( "clean0_project_1.csv",header=True)