## Dealing with missing values in the dataset

Drop the variable or drop the data entry 
### Replace the missing values 
Replace it with the average values, frequency or based on other functions

### Dropping missing values in pandas
use dataframes.dropna()  
df.dropna(subset = ["price"], axis = 0, inplace = true)
#### Notes: 
Axis = 0 drops the entire row, 
Axis = 1 drops the entire column.

In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\Shreyas(Courses)\Data Analysis (IBM)\Dataset Used\cardata.csv", header = None)
headers = [ "number", "symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
df.columns = headers
df.replace("?", np.nan , inplace = True )
df.dropna(subset = ["price"], axis = 0 , inplace = True)




## Replacing the missing values
use dataframe. replace(missing value , new value )  
Generally fillna is considered a better option  
The fillna statement is given below

In [3]:
mean = df["normalized-losses"].mean
df["normalized-losses"].fillna(mean, inplace= True)


## Data formatting 
Ensures that the data is consistent and easily understandable  
### Applying calculations 

For example, we need to convert city miles per gallon to liter/100km   
code used :   
df["city mpg"] = 235/ df["city mpg"]

In [4]:
df["city-mpg"] = 235/df["city-mpg"]


### ⚠️ Important note
When performing calculations on a DataFrame column, avoid overwriting the same column 
in the same operation without cleaning the data first. This can sometimes cause 
unexpected behavior or recursion errors in pandas. 

Best practice:
- Convert raw data to numeric before calculations.
- Write results to a new column (e.g., `df["city-L/100km"]`) instead of replacing 
  the original column directly.


## Incorrect datatypes
Sometimes the wrong datatypes are assigned to a feature  
## Converting

df["price"].astype("dataype")  

This converts the object type variable into int


In [5]:
df["price"] = df["price"].astype("int")

## Data normalization
Uniform the features value with different ranges.  
converting it into normal ranges before performing any calculations, makes the calculations easy

## Methods of normalization '
### Simple feature scaling
This divides all the values of the column by the maximum value
### Min - max method 
x.old - x.min /x.max - x.min  
### Z score or standard score 
x.old - average of all the entries /standard deviation


In [6]:
#The Methods are applied below
#Simple feature scaling
df['length'] = df["length"]/df['length'].max()

#The min max method
df["length"] = (df["length"]-df["length"].min()) /(df["length"].max() - df["length"].min())

#The zscore method
df["length"] = df["length"] - df["length"].mean()/df["length"].std()

### Binning the data
Binning is grouping the data to improve accuracy of models  
Binning the car dataset into high,low and medium priced cars 

Thus we need three bins and 4 dividers equal distance apart.

In [7]:

bins = np.linspace(min(df["price"]), max(df["price"]), 4)
group_names = ["Low", "Medium", "High"]
df["Price_Binned"]= pd.cut(df["price"], bins, labels= group_names, include_lowest=True)


### How to turn categorical values into quantitative values

Solution: Add dummy variables for each unique catrgory  
Assign 0 or 1 in each category. 


In [8]:
pd.get_dummies(df["fuel-type"])
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 202 entries, 0 to 205
Data columns (total 28 columns):
 #   Column             Non-Null Count  Dtype   
---  ------             --------------  -----   
 0   number             201 non-null    float64 
 1   symboling          202 non-null    int64   
 2   normalized-losses  202 non-null    object  
 3   make               202 non-null    object  
 4   fuel-type          202 non-null    object  
 5   aspiration         202 non-null    object  
 6   num-of-doors       200 non-null    object  
 7   body-style         202 non-null    object  
 8   drive-wheels       202 non-null    object  
 9   engine-location    202 non-null    object  
 10  wheel-base         202 non-null    float64 
 11  length             202 non-null    float64 
 12  width              202 non-null    float64 
 13  height             202 non-null    float64 
 14  curb-weight        202 non-null    int64   
 15  engine-type        202 non-null    object  
 16  num-of-cylind

In [9]:
df.iloc(0,3)

TypeError: _LocationIndexer.__call__() takes from 1 to 2 positional arguments but 3 were given