# Pandas
------

## 1. Dataset Creation
We create a CSV dataset that will be used for **EDA and price prediction**.

| Brand   | Age (years) | Owner | Kms Driven | Price (Predicted) |
|---------|-------------|-------|------------|-------------------|
| Maruti  | 3           | 1     | 25000      | 450000            |
| Hyundai | 5           | 2     | 60000      | 350000            |
| Tata    | 2           | 1     | 18000      | 520000            |
| Honda   | 7           | 3     | 85000      | 280000            |
| Toyota  | 4           | 1     | 40000      | 480000            |

---

## 2. Problem Statement
Predict the **price of a car** based on its features using **regression**.

---

## 3. Data Collection
- Data is collected in **CSV format**
- Loaded using **Pandas DataFrame**

---

## 4. Exploratory Data Analysis (EDA)
EDA helps understand the dataset before applying any ML model.

### EDA includes:
- **Preprocessing**
- **Data manipulation**
- **Data cleansing**

---

## 5. Data Cleaning
- Remove unnecessary columns  
  (e.g., `city`, `name` – not useful for prediction)
- Identify important features affecting price:
  - Brand
  - Age
  - Owner
  - Kilometers driven
  - Power (if available)

❌ Unnecessary features:
- Name
- City
- Registration details

---

## 6. Dependent & Independent Variables
- **Dependent variable (Y):**
  - Price
- **Independent variables (X):**
  - Brand
  - Age
  - Owner
  - Kms driven

---

## 7. Model Selection
We use **Linear Regression** to predict price.

### Linear Regression Formula:
``` y = mx + c ```
---

## 8. Pandas Usage
All the above steps are performed using **Pandas**.

### Installation:
```bash
!pip install pandas
```
----
#### useful jupyter terminal commands
```
!dir              # list files (Windows)
!cd folder_name   # change directory
!mkdir data       # create folder
!del file.txt     # delete file
```

In [1]:
import pandas as pd
import numpy as np 

In [4]:
df = pd.read_csv("Used_Bikes.csv")
df

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
32643,Hero Passion Pro 100cc,39000.0,Delhi,22000.0,First Owner,4.0,100.0,Hero
32644,TVS Apache RTR 180cc,30000.0,Karnal,6639.0,First Owner,9.0,180.0,TVS
32645,Bajaj Avenger Street 220,60000.0,Delhi,20373.0,First Owner,6.0,220.0,Bajaj
32646,Hero Super Splendor 125cc,15600.0,Jaipur,84186.0,First Owner,16.0,125.0,Hero


## ***LINEAR REGRESSION*** ```y=mx+c```

In [5]:
df.info() #describes each column data type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32648 entries, 0 to 32647
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   bike_name   32648 non-null  object 
 1   price       32648 non-null  float64
 2   city        32648 non-null  object 
 3   kms_driven  32648 non-null  float64
 4   owner       32648 non-null  object 
 5   age         32648 non-null  float64
 6   power       32648 non-null  float64
 7   brand       32648 non-null  object 
dtypes: float64(4), object(4)
memory usage: 2.0+ MB


In [7]:
df.columns #Returns a Pandas Index object containing all column names.

Index(['bike_name', 'price', 'city', 'kms_driven', 'owner', 'age', 'power',
       'brand'],
      dtype='object')

In [8]:
df.drop_duplicates() #Returns a new DataFrame with duplicate rows removed, doesn't impact original

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,Delhi,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,Bangalore,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,Jodhpur,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,Hyderabad,21300.0,First Owner,4.0,400.0,Bajaj


In [9]:
df.duplicated() #used to identify duplicate rows in a DataFrame with boolean result

0        False
1        False
2        False
3        False
4        False
         ...  
32643     True
32644     True
32645     True
32646     True
32647     True
Length: 32648, dtype: bool

In [11]:
df.duplicated().sum() #count the duplicates

np.int64(25324)

In [18]:
df.drop_duplicates(inplace=True) # inplace=True removes duplicates permanently from the original DataFrame


```df.drop_duplicates(inplace=True)```
- Removes duplicate rows
- Modifies the original DataFrame df permanently
- Returns None

### Another method
```df = df.drop_duplicates() ```



In [19]:
df.duplicated().sum() #as all duplicates get lost

np.int64(0)

In [20]:
df

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,Delhi,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,Bangalore,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,Jodhpur,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,Hyderabad,21300.0,First Owner,4.0,400.0,Bajaj


In [14]:
df.isnull() #data with null value

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
9362,False,False,False,False,False,False,False,False
9369,False,False,False,False,False,False,False,False
9370,False,False,False,False,False,False,False,False
9371,False,False,False,False,False,False,False,False


##### Returns a DataFrame of the same shape as df

- True → value is missing (NaN, None, NaT)

- False → value is present

In [21]:
df.isnull().sum()

bike_name     0
price         0
city          0
kms_driven    0
owner         0
age           0
power         0
brand         0
dtype: int64

In [None]:
df.dropna(inplace=True) #drop null value and place to df and return none

In [24]:
# Count NULL values per column
df.isnull().sum() #as all null values get dropped if existed

bike_name     0
price         0
city          0
kms_driven    0
owner         0
age           0
power         0
brand         0
dtype: int64

In [25]:
df.isnull().count()

bike_name     7324
price         7324
city          7324
kms_driven    7324
owner         7324
age           7324
power         7324
brand         7324
dtype: int64

```df.isnull().count()``` counts total rows per column, not missing values, and that 
```df.isnull().sum()``` should be used to count null entries.

In [26]:
df

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,Delhi,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,Bangalore,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,Jodhpur,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,Hyderabad,21300.0,First Owner,4.0,400.0,Bajaj


In [None]:
df['brand'] #going to column

0                   TVS
1         Royal Enfield
2               Triumph
3                   TVS
4                Yamaha
             ...       
9362               Hero
9369              Bajaj
9370    Harley-Davidson
9371              Bajaj
9372              Bajaj
Name: brand, Length: 7324, dtype: object

In [27]:
df['brand'].unique() #total unique value

array(['TVS', 'Royal Enfield', 'Triumph', 'Yamaha', 'Honda', 'Hero',
       'Bajaj', 'Suzuki', 'Benelli', 'KTM', 'Mahindra', 'Kawasaki',
       'Ducati', 'Hyosung', 'Harley-Davidson', 'Jawa', 'BMW', 'Indian',
       'Rajdoot', 'LML', 'Yezdi', 'MV', 'Ideal'], dtype=object)

In [39]:
df['brand'].nunique() #count total unique value 

23

In [28]:
df[['brand','owner']] #for multiple columns

Unnamed: 0,brand,owner
0,TVS,First Owner
1,Royal Enfield,First Owner
2,Triumph,First Owner
3,TVS,First Owner
4,Yamaha,First Owner
...,...,...
9362,Hero,First Owner
9369,Bajaj,First Owner
9370,Harley-Davidson,First Owner
9371,Bajaj,First Owner


In [41]:
df['owner'].unique()

array(['First Owner', 'Second Owner', 'Third Owner',
       'Fourth Owner Or More'], dtype=object)

In [31]:
df.describe() # kitna data hai, kaisa data hai, aur data ka spread kya hai.

Unnamed: 0,price,kms_driven,age,power
count,7324.0,7324.0,7324.0,7324.0
mean,84883.9,23910.496587,6.656472,228.133397
std,120966.2,27317.594631,3.605299,158.324219
min,4400.0,1.0,1.0,100.0
25%,30000.0,10155.75,4.0,125.0
50%,55000.0,19000.0,6.0,160.0
75%,100000.0,30112.0,8.0,350.0
max,1900000.0,750000.0,63.0,1800.0


In [32]:
df

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,Delhi,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,Bangalore,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,Jodhpur,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,Hyderabad,21300.0,First Owner,4.0,400.0,Bajaj


### Dropping Unnecessary Columns using Pandas

```python
df.drop(['city'], axis=1, inplace=True)
```

```
axis = 0 → rows (horizontal)
axis = 1 → columns (vertical)
```

In [33]:
df.drop(['city'],axis=1,inplace=True) #axis=1 is for column and axis=0 is for row

In [34]:
df #city get dropped from df

Unnamed: 0,bike_name,price,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,21300.0,First Owner,4.0,400.0,Bajaj


In [37]:
df.drop(['bike_name'],axis=1,inplace=True) #,for removing bike_name column

In [38]:
df

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
1,119900.0,11000.0,First Owner,4.0,350.0,Royal Enfield
2,600000.0,110.0,First Owner,8.0,675.0,Triumph
3,65000.0,16329.0,First Owner,4.0,180.0,TVS
4,80000.0,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...
9362,25000.0,48587.0,First Owner,8.0,150.0,Hero
9369,35000.0,60000.0,First Owner,9.0,220.0,Bajaj
9370,450000.0,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,139000.0,21300.0,First Owner,4.0,400.0,Bajaj


In [39]:
df['brand'].value_counts() #sums total bikes of a brand

brand
Bajaj              2081
Royal Enfield      1346
Hero               1142
Honda               676
Yamaha              651
TVS                 481
KTM                 375
Suzuki              203
Harley-Davidson      91
Kawasaki             61
Hyosung              53
Mahindra             50
Benelli              46
Triumph              21
Ducati               20
BMW                  10
Jawa                  7
Indian                3
MV                    3
Rajdoot               1
LML                   1
Yezdi                 1
Ideal                 1
Name: count, dtype: int64

### Filtering Data Based on a Condition (Boolean Indexing)

###### ``` **Data Cleaning** ``` is the process of removing unwanted columns, duplicate records, and missing values from data. 
- which we do above with df ,remove unnecessary columns , removing duplicates, null value
 ------
```python
TVS_bike = df[df['brand'] == 'TVS']
```
##### Another method 
```python
TVS_bike = df.query("brand == 'TVS'")
```



Question- 
- brand - 'tvs' , owner = 'first owner
- age <= 2 yrs , price <= 50k

In [None]:
TVS_bike = df[df['brand']=='TVS']  #only TVS bikes

In [42]:
TVS_bike

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
3,65000.0,16329.0,First Owner,4.0,180.0,TVS
52,60000.0,30000.0,First Owner,5.0,160.0,TVS
114,69900.0,8700.0,First Owner,3.0,160.0,TVS
130,21500.0,10500.0,First Owner,5.0,125.0,TVS
...,...,...,...,...,...,...
9247,70000.0,4116.0,First Owner,3.0,160.0,TVS
9307,30000.0,30000.0,First Owner,10.0,160.0,TVS
9312,65450.0,9238.0,First Owner,3.0,200.0,TVS
9320,20000.0,84916.0,First Owner,14.0,150.0,TVS


In [43]:
TVS_bike= TVS_bike[TVS_bike['owner']=='First Owner'] #tvs bikes with first owner only

In [44]:
TVS_bike

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
3,65000.0,16329.0,First Owner,4.0,180.0,TVS
52,60000.0,30000.0,First Owner,5.0,160.0,TVS
114,69900.0,8700.0,First Owner,3.0,160.0,TVS
130,21500.0,10500.0,First Owner,5.0,125.0,TVS
...,...,...,...,...,...,...
9247,70000.0,4116.0,First Owner,3.0,160.0,TVS
9307,30000.0,30000.0,First Owner,10.0,160.0,TVS
9312,65450.0,9238.0,First Owner,3.0,200.0,TVS
9320,20000.0,84916.0,First Owner,14.0,150.0,TVS


In [46]:
TVS_bike = TVS_bike[TVS_bike['price']<=50000] # first owner of tvs bikes with price <= 50k

In [47]:
TVS_bike

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
130,21500.0,10500.0,First Owner,5.0,125.0,TVS
131,40000.0,20000.0,First Owner,6.0,160.0,TVS
215,28000.0,28428.0,First Owner,9.0,110.0,TVS
235,28000.0,36000.0,First Owner,5.0,100.0,TVS
...,...,...,...,...,...,...
9155,14000.0,17602.0,First Owner,13.0,110.0,TVS
9157,32000.0,17870.0,First Owner,7.0,110.0,TVS
9158,18000.0,13673.0,First Owner,7.0,110.0,TVS
9307,30000.0,30000.0,First Owner,10.0,160.0,TVS


In [48]:
TVS_bike=TVS_bike[TVS_bike['age']<=2] # first owner of tvs bikes with price <= 50k and age <=2

In [49]:
TVS_bike #so only 1 bike found meeting some conditions

Unnamed: 0,price,kms_driven,owner,age,power,brand
7055,46000.0,6222.0,First Owner,2.0,100.0,TVS


In [50]:
df['owner'].unique()

array(['First Owner', 'Second Owner', 'Third Owner',
       'Fourth Owner Or More'], dtype=object)

Question- 
- brand - 'Honda' , owner = 'first owner'
- age <= 2 yrs , price <= 50k
- kms<= 40km 

In [52]:
df 

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
1,119900.0,11000.0,First Owner,4.0,350.0,Royal Enfield
2,600000.0,110.0,First Owner,8.0,675.0,Triumph
3,65000.0,16329.0,First Owner,4.0,180.0,TVS
4,80000.0,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...
9362,25000.0,48587.0,First Owner,8.0,150.0,Hero
9369,35000.0,60000.0,First Owner,9.0,220.0,Bajaj
9370,450000.0,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,139000.0,21300.0,First Owner,4.0,400.0,Bajaj


In [53]:
HONDA = df[df['brand']=='Honda'] #honda bikes only

In [54]:
HONDA

Unnamed: 0,price,kms_driven,owner,age,power,brand
6,85000.0,8200.0,First Owner,3.0,160.0,Honda
27,20800.0,30500.0,Second Owner,7.0,125.0,Honda
29,81200.0,9100.0,First Owner,2.0,160.0,Honda
34,40000.0,30000.0,First Owner,8.0,150.0,Honda
37,65000.0,43000.0,First Owner,6.0,150.0,Honda
...,...,...,...,...,...,...
9250,80000.0,8000.0,First Owner,10.0,250.0,Honda
9258,65000.0,18000.0,First Owner,6.0,160.0,Honda
9260,120000.0,14000.0,Second Owner,4.0,250.0,Honda
9340,34400.0,24513.0,Second Owner,8.0,150.0,Honda


In [55]:
HONDA=HONDA[HONDA['price']<=50000] #honda bikes with price <= 50000

In [56]:
HONDA

Unnamed: 0,price,kms_driven,owner,age,power,brand
27,20800.0,30500.0,Second Owner,7.0,125.0,Honda
34,40000.0,30000.0,First Owner,8.0,150.0,Honda
53,21900.0,30000.0,Second Owner,7.0,125.0,Honda
56,34500.0,17056.0,First Owner,5.0,110.0,Honda
82,38000.0,33000.0,First Owner,7.0,125.0,Honda
...,...,...,...,...,...,...
9209,25000.0,27000.0,First Owner,8.0,110.0,Honda
9210,24990.0,39000.0,First Owner,10.0,125.0,Honda
9234,46000.0,9000.0,First Owner,5.0,110.0,Honda
9239,35000.0,14992.0,First Owner,8.0,110.0,Honda


In [57]:
HONDA=HONDA[HONDA['age']<=2] #honda bikes with price <= 50000,age <=2

In [58]:
HONDA

Unnamed: 0,price,kms_driven,owner,age,power,brand
2349,48000.0,7119.0,First Owner,2.0,110.0,Honda


In [None]:
HONDA=HONDA[HONDA['owner']== 'First Owner'] #honda bikes with price <= 50000,age <=2,first owner

In [60]:
HONDA

Unnamed: 0,price,kms_driven,owner,age,power,brand
2349,48000.0,7119.0,First Owner,2.0,110.0,Honda


In [62]:
HONDA=HONDA[HONDA['kms_driven']<= 40000] #honda bikes with price <= 50000,age <=2,first owner,kms driven<=40k

In [63]:
HONDA

Unnamed: 0,price,kms_driven,owner,age,power,brand
2349,48000.0,7119.0,First Owner,2.0,110.0,Honda


Question- 
- brand - 'Hero' 
- kms_driven <=50k

In [65]:
Hero = df[(df['brand']=='Hero') & (df['kms_driven']<=50000)] # df.query() fn

In [66]:
Hero

Unnamed: 0,price,kms_driven,owner,age,power,brand
7,45000.0,12645.0,First Owner,3.0,100.0,Hero
22,46500.0,3500.0,First Owner,2.0,110.0,Hero
26,20000.0,29305.0,First Owner,16.0,125.0,Hero
48,37000.0,10800.0,First Owner,8.0,150.0,Hero
66,12200.0,46643.0,First Owner,14.0,100.0,Hero
...,...,...,...,...,...,...
9315,20000.0,5000.0,First Owner,10.0,100.0,Hero
9316,37000.0,28478.0,First Owner,5.0,125.0,Hero
9339,11400.0,20000.0,Second Owner,17.0,100.0,Hero
9341,25000.0,11122.0,First Owner,11.0,100.0,Hero


- ```df.shape[0]``` → number of rows

- ```df.shape[1]``` → number of columns 

In [67]:
Hero.shape[1]

6

In [68]:
Hero.shape[0]

1008

 ##### ```df.info()```
###### batata hai kitna data hai, kaunsa column hai, datatype kya hai, aur null values hai ya nahi.

In [69]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
Index: 7324 entries, 0 to 9372
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   price       7324 non-null   float64
 1   kms_driven  7324 non-null   float64
 2   owner       7324 non-null   object 
 3   age         7324 non-null   float64
 4   power       7324 non-null   float64
 5   brand       7324 non-null   object 
dtypes: float64(4), object(2)
memory usage: 400.5+ KB


In [70]:
df

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
1,119900.0,11000.0,First Owner,4.0,350.0,Royal Enfield
2,600000.0,110.0,First Owner,8.0,675.0,Triumph
3,65000.0,16329.0,First Owner,4.0,180.0,TVS
4,80000.0,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...
9362,25000.0,48587.0,First Owner,8.0,150.0,Hero
9369,35000.0,60000.0,First Owner,9.0,220.0,Bajaj
9370,450000.0,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,139000.0,21300.0,First Owner,4.0,400.0,Bajaj


In [167]:
#COLUMNS ENCODING , changing data type of brand from object to int

### Column Encoding (Categorical → Numerical)

In machine learning, models cannot work directly with **text (object) data**.
So we convert categorical columns like `brand` from **object → int**.
This process is called **Column Encoding**.

#### Why encoding is required
- ML models understand **numbers**, not strings
- Converts categorical data into numerical form
- Required before applying regression or classification models

---

#### Example: Encoding `brand` column

##### Before encoding
```python
df['brand'].dtype


In [72]:
print(df['brand'].dtype)


object


In [74]:
df['brand'].unique()

array(['TVS', 'Royal Enfield', 'Triumph', 'Yamaha', 'Honda', 'Hero',
       'Bajaj', 'Suzuki', 'Benelli', 'KTM', 'Mahindra', 'Kawasaki',
       'Ducati', 'Hyosung', 'Harley-Davidson', 'Jawa', 'BMW', 'Indian',
       'Rajdoot', 'LML', 'Yezdi', 'MV', 'Ideal'], dtype=object)

In [75]:
dct ={ 'TVS':1, 'Royal Enfield':2, 'Triumph':3, 'Yamaha':4, 'Honda':5, 'Hero':6,
       'Bajaj':7, 'Suzuki':8, 'Benelli':9, 'KTM':10, 'Mahindra':11, 'Kawasaki':12,
       'Ducati':13, 'Hyosung':14, 'Harley-Davidson':15, 'Jawa':16, 'BMW':17, 'Indian':18,
       'Rajdoot':19, 'LML':20, 'Yezdi':21, 'MV':22, 'Ideal':23}

In [76]:
df['brand']= df['brand'].map(dct) #brand names get converted into integer 1,2,3 using mapping from dictionary

In [None]:
print(df['brand'].dtype) #Column datatype changes from object → int


int64


In [78]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7324 entries, 0 to 9372
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   price       7324 non-null   float64
 1   kms_driven  7324 non-null   float64
 2   owner       7324 non-null   object 
 3   age         7324 non-null   float64
 4   power       7324 non-null   float64
 5   brand       7324 non-null   int64  
dtypes: float64(4), int64(1), object(1)
memory usage: 400.5+ KB


In [None]:
df 

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,1
1,119900.0,11000.0,First Owner,4.0,350.0,2
2,600000.0,110.0,First Owner,8.0,675.0,3
3,65000.0,16329.0,First Owner,4.0,180.0,1
4,80000.0,10000.0,First Owner,3.0,150.0,4
...,...,...,...,...,...,...
9362,25000.0,48587.0,First Owner,8.0,150.0,6
9369,35000.0,60000.0,First Owner,9.0,220.0,7
9370,450000.0,3430.0,First Owner,4.0,750.0,15
9371,139000.0,21300.0,First Owner,4.0,400.0,7


> Since Jupyter Notebook uses a shared kernel, changes made to a DataFrame in one cell are reflected across all subsequent cells unless the data is reloaded or the kernel is restarted.
> so, all over in this jupyter notebook df's brand value data type converted into int64
```python
df2 = df.copy()
df2['brand'] = df2['brand'].map(dct)
```
- if, you did this ,it won't affect original df or you can reload data again   
```df = pd.read_csv("Used_Bikes.csv")``` or, restart kernel and run each cells 
it won't affect original

