# Pandas
------
### üîπ Machine Learning Type

- Supervised Learning ‚Üí Uses labeled data
- Unsupervised Learning ‚Üí Uses unlabeled data
##### üëâ This project uses Supervised Learning (Regression) because the target variable (price) is already known.

### üîπ Machine Learning Workflow
- Problem definition
- Data collection
- Data processing
- Data encoding & EDA
- Data splitting (training & testing)
- Model creation
- Model evaluation
#### say features, not columns
## 1. Dataset Creation
We create a CSV dataset that will be used for **EDA and price prediction**.

| Brand   | Age (years) | Owner | Kms Driven | Price (Predicted) |
|---------|-------------|-------|------------|-------------------|
| Maruti  | 3           | 1     | 25000      | 450000            |
| Hyundai | 5           | 2     | 60000      | 350000            |
| Tata    | 2           | 1     | 18000      | 520000            |
| Honda   | 7           | 3     | 85000      | 280000            |
| Toyota  | 4           | 1     | 40000      | 480000            |

---

## 2. Problem Statement
Predict the **price of a car** based on its features using **regression**.

---

## 3. Data Collection
- Data is collected in **CSV format**
- Loaded using **Pandas DataFrame**

---

## 4. Exploratory Data Analysis (EDA)
EDA helps understand the dataset before applying any ML model.

### EDA includes:
- **Preprocessing**
- **Data manipulation**
- **Data cleansing**

---

## 5. Data Cleaning
- Remove unnecessary columns  
  (e.g., `city`, `name` ‚Äì not useful for prediction)
- Identify important features affecting price:
  - Brand
  - Age
  - Owner
  - Kilometers driven
  - Power (if available)

‚ùå Unnecessary features:
- Name
- City
- Registration details

---

## 6. Dependent & Independent Variables
- **Dependent variable (Y):**
  - Price
- **Independent variables (X):**
  - Brand
  - Age
  - Owner
  - Kms driven

---
## 7. Data Encoding
- Categorical features (e.g., Brand) are converted into numerical form
- Encoding is necessary for machine learning models to work properly

## 8. Data Splitting (Training & Testing)
- The dataset is divided into two parts:
  - **Training Data**
    Used to train / create the model
  - **Testing Data**
    Used to test the model‚Äôs performance
This ensures the model can generalize to unseen data.


## 9. Model Selection
We use **Linear Regression** to predict price.

### Linear Regression Formula:
``` y = mx + c ```
---

## 8. Pandas Usage
All steps including:
- Data loading
- Cleaning
- EDA
- Feature selection
- Data splitting
are performed using **Pandas**.

### Installation:
```bash
!pip install pandas
```
----
#### useful jupyter terminal commands
```
!dir              # list files (Windows)
!cd folder_name   # change directory
!mkdir data       # create folder
!del file.txt     # delete file
```

In [1]:
import pandas as pd
import numpy as np 

In [2]:
df = pd.read_csv("Used_Bikes.csv")
df

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
32643,Hero Passion Pro 100cc,39000.0,Delhi,22000.0,First Owner,4.0,100.0,Hero
32644,TVS Apache RTR 180cc,30000.0,Karnal,6639.0,First Owner,9.0,180.0,TVS
32645,Bajaj Avenger Street 220,60000.0,Delhi,20373.0,First Owner,6.0,220.0,Bajaj
32646,Hero Super Splendor 125cc,15600.0,Jaipur,84186.0,First Owner,16.0,125.0,Hero


## ***LINEAR REGRESSION*** ```y=mx+c```

In [3]:
df.info() #describes each column data type

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32648 entries, 0 to 32647
Data columns (total 8 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   bike_name   32648 non-null  object 
 1   price       32648 non-null  float64
 2   city        32648 non-null  object 
 3   kms_driven  32648 non-null  float64
 4   owner       32648 non-null  object 
 5   age         32648 non-null  float64
 6   power       32648 non-null  float64
 7   brand       32648 non-null  object 
dtypes: float64(4), object(4)
memory usage: 2.0+ MB


In [4]:
df.columns #Returns a Pandas Index object containing all column names.

Index(['bike_name', 'price', 'city', 'kms_driven', 'owner', 'age', 'power',
       'brand'],
      dtype='object')

In [5]:
df.drop_duplicates() #Returns a new DataFrame with duplicate rows removed, doesn't impact original

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,Delhi,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,Bangalore,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,Jodhpur,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,Hyderabad,21300.0,First Owner,4.0,400.0,Bajaj


In [6]:
df.duplicated() #used to identify duplicate rows in a DataFrame with boolean result

0        False
1        False
2        False
3        False
4        False
         ...  
32643     True
32644     True
32645     True
32646     True
32647     True
Length: 32648, dtype: bool

In [7]:
df.duplicated().sum() #count the duplicates

np.int64(25324)

In [8]:
df.drop_duplicates(inplace=True) # inplace=True removes duplicates permanently from the original DataFrame


```df.drop_duplicates(inplace=True)```
- Removes duplicate rows
- Modifies the original DataFrame df permanently
- Returns None

### Another method
```df = df.drop_duplicates() ```



In [9]:
df.duplicated().sum() #as all duplicates get lost

np.int64(0)

In [10]:
df

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,Delhi,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,Bangalore,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,Jodhpur,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,Hyderabad,21300.0,First Owner,4.0,400.0,Bajaj


In [11]:
df.isnull() #data with null value

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
9362,False,False,False,False,False,False,False,False
9369,False,False,False,False,False,False,False,False
9370,False,False,False,False,False,False,False,False
9371,False,False,False,False,False,False,False,False


##### Returns a DataFrame of the same shape as df

- True ‚Üí value is missing (NaN, None, NaT)

- False ‚Üí value is present

In [12]:
df.isnull().sum()

bike_name     0
price         0
city          0
kms_driven    0
owner         0
age           0
power         0
brand         0
dtype: int64

In [13]:
df.dropna(inplace=True) #drop null value and place to df and return none

In [14]:
# Count NULL values per column
df.isnull().sum() #as all null values get dropped if existed

bike_name     0
price         0
city          0
kms_driven    0
owner         0
age           0
power         0
brand         0
dtype: int64

In [15]:
df.isnull().count()

bike_name     7324
price         7324
city          7324
kms_driven    7324
owner         7324
age           7324
power         7324
brand         7324
dtype: int64

```df.isnull().count()``` counts total rows per column, not missing values, and that 
```df.isnull().sum()``` should be used to count null entries.

In [16]:
df

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,Delhi,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,Bangalore,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,Jodhpur,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,Hyderabad,21300.0,First Owner,4.0,400.0,Bajaj


In [17]:
df['brand'] #going to column

0                   TVS
1         Royal Enfield
2               Triumph
3                   TVS
4                Yamaha
             ...       
9362               Hero
9369              Bajaj
9370    Harley-Davidson
9371              Bajaj
9372              Bajaj
Name: brand, Length: 7324, dtype: object

In [18]:
df['brand'].unique() #total unique value

array(['TVS', 'Royal Enfield', 'Triumph', 'Yamaha', 'Honda', 'Hero',
       'Bajaj', 'Suzuki', 'Benelli', 'KTM', 'Mahindra', 'Kawasaki',
       'Ducati', 'Hyosung', 'Harley-Davidson', 'Jawa', 'BMW', 'Indian',
       'Rajdoot', 'LML', 'Yezdi', 'MV', 'Ideal'], dtype=object)

In [19]:
df['brand'].nunique() #count total unique value 

23

In [20]:
df[['brand','owner']] #for multiple columns

Unnamed: 0,brand,owner
0,TVS,First Owner
1,Royal Enfield,First Owner
2,Triumph,First Owner
3,TVS,First Owner
4,Yamaha,First Owner
...,...,...
9362,Hero,First Owner
9369,Bajaj,First Owner
9370,Harley-Davidson,First Owner
9371,Bajaj,First Owner


In [21]:
df['owner'].unique()

array(['First Owner', 'Second Owner', 'Third Owner',
       'Fourth Owner Or More'], dtype=object)

In [22]:
df.describe() # kitna data hai, kaisa data hai, aur data ka spread kya hai.

Unnamed: 0,price,kms_driven,age,power
count,7324.0,7324.0,7324.0,7324.0
mean,84883.9,23910.496587,6.656472,228.133397
std,120966.2,27317.594631,3.605299,158.324219
min,4400.0,1.0,1.0,100.0
25%,30000.0,10155.75,4.0,125.0
50%,55000.0,19000.0,6.0,160.0
75%,100000.0,30112.0,8.0,350.0
max,1900000.0,750000.0,63.0,1800.0


In [23]:
df

Unnamed: 0,bike_name,price,city,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,Ahmedabad,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,Delhi,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,Delhi,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,Bangalore,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,Bangalore,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,Delhi,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,Bangalore,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,Jodhpur,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,Hyderabad,21300.0,First Owner,4.0,400.0,Bajaj


### Dropping Unnecessary Columns using Pandas

```python
df.drop(['city'], axis=1, inplace=True)
```

```
axis = 0 ‚Üí rows (horizontal)
axis = 1 ‚Üí columns (vertical)
```

In [24]:
df.drop(['city'],axis=1,inplace=True) #axis=1 is for column and axis=0 is for row

In [25]:
df #city get dropped from df

Unnamed: 0,bike_name,price,kms_driven,owner,age,power,brand
0,TVS Star City Plus Dual Tone 110cc,35000.0,17654.0,First Owner,3.0,110.0,TVS
1,Royal Enfield Classic 350cc,119900.0,11000.0,First Owner,4.0,350.0,Royal Enfield
2,Triumph Daytona 675R,600000.0,110.0,First Owner,8.0,675.0,Triumph
3,TVS Apache RTR 180cc,65000.0,16329.0,First Owner,4.0,180.0,TVS
4,Yamaha FZ S V 2.0 150cc-Ltd. Edition,80000.0,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...,...
9362,Hero Hunk Rear Disc 150cc,25000.0,48587.0,First Owner,8.0,150.0,Hero
9369,Bajaj Avenger 220cc,35000.0,60000.0,First Owner,9.0,220.0,Bajaj
9370,Harley-Davidson Street 750 ABS,450000.0,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,Bajaj Dominar 400 ABS,139000.0,21300.0,First Owner,4.0,400.0,Bajaj


In [26]:
df.drop(['bike_name'],axis=1,inplace=True) #,for removing bike_name column

In [27]:
df

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
1,119900.0,11000.0,First Owner,4.0,350.0,Royal Enfield
2,600000.0,110.0,First Owner,8.0,675.0,Triumph
3,65000.0,16329.0,First Owner,4.0,180.0,TVS
4,80000.0,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...
9362,25000.0,48587.0,First Owner,8.0,150.0,Hero
9369,35000.0,60000.0,First Owner,9.0,220.0,Bajaj
9370,450000.0,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,139000.0,21300.0,First Owner,4.0,400.0,Bajaj


In [28]:
df['brand'].value_counts() #sums total bikes of a brand

brand
Bajaj              2081
Royal Enfield      1346
Hero               1142
Honda               676
Yamaha              651
TVS                 481
KTM                 375
Suzuki              203
Harley-Davidson      91
Kawasaki             61
Hyosung              53
Mahindra             50
Benelli              46
Triumph              21
Ducati               20
BMW                  10
Jawa                  7
Indian                3
MV                    3
Rajdoot               1
LML                   1
Yezdi                 1
Ideal                 1
Name: count, dtype: int64

### Filtering Data Based on a Condition (Boolean Indexing)

###### ``` **Data Cleaning** ``` is the process of removing unwanted columns, duplicate records, and missing values from data. 
- which we do above with df ,remove unnecessary columns , removing duplicates, null value
 ------
```python
TVS_bike = df[df['brand'] == 'TVS']
```
##### Another method 
```python
TVS_bike = df.query("brand == 'TVS'")
```



Question- 
- brand - 'tvs' , owner = 'first owner
- age <= 2 yrs , price <= 50k

In [29]:
TVS_bike = df[df['brand']=='TVS']  #only TVS bikes

In [30]:
TVS_bike

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
3,65000.0,16329.0,First Owner,4.0,180.0,TVS
52,60000.0,30000.0,First Owner,5.0,160.0,TVS
114,69900.0,8700.0,First Owner,3.0,160.0,TVS
130,21500.0,10500.0,First Owner,5.0,125.0,TVS
...,...,...,...,...,...,...
9247,70000.0,4116.0,First Owner,3.0,160.0,TVS
9307,30000.0,30000.0,First Owner,10.0,160.0,TVS
9312,65450.0,9238.0,First Owner,3.0,200.0,TVS
9320,20000.0,84916.0,First Owner,14.0,150.0,TVS


In [31]:
TVS_bike= TVS_bike[TVS_bike['owner']=='First Owner'] #tvs bikes with first owner only

In [32]:
TVS_bike

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
3,65000.0,16329.0,First Owner,4.0,180.0,TVS
52,60000.0,30000.0,First Owner,5.0,160.0,TVS
114,69900.0,8700.0,First Owner,3.0,160.0,TVS
130,21500.0,10500.0,First Owner,5.0,125.0,TVS
...,...,...,...,...,...,...
9247,70000.0,4116.0,First Owner,3.0,160.0,TVS
9307,30000.0,30000.0,First Owner,10.0,160.0,TVS
9312,65450.0,9238.0,First Owner,3.0,200.0,TVS
9320,20000.0,84916.0,First Owner,14.0,150.0,TVS


In [33]:
TVS_bike = TVS_bike[TVS_bike['price']<=50000] # first owner of tvs bikes with price <= 50k

In [34]:
TVS_bike

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
130,21500.0,10500.0,First Owner,5.0,125.0,TVS
131,40000.0,20000.0,First Owner,6.0,160.0,TVS
215,28000.0,28428.0,First Owner,9.0,110.0,TVS
235,28000.0,36000.0,First Owner,5.0,100.0,TVS
...,...,...,...,...,...,...
9155,14000.0,17602.0,First Owner,13.0,110.0,TVS
9157,32000.0,17870.0,First Owner,7.0,110.0,TVS
9158,18000.0,13673.0,First Owner,7.0,110.0,TVS
9307,30000.0,30000.0,First Owner,10.0,160.0,TVS


In [35]:
TVS_bike=TVS_bike[TVS_bike['age']<=2] # first owner of tvs bikes with price <= 50k and age <=2

In [36]:
TVS_bike #so only 1 bike found meeting some conditions

Unnamed: 0,price,kms_driven,owner,age,power,brand
7055,46000.0,6222.0,First Owner,2.0,100.0,TVS


In [37]:
df['owner'].unique()

array(['First Owner', 'Second Owner', 'Third Owner',
       'Fourth Owner Or More'], dtype=object)

Question- 
- brand - 'Honda' , owner = 'first owner'
- age <= 2 yrs , price <= 50k
- kms<= 40km 

In [38]:
df 

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
1,119900.0,11000.0,First Owner,4.0,350.0,Royal Enfield
2,600000.0,110.0,First Owner,8.0,675.0,Triumph
3,65000.0,16329.0,First Owner,4.0,180.0,TVS
4,80000.0,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...
9362,25000.0,48587.0,First Owner,8.0,150.0,Hero
9369,35000.0,60000.0,First Owner,9.0,220.0,Bajaj
9370,450000.0,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,139000.0,21300.0,First Owner,4.0,400.0,Bajaj


In [39]:
HONDA = df[df['brand']=='Honda'] #honda bikes only

In [40]:
HONDA

Unnamed: 0,price,kms_driven,owner,age,power,brand
6,85000.0,8200.0,First Owner,3.0,160.0,Honda
27,20800.0,30500.0,Second Owner,7.0,125.0,Honda
29,81200.0,9100.0,First Owner,2.0,160.0,Honda
34,40000.0,30000.0,First Owner,8.0,150.0,Honda
37,65000.0,43000.0,First Owner,6.0,150.0,Honda
...,...,...,...,...,...,...
9250,80000.0,8000.0,First Owner,10.0,250.0,Honda
9258,65000.0,18000.0,First Owner,6.0,160.0,Honda
9260,120000.0,14000.0,Second Owner,4.0,250.0,Honda
9340,34400.0,24513.0,Second Owner,8.0,150.0,Honda


In [41]:
HONDA=HONDA[HONDA['price']<=50000] #honda bikes with price <= 50000

In [42]:
HONDA

Unnamed: 0,price,kms_driven,owner,age,power,brand
27,20800.0,30500.0,Second Owner,7.0,125.0,Honda
34,40000.0,30000.0,First Owner,8.0,150.0,Honda
53,21900.0,30000.0,Second Owner,7.0,125.0,Honda
56,34500.0,17056.0,First Owner,5.0,110.0,Honda
82,38000.0,33000.0,First Owner,7.0,125.0,Honda
...,...,...,...,...,...,...
9209,25000.0,27000.0,First Owner,8.0,110.0,Honda
9210,24990.0,39000.0,First Owner,10.0,125.0,Honda
9234,46000.0,9000.0,First Owner,5.0,110.0,Honda
9239,35000.0,14992.0,First Owner,8.0,110.0,Honda


In [43]:
HONDA=HONDA[HONDA['age']<=2] #honda bikes with price <= 50000,age <=2

In [44]:
HONDA

Unnamed: 0,price,kms_driven,owner,age,power,brand
2349,48000.0,7119.0,First Owner,2.0,110.0,Honda


In [45]:
HONDA=HONDA[HONDA['owner']== 'First Owner'] #honda bikes with price <= 50000,age <=2,first owner

In [46]:
HONDA

Unnamed: 0,price,kms_driven,owner,age,power,brand
2349,48000.0,7119.0,First Owner,2.0,110.0,Honda


In [47]:
HONDA=HONDA[HONDA['kms_driven']<= 40000] #honda bikes with price <= 50000,age <=2,first owner,kms driven<=40k

In [48]:
HONDA

Unnamed: 0,price,kms_driven,owner,age,power,brand
2349,48000.0,7119.0,First Owner,2.0,110.0,Honda


Question- 
- brand - 'Hero' 
- kms_driven <=50k

In [49]:
Hero = df[(df['brand']=='Hero') & (df['kms_driven']<=50000)] 

In [50]:
Hero

Unnamed: 0,price,kms_driven,owner,age,power,brand
7,45000.0,12645.0,First Owner,3.0,100.0,Hero
22,46500.0,3500.0,First Owner,2.0,110.0,Hero
26,20000.0,29305.0,First Owner,16.0,125.0,Hero
48,37000.0,10800.0,First Owner,8.0,150.0,Hero
66,12200.0,46643.0,First Owner,14.0,100.0,Hero
...,...,...,...,...,...,...
9315,20000.0,5000.0,First Owner,10.0,100.0,Hero
9316,37000.0,28478.0,First Owner,5.0,125.0,Hero
9339,11400.0,20000.0,Second Owner,17.0,100.0,Hero
9341,25000.0,11122.0,First Owner,11.0,100.0,Hero


- ```df.shape[0]``` ‚Üí number of rows

- ```df.shape[1]``` ‚Üí number of columns 

In [51]:
Hero.shape[1]

6

In [52]:
Hero.shape[0]

1008

 ##### ```df.info()```
###### batata hai kitna data hai, kaunsa column hai, datatype kya hai, aur null values hai ya nahi.

In [53]:
df.info() 

<class 'pandas.core.frame.DataFrame'>
Index: 7324 entries, 0 to 9372
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   price       7324 non-null   float64
 1   kms_driven  7324 non-null   float64
 2   owner       7324 non-null   object 
 3   age         7324 non-null   float64
 4   power       7324 non-null   float64
 5   brand       7324 non-null   object 
dtypes: float64(4), object(2)
memory usage: 400.5+ KB


In [54]:
df

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,First Owner,3.0,110.0,TVS
1,119900.0,11000.0,First Owner,4.0,350.0,Royal Enfield
2,600000.0,110.0,First Owner,8.0,675.0,Triumph
3,65000.0,16329.0,First Owner,4.0,180.0,TVS
4,80000.0,10000.0,First Owner,3.0,150.0,Yamaha
...,...,...,...,...,...,...
9362,25000.0,48587.0,First Owner,8.0,150.0,Hero
9369,35000.0,60000.0,First Owner,9.0,220.0,Bajaj
9370,450000.0,3430.0,First Owner,4.0,750.0,Harley-Davidson
9371,139000.0,21300.0,First Owner,4.0,400.0,Bajaj


In [55]:
#COLUMNS ENCODING , changing data type of brand from object to int

### Column Encoding (Categorical ‚Üí Numerical)

In machine learning, models cannot work directly with **text (object) data**.
So we convert categorical columns like `brand` from **object ‚Üí int**.
This process is called **Column Encoding**.

#### Why encoding is required
- ML models understand **numbers**, not strings
- Converts categorical data into numerical form
- Required before applying regression or classification models

---

#### Example: Encoding `brand` column

##### Before encoding
```python
df['brand'].dtype


In [56]:
print(df['brand'].dtype)


object


In [57]:
df['brand'].unique()

array(['TVS', 'Royal Enfield', 'Triumph', 'Yamaha', 'Honda', 'Hero',
       'Bajaj', 'Suzuki', 'Benelli', 'KTM', 'Mahindra', 'Kawasaki',
       'Ducati', 'Hyosung', 'Harley-Davidson', 'Jawa', 'BMW', 'Indian',
       'Rajdoot', 'LML', 'Yezdi', 'MV', 'Ideal'], dtype=object)

In [58]:
dct ={ 'TVS':1, 'Royal Enfield':2, 'Triumph':3, 'Yamaha':4, 'Honda':5, 'Hero':6,
       'Bajaj':7, 'Suzuki':8, 'Benelli':9, 'KTM':10, 'Mahindra':11, 'Kawasaki':12,
       'Ducati':13, 'Hyosung':14, 'Harley-Davidson':15, 'Jawa':16, 'BMW':17, 'Indian':18,
       'Rajdoot':19, 'LML':20, 'Yezdi':21, 'MV':22, 'Ideal':23}

In [59]:
df['brand']= df['brand'].map(dct) #brand names get converted into integer 1,2,3 using mapping from dictionary

In [60]:
print(df['brand'].dtype) #Column datatype changes from object ‚Üí int


int64


In [61]:
df['owner'].unique()

array(['First Owner', 'Second Owner', 'Third Owner',
       'Fourth Owner Or More'], dtype=object)

In [62]:
dct2 ={'First Owner':1, 'Second Owner':2, 'Third Owner':3,
       'Fourth Owner Or More':4}

In [63]:
df['owner']= df['owner'].map(dct2) 

In [64]:
df

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,1,3.0,110.0,1
1,119900.0,11000.0,1,4.0,350.0,2
2,600000.0,110.0,1,8.0,675.0,3
3,65000.0,16329.0,1,4.0,180.0,1
4,80000.0,10000.0,1,3.0,150.0,4
...,...,...,...,...,...,...
9362,25000.0,48587.0,1,8.0,150.0,6
9369,35000.0,60000.0,1,9.0,220.0,7
9370,450000.0,3430.0,1,4.0,750.0,15
9371,139000.0,21300.0,1,4.0,400.0,7


In [65]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7324 entries, 0 to 9372
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   price       7324 non-null   float64
 1   kms_driven  7324 non-null   float64
 2   owner       7324 non-null   int64  
 3   age         7324 non-null   float64
 4   power       7324 non-null   float64
 5   brand       7324 non-null   int64  
dtypes: float64(4), int64(2)
memory usage: 400.5 KB


In [66]:
df 

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,1,3.0,110.0,1
1,119900.0,11000.0,1,4.0,350.0,2
2,600000.0,110.0,1,8.0,675.0,3
3,65000.0,16329.0,1,4.0,180.0,1
4,80000.0,10000.0,1,3.0,150.0,4
...,...,...,...,...,...,...
9362,25000.0,48587.0,1,8.0,150.0,6
9369,35000.0,60000.0,1,9.0,220.0,7
9370,450000.0,3430.0,1,4.0,750.0,15
9371,139000.0,21300.0,1,4.0,400.0,7


> Since Jupyter Notebook uses a shared kernel, changes made to a DataFrame in one cell are reflected across all subsequent cells unless the data is reloaded or the kernel is restarted.
> so, all over in this jupyter notebook df's brand value data type converted into int64
```python
df2 = df.copy()
df2['brand'] = df2['brand'].map(dct)
```
- if, you did this ,it won't affect original df or you can reload data again   
```df = pd.read_csv("Used_Bikes.csv")``` or, restart kernel and run each cells 
it won't affect original



In [67]:
df.to_csv("updated_used_bike_csv") #to make a new  encoded-file 

> Now divide data into two parts (dependent and independent variables)

In [68]:
df

Unnamed: 0,price,kms_driven,owner,age,power,brand
0,35000.0,17654.0,1,3.0,110.0,1
1,119900.0,11000.0,1,4.0,350.0,2
2,600000.0,110.0,1,8.0,675.0,3
3,65000.0,16329.0,1,4.0,180.0,1
4,80000.0,10000.0,1,3.0,150.0,4
...,...,...,...,...,...,...
9362,25000.0,48587.0,1,8.0,150.0,6
9369,35000.0,60000.0,1,9.0,220.0,7
9370,450000.0,3430.0,1,4.0,750.0,15
9371,139000.0,21300.0,1,4.0,400.0,7


In [69]:
x=df.drop('price',axis=1) #independent variables, after dropping eveything except price and storing in x variable

In [70]:
y= df[['price']] #dependent variable (price left in df)

In [71]:
x

Unnamed: 0,kms_driven,owner,age,power,brand
0,17654.0,1,3.0,110.0,1
1,11000.0,1,4.0,350.0,2
2,110.0,1,8.0,675.0,3
3,16329.0,1,4.0,180.0,1
4,10000.0,1,3.0,150.0,4
...,...,...,...,...,...
9362,48587.0,1,8.0,150.0,6
9369,60000.0,1,9.0,220.0,7
9370,3430.0,1,4.0,750.0,15
9371,21300.0,1,4.0,400.0,7


In [72]:
y

Unnamed: 0,price
0,35000.0
1,119900.0
2,600000.0
3,65000.0
4,80000.0
...,...
9362,25000.0
9369,35000.0
9370,450000.0
9371,139000.0


## üîπ Data Splitting (Training & Testing)

Data splitting means dividing the dataset into **two parts** so that a machine learning model can be **trained and tested properly**.

---

## üß© What are X and Y?
- **X** ‚Üí Features (input data)
  - Example: Brand, Age, Owner, Kms Driven
- **Y** ‚Üí Target / Output
  - Example: Price

---

## üìö Training Data (Learning Phase)

### üîπ X_train
- Contains **input features** used to train the model
- The model learns patterns from this data

### üîπ Y_train
- Contains **correct output values** for X_train
- Helps the model understand the correct relationship between features and target

üëâ Together, **X_train + Y_train** are used to **create the model**

---

## üìù Testing Data (Evaluation Phase)

### üîπ X_test
- Contains **new input features**
- The model has **never seen this data before**

### üîπ Y_test
- Contains **actual output values** for X_test
- Used to compare with the model‚Äôs predictions

üëâ Together, **X_test + Y_test** are used to **check model performance**

---

## ‚öñÔ∏è Why Data Splitting is Important
- Prevents the model from memorizing data
- Tests performance on unseen data
- Helps detect overfitting
- Makes the model reliable for real-world use

---

## üìä Common Split Ratio
- Training Data ‚Üí 70‚Äì80%
- Testing Data ‚Üí 20‚Äì30%

---

## üß† Simple One-Line Explanation
Data splitting ensures the model **learns from one part of the data and is evaluated on another part**.

---

## ‚úÖ Summary
- **X_train** ‚Üí Features used to train the model  
- **Y_train** ‚Üí Correct answers during training  
- **X_test** ‚Üí Features used to test the model  
- **Y_test** ‚Üí Correct answers to evaluate predictions  


## üîπ What Does Scikit-learn Do?

**Scikit-learn** is a Python library used to **build, train, test, and evaluate machine learning models** easily.

---

## üìå Main Uses of Scikit-learn

### 1Ô∏è‚É£ Data Splitting
- Splits data into **training and testing sets**
- Example: `train_test_split()`

---

### 2Ô∏è‚É£ Machine Learning Models
Scikit-learn provides built-in models like:
- Linear Regression
- Logistic Regression
- Decision Tree
- Random Forest
- KNN
- SVM

---

### 3Ô∏è‚É£ Model Training
- Trains the model using training data
- Learns relationships between **features (X)** and **target (Y)**

---

### 4Ô∏è‚É£ Prediction
- Uses trained models to **predict output** for new input data

---

### 5Ô∏è‚É£ Model Evaluation
- Measures how good the model is
- Common metrics:
  - Accuracy
  - Mean Absolute Error (MAE)
  - Mean Squared Error (MSE)
  - R¬≤ Score

---

## ‚öôÔ∏è Works With Pandas & NumPy
- Accepts **Pandas DataFrames**
- Uses **NumPy arrays** internally
- Integrates smoothly with data analysis workflows

---

## üß† Simple One-Line Explanation
**Scikit-learn helps us train, test, predict, and evaluate machine learning models easily.**

---

## ‚úÖ Example Tasks Scikit-learn Handles
- Splitting data
- Creating ML models
- Training models
- Making predictions
- Checking performance

---

## üì¶ Installation
```bash
pip install scikit-learn


In [73]:
import sklearn as sns

#### Linear regression based Model

In [74]:
from sklearn.linear_model import LinearRegression
#create a regression model for prediction

from sklearn.model_selection import train_test_split
#divide data into training and testing sets

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
#20 %data is for testing data , 42 ensures same rows are selected together each time

In [75]:
x #independent variable

Unnamed: 0,kms_driven,owner,age,power,brand
0,17654.0,1,3.0,110.0,1
1,11000.0,1,4.0,350.0,2
2,110.0,1,8.0,675.0,3
3,16329.0,1,4.0,180.0,1
4,10000.0,1,3.0,150.0,4
...,...,...,...,...,...
9362,48587.0,1,8.0,150.0,6
9369,60000.0,1,9.0,220.0,7
9370,3430.0,1,4.0,750.0,15
9371,21300.0,1,4.0,400.0,7


In [76]:
y #dependent variable

Unnamed: 0,price
0,35000.0
1,119900.0
2,600000.0
3,65000.0
4,80000.0
...,...
9362,25000.0
9369,35000.0
9370,450000.0
9371,139000.0


In [77]:
x_train #Features for training

Unnamed: 0,kms_driven,owner,age,power,brand
5789,68857.0,1,15.0,110.0,1
3451,5740.0,1,6.0,500.0,2
735,28329.0,1,6.0,110.0,6
7533,42966.0,1,7.0,150.0,4
8461,8000.0,1,5.0,500.0,2
...,...,...,...,...,...
6522,56000.0,1,9.0,200.0,7
6583,22493.0,1,5.0,500.0,2
6856,17477.0,1,5.0,150.0,7
1028,14836.0,1,8.0,150.0,4


In [78]:
x_test #Features for testing

Unnamed: 0,kms_driven,owner,age,power,brand
4909,22500.0,1,6.0,350.0,2
1942,3198.0,1,5.0,500.0,2
5763,15000.0,1,6.0,220.0,7
4800,27000.0,1,12.0,150.0,4
7614,16764.0,1,9.0,100.0,6
...,...,...,...,...,...
5653,16523.0,1,5.0,750.0,15
609,2881.0,1,7.0,250.0,14
4211,23833.0,1,4.0,150.0,8
6379,9282.0,1,7.0,500.0,2


In [79]:
y_train #Target values for training

Unnamed: 0,price
5789,18000.0
3451,140000.0
735,32000.0
7533,40000.0
8461,160000.0
...,...
6522,48928.0
6583,114000.0
6856,41000.0
1028,40000.0


In [80]:
y_test #Actual values for testing

Unnamed: 0,price
4909,88400.0
1942,102850.0
5763,67000.0
4800,30000.0
7614,20000.0
...,...
5653,395000.0
609,140000.0
4211,54500.0
6379,114000.0


In [81]:
x_train.shape , y_train.shape , x_test.shape, y_test.shape #rows & columns

((5859, 5), (5859, 1), (1465, 5), (1465, 1))

```python
from sklearn.linear_model import LinearRegression
#        library        module          class

In [82]:

lr=LinearRegression() #creation of object


In [83]:
lr.fit(x_train, y_train) #teaches the model using known data so it can predict prices for new dat


0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


> MODEL IS TRAINED NOW

```lr.predict(x_test)``` generates predicted values, which are compared with y_test to evaluate model performance.

In [84]:
pred=lr.predict(x_test) #actual result is in y test

In [85]:
print(y_test,pred) #y_test is a Pandas Series whereas pred is a NumPy array(only values)

         price
4909   88400.0
1942  102850.0
5763   67000.0
4800   30000.0
7614   20000.0
...        ...
5653  395000.0
609   140000.0
4211   54500.0
6379  114000.0
647    36000.0

[1465 rows x 1 columns] [[125382.54523531]
 [216422.20333617]
 [ 96966.48595241]
 ...
 [ 72233.83615101]
 [208856.67620686]
 [  9305.60087569]]


> ##  Best way to see actual vs predicted 

In [86]:

comparison = pd.DataFrame({
    "Actual Price": y_test.squeeze().values,
    "Predicted Price": pred.squeeze() #converts 2D data (one column) into 1D data.
})

comparison.head(20) #preview the first few rows(20) of a DataFrame to verify the data.


Unnamed: 0,Actual Price,Predicted Price
0,88400.0,125382.545235
1,102850.0,216422.203336
2,67000.0,96966.485952
3,30000.0,7571.713344
4,20000.0,8532.577046
5,141000.0,137496.720345
6,35000.0,81236.222895
7,140000.0,211759.152593
8,32500.0,42517.729764
9,85000.0,18789.647139


> ```R¬≤``` tells how well the model explains the data.
- Value is between 0 and 1
- Closer to 1 ‚Üí better model

In [87]:
lr.score(x_train,y_train) #gives R¬≤,not accuracy
#The model explains about 70% of the price pattern in training data.

0.7053826605671762

In [88]:
lr.score(x_test,y_test) 
#The model explains about 75% of the price pattern in unseen (test) data.

0.7586900869386206

| Metric   | Used For          | Meaning                                   |
|----------|-------------------|-------------------------------------------|
| Accuracy | Classification    | Percentage of correct predictions         |
| R¬≤      | Regression        | How well the model explains the data      |


##### More data = more variation, so accuracy can drop.


In [89]:
import joblib #joblib used to save machine learning models
joblib.dump(lr,"model.lb") #lr is object and model.lb is model name

['model.lb']

```joblib.dump()``` : Save the object to a file. 


```joblib.load()``` : Load the saved object from a file.

In [90]:
model=joblib.load('model.lb') 


In [91]:
model #Make predictions directly (no training needed)

0,1,2
,"fit_intercept  fit_intercept: bool, default=True Whether to calculate the intercept for this model. If set to False, no intercept will be used in calculations (i.e. data is expected to be centered).",True
,"copy_X  copy_X: bool, default=True If True, X will be copied; else, it may be overwritten.",True
,"tol  tol: float, default=1e-6 The precision of the solution (`coef_`) is determined by `tol` which specifies a different convergence criterion for the `lsqr` solver. `tol` is set as `atol` and `btol` of :func:`scipy.sparse.linalg.lsqr` when fitting on sparse training data. This parameter has no effect when fitting on dense data. .. versionadded:: 1.7",1e-06
,"n_jobs  n_jobs: int, default=None The number of jobs to use for the computation. This will only provide speedup in case of sufficiently large problems, that is if firstly `n_targets > 1` and secondly `X` is sparse or if `positive` is set to `True`. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details.",
,"positive  positive: bool, default=False When set to ``True``, forces the coefficients to be positive. This option is only supported for dense arrays. For a comparison between a linear regression model with positive constraints on the regression coefficients and a linear regression without such constraints, see :ref:`sphx_glr_auto_examples_linear_model_plot_nnls.py`. .. versionadded:: 0.24",False


In [92]:
model.predict(x_test)

array([[125382.54523531],
       [216422.20333617],
       [ 96966.48595241],
       ...,
       [ 72233.83615101],
       [208856.67620686],
       [  9305.60087569]], shape=(1465, 1))

In [93]:
model.score(x_test, y_test)


0.7586900869386206

In [111]:
model.predict([[10000.0,1,3.0,150.0,4]]) #data from frontend, #prices removed , data from encoded csv file (updated_used_bike)
#        kms_driven,owner,age,power,brand



array([[40708.09891636]])

## üîπ SVM (Support Vector Machine) Algorithm

- **Supervised learning algorithm** (works on labeled data)
- Used for **both classification and regression**
- **Mainly used for classification problems**

---

## üîπ Types of SVM
1. **Linear SVM**
2. **Non-linear SVM**

---

## üîπ Linear Regression vs Linear SVM

### Linear Regression
- Formula:  
```y = mx + c```

- Tries to **fit a single best-fit line**
- Prediction accuracy decreases when:
    - Data points are far from the straight line
- Relationship is not perfectly linear

---

### Linear SVM
- Draws a **decision boundary (hyperplane)**
- Uses **two parallel boundary lines (margins)**
- Goal:
    - Maximize the distance between data points and the hyperplane
    - Distance from the hyperplane is important
    - More robust than linear regression for classification

---

## üîπ Non-linear SVM
- Used when data is **not linearly separable**
- Example:
- Data points arranged in **circular or complex patterns**
- No straight line can separate the data
- Uses **kernel functions** to transform data into higher dimensions
- After transformation, a hyperplane can be formed

---

## üîë Summary (Very Simple)
- **Linear SVM** ‚Üí straight-line separation  
- **Non-linear SVM** ‚Üí complex / circular data separation  
- **SVM focuses on margins**, not just fitting a line 

In [95]:
from sklearn.svm import SVR,SVC  
# SVC ‚Üí Support Vector Classifier (for classification problems)
# SVR ‚Üí Support Vector Regressor (for regression problems

> ##  REGRESSION PROBLEM (SVR)

In [96]:
svm =SVR()
svm.fit(x_train,y_train) #trained with data 

  y = column_or_1d(y, warn=True)


0,1,2
,"kernel  kernel: {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable, default='rbf' Specifies the kernel type to be used in the algorithm. If none is given, 'rbf' will be used. If a callable is given it is used to precompute the kernel matrix. For an intuitive visualization of different kernel types see :ref:`sphx_glr_auto_examples_svm_plot_svm_regression.py`",'rbf'
,"degree  degree: int, default=3 Degree of the polynomial kernel function ('poly'). Must be non-negative. Ignored by all other kernels.",3
,"gamma  gamma: {'scale', 'auto'} or float, default='scale' Kernel coefficient for 'rbf', 'poly' and 'sigmoid'. - if ``gamma='scale'`` (default) is passed then it uses  1 / (n_features * X.var()) as value of gamma, - if 'auto', uses 1 / n_features - if float, must be non-negative. .. versionchanged:: 0.22  The default value of ``gamma`` changed from 'auto' to 'scale'.",'scale'
,"coef0  coef0: float, default=0.0 Independent term in kernel function. It is only significant in 'poly' and 'sigmoid'.",0.0
,"tol  tol: float, default=1e-3 Tolerance for stopping criterion.",0.001
,"C  C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2. For an intuitive visualization of the effects of scaling the regularization parameter C, see :ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.",1.0
,"epsilon  epsilon: float, default=0.1 Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value. Must be non-negative.",0.1
,"shrinking  shrinking: bool, default=True Whether to use the shrinking heuristic. See the :ref:`User Guide `.",True
,"cache_size  cache_size: float, default=200 Specify the size of the kernel cache (in MB).",200
,"verbose  verbose: bool, default=False Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.",False


In [110]:
import joblib #joblib used to save machine learning models
joblib.dump(svm,"model2.lb") #lr is object and model.lb is model name

['model2.lb']

In [112]:
model2=joblib.load('model2.lb') 

In [113]:
model2.predict([[10000.0,1,3.0,150.0,4]])



array([55002.98759728])

Now SVR modal is ready to predict

In [97]:
print(svm.score(x_train,y_train))
print(svm.score(x_test,y_test))

-0.0614777613879014
-0.058810958372743416


```.score()``` in SVR gives the R¬≤ score, which shows how well the model explains the data for training and testing sets.

In [98]:
svm.predict(x_test) #predicts output values for the test data.

array([54600.4698477 , 55129.47131119, 54861.78539028, ...,
       54550.9225554 , 55020.38248866, 55131.93435183], shape=(1465,))

> ## CLASSIFICATION PROBLEM(SVC)

In [99]:
from sklearn.datasets import load_iris #for classifiction problem trial , svc 
from sklearn.svm import SVC 

x,y= load_iris(return_X_y=True) #splitting iris data in x,y
df = pd.DataFrame(x, columns=[
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width"
])

df["target"] = y
df.head(10) #shows only starting some values

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
5,5.4,3.9,1.7,0.4,0
6,4.6,3.4,1.4,0.3,0
7,5.0,3.4,1.5,0.2,0
8,4.4,2.9,1.4,0.2,0
9,4.9,3.1,1.5,0.1,0


In [100]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
#split data into further training and testing data


In [101]:
x_train.shape,y_train.shape,x_test.shape,y_test.shape #for rows,columns

((120, 4), (120,), (30, 4), (30,))

In [102]:
svm2=SVC() #creating object/model
svm2.fit(x_train,y_train) #give training data to train the model

0,1,2
,"C  C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty. For an intuitive visualization of the effects of scaling the regularization parameter C, see :ref:`sphx_glr_auto_examples_svm_plot_svm_scale_c.py`.",1.0
,"kernel  kernel: {'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'} or callable, default='rbf' Specifies the kernel type to be used in the algorithm. If none is given, 'rbf' will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape ``(n_samples, n_samples)``. For an intuitive visualization of different kernel types see :ref:`sphx_glr_auto_examples_svm_plot_svm_kernels.py`.",'rbf'
,"degree  degree: int, default=3 Degree of the polynomial kernel function ('poly'). Must be non-negative. Ignored by all other kernels.",3
,"gamma  gamma: {'scale', 'auto'} or float, default='scale' Kernel coefficient for 'rbf', 'poly' and 'sigmoid'. - if ``gamma='scale'`` (default) is passed then it uses  1 / (n_features * X.var()) as value of gamma, - if 'auto', uses 1 / n_features - if float, must be non-negative. .. versionchanged:: 0.22  The default value of ``gamma`` changed from 'auto' to 'scale'.",'scale'
,"coef0  coef0: float, default=0.0 Independent term in kernel function. It is only significant in 'poly' and 'sigmoid'.",0.0
,"shrinking  shrinking: bool, default=True Whether to use the shrinking heuristic. See the :ref:`User Guide `.",True
,"probability  probability: bool, default=False Whether to enable probability estimates. This must be enabled prior to calling `fit`, will slow down that method as it internally uses 5-fold cross-validation, and `predict_proba` may be inconsistent with `predict`. Read more in the :ref:`User Guide `.",False
,"tol  tol: float, default=1e-3 Tolerance for stopping criterion.",0.001
,"cache_size  cache_size: float, default=200 Specify the size of the kernel cache (in MB).",200
,"class_weight  class_weight: dict or 'balanced', default=None Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``.",


In [103]:
print(svm2.score(x_train,y_train)) #here checking accuracy as it is a classification problem
print(svm2.score(x_test,y_test))

0.975
1.0


In [104]:
svm2.predict(x_test) #returns the predicted class labels for the test data.

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

In [105]:
y_test #so exactly accuracy here happens

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

## üîπ KNN (K-Nearest Neighbors)

- **KNN** is a **Supervised Learning algorithm**
- It is used for **both Classification and Regression**
- It works based on **distance (nearness)** between data points

---

### üîπ How KNN Works
- For a new data point:
  - Select the value of **K**
  - Calculate the distance between the new point and all training points  
    (commonly **Euclidean Distance**) 

$d = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2 + \cdots + (x_n - y_n)^2}$

  - Choose the **K nearest neighbors**

- **For Classification**:
  - The class with **maximum votes** among the K neighbors is predicted

- **For Regression**:
  - The **average (mean)** value of the K nearest neighbors is predicted

---

### üîπ Simple Explanation (Hinglish)
> **Jo data points zyada paas hote hain, unka influence prediction par zyada hota hai**

---

### üîπ Important Points
- KNN is a **lazy learner** (no training phase)
- Choosing the right **K value** is important:
  - Small K ‚Üí more noise sensitive
  - Large K ‚Üí smoother but less flexible
- **Feature scaling** is important because KNN is distance-based

---

### üîπ Advantages
- Simple and easy to understand
- No training required
- Works well with small datasets

---

### üîπ Disadvantages
- Slow prediction for large datasets
- High memory usage
- Sensitive to noisy data

---


In [106]:
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor #for classifier and regression 
from sklearn.datasets import load_iris #for classifiction problem trial , svc 


x,y= load_iris(return_X_y=True) #splitting iris data in x,y

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
#split data into further training and testing data

x_train.shape,y_train.shape,x_test.shape,y_test.shape #for rows,columns



((120, 4), (120,), (30, 4), (30,))

In [107]:
svm3=KNeighborsClassifier() #creating object/model
svm3.fit(x_train,y_train) #give training data to train the model

0,1,2
,"n_neighbors  n_neighbors: int, default=5 Number of neighbors to use by default for :meth:`kneighbors` queries.",5
,"weights  weights: {'uniform', 'distance'}, callable or None, default='uniform' Weight function used in prediction. Possible values: - 'uniform' : uniform weights. All points in each neighborhood  are weighted equally. - 'distance' : weight points by the inverse of their distance.  in this case, closer neighbors of a query point will have a  greater influence than neighbors which are further away. - [callable] : a user-defined function which accepts an  array of distances, and returns an array of the same shape  containing the weights. Refer to the example entitled :ref:`sphx_glr_auto_examples_neighbors_plot_classification.py` showing the impact of the `weights` parameter on the decision boundary.",'uniform'
,"algorithm  algorithm: {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto' Algorithm used to compute the nearest neighbors: - 'ball_tree' will use :class:`BallTree` - 'kd_tree' will use :class:`KDTree` - 'brute' will use a brute-force search. - 'auto' will attempt to decide the most appropriate algorithm  based on the values passed to :meth:`fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force.",'auto'
,"leaf_size  leaf_size: int, default=30 Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.",30
,"p  p: float, default=2 Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. This parameter is expected to be positive.",2
,"metric  metric: str or callable, default='minkowski' Metric to use for distance computation. Default is ""minkowski"", which results in the standard Euclidean distance when p = 2. See the documentation of `scipy.spatial.distance `_ and the metrics listed in :class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric values. If metric is ""precomputed"", X is assumed to be a distance matrix and must be square during fit. X may be a :term:`sparse graph`, in which case only ""nonzero"" elements may be considered neighbors. If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. This works for Scipy's metrics, but is less efficient than passing the metric name as a string.",'minkowski'
,"metric_params  metric_params: dict, default=None Additional keyword arguments for the metric function.",
,"n_jobs  n_jobs: int, default=None The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. Doesn't affect :meth:`fit` method.",


In [108]:
print(svm3.score(x_train,y_train)) #here checking accuracy as it is a classification problem
print(svm3.score(x_test,y_test))

0.9666666666666667
1.0


In [109]:
svm3.predict(x_test) #returns the predicted class labels for the test data.

array([1, 0, 2, 1, 1, 0, 1, 2, 1, 1, 2, 0, 0, 0, 0, 1, 2, 1, 1, 2, 0, 2,
       0, 2, 2, 2, 2, 2, 0, 0])

# üå≥ Decision Tree & Ensemble Learning Notes

---

## üîπ Decision Tree (DT)

### 1Ô∏è‚É£ If‚ÄìElse Ladder Concept
- Decision Tree works like an **if‚Äìelse ladder**
- At each node, a **condition** is checked
- Based on the condition, data is **split** into branches

---

### 2Ô∏è‚É£ Splitting Criteria (Impurity Measures)

Decision Tree splits data based on:

- **Entropy**
- **Information Gain**
- **Gini Impurity**

These measures decide **best feature** for splitting.

---

### 3Ô∏è‚É£ Tree Structure Terminology
- **Root Node** ‚Üí First split (top of tree)
- **Internal Nodes** ‚Üí Intermediate decision points
- **Leaf Node** ‚Üí Final output / prediction
- **Splitting** ‚Üí Dividing data based on feature condition

---

### 4Ô∏è‚É£ CART Algorithm
- **CART** = Classification And Regression Tree
- Supports **both Classification & Regression**
- Uses:
  - **Gini Index** ‚Üí Classification
  - **MSE / Variance Reduction** ‚Üí Regression
- Binary tree (only 2 branches per split)

---

### 5Ô∏è‚É£ Advantages of Decision Tree
- Easy to understand & visualize
- No feature scaling required
- Works with both numerical & categorical data

### 6Ô∏è‚É£ Disadvantages of Decision Tree
- Overfitting problem
- Sensitive to small data changes
- Less accurate alone compared to ensembles

---

## üîπ Formulas (All Three)

### üî∏ Entropy
$Entropy = -\sum p_i \log_2 p_i$

### üî∏ Information Gain
$IG = Entropy(parent) - \sum \left(\frac{n_i}{n}\right) Entropy(child)$

### üî∏ Gini Impurity
$Gini = 1 - \sum p_i^2$


---

## üîπ Ensemble Learning

### üìå What is Ensemble Learning?
- Combines **multiple models**
- Final prediction is **more accurate & stable**
- Based on idea:
> **Multiple weak models together make a strong model**

---

### üìå Key Ideas
- Multiple models work together
- Reduces overfitting
- Improves generalization
- Prediction closer to **true value**

---

### üìå Example
**Placement Prediction**
- Inputs: IQ, CGPA
- Output:
  - **Yes / No** ‚Üí Classification
  - **Salary (‚Çπ)** ‚Üí Regression

Different models:
- KNN
- SVM
- Linear Regression
- Decision Tree

Each gives different output ‚Üí Ensemble combines them

---

## üîπ Types of Ensemble Learning (4)

1. **Voting**
2. **Stacking**
3. **Bagging**
4. **Boosting**

---

## üîπ Voting Ensemble Learning (CAR)

### üî∏ Models Used
- Linear Regression
- Decision Tree
- KNN
- SVM

### üî∏ Classification Case
- Each model predicts **Yes / No**
- Final output based on:
  - **Majority voting**
  - OR **Priority of models**

### üî∏ Regression Case
- Each model predicts a value
- Final output = **Average of all predictions**

---

## üîπ Stacking Ensemble Learning (CAR)

### üî∏ Step-by-Step
1. Train base models:
   - LR, DT, KNN, SVM
2. Each model gives prediction:
   - Example: Yes, No, Yes, Yes
3. Form a **new dataset (meta-table)** from predictions
4. Apply **Meta Model** (e.g., Decision Tree / LR)
5. Meta model gives **final prediction**

‚úî Same approach used for **Regression**

---

## üîπ Bagging vs Boosting

### üî∏ Common Points
- Use **same algorithm repeatedly**
- Mostly use **Decision Tree**
- Models are **weak learners**

---

## üîπ Bagging (Bootstrap Aggregation)

- Example: **Random Forest**
- Uses **multiple decision trees**
- Data sampled with replacement
- Models run **parallelly (vertically)**
- Reduces **variance**
- Each tree independent

‚úî Fast & stable

---

## üîπ Boosting

- Models run **sequentially (horizontally)**
- Full data given initially
- First model learns ‚Üí errors passed to next model
- Next model focuses more on **wrong predictions**
- Error keeps reducing:
  - M1 ‚Üí M2 ‚Üí M3 ‚Üí M4
- Weak learners improve step by step

‚úî High accuracy  
‚ùå Sensitive to noise

---

## üîπ Summary Table

| Method | Parallel | Error Focus | Example |
|------|---------|------------|--------|
| Voting | Yes | No | Soft / Hard Voting |
| Stacking | Partial | Yes | Meta Learning |
| Bagging | Yes | No | Random Forest |
| Boosting | No | Yes | AdaBoost, XGBoost |

---

## ‚úÖ Key Exam Tip
- **Decision Tree alone ‚Üí weak**
- **Ensemble + Decision Tree ‚Üí powerful**

---


# üåê Flask ‚Äì Theory Notes

## üîπ What is Flask?
- **Flask** is a **lightweight Python web framework**
- Used to build **web applications and APIs**
- Helps in **communication between frontend and backend**
- Considered an **alternative to Django**

---

## üîπ Why Flask is called a Micro Framework?
- Flask provides only **core features**
- No built-in:
  - Database ORM
  - Authentication system
  - Admin panel
- Developers can add features as needed

---

## üîπ Flask Architecture
- Based on **WSGI (Web Server Gateway Interface)**
- Follows **MVC pattern** (loosely):
  - Model ‚Üí Data / Logic
  - View ‚Üí HTML templates
  - Controller ‚Üí Flask routes

---

## üîπ Frontend‚ÄìBackend Communication
- Flask allows:
  - Frontend ‚Üí Backend (form data, requests)
  - Backend ‚Üí Frontend (responses, results)
- Supports **GET and POST methods**

---

## üîπ Routing in Flask
- Routing maps **URL paths to Python functions**
- Each route represents a **web page or API endpoint**
- Example routes:
  - Home
  - About
  - Contact
  - Project

---

## üîπ Templates in Flask
- Flask uses **Jinja2 templating engine**
- Allows:
  - Dynamic HTML
  - Variable passing
  - Conditional statements
- Makes frontend pages reusable

---

## üîπ Request & Response Cycle
- User sends request from browser
- Flask processes the request
- Flask sends response back to browser
- Response can be:
  - HTML page
  - Text
  - JSON data

---

## üîπ URL Handling
- Flask supports dynamic URLs
- Uses function-based routing
- Helps in clean and readable URLs

---

## üîπ Debug Mode
- Used during development
- Automatically reloads server on code change
- Displays error messages in browser
- Should be turned off in production

---

## üîπ Advantages of Flask
- Easy to learn
- Lightweight and flexible
- Good for beginners
- Ideal for small to medium projects
- Widely used in ML model deployment

---

## üîπ Disadvantages of Flask
- Not suitable for very large applications alone
- Manual structure required
- Fewer built-in features compared to Django

---

## üîπ Flask vs Django (Theory)

| Flask | Django |
|------|--------|
| Lightweight | Heavy framework |
| Flexible | Opinionated |
| Simple structure | Predefined structure |
| Best for small apps | Best for large apps |

---

## üîπ Applications of Flask
- Web applications
- REST APIs
- Backend services
- Machine Learning deployment
- Prototyping projects

---

## üîπ One-Line Definition (Exam)
> **Flask is a lightweight Python web framework used to develop web applications and APIs with simple frontend‚Äìbackend interaction.**

---
