# 🚀 Week 02: Training Machine Learning Models 🎯  

This week, we dive deeper into **Machine Learning** by learning how to **train models** for different types of problems, including **classification** and **prediction** tasks. 🧠💡  

### What You’ll Learn:  
🔹 How to **train models** using real-world data.  
🔹 The role of **features** in improving predictions.  
🔹 How to **evaluate model performance** and fine-tune it.  
🔹 The impact of **hyperparameters** like learning rate and epochs.  

By the end of this week, you'll be able to build, train, and test models confidently. Get ready to experiment, analyze, and improve your models! 🚀🔥

## Exercise 01 : **Predicting Trip Fare using Linear Regression**  
---

### **Objective**  
The goal of this task is to build a machine learning model that predicts the fare price of a trip based on selected features. You will go through the full machine learning workflow, including data preprocessing, model training, evaluation, and visualization.  

### **Tasks Overview**  

Follow these steps to train and evaluate a machine learning model:  

1. **Load and Explore the Data** 📝  
2. **Prepare the Data** 🔧  
3. **Train the Model** 🎯  
4. **Make Predictions** 🔮  
5. **Evaluate the Model** 📊  

### **Resources**  
[A Visual Introduction To (Almost) Everything You Should Know](https://mlu-explain.github.io/linear-regression/)

---

###1.Load and Explore the Data 📝

Load the dataset using pandas and inspect its structure.
Check for missing values and basic statistics.
Visualize key features to understand their relationship with the target variable.

In [72]:
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
# from sklearn.preprocessing import LabelEncoder


data_url = "https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv"

data = pd.read_csv(data_url)
df = pd.DataFrame(data)
df.dropna(inplace=True)

categorical = df.select_dtypes(include=['object']).columns.tolist()
encoder = OneHotEncoder(sparse_output=False)
onehot = encoder.fit_transform(df[categorical])
feature_names = encoder.get_feature_names_out(categorical)

onehot_df = pd.DataFrame(onehot, columns=feature_names, index=df.index)

df = df.drop(columns=categorical)
df = pd.concat([df, onehot_df], axis=1)



Unnamed: 0,TRIP_START_HOUR,TRIP_SECONDS,TRIP_MILES,TRIP_SPEED,PICKUP_CENSUS_TRACT,DROPOFF_CENSUS_TRACT,PICKUP_COMMUNITY_AREA,DROPOFF_COMMUNITY_AREA,FARE,TIPS,...,COMPANY_Petani Cab Corp,COMPANY_Setare Inc,COMPANY_Star North Taxi Management Llc,COMPANY_Sun Taxi,COMPANY_Taxi Affiliation Services,COMPANY_Taxicab Insurance Agency Llc,"COMPANY_Taxicab Insurance Agency, LLC",COMPANY_Top Cab,COMPANY_Top Cab Affiliation,COMPANY_U Taxicab
2,17.25,1173,1.29,4.0,1.703132e+10,1.703108e+10,32.0,8.0,10.25,0.00,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,18.00,3360,3.70,4.0,1.703132e+10,1.703124e+10,32.0,24.0,23.75,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17.00,1044,1.15,4.0,1.703132e+10,1.703108e+10,32.0,8.0,10.00,0.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,17.50,1251,1.38,4.0,1.703108e+10,1.703128e+10,8.0,28.0,11.00,3.00,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,17.00,1813,2.00,4.0,1.703108e+10,1.703128e+10,8.0,28.0,14.50,0.00,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31640,5.00,1138,18.10,57.3,1.703132e+10,1.703198e+10,32.0,76.0,44.00,9.30,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
31641,17.25,2760,43.90,57.3,1.703198e+10,1.703132e+10,76.0,32.0,48.50,10.60,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
31647,6.00,1110,17.82,57.8,1.703184e+10,1.703198e+10,32.0,76.0,43.25,9.94,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
31675,12.75,1920,32.10,60.2,1.703198e+10,1.703184e+10,76.0,32.0,42.25,9.75,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


###2.Prepare the Data 🔧  
   - Select the most relevant features for training.  
   - Handle missing values if needed.  
   - Normalize or scale the data to improve performance if needed
   - Split the dataset into **training (80%)** and **testing (20%)** sets.  

>🙋 **Why we need to split our dataset**❓

In [77]:
# add your code here
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()

standar = scaler.fit_transform(df)
scaler_data = pd.DataFrame(standar, columns=df.columns, index=df.index)


scaler_data

Unnamed: 0,TRIP_START_HOUR,TRIP_SECONDS,TRIP_MILES,TRIP_SPEED,PICKUP_CENSUS_TRACT,DROPOFF_CENSUS_TRACT,PICKUP_COMMUNITY_AREA,DROPOFF_COMMUNITY_AREA,FARE,TIPS,...,COMPANY_Petani Cab Corp,COMPANY_Setare Inc,COMPANY_Star North Taxi Management Llc,COMPANY_Sun Taxi,COMPANY_Taxi Affiliation Services,COMPANY_Taxicab Insurance Agency Llc,"COMPANY_Taxicab Insurance Agency, LLC",COMPANY_Top Cab,COMPANY_Top Cab Affiliation,COMPANY_U Taxicab
2,0.653745,-0.019189,-0.752914,-1.260901,-0.496568,-0.929232,-0.154205,-0.933166,-0.595243,-0.860709,...,-0.036958,-0.020034,-0.194879,2.492465,-0.405512,-0.211577,-0.247797,-0.020034,-0.169233,-0.128032
3,0.830052,2.157662,-0.435883,-1.260901,-0.496568,-0.459744,-0.154205,-0.063975,0.179974,-0.860709,...,-0.036958,-0.020034,-0.194879,-0.401209,-0.405512,-0.211577,-0.247797,-0.020034,-0.169233,-0.128032
4,0.594976,-0.147590,-0.771330,-1.260901,-0.496568,-0.927477,-0.154205,-0.933166,-0.609599,-0.860709,...,-0.036958,-0.020034,-0.194879,-0.401209,-0.405512,-0.211577,-0.247797,-0.020034,-0.169233,-0.128032
5,0.712514,0.058449,-0.741074,-1.260901,-1.127700,-0.346327,-1.102058,0.153323,-0.552176,-0.129405,...,-0.036958,-0.020034,-0.194879,-0.401209,-0.405512,-0.211577,-0.247797,-0.020034,-0.169233,-0.128032
6,0.594976,0.617841,-0.659515,-1.260901,-1.128491,-0.341052,-1.102058,0.153323,-0.351193,-0.860709,...,-0.036958,-0.020034,-0.194879,2.492465,-0.405512,-0.211577,-0.247797,-0.020034,-0.169233,-0.128032
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31640,-2.225939,-0.054026,1.458407,4.019506,-0.496568,1.704841,-0.154205,2.760895,1.342801,1.406333,...,-0.036958,-0.020034,-0.194879,-0.401209,-0.405512,-0.211577,-0.247797,-0.020034,-0.169233,-0.128032
31641,0.653745,1.560447,4.852344,4.019506,1.248967,-0.229101,1.583526,0.370620,1.601207,1.723231,...,-0.036958,-0.020034,-0.194879,-0.401209,-0.405512,-0.211577,4.035556,-0.020034,-0.169233,-0.128032
31647,-1.990863,-0.081896,1.421574,4.069041,0.876265,1.704841,-0.154205,2.760895,1.299733,1.562344,...,-0.036958,-0.020034,-0.194879,-0.401209,-0.405512,4.726410,-0.247797,-0.020034,-0.169233,-0.128032
31675,-0.404098,0.724345,3.300078,4.306808,1.248967,1.291912,1.583526,0.370620,1.242310,1.516028,...,-0.036958,-0.020034,-0.194879,-0.401209,-0.405512,-0.211577,4.035556,-0.020034,-0.169233,-0.128032


### 3. Train the Model 🎯  

Train a **linear regression model** using `sklearn` (you can also try other alternatives).  

🔧 **Experiment with Different Features**  
- Start with a few features and observe the model's performance.  
- Try adding or removing features to see how it affects accuracy.  

> 🙋 **What are hyperparameters, and how do learning rate and epochs affect training** ❓

> 🙋 **Does using more features always improve the model** ❓

In [None]:
# add your code here

#### 🎉 Congratulations! 🎉

**🚀You’ve just trained your first Machine Learning model! 🚀**

### 4. Make Predictions 🔮  

Use the trained model to predict values on the **testing data**.  
Compare the predictions with the actual values to assess accuracy.  
You can also try making predictions on the **training data** to see how well the model memorized the patterns.  

>🙋 **Why doesn’t the model predict exact values, even when using the training data**❓

In [None]:
# add your code here

###5.Evaluate the Model 📊

Measure performance using metrics like MSE, RMSE, and R² Score.
Plot a loss curve to track training progress.
Create a scatter plot to compare actual vs. predicted values.

>🙋 **What do MSE, RMSE, and R² Score tell us about the model's performance**❓

>🙋 **How can you tell if your model is overfitting or underfitting**❓


In [None]:
# add your code here

---
## Exercise 02 : **University Admission Prediction Challenge**

😇 I know that the exercise is difficult, but you will practice what you learned last week along with your first classification algorithm.

### 🎯 Objective

Your mission is to predict whether a student will be admitted to their desired university based on various academic and application-related factors. You’ll use **Logistic Regression** to build a predictive model and discuss its strengths and limitations.

📊 The Dataset  :

[Admission_Predict.csv](https://github.com/1337-Artificial-Intelligence/Entry-Level-ML-Engineer-Bootcamp/blob/main/Week02/Admission_Prediction_Challenge.csv)

The dataset contains information on **400 students** with the following attributes:

- **GRE Score** 🎓
- **TOEFL Score** 📚
- **University Rating** 🏛️
- **Statement of Purpose (SOP) Score** ✍️
- **Letter of Recommendation (LOR) Score** 📩
- **Cumulative Grade Point Average (CGPA)** 🎯
- **Research Experience (Yes/No)** 🔬
- **Chance of Admission (Target Variable: 0 or 1)**

### 🛠️ Steps to Follow

1. **Load & Explore the Data**: Understand the dataset, check for missing values, and analyze distributions.
2. **Feature Selection & Processing**: Identify relevant features and scale them if needed.
3. **Train a Logistic Regression Model**: Implement Logistic Regression (**from scratch** optional) to classify students into "Admitted" or "Not Admitted."
4. **Evaluate the Model**: Measure accuracy, precision, recall, and other key metrics.(do some reaserch 🙂)
5. **Discuss Limitations**: Explore cases where Logistic Regression may struggle and suggest improvements (e.g., feature engineering, alternative models).

### **Resources**  
[Logistic Regression](https://mlu-explain.github.io/logistic-regression/)

### 🎨 Bonus: Visual Exploration

Use **histograms, correlation heatmaps, and scatter plots** to gain insights before modeling.

🔎 **Can you build a model that accurately predicts student admissions?** Let's find out! 🚀

In [None]:
# add your code here

---
##🎉 **Congratulations!** 🎉  

You've successfully trained your first **Linear Regression** and **Logistic Regression** models! 🚀  

Through this journey, you've learned:  
✅ How to **prepare and preprocess data** for training.  
✅ The importance of **choosing the right features** and tuning **hyperparameters**.  
✅ How to **train, predict, and evaluate models** using key metrics.  
✅ The difference between **regression (predicting continuous values)** and **classification (predicting categories)**.  

This is a **big step** in your Machine Learning journey! 💡 But ML is much more than just linear and logistic regression—there are many other models and techniques to explore.  

🔎 **Next Challenge:**  
- Research other **types of ML models** (e.g., Decision Trees, SVMs, Neural Networks).  
- Try solving different **real-world problems** using what you've learned.  

👏 Keep experimenting, keep learning, and welcome to the world of Machine Learning! 🚀🔥