# üöÄ Week 02: Training Machine Learning Models üéØ  

This week, we dive deeper into **Machine Learning** by learning how to **train models** for different types of problems, including **classification** and **prediction** tasks. üß†üí°  

### What You‚Äôll Learn:  
üîπ How to **train models** using real-world data.  
üîπ The role of **features** in improving predictions.  
üîπ How to **evaluate model performance** and fine-tune it.  
üîπ The impact of **hyperparameters** like learning rate and epochs.  

By the end of this week, you'll be able to build, train, and test models confidently. Get ready to experiment, analyze, and improve your models! üöÄüî•

## Exercise 01 : **Predicting Trip Fare using Linear Regression**  
---

### **Objective**  
The goal of this task is to build a machine learning model that predicts the fare price of a trip based on selected features. You will go through the full machine learning workflow, including data preprocessing, model training, evaluation, and visualization.  

### **Tasks Overview**  

Follow these steps to train and evaluate a machine learning model:  

1. **Load and Explore the Data** üìù  
2. **Prepare the Data** üîß  
3. **Train the Model** üéØ  
4. **Make Predictions** üîÆ  
5. **Evaluate the Model** üìä  

### **Resources**  
[A Visual Introduction To (Almost) Everything You Should Know](https://mlu-explain.github.io/linear-regression/)

---

###1.Load and Explore the Data üìù

Load the dataset using pandas and inspect its structure.
Check for missing values and basic statistics.
Visualize key features to understand their relationship with the target variable.

In [9]:
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
# from sklearn.preprocessing import LabelEncoder


data_url = "https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv"

data = pd.read_csv(data_url)
df = pd.DataFrame(data)
df.dropna(inplace=True)

categorical = df.select_dtypes(include=['object']).columns.tolist()
encoder = OneHotEncoder(sparse_output=False)
onehot = encoder.fit_transform(df[categorical])
feature_names = encoder.get_feature_names_out(categorical)

onehot_df = pd.DataFrame(onehot, columns=feature_names, index=df.index)

df = df.drop(columns=categorical)
df = pd.concat([df, onehot_df], axis=1)
# print(df.columns.tolist())
# print(categorical)
# X = df.drop()

###2.Prepare the Data üîß  
   - Select the most relevant features for training.  
   - Handle missing values if needed.  
   - Normalize or scale the data to improve performance if needed
   - Split the dataset into **training (80%)** and **testing (20%)** sets.  

>üôã **Why we need to split our dataset**‚ùì

In [10]:
# add your code here
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()

standar = scaler.fit_transform(df)
scaler_data = pd.DataFrame(standar, columns=df.columns, index=df.index)

X = scaler_data.drop(columns='FARE')
y = scaler_data['FARE']
# scaler_data

In [11]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# For splitting into train, validation, and test sets:
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)


print("X_train:")
print(X_train.head())

print("Y_train:")
print(y_train.head())


X_train:
       TRIP_START_HOUR  TRIP_SECONDS  TRIP_MILES  TRIP_SPEED  \
23909        -0.462867      1.101586    1.409734    1.096954   
23043         1.006360      1.142396    1.340014    0.968164   
232           0.830052      0.007686   -0.698979   -1.151925   
7448         -1.755787     -0.427286   -0.622681   -0.587228   
13045        -0.991789     -0.698024   -0.662146   -0.220671   

       PICKUP_CENSUS_TRACT  DROPOFF_CENSUS_TRACT  PICKUP_COMMUNITY_AREA  \
23909             1.248967             -0.928942               1.583526   
23043             1.248967             -0.346327               1.583526   
232              -0.597612              1.289567              -0.312180   
7448             -0.597612             -0.928646              -0.312180   
13045            -1.127960              1.291912              -1.102058   

       DROPOFF_COMMUNITY_AREA      TIPS  TIP_RATE  ...  \
23909               -0.933166  1.540405  0.212445  ...   
23043                0.153323  1.430710

### 3. Train the Model üéØ  

Train a **linear regression model** using `sklearn` (you can also try other alternatives).  

üîß **Experiment with Different Features**  
- Start with a few features and observe the model's performance.  
- Try adding or removing features to see how it affects accuracy.  

> üôã **What are hyperparameters, and how do learning rate and epochs affect training** ‚ùì

> üôã **Does using more features always improve the model** ‚ùì

In [16]:
# add your code here
from sklearn.linear_model import LinearRegression
model = LinearRegression()

model.fit(X_train, y_train)


0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


#### üéâ Congratulations! üéâ

**üöÄYou‚Äôve just trained your first Machine Learning model! üöÄ**

### 4. Make Predictions üîÆ  

Use the trained model to predict values on the **testing data**.  
Compare the predictions with the actual values to assess accuracy.  
You can also try making predictions on the **training data** to see how well the model memorized the patterns.  

>üôã **Why doesn‚Äôt the model predict exact values, even when using the training data**‚ùì

In [18]:
# add your code here
y_pred = model.predict(X_test)

###5.Evaluate the Model üìä

Measure performance using metrics like MSE, RMSE, and R¬≤ Score.
Plot a loss curve to track training progress.
Create a scatter plot to compare actual vs. predicted values.

>üôã **What do MSE, RMSE, and R¬≤ Score tell us about the model's performance**‚ùì

>üôã **How can you tell if your model is overfitting or underfitting**‚ùì


In [20]:
# add your code here
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)

Mean Squared Error: 0.0002658591989595919
R-squared: 0.9997259708619853


---
## Exercise 02 : **University Admission Prediction Challenge**

üòá I know that the exercise is difficult, but you will practice what you learned last week along with your first classification algorithm.

### üéØ Objective

Your mission is to predict whether a student will be admitted to their desired university based on various academic and application-related factors. You‚Äôll use **Logistic Regression** to build a predictive model and discuss its strengths and limitations.

üìä The Dataset  :

[Admission_Predict.csv](https://github.com/1337-Artificial-Intelligence/Entry-Level-ML-Engineer-Bootcamp/blob/main/Week02/Admission_Prediction_Challenge.csv)

The dataset contains information on **400 students** with the following attributes:

- **GRE Score** üéì
- **TOEFL Score** üìö
- **University Rating** üèõÔ∏è
- **Statement of Purpose (SOP) Score** ‚úçÔ∏è
- **Letter of Recommendation (LOR) Score** üì©
- **Cumulative Grade Point Average (CGPA)** üéØ
- **Research Experience (Yes/No)** üî¨
- **Chance of Admission (Target Variable: 0 or 1)**

### üõ†Ô∏è Steps to Follow

1. **Load & Explore the Data**: Understand the dataset, check for missing values, and analyze distributions.
2. **Feature Selection & Processing**: Identify relevant features and scale them if needed.
3. **Train a Logistic Regression Model**: Implement Logistic Regression (**from scratch** optional) to classify students into "Admitted" or "Not Admitted."
4. **Evaluate the Model**: Measure accuracy, precision, recall, and other key metrics.(do some reaserch üôÇ)
5. **Discuss Limitations**: Explore cases where Logistic Regression may struggle and suggest improvements (e.g., feature engineering, alternative models).

### **Resources**  
[Logistic Regression](https://mlu-explain.github.io/logistic-regression/)

### üé® Bonus: Visual Exploration

Use **histograms, correlation heatmaps, and scatter plots** to gain insights before modeling.

üîé **Can you build a model that accurately predicts student admissions?** Let's find out! üöÄ

In [61]:
# add your code here
import pandas as pd 
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score



df = pd.read_csv("~/Entry-Level-ML-Engineer-Bootcamp/Week02/Admission_Prediction_Challenge.csv")

# print(df.describe())
df['Admit_Class'] = (df['Chance of Admit '] >= 0.75).astype(int)
#splting data 
X = df.drop(columns=['Serial No.', 'Chance of Admit ', 'Admit_Class'])
y = df['Admit_Class']

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

X = pd.DataFrame(X_scaled, columns=X.columns)

X_train , X_test  , y_train, y_test =  train_test_split(X, y, test_size=0.2, random_state=42)

logistic = LogisticRegression()

logistic.fit(X_train, y_train)

y_pred = logistic.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
# X_train
# y_train

Accuracy: 0.9125
              precision    recall  f1-score   support

           0       0.95      0.89      0.92        47
           1       0.86      0.94      0.90        33

    accuracy                           0.91        80
   macro avg       0.91      0.92      0.91        80
weighted avg       0.92      0.91      0.91        80



---
##üéâ **Congratulations!** üéâ  

You've successfully trained your first **Linear Regression** and **Logistic Regression** models! üöÄ  

Through this journey, you've learned:  
‚úÖ How to **prepare and preprocess data** for training.  
‚úÖ The importance of **choosing the right features** and tuning **hyperparameters**.  
‚úÖ How to **train, predict, and evaluate models** using key metrics.  
‚úÖ The difference between **regression (predicting continuous values)** and **classification (predicting categories)**.  

This is a **big step** in your Machine Learning journey! üí° But ML is much more than just linear and logistic regression‚Äîthere are many other models and techniques to explore.  

üîé **Next Challenge:**  
- Research other **types of ML models** (e.g., Decision Trees, SVMs, Neural Networks).  
- Try solving different **real-world problems** using what you've learned.  

üëè Keep experimenting, keep learning, and welcome to the world of Machine Learning! üöÄüî•