# Data Scientist Associate Practical Exam Submission

Use this template to complete your analysis and write up your summary for submission.


## Task 1 

**1. course_id**:
course_id
This column does not exactly match the description in the table. In the table, it was mentioned that course_id should be of nominal type. However, we observed that it is of type int64. It is important to note that the values are unique in the dataset!

**This column does not contain any missing values**.


To correct the type discrepancy, it needs to be transformed into categorical type to avoid confusing it as a continuous value during model training (using the astype() method of a Pandas DataFrame).




**2. course_type**:
For this column, the values do not correspond to the description in the table. Ideally, it should be of nominal type according to the table, but we noticed that it is of type "object".

This column **does not contain any missing values**: 1850/1850 elements
To fulfill the criteria from the table, it should be transformed into "category" type.

This is done using the following code: 
```
data['course_type'] = data['course_type'].astype("category")
``` 





**3. year**:
Values match the given description (discrete values from 2011 to 2022).
**0 missing value**




**4. enrollment_count**:
Values match the given description (discrete representing the number of enrolled students).
**0 missing value**




**5. pre_score**:
The pre_score column is of type "object". Consequently, it does not fulfill the criteria in the table.

There are **130 missing values** that are represented as "-". These values have been replaced with "0".

To comply with the criteria from the table, these values need to be converted to float type to have continuous values, as indicated in the table, using the astype() method.





**6. post_score**:
Values match the given description (continuous representing the average score on the post-course exam). its type is "float"

The column contains **185 missing values**.
These values have been replaced with 0, as indicated in the table




**7. pre_requirement**:
The values in this column do not correspond to the description in the table; they are of type "object".

This column contains **89 missing values**.
The missing values have been replaced with "None," as indicated using 
```
data['pre_requirement'] = data['pre_requirement'].replace("NaN", "None")
```

we transformed this column into "category" type using the following code:

```
data['pre_requirement'] = data['pre_requirement'].astype("category")
```


**8. department**:
The values in this column do not correspond to the description in the table; they are of type "object".

This column contains **0 missing values**
.
To address this issue, we first replaced "Math" with "Mathematics," and then we transformed this column into "category" type using the following code:
```
data['department'] = data['department'].replace("Math", "Mathematics")
```
```
data['department'] = data['department'].astype("category")
```

## Task 2

Count: The total number of data points (enrollment counts) in the dataset is 1850.

Mean (Average): The mean enrollment count is approximately 231.57 students. This value represents the central tendency of the distribution, indicating that, on average, the number of students enrolled in courses is around 232.

Standard Deviation: The standard deviation is approximately 36.99. This value measures the spread or variability of enrollment counts around the mean. A higher standard deviation suggests that the enrollment counts are more spread out from the mean value of 232, while a lower standard deviation indicates that the counts are more concentrated around the mean.

Minimum: The minimum enrollment count in the dataset is 154 students. This is the smallest number of students enrolled in a course.

25th Percentile (Q1): The 25th percentile, also known as the first quartile, is 185. This means that 25% of the enrollment counts are below or equal to 185. It provides a measure of the lower range of enrollment counts.

Median (50th Percentile, Q2): The median enrollment count is 251. This is the middle value in the dataset, where 50% of the enrollment counts are below or equal to 251. The median is less sensitive to extreme values and provides a measure of the central tendency.

75th Percentile (Q3): The 75th percentile, also known as the third quartile, is 261. This means that 75% of the enrollment counts are below or equal to 261. It provides a measure of the upper range of enrollment counts.

Maximum: The maximum enrollment count in the dataset is 267 students. This is the largest number of students enrolled in a course.
 
Overall, the distribution of "enrollment_count" reveals that the majority of courses have enrollments centered around the mean value, with a relatively small standard deviation. The university offers a variety of courses, ranging from smaller classes to larger ones.
 
 
 
 
 
 The following diagrams provide more information
 
 
 histogram
 
 ![visualization_enrollment_count](visualization_enrollment_count.png)
 
 
 
 boxplot
 
 
![boxplot_enrollment_count](boxplot_enrollment_count.png)
 

## Task 3
Number of Online Courses: There are 1375 courses labeled as "online."
Number of Classroom Courses: There are 475 courses labeled as "classroom."

![couse_type_visualization](couse_type_visualization.png)

a. Course Type with the Most Observations: The course type with the most observations is "online," as it has 1375 courses associated with it.

b. Balance Across Types: **The observations are not balanced across the types.** "Online" courses significantly outnumber "classroom" courses, with a difference of 1375 - 475 = 900 courses. This indicates an imbalance, suggesting that the university offers a significantly higher number of online courses compared to classroom-based courses.

## Task 4
Based on the aggregate functions for enrollment count for each type of course:

- **Classroom Courses:**
  - Minimum Enrollment Count: 154
  - Maximum Enrollment Count: 190
  - Mean Enrollment Count: 170.43 (rounded to two decimal places)
  - Median Enrollment Count: 170.0

- **Online Courses:**
  - Minimum Enrollment Count: 231
  - Maximum Enrollment Count: 267
  - Mean Enrollment Count: 252.69 (rounded to two decimal places)
  - Median Enrollment Count: 256.0

**Comparison:**

The aggregate statistics reveal significant differences between classroom and online courses in terms of enrollment counts:

1. **Enrollment Range:** Online courses have a wider range of enrollment counts, with both the minimum and maximum counts higher than those of classroom courses. This suggests that online courses tend to attract more students at both ends of the spectrum.

2. **Central Tendency:** The mean and median enrollment counts for online courses are higher than those for classroom courses, indicating that online courses generally have higher enrollments on average.

3. **Spread of Enrollment:** Classroom courses have a smaller spread of enrollments, as indicated by the narrower interquartile range (from Q1 to Q3) and a less significant difference between the median and mean values.

Overall, the aggregate functions highlight the distinct enrollment characteristics between classroom and online courses. The online courses, on average, tend to have higher enrollments and a wider variation in student numbers compared to classroom courses. The differences in enrollment patterns between the two course types could be attributed to factors such as accessibility, flexibility, and student preferences for online learning.



The following diagrams better illustrate this comparison



![comparaison1](comparaison1.png) 


![comparaison2](comparaison2.png)



## Task 5


The university's task of predicting the number of students who will enroll in a course is a **regression** machine learning problem. Regression is the ideal choice because it focuses on making continuous numeric predictions, precisely what the university needs to forecast the exact enrollment count for their courses. Classification, which predicts categorical labels, and clustering, which groups data points without predicting a target variable, are not suitable for this scenario.

## Task 6

The choice of the Linear Regression model with one-hot encoding is due to its simplicity, interpretability, and flexibility. It serves as a good baseline for predicting enrollment counts, and its performance can be compared to more complex models. Additionally, one-hot encoding allows us to handle categorical features effectively.

In [2]:
#Import of all modules and packages

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error


# data manipulation
universty_data = pd.read_csv("university_enrollment_2306.csv")
universty_data['course_type'] = universty_data['course_type'].astype("category")
universty_data['pre_requirement'] = universty_data['pre_requirement'].replace("NaN", "None")
universty_data['pre_requirement'] = universty_data['pre_requirement'].astype("category")
universty_data['pre_requirement'] = universty_data['pre_requirement'].fillna("None")
universty_data["pre_score"] =  universty_data["pre_score"].replace("-", "0")
universty_data['pre_score'] = universty_data["pre_score"].astype(float)
universty_data['department'] = universty_data['department'].replace("Math", "Mathematics")
universty_data['department'] = universty_data['department'].astype("category")
universty_data['post_score'].fillna(0, inplace=True)

# Extract the features (input) and target variable (output)
X = universty_data[['year', 'pre_score', 'course_type', 'post_score', 'pre_requirement', 'department']]
y = universty_data['enrollment_count']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the preprocessing steps for categorical variables
categorical_features = ['course_type', 'pre_requirement', 'department']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps for both numeric and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)
    ])

# Create the Linear Regression model with preprocessing pipeline
model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('regressor', LinearRegression())])

# Fit the model to the training data
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model's performance using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)



Mean Squared Error: 0.09890894503206821


## Task 7
The code performs hyperparameter tuning using GridSearchCV to find the best settings for the Stacking Regressor. It defines a preprocessing pipeline, base models (RandomForestRegressor and LinearRegression), and a Stacking Regressor that combines the base models. GridSearchCV searches for the optimal hyperparameters (e.g., n_estimators, max_depth, fit_intercept) using cross-validation to minimize the mean squared error. The best model is then evaluated on the test set, providing an optimized solution for predicting enrollment counts.

In [2]:
#Import of all modules and packages
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score

# data manipulation
universty_data = pd.read_csv("university_enrollment_2306.csv")
universty_data['course_type'] = universty_data['course_type'].astype("category")
universty_data['pre_requirement'] = universty_data['pre_requirement'].replace("NaN", "None")
universty_data['pre_requirement'] = universty_data['pre_requirement'].astype("category")
universty_data['pre_requirement'] = universty_data['pre_requirement'].fillna("None")
universty_data["pre_score"] =  universty_data["pre_score"].replace("-", "0")
universty_data['pre_score'] = universty_data["pre_score"].astype(float)
universty_data['department'] = universty_data['department'].replace("Math", "Mathematics")
universty_data['department'] = universty_data['department'].astype("category")
universty_data['post_score'].fillna(0, inplace=True)

# Extract the features (input) and target variable (output)
X = universty_data[['year', 'pre_score', 'course_type', 'post_score', 'pre_requirement', 'department']]
y = universty_data['enrollment_count']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the preprocessing steps for numeric and categorical variables
numeric_features = ['year', 'pre_score', 'post_score']
categorical_features = ['course_type', 'pre_requirement', 'department']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps for both numeric and categorical features
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define the base models
base_models = [
    ('rf', RandomForestRegressor(random_state=42)),
    ('lr', LinearRegression())
]

# Create the Stacking Regressor model with preprocessing pipeline
model = StackingRegressor(estimators=base_models, final_estimator=LinearRegression())

# Create the full pipeline with preprocessor and the Stacking Regressor
full_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('stacking_regressor', model)])

# Define hyperparameters to tune
param_grid = {
    'stacking_regressor__rf__n_estimators': [50, 100, 150],
    'stacking_regressor__rf__max_depth': [None, 10, 20],
    'stacking_regressor__lr__fit_intercept': [True, False]
}

# Perform GridSearchCV for hyperparameter tuning
grid_search = GridSearchCV(full_pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

# Get the best model and its hyperparameters
best_model = grid_search.best_estimator_
best_params = grid_search.best_params_

# Make predictions on the test set using the best model
y_pred = best_model.predict(X_test)

# Evaluate the model's performance using Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print("Best Hyperparameters:", best_params)
print("Mean Squared Error:", mse)



Best Hyperparameters: {'stacking_regressor__lr__fit_intercept': False, 'stacking_regressor__rf__max_depth': 10, 'stacking_regressor__rf__n_estimators': 100}
Mean Squared Error: 0.09918617592240507


## Task 8
In part 6, I chose the Linear Regression model with one-hot encoding for categorical variables as a baseline model. Linear Regression is a simple and interpretable regression technique widely used for predicting numeric values. One-hot encoding is applied to handle categorical variables, converting them into numerical representations that the model can work with. The choice of Linear Regression as the baseline model allows us to establish a straightforward prediction approach, which is easy to interpret and provides a good starting point for comparison.

In part 7, I opted for the Stacking Regressor with hyperparameter tuning using GridSearchCV as a comparison model. The Stacking Regressor is a powerful ensemble learning technique that combines the predictions of multiple base models (RandomForestRegressor and Linear Regression in this case) to improve predictive performance. Hyperparameter tuning with GridSearchCV allows us to systematically search for the best combination of hyperparameters for both the base models and the final estimator (Linear Regression) of the Stacking Regressor. By leveraging the strengths of different models and optimizing their parameters, the Stacking Regressor aims to provide more accurate predictions than a single model alone.

In summary, the choice of Linear Regression with one-hot encoding as the baseline model and the Stacking Regressor with hyperparameter tuning as the comparison model offers a balanced approach. The baseline model gives us a clear and interpretable prediction, suitable for initial insights, while the Stacking Regressor with hyperparameter tuning aims to leverage the best of both base models and further enhance the predictive capabilities, potentially providing more accurate enrollment count predictions.

## Task 9
In the comparison of the models, we evaluated the performance of Linear Regression and Stacking Regressor models for predicting enrollment counts. We used Mean Squared Error (MSE) to measure the accuracy of the predictions.

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import StackingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score, explained_variance_score, mean_absolute_percentage_error


# data manipulation
universty_data = pd.read_csv("university_enrollment_2306.csv")
universty_data['course_type'] = universty_data['course_type'].astype("category")
universty_data['pre_requirement'] = universty_data['pre_requirement'].replace("NaN", "None")
universty_data['pre_requirement'] = universty_data['pre_requirement'].astype("category")
universty_data['pre_requirement'] = universty_data['pre_requirement'].fillna("None")
universty_data["pre_score"] =  universty_data["pre_score"].replace("-", "0")
universty_data['pre_score'] = universty_data["pre_score"].astype(float)
universty_data['department'] = universty_data['department'].replace("Math", "Mathematics")
universty_data['department'] = universty_data['department'].astype("category")
universty_data['post_score'].fillna(0, inplace=True)


# Extract the features (input) and target variable (output)
X = universty_data[['year', 'pre_score', 'course_type', 'post_score', 'pre_requirement', 'department']]
y = universty_data['enrollment_count']
    
# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Part 6: Linear Regression Model
categorical_features = ['course_type', 'pre_requirement', 'department']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_features)
    ])
model_lr = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', LinearRegression())])

model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)
mse_lr = mean_squared_error(y_test, y_pred_lr)

# Part 7: Stacking Regressor Model
numeric_features = ['year', 'pre_score', 'post_score']
categorical_features = ['course_type', 'pre_requirement', 'department']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

base_models = [
    ('rf', RandomForestRegressor(random_state=42)),
    ('lr', LinearRegression())
]

model_sr = StackingRegressor(estimators=base_models, final_estimator=LinearRegression())

full_pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('stacking_regressor', model_sr)])

param_grid = {
    'stacking_regressor__rf__n_estimators': [50, 100, 150],
    'stacking_regressor__rf__max_depth': [None, 10, 20],
    'stacking_regressor__lr__fit_intercept': [True, False]
}

grid_search = GridSearchCV(full_pipeline, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(X_train, y_train)

best_model_sr = grid_search.best_estimator_
y_pred_sr = best_model_sr.predict(X_test)
mse_sr = mean_squared_error(y_test, y_pred_sr)

# Print the results
print("Mean Squared Error (Linear Regression Model):", mse_lr)
print("Mean Squared Error (Stacking Regressor Model):", mse_sr)


Mean Squared Error (Linear Regression Model): 0.09890894503206821
Mean Squared Error (Stacking Regressor Model): 0.09918617592240507


## Task 10
The result, where the **Linear Regression model outperforms the Stacking Regressor model in terms of Mean Squared Error (MSE)**, could be attributed to several factors:

1. Simplicity of Linear Regression: Linear Regression is a simple and interpretable model that assumes a linear relationship between the features and the target variable. In some cases, when the data exhibits a relatively linear relationship, Linear Regression can provide accurate predictions, especially if there are not many complex interactions between the features.

2. Limited Complexity: Stacking Regressor, on the other hand, combines multiple base models, including RandomForestRegressor and Linear Regression. It has more complexity than the simple Linear Regression model. However, in this particular case, the additional complexity may not have significantly improved the predictive performance. The dataset may not have exhibited strong non-linear patterns or interactions that would have been better captured by the Stacking Regressor.

3. Overfitting: The Stacking Regressor with hyperparameter tuning through GridSearchCV may have become more prone to overfitting the training data, especially with a limited amount of data available. Overfitting occurs when the model becomes too specialized to the training data and performs poorly on unseen data.

4. Data Characteristics: The dataset may have been better suited for a simpler model like Linear Regression. The data's linearity and the relationship between features and target variable could align well with the assumptions of Linear Regression, making it a more appropriate choice for this specific prediction task.


![compar](compar.png)


## ✅  When you have finished...
- Publish your Workspace using the option on the left
- Check the published version of your report:
	- Can you see everything you want us to grade?
    - Are all the graphics visible?
- Review the grading rubric. Have you included everything that will be graded?
- Head back to the [Certification dashboard](https://app.datacamp.com/certification) to submit your practical exam