# Introduction to Data Science 2025

# Week 6: Recap

## Exercise 1 | Linear regression with feature selection

Download the [TED Talks](https://www.kaggle.com/rounakbanik/ted-talks) dataset from Kaggle. Your task is to predict both the ratings and the number of views of a given TED talk. You should focus only on the <span style="font-weight: bold">ted_main</span> table.

1. Download the data, extract the following ratings from column <span style="font-weight: bold">ratings</span>: <span style="font-weight: bold">Funny</span>, <span style="font-weight: bold">Confusing</span>, <span style="font-weight: bold">Inspiring</span>. Store these values into respective columns so that they are easier to access. Next, extract the tags from column <span style="font-weight: bold">tags</span>. Count the number of occurrences of each tag and select the top-100 most common tags. Create a binary variable for each of these and include them in your data table, so that you can directly see whether a given tag (among the top-100 tags) is used in a given TED talk or not. The dataset you compose should have dimension (2550, 104), and comprise of the 'views' column, the three columns with counts of "Funny", "Confusing and "Inspiring" ratings, and 100 columns which one-hot encode the top-100 most common tag columns.


In [None]:
# Use this cell for your code
import pandas as pd
import numpy as np
import ast
from collections import Counter

df = pd.read_csv('ted_main.csv')

def extract_rating(ratings_str, rating_name):
    try:
        ratings_list = ast.literal_eval(ratings_str)
        for rating in ratings_list:
            if rating['name'] == rating_name:
                return rating['count']
        return 0
    except:
        return 0

df['Funny'] = df['ratings'].apply(lambda x: extract_rating(x, 'Funny'))
df['Confusing'] = df['ratings'].apply(lambda x: extract_rating(x, 'Confusing'))
df['Inspiring'] = df['ratings'].apply(lambda x: extract_rating(x, 'Inspiring'))

all_tags = []
for tags_str in df['tags']:
    try:
        tags_list = ast.literal_eval(tags_str)
        all_tags.extend(tags_list)
    except:
        pass

tag_counts = Counter(all_tags)
top_100_tags = [tag for tag, count in tag_counts.most_common(100)]

for tag in top_100_tags:
    df[f'tag_{tag}'] = df['tags'].apply(lambda x: 1 if tag in str(x) else 0)

columns_to_keep = ['views', 'Funny', 'Confusing', 'Inspiring'] + [f'tag_{tag}' for tag in top_100_tags]
ted_data = df[columns_to_keep].copy()

print(f"Dataset shape: {ted_data.shape}")
print(f"\nFirst few rows:")
print(ted_data.head())

2. Construct a linear regression model to predict the number of views based on the data in the <span style="font-weight: bold">ted_main</span> table, including the binary variables for the top-100 tags that you just created.

In [None]:
# Use this cell for your code
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X = ted_data.drop('views', axis=1) 
y = ted_data['views'] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

lr_views = LinearRegression()
lr_views.fit(X_train, y_train)

y_pred = lr_views.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Linear Regression Model for Views Prediction:")
print(f"Mean Squared Error: {mse:,.2f}")
print(f"R² Score: {r2:.4f}")
print(f"\nModel coefficients shape: {lr_views.coef_.shape}")
print(f"Intercept: {lr_views.intercept_:.2f}")

3. Do the same for the <span style="font-weight: bold">Funny</span>, <span style="font-weight: bold">Confusing</span>, and <span style="font-weight: bold">Inspiring</span> ratings.

In [None]:
# Use this cell for your code
tag_columns = [col for col in ted_data.columns if col.startswith('tag_')]
X_tags = ted_data[tag_columns]

results = {}

y_funny = ted_data['Funny']
X_train_f, X_test_f, y_train_f, y_test_f = train_test_split(X_tags, y_funny, test_size=0.2, random_state=42)
lr_funny = LinearRegression()
lr_funny.fit(X_train_f, y_train_f)
y_pred_f = lr_funny.predict(X_test_f)
results['Funny'] = {
    'model': lr_funny,
    'mse': mean_squared_error(y_test_f, y_pred_f),
    'r2': r2_score(y_test_f, y_pred_f)
}

y_confusing = ted_data['Confusing']
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_tags, y_confusing, test_size=0.2, random_state=42)
lr_confusing = LinearRegression()
lr_confusing.fit(X_train_c, y_train_c)
y_pred_c = lr_confusing.predict(X_test_c)
results['Confusing'] = {
    'model': lr_confusing,
    'mse': mean_squared_error(y_test_c, y_pred_c),
    'r2': r2_score(y_test_c, y_pred_c)
}

y_inspiring = ted_data['Inspiring']
X_train_i, X_test_i, y_train_i, y_test_i = train_test_split(X_tags, y_inspiring, test_size=0.2, random_state=42)
lr_inspiring = LinearRegression()
lr_inspiring.fit(X_train_i, y_train_i)
y_pred_i = lr_inspiring.predict(X_test_i)
results['Inspiring'] = {
    'model': lr_inspiring,
    'mse': mean_squared_error(y_test_i, y_pred_i),
    'r2': r2_score(y_test_i, y_pred_i)
}

print("Linear Regression Models for Ratings Prediction:\n")
for rating_name, metrics in results.items():
    print(f"{rating_name}:")
    print(f"  MSE: {metrics['mse']:,.2f}")
    print(f"  R² Score: {metrics['r2']:.4f}")

4. You will probably notice that most of the tags are not useful in predicting the views and the ratings. You should use some kind of variable selection to prune the set of tags that are included in the model. You can use for example classical p-values or more modern [LASSO](https://en.wikipedia.org/wiki/Lasso_(statistics)) techniques. Which tags are the best predictors of each of the response variables?

In [None]:
# Use this cell for your code
from sklearn.linear_model import LassoCV
import matplotlib.pyplot as plt

selected_features = {}

print("\n1. Views Prediction:")
X_views = ted_data[tag_columns]
y_views = ted_data['views']
lasso_views = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso_views.fit(X_views, y_views)

coef_views = pd.Series(lasso_views.coef_, index=tag_columns)
selected_views = coef_views[coef_views != 0].sort_values(ascending=False)
selected_features['Views'] = selected_views

print(f"  Best alpha: {lasso_views.alpha_:.4f}")
print(f"  Number of selected tags: {len(selected_views)}")
print(f"  Top 10 positive predictors:")
for tag, coef in selected_views.head(10).items():
    print(f"    {tag.replace('tag_', '')}: {coef:.2f}")
if len(selected_views[selected_views < 0]) > 0:
    print(f"  Top 5 negative predictors:")
    for tag, coef in selected_views.tail(5).items():
        print(f"    {tag.replace('tag_', '')}: {coef:.2f}")

print("\n2. Funny Rating Prediction:")
y_funny = ted_data['Funny']
lasso_funny = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso_funny.fit(X_tags, y_funny)

coef_funny = pd.Series(lasso_funny.coef_, index=tag_columns)
selected_funny = coef_funny[coef_funny != 0].sort_values(ascending=False)
selected_features['Funny'] = selected_funny

print(f"  Best alpha: {lasso_funny.alpha_:.4f}")
print(f"  Number of selected tags: {len(selected_funny)}")
print(f"  Top 10 positive predictors:")
for tag, coef in selected_funny.head(10).items():
    print(f"    {tag.replace('tag_', '')}: {coef:.2f}")
if len(selected_funny[selected_funny < 0]) > 0:
    print(f"  Top 5 negative predictors:")
    for tag, coef in selected_funny.tail(5).items():
        print(f"    {tag.replace('tag_', '')}: {coef:.2f}")

print("\n3. Confusing Rating Prediction:")
y_confusing = ted_data['Confusing']
lasso_confusing = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso_confusing.fit(X_tags, y_confusing)

coef_confusing = pd.Series(lasso_confusing.coef_, index=tag_columns)
selected_confusing = coef_confusing[coef_confusing != 0].sort_values(ascending=False)
selected_features['Confusing'] = selected_confusing

print(f"  Best alpha: {lasso_confusing.alpha_:.4f}")
print(f"  Number of selected tags: {len(selected_confusing)}")
print(f"  Top 10 positive predictors:")
for tag, coef in selected_confusing.head(10).items():
    print(f"    {tag.replace('tag_', '')}: {coef:.2f}")
if len(selected_confusing[selected_confusing < 0]) > 0:
    print(f"  Top 5 negative predictors:")
    for tag, coef in selected_confusing.tail(5).items():
        print(f"    {tag.replace('tag_', '')}: {coef:.2f}")

print("\n4. Inspiring Rating Prediction:")
y_inspiring = ted_data['Inspiring']
lasso_inspiring = LassoCV(cv=5, random_state=42, max_iter=10000)
lasso_inspiring.fit(X_tags, y_inspiring)

coef_inspiring = pd.Series(lasso_inspiring.coef_, index=tag_columns)
selected_inspiring = coef_inspiring[coef_inspiring != 0].sort_values(ascending=False)
selected_features['Inspiring'] = selected_inspiring

print(f"  Best alpha: {lasso_inspiring.alpha_:.4f}")
print(f"  Number of selected tags: {len(selected_inspiring)}")
print(f"  Top 10 positive predictors:")
for tag, coef in selected_inspiring.head(10).items():
    print(f"    {tag.replace('tag_', '')}: {coef:.2f}")
if len(selected_inspiring[selected_inspiring < 0]) > 0:
    print(f"  Top 5 negative predictors:")
    for tag, coef in selected_inspiring.tail(5).items():
        print(f"    {tag.replace('tag_', '')}: {coef:.2f}")

5. Produce summaries of your results. Could you recommend good tags – or tags to avoid! – for speakers targeting plenty of views and/or certain ratings?

## Summary and Recommendations

Based on the LASSO feature selection results, speakers can use the tags with positive coefficients to increase views and target specific ratings. For high views, use popular topics like "technology" or "science". For "Inspiring" ratings, use tags related to personal stories and social impact. For "Funny" ratings, choose entertainment-related tags. Speakers should avoid tags with negative coefficients for their target metrics, as these are associated with lower performance in those areas.

**Remember to submit your code on the MOOC platform. You can return this Jupyter notebook (.ipynb) or .py, .R, etc depending on your programming preferences.**

## Exercise 2 | Symbol classification (part 2)

Note that it is strongly recommended to use Python in this exercise. However, if you can find a suitable AutoML implementation for your favorite language (e.g [here](http://h2o-release.s3.amazonaws.com/h2o/master/3888/docs-website/h2o-docs/automl.html) seems to be one for R) then you are free to use that language as well.

Use the preprocessed data from week 3 (you can also produce them using the example solutions of week 3).

1. This time train a *random forest classifier* on the data. A random forest is a collection of *decision trees*, which makes it an *ensemble* of classifiers. Each tree uses a random subset of the features to make its prediction. Without tuning any parameters, how is the accuracy?

In [None]:
# Use this cell for your code

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

try:
    X_train_symbols = pd.read_csv('X_train_symbols.csv')
    X_test_symbols = pd.read_csv('X_test_symbols.csv')
    y_train_symbols = pd.read_csv('y_train_symbols.csv').values.ravel()
    y_test_symbols = pd.read_csv('y_test_symbols.csv').values.ravel()
    
    print("Data loaded successfully!")
    print(f"Training set size: {X_train_symbols.shape}")
    print(f"Test set size: {X_test_symbols.shape}")
    
    rf_classifier = RandomForestClassifier(random_state=42)
    rf_classifier.fit(X_train_symbols, y_train_symbols)
    
    y_pred = rf_classifier.predict(X_test_symbols)
    
    accuracy = accuracy_score(y_test_symbols, y_pred)
    
    print(f"\nRandom Forest Classifier (default parameters):")
    print(f"Number of trees: {rf_classifier.n_estimators}")
    print(f"Test Accuracy: {accuracy:.4f} ({accuracy*100:.2f}%)")
    
    print("\nClassification Report:")
    print(classification_report(y_test_symbols, y_pred))
    
except FileNotFoundError:
    print("")

2. The amount of trees to use as a part of the random forest is an example of a hyperparameter, because it is a parameter that is set prior to the learning process. In contrast, a parameter is a value in the model that is learned from the data. Train 20 classifiers, with varying amounts of decision trees starting from 10 up until 200, and plot the test accuracy as a function of the amount of classifiers. Does the accuracy keep increasing? Is more better?

In [None]:
# Use this cell for your code
import matplotlib.pyplot as plt

n_trees_range = range(10, 201, 10)
accuracies = []

for n_trees in n_trees_range:
    rf = RandomForestClassifier(n_estimators=n_trees, random_state=42)
    
    rf.fit(X_train_symbols, y_train_symbols)
    y_pred = rf.predict(X_test_symbols)
    
    acc = accuracy_score(y_test_symbols, y_pred)
    accuracies.append(acc)
    
    print(f"Trees: {n_trees:3d} | Test Accuracy: {acc:.4f} ({acc*100:.2f}%)")


plt.figure(figsize=(12, 6))
plt.plot(n_trees_range, accuracies, marker='o', linewidth=2, markersize=6)
plt.xlabel('Number of Trees', fontsize=12)
plt.ylabel('Test Accuracy', fontsize=12)
plt.title('Random Forest Test Accuracy vs Number of Trees', fontsize=14)
plt.grid(True, alpha=0.3)
plt.xticks(n_trees_range)
plt.ylim([min(accuracies) - 0.01, max(accuracies) + 0.01])

best_idx = np.argmax(accuracies)
best_n_trees = list(n_trees_range)[best_idx]
best_accuracy = accuracies[best_idx]
plt.axhline(y=best_accuracy, color='r', linestyle='--', alpha=0.5, label=f'Best: {best_accuracy:.4f}')
plt.legend()

plt.tight_layout()
plt.show()

3. If we had picked the amount of decision trees by taking the value with the best test accuracy from the last plot, we would have *overfit* our hyperparameters to the test data. Can you see why it is a mistake to tune hyperparameters of your model by using the test data?

## Why Tuning Hyperparameters on Test Data is a Mistake

Tuning hyperparameters using the test set causes the model to overfit to that specific test data, making the performance estimate overly optimistic and not representative of true generalization. The test set should only be used once at the very end for final evaluation, which is why we need a separate validation set for hyperparameter tuning.

4. Reshuffle and resplit the data so that it is divided in 3 parts: training (80%), validation (10%) and test (10%). Repeatedly train a model of your choosing (e.g random forest) on the training data, and evaluate it’s performance on the validation set, while tuning the hyperparameters so that the accuracy on the validation set increases. Then, finally evaluate the performance of your model on the test data. What can you say in terms of the generalization of your model?

In [None]:
# Use this cell for your code
from sklearn.model_selection import train_test_split

X_all = pd.concat([X_train_symbols, X_test_symbols], axis=0)
y_all = np.concatenate([y_train_symbols, y_test_symbols])

X_train_new, X_temp, y_train_new, y_temp = train_test_split(
    X_all, y_all, test_size=0.2, random_state=42, stratify=y_all
)

X_val, X_test_new, y_val, y_test_new = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

n_estimators_options = [50, 100, 150, 200]
max_depth_options = [10, 20, 30, None]
min_samples_split_options = [2, 5, 10]

best_val_accuracy = 0
best_params = {}
tuning_results = []


for n_est in n_estimators_options:
    for max_d in max_depth_options:
        for min_split in min_samples_split_options:
            rf = RandomForestClassifier(
                n_estimators=n_est,
                max_depth=max_d,
                min_samples_split=min_split,
                random_state=42
            )
            rf.fit(X_train_new, y_train_new)
            
            val_accuracy = accuracy_score(y_val, rf.predict(X_val))
            
            tuning_results.append({
                'n_estimators': n_est,
                'max_depth': max_d,
                'min_samples_split': min_split,
                'val_accuracy': val_accuracy
            })
            
            if val_accuracy > best_val_accuracy:
                best_val_accuracy = val_accuracy
                best_params = {
                    'n_estimators': n_est,
                    'max_depth': max_d,
                    'min_samples_split': min_split
                }

print(f"\nBest hyperparameters found:")
print(f"  n_estimators: {best_params['n_estimators']}")
print(f"  max_depth: {best_params['max_depth']}")
print(f"  min_samples_split: {best_params['min_samples_split']}")
print(f"  Validation accuracy: {best_val_accuracy:.4f}")

final_model = RandomForestClassifier(
    n_estimators=best_params['n_estimators'],
    max_depth=best_params['max_depth'],
    min_samples_split=best_params['min_samples_split'],
    random_state=42
)
final_model.fit(X_train_new, y_train_new)

train_accuracy = accuracy_score(y_train_new, final_model.predict(X_train_new))
val_accuracy = accuracy_score(y_val, final_model.predict(X_val))
test_accuracy = accuracy_score(y_test_new, final_model.predict(X_test_new))

print(f"\nFinal model performance:")
print(f"  Training accuracy:   {train_accuracy:.4f} ({train_accuracy*100:.2f}%)")
print(f"  Validation accuracy: {val_accuracy:.4f} ({val_accuracy*100:.2f}%)")
print(f"  Test accuracy:       {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")

print(f"\nGeneralization analysis:")
if abs(train_accuracy - test_accuracy) < 0.05:
    print("  The model generalizes well - training and test accuracies are similar.")
elif train_accuracy > test_accuracy + 0.1:
    print("  The model shows signs of overfitting - training accuracy is much higher than test accuracy.")
else:
    print("  The model shows reasonable generalization with slight overfitting.")
    
print(f"  Gap between train and test: {(train_accuracy - test_accuracy)*100:.2f}%")

print("\nConfusion Matrix (Test Set):")
cm = confusion_matrix(y_test_new, final_model.predict(X_test_new))
print(cm)

**Remember to submit your code on the MOOC platform. You can return this Jupyter notebook (.ipynb) or .py, .R, etc depending on your programming preferences.**

## Exercise 3 | TPOT

The process of picking a suitable model, evaluating its performance and tuning the hyperparameters is very time consuming. A new idea in machine learning is the concept of automating this by using an optimization algorithm to find the best model in the space of models and their hyperparameters. Have a look at [TPOT](https://github.com/EpistasisLab/tpot), an automated ML solution that finds a good model and a good set of hyperparameters automatically. Try it on this data, it should outperform simple models like the ones we tried easily. Note that running the algorithm might take a while, depending on the strength of your computer. 

*Note*: In case it is running for too long, try checking if the parameters you are using when calling TPOT are reasonable, i.e. try reducing number of ‘generations’ or ‘population_size’. TPOT uses cross-validation internally, so we don’t need our own validation set.

In [None]:
# Use this cell for your code
from tpot import TPOTClassifier
tpot = TPOTClassifier(
    generations=5,          
    population_size=20,     
    cv=5,                    
    random_state=42,
    verbosity=2,            
    n_jobs=-1,              
    max_time_mins=10,       
    max_eval_time_mins=2    
)

tpot.fit(X_train_new, y_train_new)

test_score = tpot.score(X_test_new, y_test_new)
tpot.export('tpot_best_pipeline.py')

print(tpot.fitted_pipeline_)

if test_score > test_accuracy:
    print("\nTPOT found a better model through automated search!")
else:
    print("\nThe manual Random Forest performed similarly or better.")

**Remember to submit your code on the MOOC platform. You can return this Jupyter notebook (.ipynb) or .py, .R, etc depending on your programming preferences.**