Please carefully follow these instructions:

Please open the graph view by clicking on the icon <img src="img/graph.svg" width="15"> in the menu on the left.

This notebook includes additional interface features that support exploratory programming:

- Switching between alternative code versions

- Collapsing and zooming into code sections

- Graph-based navigation

You are encouraged to use all available features to help you understand and modify the notebook.

Only follow the exercises under markdown cells that start with Task. These are the parts where you are expected to do something (e.g. debug or modify code).

Please try to complete the tasks to the best of your ability, but don’t worry if you don’t know everything.

Once you're done, proceed to the next notebook: 3 - Post-survey.

Good luck!

In [1]:
#Please execute this code cell to download the stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

# Exploring Tweet Cleaning and Notebook Structure

This notebook uses a dataset of tweets labeled as either:
- **1**: The tweet describes a real disaster
- **0**: The tweet does not

The aim is to prepare these tweets for machine learning by cleaning the text and removing unnecessary tokens like stopwords.

A previous analyst worked on this notebook and tried several approaches. Your job is to explore what has already been done.

You will encounter:
- Multiple versions of similar code
- Possibly unused or inconsistent cells
- Potential issues that require debugging

Focus on understanding the structure, not just running code.


# Part 1 – Cleaning Text and Understanding the Notebook

In this section, you'll explore different approaches the analyst used to clean tweet text.

Your tasks:
1. Understand what each cleaning function is doing
2. Determine which version(s) were actually used later
3. Identify code that was defined but never used
4. Pay attention to the data flow, some mistakes may be subtle


In [2]:
import pandas as pd

# Load dataset
train = pd.read_csv("data/tweets.csv")
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
# Show some tweets
train[['text', 'target']].sample(10, random_state=42)

Unnamed: 0,text,target
683,Morgan Silver Dollar 1880 S Gem BU DMPL Cameo ...,0
7444,Help yourself or those you love who suffer fro...,0
5803,@BLutz10 But the rioting began prior to the de...,1
2484,If the Taken movies took place in India 2 (Vin...,0
4279,Longest Streak of Triple-Digit Heat Since 2013...,1
6973,I want some tsunami take out,0
2929,New and now: Different (FNaF fanfiction): Trix...,0
6213,[55436] 1950 LIONEL TRAINS SMOKE LOCOMOTIVES W...,0
4086,Hail Mary Full of Grace The Lord is with thee....,0
7488,act my age was a MESS everyone was so wild it ...,0


In [4]:
import re

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

In [a1]:
"""
Version 2 – Alternative Cleaning Strategy (clean_text_v2)

This version appears to modify the stopword list to preserve certain words.

Compare the output to other versions and consider:
- What is this version trying to do?
- Does it behave as expected?
"""

pronouns = {'i', 'you', 'we', 'they', 'he', 'she', 'me', 'us', 'them'}

def clean_text_v2(text):
    if pd.isnull(text):
        return ""

    text = re.sub(r"http\S+", "", text)
    text = re.sub(r"[^a-zA-Z\s]", "", text)
    words = text.lower().split()

    # Tries to define a filtered stopword list
    custom_stopwords = stop_words.intersection(pronouns)

    # Apply stopword filter (but this list now includes only pronouns)
    words = [word for word in words if word not in custom_stopwords]

    return " ".join(words)

train['cleaned_v2'] = train['text'].apply(clean_text_v2)


In [8]:
train[['text', 'cleaned_v1', 'cleaned_v2', 'cleaned_v3']].sample(10, random_state=42)


Unnamed: 0,text,cleaned_v1,cleaned_v2,cleaned_v3
683,Morgan Silver Dollar 1880 S Gem BU DMPL Cameo ...,morgan silver dollar gem bu dmpl cameo rev bla...,morgan silver dollar s gem bu dmpl cameo rev b...,morgan silver dollar gem bu dmpl cameo rev bla...
7444,Help yourself or those you love who suffer fro...,help love suffer selfesteem wounds today,help yourself or those love who suffer from se...,help you love suffer selfesteem wounds you today
5803,@BLutz10 But the rioting began prior to the de...,blutz rioting began prior decision indictment ...,blutz but the rioting began prior to the decis...,blutz rioting began prior decision indictment ...
2484,If the Taken movies took place in India 2 (Vin...,taken movies took place india vine jusreign,if the taken movies took place in india vine b...,taken movies took place india vine jusreign
4279,Longest Streak of Triple-Digit Heat Since 2013...,longest streak tripledigit heat since forecast...,longest streak of tripledigit heat since forec...,longest streak tripledigit heat since forecast...
6973,I want some tsunami take out,want tsunami take,want some tsunami take out,i want tsunami take
2929,New and now: Different (FNaF fanfiction): Trix...,new different fnaf fanfiction trixiedrowned pa...,new and now different fnaf fanfiction trixiedr...,new different fnaf fanfiction trixiedrowned pa...
6213,[55436] 1950 LIONEL TRAINS SMOKE LOCOMOTIVES W...,lionel trains smoke locomotives magnetraction ...,lionel trains smoke locomotives with magnetrac...,lionel trains smoke locomotives magnetraction ...
4086,Hail Mary Full of Grace The Lord is with thee....,hail mary full grace lord thee blessed art tho...,hail mary full of grace the lord is with thee ...,hail mary full grace lord thee blessed art tho...
7488,act my age was a MESS everyone was so wild it ...,act age mess everyone wild fun videos wreck,act my age was a mess everyone was so wild it ...,act age mess everyone wild fun videos wreck


## Task 1 – Compare Cleaning Functions

Fill out the following table based on what you understand from the code above:

| Version        | Stopwords Removed | Pronouns Kept | Cleaned Column | Notes or Issues                    |
|----------------|-------------------|----------------|----------------|-------------------------------------|
| clean_text_v1  |                   |                | cleaned_v1     |                                     |
| clean_text_v2  |                   |                | cleaned_v2     |                                     |
| clean_text_v3  |                   |                | cleaned_v3     |                                     |

Tip: Don't guess from output — read the code carefully.


## Part 2: Vectorizing the Cleaned Text

The analyst began converting the cleaned tweets into numerical features using `TfidfVectorizer`. This step is common in text processing, as it turns words into numbers based on how important they are.

They attempted this in a few different ways. However, the result either caused errors or didn’t behave as expected later in the notebook.

Your task is to review these steps and identify what needs to be adjusted.



## Task 2: Fix the Vectorization Pipeline

Review the steps taken to vectorize the cleaned tweet data.

Your goal is to make sure the vectorized output:
- Uses the correct cleaned text column
- Does not include rows where the cleaned text is empty
- Stays correctly aligned with the original dataset

You may want to check the number of rows in the vectorized output compared to the number of rows in the cleaned DataFrame.

Name your version `X_final`.

To enter the raffle, please give "Enter" as an answer to "Questions?" after entering your email.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()

# Bug: applies TF-IDF to uncleaned text instead of cleaned_v3
X_wrong = vectorizer.fit_transform(train['text'])

# This includes raw punctuation, casing, etc.
print(X_wrong.shape)


vectorizer = TfidfVectorizer()

# Applies to cleaned_v3, but dataset still contains empty rows
X_partial = vectorizer.fit_transform(train['cleaned_v3'])

# May include empty strings
print(train['cleaned_v3'].iloc[-25:-15])


# Applies vectorization first, THEN removes empty strings
X_mismatch = vectorizer.fit_transform(train['cleaned_v3'])

train = train[train['cleaned_v3'].str.strip() != ""]

#Mismatch between shape and correct shape
print(X_mismatch.shape, train.shape[0])


# Please write your solution underneath

X_final = ...

# Confirm shapes match
print("Shape of feature matrix:", X_final.shape)
print("Number of rows in DataFrame:", train.shape[0])

# Part 3 – Evaluating the Classifier

In this section, the analyst tried to evaluate the performance of a simple classifier on the tweet dataset.

However, the evaluation result does not seem correct. You might notice that:
- The accuracy is unusually high or low
- The code runs without error, but the numbers don’t make sense
- The prediction or evaluation is based on mismatched data


## Task 3 – Debug the Evaluation

Your goal is to fix the evaluation process so that it correctly shows how well the classifier performs.

Look out for:
- Whether training and testing data are correctly split
- Whether predictions are made on the correct data
- Whether accuracy is being calculated against the right labels

Make sure the final accuracy score reflects actual model performance on unseen data.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Fit model
clf = LogisticRegression()
clf.fit(X_final, train['target'])

# Make predictions
preds = clf.predict(X_final)

# Compare to test labels
print("Accuracy:", accuracy_score(test['target'], preds))

After finishing this task, you may now go to the Post-survey.