# Rule-based Classifier for Crisis Response

Run the below cells every time you open the spreadsheet.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from lib import lib 

## 1. Load and inspect the data

Load the twitter data. Uncomment (using `ctrl+/`) one line for your choice of dataset. The variable `tweets` is a list of tweets. 

In [None]:
tweets, test_tweets = lib.read_sandy_data(train_path='../data/pd_labeled-data-singlelabels-train.csv', test_path='../data/pd_labeled-data-singlelabels-test.csv')
# tweets, test_tweets = lib.read_haiti_data(train_path='../data/pd_haiti_train.csv', test_path='../data/pd_haiti_test.csv')

### Mini-exercises: Let's inspect the data!

1.) Print the number of tweets in the list.

In [None]:
#### YOUR CODE STARTS HERE ####

#### YOUR CODE ENDS HERE ####

2.) Assign the 11th tweet to the variable "tweet" and print it. 

*Hint: Remember python lists are 0 indexed! :)*

In [None]:
#### YOUR CODE STARTS HERE ####

#### YOUR CODE ENDS HERE ####

3.) To view the category of a tweet, we can access the attribute `tweet.category`. Print the category of the 11th tweet. 

*Hint: use the variable `tweet` assigned above.*

In [None]:
#### YOUR CODE STARTS HERE ####

#### YOUR CODE ENDS HERE ####

This function prints out a table containing all the tweets, along with their category label.

In [None]:
lib.show_tweets(tweets)

### Extension exercise: find the most common hashtags in the dataset

No need to add code here! Just take a look!

In [None]:
from collections import Counter
hashtags = Counter()

for t in tweets:
    for idx, token in enumerate(t):
        if token=="#":
            hashtags[t[idx+1]] += 1

print(hashtags.most_common(10))

## 2. Python refresher

First, let's do some exercises to refresh our memory of a few Python concepts.

### Functions
A Python function is written like this:
```
def add_one(x):
    return x+1
```
The name of the function is `add_one`, `x` is the input variable, and the `return` keyword tells us what to give as output.

1.) Define a function called `square_minus_1` that takes one variable (`x`), squares it, subtracts 1, and returns the result.

In [None]:
#### YOUR CODE STARTS HERE ####

#### YOUR CODE ENDS HERE ####

print("Testing:")
for x in [3,-4,6.5,0]:
    print(str(x), " -> ", str(square_minus_1(x)), end=' ')
    print("CORRECT" if square_minus_1(x)==(x**2-1) else "INCORRECT")

### If-else statements

An if/else statement looks like this:

```
if electoral_votes >= 270:
    print("You win the election")
else:
    print("You lose the election")
```

The if-statement is evaluated (`electoral_votes >= 270`); if it's true then the code under the `if` is executed, if it's false then the code under the `else` is executed.

2.) Define a function called `contains_ss` that takes one variable (word) and returns `True` if the word contains a double-s and `False` if it doesn't.

*Hint: to test whether a string e.g. "ss" is inside another string variable e.g. word, you can use `"ss" in word`.*

In [None]:
#### YOUR CODE STARTS HERE ####


#### YOUR CODE ENDS HERE ####

print("Testing:")
for word in ["computer", "science", "lesson"]:
    print("{:s} ->".format(word, contains_ss(word)), end=' ')
    print("CORRECT" if contains_ss(word)==("ss" in word) else "INCORRECT")

### More complex if-else statements

Maybe you want to check *several* conditions? You can use an if/elif/else statement.

```
if teamA_score > teamB_score:
    print("Team A wins")
elif teamA_score < teamB_score:
    print("Team B wins")
else:
    print("It's a tie!")
```

`elif` stands for "else if". In fact, the above code is just a neater way of writing this:
```
if teamA_score > teamB_score:
    print("Team A wins")
else:
    if teamA_score < teamB_score:
        print("Team B wins")
    else:
        print("It's a tie!")
```

You can have as many `elif` statments as you like. These are useful for when you want several options.

3.) Define a function called `grade` that takes one input (`score`).

If score >= 90, return the string "A".

Otherwise, if score >= 80, return the string "B".

Otherwise, if score >= 70, return the string "C".

Otherwise, if score >= 60, return the string "D".

Otherwise, if score >= 50, return the string "E".

Otherwise, return the string "F".

In [None]:
#### YOUR CODE STARTS HERE ####


#### YOUR CODE ENDS HERE ####

print("Testing:")
for (score,g) in [(77,"C"),(80,"B"),(32,"F"),(100,"A"),(69,"D")]:
    print("%d -> %s" % ((score, grade(score))), end=' ')
    print("CORRECT" if grade(score)==g else "INCORRECT")

## 3. Write a rule-based tweet classifier

Time to write our rule-based classifier!
The function outline below uses a `if/elif/else` statement to return the predicted category of a tweet.

Fill in the missing `if` and `elif` statements with something sensible (there is no one right answer)!

Start with something simple; we'll build it into something more complicated later.

*Hint: Search for a keyword in the tweet. What case should your keyword have?*

In [None]:
def classify_rb(tweet):
    
  tweet = str(tweet).lower() # this makes the tweet lower-case, so we don't have to worry about matching case

  if ________:
    return "Medical"
  elif ________:
    return "Energy"
  elif ________:
    return "Water"
  elif ________:
    return "Food"
  else:
    return "None"

## 4. Test your rule-based classifier on some examples

Run the cell below to see the results of your rule-based classifier. 
You should see a table showing each tweet, along with its true category and the category predicted by your system.

Which types of tweets does your system get right? Which types of tweets does your system get wrong and why? How would you measure the accuracy of your system (this will be the topic of next class! :) )?

In [None]:
# python syntax: list comprehension vs for-loop
predictions = [(tweet, classify_rb(tweet)) for tweet in test_tweets] # a list of (tweet, prediction) pairs

lib.show_predictions(predictions, show_mistakes_only=False)

## 5. Break your rule-based classifier!

It's time to FOOL THE RULES!

You'll be deliberately trying to break your classifier by writing tricky tweets that cause your classifier to predict the wrong category. 

Once your own classifier has been fooled by a tricky tweet, it's your job to amend the rules in your classifier to account for the new case.

In [None]:
def classify_rb_game(tweet):
    
    # TODO: Copy the body of the function classify_rb from above and paste below. When your classifer gets
    # a tweet wrong, add a new rule so it will be correct.
    
    #### YOUR CODE STARTS HERE ###

    #### YOUR CODE ENDS HERE ####

### Write a tweet about Food that will be misclassified

Below, write a disaster-scenario tweet about Food that the classification function above will get wrong (i.e. fail to recognize it's about food).

*Hint: think of less-obvious food-related keywords that aren't included in the rule-based system above.*

Then run the cell - make sure the tweet is classified as something other than Food!

In [None]:
food_tweet = ""
print("This tweet is classified as: {:s}\n".format(classify_rb_game(food_tweet)))

### Write a tweet about Energy that will be misclassified

In [None]:
energy_tweet = ""
print("This tweet is classified as: {:s}\n".format(classify_rb_game(energy_tweet)))

### Write a tweet about Water that will be misclassified

In [None]:
water_tweet = ""
print("This tweet is classified as: {:s}\n".format(classify_rb_game(water_tweet)))

### Write a tweet about Medical that will be misclassified

In [None]:
medical_tweet = ""
print("This tweet is classified as: {:s}\n".format(classify_rb_game(medical_tweet)))

### Write a tweet NOT about Food, that will be falsely classified as Food

Below, write a disaster-scenario tweet that is NOT about Food, but that the classifier above will classify as Food.

*Hint: you want to trick the classifier into thinking you're talking about food when you're not. Look at the keywords the rule-based system associates with food. Can you find a way to use them while actually talking about not-food?* 

* For example, if the system looks for the word "food" you could write "Waiting out #Sandy by reading Plato. Food for thought."
* If the system looks for the word "cook", you could write "I hear the power's out in Cook County."
* More simply, you could mention food incidentally but the real subject of the tweet is something else e.g. "Was out food shopping when I heard about the power outage on the news. Hope everyone's OK."

Then run the cell - make sure the tweet is classified as Food!

In [None]:
not_food_tweet = ""
print("This tweet is classified as: {:s}\n".format(classify_rb_game(not_food_tweet)))

### Write a tweet NOT about Energy, that will be falsely classified as Energy

In [None]:
not_energy_tweet = ""
print("This tweet is classified as: {:s}\n".format(classify_rb_game(not_energy_tweet)))

### Write a tweet NOT about Water, that will be falsely classified as Water

In [None]:
not_water_tweet = ""
print("This tweet is classified as: {:s}\n".format(classify_rb_game(not_water_tweet)))

### Write a tweet NOT about Medical, that will be falsely classified as Medical

In [None]:
not_medical_tweet = ""
print("This tweet is classified as: {:s}\n".format(classify_rb_game(not_medical_tweet)))

## 6. Bonus: understand the library functions! :D  
In `lib.py`, read through the following functions we used above: 
  * `read_sandy_data` / `read_haiti_data`
  * `show_predictions`
  * `show_tweets`