## Homework 5 - VAST Challenge 2011 - Mini Challenge 1 

Authors: Suhas Keshavamurthy, Kriti Shrivastava 

Datasource: VAST Challenge 2011

http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge%202011/taskdescription-of-all2011challenges-printfromoriginalwebisteofchallenge.pdf

#### Task Description: 
Pick a VAST Challenge. This is to be the VAST Challenge for the final project.  Identify what ML algorithms you will need to run on the data. These could be one algorithm you will run several times or different algorithms. Explain your selection.  Find one or more similar example data set(s) anywhere on the web (check the data web sites the syllabus suggested) and run the algorithms. Identify which one you think will work on the VAST Challenge you selected. What visualizations do you think you need?

###### Bokeh Version used: 0.12.6

### Problem Description:

##### Mini Challenge 1  Characterization of an Epidemic Spread

Vastopolis is a major metropolitan area with a population of approximately two million residents. During the last few
days, health professionals at local hospitals have noticed a dramatic increase in reported illnesses. Observed
symptoms are largely flu-like and include fever, chills,sweats, aches and pains, fatigue, coughing, breathing difficulty,
nausea and vomiting, diarrhea, and enlarged lymph nodes. More recently, there have been several deaths believed
to be associated with the current outbreak. City officials fear a possible epidemic and are mobilizing emergency
management resources to mitigate the impact. The challenge is to provide an assessment of the
situation.

Two datasets are provided. The first one contains microblog messages collected from various devices with
GPS capabilities. These devices include laptop computers, handheld computers, and cellular phones. The second
one contains map information for the entire metropolitan area. The map dataset contains a satellite image with
labeled highways, hospitals, important landmarks, and water bodies. Supplemental tables for
population statistics and observed weather data are also made available.

***MC 1.1* Origin and Epidemic Spread:** Identify approximately where the outbreak started on the map (ground zero
location). If possible, outline the affected area. Explain how you arrived at your conclusion. (Short answer)

***MC 1.2* Epidemic Spread:** Present a hypothesis on how the infection is being transmitted. For example, is the
method of transmission person-to-person, airborne, waterborne, or something else? Identify the trends that support
your hypothesis. Is the outbreak contained? Is it necessary for emergency management personnel to deploy
treatment resources outside the affected area? Explain your reasoning. (Detailed answer)


### PART A
### ML Algorithms we plan to use and why

The primary source of data for this problem is the 'Twitter dataset'. The tweets provide an indication of the disease, time and location. The dataset consists of a huge number of tweets during a given time period. 
The first step is to filter the tweets related to the disease. Here we can use NLP techniques to identify the appropriate tweets by analyzing the messages. 

The NLP process:

*Step 1:*
Pre-processing the data. First step would be to remove stop words, remove whitespace and punctuation, lemmatization (convert to root word), and apply N-grams.

*Step 2:*
    Supervised Learning:
    Next we will apply Word2Vec technique to convert the text to features. We can use either BOW (Continuous bag of words) or skip-gram model to do this. Before we are able to apply Word2Vec technique, we need a training set. The simplest way to do this for this particular dataset is to parse the dataset messages for few common words that we can identify is relevant for our problem. These words can be 'flu', 'sick', 'well' (and other epidemic related key words specified in the problem statement). Now we can mark these text messages as relevant and use it as the training dataset. Now we can use the Word2Vec technique to identify all relevant messages that are related to the epidemic. Between BOW and skip-gram model, we need to see which provides better results, although from an initial guess it appears that skip-gram is more relevant for our case.
    Unsupervised Learning:
    We can also attempt to apply unsupervised learning techniques like LDA (Latent Dirichlet allocation) to classify the messages into groups (topics). The challenge here would be to estimate number of topics (as the tweet data is going to be largely random). 
    
*Step 3:* 
    Classification:
    We can then use a classfication algorithm like Random Forest, SVM or Logistic Regression on the supervised learning outcome to identify relevant tweets. The accuracy needs to be calculated for each of these algorithms. If the accuracy is sufficiently high we can proceed with choosing the messages from the entire dataset.
    It is a similar case with the unsupervised learning approach. If the estimated term frequency and overall term frequency are close to each other then we have a higher confidence that the classification generated by this model is accurate and we can use the terms with high correlation to extract relevant tweets.
    
Once the relevant tweets are identified we obtain the timestamp and location of the tweets corresponding to the person being sick or not. It is assumed that this location and time corresponds to the location and time of outbreak. While this is not necessarily true, over a large dataset it will help us identify the trends. 
One example of a false positive is "Benjamin has the flu oh the suffering...". Here the person posting the tweet may not have the flu and is referencing a third party. Her location of the tweet might also be where Benjamin is not currently situated.
Now with this information, we can perform group-by operation on day, time, location and ID on the reduced relevant dataset.

This information in itself might be sufficient to perform certain analysis.
We can group the location information (where the disease has occured) and identify areas in Vastopolis where the epidemic is rampant. We apply clustering ML algorithms here. But since we are interested in obtaining the clusters, we need to apply algorithms like DBScan, MeanShift to identify number of clusters. We can also attempt to apply hierarchical clustering algorithms which do not require 'cluster number' as input.

One additional approach can be tracking the location of all people (using the ID field). If the number of people who are sick (as identified above), and those are not sick are roughly proportionate or much less than the other, in a given area in a given time frame, then we can predict that the epidemic is not spreading from person to person.

Now we can use the additional information provided in the input dataset to identify if other correlations exist for the spread of epidemic. 

1. We can attempt to identify if there is a correlation between wind direction and spread of the epidemic. This might indicate possible transmission over air of the disease 

2. It might be relevant to identify the severity of the disease by comparing the density of population with the number of diseased cases. Might indicate correlation with person-to-person transmission.

3. From the map of the city, we can visually try to identify if the river (water source) might be a possible cause of the spread of the epidemic. (by observing the disease reporting location over the time-series data)



### PART B
### Applying approaches on sample data

#### Classifying tweets into 2 categories: Related to Rain, Not related to Rain (when training data is available)

In [5]:
import nltk
import sys
from sys import exit

## Training Data
rain_tweets = [('this rain is craze today', 'rain'),
               ('Nov 23 17:30 Temperature 3C no or few clouds Wind SW 6 km/h  Humidity 70% France', 'rain'),
               ('missin climbing mountains in the rain', 'rain'),
               ('There are days in live broadcasting Torrential rain in Paris ', 'rain'),
               ('Heavy Rain today in!', 'rain'),
               ('Woman in the boulangerie started complaining about the rain. I said, "its better than terrorists". Need to finesse my jovial patter', 'rain'),
               ('Light to moderate rain over NCR', 'rain'),
               ('After a cold night last night, tonight will be milder and mainly frost-free, with this band of rain. Jo', 'rain'),
               ('But I love the rain. And it rains frequently these days~ So it makes me feel rather good', 'rain'),
               ('With 1000 mm rain already and more rain forecasted 4 Chennai, Nov 2015 will overtake Oct 2005 and Nov 1918 to become the Wettest Month EVER!', 'rain'),
               ('It is raining today. Wet!', 'rain'),
               ('Lots of rain today. Raining!', 'rain'),
               ('Why is it raining?', 'rain'),
               ('So much rain!', 'rain'),
               ('it always rains this time of year', 'rain'),
               ('raining', 'rain'),
               ('raining outside today, rained yesterday too', 'rain'),
               ('rainy weather today! jeez', 'rain'),
               ('Rain has finally extinguished a #wildfire in Olympic National Park that had been burning since May', 'rain'),
               ('The rain had us indoors for Thursdays celebration', 'rain'),
               ('Rain (hourly) 0.0 mm, Pressure: 1012 hPa, falling slowly', 'rain'),
               ('That aspiration yours outfit make ends meet spite of the rainy weather this midsummer?: Edb', 'rain'),
               ('Glasgow\'s bright lights of Gordon st tonight #rain #Glasgow', 'rain'),
               ('Why is it raining? Because it always rains this time of year', 'rain'),
               ('The forecast for this week\'s weather includes lots of rain!', 'rain'),
               ('Morning Has Broken: Morning has BrokenAs I sit in my warm car in between rain squalls I am looking out', 'rain'),
               ('Wind 2.0 mph SW. Barometer 1021.10 mb, Falling. Temperature 5.5 °C. Rain today 0.2 mm. Humidity 78%', 'rain')]

not_rain_tweets = [('I do not like this car', 'not_rain'),
              ('This view is horrible', 'not_rain'),
              ('I feel tired this morning', 'not_rain'),
              ('I am not looking forward to the concert', 'not_rain'),
              ('He is my enemy', 'not_rain'),
              ('I am a bad boy', 'not_rain'),
              ('Tomorrow is going to be fun.', 'not_rain'),
              ('Smiling all around.', 'not_rain'),
              ('These are great apples today.', 'not_rain'),
              ('How about them apples? Thomas is a happy boy.', 'not_rain'),
              ('Thomas is very zen. He is well-mannered.', 'not_rain'),
              ('happy and good lots of light!', 'not_rain'),
              ('I like this new iphone very much', 'not_rain'),
              ('This is not good', 'not_rain'),
              ('I am bothered by this', 'not_rain'),
              ('I am not connected with this', 'not_rain'),
              ('Sadistic creep you ass. Die.', 'not_rain'),
              ('All sorts of crazy and scary as hell.', 'not_rain'),
              ('Not his emails, no.', 'not_rain'),
              ('His father is dead. Returned obviously.', 'not_rain'),
              ('He has a bomb.', 'not_rain'),
              ('Too fast to be on foot. We cannot catch them.', 'not_rain'),
              ('Feeling so stupid stoopid stupid!', 'not_rain'),
              (':-(( :-(', 'not_rain'),
              ('This is the worst way imaginable, all of this traffic', 'not_rain')]


tweets = []
for (words, sentiment) in not_rain_tweets + rain_tweets:
    words_filtered = [e.lower() for e in words.split() if len(e) >= 2]
    tweets.append((words_filtered, sentiment))

def get_words_in_tweets(tweets):
    all_words = []
    for (words, sentiment) in tweets:
        all_words.extend(words)
    return all_words

def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

word_features = get_word_features(get_words_in_tweets(tweets))

training_set = nltk.classify.apply_features(extract_features, tweets)
classifier = nltk.NaiveBayesClassifier.train(training_set)

runtweets = []  # setup to import a list of tweets from a file into a python list
## Test data
runtweets.append('I am a bad boy')  # should be not_rain
runtweets.append('rain today')  # should be rain
runtweets.append('so stupid')  # should be not_rain
runtweets.append('it is raining outside')  # should be rain
runtweets.append('I love it')  # should be not_rain
runtweets.append('so good')  # should be not_rain
runtweets.append("The rainwater is everywhere!") # should be rain
not_raincount = 0
raincount = 0
for tweett in runtweets:
    valued = classifier.classify(extract_features(tweett.split()))
    print (tweett," : ", valued)
    if valued == 'not_rain':
        not_raincount = not_raincount + 1
    if valued == 'rain':
        raincount = raincount + 1
print ('\nRain count: %s \nNot Rain count: %s' % (raincount, not_raincount))

I am a bad boy  :  not_rain
rain today  :  rain
so stupid  :  not_rain
it is raining outside  :  rain
I love it  :  not_rain
so good  :  not_rain
The rainwater is everywhere!  :  not_rain

Rain count: 2 
Not Rain count: 5


####  Finding similar words to create a dictionary for a category

In [6]:
from itertools import chain
from nltk.corpus import wordnet

synonyms = wordnet.synsets("rain")
lemmas = list(set(chain.from_iterable([word.lemma_names() for word in synonyms])))
print(lemmas)

['rain_down', 'rainfall', 'pelting', 'rainwater', 'rain']


#### Categorizing tweets using the category dictionary

In [40]:
category_dict = {}
category_dict['rain']  = lemmas
rain_tweet = set()
not_rain_tweet = set()
for items in runtweets:
        items_tweet = set(items.split())
        if set(category_dict['rain']) & items_tweet:
            rain_tweet.add(items)
        else:
            not_rain_tweet.add(items)
print("Rain Tweets: ",rain_tweet)
print("Not Rain Tweets: ",not_rain_tweet)

Rain Tweets:  {'rain today', 'The rainwater is everywhere!'}
Not Rain Tweets:  {'I am a bad boy', 'so good', 'so stupid', 'it is raining outside', 'I love it'}


#### Improving Performance

#### Generating a more cohesive dictionary of relevant words

In [37]:
### Extending dictionary to include Hypernyms, holonyms, meronyms and Hyponyms
relevantWords = []
relevantWords = lemmas
relevantWordsSynsets = []
relevantWordsSynsets = synonyms

for i,j in enumerate(wordnet.synsets('rain')):
    hypernyms = j.hypernyms()
    hyper_lemmas = list(set(chain.from_iterable([word.lemma_names() for word in hypernyms])))
    for word in hyper_lemmas:
        if word not in relevantWords:
            relevantWords.append(word)
    hyponyms = j.hyponyms()
    hypo_lemmas = list(set(chain.from_iterable([word.lemma_names() for word in hyponyms])))
    for word in hypo_lemmas:
        if word not in relevantWords:
            relevantWords.append(word)
    member_holonyms = j.member_holonyms()  
    member_holonyms_lemmas = list(set(chain.from_iterable([word.lemma_names() for word in member_holonyms])))
    for word in member_holonyms_lemmas:
        if word not in relevantWords:
            relevantWords.append(word)
    part_meronyms = j.part_meronyms()
    part_meronyms_lemmas = list(set(chain.from_iterable([word.lemma_names() for word in part_meronyms])))
    for word in part_meronyms_lemmas:
        if word not in relevantWords:
            relevantWords.append(word)
    relevantWordsSynsets.extend(hypernyms)
    relevantWordsSynsets.extend(hyponyms)
    relevantWordsSynsets.extend(member_holonyms)
    relevantWordsSynsets.extend(part_meronyms)
# print(relevantWordsSynsets)
print(relevantWords)


['rain_down', 'rainfall', 'pelting', 'rainwater', 'rain', 'downfall', 'precipitation', 'deluge', 'drizzle', 'torrent', 'mizzle', 'waterspout', 'shower', 'cloudburst', 'rainstorm', 'pelter', 'soaker', 'rain_shower', 'downpour', 'monsoon', 'raindrop', 'freshwater', 'fresh_water', 'chronological_succession', 'sequence', 'successiveness', 'succession', 'chronological_sequence', 'fall', 'precipitate', 'come_down', 'spit', 'pour', 'shower_down', 'sprinkle', 'pelt', 'patter', 'stream', 'rain_buckets', 'pitter-patter', 'rain_cats_and_dogs', 'spatter']


This list is a lot more cohesive than the previous one and includes word like "drizzle", "monsoon" which are also relevant to rain. Categorizing using this dictionary of words, would give better results. However, the dictionary needs to be carefully inspected for irrelevant words like "succession", "sequence" which do not match our needs and must be removed.

#### Semantic similarity

In [45]:
from itertools import product

runtweets.append("It is pouring")
# rain = Word("rain")
category_dict = {}
category_dict['rain']  = relevantWords
rain_tweet = set()
not_rain_tweet = set()
for items in runtweets:
        items_tweet = list(items.split())
        if set(category_dict['rain']) & set(items_tweet):
            rain_tweet.add(items)
        else:
            wordFromList1 = wordnet.synsets(items_tweet[0])
            allsyns1 = relevantWordsSynsets
            allsyns2 = set(ss for word in items_tweet for ss in wordnet.synsets(word))
            best = max((wordnet.wup_similarity(s1, s2) or 0, s1, s2) for s1, s2 in product(allsyns1, allsyns2))
            if(best[0] > 0.9):
                rain_tweet.add(items)
            else:
                not_rain_tweet.add(items)

print("Rain Tweets: ",rain_tweet)
print("Not Rain Tweets: ",not_rain_tweet)

Rain Tweets:  {'The rainwater is everywhere!', 'it is raining outside', 'rain today', 'It is pouring'}
Not Rain Tweets:  {'I love it', 'so good', 'so stupid', 'I am a bad boy'}


Even after creating a more cohesive dictionary, there is still a possibility of missing out on some relevant words. One way to handle this is using semantic similarity between words. For every word in the tweet, we can find its semantic similarity with the words in our dictionary. In this way we would be able to consider words which may not be a part of our dictionary but are similar to the words in it. So if the tweet has a word which is not in our "rain" dictionary but is very similar to it, it will still be categorized as "rain".
In the example above, the word "pouring" is not in the dictionary but the tweet is still categorized as "rain" as expected.

### PART C
### Analyzing the techniques and discussing applicability on actual dataset

Approach one(using Naive Bayes classifier) works well if the training data is already provided. But in our case, we dont have the training data/ground truths. Hence we will have to look for alternatives while compromising on accuracy. The second approach creates a dictionary of the relevant words for a category. Categorization of tweets is then done using these dictionary of words. This compromises on the accuracy because there is always a chance of missing out the catch words. We have a dependency on the NLTK library for providing relevant words for the category. In this sample data, we only created a dictionary using only one word "rain", but in our data, we will create a dictionary using all the words mentioned in the challenge description, like, fever, chills, sweats, aches and pains etc.

For the actual dataset, we can use a combination of these two techniques. We can generate the training dataset by categorizing using the dictionary technique along with manual inspection. We can then use the Nayes Bayes Classifier or other more complex nueral network to predict the class of remanining dataset. Once we have this information, we can filter the dataset and keep only the tweets which are relevant to our problem. After this is done, we can cluster these tweets based on their location and plot them over the city map provided. Plotting tweets incrementally with time, should give us an idea where the epidemic originated from and the direction in which it is spreading. We expect to face problems because of the size of actual dataset. This sample data was minuscule as compared to the original dataset and was only used to test our techniques.  

### PART D
### Visualizations needed

The following visualizations are expected to be useful for this particular challenge

Plot the location of sick reported against the background of the image of the city. Since the number of people who are sick varies over time, provide the following interactions to the user. 
1. We can tag the tweet with the location and time (displayed when user hovers over the location)
2. We can create a play button, which plots growth of disease incidence by time against the map of Vastopolis like a time-series video.
3. User can select date, time to plot location of sick/healthy people at a particular point of time. 
4. A line graph which shows the rate of growth/decline of the epidemic can be plotted alongside of it.
5. A dynamic, weighted graph can also be generated as a function of time to identify relationships.

Visualization can also animate wind speed and direction for a given day.
The population density for day and night can also be indicated on the background image. (This however does not seem extremely relevant to the problem during the analysis)

<img src="visualization.png">

##### References

1. Simple Text Classification with Python and TextBlob: http://stevenloria.com/how-to-build-a-text-classification-system-with-python-and-textblob/
2. Labelling tweets using supervised classification: http://mark-kay.net/2013/12/19/labelling-tweets-using-supervised-classification/
3. Stackoverflow question on tweet classification: https://stackoverflow.com/questions/36875780/tweet-classification-into-multiple-categories-on-unsupervised-data-tweets
4. WordNet tutorial: http://stevenloria.com/tutorial-wordnet-textblob/
5. NLP examples in Python: https://github.com/jamesacampbell/python-examples
6. https://blog.nycdatascience.com/student-works/identifying-fake-news-nlp/