# Search Liver Cancer Cures
# Stream twitter data into postgres database

In [113]:
%matplotlib inline

In [120]:
import tweepy
import pandas as pd
import os
import json
import requests
import re
import matplotlib.pyplot as plt

from wordcloud import WordCloud

### Authorize Twitter

In [6]:
consumer_key = os.environ["MF_TWITTER_CONSUMER_API_KEY"]
consumer_secret = os.environ["MF_TWITTER_CONSUMER_API_KEY_SECRET"]
access_token = os.environ["MF_TWITTER_ACCESS_TOKEN"]
access_token_secret = os.environ["MF_TWITTER_ACCESS_TOKEN_SECRET"]

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

### Query Search API

In [73]:
query = 'liver cancer'
max_tweets = 100
r = [status for status in tweepy.Cursor(api.search, q=query, tweet_mode="extended").items(max_tweets)]

In [99]:
statuses = [status._json['full_text'] for status in r if 'retweeted_status' not in status._json]
statuses

['@JasonCrabbMusic Thankful that my cousin got an "all clear" report on the colon and liver cancer he had..that the doctors said was incurable..GOD IS GOOD!',
 'My mom just told me one of my neighbors is going into hospice.  She was diagnosed last week with lung and liver cancer and now she only has a few weeks left to live.  I’m not that close to her but I don’t wanna believe it',
 '@AngelaDwane Thank you, Angéla! Unfortunately, the cancer has recurred, a tumor has been found in the lungs, liver and spleen! It would have been surgery tomorrow, but it was postponed for technical reasons...❤️😘',
 'うーん、トマトですが、💦\n\nStudy finds that in mice, lycopene in tomatoes reduced fatty liver disease, inflammation and liver cancer https://t.co/W3ZM3M8fF7 via @medical_xpress',
 'I have nothing but compassion for those cancer has touched. I watched my Mom slip away. I am afraid every day my cancer is floating around in my body looking for organs to attack. Brain, liver, bones, lungs. My cancer loves th

In [109]:
def format_status_list(status_list):
    status_list = [re.sub('RT @.+: ', '', status) for status in status_list]
    status_list = [re.sub('\n', ' ', status) for status in status_list]
    status_list = [re.sub('https?://[^ ]+', '', status) for status in status_list] # remove links
    status_list = [re.sub('@\w+', '', status) for status in status_list] # remove @
    return status_list

def filter_http(tokens):
    tokens = [word for word in tokens if re.match(r'https?.+', word) == None]
    tokens = [word for word in tokens if re.match(r'.+https?.+', word) == None]
    tokens = [word for word in tokens if re.match(r'.+twitter.+', word) == None]
    return tokens

def tokenize(status_list):
    tokens = ','.join(status_list)
    tokens = tokens.replace("'", '')
    tokens = tokens.lower()
    
    return tokens

In [100]:
# Format statuses
statuses2 = format_status_list(statuses)
statuses2

[' Thankful that my cousin got an "all clear" report on the colon and liver cancer he had..that the doctors said was incurable..GOD IS GOOD!',
 'My mom just told me one of my neighbors is going into hospice.  She was diagnosed last week with lung and liver cancer and now she only has a few weeks left to live.  I’m not that close to her but I don’t wanna believe it',
 ' Thank you, Angéla! Unfortunately, the cancer has recurred, a tumor has been found in the lungs, liver and spleen! It would have been surgery tomorrow, but it was postponed for technical reasons...❤️😘',
 'うーん、トマトですが、💦  Study finds that in mice, lycopene in tomatoes reduced fatty liver disease, inflammation and liver cancer  via ',
 'I have nothing but compassion for those cancer has touched. I watched my Mom slip away. I am afraid every day my cancer is floating around in my body looking for organs to attack. Brain, liver, bones, lungs. My cancer loves these organs.',
 " People who drink alcohol everyday are ten times mor

In [97]:
# Create word cloud


In [118]:
token_string = tokenize(statuses2)

In [123]:
def generate_wordcloud_image(token_string):

    wordcloud = WordCloud(width = 800, height = 800,
                    background_color ='white',
#                     stopwords = self.stopwords,
                    min_font_size = 10).generate(token_string)

    # Plot the WordCloud
    plt.figure(figsize = (8, 8), facecolor = None)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.tight_layout(pad = 0)
    plt.show();

In [None]:
# generate word clouds
generate_wordcloud_image(token_string)

### Data filtering

We filter data to identify health tweets. Keyword filtering, which is used to obtain the data, is insufficient; e.g., “I'm sick of this” and “justin beber ur so cool and i have beber fever.” [8] Instead, we rely on supervised machine learning classification to filter tweets.

We filtered tweets from 2009–2010 with 20,000 keyphrases and randomly annotated a subset of the remaining 11.7 million tweets using Amazon Mechanical Turk, a crowdsourcing service, [30]–[31] to distinguish relevant health tweets from spurious matches. Workers annotated examples as positive (about the user’s health), negative (unrelated, e.g. news updates or advertisements, or not English), or ambiguous. To ensure quality, we annotated a sample ourselves and required workers to annotate some of these “gold” tweets, which allowed us to check annotator accuracy and exclude inaccurate workers. Second, each tweet was labeled by three annotators and the final label was determined by majority vote, removing the 1.1% of examples where the majority vote was ambiguous.

This yielded a set of 5,128 tweets (36.1% positive) for training data to create a classifier for health relevance. We trained a binary logistic regression model using the MALLET toolkit [32] with n-gram (1≤n≤3) word features. We tokenized the raw text such that contiguous blocks of punctuation were treated as word separators, with punctuation blocks retained as word tokens. We removed tweets containing URLs, which were almost always false positives.

We tuned the prediction threshold using 10-fold cross validation to result in an estimated 68% precision and 72% recall, a balance of precision and recall. Applying this classifier to the health stream yielded 144 million health tweets, a nearly hundred-fold increase over our earlier study of 1.6 million tweets. [24].

## Keywords and search terms for liver cancer

In [None]:
Hemangioma
Hepatic adenoma
Focal nodular hyperplasia
Cysts
Lipoma
Fibroma
Leiomyoma
Hepatocellular carcinoma (HCC)
Cholangiocarcinoma
Primary liver cancer (hepatocellular carcinoma) tends to occur in livers damaged by birth defects, alcohol abuse, or chronic infection with diseases such as hepatitis B and C, hemochromatosis (a hereditary disease associated with too much iron in the liver), and cirrhosis. More than half of all people diagnosed with primary liver cancer have cirrhosis -- a scarring condition of the liver commonly caused by alcohol abuse. Hepatitis B and C and hemochromatosis can cause permanent damage and liver failure. Liver cancer may also be linked to obesity and fatty liver disease.

Various cancer-causing substances are associated with primary liver cancer, including certain herbicides and chemicals such as vinyl chloride and arsenic. Smoking, especially if you abuse alcohol as well, also increases risk. Aflatoxins, cancer-causing substances made by a type of plant mold, have also been implicated. Aflatoxins can contaminate wheat, peanuts, rice, corn, and soybeans. These are rare problems in most developed countries like the U.S. Other causes include the hormones androgen and estrogen and a dye formerly used in medical tests called thorotrast.