In [2]:
# Next steps:
# MODEL EVALUATION
# text embeddings, establish basline performance

# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

## Sentiment140
### Source Authors
Sentiment140 was created by Alec Go, Richa Bhayani, and Lei Huang, who were Computer Science graduate students at Stanford University.

[Data Link](http://help.sentiment140.com/for-students)

### Source Purpose
"Our approach was unique because our training data was automatically created, as opposed to having humans manual annotate tweets. In our approach, we assume that any tweet with positive emoticons, like :), were positive, and tweets with negative emoticons, like :(, were negative. We used the Twitter Search API to collect these tweets by using keyword search. This is described in our paper."

### Schema
The data is a CSV with emoticons removed. Data file format has 6 fields:
0. the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
1. the id of the tweet (2087)
2. the date of the tweet (Sat May 16 23:58:44 UTC 2009)
3. the query (lyx). If there is no query, then this value is NO_QUERY.
4. the user that tweeted (robotickilldozr)
5. the text of the tweet (Lyx is cool)
### Project Relevance

In [10]:
# Sentiment140
# Data Cleaning
sentiment_train_data = pd.read_csv('./data/Sentiment140/train_data.csv', header=None, 
                         names=['polarity', 'id', 'date', 'query', 'user', 'text'], encoding='latin')
# Data Size
print(f'Sentiment140 Train Data Shape: {sentiment_train_data.shape}')

def search_text(text_series, search_pattern, case):
    return text_series.str.contains(search_pattern, case=case)

def print_tweets(data, polarity, n):
    for elem in data[data.polarity == polarity].text:
        if n <= 0:
            break
        else:
            n-=1
        print(elem)
        print()

# Baseline performance
oakland_results = sentiment_train_data[search_text(sentiment_train_data.text, 'oakland', case=False)]
print('Negative Oakland Tweets')
print_tweets(oakland_results, 0, 3)
print('Positive Oakland Tweets')
print_tweets(oakland_results, 4, 3)
foodbank_results = sentiment_train_data[search_text(sentiment_train_data.text, 'food bank', case=False)]
print('Negative Foodbank Tweets')
print_tweets(foodbank_results, 0, 3)
print('Positive Foodbank Tweets')
print_tweets(foodbank_results, 4, 3)
lgbt_results = sentiment_train_data[search_text(sentiment_train_data.text, 'lgbt', case=False)]
print('Negative LGBT Tweets')
print_tweets(lgbt_results, 0, 3)
print('Positive LGBT Tweets')
print_tweets(lgbt_results, 4, 3)

Sentiment140 Train Data Shape: (1600000, 6)
Negative Oakland Tweets
@CodaQueen oh wait he does have 1 in Oakland on the 18th. Can't understand why he only has 1 &amp; in Oakland 

dear morrissey, stop cancelling shows. it bums people out. first ft lauderdale, now oakland.  get it together. thanks &lt;3

Just worked my last night at the Oakland Children's and I'm sad.  That was my 2nd home for 4 years. New hospital tomorrow...woo

Positive Oakland Tweets
@MadameSoybean Ah, I think that you think that I'm in Oakland. But I'm 145 miles north of there. So you didn't pass by. I was confused. 

@AsPgameLive  i miss nick swisher as well and follow him- i happy for him but want him to come home to oakland  sniff sniff

@debraoakland  Only telling the truth Dibster 

Negative Foodbank Tweets
@KGMB9 i wish the Hawaii Food Bank's food drive wasn't always so close to the Letter Carrier's Food Drive (May 9) 

@JMC_Ministries Churches aren't involved as we r called 2b thru Christ. If we were, there 

## ASU Twitter Graph
### Source Authors
R. Zafarani and H. Liu, (2009). Social Computing Data Repository at ASU [http://socialcomputing.asu.edu]. Tempe, AZ: Arizona State University, School of Computing, Informatics and Decision Systems Engineering.

[Data Link](http://socialcomputing.asu.edu/datasets/Twitter)

### Source Purpose
Social Computing Data Repository hosts data from a collection of many different social media sites, most of which have blogging capacity. Some of the prominent social media sites included in this repository are BlogCatalog, Twitter, MyBlogLog, Digg, StumbleUpon, del.icio.us, MySpace, LiveJournal, The Unofficial Apple Weblog (TUAW), Reddit, etc. The repository contains various facets of blog data including blog site metadata like, user defined tags, predefined categories, blog site description; blog post level metadata like, user defined tags, date and time of posting; blog posts; blog post mood (which is defined as the blogger's emotions when (s)he wrote the blog post); blogger name; blog post comments; and blogger social network.

The repository has been designed in 2009 by Reza Zafarani and Huan Liu. Funding support from the Air Force Office of Scientific Research (AFOSR) and Office of Naval Research (ONR) is gratefully acknowledged. The credit also goes to our dataset creaters who made gathering this repository possible.

### Schema
nodes.csv
-- it's the file of all the users. This file works as a dictionary of all the users in this data set. It's useful for fast reference. It contains
all the node ids used in the dataset

edges.csv
-- this is the friendship/followership network among the users. The friends/followers are represented using edges. Edges are directed.
Here is an example.

{1,2}

This means user with id "1" is followering user with id "2".

Attribute Information:
Twitter is a social news website. It can be viewed as a hybrid of email, instant messaging and sms messaging all rolled into one neat and simple package. It's a new and easy way to discover the latest news related to subjects you care about.

- Basic statistics
Number of Nodes: 11,316,811
Number of Edges: 85,331,846

### Project Relevance
* While this has no actual text information, this data set can be very useful if augmented with specific values we want to prioritize in our PageRank implementation
* Good testing ground for scaling up algorithms due to data size

In [12]:
# ASU Twitter Graph
# graph imports
import operator

import networkx as nx

# Data Cleaning
twitter_nodes = pd.read_csv('./data/Twitter-dataset/data/nodes.csv', header=None)
twitter_edges = pd.read_csv('./data/Twitter-dataset/data/edges.csv', header=None)
# let's take a random sample of twitter_edges to make it a bit smaller
sampled_twitter_edges = twitter_edges.sample(frac=0.01)

G = nx.DiGraph()
G.add_edges_from(sampled_twitter_edges.values)
print(f'Number of nodes: {len(list(G.nodes))}')
in_degrees = dict(G.in_degree)
print(f'Largest number of connections: {max(in_degrees.values())}')

# Baseline performance
print(f'Nodes with largest in-degrees: {dict(sorted(in_degrees.items(), key=operator.itemgetter(1), reverse=True)[:5])}')
pr = nx.pagerank(G)
print(f'Built-in PageRank: {dict(sorted(pr.items(), key=operator.itemgetter(1), reverse=True)[:5])}')

Number of nodes: 585094
Largest number of connections: 5545
Nodes with largest in-degrees: {5994113: 5545, 7496: 3568, 1349110: 3443, 3493: 2187, 3402: 1993}
Built-in PageRank: {5994113: 0.003860802208591317, 7496: 0.0024850155981141118, 1349110: 0.002373434560173397, 3493: 0.001272556362991607, 3402: 0.0011883608812889813}


## Cheng-Caverlee-Lee Twitter Scrape
### Source Authors
Z. Cheng, J. Caverlee, and K. Lee. You Are Where You Tweet: A Content-Based Approach to Geo-locating Twitter Users. In Proceeding of the 19th ACM Conference on Information and Knowledge Management (CIKM), Toronto, Oct 2010. (Bibtex)

[Data Link](https://archive.org/details/twitter_cikm_2010)

### Source Purpose
This dataset is a collection of scraped public twitter updates used in coordination with an academic project to study the geolocation data related to twittering. 

### Schema
The training set contains 115,886 Twitter users and 3,844,612 updates from the users. All the locations of the users are self-labeled in United States in city-level granularity. The test set contains 5,136 Twitter users and 5,156,047 tweets from the users. All the locations of users are uploaded from their smart phones with the form of "UT: Latitude,Longitude".

* “training set users.txt” - “UserID\tUserLocation”.
* “training set tweets.txt” - “UserID\tTweetID\tTweet\tCreatedAt”.

### Project Relevance
* Geographic location imputation
* Size and diversity of tweet text makes this a good candidate for a development test set

In [24]:
# Cheng-Caverlee-Lee Twitter Scrape
# Data Cleaning
# This data set comes in a tsv file with inconsistent line lengths and tabs embedded in some tweet texts
# Base pandas has difficulty reading in the data, some cleaning required
def read_ccl_data(fp):
    """Read in the tsv values accomodating for line length inconsistency and embedded tabs in text values"""
    ccl_data_dict = dict()
    with open(fp) as f:
        # counter represents the "row" number of the data
        counter = 0
        for line in f:
            payload = line.split('\t')
            
            # Handle embedded tabs within tweet text
            if len(payload) > 4:
                first_col = payload[0]
                second_col = payload[1]
                fourth_col = payload[-1]
                # Non-ideal notation but it works for now
                third_col = ' '.join(payload[-2:1:-1][::-1])
                payload = [first_col, second_col, third_col, fourth_col]
            
            # Skip rows with missing data values
            # TODO: Use data schema patterns to identify which values are missing and impute or mark None
            if len(payload) < 4:
                #skip the line for now
                continue
                
            ccl_data_dict[counter] = payload
            counter += 1

    return pd.DataFrame(ccl_data_dict).T

def query_for_hand_labeling(data, search_term, nrows, outfp):
    text_search = data[search_text(data.loc[:, 2], search_term, case=False)]
    text_search.to_csv(outfp)


def read_hand_labeled_data(fp):
    df = pd.read_csv(fp)
    new = df.loc[:, '3'].str.split(',', expand=True)
    new.loc[:, 2] = new.loc[:, 2].str.strip()
    new = new.rename({0: 'date', 1: 'political', 2: 'sentiment'}, axis=1)
    df = df.drop(['3', '4', '5'], axis=1).rename({'Unnamed: 0':'user_id', 
                                                  '0':'tweet_id', '1':'tweet_id2', 
                                                  '2':'tweet'}, axis=1)
    return df.join(new).astype({'political': 'int32', 'sentiment': 'int32'})

ccl_train_tweets = read_ccl_data('./data/twitter_cikm_2010/training_set_tweets.txt')
ccl_train_users = pd.read_csv('./data/twitter_cikm_2010/training_set_users.txt', sep='\t', header=None)

# Data Size
print(f'User data size: {ccl_train_users.shape}')
print(f'Tweets data size: {ccl_train_tweets.shape}')

# Class balance
print(f'Top 5 classes support:\n {(ccl_train_users.loc[:, 1].value_counts() / ccl_train_users.shape[0]).head()}')

# Hand labeled sentiment and political
#query_for_hand_labeling(ccl_train_tweets, 'lgbt', 100, './data/twitter_cikm_2010/jdunn_labeled.csv')

Top 5 classes support: Los Angeles        0.038987
New York           0.037382
Los Angeles, CA    0.026336
Chicago            0.023696
New York, NY       0.021124
Name: 1, dtype: float64
User data size: (115886, 2)
Tweets data size: (3679161, 4)


In [69]:
# supports
df = read_hand_labeled_data('./data/twitter_cikm_2010/jdunn_labeled.csv')
print(f'Support for Political Class 1: {df[df.political == 1].shape[0]/df.shape[0]}')
print(f'Support for Political Class 0: {df[df.political == 0].shape[0]/df.shape[0]}')

print(f'Support for Sentiment Class 1: {df[df.sentiment == 1].shape[0]/df.shape[0]}')
print(f'Support for Sentiment Class 0: {df[df.sentiment == 0].shape[0]/df.shape[0]}')
print(f'Support for Sentiment Class -1: {df[df.sentiment == -1].shape[0]/df.shape[0]}')

Support for Political Class 1: 0.76
Support for Political Class 0: 0.24
Support for Sentiment Class 1: 0.35
Support for Sentiment Class 0: 0.58
Support for Sentiment Class -1: 0.07
