# Introduction
**Sophia and Victoria**

In Ferguson Missouri the Summer 2014, an unarmed black teenager was shot 6 times and subsequently died from his wounds. The white officer was bot indicted by a grand jury 3 months later. 

The initial incident and the following decision incited both a physical firestorm as the streets of Ferguson were filled with fires and protests, and virtually through social media as people from around the world weighed in. 

This project explores the phenomenon of social media activism, news sharing, and the relationship between the virtual and physical world through analysis of 13 million tweets over the two weeks following the shooting, and 15 million tweets about the indictment decision in the two weeks following.

# Importing everthing!
Let's get ready to do some cool data things!

In [3]:
% matplotlib inline
import pandas as pd
import json
import matplotlib.pyplot as plt
import numpy as np
import time
from datetime import datetime

import cartopy.crs as ccrs
from ipywidgets import widgets
import matplotlib.cm as cm

from matplotlib import animation
from IPython.display import HTML
from tempfile import NamedTemporaryFile

import networkx as nx



# Reading in Data
Here, we're using a function that we've developed to read in the data, a certain number of lines at a time. This uses a file of cleaned tweets that can be found...

In [4]:
def ReadToDf(linesAtATime,filepath, max=-1):
    start = time.time()
    df = pd.DataFrame()
    i = 0
    data = [] 
    
    #Open and read in the file
    with open(filepath) as cleanedTweets:
        for tweet in cleanedTweets:
            i += 1
            jsonline = json.loads(tweet)
            data.append(jsonline)
            #aggregate once we've read in the appropriate number of liens
            if (i % linesAtATime == 0):
                print "number of tweets parsed: ", i
                print "total time elapsed: ", time.time() - start
                df = df.append(pd.DataFrame(data=data))
                #reset the data
                data = []
        #Allow us to handle the last few tweets, and to truncate the data if we just want to run unit tests
            if (max > 0 and i >= max):
                break
        df = pd.DataFrame(data=data).append(df)
    #return the aggregation
    return df

In [None]:
shooting_df = ReadToDf(500000, 'data/cleanedShootingTweets.json')

number of tweets parsed:  500000
total time elapsed:  9.01916599274
number of tweets parsed:  1000000
total time elapsed:  20.4038479328
number of tweets parsed:  1500000
total time elapsed:  32.8145618439
number of tweets parsed:  2000000
total time elapsed:  46.7838070393
number of tweets parsed:  2500000
total time elapsed:  60.3250980377
number of tweets parsed:  3000000
total time elapsed:  76.4616868496
number of tweets parsed:  3500000
total time elapsed:  92.7481968403
number of tweets parsed:  4000000
total time elapsed:  110.749866009
number of tweets parsed:  4500000
total time elapsed:  126.559936047
number of tweets parsed:  5000000
total time elapsed:  145.757845879
number of tweets parsed:  5500000
total time elapsed:  162.547792912
number of tweets parsed:  6000000
total time elapsed:  183.822963953
number of tweets parsed:  6500000
total time elapsed:  203.087458849
number of tweets parsed:  7000000
total time elapsed:  227.908416033
number of tweets parsed:  7500000
t

In [None]:
shooting_df.head(10)

In [None]:
indictment_df = ReadToDf(500000, 'data/cleanedIndTweets.json')

In [None]:
indictment_df.head(10)

## Cleaning Data
This is where we're going to clean data that we've read in. The only cleaning that we need to do is that we need to convert the "created_at" column to a datetime object. The rest of the cleaning data is done in the script `development_scripts/clean_json_data.py`. 

In [None]:
def recodeData (dataframe):
    # Creating a parseable dataset    
    dataframe['createdDatetime'] =  pd.to_datetime(
        dataframe['created_at'], 
        format = '%a %b %d %H:%M:%S +0000 %Y')

Now, let's recode the dates for the shooting data and the indictment data! These opperations are done in place, so we don't actually have to return anything

In [None]:
print('recoding shooting dataset')
recodeData(shooting_df)
print('recoding indictment dataset')
recodeData(indictment_df)

# Number of Tweets over Time
In this section, we mostly want to explore broadly the data we have -- both the shooting and indictment-related tweets. First, we want to understand the volume of tweets that we have, and when there are spikes in the data. We will do this for both the shooting and the indictment tweets.

## Tweets after the Mike Brown Shooting
In the dataset related to the shooting, we have approximately 11 million tweets worth of data. Over the course of the first week, statements from police, from the family of Michael Brown, statements from the police officer Darren Wilson, and unrest from the streets were recorded and shared wildly across the web. To get a general understanding of the data, we are going to plot the number of tweets over time and annotate this plot with important events that happened over the course of the two weeks. 

First, we will group the tweets by minute, and then count the number of tweets that happened each minute. 

In [None]:
#Raw numbers of tweets over time
shootingTweetsGroupedTime = (shooting_df
                                 .set_index('createdDatetime')
                                 .groupby([pd.TimeGrouper('Min')])
                                 .count()
                                 .reset_index())

In [None]:
fig, ax = plt.subplots(figsize=(10,10))

shootingTweetsGroupedTime.plot(kind='line', x='createdDatetime', y='created_at', ax=ax)
plt.title('Number of Tweets by Users')

ax.set_ylabel('Count of tweets')
ax.set_xlabel('Date/Time')

#Overlaying Special Events
#August 11, 10AM - first police department demonstration
#August 11, 4PM - parents ask for stop to violence
#August 11, 8PM - tear gas used at protest
#August 12, 10AM - protest in St. Louis
#August 12, 12PM - Al Sharpton addresses crowds
#August 12, 4PM - Obama makes a statement
#August 13, 6PM - Reporters detained
#August 13, 9PM - Tear gas used again, and at reporters
#August 14, 7AM - Antonio French released from jail
#August 14, 11:40AM - Obama Address
#August 14, 6PM - Silent Vigils, first peaceful night
#August 15, 8:45AM - Darren Wilson names
#August 15, 12:30PM - Assassination statement by family
#August 15 Evening - Huge amounts of protest
#August 16, 3PM - State of emergency issued, curfew issued
#August 17 - Afternoon - Federal Autopsy Ordered
#August 18 - 2AM - Federal Gaurd Ordered into town
#August 18 - 3:30PM - third Obama address
#August 18 - Trayvon Martin's mother published letter
#August 19 - 7AM - family on the Today Show
#August 19 - 1PM - another man is shot
#August 22 - 12PM - national gaurd ordered to withdraw
#August 23 - Online fundraisers for officer surpass that of Brown
#August 23 - 7PM - Obama address
#August 24 - 12PM - Private Viewing, Requests for no Violence
#August 25 - Funeral

calendar_dates = {'FirstDemo':datetime(2014,8,11,10,0,0),
                 'ParentReq':datetime(2014,8,11,16,0,0),
                 'TearGas':datetime(2014,8,11,20,0,0),
                 'StLouisProtest':datetime(2014,8,12,10,0,0),
                 'AlSharpton':datetime(2014,8,12,12,0,0),
                 'Obama':datetime(2014,8,12,16,0,0),
                 'Reporters':datetime(2014,8,13,18,0,0),
                 'TearGas2':datetime(2014,8,13,21,0,0),
                 'AntonioFrench':datetime(2014,8,14,7,0,0),
                 'Obama2':datetime(2014,8,14,11,40,0),
                 'SilentVigils':datetime(2014,8,14,18,0,0),
                 'DarrenWilson':datetime(2014,8,15,8,45,0),
                 'CharacterAssassination':datetime(2014,8,15,12,30,0),
                 'StateOfEmergency':datetime(2014,8,16,15,0,0),
                 'FederalGaurd':datetime(2014,8,18,2,0,0),
                 'Obama3':datetime(2014,8,18,15,30,0),
                 'TodayShow':datetime(2014,8,19,7,0,0),
                 'AnotherShot':datetime(2014,8,19,13,0,0),
                 'GaurdWithdrawn':datetime(2014,8,22,12,0,0),
                 'Obama4':datetime(2014,8,23,19,0,0),
                 'Viewing':datetime(2014,8,24,12,0,0)}

for event in calendar_dates:
    plt.axvline(x=calendar_dates.get(event),ymin=0, ymax=4000, linewidth=4,color='g',label=event)
    plt.text(calendar_dates.get(event),600,event)
plt.show()

Further, a certain percentage of users have place or geoencoding available on their tweets. Spatially we can see how tweets were generated following the shooting.

In [None]:
latLonPopulated = df[(df['x'] != 0) & (df['y'] != 0)]

In [None]:
#Note, the animation code courtesy of and adapted from http://jakevdp.github.io/blog/2013/05/12/embedding-matplotlib-animations/
VIDEO_TAG = """<video controls>
 <source src="data:video/x-m4v;base64,{0}" type="video/mp4">
 Your browser does not support the video tag.
</video>"""

def anim_to_html(anim):
    if not hasattr(anim, '_encoded_video'):
        with NamedTemporaryFile(suffix='.mp4') as f:
            anim.save(f.name, fps=20, extra_args=['-vcodec', 'libx264', '-pix_fmt', 'yuv420p'])
            video = open(f.name, "rb").read()
        anim._encoded_video = video.encode("base64")
    
    return VIDEO_TAG.format(anim._encoded_video)

def display_animation(anim):
    plt.close(anim._fig)
    return HTML(anim_to_html(anim))

In [None]:
# First set up the figure, the axis, and the plot element we want to animate
fig = plt.figure(figsize=(10,10))
ax = plt.axes(xlim=(-180, 180), ylim=(-75, 75))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
scat, = plt.plot([], [],'o')

# initialization function: plot the background of each frame
def init():
    scat.set_data([], [])
    return scat,

# animation function.  This is called sequentially by the animator
def animate(i):
    day = round(i/24) + 17
    hour = i - (day-17)*24
    subset = []
    subset = latLonPopulated[
        (latLonPopulated['createdDatetime_day'] == day) & 
        (latLonPopulated['createdDatetime_hour'] == hour)]

    scat.set_data(subset.x,subset.y)
    
    return scat,

# animation.Animation._repr_html_ = anim_to_html #this yields a depreciation warning, heads up
# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.FuncAnimation(fig, animate, frames=336, interval=5000, blit=True)

# call our new function to display the animation
display_animation(anim)

And over all time, statically, this looks like:

In [None]:
#the cartopy map, going simple outline for now
plt.figure(figsize=(15,15))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()

#straight-up coordinate data which we just finished recoding
plt.scatter(latLonPopulated.x,latLonPopulated.y,color='r')
plt.axis([-180, 180, -75, 75])
plt.show()

In [None]:
#as a hexbin instead
#the cartopy map, going simple outline for now
plt.figure(figsize=(15,15))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()

#straight-up coordinate data which we just finished recoding
plt.hexbin(latLonPopulated.x,latLonPopulated.y,cmap=cm.jet) #add bins='log' to see a log based chart
cb = plt.colorbar()
cb.set_label('counts')
plt.show()

## Tweets after the Decision not to Indict
Now, we also want to look over the same analysis for the tweets following the decision not to indict Darren Wilson. First, let's get an idea of the number of tweets that occurred over time. 

In [None]:
#Raw numbers of tweets over time
indictmentTweetsGroupedTime = (indictment_df
                                 .set_index('createdDatetime')
                                 .groupby([pd.TimeGrouper('Min')])
                                 .count()
                                 .reset_index())

fig, ax = plt.subplots(figsize=(10,10))

indictmentTweetsGroupedTime.plot(kind='line', x='createdDatetime', y='created_at', ax=ax)
plt.title('Number of Tweets by Users')

ax.set_ylabel('Count of tweets')
ax.set_xlabel('Date/Time')

Similarly, let's also take a look at the geographic distribution of those tweets over time!

In [None]:
latLonPopulated = df[(df['x'] != 0) & (df['y'] != 0)]

In [None]:
#Note, the animation code courtesy of and adapted from http://jakevdp.github.io/blog/2013/05/12/embedding-matplotlib-animations/
VIDEO_TAG = """<video controls>
 <source src="data:video/x-m4v;base64,{0}" type="video/mp4">
 Your browser does not support the video tag.
</video>"""

def anim_to_html(anim):
    if not hasattr(anim, '_encoded_video'):
        with NamedTemporaryFile(suffix='.mp4') as f:
            anim.save(f.name, fps=20, extra_args=['-vcodec', 'libx264', '-pix_fmt', 'yuv420p'])
            video = open(f.name, "rb").read()
        anim._encoded_video = video.encode("base64")
    
    return VIDEO_TAG.format(anim._encoded_video)

def display_animation(anim):
    plt.close(anim._fig)
    return HTML(anim_to_html(anim))

In [None]:
# First set up the figure, the axis, and the plot element we want to animate
fig = plt.figure(figsize=(10,10))
ax = plt.axes(xlim=(-180, 180), ylim=(-75, 75))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
scat, = plt.plot([], [],'o')

# initialization function: plot the background of each frame
def init():
    scat.set_data([], [])
    return scat,

# animation function.  This is called sequentially by the animator
def animate(i):
    day = round(i/24) + 17
    hour = i - (day-17)*24
    subset = []
    subset = latLonPopulated[
        (latLonPopulated['createdDatetime_day'] == day) & 
        (latLonPopulated['createdDatetime_hour'] == hour)]

    scat.set_data(subset.x,subset.y)
    
    return scat,

# animation.Animation._repr_html_ = anim_to_html #this yields a depreciation warning, heads up
# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.FuncAnimation(fig, animate, frames=336, interval=5000, blit=True)

# call our new function to display the animation
display_animation(anim)

To just see where the tweets came from over time, we can plot all the geo-located tweets at the same time:

In [None]:
#the cartopy map, going simple outline for now
plt.figure(figsize=(15,15))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()

#straight-up coordinate data which we just finished recoding
plt.scatter(latLonPopulated.x,latLonPopulated.y,color='r')
plt.axis([-180, 180, -75, 75])
plt.show()

In [None]:
#as a hexbin instead
#the cartopy map, going simple outline for now
plt.figure(figsize=(15,15))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()

#straight-up coordinate data which we just finished recoding
plt.hexbin(latLonPopulated.x,latLonPopulated.y,cmap=cm.jet) #add bins='log' to see a log based chart
cb = plt.colorbar()
cb.set_label('counts')
plt.show()

# Hashtags - Sophia
Now, let's explore the different hashtags used by users. First, let's create a dataframe that has one row per tweet-hashtag combination. So, a tweet using two hashtags would translate into a dataframe with two rows. 

We will create this dataframe by looping over the rows in the dataframe, and looping over the hashtags in each tweet. For each tweet, we will create a json object that represents the row that should be added to our new dataframe. 

In [None]:
hashtagRows = []
hashtagMap = {'MICHAELBROWN': 'MIKEBROWN'}
for i,tweet in df.iterrows():
    for hashtag in tweet['entities_hashtags_text']:
        
        mappedhashtag = hashtag.upper()
        if (mappedhashtag in hashtagMap):
            mappedhashtag = hashtagMap[mappedhashtag]
        #Do we want to map michael brown to mike brown and similar stuff?
        hashtagRows.append({
                'createdDatetime': tweet['createdDatetime'],
                'hashtag': mappedhashtag,
                'tweetId': tweet['id_str'],
                'x': tweet['x'],
                'y': tweet['y'],
                'createdDatetime_day': tweet['createdDatetime_day'],
                'createdDatetime_hour': tweet['createdDatetime_hour']
            })
print "creating dataframe"
hashtagsDf = pd.DataFrame(hashtagRows)

In [None]:
hashtagsDf.head(10)

Now that we have a dataframe, let's get the most popular tweets! We will do this by grouping the dataframe by hashtag and then aggregating by count. We'll sort by count, and then transform that information to a list that we can use later. Right now, we'll start by getting the top 10 hashtags and plotting those over time. 

In [None]:
numTopHashtags = 10
popularHashtagsList = (hashtagsDf
                   .groupby('hashtag')
                   .count()
                   .reset_index()
                   .sort_values(by='createdDatetime', ascending=False)['hashtag']
                   .tolist())[0:numTopHashtags]

print(popularHashtagsList)

Now that we have the most popular hashtags, let's filter the hashtags dataframe for just those hashtags. 

In [None]:
popularHashtagsDf = hashtagsDf[hashtagsDf.hashtag.isin(popularHashtagsList)]
popularHashtagsDf.head(10)

Now, let's count the number of times each hashtag was used in a given minute. To do this, we will group the dataframe by datetime.minute and hashtag and then aggregate by count.

In [None]:
hashtagTimeCounts = (popularHashtagsDf
                     .set_index('createdDatetime')
                     .groupby([pd.TimeGrouper('H'), 'hashtag'])
                     .count()
                     .reset_index())
hashtagTimeCounts.head(10)

Now that we've gotten the counts for a particular hashtag every minute, let's plot this over time as a line graph.

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

for hashtag in popularHashtagsList:
    filteredHashtagTimeCounts = hashtagTimeCounts[hashtagTimeCounts.hashtag == hashtag]
    filteredHashtagTimeCounts.plot(kind = 'line', x = 'createdDatetime', y = 'tweetId', label = hashtag, ax = ax)
    
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Time')
plt.ylabel('Number of Tweets over Time')
plt.title('Hashtag useage over Time')

Now, we can see the number of times each hashtag was used in a tweet in this dataset. Unfortunately, it appears that the hashtag "Ferguson" was used much more than any of the other hashtags, so this plot is a little hard to read. To adjust for that, let's "normalize" each hashtag line on this graph. To do this, we will divide the number of times that hashtag was used in any given minute by the maxiumum times that hashtag was used in any minute. This will mean that we can see all the lines on the same set of axes. 

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

for hashtag in popularHashtagsList:
    filteredHashtagTimeCounts = hashtagTimeCounts[hashtagTimeCounts.hashtag == hashtag]
    maxCount = filteredHashtagTimeCounts['tweetId'].max()
    filteredHashtagTimeCounts['normalizedCounts'] = filteredHashtagTimeCounts['tweetId']/maxCount
    filteredHashtagTimeCounts.plot(kind = 'line', x = 'createdDatetime', y = 'normalizedCounts', label = hashtag, ax = ax)
    
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.xlabel('Time')
plt.ylabel('Normalized Number of Tweets over Time')
plt.title('Hashtag useage over Time')

We can see that the most popular hashtags seem to follow a similar trajectory. First of all, let's spend a little bit of time talking about what each hashtag means.

**Ferguson:**
This hashtag appears to be refering in a boader context to the events that occured in Ferguson. [source](http://www.dailydot.com/politics/ferguson-michael-brown-eric-garner-black-lives-matter-hashtag-activism/)

**Mike/Michael Brown:**
This hashtag, somewhat self-explanatorily, refers to Michael Brown, the unarmed teen that was shot by police. [source](http://www.dailydot.com/politics/ferguson-michael-brown-eric-garner-black-lives-matter-hashtag-activism/)

**TCOT:**
This hashtag stands for "Top Conservatives on Twitter" and is used to bring together conservatives on twitter. The corresponding liberal hashtag is "P2". [source](http://blog.sfgate.com/ybenjamin/2010/07/27/the-secret-twitter-war-for-americas-independents-tcot-vs-p2/)  Very quickly, after the events in Ferguson, there started to be political discussions about Ferguson. 

**Hands Up Don't Shoot:**
"Hands up Don't Shoot" was a phrase commonly used in the Ferguson protests. This phrase references witness' statements that say that Michael Brown had his hands up before he was shot by police. This phrase was adopted in peaceful protest after the Ferguson shooting [source](http://www.cbc.ca/newsblogs/yourcommunity/2014/08/hands-up-dont-shoot-gesture-spreads-online-in-support-of-ferguson-protesters.html)

**STL:**
This hashtag, similarly to the hashtag "Ferguson" is in reference o the city of Saint Louis, Missouri.  [source](https://tagdef.com/stl)

**Unite Blue:**
Although this hashtag is typically used to refer to "uniting liberals on twitter" [source](), in this context this hashtag refers to people uniting in support of the police force. [source]()

**Ezell Ford: **
Ezell Ford is another African American man that was also killed after being shot by police. He was shot on August 11th, 2014 in LA. [source](https://en.wikipedia.org/wiki/Shooting_of_Ezell_Ford)

**Ferguson Shooting:**
This hashtag, appears to be used to just refer to events surrounding the Ferguson shooting. [source]()

TALK ABOUT THE GRAPH HERE!

We can even see where these hashtags are most popular over time!

In [None]:
latLonPopulated_HT = hashtagsDf[(hashtagsDf['x'] != 0) & (hashtagsDf['y'] != 0)]

In [None]:
def plotAnimationProperly(list_functions, data_to_plot, colors_to_plot):
    for i in range(len(list_functions)):
        list_functions[i].set_data(data_to_plot[i].x,data_to_plot[i].y)
        list_functions[i].set_color(colors_to_plot[i])

In [None]:
import matplotlib.patches as mpatches
# First set up the figure, the axis, and the plot element we want to animate
fig = plt.figure(figsize=(10,10))
ax = plt.axes(xlim=(-180, 180), ylim=(-75, 75))
# plt.axis([-180, 180, -75, 75])
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()
scat, = plt.plot([], [],'o')
scat_top_1, = plt.plot([], [],'o')
scat_top_2, = plt.plot([], [],'o')
scat_top_3, = plt.plot([], [],'o')
scat_top_4, = plt.plot([], [],'o')
scat_top_5, = plt.plot([], [],'o')
scat_top_6, = plt.plot([], [],'o')
scat_top_7, = plt.plot([], [],'o')
scat_top_8, = plt.plot([], [],'o')
scat_top_9, = plt.plot([], [],'o')

# animation function.  This is called sequentially
patches = []
def animate(i):
    day = round(i/24) + 17
    hour = i - (day-17)*24
    subset = []
    color = []
    for j,hashtag in enumerate(popularHashtagsList):
        subset.append(latLonPopulated_HT[
            (latLonPopulated_HT['createdDatetime_day'] == day) & 
            (latLonPopulated_HT['createdDatetime_hour']== hour) & 
            (latLonPopulated_HT['hashtag'] == hashtag)])
        color.append(cm.jet(j/float(len(popularHashtagsList))))

    plotAnimationProperly([scat,scat_top_1,scat_top_2,scat_top_3,scat_top_4,scat_top_5,scat_top_6,scat_top_7,scat_top_8,scat_top_9],subset,color)
    
    return scat,scat_top_1,scat_top_2,scat_top_3,scat_top_4,scat_top_5,scat_top_6,scat_top_7,scat_top_8,scat_top_9,

# animation.Animation._repr_html_ = anim_to_html
#set up the legend
for j, hashtag in enumerate(popularHashtagsList):
    patches.append(mpatches.Patch(color=cm.jet(j/float(len(popularHashtagsList))), label=hashtag))
plt.legend(handles=patches, loc='best')

# call the animator.  blit=True means only re-draw the parts that have changed.
anim = animation.FuncAnimation(fig, animate, frames=48, interval=5000, blit=False)

# call our new function to display the animation
display_animation(anim)


In [None]:
#the cartopy map, going simple outline for now
plt.figure(figsize=(15,15))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()

for j,hashtag in enumerate(popularHashtagsList):
    subset = latLonPopulated_HT[(latLonPopulated_HT['hashtag'] == hashtag)]
    if len(subset.x) == 0:
        pass
    else:
        plt.scatter(subset.x,subset.y,color=cm.jet(j/float(len(popularHashtagsList))),label=hashtag)

handles, labels = ax.get_legend_handles_labels()
ax.legend(handles,labels, loc='best')
plt.axis([-180, 180, -75, 75])
plt.show()

In [None]:
#each hashtag, hexbin
plt.figure(figsize=(100,20))

for j,hashtag in enumerate(popularHashtagsList):
    subset = latLonPopulated_HT[(latLonPopulated_HT['hashtag'] == hashtag)]
    ax = plt.subplot(2,5,j+1,projection=ccrs.PlateCarree())
    ax.coastlines()
    ax.hexbin(subset.x,subset.y,cmap=cm.jet,bins='log')
    plt.title(hashtag)
    plt.axis([-180, 180, -75, 75])

plt.show()

# Users
introduction goes here!

## Verified Users - Sophia
Additionally, one of the things we wanted to investigate was the role that verified users play in raising awareness about a certain event. On twitter verified users are users that represent an organization (like a news source) or a public figure. We hypotheize that getting more verified users involved in talking about social justice will cause more non-verified users to also be engaged in the conversation, as they see what verififed users are saying. 

In [None]:
verified = df[df.user_verified == True]
normal = df[df.user_verified == False]

Now let's find the number of tweets by verified users every minute

In [None]:
groupedVerified = verified.set_index('createdDatetime').groupby([pd.TimeGrouper('Min')]).count().reset_index()
groupedNormal = normal.set_index('createdDatetime').groupby([pd.TimeGrouper('Min')]).count().reset_index()

Now let's plot this information!

In [None]:
fig, ax = plt.subplots(figsize=(10,10))


groupedVerified.plot(kind='line', x='createdDatetime', y='created_at', label='Verified Users', ax=ax)
groupedNormal.plot(kind='line', x='createdDatetime', y='created_at', ax = ax, secondary_y=True, label='Non-verified Users')
plt.title('Number of tweets by Users')

ax.set_ylabel('Count of tweets (Verified users)', color='b')
ax.right_ax.set_ylabel('Count of tweets (Non-Verified users)', color='g')

To see whether or not the number of non-verified users' tweets and number of verified tweets there are are related, let's correlate the number of tweets by verified users and the number of tweets by non-verified users. 

In [None]:
verifiedtweetsCount = groupedVerified.sort_values(by='createdDatetime')['created_at'].tolist()
normaltweetsCount = groupedNormal.sort_values(by='createdDatetime')['created_at'].tolist()

corr = np.correlate(verifiedtweetsCount, normaltweetsCount, mode='full')
delays = range(-len(corr)/2, len(corr)/2)

In [None]:
plt.plot(delays,corr)
plt.xlabel('delay in number of hours')
plt.ylabel('correlation')
plt.title('Correlation between number of verified tweets and number of non-verified account tweets')

Now, let's also get the index (in hours) for the highest autocorrelation

In [None]:
delays[np.argmax(corr)]

So the correlation of those two vectors finds that they are most correlated when the verifiedTweetsCount and the normalTweets count are offset by 402. This means that about 6.7 hours after an increase in ____ there tends to also be an increase in ___.

## Retweets - Victoria

In [None]:
retweetRows = []
for i,tweet in df.iterrows():
    for retweeter in tweet['entities_user_mentions_screen_name']:
        mappeduser = retweeter.upper()
        retweetRows.append({
                'createdDatetime': tweet['createdDatetime'],
                'retweeter': mappeduser,
                'tweetId': tweet['user_screen_name'].upper(),
                'x': tweet['x'],
                'y': tweet['y'],
                'retweet_count': tweet['retweet_count'],
                'user_verified': tweet['user_verified'],
                'entities_hashtags_text': tweet['entities_hashtags_text'],
#                 'activist': tweet['activist']
            })
print "creating dataframe"
retweetDf = pd.DataFrame(retweetRows)

In [None]:
colormap = {False:'blue',True:'red'}
retweetDf['node_color'] = retweetDf['user_verified'].apply(lambda x: colormap.get(x))

In [None]:
pos = {}
for i,tweet in retweetDf.iterrows():
    pos[tweet['tweetId'].upper()] = np.asarray([tweet['retweet_count'],tweet['createdDatetime']])
    pos[tweet['retweeter'].upper()] = np.asarray([tweet['retweet_count'],tweet['createdDatetime']]) #temporary until we have all data

In [None]:
plt.figure(figsize=(15,15))

G=nx.from_pandas_dataframe(retweetDf[0:1000:], source='tweetId',target='retweeter')
# other_pos = nx.spectral_layout(G)

nx.draw_networkx(G,alpha=0.2,cmap=cm.jet,font_size=0, node_size=10, node_color=retweetDf['node_color'])
plt.axis('equal')
plt.show()

In [None]:
latLonPopulated_RT = retweetDf[(retweetDf['x'] != 0) & (retweetDf['y'] != 0)]

pos = {}
for i,tweet in latLonPopulated_RT.iterrows():
    pos[tweet['tweetId'].upper()] = np.asarray([tweet['x'],tweet['y']])
    pos[tweet['retweeter'].upper()] = np.asarray([tweet['x'],tweet['y']]) #temporary until we have all data

In [None]:
plt.figure(figsize=(15,15))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()

G=nx.from_pandas_dataframe(latLonPopulated_RT, source='tweetId',target='retweeter')
# other_pos = nx.spectral_layout(G)

nx.draw_networkx(G,pos,alpha=0.1,cmap=cm.jet,font_size=0,node_size=10, node_color=latLonPopulated_RT['node_color'])
plt.axis([-130, -50, 20, 50])
plt.show()

In [None]:
#and now, to take a look at networks based upon hashtags...
numTopHashtags = 160
popularHashtagsList = (hashtagsDf
                   .groupby('hashtag')
                   .count()
                   .reset_index()
                   .sort_values(by='createdDatetime', ascending=False)['hashtag']
                   .tolist())[0:numTopHashtags]

print popularHashtagsList

In [None]:
hashtagMap = {
#                 'MIKEBROWN':'blue',
               'JUSTICEFORMIKEBROWN':'blue',
               'TCOT':'red',
               'HANDSUPDONTSHOOT':'blue',
               'UNITEBLUE':'red',
               'DONTSHOOT':'blue',
               'CRIMEBUTNOTIME':'blue',
               'POLICEBRUTALITY':'blue',
#                'RESPECT':'blue',
               'P2':'blue',
#                'BEFREE':'blue',
#                'MIKEBROWNRALLY':'blue',
               'DARRENWILSON':'red',
               'STANDWITHFERGUSON':'blue',
#                'RIPMIKEBROWN':'blue',
               'BLACKLIVESMATTER':'blue',
#                'OCCUPYFERGUSON':'blue',
               'POLICESTATE':'blue',
               'WHITEPRIVILEGE':'blue',
               'IFTHEYGUNNEDMEDOWN':'blue',
               'HANDSUP':'blue',
#                'POLICE':'red',
               'GOP':'red',
#                'NOJUSTICENOPEACE':'blue',
               'WEGOTYOUSIS':'blue',
#                'SOLIDARITY':'blue',
               'RACISM':'blue',
               'CIVILRIGHTS':'blue',
               'FERGUSONCOVERUP':'blue',
               'WARONWHITES':'red',
               'MIKEBROWNBHEARD':'blue',
               'BLACKTWITTER':'blue',
#                'TRAYVONMARTIN':'blue',
               'REDNATIONRISING':'red',
               'TLOT':'red',
               'TGDN':'red',
               'PJNET':'red',
               '2A':'red',
               'NMOS14':'blue',
              'FOXNEWS':'red',
#               'KNOWYOURRIGHTS':'blue',
              'CAPTRONJOHNSON':'red',
              'BUNDYRANCH':'red',
              'CCOT':'red',
              'COPSLIE':'blue',
              'LNYHBT':'red',
#               'FILMTHEPOLICE':'blue',
#               'ALSHARPTON':'blue',
              'NOJUSTICENOSLEEP':'blue',
              'FTP':'blue',
#               'WHEREISDARRENWILSON':'blue',
              'NRA':'red',
              'ARRESTDARRENWILSON':'blue',
                   }

retweetHT_Rows = []
for i,tweet in retweetDf.iterrows():
    for hashtag in tweet['entities_hashtags_text']:
        mappedhashtag = hashtag.upper()
        if (mappedhashtag in hashtagMap):
            mappedhashtag = hashtagMap[mappedhashtag]
#         else:
#             mappedhashtag = 'green'
            retweetHT_Rows.append({
                    'createdDatetime': tweet['createdDatetime'],
                    'hashtag': mappedhashtag,
                    'tweetId': tweet['tweetId'],
                    'x': tweet['x'],
                    'y': tweet['y'],
                    'retweeter': tweet['retweeter'],
                    'retweet_count': tweet['retweet_count'],
                    'user_verified': tweet['user_verified'],
                })
print "creating dataframe"
retweetHashtagsDf = pd.DataFrame(retweetHT_Rows)
                   

In [None]:
plt.figure(figsize=(15,15))

G=nx.from_pandas_dataframe(retweetHashtagsDf[0:5000:], source='tweetId',target='retweeter')

nx.draw_spring(G,alpha=1, font_size=0, node_size=50, node_color=retweetHashtagsDf['hashtag'])
plt.axis('equal')
plt.show()

In [None]:
hashtagMapRed = {
#                 'MIKEBROWN':'blue',
#                'JUSTICEFORMIKEBROWN':'blue',
               'TCOT':'red',
#                'HANDSUPDONTSHOOT':'blue',
               'UNITEBLUE':'red',
#                'DONTSHOOT':'blue',
#                'CRIMEBUTNOTIME':'blue',
#                'POLICEBRUTALITY':'blue',
#                'RESPECT':'blue',
#                'P2':'blue',
#                'BEFREE':'blue',
#                'MIKEBROWNRALLY':'blue',
               'DARRENWILSON':'red',
#                'STANDWITHFERGUSON':'blue',
#                'RIPMIKEBROWN':'blue',
#                'BLACKLIVESMATTER':'blue',
#                'OCCUPYFERGUSON':'blue',
#                'POLICESTATE':'blue',
#                'WHITEPRIVILEGE':'blue',
#                'IFTHEYGUNNEDMEDOWN':'blue',
#                'HANDSUP':'blue',
#                'POLICE':'red',
               'GOP':'red',
#                'NOJUSTICENOPEACE':'blue',
#                'WEGOTYOUSIS':'blue',
#                'SOLIDARITY':'blue',
#                'RACISM':'blue',
#                'CIVILRIGHTS':'blue',
#                'FERGUSONCOVERUP':'blue',
               'WARONWHITES':'red',
#                'MIKEBROWNBHEARD':'blue',
#                'BLACKTWITTER':'blue',
#                'TRAYVONMARTIN':'blue',
               'REDNATIONRISING':'red',
               'TLOT':'red',
               'TGDN':'red',
               'PJNET':'red',
               '2A':'red',
#                'NMOS14':'blue',
              'FOXNEWS':'red',
#               'KNOWYOURRIGHTS':'blue',
              'CAPTRONJOHNSON':'red',
              'BUNDYRANCH':'red',
              'CCOT':'red',
#               'COPSLIE':'blue',
              'LNYHBT':'red',
#               'FILMTHEPOLICE':'blue',
#               'ALSHARPTON':'blue',
#               'NOJUSTICENOSLEEP':'blue',
#               'FTP':'blue',
#               'WHEREISDARRENWILSON':'blue',
              'NRA':'red',
#               'ARRESTDARRENWILSON':'blue',
                   }

hashtagMapBlue = {
                'MIKEBROWN':'blue',
               'JUSTICEFORMIKEBROWN':'blue',
#                'TCOT':'red',
               'HANDSUPDONTSHOOT':'blue',
#                'UNITEBLUE':'red',
               'DONTSHOOT':'blue',
               'CRIMEBUTNOTIME':'blue',
               'POLICEBRUTALITY':'blue',
               'RESPECT':'blue',
               'P2':'blue',
               'BEFREE':'blue',
               'MIKEBROWNRALLY':'blue',
#                'DARRENWILSON':'red',
               'STANDWITHFERGUSON':'blue',
               'RIPMIKEBROWN':'blue',
               'BLACKLIVESMATTER':'blue',
               'OCCUPYFERGUSON':'blue',
               'POLICESTATE':'blue',
               'WHITEPRIVILEGE':'blue',
               'IFTHEYGUNNEDMEDOWN':'blue',
               'HANDSUP':'blue',
#                'POLICE':'red',
#                'GOP':'red',
               'NOJUSTICENOPEACE':'blue',
               'WEGOTYOUSIS':'blue',
               'SOLIDARITY':'blue',
               'RACISM':'blue',
               'CIVILRIGHTS':'blue',
               'FERGUSONCOVERUP':'blue',
#                'WARONWHITES':'red',
               'MIKEBROWNBHEARD':'blue',
               'BLACKTWITTER':'blue',
               'TRAYVONMARTIN':'blue',
#                'REDNATIONRISING':'red',
#                'TLOT':'red',
#                'TGDN':'red',
#                'PJNET':'red',
#                '2A':'red',
               'NMOS14':'blue',
#               'FOXNEWS':'red',
              'KNOWYOURRIGHTS':'blue',
#               'CAPTRONJOHNSON':'red',
#               'BUNDYRANCH':'red',
#               'CCOT':'red',
              'COPSLIE':'blue',
#               'LNYHBT':'red',
              'FILMTHEPOLICE':'blue',
              'ALSHARPTON':'blue',
              'NOJUSTICENOSLEEP':'blue',
              'FTP':'blue',
              'WHEREISDARRENWILSON':'blue',
#               'NRA':'red',
              'ARRESTDARRENWILSON':'blue',
                   }

retweetHT_Rows = []
for i,tweet in retweetDf.iterrows():
    for hashtag in tweet['entities_hashtags_text']:
        mappedhashtag = hashtag.upper()
        if (mappedhashtag in hashtagMapRed):
            mappedhashtag = hashtagMapRed[mappedhashtag]
            retweetHT_Rows.append({
                    'createdDatetime': tweet['createdDatetime'],
                    'hashtag': mappedhashtag,
                    'tweetId': tweet['tweetId'],
                    'x': tweet['x'],
                    'y': tweet['y'],
                    'retweeter': tweet['retweeter'],
                    'retweet_count': tweet['retweet_count'],
                    'user_verified': tweet['user_verified'],
                })
            break
        elif (mappedhashtag in hashtagMapBlue):
            mappedhashtag = hashtagMapBlue[mappedhashtag]
            retweetHT_Rows.append({
                    'createdDatetime': tweet['createdDatetime'],
                    'hashtag': mappedhashtag,
                    'tweetId': tweet['tweetId'],
                    'x': tweet['x'],
                    'y': tweet['y'],
                    'retweeter': tweet['retweeter'],
                    'retweet_count': tweet['retweet_count'],
                    'user_verified': tweet['user_verified'],
                })
            break
        else:
            pass
        
            
print "creating dataframe"
retweetHashtagsDf = pd.DataFrame(retweetHT_Rows)

In [None]:
plt.figure(figsize=(15,15))

G=nx.from_pandas_dataframe(retweetHashtagsDf[0:1000:], source='tweetId',target='retweeter')

nx.draw_networkx(G,alpha=1, font_size=0, node_size=50, node_color=retweetHashtagsDf['hashtag'])
plt.axis('equal')
plt.show()

In [None]:
latLonPopulated_RTHT = retweetHashtagsDf[(retweetHashtagsDf['x'] != 0) & (retweetHashtagsDf['y'] != 0)]

pos = {}
for i,tweet in latLonPopulated_RTHT.iterrows():
    pos[tweet['tweetId'].upper()] = np.asarray([tweet['x'],tweet['y']])
    pos[tweet['retweeter'].upper()] = np.asarray([tweet['x'],tweet['y']]) #temporary until we have all data

In [None]:
plt.figure(figsize=(15,15))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.coastlines()

G=nx.from_pandas_dataframe(latLonPopulated_RTHT, source='tweetId',target='retweeter')
# other_pos = nx.spectral_layout(G)

nx.draw_networkx(G,pos,alpha=0.2,font_size=0,node_size=40, node_color=latLonPopulated_RTHT['hashtag'])
plt.axis([-130, -50, 20, 50])
plt.show()

## Conclusions - Victoria and Sophia